Abstract:The current mainstream architecture of image captioning technology is the Encoder-Decoder architecture based on deep neural networks.Most works focus on attention mechanism and the extraction of image features,such as hard attention model and top-down attention model.These methods only use the information from the previous moment to predict the output at the current moment,which results in single time dimension of the input information of the decoder.Meanwhile,the single output of the decoder also decreases the accuracy of the prediction result.This paper proposes a horizontal and vertical model of information fusion in multiple time dimensions.The horizontal structure of the model uses the semantic information of the past and present moments to enrich the input of the decoder,and the vertical structure of the model simultaneously generates prediction vectors of the present and future moments to enrich the output of the decoder.The decoders of the two independent structures of the model generate multiple outputs,then we respectively perform weighted fusion as the final output of the two structures of the model.Experiment results on Flickr30k and MSCOCO datasets show that the scores of these two models on multiple evaluation indicators are higher than other mainstream models,and the descriptions of images generated by our models are more accurate compared with other mainstream models.
李坤,周世斌,朱佳明,张国鹏. 多时间维度信息融合的图像描述方法[J]. 小型微型计算机系统, 2022, 43(1): 103-110.
LI Kun,ZHOU Shi-bin,ZHU Jia-ming,ZHANG Guo-peng. Information Fusion in Multiple Time Dimensions for Image Captioning. Journal of Chinese Computer Systems, 2022, 43(1): 103-110.