From image to language and back again

A. BELZ; T.L. BERG; L. YU

doi:10.1017/S1351324918000086

From image to language and back again

Published online by Cambridge University Press: 23 April 2018

A. BELZ ,

T.L. BERG and

L. YU

Show author details

A. BELZ: Affiliation:
Computing, Engineering and Mathematics, University of Brighton, Lewes Road, Brighton BN2 4GJ, UK e-mail: A.S.Belz@brighton.ac.uk
T.L. BERG: Affiliation:
Computer Science, UNC Chapel Hill, Chapel Hill, NC 27599-3175, USA e-mail: berg.tamara@gmail.com, licheng@cs.unc.edu
L. YU: Affiliation:
Computer Science, UNC Chapel Hill, Chapel Hill, NC 27599-3175, USA e-mail: berg.tamara@gmail.com, licheng@cs.unc.edu

Article contents

Extract
References

Get access

Rights & Permissions

Extract

Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Frank et al.), multimodal machine translation (Madhyastha et al., Frank et al.), image caption generation (Madhyastha et al., Tanti et al.), visual scene understanding (Silberer et al.), and multimodal learning of high-level attributes (Sorodoc et al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).

Information

Type: Articles
Information: Natural Language Engineering , Volume 24 , Special Issue 3: Language for Images , May 2018 , pp. 325 - 362

DOI: https://doi.org/10.1017/S1351324918000086 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Anderson, P., Fernando, B., Johnson, M., and Gould, S. 2016. Spice: semantic propositional image caption evaluation. In Proceedings of ECCV-2016, pp. 382–398.Google Scholar

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. 2017. Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv:1707.07998.Google Scholar

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. 2016a. Learning to compose neural networks for question answering. In Proceedings of NAACL-2016.CrossRef Google Scholar

Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. 2016b. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 39–48.Google Scholar

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. 2015. VQA: visual Question Answering. In Proceedings of ICCV’15.Google Scholar

Belz, A. 2009. That’s nice. . . what can you do with it? Computational Linguistics 35 (1): 118–119.Google Scholar

Belz, A., and Hastie, H. 2014. Comparative evaluation and shared tasks for NLG in interactive systems. In Natural Language Generation in Interactive Systems. Cambridge: CUP.Google Scholar

Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., Keller, F., Muscat, A., and Plank, B., 2016. Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55: 409–442.CrossRef Google Scholar

Chen, J., Kuznetsova, P., Warren, D., and Choi, Y. 2015a. Déjà image-captions: a corpus of expressive descriptions in repetition. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 504–514.Google Scholar

Chen, K., Wang, J., Chen, L.-C., Gao, H., Xu, W., and Nevatia, R. 2015b. ABC-CNN: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960.Google Scholar

Chen, X., and Zitnick, C. L. 2015. Mind’s eye: a recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422–2431.Google Scholar

Dai, B., Lin, D., Urtasun, R., and Fidler, S. 2017. Towards diverse and natural image descriptions via a conditional GAN. ICCV 2017, arXiv preprint arXiv:1703.06029.Google Scholar

Dale, R., and Reiter, E., 1995. Computational interpretations of the Gricean maxims in the generation of referring expressions. Cognitive Science 19: 233–264.Google Scholar

Dale, R., and Reiter, E., 2000. Building natural language generation systems. New York, NY: CUP.Google Scholar

Das, A., Agrawal, H., Zitnick, L., Parikh, D., and Batra, D. 2017. Human attention in visual question answering: do humans and deep networks look at the same regions? Computer Vision and Image Understanding 163: 90–100.Google Scholar

De Marneffe, M.-C., et al. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC, vol. 6, Genoa, Italy, pp. 449–454.Google Scholar

De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., and Courville, A. 2017. Guesswhat?! Visual object discovery through multi-modal dialogue. In Proceedings of CVPR.Google Scholar

Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., and Mitchell, M. 2015. Language models for image captioning: the quirks and what works. In Proceedings of CoRR, abs/1505.01809.Google Scholar

Elliott, D., and de Vries, A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 42–52.Google Scholar

Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L. 2017. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of CoRR, abs/1710.07177.Google Scholar

Elliott, D., Frank, S., Sima’an, K., and Specia, L. 2016. Multi30k: multilingual english-German image descriptions. In Proceedings of the 5th Workshop on Vision and Language, arXiv preprint arXiv:1605.00459.Google Scholar

Elliott, D., and Keller, F. 2013. Image description using visual dependency representations. In Proceedings of the 18th Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, pp. 1292–1302.Google Scholar

Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollazr, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 1473–1482.Google Scholar

Fang, R., Liu, C., She, L., and Chai, J. 2013. Towards situated dialogue: revisiting referring expression generation. In Proceedings of EMNLP’13.Google Scholar

Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth, D. 2010. Every picture tells a story: generating sentences from images. In Proceedings of ECCV’10, pp. 15–29.Google Scholar

Feng, Y., and Lapata, M. 2008. Automatic image annotation using auxiliary text information. In Proceedings of ACL-2008: HLT, pp. 272–280.Google Scholar

FitzGerald, N., Artzi, Y., and Zettlemoyer, L. 2013. Learning distributions over logical forms for referring expression generation. In Proceedings of Empirical Methods on Natural Language Processing (EMNLP-2013).Google Scholar

Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP 2016, arXiv preprint arXiv:1606.01847.Google Scholar

Gan, C., Gan, Z., He, X., Gao, J., and Deng, L. 2017a. Stylenet: generating attractive visual captions with styles. In Proceedings of CVPR.Google Scholar

Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., and Deng, L. 2017b. Semantic compositional networks for visual captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2.Google Scholar

Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. 2015. Are you talking to a machine? dataset and methods for multilingual image question. In Proceedings of Advances in Neural Information Processing Systems, pp. 2296–2304.Google Scholar

Gatt, A., and Belz, A. 2010. Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges, pp. 264–293. Berlin, Heidelberg: Springer.Google Scholar

Gella, S., Lapata, M., and Keller, F. 2016. Unsupervised visual sense disambiguation for verbs using multimodal embeddings. In NAACL 2016, arXiv preprint arXiv:1603.09188.Google Scholar

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. 2017. Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR).CrossRef Google Scholar

Grice, H. P. 1975. Logic and conversation, pp. 41–58. Cambridge, MA: Harvard University Press.Google Scholar

Grubinger, M., Clough, P., Müller, H., and Deselaers, T. 2006a. The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In Proceedings of International workshop ontoImage, vol. 5, p. 10.Google Scholar

Grubinger, M., Clough, P., Müller, H., and Deselaers, T. 2006b. The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In Proceedings of International Workshop OntoImage, vol. 5, p 10.Google Scholar

Gupta, A., Verma, Y., and Jawahar, C. V. 2012. Choosing linguistics over vision to describe images. In Proceedings of the 26th AAAI Conference on Artificial Intelligence, pp. 606–612.Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.Google Scholar

Hendricks, L. A., et al. 2016. Deep compositional captioning: describing novel object categories without paired training data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar

Hendricks, L. A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., and Russell, B. 2017. Localizing moments in video with natural language. In Proceedings of ICCV.Google Scholar

Hochreiter, S., and Schmidhuber, J., 1997. Long short-term memory. Neural Computation 9 (8): 1735–1780.Google Scholar

Hodosh, M., Young, P., and Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47: 853–899.Google Scholar

Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko, K. 2017. Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 804–813.Google Scholar

Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., and Darrell, T. 2016. Natural language object retrieval. In Proceedings of CVPR, IEEE.Google Scholar

Huang, T.-H. K., et al. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239.Google Scholar

Jang, Y., Song, Y., Yu, Y., Kim, Y., and Kim, G. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of CVPR.Google Scholar

Jia, X., Gavves, E., Fernando, B., and Tuytelaars, T. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, pp. 2407–2415.Google Scholar

Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., and Girshick, R. 2017a. CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 1988–1997.Google Scholar

Johnson, J., Hariharan, B., van der Maaten, L., Hoffman, J., Fei-Fei, L., Zitnick, C. L., and Girshick, R. 2017b. Inferring and executing programs for visual reasoning. In Proceedings of ICCV.Google Scholar

Karpathy, A., and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 3128–3137.Google Scholar

Karpathy, A., Joulin, A., and Fei-Fei, L. F. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of Advances in Neural Information Processing Systems, pp. 1889–1897.Google Scholar

Kazemzadeh, S., Ordonez, V., Matten, M., and Berg, T. 2014. Referitgame: referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 787–798.Google Scholar

Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, pp. 199–209.Google Scholar

Kim, K.-M., Heo, M.-O., Choi, S.-H., and Zhang, B.-T. 2017. Deepstory: video story QA by deep embedded memory networks. In Proceedings of IJCAI.Google Scholar

Kinghorn, P., Zhang, L., and Shao, L., 2018. A region-based image caption generator with refined descriptions. Neurocomputing 272: 416–424.Google Scholar

Kong, C., Lin, D., Bansal, M., Urtasun, R., and Fidler, S. 2014. What are you talking about? Text-to-image coreference. In Proceedings of CVPR.Google Scholar

Krahmer, E., and van Deemter, K., 2012. Computational generation of referring expressions: a survey. Computational Linguistics 38: 173–218.Google Scholar

Krause, J., Johnson, J., Krishna, R., and Fei-Fei, L. 2017. A hierarchical approach for generating descriptive image paragraphs. In CVPR 2017. arXiv preprint arXiv:1611.06607.Google Scholar

Krishna, R., et al. 2017a. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1): 32–73.Google Scholar

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. 2017b. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 1–42.Google Scholar

Krizhevsky, A., Sutskever, I., and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105.Google Scholar

Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. 2011. Baby talk: understanding and generating simple image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, pp. 1601–1608.Google Scholar

Kuznetsova, P., Ordonez, V., Berg, T. L., and Choi, Y., 2014. Treetalk: composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics 2 (10): 351–362.Google Scholar

Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., and Choi, Y., 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning, Portland, Oregon, pp. 220–228.Google Scholar

Li, X., Lan, W., Dong, J., and Liu, H. 2016. Adding Chinese captions to images. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 271–275.Google Scholar

Li, Z., et al. 2017. Tracking by natural language specification. In Proceedings of CVPR.Google Scholar

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. 2014a. Microsoft coco: common objects in context. In Proceedings of ECCV-2014, pp. 740–755.Google Scholar

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014b). Microsoft coco: common objects in context. In Proceedings of European Conference on Computer Vision, Springer, pp. 740–755.Google Scholar

Liu, J., et al. 2017. Referring expression generation and comprehension via attributes. In Proceedings of CVPR.Google Scholar

Lu, J., Xiong, C., Parikh, D., and Socher, R. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 6.Google Scholar

Lu, J., Yang, J., Batra, D., and Parikh, D. 2016. Hierarchical question-image co-attention for visual question answering. In Proceedings of NIPS-2016, pp. 289–297.Google Scholar

Ma, L., Lu, Z., and Li, H. 2016. Learning to answer questions from image using convolutional neural network. In Proceedings of AAAI, vol. 3, pp. 16.Google Scholar

Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., and Pal, C. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In Proceedings of CVPR-2017.Google Scholar

Malinowski, M., and Fritz, M. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of Advances in Neural Information Processing Systems, pp. 1682–1690.Google Scholar

Malinowski, M., Rohrbach, M., and Fritz, M. 2015. Ask your neurons: a neural-based approach to answering questions about images. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), IEEE Computer Society, pp. 1–9.Google Scholar

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., and Murphy, K. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–20.Google Scholar

Mason, R., and Charniak, E., 2014. Nonparametric method for data-driven image captioning. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, Baltimore, Maryland, pp. 592–598.Google Scholar

Mathews, A. P., Xie, L., and He, X. 2016. Senticap: generating image descriptions with sentiments. In Proceedings of AAAI, pp. 3574–3580.Google Scholar

Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., and Daumé,III, H., 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 747–756.Google Scholar

Mitchell, M., Reiter, E., and van Deemter, K. 2013a. Typicality and object reference. In Proceedings of Cognitive Science.Google Scholar

Mitchell, M., van Deemter, K., and Reiter, E. 2010. Natural reference to objects in a visual domain. In Proceedings of International Natural Language Generation Conference (INLG).Google Scholar

Mitchell, M., van Deemter, K., and Reiter, E. 2011. Two approaches for generating size modifiers. In European Workshop on Natural Language Generation.Google Scholar

Mitchell, M., van Deemter, K., and Reiter, E. 2013b. Generating expressions that refer to visible objects. In Proceedings of NAACL’13.Google Scholar

Miyazaki, T., and Shimizu, N. 2016. Cross-lingual image caption generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1780–1790.Google Scholar

Mun, J., Seo, P. H., Jung, I., and Han, B. 2017. Marioqa: answering questions by watching gameplay videos. In Proceedings of ICCV.Google Scholar

Muscat, A., and Belz, A., 2017. Learning to generate descriptions of visual data anchored in spatial relations. IEEE Computational Intelligence Magazine 12 (3): 29–42.Google Scholar

Nagaraja, V. K., Morariu, V. I., and Davis, L. S. 2016. Modeling context between objects for referring expression understanding. In Proceedings of ECCV, Springer.Google Scholar

Nam, H., Ha, J.-W., and Kim, J. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 299–307.Google Scholar

Ordonez, V., Kulkarni, G., and Berg, T. L. 2011. Im2text: Describing images using 1 million captioned photographs. In Shawe-Taylor et al. (eds.), Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc., pp. 1143–1151.Google Scholar

Ortiz, L. G. M., Wolff, C., and Lapata, M. 2015. Learning to interpret and describe abstract scenes. In Proceedings of NAACL-2015, pp. 1505–1515.Google Scholar

Over, P., Fiscus, J., Sanders, G., Joy, D., Michel, M., Awad, G., Smeaton, A., Kraaij, W., and Quénot, G. 2014. TRECVID 2014–an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID, pp. 52.Google Scholar

Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J. 2010. Collecting image annotations using Amazon’s Mechanical Turk. In Proceedings of the NAACL-10 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 139–147.Google Scholar

Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., and Pinkal, M., 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1: 25–36.Google Scholar

Ren, M., Kiros, R., and Zemel, R. 2015a. Exploring models and data for image question answering. In Proceedings of Advances in Neural Information Processing Systems, pp. 2953–2961.Google Scholar

Ren, S., He, K., Girshick, R., and Sun, J. 2015b. Faster R-CNN: towards real-time object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems, pp. 91–99.Google Scholar

Ren, Y., Van Deemter, K., and Pan, J. Z. 2010. Charting the potential of description logic for the generation of referring expressions. In Proceedings of International Natural Language Generation Conference (INLG).Google Scholar

Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., and Schiele, B. 2016. Grounding of textual phrases in images by reconstruction. In Proceedings of ECCV, Springer.Google Scholar

Rohrbach, A., Rohrbach, M., Tandon, N., and Schiele, B. 2015. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212.Google Scholar

Rosenfeld, A., 1978. Iterative methods in image analysis. Pattern Recognition 10 (3): 181–187.Google Scholar

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. 2017. Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of ICCV.Google Scholar

Shih, K. J., Singh, S., and Hoiem, D. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4613–4621.Google Scholar

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. 2012. Indoor segmentation and support inference from rgbd images. In Proceedings of Computer Vision (ECCV-2012), pp. 746–760.Google Scholar

Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR 2015, arXiv preprint arXiv:1409.1556.Google Scholar

Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng, A. Y., 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2: 27–218.Google Scholar

Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of AAAI, pp. 4278–4284.Google Scholar

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9.Google Scholar

Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., and Fidler, S. 2016. Movieqa: understanding stories in movies through question-answering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar

Turing, A. M., 1950. Computing machinery and intelligence. Mind 59 (236): 433–460.Google Scholar

Unal, M. E., Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N. I., and Cakici, R. 2016. Tasviret: a benchmark dataset for automatic Turkish description generation from images. In Proceedings of 24th Signal Processing and Communication Application Conference (SIU), IEEE, pp. 1977–1980.Google Scholar

Van Deemter, K., Gatt, A., van Gompel, R. P., and Krahmer, E., 2012. Toward a computational psycholinguistics of reference production. Topics in Cognitive Science 4 (2): 166–183.Google Scholar

van Deemter, K., van der Sluis, I., and Gatt, A. 2006. Building a semantically transparent corpus for the generation of referring expressions. In Proceedings of International Conference on Natural Language Generation (INLG).Google Scholar

van Miltenburg, E., Elliott, D., and Vossen, P. 2017. Cross-linguistic differences and similarities in image descriptions. In Proceedings of CoRR, abs/1707.01736.Google Scholar

Vedantam, R., Zitnick, C. L., and Parikh, D. 2014. Cider: Consensus-based image description evaluation. InProceedings of CoRR, abs/1411.5726.Google Scholar

Venugopalan, S., Hendricks, L. A., Mooney, R., and Saenko, K. 2016. Improving LSTM-based video description with linguistic knowledge mined from text. In Proceedings of EMNLP-2016, pp. 1961–1966.Google Scholar

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., and Saenko, K. 2015. Translating videos to natural language using deep recurrent neural networks. In Proceedings of NAACL-2015.Google Scholar

Viethen, J., and Dale, R. 2008. The use of spatial relations in referring expression generation. In Proceedings of International Natural Language Generation Conference (INLG).Google Scholar

Viethen, J., and Dale, R. 2010. Speaker-dependent variation in content selection for referring expression generation. In Australasian Language Technology Workshop.Google Scholar

Viethen, J., Mitchell, M., and Krahmer, E. 2013. Graphs and spatial relations in the generation of referring expressions. In Proceedings of ENLG-2013.Google Scholar

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. 2015. Show and tell: a neural image caption generator. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3156–3164.Google Scholar

Wang, L., Li, Y., Huang, J., and Lazebnik, S., 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transaction Pattern Analysis and Machine Intelligence PP (99): 1.Google Scholar

Wang, L., Li, Y., and Lazebnik, S. 2016a. Learning deep structure-preserving image-text embeddings. In Proceedings of CVPR.Google Scholar

Wang, L., Schwing, A., and Lazebnik, S. 2017. Diverse and accurate image description using a variational auto-encoder with an additive Gaussian encoding space. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Proceedings of Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc, pp. 5756–5766.Google Scholar

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., and Van Gool, L. 2016b. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of ECCV, Springer.Google Scholar

Winograd, T., 1972. Understanding natural language. Cognitive Psychology 3 (1): 1191.Google Scholar

Wu, Q., Shen, C., Wang, P., Dick, A., and van den Hengel, A., 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99): 1.Google Scholar

Wu, Z., and Palmer, M. 1994. Verbs semantics and lexical selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 133–138.Google Scholar

Xiong, C., Merity, S., and Socher, R. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning, pp. 2397–2406.Google Scholar

Xu, H., and Saenko, K. 2016. Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In Proceedings of European Conference on Computer Vision, Springer, pp. 451–466.Google Scholar

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Proceedings of International Conference on Machine Learning, pp. 2048–2057.Google Scholar

Yagcioglu, S., Erdem, E., Erdem, A., and Cakici, R. 2015. A distributed representation based query expansion approach for image captioning. In Proceedings of the ACL-IJCNLP-2015, vol. 2, pp. 106–111.Google Scholar

Yang, Y., Teo, C. L., Daumé III, H., and Aloimonos, Y., 2011. Corpus-guided sentence generation of natural images. In Proceedings of the 16th Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburg, Scotland, pp. 444–454.Google Scholar

Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29.Google Scholar

Yatskar, M., Vanderwende, L., and Zettlemoyer, L., 2014. See no evil, say no evil: description generation from densely labeled images. In Proceedings of the 3rd Joint Conference on Lexical and Computational Semantics, Dublin, Ireland, pp. 110–120.Google Scholar

Yoshikawa, Y., Shigeto, Y., and Takeuchi, A. 2017. Stair captions: constructing a large-scale Japanese image caption dataset. arXiv preprint arXiv:1705.00823.Google Scholar

You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651–4659.Google Scholar

Young, P., Lai, A., Hodosh, M., and Hockenmaier, J., 2014. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2: 67–78.Google Scholar

Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., and Berg, T. L. 2018. Mattnet: modular attention network for referring expression comprehension. arXiv preprint arXiv:1801.08186.Google Scholar

Yu, L., Park, E., Berg, A. C., and Berg, T. L. 2015. Visual madlibs: fill in the blank description generation and question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar

Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. 2016a. Modeling context in referring expressions. In Proceedings of ECCV-2016, pp. 69–85.Google Scholar

Yu, L., Tan, H., Bansal, M., and Berg, T. L. 2017. A joint speaker–listener–reinforcer model for referring expressions. In Proceedings of CVPR.Google Scholar

Yu, Y., Ko, H., Choi, J., and Kim, G. 2016b. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of CVPR.Google Scholar

Zhao, Z., Yang, Q., Cai, D., He, X., and Zhuang, Y. 2017. Video question answering via hierarchical spatio-temporal attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), vol. 2.Google Scholar

Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., and Fergus, R. 2015. Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167.Google Scholar

Zhu, L., Xu, Z., Yang, Y., and Hauptmann, A. G. 2015. Uncovering temporal context for video question and answering. arXiv preprint arXiv:1511.04670.Google Scholar

Zhu, Y., Groth, O., Bernstein, M., and Fei-Fei, L. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4995–5004.Google Scholar

Zitnick, C. L., and Parikh, D. 2013. Bringing semantics into focus using visual abstraction. In Proceedings of CVPR-2013, pp. 3009–3016.Google Scholar

Article contents

From image to language and back again

Extract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests