Hostname: page-component-745bb68f8f-5r2nc Total loading time: 0 Render date: 2025-01-14T00:49:32.010Z Has data issue: false hasContentIssue false

Understanding visual scenes

Published online by Cambridge University Press:  28 March 2018

CARINA SILBERER
Affiliation:
DTCL, Universitat Pompeu Fabra, Roc Boronat 138, 08018 Barcelona, Spain e-mail: CarinaSilberer@gmail.com
JASPER UIJLINGS
Affiliation:
School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, UK e-mail: jrr.uijlings@gmail.com
MIRELLA LAPATA
Affiliation:
ILCC, School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, UK e-mail: mlap@inf.ed.ac.uk

Abstract

A growing body of recent work focuses on the challenging problem of scene understanding using a variety of cross-modal methods which fuse techniques from image and text processing. In this paper, we develop representations for the semantics of scenes by explicitly encoding the objects detected in them and their spatial relations. We represent image content via two well-known types of tree representations, namely constituents and dependencies. Our representations are created deterministically, can be applied to any image dataset irrespective of the task at hand, and are amenable to standard NLP tools developed for tree-based structures. We show that we can apply syntax-based SMT and tree kernel methods in order to build models for image description generation and image-based retrieval. Experimental results on real-world images demonstrate the effectiveness of the framework.

Type
Articles
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aditya, S., Baral, C., Yang, Y., Aloimonos, Y., and Fermuller, C. 2016. DeepIU: an Architecture for image understanding. In Proceedings of Advances in Cognitive Systems.Google Scholar
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. 2015. VQA: visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. 2015. Microsoft COCO captions: data collection and evaluation server. ArXiv e-prints, abs/1504.00325v2.Google Scholar
Collins, M. and Duffy, N. 2001. Convolution kernels for natural language. In Proceedings of the 14th International Conference on Advances in Neural Information Processing Systems: Natural and Synthetic, pp. 625–32.Google Scholar
Coyne, B. and Sproat, R. 2001. WordsEye: an automatic text-to-scene conversion system. In SIGGRAPH '01: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques.Google Scholar
Culotta, A. and Sorensen, J. 2004. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.Google Scholar
Deng, Y., Kanervisto, A., Ling, J. and Rush, A. M. 2017. Image-to-markup generation with coarse-to-fine attention. In Proceedings of the 34th International Conference on Machine Learning, pp. 980–89.Google Scholar
Devlin, J., Gupta, S., Girshick, R. B., Mitchell, M. and Zitnick, C. L. 2015. Exploring nearest neighbor approaches for image captioning. ArXiv e-prints, abs/1505.04467.Google Scholar
Elliott, D. 2015. Structured Representation of Images for Language Generation and Image Retrieval. PhD Thesis, Edinburgh, Scotland, UK: The University of Edinburgh.Google Scholar
Elliott, D. and de Vries, A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 42–52.Google Scholar
Elliott, D. and Keller, F. 2013 (October). Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292–1302.Google Scholar
Elliott, D., Lavrenko, V. and Keller, F. 2014. Query-by-example image retrieval using visual dependency representations. In COLING 2014, 25th International Conference on Computational Linguistics, pp. 109–20.Google Scholar
Girshick, R. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 1440–8.Google Scholar
Gupta, S. and Malik, J. 2015. Visual Semantic Role Labeling. ArXiv e-prints, abs/1505.04474v1.Google Scholar
Heafield, K., Pouzyrevsky, I., Clark, J. H. and Koehn, P. 2013. Scalable modified Kneser–Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 690–6.Google Scholar
Huang, L. 2006. Statistical syntax-directed translation with extended domain of locality. In Proceedings of the Association for Machine Translation in the Americas, pp. 66–73.Google Scholar
Huang, T.-H. (Kenneth), Ferraro, F., Mostafazadeh, N., Misra, I., Agrawal, A., Devlin, J., Girshick, R., He, X., Kohli, P., Batra, D., Zitnick, C. L., Parikh, D., Vanderwende, L., Galley, M., and Mitchell, M. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: HLT, pp. 1233–9.Google Scholar
Jégou, H., Douze, M. and Schmid, C. 2008. Hamming Embedding and Weak Geometry Consistency for Large Scale Image Search – Extended Version. Research Report 6709. Inria Grenoble, Rhône-Alpes, France.Google Scholar
Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. 2015. Image retrieval using scene graphs. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3668–78.Google Scholar
Kafle, K. and Kanan, C. 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4976–84.Google Scholar
Karger, D. R., Klin, P. N. and Tarjan, R. E. 1995. A randomized linear-time algorithm to find minimum spanning t rees. Journal of the ACM 42 (2): 321–8.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–80.Google Scholar
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. 2017. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), 3273.Google Scholar
Kruskal, J. B. 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. In Proceedings of the American Mathematical Society, 7.Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., and Berg, T. L. 2011. Baby talk: understanding and generating simple image descriptions. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 1601–8.Google Scholar
Lan, T., Yang, W., Wang, Y. and Mori, G. 2012. Image retrieval with structured object queries using latent ranking SVM. In Proceedings of the 12th European Conference on Computer Vision, pp. 129–42.Google Scholar
Li, S., Kulkarni, G., Berg, T. L., Berg, A. C. and Choi, Y. 2011. Composing simple image descriptions using web-scale N-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 220–8.Google Scholar
Lin, D., Fidler, S., Kong, C. and Urtasun, R. 2015. Generating multi-sentence lingual descriptions of indoor scenes. In Proceedings of the British Machine Vision Conference.Google Scholar
Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., and Daumé, H. III 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–56.Google Scholar
Moschitti, A. 2006a. Efficient convolution kernels for dependency and constituent syntactic trees. In Proceedings of the 17th European Conference on Machine Learning, pp. 318–29.Google Scholar
Moschitti, A. 2006b. Making tree kernels practical for natural language learning. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 113–20.Google Scholar
Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 1951.Google Scholar
Ortiz, L. G. M., Wolff, C. and Lapata, M. 2015. Learning to interpret and describe abstract scenes. In Proceedings of the 2015 North American Chapter of the Association for Computational Linguistics: HLT, pp. 1505–15.Google Scholar
Palmer, M., Gildea, D. and Kingsbury, P. 2005. The proposition bank: an annotated corpus of semantic roles. Computational Linguistics 31 (1): 71106.Google Scholar
Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–8.Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. 2007. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. 2008. Lost in quantization: improving particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. 2017. Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123 (1), 7493.Google Scholar
Prim, R. C. 1957. Shortest connection networks And some generalization. Bell System Technical Journal 36 (6), 1389–401.Google Scholar
Roth, M. and Lapata, M. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany, pp. 1192–1202.Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3): 211–52.CrossRefGoogle Scholar
Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., and Manning, C. D. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language, pp. 70–80.Google Scholar
Simonyan, K. and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv e-prints, abs/1409.1556v6.Google Scholar
Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., and Smeulders, A. W. M. 2013. Selective search for object recognition. International Journal of Computer Vision 104 (2): 154–71.Google Scholar
Vedantam, R., Zitnick, C. L. and Parikh, D. 2015. CIDEr: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–75.Google Scholar
Vinyals, O., Toshev, A., Bengio, S. and Erhan, D. 2015. Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–64.Google Scholar
Vishwanathan, S. V. N. and Smola, A. J. 2002. Fast kernels for string and tree matching. Advances in Neural Information Processing Systems 15: Annual Conference on Neural Information Processing Systems, pp. 569–76.Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In Blei, D., and Bach, F. (eds.), Proceedings of the 32nd International Conference on Machine Learning, pp. 2048–57. JMLR Workshop and Conference Proceedings.Google Scholar
Yatskar, M., Ordonez, V. and Farhadi, A. 2016a. Stating the obvious: extracting visual common sense knowledge. In Proceedings of the 2016 Conference of the NAACL: Human Language Technologies, pp. 193–8.Google Scholar
Yatskar, M., Zettlemoyer, L. and Farhadi, A. 2016b. Situation recognition: visual semantic role labeling for image understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Young, P., Lai, A., Hodosh, M. and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2: 6778.Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A. C., and Berg, T. L. 2016. Modeling context in referring expressions. In ECCV.Google Scholar
Zampogiannis, K., Yang, Y., Fermüller, C., and Aloimonos, Y. 2015. Learning the spatial semantics of manipulation actions through preposition grounding. In Proceedigs of the IEEE International Conference on Robotics and Automation, pp. 1389–96.Google Scholar
Zitnick, C. L., Vedantam, R. and Parikh, D. 2016. Adopting abstract images for semantic scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (4): 627–38.Google Scholar