A Neuro-Symbolic ASP Pipeline for Visual Question Answering

THOMAS EITER; NELSON HIGUERA; JOHANNES OETSCH; MICHAEL PRITZ

doi:10.1017/S1471068422000229

A Neuro-Symbolic ASP Pipeline for Visual Question Answering

Published online by Cambridge University Press: 11 July 2022

and

THOMAS EITER: Affiliation:
Institute of Logic and Computation, Vienna University of Technology (TU Wien), Austria (e-mails: eiter@kr.tuwien.ac.at, higuera@kr.tuwien.ac.at, oetsch@kr.tuwien.ac.at, pritz@kr.tuwien.ac.at)
NELSON HIGUERA: Affiliation:
Institute of Logic and Computation, Vienna University of Technology (TU Wien), Austria (e-mails: eiter@kr.tuwien.ac.at, higuera@kr.tuwien.ac.at, oetsch@kr.tuwien.ac.at, pritz@kr.tuwien.ac.at)
JOHANNES OETSCH: Affiliation:
Institute of Logic and Computation, Vienna University of Technology (TU Wien), Austria (e-mails: eiter@kr.tuwien.ac.at, higuera@kr.tuwien.ac.at, oetsch@kr.tuwien.ac.at, pritz@kr.tuwien.ac.at)
MICHAEL PRITZ: Affiliation:
Institute of Logic and Computation, Vienna University of Technology (TU Wien), Austria (e-mails: eiter@kr.tuwien.ac.at, higuera@kr.tuwien.ac.at, oetsch@kr.tuwien.ac.at, pritz@kr.tuwien.ac.at)

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

We present a neuro-symbolic visual question answering (VQA) pipeline for CLEVR, which is a well-known dataset that consists of pictures showing scenes with objects and questions related to them. Our pipeline covers (i) training neural networks for object classification and bounding-box prediction of the CLEVR scenes, (ii) statistical analysis on the distribution of prediction values of the neural networks to determine a threshold for high-confidence predictions, and (iii) a translation of CLEVR questions and network predictions that pass confidence thresholds into logic programmes so that we can compute the answers using an answer-set programming solver. By exploiting choice rules, we consider deterministic and non-deterministic scene encodings. Our experiments show that the non-deterministic scene encoding achieves good results even if the neural networks are trained rather poorly in comparison with the deterministic approach. This is important for building robust VQA systems if network predictions are less-than perfect. Furthermore, we show that restricting non-determinism to reasonable choices allows for more efficient implementations in comparison with related neuro-symbolic approaches without losing much accuracy.

Keywords

answer-set programming visual question answering neuro-symbolic computation

Information

Type: Original Article
Information: Theory and Practice of Logic Programming , Volume 22 , Issue 5: 38th International Conference on Logic Programming Special Issue II , September 2022 , pp. 739 - 754

DOI: https://doi.org/10.1017/S1471068422000229 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

This work was partially funded by the Bosch Center for Artificial Intelligence at Renningen, Germany.

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L. and Parikh, D. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) 2015, pp. 2425–2433. IEEE.CrossRef Google Scholar

Basu, K., Shakerin, F. and Gupta, G. AQuA: ASP-based visual question answering. In Proceedings of the 22nd International Symposium on Practical Aspects of Declarative Languages (PADL 2020) 2020, vol. 12007. Lecture Notes in Computer Science. Springer, 57–72.Google Scholar

Brewka, G., Eiter, T. and TruszczyŃski, M. 2011. Answer set programming at a glance. Communications of the ACM, 54, 12, 92–103.CrossRef Google Scholar

Gebser, M., Kaminski, R., Kaufmann, B. and Schaub, T. 2012. Answer Set Solving in Practice, volume 6 of Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers.Google Scholar

Gelfond, M. and Lifschitz, V. 1991. Classical negation in logic programs and disjunctive databases. New Generation Computing, 9, 3–4, 365–385.CrossRef Google Scholar

Jabri, A., Joulin, A. and Van Der Maaten, L. Revisiting visual question answering baselines. In Proceedings of the 14th European Conference on Computer Vision (ECCV 2016) 2016, vol. 9912. Lecture Notes in Computer Science. Springer, 727–739.Google Scholar

Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L. and Girshick, R. B. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. IEEE, 1988–1997.CrossRef Google Scholar

Lu, J., Yang, J., Batra, D. and Parikh, D. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems (NIPS 2016) 2016, vol. 29. Curran Associates, Inc., 289–297.Google Scholar

Malinowski, M. and Fritz, M. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS 2014) 2014, vol. 27. Curran Associates, Inc., 1682–1690.Google Scholar

Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T. and Raedt, L. D. DeepProbLog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems (NeurIPS 2018) 2018, vol. 31, 3753–3763.Google Scholar

Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B. and Wu, J. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019) 2019.Google Scholar

Redmon, J. and Farhadi, A. 2018. YOLOv3: An incremental improvement. arXiv preprint arXiv:1804.02767, abs/1804.02767.Google Scholar

Ren, M., Kiros, R. and Zemel, R. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems (NIPS 2015) 2015, vol. 28. Curran Associates, Inc., 2953–2961.Google Scholar

Riley, H. and Sridharan, M. 2019. Integrating non-monotonic logical reasoning and inductive learning with deep learning for explainable visual question answering. Frontiers in Robotics and AI, 6:125.Google Scholar

Sampat, S. K., Kumar, A., Yang, Y. and Baral, C. CLEVR_HYP: A challenge dataset and baselines for visual question answering with hypothetical actions over images. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021) 2021. Association for Computational Linguistics, 3692–3709.Google Scholar

Xu, J., Zhang, Z., Friedman, T., Liang, Y. and Van den Broeck, G. A semantic loss function for deep learning with symbolic knowledge. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018) 2018, vol. 80. Proceedings of Machine Learning Research. PMLR, 5502–5511.Google Scholar

Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, 21–29.CrossRef Google Scholar

Yang, Z., Ishay, A. and Lee, J. NeurASP: Embracing neural networks into answer set programming. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI 2020) 2020. International Joint Conferences on Artificial Intelligence Organization, 1755–1762.CrossRef Google Scholar

Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P. and Tenenbaum, J. Neural-symbolic VQA: Disentangling reasoning from vision and language understanding. In Advances in Neural Information Processing Systems (NeurIPS 2018) 2018, vol. 39. Curran Associates, Inc., 1039–1050.Google Scholar

Zhu, Y., Groth, O., Bernstein, M. and Fei-Fei, L. Visual7w: Grounded question answering in images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, 4995–5004.CrossRef Google Scholar

Article contents

A Neuro-Symbolic ASP Pipeline for Visual Question Answering

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests