AI, Human–Robot Interaction, and Natural Language Processing

doi:10.1017/9781108989275.021

20 - AI, Human–Robot Interaction, and Natural Language Processing

from Part V - Advances in Multimodal and Technological Context-Based Research

Published online by Cambridge University Press: 30 November 2023

Ian McLoughlin and

Nitin Indurkhya

Edited by

Jesús Romero-Trillo

Show author details

Jesús Romero-Trillo: Affiliation:
Universidad Autónoma de Madrid

Book contents

Get access

Summary

An AI-driven (or AI-assisted) speech or dialogue system, from an engineering perspective, can be decomposed into a pipeline with a subset of the following three distinct processing activities: (1) Speech processing that turns sampled acoustic sound waves into enriched phonetic information through automatic speech recognition (ASR), and vice versa via text-to-speech (TTS); (2) Natural Language Processing (NLP), which operates at both syntactic and semantic levels to get at the meanings of words as well as of the enriched phonetic information; (3) Dialogue processing which ties both together so that the system can function within the specified latency and semantic constraints. This perspective allows for at least three levels of context. The lowest level is phonetic, where the fundamental components of speech are built from a time-sequence string of acoustic symbols (analyzed in ASR or generated in TTS). The next higher level of context is word- or character-level, normally postulated as sequence-to-sequence modeling. The highest level of context typically used today keeps track of a conversation or topic. An even higher level of context, generally missing today, but which will be essential in future, is that of our beliefs, desires, and intentions.

Keywords

Speech Recognition Natural Language Processing Dialogue Text-to-Speech

Information

Type: Chapter
Information: The Cambridge Handbook of Language in Context , pp. 436 - 454

DOI: https://doi.org/10.1017/9781108989275.021 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2023

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Book purchase

Temporarily unavailable

References

Akiwowo, S., Vidgen, B., Prabhakaran, V., and Waseem, Z., eds. (2020). Proceedings of the Fourth Workshop on Online Abuse and Harms. Association for Computational Linguistics. https://aclanthology.org/volumes/2020.alw-1/.Google Scholar

Baker, J. M., Deng, L., Glass, J., Khudanpur, S., Lee, C.-H., Morgan, N., and O’Shaughnessy, D. (2009). Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. IEEE Signal Processing Magazine, 26(3), 75–80.CrossRef Google Scholar

Bender, E. M, and Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5185–5198). Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.463.Google Scholar

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc-Candlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. CoRR, abs/2005.14165.Google Scholar

Bunt, H. (2011). The semantics of dialogue acts. In Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011). SIGSEM. https://aclanthology.org/W11-0100.Google Scholar

Cai, W., Chen, J., and Li, M. (2018). Exploring the encoding layer and loss function in end-to-end speaker and language recognition system. arXiv preprint arXiv:1804.05160.Google Scholar

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. (2020). Extracting training data from large language models. arXiv preprint arXiv:2012.07805.Google Scholar

etinoğlu, Ç, Schulz, Ö., S., and Vu, N. T. (2016). Challenges of computational processing of code-switching. arXiv preprint arXiv:1610.02213.Google Scholar

Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., and Smith, N. A. (2021). All That’s “human” is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Vol. I: Long Papers) (pp. 7282–7296). Association for Computational Linguistics. https://aclanthology.org/2021.acl-long.565/.Google Scholar

Crystal, D. (2003). The Cambridge Encyclopedia of English Language. Cambridge: Cambridge University Press.Google Scholar

Fedus, W., Zoph, B., and Shazeer, N. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961.Google Scholar

Ferrer, L., Lei, Y, .McLaren, M., and Scheffer, N. (2015). Study of senone-based deep neural network approaches for spoken language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(1), 105–116.CrossRef Google Scholar

Frankel, J., and King, S. (2001). ASR-articulatory speech recognition. In Proceedings of Eurospeech-2001 (pp. 599–602). Aalborg: International Speech Communication Association.Google Scholar

GPT-3. (2020 )(Sep). A robot wrote this entire article. Are you scared yet, human? – GPT-3. The Guardian, September 8, www.theguardian.com/commentisfree/2020/sep/08/robotwrote-this-article-gpt–3.Google Scholar

Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmidhuber, J. (2016). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232.CrossRef Google Scholar PubMed

Hu, R., and Singh, A. (2021). Transformer is all you need: Multi-modal multitask learning with a unified transformer. arXiv e-prints, arXiv– 2102.Google Scholar

Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I. V., and Dai, L.-R. (2014). Deep bottleneck features for spoken language identification. PLoS ONE, 9(7), e100795.CrossRef Google Scholar PubMed

Jin, M., Song, Y., McLoughlin, I., and Dai, L.-R. (2018). LID-senones and their statistics for language identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 171–183.CrossRef Google Scholar

Jobs, S. (2005). Steve Jobs connect the dots [video file]. www.youtube.com/watch?v=5BSbOc5VYY8. Accessed June 22, 2023.Google Scholar

Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., and Levy, O. (2020). Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64–77.CrossRef Google Scholar

Juang, B.-H., and Rabiner, L. R. (2005). Automatic speech recognition: A brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara, 1, 67.Google Scholar

Kohler, M. A., and Kennedy, M. (2002). Language identification using shifted delta cepstra. In Proceedings of the 45th IEEE International Midwest Symposium on Circuits and Systems (pp. III–69). doi:10.1109/MWSCAS.2002.1186972.CrossRef Google Scholar

Lin, J., Yang, A., Bai, J., Zhou, C., Jiang, L., Jia, X., Wang, A., Zhang, J., Li, Y., Lin, W., Zhou, J., and Yang, H. (2021). M6-10 T: A Sharing-Delinking paradigm for efficient multi-trillion parameter pretraining. https://arxiv.org/abs/2110.03888.Google Scholar

McLoughlin, I. V. (2016). Speech and Audio Processing: a MATLAB-Based Approach. Cambridge: Cambridge University Press.CrossRef Google Scholar

McLoughlin, I. V. (2018). Computer Systems: An Embedded Approach. Singapore: McGraw Hill Education.Google Scholar

McLoughlin, I. V., and Sharifzadeh, H. R. (2008). Speech recognition for smart homes. In Vasile-Florian Păiș (ed.), Speech Recognition: Technologies and Applications (pp. 477–494). Rijeka: IntechOpen. https://doi.org/10.52305/BKWM8996.CrossRef Google Scholar

McTear, M. F. 2002. Spoken dialogue technology: Enabling the conversational user interface. ACM Computing Surveys (CSUR), 34(1), 90–169.CrossRef Google Scholar

Miao, X., McLoughlin, I., Wang, W., and Zhang, P. (2021). D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural Networks, 139, 201–211.CrossRef Google Scholar

Mori, M., MacDorman, K. F., and Kageki, N. (2012). The Uncanny Valley [From the Field]. IEEE Robotics Automation Magazine, 19(2), 98–100.CrossRef Google Scholar

Richardson, J., and Arthur, M. B. (2013). Just three stories: The career lessons behind Steve Jobs’ Stanford University Commencement Address. Journal of Business & Management, 19(1), 45–57.Google Scholar

Skjuve, M., Følstad, A., Fostervold, K. I., and Brandtzaeg, P. B. (2021). My Chatbot companion: A study of human–chatbot relationships. International Journal of Human-Computer Studies, 149, 102601.CrossRef Google Scholar

Slonim, N., Bilu, Y., Alzate, C., Bar-Haim, R., Bogin, B., Bonin, F., Choshen, L., Cohen-Karlik, E., Dankin, L., Edelstein, L., et al. (2021). An autonomous debating system. Nature, 591(7850), 379–384.CrossRef Google Scholar PubMed

Taylor, P. (2009). Text-to-Speech Synthesis. Cambridge: Cambridge University Press.CrossRef Google Scholar

Thomason, J., Sinapov, J., Svetlik, M., Stone, P., and Mooney, R. J. (2016). Learning multi-modal grounded linguistic semantics by playing “I Spy.” In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI) (pp. 3477–3483), July 2016. Palo Alto: AAAI Press / International Joint Conferences on Artificial Intelligence.Google Scholar

Truong, M., Fast, N. J., and Kim, J. (2020). It’s not what you say, it’s how you say it: Conversational flow as a predictor of networking success. Organizational Behavior and Human Decision Processes, 158, 1–10.CrossRef Google Scholar

Turing, A. M. (2009). Computing machinery and intelligence. In Epstein, R., Roberts, G., and Beber, G. (eds.), Parsing the Turing Test (pp. 23–65). Dordrecht: Springer. https://doi.org/10.1007/978-1-4020-6710-5_3.CrossRef Google Scholar

Voskarides, N., Meij, E., Reinanda, R., Khaitan, A., Osborne, M., Stefanoni, G., Kambadur, P., and de Rijke, M. (2018). Weakly-supervised contextualization of knowledge graph facts. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (pp. 765–774). New York: Association for Computing Machinery.Google Scholar

Weizenbaum, J. 1966. ELIZA: A computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1), 36–45.CrossRef Google Scholar

Wolf, M. J., Miller, K. W., and Grodzinsky, F. S. (2017). Why we should have seen that coming: Comments on Microsoft’s Tay “Experiment,” and wider implications. The ORBIT Journal, 1(2), 1–12.CrossRef Google Scholar

Zerubavel, E. (2006). The Elephant in the Room: Silence and Denial in Everyday Life. Oxford: Oxford University Press.Google Scholar

Zhou, L., Gao, J., Li, D., and Shum, H.-Y. (2020). The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1), 53–93.CrossRef Google Scholar

Accessibility standard: Unknown

Accessibility compliance for the PDF of this book is currently unknown and may be updated in the future.