Improving speech emotion recognition based on acoustic words emotion dictionary

Wang Wei; Xinyi Cao; He Li; Lingjie Shen; Yaqin Feng; Paul A. Watters

doi:10.1017/S1351324920000339

Improving speech emotion recognition based on acoustic words emotion dictionary

Published online by Cambridge University Press: 10 June 2020

Wang Wei ,

Xinyi Cao ,

He Li ,

Lingjie Shen ,

Yaqin Feng and

Paul A. Watters

Show author details

Wang Wei: Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China
Xinyi Cao: Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China
He Li: Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
Lingjie Shen: Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China
Yaqin Feng: Affiliation:
School of Education Science, Nanjing Normal University, Nanjing, 210097, China
Paul A. Watters*: Affiliation:
Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
*: *Corresponding author. E-mail: P.Watters@latrobe.edu.au

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

To improve speech emotion recognition, a U-acoustic words emotion dictionary (AWED) features model is proposed based on an AWED. The method models emotional information from acoustic words level in different emotion classes. The top-list words in each emotion are selected to generate the AWED vector. Then, the U-AWED model is constructed by combining utterance-level acoustic features with the AWED features. Support vector machine and convolutional neural network are employed as the classifiers in our experiment. The results show that our proposed method in four tasks of emotion classification all provides significant improvement in unweighted average recall.

Keywords

Speech emotion recognition Emotion dictionary Deep learning Support vector machine

Information

Type: Article
Information: Natural Language Engineering , Volume 27 , Issue 6 , November 2021 , pp. 747 - 761

DOI: https://doi.org/10.1017/S1351324920000339 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Jeannette, N., Lee, S. and Narayanan, S.S. (2008). IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42(4), 335–359.Google Scholar

Cao, H., Savran, A. and Verma, R. (2015). Acoustic and lexical representations for affect prediction in spontaneous conversations. Computer Speech & Language 29(1), 203–217.CrossRef Google Scholar PubMed

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20(3), 273–297.CrossRef Google Scholar

Ekman, P. (1992). Are there basic emotions? Psychological Review 99(3), 550–553.CrossRef Google Scholar PubMed

Eyben, F., Wöllmer, M. and Schuller, B. (2010). Opensmile: the munich versatile and fast open-source audio feature extractor. In MM’10 - Proceedings of the ACM Multimedia 2010 International Conference, pp. 1459–1462.CrossRef Google Scholar

Eyben, F., Scherer, K.R., Truong, K.P., Schuller, B.W., Sundberg, J. and Andre, E. (2016). The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing 7(2), 190–202.CrossRef Google Scholar

Fayek, H.M., Lech, M., Cavedon, L. and Wu, H. (2017). Evaluating deep learning architectures for Speech Emotion Recognition. Neural Networks 92(1), 60–68.CrossRef Google Scholar PubMed

Fernandez, R. (2004). A computational model for the automatic recognition of affect in speech. Thesis Massachusetts Institute of Technology 28(1), 50–58.Google Scholar

Fernandez, R. and Picard, R. (2011). Recognizing affect from speech prosody using hierarchical graphical models. Speech Communication 53(9C10), 88–103.CrossRef Google Scholar

Jin, Q., Li, C. and Chen, S. (2015). Speech emotion recognition with acoustic and lexical features. pp. 4749–4753. doi:10.1109/ICASSP.2015.7178872.CrossRef Google Scholar

Keren, G. and Schuller, B. (2016). Convolutional RNN: an enhanced model for extracting features from sequential data. In 2016 International Joint Conference on Neural Networks (IJCNN) as part of the IEEE World Congress on Computational Intelligence (IEEE WCCI), Canada: Vancouver, pp. 3412–3419.CrossRef Google Scholar

Lee, C.C., Mower, E., Busso, C., Lee, S. and Narayanan, S. (2011). Emotion recognition using a hierarchical binary decision tree approach. Speech Communication 53(9¨C10), 1162–1171.CrossRef Google Scholar

Litman, D. and Forbes, K. (2003). Recognizing emotions from student speech in tutoring dialogues. Automatic Speech Recognition and Understanding Workshop 25(3), 698–704.Google Scholar

Mao, Q., Dong, M., Huang, Z. and Zhan, Y. (2014). Learning ssalient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia 16(8), 2203–2213.CrossRef Google Scholar

Mariooryad, S. and Busso, C. (2013). Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Transactions on Affective Computing 4(2), 183–196.CrossRef Google Scholar

Metallinou, A., Wollmer, M., Katsamanis, A. and Eyben, F. (2012). Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Transactions on Affective Computing 3(2), 184–198.CrossRef Google Scholar

Mirsamadi, S., Barsoum, E. and Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In Acoustics, Speech and Signal Processing (ICASSP). LA, New Orleans, pp. 2227–2231 CrossRef Google Scholar

Neumann, M. and Vu, N.T. (2017). Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech. In Interspeech, Stockholm, Sweden, pp. 1263–1267.CrossRef Google Scholar

Ozkan, D., Scherer, S. and Morency, L.P. (2012). Step-wise emotion recognition using concatenated-HMM. In 14th ACM International Conference on Multimodal Interaction (ICMI), pp. 477–484.CrossRef Google Scholar

Schuller, B. and Rigoll, G. (2006). Timing levels in segment-based speech emotion recognition. In INTERSPEECH 2006, International Conference on Spoken Language Processing (ICSLP), pp. 1818–1821.Google Scholar

Schuller, B., Steidl, S. and Batliner, A. (2009). The Interspeech 2009 emotion challenge. In INTERSPEECH 2009, Conference of the International Speech Communication Association, pp. 312–315.CrossRef Google Scholar

Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G. and Wendemuth, A. (2010a). Acoustic emotion recognition: a benchmark comparison of performances. In Automatic Speech Recognition & Understanding, ASRU 2009, pp. 552–557.CrossRef Google Scholar

Schuller, B., Steidl, S., Batliner, A., Burkhardt, F. and Narayanan, S.S. (2010b). The INTERSPEECH 2010 paralinguistic challenge. In INTERSPEECH 2010, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, pp. 2794–2797.CrossRef Google Scholar

Shah, M., Chakrabarti, C. and Spanias, A. (2014). A multi-modal approach to emotion recognition using undirected topic models. In IEEE International Symposium on Circuits and Systems, Melbourne VIC, pp. 754–757.CrossRef Google Scholar

Shami, M.T. and Kamel, M.S. (2005). Segment-based approach to the recognition of emotions in speech. In IEEE International Conference on Multimedia and Expo, Amsterdam, pp. 383–389.CrossRef Google Scholar

Tian, L., Moore, J.D. and Lai, C. (2015). Emotion recognition in spontaneous and acted dialogues. In International Conference on Affective Computing and Intelligent Interaction, Xi’an, pp. 698–704.CrossRef Google Scholar

Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E. and Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5089–5093.CrossRef Google Scholar

Wollmer, M., Metallinou, A., Katsamanis, N., Schuller, B. and Narayanan, S. (2012). Analyzing the memory of BLSTM Neural Networks for enhanced emotion classification in dyadic spoken interactions. In IEEE International Conference on Acoustics, Speech and Signal Processing, Kyoto, pp. 4157–4160.CrossRef Google Scholar

Yang, N., Muraleedharan, R., Kohl, J., Demirkol, I., Heinzelman, W. and Sturge-Apple, M. 2012. Speech-based emotion classification using multiclass SVM with hybrid kernel and thresholding fusion. In IEEE Workshop on Spoken Language Technology, Miami, FL, pp. 455–460.Google Scholar

Article contents

Improving speech emotion recognition based on acoustic words emotion dictionary

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests