Hostname: page-component-54dcc4c588-gwv8j Total loading time: 0 Render date: 2025-09-19T20:57:21.583Z Has data issue: false hasContentIssue false

Automatic speech recognition and pronunciation learning

Published online by Cambridge University Press:  06 August 2025

Shannon McCrocklin
Affiliation:
Linguistics, Southern Illinois University, Carbondale, IL, USA
John Levis*
Affiliation:
English, Iowa State University, Ames, IA, USA
*
Corresponding author: John Levis; Email: jlevis@iastate.edu
Rights & Permissions [Opens in a new window]

Abstract

Information

Type
Research Timeline
Copyright
© The Author(s), 2025. Published by Cambridge University Press.

1. Introduction

Although pronunciation is a central factor in speech intelligibility (Jenkins, Reference Jenkins2000; Munro & Derwing, Reference Munro and Derwing1995, Reference Munro and Derwing2020), it remains a relatively neglected area of language learning because of time constraints in the classroom (Murphy & Baker, Reference Murphy, Baker, Reed and Levis2015), lack of teacher expertise (Huensch, Reference Huensch2019), and the difficulty of providing targeted feedback to diverse students (Khaustova et al., Reference Khaustova, Pyshkin, Khaustov, Blake and Bogach2023). The possible application of Automatic Speech Recognition (ASR) technologies, which has the potential to provide individualized learner feedback on pronunciation, was imagined long before the technical development of ASR systems made it practicable (Wohlert, 1984*Footnote 1). With improvements in accuracy and speaker independence of systems, interest in using ASR for pronunciation learning and teaching has grown.

This timeline traces research into the application of ASR to pronunciation teaching and learning. The process began with an initial identification of 126 articles using library searches of EBSCO Host, Linguistics and Language Behavior Abstracts, as well as follow-up searches with Google Scholar. Keywords for searches included automatic speech recognition, speech-to-text, computer-assisted pronunciation training, as well as searches seeking a combination of terms related to pronunciation as well as technology and/or computer assisted language learning. The selection process for included articles considered quality and impact of studies as well as connections between research studies across time. As a result of this iterative process, we included 46 influential articles in this timeline, which was later expanded to 50 based on reviewer feedback. Notably, many of the included articles were published in the past decade.

Automatic speech recognition works by applying computational processes to decode and transcribe oral speech input. This is done through recording oral speech input and providing an analysis of the input using a probabilistic algorithm which results in an output, usually in the form of a written text (Levis & Suvorov, Reference Levis, Suvorov and Chapelle2020). Today, variants of ASR programs are found in a wide variety of applications, including personal assistants, mobile devices, video chat programs, closed captioning on videos, dictation programs for hands-free typing, and language-learning programs. As a consequence, ASR systems now play an increasingly expansive role in applied linguistics and language learning by providing feedback to learners on their spoken language and pronunciation.

With origins dating to the 1950s, ASR is a relatively young technology (Davis et al., Reference Davis, Biddulph and Balashek1952). Its modern development began in earnest in the late 1970s, and the first business-oriented ASR applications appeared in the 1980s (Rabiner & Juang, Reference Rabiner, Juang, Benesty, Sondhi and Huang2008). One of the main challenges facing developers of ASR was the variability of speech. Early programs were speaker-dependent; they were trained with speech samples to recognize the speech of a single speaker or a single speech variety. Even with extensive training, early programs had high error rates due to individual differences, speech rate, and variations in pronunciation when words were produced in connected speech. Given these challenges, early research into ASR for L2 speech focused heavily on the accuracy of recognition (Bernstein et al., 1990*). One of the first studies to try to harness a commercial dictation program for pronunciation, Derwing et al. (2000*) examined the accuracy of transcription of a promising commercial ASR program, Dragon Naturally Speaking, for both L1 and L2 English speech. The study found that while the program was able to recognize L1 speech with accuracy rates of around 90%, it had much lower recognition rates for second-language (L2) speakers (∼70%) despite the fact that the L2 speakers were understood by human listeners at a much higher rate (∼95%). This study was highly influential and halted work with commercial dictation programs until later studies could document improvement in the recognition of L2 speech. Now, although some concerns remain regarding accuracy (see Inceoglu et al., 2023*), studies have found that the improved ASR recognition of L2 speech for Google Voice Typing is over 90% and is similar to that of human listeners (Johnson et al., 2024*; McCrocklin & Edalatishams, 2020*). Automatic speech recognition thus provides a useful estimate of intelligibility. Notably, improved recognition accuracy, which correlates with human listener comprehension, supports the goal of making intelligibility (in this case, the successful recognition of spoken words) rather than accentedness the emphasis in pronunciation learning (Levis, Reference Levis2020).

Given concerns about the accuracy of recognition, early work in ASR for second-language learning concentrated on building bespoke Computer-Assisted Pronunciation Training (CAPT) applications that integrated ASR to enable pronunciation feedback. These CAPT programs often controlled learner utterances to enable more accurate feedback on pronunciation errors (Hincks, Reference Hincks, Reed and Levis2015). Some programs also incorporated other data, such as information about the learners’ first languages (Moustroufas & Digilakis, 2007*), to further improve accuracy. Early examples of ASR-CAPT studies are found in Neri et al. (2002*, 2008*), which demonstrated the potential of ASR-CAPT for pronunciation learning. Interest in ASR-CAPT has continued, with examples including Bashori et al. (2022*, 2024*), which investigated both pronunciation improvement and vocabulary learning using ASR-based language-learning programs.

In the last ten years, ASR has dramatically increased in accuracy thanks to improvements based on deep learning (Cho, Reference Cho and Mitkov2022) and the use of extensive spoken corpora (Lamel & Gauvain, Reference Lamel, Gauvain and Mitkov2022). These improvements reignited interest in harnessing other types of programs, such as dictation programs, as part of pronunciation learning and teaching (e.g., Liakin et al., 2015*; McCrocklin, 2016*; Mroz, 2018*), and studies highlighted the potential of dictation programs to facilitate pronunciation learning, spurring additional research that continues today (e.g., Inceoglu et al., 2024*; Johnson et al., 2024*).

Beyond accuracy and learning potential, research has also explored the experience that learners have with ASR. Although some studies shared learners’ concerns about the accuracy of ASR feedback or their frustrations with low rates of recognition (McCrocklin, Reference McCrocklin2019), in general, learners like using ASR to practice their pronunciation because they get feedback about their pronunciation to help them improve (Ahn & Lee, 2016*; Neri et al., 2006*; Mroz, 2018*; Wang & Young, 2015*). The use of ASR also encourages autonomous learning (Inceoglu, 2023*; McCrocklin, 2016*). Studies using screen and video recordings of learner practice have further shown that most learners are willing to try repairing utterances when ASR indicates a possible pronunciation error (Inceoglu et al., 2024*) and can improve their ASR transcription accuracy in subsequent attempts (McCrocklin, 2019b*). Studies applying the Technology Acceptance Model (TAM) have further shown that learners and teachers are likely to continue using ASR after the conclusion of the study (Dillon & Wells, 2021*; Hsu, 2024*), indicating that they find it useful enough to make it part of their continued language learning. As studies have continued to show that ASR (both through CAPT programming or through the incorporation of generalized tools, such as dictation programs) can support pronunciation learning and improvement, researchers’ questions have increasingly turned to factors that affect the success of implementation, such as the nature of the feedback provided, work arrangements, segmental targets, and length of training.

In general, ASR systems can provide three types of feedback. First, they can provide an overall score for learners’ pronunciation, for example, a Goodness of Pronunciation (GOP) score (see Witt and Young, 1997*). However, GOP scores are opaque, requiring learners to determine where their pronunciation does not succeed in order to increase their scores. Automatic speech recognition feedback can also be provided in the form of written text, in which errors in transcription may suggest pronunciation errors. However, this type of feedback is also indirect, and mistranscriptions may not help learners understand the patterns of errors in their speech nor how to improve. Finally, ASR can be used in conjunction with explicit feedback on particular phonemic errors to highlight mispronunciations and provide explanatory feedback. The meta-analysis by Ngo et al. (2024*) showed that while implicit feedback was useful and could support learning gains, explicit feedback was more beneficial for learning. Still, more work is needed to explore the effectiveness of various ASR-feedback options.

Another area of investigation has been work arrangements for ASR practice, such as whether the work is done individually, in pairs or groups, or in conjunction with teacher-training. Although most early research assumed learners would work individually with ASR programs, and Elimat & AbuSeileek (2014*) found an advantage to students working individually, studies have increasingly found that learner support (practice with a peer) may be able to increase the efficacy of such practice (Dai & Wu, 2023*; Evers & Chen, 2021*). Learners also expressed greater satisfaction working with ASR when they had teacher support (Liu et al., 2022*). It is likely that there is a link between the feedback quality of the ASR program and the need for additional support, either through a peer or teacher. Learners may derive greater benefits from support when using programs that provide indirect or implicit feedback, as learners may struggle to interpret implicit feedback on their own. This should be a target for further research.

Several studies have shown that ASR training has uneven effects across segmental pronunciation targets. McCrocklin (2019a*) examined learner improvement across a range of English vowels and consonants, finding that some segmentals (for example /ɹ/ and /i/) showed little to no improvement. Inceoglu et al. (2020*) and Guskaroska (Reference Guskaroska2020) also examined learner improvement for vowel sounds in English, finding that some vowel pronunciations improved while others did not. Future research may explore whether there are consistent differences, which would suggest that some segmentals are more suitably trained through ASR practice than others.

Due to practical constraints, many ASR studies have relied on short training windows. A recent meta-analysis (Ngo et al., 2024*), however, suggests that longer and more consistent use of ASR leads to greater improvement. Their analysis showed that studies involving four weeks of training or less were rarely effective, while studies employing medium (5–8 weeks) or long training durations (9 weeks or longer) led to substantial improvement. Studies that seek longer training times, such as Hsu (2024*) and Nickolai (2024*), will help the field better understand the ultimate impacts of sustained ASR training and the extent of improvement possible with ASR practice.

Looking to the future, ASR programs will continue to grow in influence within language learning in general and for pronunciation learning in particular. Automatic speech recognition continues to improve, not only in general, but specifically in recognizing and assessing L2 speech (Çalık et al., Reference Çalık, Küçükmanisa and Kilimci2024; Yan et al., 2024*). Ultimately, the gold standard of ASR for pronunciation improvement (Derwing & Munro, Reference Derwing and Munro2015) occurs when listeners understand the spontaneous speech of L2 speakers more successfully. Programming that integrates artificial intelligence (AI) will increasingly allow learners to engage in spontaneous speech practice while getting pronunciation feedback on features affecting intelligibility; generative AI will likely blur the lines between CAPT, which can offer explicit feedback, but often offers little to no flexibility for learners to actively participate in creating and producing their own sentences, and repurposed commercial programs (e.g., dictation programs), which can only provide implicit feedback through a written transcript, but allow learners flexibility and control to create and practice their own learning materials. Currently, programs like ELSA Speak and Gliglish are debuting options for flexible, communicative practice through role-play conversations that make explicit pronunciation feedback available. Newer programs, however, may have language-learning goals that focus on other language learning skills (see Bashori et al., 2024*) but do not support sustained focus on pronunciation. Ideally, future research will continue to closely examine the pronunciation learning enabled through these technologies and continue to find ways to better support learners’ pronunciation improvement.

Our exploration of the history of research in ASR for pronunciation learning and teaching identified five major themes. Coding advancements (A) have largely taken place apart from pronunciation learning and teaching but have allowed for the current usefulness of ASR for pronunciation learning and teaching. Evaluations of programs in terms of accuracy and affordances (B) for learning continue to play a role in research, but these areas are less important now than in the past as the programs have been established as sufficiently accurate to be useful for pronunciation learning. In regard to types of programs (C), early emphasis on specially created programs has been replaced by emphasis on repurposing programs built for other purposes (e.g., dictation programs). Future developments of programs integrating AI or focusing on suprasegmental features may again change the direction of research in this area. Research has focused heavily on learner improvement (D), or how and where learners improve. While most studies have focused on improvements of segmental pronunciation, some studies have focused on larger issues of intelligibility or comprehensibility. Research on ASR for suprasegmental training has been rare but may become more common if or when commercial programs become available that can address suprasegmentals (see Kochem et al., 2022*). Research has also explored the learner experience (E), including how learners use the tools as well as the degree to which learners find ASR enjoyable and valuable as a language learning tool.

  1. A. Coding advancements

  2. B. Program evaluations

B1. Accuracy of ASR

B2. Affordances for learning

  1. C. Type of program

C1. Assessment

C2. CAPT programs

C3. Dictation programs

  1. D. Learner improvement

D1. Segmental accuracy changes

D2. Intelligibility/comprehensibility

  1. E. Learning experience

Shannon McCrocklin is an Associate Professor in the School of Languages and Linguistics at Southern Illinois University. She earned her M.A. in TESOL from the University of Illinois and her Ph.D. in Applied Linguistics and Technology from Iowa State University. Her research focuses on the intersection of CALL and second language pronunciation learning and teaching, with a primary emphasis on automatic speech recognition. She is the editor of Technological resources for second language pronunciation learning and teaching: Research-based approaches (2022, Bloomsbury Publishing).

John Levis is Distinguished Professor of Applied Linguistics and TESL at Iowa State University. He is the author of Intelligibility, oral communication, and the teaching of pronunciation (2018, Cambridge University Press) as well as the co-author of Second language pronunciation: Bridging the gap between research and practice (2022, Wiley Blackwell). He co-edited the Handbook of English pronunciation (2015, Wiley Blackwell) as well as Social dynamics in second language accent (2014, DeGruyter Mouton). He is the founder of an annual conference, Pronunciation in Second Language Learning and Teaching, held since 2009, and the founding editor of the Journal of Second Language Pronunciation.

1 Small capitalised work is mentioned elsewhere in the timeline.

Footnotes

1. Asterisk indicates full reference can be found in the timeline itself.

1 Small capitalised work is mentioned elsewhere in the timeline.

References

Çalık, Ş. S., Küçükmanisa, A., & Kilimci, Z. H. (2024). A novel framework for mispronunciation detection of Arabic phonemes using audio-oriented transformer models. Applied Acoustics, 215, 109711. https://doi.org/10.1016/j.apacoust.2023.109711CrossRefGoogle Scholar
Cho, K. (2022). Deep learning. In Mitkov, R. (Ed.), The Oxford handbook of computational linguistics (pp. 359414). Oxford University Press.Google Scholar
Davis, K. H., Biddulph, R., & Balashek, S. (1952). Automatic recognition of spoken digits. The Journal of the Acoustical Society of America, 24(6), 637642. https://doi.org/10.1121/1.1906946CrossRefGoogle Scholar
Derwing, T. M., & Munro, M. J. (2015). Pronunciation fundamentals: Evidence-based perspectives for L2 teaching and research. John BenjaminsCrossRefGoogle Scholar
Guskaroska, A. (2020). ASR-dictation on smartphones for vowel pronunciation practice. Journal of Contemporary Philology, 3(2), 4561. https://doi.org/10.37834/jcp2020045gGoogle Scholar
Hincks, R. (2015). Technology and learning pronunciation. In Reed, M. & Levis, J. (Eds.), The handbook of English pronunciation (pp. 505519). John Wiley & Sons.CrossRefGoogle Scholar
Huensch, A. (2019). Pronunciation in foreign language classrooms: Instructors’ training, classroom practices, and beliefs. Language Teaching Research, 23(6), 745764. https://doi.org/10.1177/136216881876718CrossRefGoogle Scholar
Jenkins, J. (2000). The phonology of English as an international language. Oxford University Press.Google Scholar
Khaustova, V., Pyshkin, E., Khaustov, V., Blake, J., & Bogach, N. (2023, November). CAPTuring accents: An approach to personalize pronunciation training for learners with different L1 backgrounds. In International conference on speech and computer (pp. 5970). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-48312-7_5CrossRefGoogle Scholar
Lamel, L., & Gauvain, J.-L. (2022). Speech recognition. In Mitkov, R. (Ed.), The Oxford handbook of computational linguistics (pp. 770788). Oxford University Press.Google Scholar
Levis, J. (2020). Revisiting the intelligibility and nativeness principles. Journal of Second Language Pronunciation, 6(3), 310328. https://doi.org/10.1075/jslp.20050.levCrossRefGoogle Scholar
Levis, J., & Suvorov, R. (2020). Automatic speech recognition. In Chapelle, C. (Ed.), The encyclopedia of applied linguistics. Wiley. https://doi.org/10.1002/9781405198431.wbeal0066.pub2Google Scholar
McCrocklin, S. (2019). Learners’ feedback regarding ASR-based dictation practice for pronunciation learning. CALICO Journal, 36(2), 119137. https://doi.org/10.1558/cj.34738Google Scholar
Munro, M. J., & Derwing, T. M. (1995). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 7397. https://doi.org/10.1111/j.1467-1770.1995.tb00963.xCrossRefGoogle Scholar
Munro, M. J., & Derwing, T. M. (2020). Foreign accent, comprehensibility and intelligibility, redux. Journal of Second Language Pronunciation, 6(3), 283309. https://doi.org/10.1075/jslp.20038.munCrossRefGoogle Scholar
Murphy, J. M., & Baker, A. A. (2015). History of ESL pronunciation teaching. In Reed, M. & Levis, J. (Eds.), The handbook of English pronunciation (pp. 3665). Wiley-Blackwell.10.1002/9781118346952.ch3CrossRefGoogle Scholar
Rabiner, L., & Juang, B. H. (2008). Historical perspective of the field of ASR/ NLU. In Benesty, J., Sondhi, M. M., & Huang, Y. A. (Eds.), Springer handbook of speech processing (pp. 521538). Springer.CrossRefGoogle Scholar