Hostname: page-component-745bb68f8f-cphqk Total loading time: 0 Render date: 2025-01-13T14:04:01.108Z Has data issue: false hasContentIssue false

Multilingual pronunciation by analogy

Published online by Cambridge University Press:  01 October 2008

TASANAWAN SOONKLANG
Affiliation:
Information: Signals, Images, Systems (ISIS) Research Group, School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK e-mail: anncenter@gmail.com, rid@ecs.soton.ac.uk
ROBERT I. DAMPER
Affiliation:
Information: Signals, Images, Systems (ISIS) Research Group, School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK e-mail: anncenter@gmail.com, rid@ecs.soton.ac.uk
YANNICK MARCHAND
Affiliation:
Institute for Biodiagnostics (Atlantic), National Research Council Canada, Neuroimaging Research Laboratory, 1796 Summer Street, Suite 3900, Halifax, Nova Scotia, CanadaB3H 3A7 e-mail: yannick.marchand@nrc.cnrc.gc.ca

Abstract

Automatic pronunciation of unknown words (i.e., those not in the system dictionary) is a difficult problem in text-to-speech (TTS) synthesis. Currently, many data-driven approaches have been applied to the problem, as a backup strategy for those cases where dictionary matching fails. The difficulty of the problem depends on the complexity of spelling-to-sound mappings according to the particular writing system of the language. Hence, the degree of success achieved varies widely across languages but also across dictionaries, even for the same language with the same method. Further, the sizes of the training and test sets are an important consideration in data-driven approaches. In this paper, we study the variation of letter-to-phoneme transcription accuracy across seven European languages with twelve different lexicons. We also study the relationship between the size of dictionary and the accuracy obtained. The largest dictionaries of each language have been partitioned into ten approximately equal-sized subsets and combined to give ten different-sized test sets. In view of its superior performance in previous work, the transcription method used is pronunciation by analogy (PbA). Best results are obtained for Spanish, generally believed to have a very regular (‘shallow’) orthography, and poorest results for English, a language whose irregular spelling system is legendary. For those languages for which multiple dictionaries were available (i.e., French and English), results were found to vary across dictionaries. For the relationship between dictionary size and transcription accuracy, we find that as dictionary size grows, so performance grows monotonically. However, the performance gain decelerates (tends to saturate) as the dictionary increases in size; the relation can simply be described by a logarithmic regression, one parameter of which (α) can be taken as quantifying the depth of orthography of a language. We find that α for a language is significantly correlated with transcription performance on a small dictionary (approximately 10,000 words) for that language, but less so for asymptotic performance. This may be because our measure of asymptotic performance is unreliable, being extrapolated from the fitted logarithmic regression.

Type
Papers
Copyright
Copyright © Cambridge University Press 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abercrombie, D. 1981. Extending the Roman alphabet: Some orthographic experiments of the past four centuries. In Asher, R. E. and Henderson, E. (eds.) Towards a History of Phonetics, p. 207–24. Edinburgh, UK: Edinburgh University Press.Google Scholar
Aha, D. W. 1997. Lazy learning. Artificial Intelligence Review 11 (1–5): 710.CrossRefGoogle Scholar
Baayen, H. 2001. Word Frequency Distributions. Dordrecht, The Netherlands: Kluwer Academic Publishers.CrossRefGoogle Scholar
Bagshaw, P. C. 1998. Phonemic transcription by analogy in text-to-speech synthesis: novel word pronunciation and lexicon compression. Computer Speech and Language 12 (2): 119–42.CrossRefGoogle Scholar
Banko, M., and Brill, E. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, p. 2633.Google Scholar
Carney, E. 1994. A Survey of English Spelling. London, UK: Routledge.Google Scholar
Cherkassky, V. and Mulier, F. 1998. Learning from Data. New York: John Wiley.Google Scholar
Coltheart, M. 1978. Lexical access in simple reading tasks. In Underwood, G. (ed.), Strategies of Information Processing, p. 151216. New York: Academic Press.Google Scholar
Daelemans, W., van den Bosch, A., and Weijters, T. 1997. IGTree: using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review 11 (1–5): 407–23.CrossRefGoogle Scholar
Daelemans, W., van den Bosch, A., and Zavrel, J. 1999. Forgetting exceptions is harmful in language learning. Machine Learning 34 (1–3): 1143.CrossRefGoogle Scholar
Damper, R. I. 2001. Data-Driven Methods in Speech Synthesis. Dordrecht, The Netherlands: Kluwer Academic Publishers.Google Scholar
Damper, R. I., and Eastmond, J. F. G. 1997. Pronunciation by analogy: impact of implementational choices on performance. Language and Speech 40 (1): 123.CrossRefGoogle Scholar
Damper, R. I., and Marchand, Y. 2006. Information fusion approaches to the automatic pronunciation of print by analogy. Information Fusion 71 (2): 207–20.CrossRefGoogle Scholar
Damper, R. I., Marchand, Y., Adamson, M. J., and Gustafson, K. 1999. Evaluating the pronunciation component of text-to-speech systems for English: a performance comparison of different approaches. Computer Speech and Language 13 (2): 155–76.CrossRefGoogle Scholar
Damper, R. I., Marchand, Y., Adsett, C. R., Soonklang, T., and Marsters, J.-D. S. 2005a. Multilingual data-driven pronunciation. In Proceedings of 10th International Conference on Speech and Computer (SPECOM 2005), Patras, Greece, p. 167–70.Google Scholar
Damper, R. I., Marchand, Y., Marsters, J.-D. S., and Bazin, A. I. 2005b. Aligning text and phonemes for speech technology applications using an EM-like algorithm. International Journal of Speech Technology 8 (2): 149–62.CrossRefGoogle Scholar
Dedina, M. J., and Nusbaum, H. C. 1991. Pronounce: a program for pronunciation by analogy. Computer Speech and Language 5 (1): 5564.CrossRefGoogle Scholar
Dutoit, T. 1997. Introduction to Text-to-Speech Synthesis. Dordrecht, The Netherlands: Kluwer Academic Publishers.CrossRefGoogle Scholar
Elovitz, H. S., Johnson, R., McHugh, A., and Shore, J. E. 1976. Letter-to-sound rules for automatic translation of English text to phonetics. IEEE Transactions on Speech and Audio Processing ASSP-24 (6): 446–59.Google Scholar
Federici, S., Pirrelli, V., and Yvon, F. 1995. Advances in analogy-based learning: false friends and exceptional items in pronunciation by paradigm-driven analogy. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 1995) Workshop on New Approaches to Learning for Natural Language Processing, Montreal, Canada, pp. 158163.Google Scholar
Holmes, J. N., and Holmes, W. 2001. Speech Synthesis and Recognition, 2nd ed.New York: Taylor and Francis.Google Scholar
Jiampojamarn, S., Kondrak, G., and Sherif, T. 2007. Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2007), Rochester, NY, pp. 372–79.Google Scholar
Katz, L., and Feldman, L. B. 1981. Linguistic coding in word recognition: comparisons between a deep and a shallow orthography. In Lesgold, A. M. and Perfetti, C. A. (ed.), Interactive Processes in Reading, pp. 85106. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
Klatt, D. H. 1987. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82 (3): 737–93.CrossRefGoogle ScholarPubMed
Liberman, I., Liberman, A., Mattingly, I., and Shankweiler, D. 1980. Orthography and the beginning reader. In Kavanagh, J. and Venezky, R. (eds.), Orthography, Reading and Dyslexia, pp. 137–53. Baltimore, OH: University Park Press.Google Scholar
Marchand, Y., and Damper, R. I. 2000. A multistrategy approach to improving pronunciation by analogy. Computational Linguistics 26 (2): 195219.CrossRefGoogle Scholar
Marchand, Y., and Damper, R. I. 2007. Can syllabification improve pronunciation by analogy? Natural Language Engineering 13 (1): 124.CrossRefGoogle Scholar
McCulloch, N., Bedworth, M., and Bridle, J. 1987. netspeak—a re-implementation of nettalk. Computer Speech and Language 2 (3–4): 289301.CrossRefGoogle Scholar
Möbius, B. 2003. Rare events and closed domains: Two delicate concepts in speech synthesis. International Journal of Speech Technology, 6 (1), 5771.CrossRefGoogle Scholar
Partee, B. H., terMeulen, A. G. B. Meulen, A. G. B., and Wall, R. E. 1993. Mathematical Methods in Linguistics. Dordrecht, the Netherlands: Kluwer Academic Publishers (Corrected second printing).CrossRefGoogle Scholar
Sampson, G. 1985. Writing Systems. London, UK: Hutchinson.Google Scholar
Schroeder, M. 1991. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. New York: W. H. Freeman.Google Scholar
Scragg, D. G. 1975. A History of English Spelling. Manchester, UK: Manchester University Press.Google Scholar
Sejnowski, T. J., and Rosenberg, C. R. 1987. Parallel networks that learn to pronounce English text. Complex Systems 1 (1)145–68.Google Scholar
Siegel, S. 1956. Nonparametric Statistics for the Behavioral Sciences. Tokyo, Japan: McGraw-Hill Kogakusha.Google Scholar
Sproat, R., Möbius, B., Maeda, K., and Tzoukermann, E. 1998. Multilingual text analysis. In Sproat, R. (ed.), Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, pp. 3187. Dordrecht, The Netherlands: Kluwer Academic Publishers.Google Scholar
Sullivan, K. P. H. 2001. Analogy, the corpus and pronunciation. In Damper (ed.) Data-Driven methods in speech synthesis, pp. 4570. Dordrecht, the Netherlands: Kluwer Academic.Google Scholar
Sullivan, K. P. H., and Damper, R. I. 1993. Novel-word pronunciation: a cross-language study. Speech Communication 13 (3–4): 441–52.CrossRefGoogle Scholar
Turvey, M. T., Feldman, L. B., and Lukatela, G. 1984. The Serbo-Croatian orthography constrains the reader to a phonologically analytic strategy. In Henderson, L. (ed.), Orthographies and Reading, Perspectives from Cognitive Psychology, Neuropsychology and Linguistics, pp. 8189. London, UK: Lawrence Erlbaum Associates.Google Scholar
van den Bosch, A. 1997. Learning to Pronounce Written Words: A Study in Inductive Language Learning. PhD Thesis, University of Maastricht, The Netherlands.Google Scholar
van den Bosch, A., Content, A., Daelemans, W., and DeGelder, B. Gelder, B. 1994. Measuring the complexity of writing systems. Journal of Quantitative Linguistics 1 (3): 178–88.CrossRefGoogle Scholar
van den Bosch, A., Weijters, A., van den Herik, H. J., and Daelemans, W. 1997. When small disjuncts abound, try lazy learning. In Proceedings of the 7th Belgian-Dutch Conference on Machine Learning, BENELEARN-97, Tilburg, The Netherlands, pp. 109–118.Google Scholar
Venezky, R. L. 1965. A Study of English Spelling-to-Sound Correspondences on Historical Principles. Ann Arbor, MI: Ann Arbor Press.Google Scholar
Yvon, F. 1996a. Grapheme-to-phoneme conversion using multiple unbounded overlapping chunks. In Proceedings of Conference on New Methods in Natural Language Processing (NeMLaP-2 96), Ankara, Turkey, pp. 218–28.Google Scholar
Yvon, F. 1996b. Prononcer par Analogie: Motivations, Formalisations et Évaluations. PhD Thesis, ENST, Paris, France.Google Scholar
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Cambridge, MA: Addison-Wesley.Google Scholar
Zue, V. W., and Glass, J. R. 2000. Conversational interfaces: advances and challenges. Proceedings of the IEEE 88 (8): 1166–180.CrossRefGoogle Scholar