Hostname: page-component-745bb68f8f-5r2nc Total loading time: 0 Render date: 2025-01-13T14:16:53.912Z Has data issue: false hasContentIssue false

The Kestrel TTS text normalization system

Published online by Cambridge University Press:  12 December 2014

PETER EBDEN
Affiliation:
Google, Inc (now at Thought Machine), London, UK email: pebden@google.com
RICHARD SPROAT
Affiliation:
Google, Inc, New York, USA email: rws@google.com

Abstract

This paper describes the Kestrel text normalization system, a component of the Google text-to-speech synthesis (TTS) system. At the core of Kestrel are text-normalization grammars that are compiled into libraries of weighted finite-state transducers (WFSTs). While the use of WFSTs for text normalization is itself not new, Kestrel differs from previous systems in its separation of the initial tokenization and classification phase of analysis from verbalization. Input text is first tokenized and different tokens classified using WFSTs. As part of the classification, detected semiotic classes – expressions such as currency amounts, dates, times, measure phases, are parsed into protocol buffers (https://code.google.com/p/protobuf/). The protocol buffers are then verbalized, with possible reordering of the elements, again using WFSTs. This paper describes the architecture of Kestrel, the protocol buffer representations of semiotic classes, and presents some examples of grammars for various languages. We also discuss applications and deployments of Kestrel as part of the Google TTS system, which runs on both server and client side on multiple devices, and is used daily by millions of people in nineteen languages and counting.

Type
Articles
Copyright
Copyright © Cambridge University Press 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abney, S., 1996. Partial parsing via finite-state cascades. Natural Language Engineering 2 (4): 337344.CrossRefGoogle Scholar
Aho, A., 1969. Nested stack automata. Journal of the Association for Computing Machinery 16 (3): 383406.CrossRefGoogle Scholar
Allauzen, C., Mohri, M., and Riley, M. 2004. Statistical modeling for unit selection in speech synthesis. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL’2004), pp. 55–62.Google Scholar
Allauzen, C., and Riley, M., 2012. A pushdown transducer extension for the OpenFst library. In Conference on Implementation and Application of Automata, Lecture Notes in Computer Science vol. 7381, Heidelberg: Springer, pp. 6677.CrossRefGoogle Scholar
Allauzen, C., Riley, M., and Schalkwyk, J., 2011. Filters for efficient composition of weighted finite-state transducers. In Conference on Implementation and Application of Automata, Lecture Notes in Computer Science vol. 6482, Heidelberg: Springer, pp. 2838.CrossRefGoogle Scholar
Allen, J., Hunnicutt, M. S., Klatt, D., Armstrong, R., and Pisoni, D. 1987. From Text to Speech: The MITalk System, Cambridge, England, UK: Cambridge University Press.Google Scholar
Bangalore, S., and Riccardi, G., 2001. A finite-state approach to machine translation. In 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, pp. 18.Google Scholar
Bird, S., and Ellison, T. M., 1994. One-level phonology: autosegmental representations and rules as finite automata. Computational Linguistics 20 (1): 5590.Google Scholar
de Gispert, A., Iglesias, G., Blackwood, G., Banga, E., and Byrne, W., 2010. Hierarchical phrase-based translation with weighted finite-state transducers and shallow-n grammars. Computational Linguistics 36 (3): 505533.CrossRefGoogle Scholar
Duchi, J., and Singer, Y. 2009. Boosting with structural sparsity. In Proceedings of the 26th International Conference on Machine Learning, Montreal, p. 297304.Google Scholar
Johnson, C. D. 1972. Formal Aspects of Phonological Description. Walter de Gruyter.CrossRefGoogle Scholar
Joshi, A., 1996. A parser from antiquity. Natural Language Engineering 2 (4): 291294.CrossRefGoogle Scholar
Jurafsky, D., and Martin, J., 2009. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and speech recognition. 2nd edn.Pearson: Prentice Hall.Google Scholar
Kaplan, R. M., and Kay, M., 1994. Regular models of phonological rule systems. Computational Linguistics 20: 331378.Google Scholar
Koskenniemi, K. 1983. Two-level morphology: a general computational model of word-form recognition and production. PhD thesis, University of Helsinki.CrossRefGoogle Scholar
Möbius, B., 2001. German and Multilingual Speech Synthesis. Phonetik AIMS: Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung vol. 7, Lehrstuhl für experimentelle Phonetik, Stuttgart.Google Scholar
Möbius, B., Sproat, R., van Santen, J., and Olive, J. 1997. The Bell Labs German text-to-speech system: an overview. In Eurospeech. Rhodes.CrossRefGoogle Scholar
Mohri, M. 2009. Weighted automata algorithms. In Droste, M., Kuich, W., and Vogler, H. (eds.) Handbook of Weighted Automata, Monographs in Theoretical Computer Science, Springer, pp. 213254.Google Scholar
Mohri, M., Pereira, F. C. N., and Riley, M., 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language 16 (1): 6988.CrossRefGoogle Scholar
Mohri, M., and Sproat, R. 1996. An efficient compiler for weighted rewrite rules. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pp. 231–238.Google Scholar
Navigli, R., 2009. Word sense disambiguation: a survey. ACM Computing Surveys 41 (2): 169.CrossRefGoogle Scholar
Neubig, G., Nakata, Y., and Mori, S., 2011. Pointwise prediction for robust, adaptable Japanese morphological analysis. In Association for Computational Linguistics, Portland, OR, pp. 529533.Google Scholar
Pereira, F., Riley, M., and Sproat, R., 1994. Weighted rational transductions and their application to human language processing. In ARPA Workshop on Human Language Technology, Plainsboro, NJ, pp. 249254.Google Scholar
Roark, B., Riley, M., Allauzen, C., Tai, T., and Sproat, R., 2012. The OpenGrm open-source finite-state grammar software libraries. In ACL, Jeju Island, Korea, pp. 6166.Google Scholar
Roark, B., and Sproat, R., 2007. Computational Approaches to Morphology and Syntax. Oxford: Oxford University Press.Google Scholar
Roark, B., and Sproat, R., 2014. Hippocratic abbreviation expansion. In Association for Computational Linguistics, Baltimore, MD, pp. 364369.Google Scholar
Skut, W., Ulrich, S., and Hammervold, K., 2003. A generic finite state compiler for tagging rules. Machine Translation 18 (3): 239250.CrossRefGoogle Scholar
Skut, W., Ulrich, S., and Hammervold, K., 2004. A bimachine compiler for ranked tagging rules. In Proceedings of the 20th International Conference on Computational Linguistics, COLING ’04, Association for Computational Linguistics, Geneva, Switzerland, pp. 198204.CrossRefGoogle Scholar
Sproat, R., 1996. Multilingual text analysis for text-to-speech synthesis. Natural Language Engineering 2 (4): 369380.CrossRefGoogle Scholar
Sproat, R. (ed.):, 1997. Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Boston, MA: Springer.Google Scholar
Sproat, R., 2000. A Computational Theory of Writing Systems. Cambridge, England, UK: Cambridge University Press.Google Scholar
Sproat, R., 2010. Lightly supervised learning of text normalization: Russian number names. In IEEE Workshop on Spoken Language Technology, IEEE, Berkeley, CA, pp. 436441.CrossRefGoogle Scholar
Sproat, R., Black, A., Chen, S., Kumar, S., Ostendorf, M., and Richards, C., 2001. Normalization of non-standard words. Computer Speech and Language 15 (3): 287333.CrossRefGoogle Scholar
Tai, T., Skut, W., and Sproat, R. 2011. Thrax: an open source grammar compiler built on OpenFst. In Automatic Speech Recognition and Understanding Workshop, Waikoloa Resort, Hawaii.Google Scholar
Taylor, P., 2009. Text to Speech Synthesis. Cambridge, England, UK: Cambridge University Press.CrossRefGoogle Scholar
Yarowsky, D. 1996. Homograph disambiguation in text-to-speech synthesis. In van Santen, J., Sproat, R., Olive, J., and Hirschberg, J. (eds.), Progress in Speech Synthesis, New York: Springer, pp. 157172.Google Scholar