Hostname: page-component-cd9895bd7-gxg78 Total loading time: 0 Render date: 2024-12-26T09:42:07.112Z Has data issue: false hasContentIssue false

Weighted finite-state transducers for normalization of historical texts

Published online by Cambridge University Press:  01 April 2019

Izaskun Etxeberria*
Affiliation:
IXA Group, University of the Basque Country, Donostia-San Sebastián, Spain
Iñaki Alegria
Affiliation:
IXA Group, University of the Basque Country, Donostia-San Sebastián, Spain
Larraitz Uria
Affiliation:
IXA Group, University of the Basque Country, Donostia-San Sebastián, Spain
*
*Corresponding author. Email: izaskun.etxeberria@ehu.eus

Abstract

This paper presents a study about methods for normalization of historical texts. The aim of these methods is learning relations between historical and contemporary word forms. We have compiled training and test corpora for different languages and scenarios, and we have tried to read the results related to the features of the corpora and languages. Our proposed method, based on weighted finite-state transducers, is compared to previously published ones. Our method learns to map phonological changes using a noisy channel model; it is a simple solution that can use a limited amount of supervision in order to achieve adequate performance. The compiled corpora are ready to be used for other researchers in order to compare results. Concerning the amount of supervision for the task, we investigate how the size of training corpus affects the results and identify some interesting factors to anticipate the difficulty of the task.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alegria, I., Etxeberria, I., Hulden, M. and Maritxalar, M. (2009). Porting Basque morphological grammars to foma, an open-source tool. In Finite-State Methods and Natural Language Processing. FSMNLP 2009. Lecture Notes in Computer Science, vol. 6062. Berlin: Springer, pp. 105113.CrossRefGoogle Scholar
Alegria, I., Etxeberria, I. and Labaka, G. (2013). Una cascada de transductores simples para normalizar tweets. In Proceedings of the Tweet Normalization Workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN). Spain: Madrid, pp. 1519.Google Scholar
Allauzen, C., Riley, M., Schalkwyk, J., Skut, W. and Mohri, M. (2007). OpenFST: A general and efficient weighted finite-state transducer library. In International Conference on Implementation and Application of Automata (CIAA). Lecture Notes in Computer Science, vol. 4783. Berlin: Springer, pp. 1123.CrossRefGoogle Scholar
Almeida, J.J., Santos, A. and Simoes, A. (2010). Bigorna—A toolkit for orthography migration challenges. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC2010). Valleta, Malta, pp. 227232.Google Scholar
Beesley, K.R. and Karttunen, L. (2003). Finite-state Morphology: Xerox Tools and Techniques. Stanford: CSLI Publications.Google Scholar
Bollmann, M., Dipper, S., Krasselt, J. and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling-case studies from Early New High German. In Proceedings of the First International Workshop on Language Technology for Historical Text(s). Vienna, Austria, pp. 342350.Google Scholar
Bollmann, M. and Søgaard, A. (2016). Improving historical spelling normalization with bi-directional LSTMS and multi-task learning. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016). Osaka, Japan, pp. 131139.Google Scholar
Carreras, X., Chao, I., Padró, L. and Padró, M. (2004). Freeling: An open-source suite of language analyzers. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC2004). Lisbon, Portugal, pp. 239242.Google Scholar
Erjavec, T. (2015). The IMP historical Slovene language resources. Language Resources and Evaluation 49(3), 753775.CrossRefGoogle Scholar
Etxeberria, I., Alegria, I., Hulden, M. and Uria, L. (2014). Learning to map variation-standard forms using a limited parallel corpus and the standard morphology. Procesamiento del Lenguaje Natural 52, 1320.Google Scholar
Etxeberria, I., Alegria, I., Uria, L. and Hulden, M. (2016a). Combining phonology and morphology for the normalization of historical texts. In Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Berlin, Germany, pp. 100105.CrossRefGoogle Scholar
Etxeberria, I., Alegria, I., Uria, L. and Hulden, M. (2016b). Evaluating the noisy channel model for the normalization of historical texts: Basque, Spanish and Slovene. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC2016). Portorož, Slovenia, pp. 10641069.Google Scholar
Fiebranz, R., Lindberg, E., Lindström, J. and Ågren, M. 2011. Making verbs count: The research project Gender and Work and its methodology. Scandinavian Economic History Review 59(3), 273293.CrossRefGoogle Scholar
Hulden, M. (2009). Foma: A finite-state compiler and library. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session. Athens, Greece, pp. 2932. Association for Computational Linguistics.Google Scholar
Hulden, M., Alegria, I., Etxeberria, I. and Maritxalar, M. (2011). Learning word-level dialectal variation as phonological replacement rules using a limited parallel corpus. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties. Edinburgh, Scotland, pp. 3948. Association for Computational Linguistics.Google Scholar
Jiampojamarn, S., Kondrak, G. and Sherif, T. (2007). Applying many-to-many alignments and hidden Markov models to letter-to-phoneme conversion. In Proceedings of HLT-NAACL’07. Rochester, New York, pp. 372379.Google Scholar
Jurish, B. (2010). Comparing canonicalizations of historical German text. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology. Uppsala, Sweden. pp. 7277. Association for Computational Linguistics.Google Scholar
Kestemont, M., Daelemans, W. and Pauw, G.D. (2010). Weigh your words—Memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing 25(3), 287301.CrossRefGoogle Scholar
Korchagina, N. (2017). Normalizing medieval German texts: From rules to deep learning. In Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language. Gothenburg, Sweden, pp. 1217.Google Scholar
Ljubešic, N., Zupan, K., Fišer, D. and Erjavec, T. (2016). Normalising Slovene data: Historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016). Bochum, Germany, pp. 146155.Google Scholar
Mann, G.S. and Yarowsky, D. (2001). Multipath translation lexicon induction via bridge languages. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies. Pittsburgh, Pennsylvania: Association for Computational Linguistics, pp. 151158.Google Scholar
Muggleton, S. and De Raedt, L. (1994). Inductive logic programming: Theory and methods. The Journal of Logic Programming 19, 629679.CrossRefGoogle Scholar
Novak, J.R., Minematsu, N. and Hirose, K. (2012). WFST-based grapheme-to-phoneme conversion: Open source tools for alignment, model-building and decoding. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing (FSMNLP2012). Donostia, Spain: Association for Computational Linguistics, pp. 4549.Google Scholar
Novak, J.R., Minematsu, N. and Hirose, K. (2016). Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework. Natural Language Engineering 22(6), 907938.CrossRefGoogle Scholar
Pettersson, E. (2016). Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction. Doctoral dissertation, Uppsala: Acta Universitatis Upsaliensis.Google Scholar
Pettersson, E., Megyesi, B. and Nivre, J. (2014). A multilingual evaluation of three spelling normalisation methods for historical text. In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Gothenburg, Sweden, pp. 3241.CrossRefGoogle Scholar
Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Synthesis Lectures on Human Language Technologies. San Rafael, USA: Morgan & Claypool Publishers.CrossRefGoogle Scholar
Porta, J., Sancho, J.-L. and Gómez, J. (2013). Edit transducers for spelling variation in Old Spanish. In Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA 2013. NEALT Proceedings Series. Oslo, Norway, vol. 18, pp. 7079.Google Scholar
Rognvaldsson, E., Ingason, A.K., Sigurdhsson, E.F. and Wallenberg, J. (2012). The Icelandic Parsed Historical Corpus (IcePaHC). In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC2012). Istanbul, Turkey, pp. 19771984.Google Scholar
Scheible, S., Whitt, R.J., Durrell, M. and Bennett, P. (2011). A gold standard corpus of Early Modern German. In Proceedings of the 5th Linguistic Annotation Workshop. Portland, OR: Association for Computational Linguistics, pp. 124128.Google Scholar
Scherrer, Y. (2007). Adaptive string distance measures for bilingual dialect lexicon induction. In Proceedings of the 45th Annual Meeting of the ACL: Student Research Workshop. Prague, Czech Republic: Association for Computational Linguistics, pp. 5560.CrossRefGoogle Scholar
Scherrer, Y. and Erjavec, T. (2016). Modernising historical Slovene words. Natural Language Engineering 22(6), 881905.CrossRefGoogle Scholar
Simon, E. (2014). Corpus building from Old Hungarian codices. In The Evolution of Functional Left Peripheries in Hungarian Syntax. Oxford: Oxford University Press, pp. 224236.Google Scholar
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S. and Tsujii, J. (2012). BRAT: A web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: Association for Computational Linguistics, pp. 102107.Google Scholar
Tjong, Kim Sang E., Bollman, M., Boschker, R., Casacuberta, F., Dietz, F.M., Dipper, S., Domingo, M., van der Goot, R., van Koppen, J.M., Ljubešić, N., Östling, R., Petran, F., Pettersson, E., Scherrer, Y., Schraagen, M., Sevens, L., Tiedemann, J., Vanallemeersch, T. and Zervanou, K. (2017). The CLIN27 shared task: Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal 7, 5364.Google Scholar