Hostname: page-component-cd9895bd7-gvvz8 Total loading time: 0 Render date: 2024-12-26T08:15:07.607Z Has data issue: false hasContentIssue false

Efficiently generating correction suggestions for garbled tokens of historical language

Published online by Cambridge University Press:  21 March 2011

ULRICH REFFLE*
Affiliation:
Centrum f/4r Informations und Sprachverarbeitung, University of Munich, Germany email: uli@cis.uni-muenchen.de

Abstract

Text correction systems rely on a core mechanism where suitable correction suggestions for garbled input tokens are generated. Current systems, which are designed for documents including modern language, use some form of approximate search in a given background lexicon. Due to the large amount of spelling variation found in historical documents, special lexica for historical language can only offer restricted coverage. Hence historical language is often described in terms of a matching procedure to be applied to modern words. Given such a procedure and a base lexicon of modern words, the question arises of how to generate correction suggestions for garbled historical variants. In this paper we suggest an efficient algorithm that solves this problem. The algorithm is used for postcorrection of optical character recognition results on historical document collections.

Type
Papers
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aho, A. V. and Corasick, M. J. 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM 18 (6): 333–40.CrossRefGoogle Scholar
Archer, D., Ernst-Gerlach, A., Kempen, S., Pilz, T., and Rayson, P. 2006. The identification of spelling variants in English and German historical texts: manual or automatic. In Proceedings of the Digital Humanities Conference, Paris, France, pp. 35.Google Scholar
Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In ACL '00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA. Association for Computational Linguistics, pp. 286–93.CrossRefGoogle Scholar
Bunke, H. 1993. A fast algorithm for finding the nearest neighbour of a word in a dictionary. In ICDAR '93: Proceedings of the 2nd International Conference on Document Analysis and Recognition, Washington DC, USA: IEEE Computer Society, pp. 632–37.Google Scholar
Ernst-Gerlach, A. and Fuhr, N. 2006. Generating search term variants for text collections with historic spellings. In ECIR '06: Proceedings of the 28th European Conference on Information Retrieval Research, Berlin: Springer.Google Scholar
Ernst-Gerlach, A. and Fuhr, N. 2007. Retrieval in text collections with historic spelling using linguistic and spelling variants. In JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, New York, NY, USA: ACM, pp. 333–41.Google Scholar
Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., and Schulz, K. U. 2009a. Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In AND '09: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, New York, NY, USA: ACM, pp. 6976.CrossRefGoogle Scholar
Gotscharek, A., Reffle, U., Ringlstetter, C. and Schulz, K. U. 2009b. On lexical resources for digitization of historical documents. In DocEng '09: Proceedings of the 9th ACM symposium on Document engineering, New York, NY, USA: ACM, pp. 193200.CrossRefGoogle Scholar
Hauser, A., Heller, M., Leiss, E., Schulz, K. U., and Wanzeck, C. 2006. Information access to historical documents from the early new high german period. In IJCAI '07: Workshop on Analytics for Noisy Unstructured Text Data.Google Scholar
Mihov, S. and Schulz, K. U. 2004, December. Fast approximate search in large dictionaries. Computational Linguistics 30 (4): 451–77.CrossRefGoogle Scholar
Navarro, G. and Raffinot, M. 2001. Flexible Pattern Matching in Strings. Cambridge University Press.Google Scholar
Oflazer, K. 1996. Error-tolerant finite-state recognition with applications to morphological analysis and spelling correction. Computational Linguistics 22 (1): 7389.Google Scholar
Owolabi, O. and McGregor, D. 1988. Fast approximate string matching. Software - Practice and Experience 18 (4): 387–93.CrossRefGoogle Scholar
Pilz, T., Luther, W., Ammon, U. and Fuhr, N. 2005. Rule-based search in text databases with nonstandard orthography. In Proceedings of ACH/ALLC 2005, Victoria, BC, Canada.Google Scholar
Roche, E. and Schabes, Y. (eds.) 1997. Finite-State Language Processing. Bradford Book. Cambridge, MA, USA: The MIT Press.CrossRefGoogle Scholar
Schulz, K., Mihov, S. and Mitankin, P. 2007. Fast selection of small and precise candidate sets from dictionaries for text correction tasks. In ICDAR '07: Proceedings of the Ninth International Conference on Document Analysis and Recognition, Washington, DC, USA: IEEE Computer Society, pp. 471475.Google Scholar
Schulz, K. U. and Mihov, S. 2002. Fast String Correction with Levenshtein-Automata. International Journal of Document Analysis and Recognition 5 (1): 6785.Google Scholar
Wu, S. and Manber, U. 1992. Fast text searching allowing errors. Communications of the ACM 35 (10): 8391.CrossRefGoogle Scholar