Skip to main content Accessibility help
×
Hostname: page-component-cd9895bd7-gbm5v Total loading time: 0 Render date: 2024-12-26T19:23:35.808Z Has data issue: false hasContentIssue false

35 - Natural Language Processing

from Part 6 - Experimental and Quantitative Approaches

Published online by Cambridge University Press:  16 May 2024

Danko Šipka
Affiliation:
Arizona State University
Wayles Browne
Affiliation:
Cornell University, New York
Get access

Summary

This chapter surveys the history and main directions of natural language processing research in general, and for Slavic languages in particular. The field has grown enormously since its beginning. Especially since 2010, the amount of digital texts has been rapidly growing; furthermore, research has yielded an ever-greater number of highly usable applications. This is reflected in the increasing number and attendance of NLP conferences and workshops. Slavic countries are no exception; several have been organising international conferences for decades, and their proceedings are the best place to find publications on Slavic NLP research. The general trend of the evolution of NLP is difficult to predict. It is certain that deep learning, including various new types (e.g. contextual, multilingual) of word embeddings and similar ‘deep’ models will play an increasing role, while predictions also mention the increasing importance of the Universal Dependencies framework and treebanks and research into the theory, not only the practice, of deep learning, coupled with attempts at achieving better explainability of the resulting models.

Type
Chapter
Information
Publisher: Cambridge University Press
Print publication year: 2024

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Azarova, I., Mitrofanova, O., Sinopalnikova, A., Yavorskaya, M., & Oparin, I. (2002). RussNet: Building a Lexical Database for the Russian Language. Proceedings of the LREC Workshop on WordNet Structures and Standardization, and How These Affect Wordnet Applications and Evaluation, Las Palmas, 6064.Google Scholar
Babych, B., Kanishcheva, O., Nakov, P., Piskorski, J., Pivovarova, L., Starko, V., Steinberger, J. Yangarber, R., Marcińczuk, M., Pollak, S., Přibáň, P., & Robnik-Šikonja, M., eds. (2021). Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics. https://aclanthology.org/2021.bsnlp-1.Google Scholar
Bojar, O. & Hajič, J. (2008). Phrase-based and deep syntactic English-to-Czech statistical machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation, pp. 143146.CrossRefGoogle Scholar
Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R. L., & Roossin, P. S. (1988). A statistical approach to language translation. Coling’88. Association for Computational Linguistics, 1, 7176.Google Scholar
Church, K. & Liberman, M. (2021). The future of Computational Linguistics: On beyond alchemy. In Frontiers in Artificial Intelligence, 4. https://doi.org/10.3389/frai.2021.625341.CrossRefGoogle ScholarPubMed
Dobrovoljc, K., Krek, S., & Erjavec, T. (2017). The Sloleks morphological lexicon and its future development. In Gorjanc, V. et al., eds., Dictionary of Modern Slovenian: Problems and Solutions, Ljubljana: Ljubljana University Press, pp. 4263.Google Scholar
Džeroski, S., Erjavec, T., & Zavrel, J. (2000). Morphosyntactic tagging of Slovenian: Evaluating taggers and tagsets. Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece, 31 May–2 June 2000, pp. 10991104. https://aclanthology.org/L00-1108/.Google Scholar
Erjavec, T. (2012). MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1), 131142.CrossRefGoogle Scholar
Erjavec, T. (2015). The IMP historical Slovenian language resources. Language Resources and Evaluation, 49(3), 753775. https://doi.org/10.1007/s10579-015-9294-7.CrossRefGoogle Scholar
Erjavec, T. & Džeroski, S. (2004). Machine learning of morphosyntactic structure: Lemmatizing unknown Slovenian words. Applied Artificial Intelligence, 18, 1741.CrossRefGoogle Scholar
Erjavec, T., Piskorski, J., Pivovarova, L., Šnajder, J., Steinberger, J., & Yangarber, R., eds., (2017). Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics. https://aclanthology.org/W17–1400.Google Scholar
Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In Studies in Linguistic Analysis, pp. 132 [reprinted in Palmer, F. R., ed. (1968), Selected Papers of J.R. Firth 1952–1959, London: Longman].Google Scholar
Fišer, D. & Sagot, B. (2015). Constructing a poor man’s wordnet in a resource-rich world. Language Resources and Evaluation, 49, 601. https://doi.org/10.1007/s10579-015-9295-6.CrossRefGoogle Scholar
Fucíková, E., Hajič, J., Šindlerová, J., & Uresová, Z. (2015). Czech-English bilingual valency lexicon online. In 14th International Workshop on Treebanks and Linguistic Theories (TLT 2015), pp. 6171.Google Scholar
Hajič, J., Bejček, E., Hlaváčová, J., Mikulová, M., Straka, M., Štěpánek, J., & Štěpánková, B. (2020). Prague Dependency Treebank – Consolidated 1.0. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), European Language Resources Association (ELRA), pp. 52085218.Google Scholar
Hajič, J., Ciaramita, M., Johansson, R., Kawahara, D., Martí, M. A., Màrquez, L., Meyers, A., Nivre, J., Padó, S., Štěpánek, J., Straňák, P., Surdeanu, M., Xue, N., & Zhang, Y. (2009). The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task. ACL.Google Scholar
Hajič, J., Hajičová, E., Mikulová, M., & Mírovský, J. (2017). Prague dependency treebank. In Handbook of Linguistic Annotation, Dordrecht: Springer, pp. 555594.CrossRefGoogle Scholar
Hajič, J., & Hladká, B. (1998). Tagging inflective languages: Prediction of morphological categories for a rich, structured tagset. In COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics. https://aclanthology.org/C98-1077/.Google Scholar
Hajič, J., Hric, J., & Kuboň, V. (2000). Machine translation of very close languages. In Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLC ’00), pp. 712. https://doi.org/10.3115/974147.974149.CrossRefGoogle Scholar
Hajič, J., Panevová, J., Uresová, Z., Bémová, A., Kolárová, V., & Pajas, P. (2003). PDT-VALLEX: Creating a large-coverage valency lexicon for treebank annotation. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories, Vol. 9, pp. 5768.Google Scholar
Hutchins, J. (2001). Machine Translation over fifty years. Histoire Épistémologie Langage, 23(1), 731. https://doi.org/10.3406/hel.2001.2815CrossRefGoogle Scholar
Hutchins, J. & Lovtskii, E. (2000). Petr Petrovich Troyanskii (1894–1950): A forgotten pioneer of Mechanical Translation. Machine Translation 15(3), 187221. https://doi.org/10.1023/A:1011653602669.CrossRefGoogle Scholar
Jurish, B. (2011). Finite-State Canonicalization Techniques for Historical German. PhD thesis, University of Potsdam.Google Scholar
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography ASIALEX, 1, 736. https://doi.org/10.1007/s40607-014-0009-9.CrossRefGoogle Scholar
Kilgarriff, A., Husák, M., McAdam, K., Rundell, M., & Rychlý, P. (2008). GDEX: Automatically finding good dictionary examples in a corpus. In Proceedings of the 13th EURALEX International Congress. Spain, July 2008, pp. 425432.Google Scholar
Kilgarriff, A., Kovář, V., Krek, S., Srdanović, I., & Tiberius, C. (2010). A quantitative evaluation of word sketches. In Proceedings of the 14th EURALEX International Congress, Fryske Akademy, pp. 372379.Google Scholar
Kobyliński, K. (2014). PoliTa: A multitagger for Polish. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). https://aclanthology.org/L14-1014/.Google Scholar
Kosem, I., Husák, M., & McCarthy, D. (2011). GDEX for Slovene. In Proceedings of eLex 2011: Electronic Lexicography in the 21st Century: New Applications for New Users, Ljubljana: Trojina, Institute for Applied Slovene Studies, pp. 150159.Google Scholar
Krek, S., Erjavec, T., Dobrovoljc, K., Gantar, P., Holdt, S. A., Čibej, J., & Brank, J. (2020). The ssj500k training corpus for Slovenian language processing. In Proceedings of the Conference on Language Technologies and Digital Humanities. http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Krek-et-al_The-ssj500k-Training-Corpus-for-Slovenian-Language-Processing.pdf.Google Scholar
Krstev, C. & Vitas, D. (2007). Extending the Serbian E-dictionary by using lexical transducers. In Koeva, S. et al., eds., Formaliser les langues avec l’ordinateur: de INTEX à NooJ, Besançon: Presses universitaires de Franche-Comté, pp. 147168. http://books.openedition.org/pufc/27079.CrossRefGoogle Scholar
Kulagina, O. S. & Mel‘čuk, I. A. (1967). Automatic translation: some theoretical aspects and the design of a translation system. In Nirenburg, S. et al., eds., Readings in Machine Translation, Cambridge, MA: MIT Press, pp. 157175.Google Scholar
Ljubešić, N., Agić, Z., Batanović, V., & Erjavec, T. (2018). hr500k – a reference training corpus of Croatian. In Proceedings of the Conference on Language Technologies and Digital Humanities, pp. 154161. www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Ljubesic-et-al_hr500k-A-Reference-Training-Corpus-of-Croatian.pdf.Google Scholar
Ljubešić, N. & Dobrovoljc, K. (2019). What does neural bring? Analysing improvements in morphosyntactic annotation and lemmatisation of Slovenian, Croatian and Serbian. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics, pp. 2934. https://doi.org/10.18653/v1/W19-3704.CrossRefGoogle Scholar
Ljubešić, N., Klubička, F., Agić, A., & Jazbec, I. P. (2016a). New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia.Google Scholar
Ljubešić, N., Zupan, K., Fišer, D., & Erjavec, T. (2016b). Normalising Slovenian data: Historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS), September 19–21, 2016, Bochum, Germany, pp. 146155. www.linguistics.rub.de/konvens16/pub/19_konvensproc.pdf.Google Scholar
Loukachevitch, N. V., Lashevich, G., Gerasimova, A. A., Ivanov, V. V., & Dobrov, B. V. (2016). Creating Russian WordNet by conversion. In Proceedings of Conference on Computational Linguistics and Intellectual Technologies Dialog-2016, Moscow: RSUH, pp. 405415.Google Scholar
McDonald, R., Pereira, F., Ribarov, K., & Hajič, J. (2005). Non-projective dependency parsing using spanning tree algorithms. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 523530.CrossRefGoogle Scholar
Mel‘čuk, I. A. (2001). Communicative Organization in Natural Language: The Semantic-Communicative Structure of Sentences, Amsterdam: John Benjamins.CrossRefGoogle Scholar
Nivre, J., de Marneffe, M. C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., Schuster, S., Tyers, F., & Zeman, D. (2020). Universal dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Conference LREC. www.aclweb.org/anthology/2020.lrec-1.497.Google Scholar
Piasecki, M., Szpakowicz, S., & Broda, B. (2009). A Wordnet from the Ground Up, Wrocław: Oficyna Wydawnicza Politechniki Wroclawskiej.Google Scholar
Piskorski, J., Laskova, L., Marcińczuk, M., Pivovarova, L., Přibáň, P., Steinberger, J., & Yangarber, R., eds. (2019). The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, Association for Computational Linguistics, pp. 6374. www.aclweb.org/anthology/W19-3709.CrossRefGoogle Scholar
Pollak, S., Repar, A., Martinc, M., & Podpečan, V. (2019). Karst Exploration: Extracting terms and definitions from Karst Domain Corpus. In Proceedings of the eLex 2019 Conference: Electronic Lexicography in the 21st Century, Brno: Lexical Computing CZ. https://elex.link/elex2019/proceedings-download/.Google Scholar
Pomikálek, J. (2011). Removing Boilerplate and Duplicate Content from Web Corpora. PhD thesis, Masaryk University, Brno.Google Scholar
Przepiórkowski, A., Górski, R. L., Łaziński, M., & Pęzik, P. (2010). Recent developments in the National Corpus of Polish. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), Valetta, Malta.Google Scholar
Przepiórkowski, A., Hajnicz, E., Patejuk, A., Woliński, M., Skwarski, F., & Świdziński, M. (2014). Walenty: Towards a comprehensive valence dictionary of Polish. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavík: European Language Resources Association (ELRA), pp. 27852792.Google Scholar
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the Association for Computational Linguistics (ACL) System Demonstrations.CrossRefGoogle Scholar
Ramírez-Sánchez, G., Sánchez-Martínez, F., Ortiz-Rojas, S., Pérez-Ortiz, J. A., & Forcada, M. L. (2006). Opentrad Apertium open-source machine translation system: An opportunity for business and research. In Proceedings of Translating and the Computer 28 Conference, London, November 16–17, 2006.Google Scholar
Rehm, G. & Uszkoreit, H., eds. (2013). META-NET Strategic Research Agenda for Multilingual Europe 2020, Dordrecht: Springer Nature. https://doi.org/10.1007/978-3-642-36348-1.CrossRefGoogle Scholar
Savary, A., Candito, M., Mititelu, V. B., Bejček, E., Cap, F., Čéplö, S., Cordeiro, S. R., Eryiğit, G., Giouli, V., van Gompel, M., HaCohen-Kerner, Y., Kovalevskaitė, J., Krek, S., Liebeskind, C., Monti, J., Escartín, C. P., van der Plas, L., QasemiZadeh, B., Ramisch, C., Sangati, F., Stoyanova, I., & Vincze, V. (2018). PARSEME multilingual corpus of verbal multiword expressions. In Markantonatou, S., Ramisch, C., Savary, A., & Vincze, V., eds., Multiword Expressions at Length and In Depth: Extended Papers from the MWE2017 Workshop, Berlin: Language Science Press, pp. 87147. https://doi.org/10.5281/zenodo.1471591.Google Scholar
Sgall, P., Goralciková, A., Nebesky, L., & Hajičová, E. (1969). A Functional Approach to Syntax in Generative Description of Language Mathematical Linguistics and Automatic Language Processing, New York, NY: Elsevier.Google Scholar
Silberztein, M. (1994). INTEX: A corpus processing system. In COLING ’94 Proceedings, Kyoto: COLING.Google Scholar
Simov, K., Osenova, P., Kolkovska, S., Balabanova, E., Doikoff, D., Ivanova, K., Simov, A., & Kouylekov, M. (2002). Building a linguistically interpreted corpus of Bulgarian: The BulTreeBank. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), Canary Islands, Spain, pp. 17291736.Google Scholar
Simov, K., Peev, Z., Kouylekov, M., Simov, A., Dimitrov, M., & Kiryakov, A. (2001). CLaRK – an XML-based system for corpora development. In Proceedings of the Corpus Linguistics 2001 Conference, pp. 558560.Google Scholar
Stanković, R., Krstev, C., Stijović, R., Gočanin, M., & Škorić, , M. (2021). Towards automatic definition extraction for Serbian. In Proceedings of XIX EURALEX Congress: Lexicography for Inclusion, Vol. II, Democritus University of Thrace, pp. 695703. https://euralex2020.gr/proceedings-volume-2/.Google Scholar
Straka, M., Hajič, J., & Straková, J. (2016). UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, PoS tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 42904297. https://aclanthology.org/L16-1680/.Google Scholar
Straka, M. & Straková, J. (2017). Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada, August 2017. https://aclanthology.org/K17-3009/.Google Scholar
Šnajder, J. (2013). Models for predicting the inflectional paradigm of Croatian words. Slovenščina 2.0, 1(2), 134. www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_02.pdf.Google Scholar
Štěpánková, B., Mikulová, M., & Hajič, J. (2020). The MorfFlex Dictionary of Czech as a source of linguistic data. In Proceedings of XIX EURALEX Congress: Lexicography for Inclusion, Democritus University of Thrace, Thrace, pp. 387392.Google Scholar
Tufiş, D., ed. (2000). BalkaNet: Design and Development of a Multilingual Balkan WordNet. Romanian Journal of Information Science and Technology Special Issue, 7 (1–2).Google Scholar
Vetulani, Z. (2000). Electronic language resources for POLISH: POLEX, CEGLEX and GRAMLEX. In Proceedings of the Second International Conference on Language Resources and Evaluation, LREC’2014, Athens, Greece, European Language Resources Association (ELRA), pp. 367374.Google Scholar
Vetulani, Z., Kubis, M., & Obrębski, T. (2010). PolNet – Polish WordNet: Data and tools. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), European Language Resources Association (ELRA).Google Scholar
Vitas, D. & Krstev, C. (2004). Intex and Slavonic morphology. In Muller, C., Royauté, J., & Silberztein, M., eds., INTEX pour la Linguistique et le traitement automatique des langues, Besançon: Presses Universitaires de Franche-Comté, pp. 1934.CrossRefGoogle Scholar
Zaliznjak, A. A. (1977). Grammatičeskij slovar’ russkogo jazyka (Grammatical Dictionary of the Russian Language), Moscow: Russkie Slovari.Google Scholar
Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., & Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 121.Google Scholar
Žolkovskij, A. K. & Mel‘čuk, I. A. (1965). O vozmožnom metode i instrumentax semantičeskogo sinteza (On a possible method and instruments for semantic synthesis). Naučno-texničeskaja Informacija, 5, 2328.Google Scholar

Save book to Kindle

To save this book to your Kindle, first ensure no-reply@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

Available formats
×