Hostname: page-component-cd9895bd7-gxg78 Total loading time: 0 Render date: 2024-12-26T08:08:58.209Z Has data issue: false hasContentIssue false

A note on constituent parsing for Korean

Published online by Cambridge University Press:  10 November 2020

Mija Kim
Affiliation:
Kyung Hee University, Seoul, South Korea
Jungyeul Park*
Affiliation:
University of Washington, Seattle, WA, USA
*
*Corresponding author. E-mail: jungyeul@uw.edu

Abstract

This study deals with widespread issues on constituent parsing for Korean including the quantitative and qualitative error analyses on parsing results. The previous treebank grammars have been accepted as being interpretable in the various annotation schemes, whereas the recent parsers turn out to be much harder for humans to interpret. This paper, therefore, intends to find the concrete typology of parsing errors, to describe how these parsers deal with sentences and to show their statistical distribution, using state-of-the-art statistical and neural parsers. For doing this work, we train and evaluate the phrase structure Sejong treebank using statistical and neural parsing systems and obtain results up to a 89.18% F $_1$ score, which outperforms previous constituent parsing results for Korean. We also define best practices for correct comparison to future work by proposing the standard corpus division for the Sejong treebank.

Type
Article
Copyright
© The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Mija Kim and Jungyeul Park contributed equally.

References

Abeillé, A. (ed.) (2003). Treebanks: Building and Using Parsed Corpora. Netherlands: Springer.CrossRefGoogle Scholar
Bangalore, S. and Joshi, A.K. (1999). Supertagging: An approach to almost parsing. Computational Linguistics 25(2), 237265.Google Scholar
Bikel, D.M. (2004a). Intricacies of Collins’ parsing model. Computational Linguistics 30(4), 479511.CrossRefGoogle Scholar
Bikel, D.M. (2004b). On the Parameter Space of Generative Lexicalized Statistical Parsing Models. PhD Thesis, University of Pennsylvania.Google Scholar
Björkelund, A., Çetinoğlu, Ö., Faleńska, A., Farkas, R., Mueller, T., Seeker, W. and Szántó, Z. (2014). Introducing the IMS-Wrocław-Szeged-CIS entry at the SPMRL 2014 shared task: Reranking and morpho-syntax meet unlabeled data. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, Dublin, Ireland. Dublin City University, pp. 97102.Google Scholar
Björkelund, A., Çetinoğlu, Ö., Farkas, R., Mueller, T. and Seeker, W. (2013). (Re)ranking meets morphosyntax: State-of-the-art results from the SPMRL 2013 shared task. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, Seattle, Washington, USA. Association for Computational Linguistics, pp. 135145.Google Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135146.CrossRefGoogle Scholar
Charniak, E. (1996). Tree-bank grammars. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 10311036.Google Scholar
Charniak, E. and Johnson, M. (2005). Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan. Association for Computational Linguistics, pp. 173180.CrossRefGoogle Scholar
Choi, D., Park, J. and Choi, K.-S. (2012). Korean Treebank transformation for parser training. In Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages, Jeju, Republic of Korea. Association for Computational Linguistics, pp. 7888.Google Scholar
Choi, J.D. and Palmer, M. (2011). Statistical dependency parsing in Korean: From corpus generation to automatic parsing. In Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages, Dublin, Ireland. Association for Computational Linguistics, pp. 111.Google Scholar
Choi, K.-S., Han, Y.S., Han, Y.G. and Kwon, O.W. (1994). KAIST tree bank project for Korean: Present and future development. In Proceedings of the International Workshop on Sharable Natural Language Resources, Nara Institute of Science and Technology. Nara Institute of Science and Technology, pp. 714.Google Scholar
Chomsky, N. (1981). Lectures on Government and Binding. Studies in Generative Grammar. Dordrecht, The Netherlands: Foris Publications.Google Scholar
Chomsky, N. (1982). Some Concepts and Consequences of the Theory of Government and Binding . Linguistic Inquiry Monograph, vol. 6. Cambridge, MA: The MIT Press.Google Scholar
Chomsky, N. (1986). Barriers. Linguistic Inquiry Monograph, vol. 13. Cambridge, MA: The MIT Press.Google Scholar
Chung, T., Post, M. and Gildea, D. (2010). Factors affecting the accuracy of Korean parsing. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, Los Angeles, CA, USA. Association for Computational Linguistics, pp. 4957.Google Scholar
Coavoux, M. and Crabbé, B. (2016). Neural greedy constituent parsing with dynamic oracles. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 172182.CrossRefGoogle Scholar
Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. PhD Thesis, University of Pennsylvania.Google Scholar
Collins, M. (2000). Discriminative reranking for natural language parsing. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML’00, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp. 175182.Google Scholar
Collins, M. and Koo, T. (2015). Discriminative reranking for natural language parsing. Computational Linguistics 31(1), 2570.CrossRefGoogle Scholar
Dakota, D. and Kübler, S. (2017). Towards replicability in parsing. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria. INCOMA Ltd., pp. 185194.CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 41714186.Google Scholar
Durrett, G. and Klein, D. (2015). Neural CRF parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China. Association for Computational Linguistics, pp. 302312.CrossRefGoogle Scholar
Fernández-González, D. and Gómez-Rodríguez, C. (2018). Dynamic oracles for top-down and in-order shift-reduce constituent parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 13031313.CrossRefGoogle Scholar
Fried, D., Kitaev, N. and Klein, D. (2019). Cross-domain generalization of neural constituency parsers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 323330.CrossRefGoogle Scholar
Fried, D. and Klein, D. (2018). Policy gradient as a proxy for dynamic oracles in constituency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 469476.CrossRefGoogle Scholar
Fried, D., Stern, M. and Klein, D. (2017). Improving neural parsing by disentangling model combination and reranking effects. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada. Association for Computational Linguistics, pp. 161166.CrossRefGoogle Scholar
Goldberg, Y., Sartorio, F. and Satta, G. (2014). A tabular method for dynamic oracles in transition-based parsing. Transactions of the Association for Computational Linguistics 2(1), 119130.CrossRefGoogle Scholar
Han, C.-H., Han, N.-R. and Ko, E.-S. (2001). Bracketing Guidelines for Penn Korean TreeBank. Technical report, University of Pennsylvania.Google Scholar
Han, C.-H., Han, N.-R., Ko, E.-S., Palmer, M. and Yi, H. (2002). Penn Korean treebank: Development and evaluation. In Proceedings of the 16th Pacific Asia Conference on Language, Information and Computation, Jeju, Korea. Pacific Asia Conference on Language, Information and Computation, pp. 6978.Google Scholar
Hermjakob, U. (2000). Rapid parser development: A machine learning approach for Korean. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, Washington, USA, pp. 118123.Google Scholar
Hermjakob, U. and Mooney, R.J. (1997). Learning parse and translation decisions from examples with rich context. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain. Association for Computational Linguistics, pp. 482489.Google Scholar
Johnson, M. (1998). PCFG models of linguistic tree representations. Computational Linguistics 24(4), 613632.Google Scholar
Jung, S., Lee, C. and Hwang, H. (2018). End-to-end Korean part-of-speech tagging using copying mechanism. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17(3), 19:119:8.CrossRefGoogle Scholar
Kitaev, N., Cao, S. and Klein, D. (2019). Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 34993505.CrossRefGoogle Scholar
Kitaev, N. and Klein, D. (2018). Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 26752685.CrossRefGoogle Scholar
Klein, D. and Manning, C.D. (2001). Parsing with treebank grammars: Empirical bounds, theoretical models, and the structure of the Penn Treebank. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France. Association for Computational Linguistics, pp. 338345.CrossRefGoogle Scholar
Klein, D. and Manning, C.D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan. Association for Computational Linguistics, pp. 423430.CrossRefGoogle Scholar
Kulmizev, A., Ravishankar, V., Abdou, M. and Nivre, J. (2020). Do neural language models show preferences for syntactic formalisms? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 7 2020. Association for Computational Linguistics, pp. 40774091.CrossRefGoogle Scholar
Kummerfeld, J.K., Hall, D., Curran, J.R. and Klein, D. (2012). Parser showdown at the wall street corral: An empirical investigation of error types in parser output. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea. Association for Computational Linguistics, pp. 10481059.Google Scholar
Kummerfeld, J.K., Tse, D., Curran, J.R. and Klein, D. (2013). An empirical examination of challenges in Chinese parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria. Association for Computational Linguistics, pp. 98103.Google Scholar
Kuncoro, A., Ballesteros, M., Kong, L., Dyer, C., Neubig, G. and Smith, N.A. (2017). What do recurrent neural network grammars learn about syntax? In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain. Association for Computational Linguistics, pp. 12491258.Google Scholar
Matsuzaki, T., Miyao, Y. and Tsujii, J. (2005). Probabilistic CFG with latent annotations. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan. Association for Computational Linguistics, pp. 7582.CrossRefGoogle Scholar
Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013a). Efficient estimation of word representations in vector space. In Proceedings of Workshop at the International Conference on Learning Representations (ICLR) 2013, Scottsdale, Arizona. The International Conference on Learning Representations (ICLR).Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z. and Weinberger, K.Q. (eds), Advances in Neural Information Processing Systems, vol. 26. Lake Tahoe, Nevada: Curran Associates, Inc., pp. 31113119.Google Scholar
Oh, J.-Y. and Cha, J.-W. (2013). Korean dependency parsing using key Eojoel. Journal of KIISE: Software and Applications 40(10), 600608.Google Scholar
Park, J. (2006). Extraction automatique d’une grammaire d’arbres adjoints à partir d’un corpus arboré pour le coréen. PhD Thesis, Université Paris 7 - Denis Diderot, Paris, France.Google Scholar
Park, J. (2018). Une note sur l’analyse du constituant pour le français. In Actes de Traitement Automatique des Langues Naturelles (TANL2018), Rennes, France. ATALA, pp. 251–260.Google Scholar
Park, J., Dugast, L., Hong, J.-P., Shin, C.-U. and Cha, J.-W. (2017). Building a better bitext for structurally different languages through self-training. In Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora, Taipei, Taiwan. Asian Federation of Natural Language Processing, pp. 110.Google Scholar
Park, J., Hong, J.-P. and Cha, J.-W. (2016). Korean language resources for everyone. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers (PACLIC 30), Seoul, Korea. Pacific Asia Conference on Language, Information and Computation, pp. 4958.Google Scholar
Park, J., Kawahara, D., Kurohashi, S. and Choi, K.-S. (2013). Towards fully lexicalized dependency parsing for Korean. In Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013), Nara, Japan. Assocation for Computational Linguistics, pp. 120126.Google Scholar
Park, J. and Tyers, F. (2019). A new annotation scheme for the Sejong part-of-speech tagged corpus. In Proceedings of the 13th Linguistic Annotation Workshop, Florence, Italy. Association for Computational Linguistics, pp. 195202.CrossRefGoogle Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. Association for Computational Linguistics, pp. 15321543.CrossRefGoogle Scholar
Petrov, S., Barrett, L., Thibaux, R. and Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia. Association for Computational Linguistics, pp. 433440.CrossRefGoogle Scholar
Petrov, S. and Klein, D. (2007). Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, Rochester, New York. Association for Computational Linguistics, pp. 404411.Google Scholar
Post, M. and Gildea, D. (2009). Bayesian learning of a tree substitution grammar. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Suntec, Singapore. Association for Computational Linguistics, pp. 4548.CrossRefGoogle Scholar
Sarkar, A. (2002). Statistical Parsing Algorithms for Lexicalized Tree Adjoining Grammars. PhD Thesis, University of Pennsylvania, Philadelphia, Pennsylvania, USA.Google Scholar
Sarkar, A. and Han, C.-H. (2002). Statistical morphological tagging and parsing of Korean with an LTAG grammar. In Proceedings of 6th International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+6), Venice, Italy, pp. 4856.Google Scholar
Seddah, D., Kübler, S. and Tsarfaty, R. (2014). Introducing the SPMRL 2014 shared task on parsing morphologically-rich languages. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, Dublin, Ireland. Dublin City University, pp. 103109.Google Scholar
Seddah, D., Tsarfaty, R., Kübler, S., Candito, M., Choi, J.D., Farkas, R., Foster, J., Goenaga, I., Gojenola Galletebeitia, K., Goldberg, Y., Green, S., Habash, N., Kuhlmann, M., Maier, W., Nivre, J., Przepiórkowski, A., Roth, R., Seeker, W., Versley, Y., Vincze, V., Woliński, M., Wróblewska, A. and de la Clergerie, E.V. (2013). Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, Seattle, Washington, USA. Association for Computational Linguistics, pp. 146182.Google Scholar
Socher, R., Bauer, J., Manning, C.D. and Ng, A.Y. (2013). Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria. Association for Computational Linguistics, pp. 455465.Google Scholar
Stymne, S., Lhoneux, M., Smith, A. and Nivre, J. (2018). Parser training with heterogeneous treebanks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 619625.CrossRefGoogle Scholar
Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I. and Hinton, G.E. (2015). Grammar as a foreign language. In Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M. and Garnett, R. (eds), Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp. 27732781 Google Scholar
Watanabe, T. and Sumita, E. (2015). Transition-based neural constituent parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China. Association for Computational Linguistics, pp. 11691179.CrossRefGoogle Scholar
Xia, F., Han, C., Palmer, M. and Joshi, A. (2000). Comparing lexicalized treebank grammars extracted from Chinese, Korean, and English Corpora. In Second Chinese Language Processing Workshop, Hong Kong, China. Association for Computational Linguistics, pp. 5259.CrossRefGoogle Scholar