A note on constituent parsing for Korean

Mija Kim; Jungyeul Park

doi:10.1017/S1351324920000479

A note on constituent parsing for Korean

Published online by Cambridge University Press: 10 November 2020

Mija Kim

and

Jungyeul Park

Show author details

Mija Kim: Affiliation:
Kyung Hee University, Seoul, South Korea
Jungyeul Park*: Affiliation:
University of Washington, Seattle, WA, USA
*: *Corresponding author. E-mail: jungyeul@uw.edu

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

This study deals with widespread issues on constituent parsing for Korean including the quantitative and qualitative error analyses on parsing results. The previous treebank grammars have been accepted as being interpretable in the various annotation schemes, whereas the recent parsers turn out to be much harder for humans to interpret. This paper, therefore, intends to find the concrete typology of parsing errors, to describe how these parsers deal with sentences and to show their statistical distribution, using state-of-the-art statistical and neural parsers. For doing this work, we train and evaluate the phrase structure Sejong treebank using statistical and neural parsing systems and obtain results up to a 89.18% F$_1$ score, which outperforms previous constituent parsing results for Korean. We also define best practices for correct comparison to future work by proposing the standard corpus division for the Sejong treebank.

Keywords

Constituent parsing Sejong treebank Errors analysis Korean

Information

Type: Article
Information: Natural Language Engineering , Volume 28 , Issue 2 , March 2022 , pp. 199 - 222

DOI: https://doi.org/10.1017/S1351324920000479 [Opens in a new window]
Copyright: © The Author(s), 2020. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

Mija Kim and Jungyeul Park contributed equally.

References

Abeillé, A. (ed.) (2003). Treebanks: Building and Using Parsed Corpora. Netherlands: Springer.CrossRef Google Scholar

Bangalore, S. and Joshi, A.K. (1999). Supertagging: An approach to almost parsing. Computational Linguistics 25(2), 237–265.Google Scholar

Bikel, D.M. (2004a). Intricacies of Collins’ parsing model. Computational Linguistics 30(4), 479–511.CrossRef Google Scholar

Bikel, D.M. (2004b). On the Parameter Space of Generative Lexicalized Statistical Parsing Models. PhD Thesis, University of Pennsylvania.Google Scholar

Björkelund, A., Çetinoğlu, Ö., Faleńska, A., Farkas, R., Mueller, T., Seeker, W. and Szántó, Z. (2014). Introducing the IMS-Wrocław-Szeged-CIS entry at the SPMRL 2014 shared task: Reranking and morpho-syntax meet unlabeled data. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, Dublin, Ireland. Dublin City University, pp. 97–102.Google Scholar

Björkelund, A., Çetinoğlu, Ö., Farkas, R., Mueller, T. and Seeker, W. (2013). (Re)ranking meets morphosyntax: State-of-the-art results from the SPMRL 2013 shared task. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, Seattle, Washington, USA. Association for Computational Linguistics, pp. 135–145.Google Scholar

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146.CrossRef Google Scholar

Charniak, E. (1996). Tree-bank grammars. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 1031–1036.Google Scholar

Charniak, E. and Johnson, M. (2005). Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan. Association for Computational Linguistics, pp. 173–180.CrossRef Google Scholar

Choi, D., Park, J. and Choi, K.-S. (2012). Korean Treebank transformation for parser training. In Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages, Jeju, Republic of Korea. Association for Computational Linguistics, pp. 78–88.Google Scholar

Choi, J.D. and Palmer, M. (2011). Statistical dependency parsing in Korean: From corpus generation to automatic parsing. In Proceedings of the Second Workshop on Statistical Parsing of Morphologically Rich Languages, Dublin, Ireland. Association for Computational Linguistics, pp. 1–11.Google Scholar

Choi, K.-S., Han, Y.S., Han, Y.G. and Kwon, O.W. (1994). KAIST tree bank project for Korean: Present and future development. In Proceedings of the International Workshop on Sharable Natural Language Resources, Nara Institute of Science and Technology. Nara Institute of Science and Technology, pp. 7–14.Google Scholar

Chomsky, N. (1981). Lectures on Government and Binding. Studies in Generative Grammar. Dordrecht, The Netherlands: Foris Publications.Google Scholar

Chomsky, N. (1982). Some Concepts and Consequences of the Theory of Government and Binding . Linguistic Inquiry Monograph, vol. 6. Cambridge, MA: The MIT Press.Google Scholar

Chomsky, N. (1986). Barriers. Linguistic Inquiry Monograph, vol. 13. Cambridge, MA: The MIT Press.Google Scholar

Chung, T., Post, M. and Gildea, D. (2010). Factors affecting the accuracy of Korean parsing. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, Los Angeles, CA, USA. Association for Computational Linguistics, pp. 49–57.Google Scholar

Coavoux, M. and Crabbé, B. (2016). Neural greedy constituent parsing with dynamic oracles. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 172–182.CrossRef Google Scholar

Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing. PhD Thesis, University of Pennsylvania.Google Scholar

Collins, M. (2000). Discriminative reranking for natural language parsing. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML’00, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc., pp. 175–182.Google Scholar

Collins, M. and Koo, T. (2015). Discriminative reranking for natural language parsing. Computational Linguistics 31(1), 25–70.CrossRef Google Scholar

Dakota, D. and Kübler, S. (2017). Towards replicability in parsing. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria. INCOMA Ltd., pp. 185–194.CrossRef Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 4171–4186.Google Scholar

Durrett, G. and Klein, D. (2015). Neural CRF parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China. Association for Computational Linguistics, pp. 302–312.CrossRef Google Scholar

Fernández-González, D. and Gómez-Rodríguez, C. (2018). Dynamic oracles for top-down and in-order shift-reduce constituent parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 1303–1313.CrossRef Google Scholar

Fried, D., Kitaev, N. and Klein, D. (2019). Cross-domain generalization of neural constituency parsers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 323–330.CrossRef Google Scholar

Fried, D. and Klein, D. (2018). Policy gradient as a proxy for dynamic oracles in constituency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 469–476.CrossRef Google Scholar

Fried, D., Stern, M. and Klein, D. (2017). Improving neural parsing by disentangling model combination and reranking effects. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada. Association for Computational Linguistics, pp. 161–166.CrossRef Google Scholar

Goldberg, Y., Sartorio, F. and Satta, G. (2014). A tabular method for dynamic oracles in transition-based parsing. Transactions of the Association for Computational Linguistics 2(1), 119–130.CrossRef Google Scholar

Han, C.-H., Han, N.-R. and Ko, E.-S. (2001). Bracketing Guidelines for Penn Korean TreeBank. Technical report, University of Pennsylvania.Google Scholar

Han, C.-H., Han, N.-R., Ko, E.-S., Palmer, M. and Yi, H. (2002). Penn Korean treebank: Development and evaluation. In Proceedings of the 16th Pacific Asia Conference on Language, Information and Computation, Jeju, Korea. Pacific Asia Conference on Language, Information and Computation, pp. 69–78.Google Scholar

Hermjakob, U. (2000). Rapid parser development: A machine learning approach for Korean. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, Washington, USA, pp. 118–123.Google Scholar

Hermjakob, U. and Mooney, R.J. (1997). Learning parse and translation decisions from examples with rich context. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain. Association for Computational Linguistics, pp. 482–489.Google Scholar

Johnson, M. (1998). PCFG models of linguistic tree representations. Computational Linguistics 24(4), 613–632.Google Scholar

Jung, S., Lee, C. and Hwang, H. (2018). End-to-end Korean part-of-speech tagging using copying mechanism. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 17(3), 19:1–19:8.CrossRef Google Scholar

Kitaev, N., Cao, S. and Klein, D. (2019). Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 3499–3505.CrossRef Google Scholar

Kitaev, N. and Klein, D. (2018). Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 2675–2685.CrossRef Google Scholar

Klein, D. and Manning, C.D. (2001). Parsing with treebank grammars: Empirical bounds, theoretical models, and the structure of the Penn Treebank. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France. Association for Computational Linguistics, pp. 338–345.CrossRef Google Scholar

Klein, D. and Manning, C.D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan. Association for Computational Linguistics, pp. 423–430.CrossRef Google Scholar

Kulmizev, A., Ravishankar, V., Abdou, M. and Nivre, J. (2020). Do neural language models show preferences for syntactic formalisms? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 7 2020. Association for Computational Linguistics, pp. 4077–4091.CrossRef Google Scholar

Kummerfeld, J.K., Hall, D., Curran, J.R. and Klein, D. (2012). Parser showdown at the wall street corral: An empirical investigation of error types in parser output. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea. Association for Computational Linguistics, pp. 1048–1059.Google Scholar

Kummerfeld, J.K., Tse, D., Curran, J.R. and Klein, D. (2013). An empirical examination of challenges in Chinese parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria. Association for Computational Linguistics, pp. 98–103.Google Scholar

Kuncoro, A., Ballesteros, M., Kong, L., Dyer, C., Neubig, G. and Smith, N.A. (2017). What do recurrent neural network grammars learn about syntax? In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain. Association for Computational Linguistics, pp. 1249–1258.Google Scholar

Matsuzaki, T., Miyao, Y. and Tsujii, J. (2005). Probabilistic CFG with latent annotations. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, Michigan. Association for Computational Linguistics, pp. 75–82.CrossRef Google Scholar

Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. (2013a). Efficient estimation of word representations in vector space. In Proceedings of Workshop at the International Conference on Learning Representations (ICLR) 2013, Scottsdale, Arizona. The International Conference on Learning Representations (ICLR).Google Scholar

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z. and Weinberger, K.Q. (eds), Advances in Neural Information Processing Systems, vol. 26. Lake Tahoe, Nevada: Curran Associates, Inc., pp. 3111–3119.Google Scholar

Oh, J.-Y. and Cha, J.-W. (2013). Korean dependency parsing using key Eojoel. Journal of KIISE: Software and Applications 40(10), 600–608.Google Scholar

Park, J. (2006). Extraction automatique d’une grammaire d’arbres adjoints à partir d’un corpus arboré pour le coréen. PhD Thesis, Université Paris 7 - Denis Diderot, Paris, France.Google Scholar

Park, J. (2018). Une note sur l’analyse du constituant pour le français. In Actes de Traitement Automatique des Langues Naturelles (TANL2018), Rennes, France. ATALA, pp. 251–260.Google Scholar

Park, J., Dugast, L., Hong, J.-P., Shin, C.-U. and Cha, J.-W. (2017). Building a better bitext for structurally different languages through self-training. In Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora, Taipei, Taiwan. Asian Federation of Natural Language Processing, pp. 1–10.Google Scholar

Park, J., Hong, J.-P. and Cha, J.-W. (2016). Korean language resources for everyone. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers (PACLIC 30), Seoul, Korea. Pacific Asia Conference on Language, Information and Computation, pp. 49–58.Google Scholar

Park, J., Kawahara, D., Kurohashi, S. and Choi, K.-S. (2013). Towards fully lexicalized dependency parsing for Korean. In Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013), Nara, Japan. Assocation for Computational Linguistics, pp. 120–126.Google Scholar

Park, J. and Tyers, F. (2019). A new annotation scheme for the Sejong part-of-speech tagged corpus. In Proceedings of the 13th Linguistic Annotation Workshop, Florence, Italy. Association for Computational Linguistics, pp. 195–202.CrossRef Google Scholar

Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. Association for Computational Linguistics, pp. 1532–1543.CrossRef Google Scholar

Petrov, S., Barrett, L., Thibaux, R. and Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia. Association for Computational Linguistics, pp. 433–440.CrossRef Google Scholar

Petrov, S. and Klein, D. (2007). Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, Rochester, New York. Association for Computational Linguistics, pp. 404–411.Google Scholar

Post, M. and Gildea, D. (2009). Bayesian learning of a tree substitution grammar. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Suntec, Singapore. Association for Computational Linguistics, pp. 45–48.CrossRef Google Scholar

Sarkar, A. (2002). Statistical Parsing Algorithms for Lexicalized Tree Adjoining Grammars. PhD Thesis, University of Pennsylvania, Philadelphia, Pennsylvania, USA.Google Scholar

Sarkar, A. and Han, C.-H. (2002). Statistical morphological tagging and parsing of Korean with an LTAG grammar. In Proceedings of 6th International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+6), Venice, Italy, pp. 48–56.Google Scholar

Seddah, D., Kübler, S. and Tsarfaty, R. (2014). Introducing the SPMRL 2014 shared task on parsing morphologically-rich languages. In Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, Dublin, Ireland. Dublin City University, pp. 103–109.Google Scholar

Seddah, D., Tsarfaty, R., Kübler, S., Candito, M., Choi, J.D., Farkas, R., Foster, J., Goenaga, I., Gojenola Galletebeitia, K., Goldberg, Y., Green, S., Habash, N., Kuhlmann, M., Maier, W., Nivre, J., Przepiórkowski, A., Roth, R., Seeker, W., Versley, Y., Vincze, V., Woliński, M., Wróblewska, A. and de la Clergerie, E.V. (2013). Overview of the SPMRL 2013 shared task: A cross-framework evaluation of parsing morphologically rich languages. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, Seattle, Washington, USA. Association for Computational Linguistics, pp. 146–182.Google Scholar

Socher, R., Bauer, J., Manning, C.D. and Ng, A.Y. (2013). Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria. Association for Computational Linguistics, pp. 455–465.Google Scholar

Stymne, S., Lhoneux, M., Smith, A. and Nivre, J. (2018). Parser training with heterogeneous treebanks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 619–625.CrossRef Google Scholar

Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I. and Hinton, G.E. (2015). Grammar as a foreign language. In Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M. and Garnett, R. (eds), Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp. 2773–2781 Google Scholar

Watanabe, T. and Sumita, E. (2015). Transition-based neural constituent parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China. Association for Computational Linguistics, pp. 1169–1179.CrossRef Google Scholar

Xia, F., Han, C., Palmer, M. and Joshi, A. (2000). Comparing lexicalized treebank grammars extracted from Chinese, Korean, and English Corpora. In Second Chinese Language Processing Workshop, Hong Kong, China. Association for Computational Linguistics, pp. 52–59.CrossRef Google Scholar

Article contents

A note on constituent parsing for Korean

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests