Hostname: page-component-cd9895bd7-jn8rn Total loading time: 0 Render date: 2024-12-26T08:58:24.827Z Has data issue: false hasContentIssue false

Out-domain Chinese new word detection with statistics-based character embedding

Published online by Cambridge University Press:  11 February 2019

Yuzhi Liang
Affiliation:
Department of Information Engineering, Peking University Shenzhen Graduate School, Shenzhen, China
Min Yang
Affiliation:
Frontier Science and Technology Research Centre, Shenzhen Institutes of Advanced Technology, Shenzhen, China
Jia Zhu*
Affiliation:
Department of Computer Science, South China Normal University, Guangzhou, China
S. M. Yiu
Affiliation:
Department of Computer Science, The University of Hong Kong, Hong Kong, China
*
*Corresponding author. Email: jzhu@m.scnu.edu.cn

Abstract

Unlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.

Type
Article
Copyright
© Cambridge University Press 2019 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Cai, D. and Zhao, H. (2016). Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), pp. 409420.Google Scholar
Cai, D., Zhao, H., Zhang, Z., Xin, Y., Wu, Y. and Huang, F. (2017). Fast and accurate neural word segmentation for Chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver: Association for Computational Linguistics (ACL), pp. 608615.CrossRefGoogle Scholar
Chang, P.C., Galley, M. and Manning, C.D. (2008). Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, Madison: Omnipress Inc., pp. 224232.CrossRefGoogle Scholar
Chen, X., Qiu, X., Zhu, C., Liu, P. and Huang, X. (2015). Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon: Association for Computational Linguistics (ACL), pp. 11971206.CrossRefGoogle Scholar
Eddy, S.R. (1996). Hidden Markov models. Current Opinion in Structural Biology, 6, 361365.CrossRefGoogle ScholarPubMed
Feng, H., Chen, K., Kit, C. and Deng, X. (2004). Unsupervised segmentation of Chinese corpus using accessor variety. In International Conference on Natural Language Processing, India: NLP Association of India, pp. 694703.Google Scholar
Gao, Q. and Vogel, S. (2010). A multi-layer Chinese word segmentation system optimized for out-of-domain tasks. In Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010), Beijing, Chinese Information Processing Society of China, pp. 210215.Google Scholar
Huang, M., Ye, B., Wang, Y., Chen, H., Cheng, J. and Zhu, X. (2014). New word detection for sentiment analysis. In ACL (1), Baltimore: Association for Computational Linguistics (ACL), pp. 531541.Google Scholar
Jin, Z. and Tanaka-Ishii, K. (2006). Unsupervised segmentation of Chinese text by use of branching entropy. In Proceedings of the COLING/ACL on Main Conference Poster Sessions, Sydney: Association for Computational Linguistics (ACL), pp. 428435.CrossRefGoogle Scholar
Kityz, C. and Wilks, Y. (1999). Unsupervised learning of word boundary with description length gain. In Proceedings of the CoNLL99 ACL Workshop, Bergen: Association for Computational Linguistics (ACL), pp. 16.Google Scholar
Lafferty, J., McCallum, A. and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, Morgan Kaufmann Publishers Inc., pp. 282289.Google Scholar
Leng, Y., Liu, W., Wang, S. and Wang, X. (2016). A feature-rich CRF segmenter for Chinese micro-blog. In International Conference on Computer Processing of Oriental Languages, Kunming, Springer LNAI, pp. 854861.Google Scholar
Li, Y., Li, W., Sun, F. and Li, S. (2015). Component-enhanced Chinese character embeddings. arXiv preprint arXiv:1508.06669.Google Scholar
Liu, Y., Zhang, Y., Che, W., Liu, T. and Wu, F. (2014). Domain adaptation for CRF-based Chinese word segmentation using free annotations. In EMNLP, Doha: Association for Computational Linguistics (ACL), pp. 864874.Google Scholar
Luo, S. and Sun, M. (2003). Two-character Chinese word extraction based on a hybrid of internal and contextual measures. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo: Association for Computational Linguistics, vol. 17, 2430.CrossRefGoogle Scholar
McCallum, A., Freitag, D. and Pereira, F.C. (2000). Maximum entropy Markov models for information extraction and segmentation. In ICML, California, Morgan Kaufmann Inc., vol. 17, pp. 591598.Google Scholar
Miao, C.-J. and Chen, X.-M. (2011) The Interpretation of Modern Chinese Verbs. Beijing Normal University Press, pp. 322.Google Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, pp. 31113119.Google Scholar
Pei, W., Ge, T. and Chang, B. (2014). Max-margin tensor neural network for Chinese word segmentation. In ACL (1), Baltimore: Association for Computational Linguistics (ACL), pp. 293303.Google Scholar
Peng, F., Feng, F. and McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on Computational Linguistics, Barcelona: Association for Computational Linguistics (ACL), p. 562.Google Scholar
Qian, P., Qiu, X. and Huang, X. (2016). A new psychometric-inspired evaluation metric for Chinese word segmentation. In Proceedings of the 54th international conference on Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), vol. 1, pp. 21852194.Google Scholar
Qiu, X., Qian, P. and Shi, Z. (2016). Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word segmentation for micro-blog texts. In International Conference on Computer Processing of Oriental Languages, Kunming: Springer LNAI, pp. 901906.Google Scholar
Sproat, R. and Emerson, T. (2003). The second international Chinese word segmentation bakeoff. In Proceeding of the Sighan Workshop on Chinese Language, Sapporo: Association for Computational Linguistics, pp. 133143.CrossRefGoogle Scholar
Sun, Y., Lin, L., Yang, N., Ji, Z. and Wang, X. (2014). Radical-enhanced Chinese character embedding. In International Conference on Neural Information Processing, Montreal: Neural Information Processing Systems Foundation, Inc., pp. 279286.Google Scholar
Wang, L.Y., Wong, F., Chao, S. and Xing, J.W. (2012). CRFs-based Chinese word segmentation for micro-blog with small-scale data. In Association for Computational Linguistics, Tianjin: Association for Computational Linguistics, pp. 5157.Google Scholar
Wang, Y., Jun’ichi Kazama, Y.T., Tsuruoka, Y., Chen, W., Zhang, Y. and Torisawa, K. (2011). Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In IJCNLP, Chiang Mai: Asian Federation of Natural Language Processing, pp. 309317.Google Scholar
Xia, Q., Li, Z., Chao, J. and Zhang, M. (2016). Word segmentation on micro-blog texts with external lexicon and heterogeneous data. In International Conference on Computer Processing of Oriental Languages, Kunming: Springer LNAI, pp. 711721.Google Scholar
Xue, N. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 2948.Google Scholar
Yao, Y. and Huang, Z. (2016). Bi-directional LSTM recurrent neural network for Chinese word segmentation. In International Conference on Neural Information Processing, Barcelona: Neural Information Processing Systems Foundation, Inc., pp. 345353.CrossRefGoogle Scholar
Zhang, H.P., Yu, H.K., Xiong, D.Y. and Liu, Q. (2003). HHMM-based Chinese lexical analyzer ICTCLAS. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo: Association for Computational Linguistics, vol. 17, pp. 184187.CrossRefGoogle Scholar
Zhang, K., Sun, M. and Zhou, C. (2012a). Word segmentation on Chinese microblog data with a linear-time incremental model. In Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin: Association for Computational Linguistics, pp. 4146.Google Scholar
Zhang, M., Deng, Z., Che, W. and Liu, T. (2012b). Combining statistical model and dictionary for domain adaption of Chinese word segmentation. Journal of Chinese Information Processing 26(2), 812.Google Scholar
Zhang, M., Zhang, Y. and Fu, G. (2016). Transition-based neural word segmentation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin: Association for Computational Linguistics (ACL), pp. 421431.Google Scholar
Zhang, R., Yasuda, K. and Sumita, E. (2008). Chinese word segmentation and statistical machine translation. ACM Transactions on Speech and Language Processing (TSLP) 5(2), 4.Google Scholar
Zhang, Y. and Clark, S. (2007). Transition-based parsing of the Chinese Treebank using a global discriminative model. In IWPT ’09 Proceedings of the 11th International Conference on Parsing Technologies, Paris: Association for Computational Linguistics (ACL), pp. 162171.Google Scholar
Zheng, X., Chen, H. and Xu, T. (2013). Deep learning for Chinese word segmentation and POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle: Association for Computational Linguistics (ACL), pp. 647657.Google Scholar