Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora

Renkui Hou; Chu-Ren Huang

doi:10.1017/S1351324920000121

Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora

Published online by Cambridge University Press: 09 March 2020

Renkui Hou

and

Chu-Ren Huang

Show author details

Renkui Hou: Affiliation:
College of Humanities, Guangzhou University, Guangzhou, China Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
Chu-Ren Huang*: Affiliation:
Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong
*: *Corresponding author. E-mail: churen.huang@polyu.edu.hk

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper proposes a robust text classification and correspondence analysis approach to identification of similar languages. In particular, we propose to use the readily available information of clauses and word length distribution to model similar languages. The modeling and classification are based on the hypothesis that languages are self-adaptive complex systems and hence can be classified by dynamic features describing the system, especially in terms of distributional relations of constituents of a system. For similar languages whose grammatical differences are often subtle, classification based on dynamic system features should be more effective. To test this hypothesis, we considered both regional and genre varieties of Mandarin Chinese for classification. The data are extracted from two comparable balanced corpora to minimize possible confounding factors. The two corpora are the Sinica Corpus from Taiwan and the Lancaster Corpus of Mandarin Chinese from Mainland China, and the two genres are reportage and review. Our text classification and correspondence analysis results show that the linguistically felicitous two-level constituency model combining power functions between word and clauses effectively classifies the two varieties of Chinese for both genres. In addition, we found that genres do have compounding effect on classification of regional varieties. In particular, reportage in two varieties is more likely to be classified than review, corroborating the complex system view of language variations. That is, language variations and changes typically do not take place evenly across the board for the complete language system. This further enhances our hypothesis that dynamic complex system features, such as the power functions captured by the Menzerath–Altmann law, provide effective models in classifications of similar languages.

Keywords

Information

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 6: Natural Language Processing for Similar Languages, Varieties, and Dialects , November 2020 , pp. 613 - 640

DOI: https://doi.org/10.1017/S1351324920000121 [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Altmann, G. (1993). Science and linguistics. In Köhler R. and Rieger, B.B (eds.), Contributions to quantitative linguistics. Dordrecht: Springer, pp. 3–10.CrossRef Google Scholar

Baayen, R.H. (2008). Analyzing Linguistics Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.CrossRef Google Scholar

Beckner, C., Blythe, R., Bybee, J., Christiansen, M.H., Croft, W., Ellis, N.C., Holland, J., Ke, J., Larsen-Freeman, D. and Schoenemann, T. (2009). Language is a complex adaptive system: Position paper. Language learning 59, 1–26.Google Scholar

Best, K.-H. (2002). The distribution of rhythmic units in German short prose. Glottometrics 3, 136–142.Google Scholar

Best, K.H. (2005). Quantitative Linguistics-An International Handbook, chapter Satzlänge (Sentence length), pages 298–304, de Gruyter.Google Scholar

Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.CrossRef Google Scholar

Chao, Y.R. (1968). A Grammar of Spoken Chinese. Berkeley and Los Angeles: University of California Press.Google Scholar

Chen, H.H. (1994). The contextual analysis of Chinese sentences with punctuation marks. Literary and linguistic computing 9(4), 281–289.CrossRef Google Scholar

Chen, K.-J., Luo, C.-C., Chang, M.-C., Chen, F.-Y., Chen, C.-J., Huang, C.-R. & Gao, Z.-M. (2003). Sinica Treebank: Design Criteria, Representational Issues and Implementation. In Abeilleé A (ed), Treebanks: Building and Using Parsed Corpora. Dordrecht; Boston: Kluwer Academic Publishers, pp. 231–248.CrossRef Google Scholar

Chen, K.-J., Huang, C.-R., Chang, L.-P. and Hsu, H.-L. (1996). Sinica Corpus: Design Methodology for Balanced Corpora. In Park, B-S and Kim, JB (eds), Proceeding of the 11th Pacific Asia Conference on Language, Information and Computation. Seoul: Kyung Hee University. pp. 167–176.Google Scholar

Christensen, M. (1994). Varation in Spoken and Written Mandarin Narrative Discourse. Ph.D. thesis. Ohio State University, Columbus.Google Scholar

Deng, Y. & Feng, Z. (2013). A quantitative linguistic study on the relationship between word length and word frequency. Journal of Foreign Language 36(3), 29–39.Google Scholar

Eroglu, S. (2014). Language-like behavior of protein length distribution in proteomes. Complexity 20(2), 12–21.Google Scholar

Feng, Z. (2012). 用计量方法研究语言. Foreign Language Teaching and Research 44(2), 256–269.Google Scholar

Ferrer-I-Cancho, R. and Núria, F. (2010). The self-organization of genomes. Complexity 15(5), 34–36.Google Scholar

Ferrer-I-Cancho, R., Núria, F., Antoni, H.-F., Gemma, B. and Baixeries, J. (2012). The challenges of statistical patterns of language: The case of Menzerath’s law in genomes. Complexity 18(3), 11–17.CrossRef Google Scholar

Grzybek, P. (2007). History and methodology of word length studies. In Contributions to the Science of Text and Language. Dordrecht: Springer, pp. 15–90.Google Scholar

Hong, J.-F., and Huang, C.-R. (2008). 語料庫為本的兩岸對應詞彙發掘. (A corpus-based approach to the discovery of cross-strait lexical contrasts). Language and Linguistics 9(2), 221–238.Google Scholar

Hong, J.-F. and Huang, C.-R. (2013). 以中文十億詞語料庫為基礎之兩岸詞彙對比研究 (Cross-strait lexical differences: A comparative study based on Chinese Gigaword Corpus). Computational Linguistics and Chinese Language Processing 18(2), 19–34.Google Scholar

Hou, R., Huang, C.-R., Do, H.S. and Liu, H. (2017). A study on correlation between Chinese sentence and constituting clauses based on the Menzerath-Altmann Law. Journal of Quantitative Linguistics 24(4), 350–366. Published online: 26 Apr 2017. http://dx.doi.org/10.1080/09296174.2017.1314411 CrossRef Google Scholar

Hou, R., Huang, C.R. and Liu, H. (2019). A study on Chinese register characteristics based on regression analysis and text clustering. Corpus Linguistics and Linguistic Theory 15(1), 1–37. https://doi.org/10.1515/cllt-2016-006 CrossRef Google Scholar

Hou, R. and Huang, C.-R. 2019. Robust stylometric analysis and author attribution based on tones and rimes. Journal of Natural Language Engineering. Online First View. https://doi.org/10.1017/S135132491900010X CrossRef Google Scholar

Hou, R., Huang, C.-R., Ahrens, K. and Lee, Y.S. (2019). Linguistic characteristics of Chinese register based on the Menzerath-Altmann law and text clustering. Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqz005 Google Scholar

Hou, R., Yang, J. and Jiang, M. (2014). A study on Chinese quantitative stylistic features and relation among different stylesbased on text clustering. Journal of Quantitative Linguistics 21(3), 246–280.CrossRef Google Scholar

Hu, H., Li, W., Zhou, H., Tian, Z., Zhang, Y., and Zou, L. (2019). Ensemble Methods to Distinguish Mainland and Taiwan Chinese. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 165–171.Google Scholar

Huang, C.R. (1989). On the mathematical properties of Mandarin Chinese 試論漢語的數學規範性質. Bulletin of the Institute of History and Philology 60, 47–73.Google Scholar

Huang, C.R., Chen, K.J., and Gao, Z.M. (1998). Noun class extraction from a corpus-based collocation dictionary: An integration of computational and qualitative approaches. In B. Tsou et al. (eds.), Quantitative and Computational Studies of Chinese Linguistics 339–352. Hong Kong: City University of Hong Kong.Google Scholar

Huang, C.-R. and Lee, L.H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation (pp. 404–410).Google Scholar

Huang, C.-R., Lin, J., Jiang, M. and Xu, H. (2014). Corpus-based study and identification of Mandarin Chinese light verb variations. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (pp. 1–10).CrossRef Google Scholar

Huang, C.-R. and Lin, J. (2013). The ordering of Mandarin Chinese light verbs. In Proceedings of the 13th Chinese Lexical Semantics Workshop. D. Ji and G. Xiao (Eds.) CLSW 2012, LNAI 7717, pages 728–735. Heidelberg: Springer.Google Scholar

Huang, C.-R. and Shi, D. (2016). A Reference Grammar of Chinese. Cambridge: Cambridge University Press.CrossRef Google Scholar

Jauhiainen, T., Lindén, K. and Jauhiainen, H. (2019). Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 178–187.Google Scholar

Ke, J., Minett, J.W., Au, C.P. and Wang, W.S.Y. (2002). Self-organization and selection in the emergence of vocabulary. Complexity 7(3), 41–54.CrossRef Google Scholar

Köhler, R. (2012). Quantitative syntax analysis, vol. 65. Berlin: Walter de Gruyter.CrossRef Google Scholar

Kroch, T. (1994). Morphosyntactic variation. In Beals K., Denton J., Knippen R., Melnar L., Suzuki H. and Zeinfeld E. (eds.), Papers from the 30th regional meeting of the Chicago Linguistics Society: Parasession on variation and linguistic theory, vol. 2. Chicago: Chicago Linguistics Society, pp. 180–201.Google Scholar

Krug, M., Schlüter, J. and Rosenbach, A. (2013). Introduction. Investigating language variation and change. In Krug M and Schlüter J (eds), Research methods in language variation and change. Cambridge: Cambridge University Press, pp. 1–14.CrossRef Google Scholar

Labov, W. (1969). Contraction, deletion, and inherent variability of the English copula. Language 45(4), 715–762.CrossRef Google Scholar

Li, W. (2011). Menzerath’s Law at the gene-exon level in the human genome. Complexity 17(4), 49–53.CrossRef Google Scholar

Lin, J., Shi, D, Jiang, M. and Huang, C.-R. (2018). Variations in World Chineses. In Huang C.-R. et al. (eds), Routledge Handbook on Chinese Applied Linguistics. London: Routledge.Google Scholar

Liu, Y. and Hu, F. (2011). A comparative study of stylistics between “Reading News” and “Talking News”. Language Teaching and Linguistic Studies 1, 97–104.Google Scholar

Liu, H. and Huang, W. (2012). Quantitative linguistics: State of the art, theories and methods. Journal of Zhejiang University (Humanities and Social Sciences) 42(2). 178–192.Google Scholar

Lu, J. (1993). The features of Chinese sentences. Chinese Language Learning No.1, 1–6.Google Scholar

Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third dsl shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14.Google Scholar

Neergaard, K.D. and Huang, C.-R. (2019). Constructing the Mandarin phonological network: Novel syllable inventory used to identify schematic segmentation. Complexity, 2019.CrossRef Google Scholar

McEnery, A. and Xiao, Z. (2004). The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. Religion 17, 3–4.Google Scholar

Pande, H. and Dhami, H.S. (2015). Determination of the distribution of sentence length frequencies for Hindi language texts and utilization of sentence length frequency profiles for authorship attribution. Journal of Quantitative Linguistics 22(4), 338–348.CrossRef Google Scholar

Pawłowski, A. and Eder, M. (2015). Sequential Structures in “Dalimil’s Chronicle”. In Mikros G.K. and Ján M. (eds), Sequences in Language and Text. Berlin & Boston: Walter de Gruyter GmbH, 69, 147–167.Google Scholar

Peirsman, Y., Geeraerts, D. and Speelman, D. (2010). The automatic identification of lexical variation between language varieties. Natural Language Engineering 16(4), 469–491.Google Scholar

Popescu, I.I., Best, K.H. and Altmann, G. (2014). Unified modeling of length in language. Language 2, 124.Google Scholar

R Core Team. (2016). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.Google Scholar

Sigurd, B., Eeg-Olofsson, M. and Van, W.J. (2004). Word length, sentence length and frequency–Zipf revisited. Studia Linguistica 58(1), 37–52.CrossRef Google Scholar

Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2001). Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 471–495.CrossRef Google Scholar

Štajner, S. and Mitkov, R. (2012). Style of religious texts in 20th century. In proceedings of the Workshop on Language Resource and Evaluation for Religious Texts (LRE-Rel), held in conjunction with LREC 2012, pp. 81–7. 23 May. Istanbul, Turkey.Google Scholar

Thomason, S. (1997). Language Variation and Change. In Nunberg G and Wasow T (eds), The Fields of Linguistics. Washington, DC: The Linguistic Society of America. Accessed at https://www.linguisticsociety.org/resource/language-variation-and-change Google Scholar

Wang, K., and Qin, H. (2014). What is peculiar to translational Mandarin Chinese? A corpus-based study of Chinese constructions’ load capacity. Corpus Linguistics and Linguistic Theory 10(1), 57–77.CrossRef Google Scholar

Wang, T. and Li, X. (1996). 兩岸詞彙比較研究管見 (Research on lexical differences between Mainland and Taiwan Mandarin), World Chinese (〈華文世界〉), volume 81.Google Scholar

Wang, W.S.-Y. 王士元. (2006). Language is a complex adaptive system语言是一个复杂适应系统. Journal of Tsinghua University (Philosophy and Social Science) 21(6), 5–13.Google Scholar

Wang, W.S.-Y. (1969). Competing changes as a cause of residue. Language 45(1), 9–25.CrossRef Google Scholar

Wimmer, G. and Altmann, G. (2005). Unified derivation of some linguistic laws. In Köhler R., Altmann G, Piotrowski R.G. (eds), Quantitative Linguistics. An International Handbook. Berlin: de Gruyter, pp. 791–807.Google Scholar

Wimmer, G. and Altmann, G. (2007). Towards a unified derivation of some linguistic laws. In Grzybek, P. (ed), Contributions to the Science of Text and Language. Springer, Dordrecht, pp. 329–337.Google Scholar

Xu, D. (1995). 兩岸詞語差異之比較 (Lexical difference between Mainland and Taiwan Chinese). 1 st symposium on Cross-Strait Lexical and Character differences (第一屆兩岸漢語語彙文字學術研討會論文集Google Scholar

Xu, H., Jiang, M., Lin, J., Shi, D. and Huang, C.-R. (2019). Light Verb Variations in Varieties of Chinese: Comparable Corpus Driven Approaches to Processing of Similar Languages. To appear in Zampieri and Nakov (2019). Similar Languages, Varieties, and Dialects: A Computational Perspective. Cambridge: Cambridge University Press.Google Scholar PubMed

Xu, H., Jiang, M., Lin, J. and Huang, C.-R. (2020). Light Verb Variations and Varieties of Mandarin Chinese: Comparable Corpus Driven Approaches to Grammatical Variations. To Appear in Corpus Linguistics and Linguistic Theory.CrossRef Google Scholar

Zampieri, M. and Preslav, N. (eds) (2019). Similar Languages, Varieties, and Dialects: A Computational Perspective. In Studies in Natural Language Processing book series. Cambridge: Cambridge University Press.Google Scholar

Zampieri, M., Malmasi, S., Scherrer, Y., Samardžic, T., Tyers, F., Silfverberg, M.P., Klyueva, N, Pan, T.L., Huang, C.R., Ionescu, R.T., Butnaru, A. (2019). A Report on the Third VarDial Evaluation Campaign. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019). Association for Computational Linguistics, pp.1–16.Google Scholar

Zampieri, M., Tan, L., Ljubešić, N. and Tiedemann, J. (2014). Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects.CrossRef Google Scholar

Zeng, R. (1995). 兩岸語言詞彙整理之我見 (Opinion on cross-Strait language differences)1 st symposium on Cross-Strait Lexical and Character differences 第一屆兩岸漢語語彙文字學術研討會論文集Google Scholar

Zhu, D. (1982). Lectures on Grammar. Beijing, China: Commercial Press.Google Scholar

Zipf, G.K. (1935). The Psycho-Biology of Language: An Introduction to Dynamic Philology. Oxford, England: Houghton, Mifflin.Google Scholar

Article contents

Classification of regional and genre varieties of Chinese: A correspondence analysis approach based on comparable balanced corpora

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests