Hostname: page-component-78c5997874-dh8gc Total loading time: 0 Render date: 2024-11-13T04:40:06.554Z Has data issue: false hasContentIssue false

Investigating translated Chinese and its variants using machine learning

Published online by Cambridge University Press:  03 April 2020

Hai Hu*
Affiliation:
Department of Linguistics, Indiana University, Bloomington, IN, USA
Sandra Kübler
Affiliation:
Department of Linguistics, Indiana University, Bloomington, IN, USA
*
*Corresponding author. E-mail: huhai@indiana.edu

Abstract

Translations are generally assumed to share universal features that distinguish them from texts that are originally written in the same language. Thus, we can argue that these translations constitute their own variety of a language, often called translationese. However, translations are also influenced by their source languages and thus show different characteristics depending on the source language. Consequently, we argue that these variants constitute different “dialects” of translations into the same target language. Studies using machine learning techniques on Indo-European languages have investigated the universal characteristics of translationese and how translations from various source languages differ. However, for typologically very different languages such as Chinese, there are only few corpus studies that tap into the intricate relation between translations and the originals, as well as into the relations among translations themselves. In this contribution, we investigate the following questions: (1) What are the characteristics of Chinese translationese, both in general and with respect to different source languages? (2) Can we find differences not only at the lexical but also on the syntactic level? and (3) Based on the characteristics found in the previous questions, which of the proposed laws and universals can we corroborate based on our evidence from Chinese? We use machine learning to operationalize determining the importance of different characteristics and comparing their importance for our Chinese dataset with characteristics previously reported in studies on English. In addition, our methodology allows us to add syntactic features, which have rarely been used to study translations into Chinese. Our results show that Chinese translations as a whole can be reliably distinguished from non-translations, even based on only five features. More interestingly, typological traces from the source languages can often be found in their translations, therefore creating what we call dialects of translationese. For instance, translations from two Altaic languages exhibit more noun repetition and less frequent use of pronouns. Additionally, some characteristics that are not discriminative for English work well for Chinese, possibly because the distance between Chinese and the source languages is greater than that in English studies.

Type
Article
Copyright
© Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In Baker, M., Francis, G. and Tognini-Bonelli, E. (eds), Text and Technology: In Honour of John Sinclair. Amsterdam: John Benjamins, pp. 233250.Google Scholar
Baker, M. (1995). Corpora in translation studies: An overview and some suggestions for future research. Target. International Journal of Translation Studies, 7(2), 223243.CrossRefGoogle Scholar
Baker, M. (1996). Corpus-based translation studies: The challenges that lie ahead. In Somers, H. (ed), Terminology, LSP and Translation. Studies in Language Engineering in Honour of Juan C. Sager, vol. 18. Amsterdam and Philadelphia: Benjamins, pp. 175186.Google Scholar
Baroni, M. and Bernardini, S. (2005). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing 21(3), 259274.CrossRefGoogle Scholar
Becher, V. (2011). Explicitation and Implicitation in Translation. A Corpus-Based Study of English-German and German-English Translations of Business Texts. PhD Thesis, University of Hamburg.Google Scholar
Ben-Ari, N. (1998). The ambivalent case of repetitions in literary translation. Avoiding repetitions: A “universal” of translation? Meta: Journal des Traducteurs/Meta: Translators’ Journal, 43(1), 6878.CrossRefGoogle Scholar
Blum-Kulka, S. (1986). Shifts of cohesion and coherence in translation. In House, J. and Blum-Kulka, S. (eds), Interlingual and Intercultural Communication: Discourse and Cognition in Translation and Second Language Acquisition Studies. Gunter Narr, Tuebingen, Germany, pp. 1735.Google Scholar
Bykh, S. and Meurers, D. (2014). Exploring syntactic features for native language identification: A variationist perspective on feature encoding and ensemble optimization. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, Dublin, Ireland, pp. 1962–1973.Google Scholar
Cappelle, B. and Loock, R. (2017). Typological differences shining through: The case of phrasal verbs in translated English. In De Sutter, G., Lefer, M.-A. and Delaere, I. (eds), Empirical Translation Studies: New Theoretical and Methodological Traditions. Walter de Gruyter, Berlin, Germany, pp. 235263.Google Scholar
Cartoni, B., Zufferey, S., Meyer, T. and Popescu-Belis, A. (2011). How comparable are parallel corpora? Measuring the distribution of general vocabulary and connectives. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, OR, pp. 7886.Google Scholar
Chen, J.W. (2006). Explicitation Through the Use of Connectives in Translated Chinese: A Corpus-Based Study . PhD Thesis, The University of Manchester.Google Scholar
Chen, P. (1987). Discourse analysis of zero anaphora in Chinese. Chinese Philology (In Chinese). 5, 363378 Google Scholar
Chen, Z., Boston, M.F. and Hale, J.T. (2009). Using entropy to evaluate child language performance. In The 22nd CUNY Conference on Human Sentence Processing, Davis, CA.Google Scholar
Church, K.W. and Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 2229.Google Scholar
Da, J. (2004). A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction. In Proceedings of the Fourth International Conference on New Technologies in Teaching and Learning Chinese, Beijing, China, pp. 501511.Google Scholar
De Sutter, G., Lefer, M.-A. and Delaere, I. (eds). (2017). Empirical Translation Studies: New Methodological and Theoretical Traditions, vol. 300. Walter de Gruyter, Berlin, Germnay.CrossRefGoogle Scholar
Evert, S. and Neumann, S. (2017). The impact of translation direction on characteristics of translated texts: A multivariate analysis for English and German. In De Sutter, G., Lefer, M.-A. and Delaere, I. (eds), Empirical Translation Studies: New Theoretical and Methodological Traditions. Walter de Gruyter, Berlin, Germany, pp. 4780.Google Scholar
Ferraresi, A. and Miličević, M. (2017). 5 phraseological patterns in interpreting and translation. Similar or different? In De Sutter, G., Lefer, M.-A. and Delaere, I. (eds), Empirical Translation Studies: New Theoretical and Methodological Traditions. Walter de Gruyter, Berlin, Germany, pp. 157182.Google Scholar
Frawley, W. (1984). Prolegomenon to a theory of translation. In Frawley, W. (ed), Translation: Literary, Linguistic and Philosophical Perspectives. Associated University Press, London, pp. 159175.Google Scholar
Gellerstam, M. (1986). Translationese in Swedish novels translated from English. In Wollin, L. and Lindquist, H. (eds), Translation Studies in Scandinavia, vol. 1. CWK Gleerup, pp. 8895.Google Scholar
Graff, D. (2007). Chinese Gigaword, 3rd Edn. LDC Catalog No.: LDC2007T38, ISBN: 1-58563-455-7.Google Scholar
Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22(3), 251270.CrossRefGoogle Scholar
Hale, J. (2016). Information-theoretical complexity metrics. Language and Linguistics Compass 10(9), 397412.CrossRefGoogle Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11(1), 1018.CrossRefGoogle Scholar
He, Y. (2008). A Study of Grammatical Features in Europeanized Chinese. Commercial Press (In Chinese), Beijing.Google Scholar
Hu, H., Li, W. and Kübler, S. (2018). Detecting syntactic features of translated Chinese. In Proceedings of the 2nd Workshop on Stylistic Variations at NAACL-HLT 2018, New Orleans, LA, pp. 2028.CrossRefGoogle Scholar
Hu, X., Xiao, R. and Hardie, A. (2016). How do English translations differ from non-translated English writings? A multi-feature statistical model for linguistic variation analysis. Corpus Linguistics and Linguistic Theory 15(2), 347382.CrossRefGoogle Scholar
Ilisei, I. and Inkpen, D. (2011). Translationese traits in Romanian newspapers: A machine learning approach. International Journal of Computational Linguistics and Applications 2(1–2), 319–32.Google Scholar
Ilisei, I., Inkpen, D., Pastor, G.C. and Mitkov, R. (2010). Identification of translationese: A machine learning approach. In International Conference on Intelligent Text Processing and Computational Linguistics, Iasi, Romania, pp. 503511.CrossRefGoogle Scholar
Ke, F. (2005). Fanyi zhong de xian he yin (implicitation and explicitation in translations). Foreign Language Teaching and Research 37(4), 303307 (In Chinese).Google Scholar
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 79–86.Google Scholar
Koppel, M. and Ordan, N. (2011). Translationese and its dialects. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp. 1318–1326.Google Scholar
Kunilovskaya, M. and Kutuzov, A. (2017). Testing target text fluency: A machine learning approach to detecting syntactic translationese in English-Russian translation. In Menzel, K., Lapshinova-Koltunski, E. and Kunz, K. (eds), New Perspectives on Cohesion and Coherence. Language Science Press, Berlin, pp. 75104.Google Scholar
Kwon, N., Kluender, R., Kutas, M. and Polinsky, M. (2013). Subject/object processing asymmetries in Korean relative clauses: Evidence from ERP data. Language 89(3), 537.CrossRefGoogle ScholarPubMed
Laviosa-Braithwaite, S. (1996). The English Comparable Corpus (ECC): A Resource and a Methodology for the Empirical Study of Translation. PhD Thesis, University of Manchester.Google Scholar
Lembersky, G., Ordan, N. and Wintner, S. (2012). Language models for machine translation: Original vs. translated texts. Computational Linguistics 38(4), 799825.CrossRefGoogle Scholar
Levy, R. and Andrew, G. (2006). Tregex and Tsurgeon: Tools for querying and manipulating tree data structures. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 2231–2234.Google Scholar
Lin, C.-J.C. (2011). Chinese and English relative clauses: Processing constraints and typological consequences. In Proceedings of the 23rd North American Conference on Chinese Linguistics (NACCL-23), Eugene, OR.Google Scholar
Lin, C.-J.C. (2018). Subject prominence and processing filler-gap dependencies in prenominal relative clauses: The comprehension of possessive relative clauses and adjunct relative clauses in Mandarin Chinese. Language 94, 758797.CrossRefGoogle Scholar
Lin, C.-J.C. and Hu, H. (2018). Syntactic complexity as a measure of linguistic authenticity in modern Chinese. In 26th Annual Conference of International Association of Chinese Linguistics and the 20th International Conference on Chinese Language and Culture, Madison, WI.Google Scholar
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics Beijing, 15(4), 474496.CrossRefGoogle Scholar
Lv, S. (1942). A Sketch of Chinese Grammar. Commercial Press (In Chinese).Google Scholar
Malmasi, S. and Dras, M. (2018). Native language identification with classifier stacking and ensembles. Computational Linguistics 44(3), 403446.CrossRefGoogle Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, pp. 5560.CrossRefGoogle Scholar
Mauranen, A. and Kujamäki, P. (eds) (2004). Translation Universals: Do they Exist?, vol. 48. John Benjamins, Amsterdam.CrossRefGoogle Scholar
Meyer, T. and Webber, B. (2013). Implicitation of discourse connectives in (machine) translation. In Proceedings of the Workshop on Discourse in Machine Translation, pp. 1926.Google Scholar
Olohan, M. and Baker, M. (2000). Reporting that in translated English. Evidence for subconscious processes of explicitation? Across Languages and Cultures 1(2), 141158.CrossRefGoogle Scholar
Pápai, V. (2004). Explicitation: A universal of translated text? In Mauranen, A. and Kujamäki, P. (eds), Translation Universals: Do they exist? John Benjamins, pp. 143164.CrossRefGoogle Scholar
Puurtinen, T. (2004). Explicitation of clausal relations: A corpus-based analysis of clause connectives in translated and non-translated Finnish children’s literature. In Mauranen, A. and Kujamäki, P. (eds), Translation Universals: Do they exist? John Benjamins, pp. 165176.CrossRefGoogle Scholar
Rabinovich, E., Nisioi, S., Ordan, N. and Wintner, S. (2016). On the similarities between native, non-native and translated texts. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1870–1881.CrossRefGoogle Scholar
Rabinovich, E. and Wintner, S. (2015). Unsupervised identification of translationese. Transactions of the Association of Computational Linguistics 3(1), 419432.CrossRefGoogle Scholar
Rubino, R., Lapshinova-Koltunski, E. and van Genabith, J. (2016). Information density and quality estimation features as translationese indicators for human translation classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, pp. 960–970.CrossRefGoogle Scholar
Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal 27(3), 379423.CrossRefGoogle Scholar
Swanson, B. and Charniak, E. (2012). Native language detection with tree substitution grammars. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, South Korea pp. 193–197.Google Scholar
Teich, E. (2003). Cross-Linguistic Variation in System and Text: A Methodology for the Investigation of Translations and Comparable Texts. Walter de Gruyter Berlin.CrossRefGoogle Scholar
Toury, G. (1978). The nature and role of norms in translation. In Holmes, J., Lambert, J. and van den Broeck, R. (eds), Literature and Translation: New Perspectives in Literary Studies. Acco, Leuven.Google Scholar
Toury, G. (1995). Descriptive Translation Studies and Beyond. John Benjamins, Amsterdam.CrossRefGoogle Scholar
Volansky, V., Ordan, N. and Wintner, S. (2013). On the features of translationese. Digital Scholarship in the Humanities 30(1), 98118.CrossRefGoogle Scholar
Wang, L. (1943). Contemporary Grammar of Chinese. Commercial Press (In Chinese) Beijing.Google Scholar
Wang, L. (1944). Theory of Chinese Grammar. Commercial Press (In Chinese) Beijing.Google Scholar
Wang, L. (1958). History of the Chinese Language. Zhonghua Book Company (In Chinese) Beijing.Google Scholar
Xiao, R. (2010). How different is translated Chinese from native Chinese?: A corpus-based study of translation universals. International Journal of Corpus Linguistics 15(1), 535.CrossRefGoogle Scholar
Xiao, R. and Hu, X. (2015). Corpus-Based Studies of Translational Chinese in English-Chinese Translation. Springer Berlin.CrossRefGoogle Scholar
Xue, N., Xia, F., Chiou, F.-D. and Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering 11(2), 207238.CrossRefGoogle Scholar
Zhu, D. (1985). Dialogues in Grammar. Commercial Press (In Chinese) Beijing.Google Scholar