1. Introduction
Syntactic parsing is one of the areas in Natural Language Processing that has immensely profited from deep learning, especially by the development in large scale pre-trained language models. One of the most difficult problems in the pre-neural era was data sparsity. Manually annotated treebanks are expensive and tend to be small, so that unknown words and unknown constructions were equally likely. Neural models have proven to be better suited for making correct parsing decisions, and embeddings working on subword levels reduce the problems of handling unknown words (Vania, Grivas, and Lopez Reference Vania, Grivas and Lopez2018).
Another achievement resulting from the use of neural models is cross-lingual parsing, where we train on a source language and then parse a different target language, with a multilingual language model, such as multilingual Bidirectional Encoder Representations from Transformers (mBERT) (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), serving as a bridge (Das and Sarkar Reference Das and Sarkar2020). This is a solution for many low-resource languages for which no treebanks exist. However, while cross-lingual transfer learning has shown benefits in cross-lingual parsing (Das and Sarkar Reference Das and Sarkar2020), there are certain underlying assumptions that are often made in the process. One such assumption is that we can successfully leverage the orthographic representations of languages (henceforth: scripts) to help facilitate information sharing. This is particularly pertinent when sharing lexical information, such as word or character embeddings, or embeddings derived from pre-trained multilingual language models, as although transfer has improved performance, the success is still influenced by the data used to create the embedding (Joshi, Peters, and Hopkins Reference Joshi, Peters and Hopkins2018; Dakota Reference Dakota2021). Additionally, how the scripts of languages are handled by the tokenizer of the language model becomes central to their ability to deal with a large number of word representations to which they are exposed during training and testing. A central question then arises in how effective such lexicalized approaches are when the language is not only low-resource but it is written in a unique script that will not benefit from lexical sharing. And what additional strategies can be employed in such cross-lingual settings?
To examine these questions, we investigate how to effectively develop dependency parsing strategies for Xibe, which is not only a low-resource language, but written in a unique script that is not present in pre-trained language models, nor fully covered in any related languages. We investigate lexicalized (Kübler et al. Reference Kübler, McDonald and Nivre2009) and delexicalized parsing (Zeman and Resnik Reference Zeman and Resnik2008). The latter approach makes cross-lingual parsing feasible in settings where large multilingual language models are not available or insufficient.
In a first step, we investigate monolingual parsing before performing cross-lingual parsing experiments, using both lexicalized and delexicalized approaches to determine the difficulties when applying standard cross-lingual approaches. We find that word plus character embeddings yield a high performance in monolingual parsing, which often deteriorate when language models are included, due to their inability to handle the Xibe script. Our best results are obtained with a lexicalized model using word and character embeddings plus the Pre-trained Language Model for Chinese Minority Languages (CINO; Yang et al. Reference Yang, Xu, Cui, Wang, Lin, Wu and Chen2022). But we reach competitive results using delexicalized monolingual approaches, as cross-lingual approaches either suffer from an inability to transfer knowledge via script incompatibilities or from conflicting syntactic constructions between various source languages and Xibe.
In summary, the major contributions of our work presented here are the following: (1) We investigate parsing Xibe, a language with a unique script. To our knowledge, the only relevant work on that language is our work on delexicalized parsing of Xibe (Zhou and Kübler, Reference Zhou and Kübler2021). Our results on suitable source languages builds the basis for our multilingual experiments here. (2) We show that monolingual lexicalized parsing is successful when trained on a small set of sentences, in combination with word and character embeddings. (3) We show that pre-trained multilingual language models do not improve performance in such a setting. (4) We show that monolingual delexicalized parsing is a competitive alternative. (5) We show that multilingual parsing results in a lower performance.
We first cover relevant cross-lingual parsing literature in Section 2 before introducing the Xibe language and treebank in Section 3. We present our methodology in Section 4 and initial monolingual parsing results in Section 5 in which we also provide discussions on how Xibe is handled by various embedding representations and on the delexicalized performance. Cross-lingual approaches are presented in Section 6, covering both single source and multi-source approaches, with a particular focus on how differences between languages and annotations in treebanks can impact performance. We conclude in Section 7.
2. Related work
Since there is a wide range of literature on parsing in general, and on cross-lingual parsing more specifically, we restrict our overview to cross-lingual dependency parsing, with a specific focus on the different lexicalized and delexicalied approaches.
Cross-lingual transfer learning has been useful in improving the accuracy of low-resource target languages and has been applied in a multitude of tasks (Lin et al. Reference Lin, Chen, Lee, Li, Zhang, Xia, Rijhwani, He, Zhang, Ma, Anastasopoulos, Littell and Neubig2019). This process refers to the application (and potentially adaptation) of resources and models from high-resource source languages to low-resource target languages on different levels with the assumption that existing similarities between languages can be exploited.
The main challenge for cross-lingual parsing is to reduce the language discrepancies on different levels (such as in the annotation, or on the lexical level, wrt. scripts, etc.), between the source language and the target language. For dependency parsing, there are four main cross-lingual parsing approaches: annotation projection (Yarowsky, Ngai, and Wicentowski Reference Yarowsky, Ngai and Wicentowski2001; Hwa et al. Reference Hwa, Resnik, Weinberg, Cabezas and Kolak2005), treebank translation (Tiedemann, Agić, and Nivre Reference Tiedemann, Agić and Nivre2014; Tiedemann and Agić Reference Tiedemann and Agić2016), multilingual parsing models (Duong et al. Reference Duong, Cohn, Bird and Cook2015b; Ammar et al. Reference Ammar, Mulcaire, Ballesteros, Dyer and Smith2016; Kondratyuk and Straka Reference Kondratyuk and Straka2019), and model transfer (Zeman and Resnik Reference Zeman and Resnik2008; McDonald, Petrov, and Hall Reference McDonald, Petrov and Hall2011).
Annotation projection uses word alignment to project annotations from the source to the target language; treebank translation refers to the automatic translation of treebank data, while in model transfer approaches, models trained on source language treebanks are directly applied to parse target languages. For Xibe, there are currently no parallel corpora or machine translation systems available, which makes model transfer the most feasible approach.
2.1 Delexicalized approaches
A common approach to cross-lingual parsing is delexicalization. The strategy was first employed by Zeman and Resnik (Reference Zeman and Resnik2008) in constituency parsing, who replaced words with their morphological (part-of-speech (POS)) representations when adapting a Danish treebank to Swedish. Delexicalization results in similar input to the parser, independent of the lexical differences between the languages.Footnote a This approach produced a model that matched the performance of a model trained on approximately 1,500 sentences from the target language. The approach sparked several concurrent research directions. McDonald et al. (Reference McDonald, Petrov and Hall2011) concatenated multiple delexicalized source treebanks when training a dependency parser for a target language. Results were noticeably better than unsupervised parsing approaches. Simultaneous work by Cohen et al. (Reference Cohen, Das and Smith2011) used delexicalized models to support initial parameters for unsupervised parsing when using treebanks with non-parallel data. Rosa and Žabokrtský (Reference Rosa and Žabokrtský2015b) trained an MSTParser model interpolation as an alternative for multi-source cross-lingual delexicalized dependency parser transfer, finding that performance was comparable to parse tree combinations, but computationally less expensive. The work by Rosa (Reference Rosa2015) involved the training of several independent parsers which were applied to the same input sentence. The resulting tree was obtained by finding the maximum spanning tree of a weighted directed graph of the potential parse tree edges from the different parsers and yielded better trees than those generated from a single model, with the largest increases shown in lower resource settings.
One persistent issue in cross-lingual modeling is optimal data source selection for specific target data. Using perplexity for data point selection in a cross-lingual setup was shown by Søgaard (Reference Søgaard2011) as more suitable for selecting only similar sentences from the source treebank to parse the target language, in comparison to utilizing the entire source data. Rosa and Žabokrtský (Reference Rosa and Žabokrtský2015a) used KL divergence (Kullback and Leibler, Reference Kullback and Leibler1951) of POS trigrams to detect language similarity for source and target selection in delexicalized dependency parsing. Results showed good performance for single-source transfer and were further strengthened in a multi-source setup, but results seemingly degrade the further apart two languages are linguistically, suggesting that it may require additional language characteristics. In a similar vein, approaches by Dozat et al. (Reference Dozat, Qi and Manning2017) used delexicalized language family models to parse additional surprise languages within the target family, assuming that language families share closer POS distributions and syntactic characteristics. Parallel work by Das et al. (Reference Das, Zaffar and Sarkar2017) used transformations of syntactic features of the source languages to create delexicalized models to improve source-target selection criteria when parsing an unknown target language.
However, linguistic relatedness is not a guarantee of optimal source selection. Lynn et al. (Reference Lynn, Foster, Dras and Tounsi2014) found that Indonesian was a better source language than the more closely related Indo-European languages when parsing Celtic in a cross-lingual setting. They noted that specific linguistic/annotation phenomena in treebanks, such as nominal modifications and long range dependencies, can be more closely aligned in a specific language/treebank pair, the impact of which will correspond to the prevalence of such phenomena in the target language.
Little delexicalized work exists for parsing Xibe. Zhou and Kübler (Reference Zhou and Kübler2021) performed cross-lingual experiments using three selection criteria: typology, perplexity, and LangRank as source selection, noting that syntactic similarity was the most important factor in source selection, with Japanese being the optimal source language.
2.2 Lexicalized approaches
While delexicalized parsing seems to be a natural choice in a cross-lingual experiments due to lexical differences in source and target languages, this also means that the parser has to rely on a very coarse-grained POS tagset, the Universal Dependencies (UD) POS tagset, to model syntactic phenomena. Consequently, lexicalized cross-lingual parsing has in some cases yielded superior results (Falenska and Çetinoğlu, Reference Falenska and Çetinoğlu2017). Early work by Täckström et al. (Reference Täckström, McDonald and Uszkoreit2012) used parallel data to induce cross-lingual word clusters, and added them as features for their delexicalized parser. Results showed a reduced relative error across treebanks by up to 13 percent and importantly, such features never had a negative impact on model performance. Xiao and Guo (Reference Xiao and Guo2014) proposed that the source and target language words with the same meaning share a common embedding. The embeddings are jointly trained with a neural model and are used for dependency parsing. Results showed superior cross-lingual generalizability across nine languages.
A hybrid representation was proposed by Duong et al. (Reference Duong, Cohn, Bird and Cook2015a) where cross-lingual embeddings were generated by using lexical and POS representations, generating more syntactically aware embeddings. Results were higher than for corresponding delexicalized baselines. Ahmad et al. (Reference Ahmad, Zhang, Ma, Hovy, Chang and Peng2019) explored encoder representations pairing English as source language against a typologically diverse set of thirty target languages. While training was performed only on English, at testing times embeddings derived from pre-trained multilingual models were used as input and projected into a shared embedding space and aligned with the English embeddings. They found that RNN encoders were better for target languages that were syntactically closer to English while transformers are more flexible with word order modeling. He et al. (Reference He, Zhang, Berg-Kirkpatrick and Neubig2019) focused on cross-lingual approaches to more distant languages by using URIEL (Littell et al. Reference Littell, Mortensen, Lin, Kairis, Turner and Levin2017), which encodes information such as typological, geographical, and phylogenetic features of languages, into vector representations to determine language distances. They utilized a structured flow model to induce interlingual embeddings to enhance the ability to learn and share syntactic information on a target language. Results using English as the source yielded improvements on a set of ten distant languages.
While pre-trained language models such as BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) have resulted in substantial performance gains in cross-lingual experiments with the use of pre-trained multilingual embeddings (e.g., mBERT), they still possess some limitations. Since those models split words into subwords, subword overlap between source and target language can be a potential indicator of source language selection (Wu and Dredze Reference Wu and Dredze2019). However, such models do not necessarily yield optimal results for very low-resource languages. Wu and Dredze (Reference Wu and Dredze2020) found that subword pieces of low-resource languages may be present in the base vocabulary of mBERT, but this does not mean these represent the most relevant subword types in the target language, which may be missing all together. They note that one of the main contributing factors to the limitation of mBERT is simple limitations in data availability and quality for low-resource languages and caution against relying solely on such paradigms in low-resource settings. Work by Rust et al. (Reference Rust, Pfeiffer, Vulić, Ruder and Gurevych2021) examining the tokenizers in monolingual and multilingual models shows the impact that monolingual tokenizers have on performance. When using a multilingual model, simply replacing the multilingual tokenizer with a monolingual tokenizer for the target language improves performance on multiple tasks and languages. Recent work by Blaschke et al. (Reference Blaschke, Schütze and Plank2023) examined how subword overlap aids in source and target selection for cross-dialect POS tagging. They use metrics including the split word ratio difference (the ratio of words split into subwords between the source and target), the overlap of seen subwords in the source and target dialects, and the subword level type-token ratio. Experiments on a range of dialects across multiple languages indicate that similarity in the split word ratio is the stronger indicator of source and target performance.
3. The Xibe language and treebank
3.1 Xibe: A severely endangered Tungusic language in China
The Xibe language (ISO 639-3: sjo) is a Tungusic language used by the Xibe ethnic group in China. According to the Seventh National Population Census conducted in 2020 (Office for the National Population Census, Reference Robbeets and Savelyev2022), Xibe has a population of around 190,000 members, which are mainly distributed in the northeastern provinces (Liaoning, Jilin, and Heilongjiang) and Xinjiang Uyghur Autonomous Region. For historical reasons, the Northeast Xibe has almost lost their language, whereas the Xinjiang Xibe still actively uses their native language.
The modern Xibe language is mainly spoken in the Cabcal Xibe Autonomous County and its adjacent areas including Huocheng, Tacheng, Gulja, and Urumqi (Gorelova, Reference Gorelova2002). However, the actual number of speakers of the language ranges between 10,000 and 50,000 (Chog, Reference Chog2016a); it is in decline as the social function of the language is weakening. UNESCO recognizes Xibe as a severely endangered language, that is, the language is spoken by grandparents and older generations, the parent generation may understand it but they do not speak it to their children or among themselves. Therefore, documenting this language is necessary from perspectives of linguistic research and preservation of culture.
Xibe shows significant differences between its spoken and written forms. Spoken Xibe is a collection of dialects, and there is no standard. Previous linguistic research on the Xibe language mostly focused on documenting Xibe dialectal differences or studying phonology and morphology of spoken Xibe (Norman Reference Norman1974; Li Reference Li1979, Reference Li1982, Reference Li1984, Reference Li1985, Reference Li1988; Jang Reference Jang2008; Zikmundová Reference Zikmundová2013). Written Xibe is written in Xibe script, and it is used less often, compared to spoken Xibe as the younger generation mostly cannot read or write the script (Chog, Reference Chog2016b). Previous studies related to written Xibe are mainly concerned with comparisons to spoken Xibe or literary Manchu, a closely related language (Gu, Reference Gu2016). In this study, we focus on parsing written Xibe.
3.2 Writing system
The modern Xibe writing system is slightly modified from the Manchu script tongki fuka sindaha hergen (Eng.: letters with circle and dot), which is derived from the traditional Mongolian writing system. The Xibe script is written vertically from top to bottom.Footnote b Xibe has five vowels and nine consonants, plus ten “foreign” letters, which are specifically used for loanword transliteration (Šetuken, 2009), shown in Table 1. In addition, each Xibe letter has three shapes depending on its position in a word: at initial, middle or final positions. Vowels additionally have an isolated shape as they can be used alone.
According to the Unicode Standard Version 15.0 (The Unicode Consortium 2022), Xibe graphemes are encoded partly using traditional Mongolian graphemes and partly using graphemes specific to Xibe. Xibe shares twelve graphemes with traditional Mongolian and uses twenty-one Xibe specific graphemes; see Table 2.
3.3 The Xibe Universal Dependencies Treebank
Zhou et al. (Reference Zhou, Chung, Kübler and Tyers2020a) annotated a small dependency treebank based on the Universal Dependencies framework (de Marneffe et al. Reference de Marneffe, Manning, Nivre and Zeman2021). The treebank currently contains a total of 1,200 sentences, including 544 grammar examples collected from a written Xibe grammar book (Šetuken 2009), 266 sentences from Cabcal Newspaper and 390 sentences from the introductory Xibe textbooks Nimangga Gisun (Eng.: Mother Tongue).Footnote c
In the treebank, annotation for each tree includes sentence-level and token-level information. Sentence information includes the Xibe sentence, Latinized transliteration and English translation. Each token is annotated for lexical, morphological and syntactic information. Figure 1 shows an example annotation for a simple sentence. meiherehebi (Eng.: have carried) is the ROOT, in sentence final position. The sentence has a core argument and an adjunct: suduri i ujen tašan (Eng.: serious duty of history) is the object marked by the accusative case marker be, and musei meiren is the adjunct marked by the locative case marker de. Xibe is a pro-drop language; in this sentence, the subject is dropped and consequently does not occur in the annotation. Within the object phrase, there is a nominal modifier marked by the genitive case i.
3.4 Xibe syntax
As a Tungusic language, Xibe shares a range of morphological and syntactic traits with other languages from the same language family: All have agglutinative morphology and use Subject-Object-Verb (SOV) word order. However, the languages differ in the degree of agglutination. In comparison to other Tungusic languages such as Evenki and Nanai, Xibe (as well as Manchu and Jurchen) has less inflectional morphology, which is assumed to be the result of language contact with Sinitic languages (Whaley and Oskolskaya Reference Whaley and Oskolskaya2020).
Xibe sentences follow a rigid SOV word order. Arguments are marked for case, but case marking is optional (see Section 3.4.1 for more details). Phrases are consistently head final, and subordinate clauses are located before the head they modify. The verbs of subordinate clauses are in non-finite form. Adverbial clauses are headed by converbs (see Section 3.4.2), and adnominal clauses are headed by participles. In written Xibe, inflectional morphology is mainly present on nouns and verbs. Nouns inflect for number and case. Verbal morphology of Xibe is the most complex. A verb in Xibe consists of a verb stem and an inflectional suffix. Inflectional suffixes include finite and infinite verb suffixes. Additional verbal suffixes express tense, aspect, voice, and mood.
For the remainder of this section, we focus on the two syntactic phenomena that are relevant for understanding our parsing results, namely the case marking system and converbs.
3.4.1 Case marking
Written Xibe has eight cases, as shown in Table 3. Subjects are unmarked. There are two cases of syncretism: i marks genitive or instrumental case and ci marks ablative or lative case. The case markers follow their nouns, they can be written either as suffixes attaching to the nouns or as independent tokens. Following the UD guidelines (de Marneffe et al. Reference de Marneffe, Manning, Nivre and Zeman2021), separate case markers depend on their preceding nouns. If the case marker is suffixed on the noun, it is not annotated separately.
The example tree in Figure 1 has three separately written case markers. The sentence in Figure 2 has an accusative and a lative case marker, but the lative case is suffixed to (boo, Eng.: home), which depends on the verb as an oblique.
Note that the case markers are clear indicators of the grammatical function of the noun, and the separate case markers are accessible for the parser in lexicalized parsing settings. Whether the suffixes are also accessible may depend on the amount of training data. However, in delexicalized parsing, none of this information is available.
3.4.2 Converbs
Infinite verb forms in Xibe consist of converbs and participles. Converbs are nonfinite verbs that express adverbial subordination. This syntactic construction is typical for Tungusic, Turkic, and Mongolic languages. Xibe uses a range of converb suffixes, denoting different aspects or modal meanings, as shown in Table 4. In example (1), the suffix -hai in (hadahai) is the durative suffix, which adds the meaning that an action continues at the same time as another action.
-
(1)
yasa hadahai tuwambi
eye stare.conv look.pres
“The eye keeps staring.”
In the Xibe UD treebank, converbs are dependents of their matrix verbs, via an adverbial clause (advcl) relation. This relation accounts for 9.04 percent of all dependencies in the Xibe treebank.
-
(2)
ajige jui eme be sabufi injeršeme feksime jimbi
small child mother acc see.conv laugh.conv run.conv come.pres
“Having seen (his) mother, the small kid laughs and runs over.”
Example (2) has a perfect converb (sabufi) and two imperfect converbs (injeršeme) and (feksime). In the annotation shown in Figure 3, the perfect converb (sabufi) depends on the main verb of the sentence converb (jimbi), to describe that the action of “seeing” is completed before the action “to come” is undertaken. The two imperfect converbs have the suffix converb -me, which denotes simultaneity of actions. In the example sentence, converb (injeršeme) depends on the other imperfect converb (feksime) as an accompanying action. converb (feksime) depends on the main verb as an accompanying action.
It is clear that in a cross-lingual setting, this construction can only be successfully parsed when the source language also has similar converb constructions.
4. Experimental setup
4.1 Treebank data
We use Xibe as our target language. The current Xibe treebank is relatively small, with approximately 1,200 sentences, and only provides a test set (see Table 5) according to UD guidelines.
For our source languages, we select languages based on language typology. Xibe is a Tungusic language. The Tungusic languages together with Mongolic and Turkic languages belong to the Altaic language family (Robbeets Reference Robbeets2020). When we consider the broader category of the Transeurasian language hypothesis (Robbeets Reference Robbeets2020), Korean and Japanese are also considered to be close relatives of Xibe.Footnote d We have decided to use both the Altaic and Transeurasian language neighbors since these language families exhibit similar linguistic characteristics, especially on the morpho-syntactic and syntactic level. Such shared characteristics are often important criteria for determining optimal source languages for cross-lingual parsing. Additionally, these characteristics have already been shown to also be of high relevance for Xibe in cross-lingual settings (Zhou and Kübler Reference Zhou and Kübler2021).
The similarities in morphology and syntax between Xibe and the other Transeurasian languages (Robbeets and Savelyev, Reference Robbeets and Savelyev2020) mainly include the following:
-
1. They have a predominant SOV word order, and the predicate verb is strictly located in a sentence final position.
-
2. Phrases follow a head-final word order.
-
3. Transeurasian languages all have a high degree of morphological agglutination, where bound morphemes are predominantly suffixing.
In UD version 2.10, there are twenty-five treebanks representing nine Transeurasian languages. Following Zhou and Kübler (Reference Zhou and Kübler2021), we select ten treebanks in six languages as shown in Table 5 and exclude three languages, Old Turkic, Tatar, and Yakut, due to their limited number of available trees (less than one hundred).
4.2 Data split
For most treebanks, we use the training, development, and test splits provided by the UD treebanks. For Xibe, Buryat, and Kazakh, however, only small treebanks are available and few non-test sentences are available. For these languages, we choose to split the data into three folds per treebank.
For monolingual Xibe experiments (Section 5.1), this means we run threefold experiments where one fold serves as training, development, and test set, respectively. For all experiments where Xibe is used for testing, we use the full treebank as test set.
For the single-source cross-lingual experiments (Section 6.1), since Buryat and Kazakh are only used for training, we combine two folds into a training set while using the third as development set.
For multi-source cross-lingual experiments (Section 6.3), we concatenate training, development and test data of all selected treebanks. Then we use 80 percent for training and the remaining 20 percent as development set. For the training/development split, we use perplexity-based stratified sampling to ensure that both sets have the same distribution in terms of perplexity.
4.3 Multilingual word embeddings
For pre-trained multilingual language models, we use mBERT (multilingual Bidirectional Encoder Representations from Transformers; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), cross-lingual RoBERTa (Robustly Optimized BERT Approach; XLM-R; Conneau et al. Reference Conneau, Khandelwal, Goyal, Chaudhary, Wenzek, Guzmán, Grave, Ott, Zettlemoyer and Stoyanov2020), and the Pre-trained Language Model for CINO (Yang et al., Reference Yang, Xu, Cui, Wang, Lin, Wu and Chen2022). The three pre-trained language models are trained on corpora of different sets of languages and use different tokenizers.
mBERT is a large-scale pre-trained language model trained on 104 languages, which include typological relatives of Xibe: Kazakh, Turkish, Tajik, Kyrgyz, Tatar, Uzbek, Korean, and Japanese. It uses the WordPiece tokenizer (Wu et al. Reference Wu, Schuster, Chen, Le, Norouzi, Macherey, Krikun, Cao, Gao and Macherey2016), which breaks words into subword pieces.
XLM-R is a multilingual model that is trained on CommonCrawl data containing one hundred languages, including Kazakh, Kyrgyz, Turkish, Uyghur, Uzbek, Mongolian (Cyrillic), Korean, and Japanese. XLM-R uses the SentencePiece tokenizer (Kudo and Richardson Reference Kudo and Richardson2018), which encodes a sentence as a Unicode sequence and decodes the tokenized Unicode sequence back to text.
CINO is built on XLM-R but has been adapted for minority languages by resizing its vocabulary and adopting a fast masked language modeling objective for pre-training. The extended languages include Cantonese, Tibetan, traditional Mongolian, Uyghur, Kazakh in Arabic script, Latinized Zhuang, and Korean. The CINO tokenizer also extends the XLM-R tokenizer via merging a Tibetan tokenizer and a traditional Mongolian tokenizer so that it can recognize these two languages (Yang et al. Reference Yang, Xu, Cui, Wang, Lin, Wu and Chen2022).
All three language models are trained on more than one hundred languages and include syntactically similar languages, such as Turkic languages, Korean and Japanese, but to the best of our knowledge, Xibe is not present in any of the language models.
A complication arises from the fact that Xibe is written in a script derived from the Manchu alphabet, which is based on the traditional Mongolian script. Although the Xibe alphabet is included in the Unicode block for traditional Mongolian,Footnote e only twelve out of thirty-five Xibe letters share the same Unicode encodings with traditional Mongolian, while the remaining twenty-three letters use unique encodings. Therefore, we assume that mBERT and XLM-R will be at a serious disadvantage, while CINO can be assumed to have better but still limited capability in recognizing Xibe (see Section 5.2 for more details).
4.4 Parser
We use the implementation of the Deep Biaffine parser (Dozat and Manning Reference Dozat and Manning2017; Dozat et al. Reference Dozat, Qi and Manning2017) by Sayyed and Dakota (Reference Sayyed and Dakota2021).Footnote f The parser is a neural graph-based dependency parser. The parser uses a biaffine classifier, and replaces a traditional bilinear or traditional MLP-based addition with biaffine attention. The deep bilinear attention allows for more relevant information to be retained before being used in the biaffine classifier.
The main reason for choosing this implementation is that the default parser implementation requires word embeddings, which are then concatenated with an additional embeddings model (e.g., word+POS, word+char). This behavior makes pure delexicalized experiments difficult: We would have to replace words by their POS tags (see Section 4.5), as the parser would concatenate these delexicalized embeddings with another embeddings model by default, for example, delexicalized+char, delexicalized+POS. The latter example would be a reduplication of POS embeddings. The implementation by Sayyed and Dakota (Reference Sayyed and Dakota2021) has an option that enables single embeddings to be used without any concatenation, resulting in only a single delexicalized embeddings model when words are replaced with their POS tags, avoiding both embeddings concatenation and any reduplication.
We use the default hyper-parameters of the original base parser (Zhang, Li, and Min Reference Zhang, Li and Min2020), shown in Table 6. The word, POS tag, and character embeddings are randomly initialized. For the embeddings derived from the pre-trained language models, a scalar mixture of the last four layers is passed through a linear layer to produce the embeddings of the specified dimension (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018; Tenney, Das, and Pavlick Reference Tenney, Das and Pavlick2019a, Reference Tenney, Xia, Chen, Wang, Poliak, McCoy, Kim, Van Durme, Bowman, Das and Pavlick2019b).
4.5 Delexicalization
Since the Xibe alphabet is not represented in the training data of the pre-trained language models, we also experiment with delexicalized models of the parser. This will eliminate the script problem, but it reduces the parser’s input to the coarse UD POS tags. We delexicalize all treebanks by replacing words by their UD POS tags. While this severely limits the amount of information available to the parser, it is a standard technique for pre-neural cross-lingual parsing (see Section 2.1).
4.6 Evaluation
We evaluate all experiments using the CoNLL 2018 Shared Task Scorer (Zeman et al. Reference Zeman, Hajič, Popel, Potthast, Straka, Ginter, Nivre and Petrov2018).Footnote g Note that the scorer ignores dependency subtypes. Consequently, our evaluation is based on the main types, even though subtypes are present during training and are assigned in the parsing process. We report the unlabeled (UAS) and labeled attachment score (LAS), both macro-averaged, but mainly focus on the analysis of LAS. We additionally provide analyses by dependency labels and POS tags.
For significance testing, we use Dan Bikel’s Randomized Parsing Evaluation Comparator (Yeh Reference Yeh2000).Footnote h The null hypothesis ( $H_0$ ) is that for each test sentence, scores obtained from the two models are equally likely. For each setting, the weighted average UAS and LAS are calculated per model and per sentence. To determine significance, the script loops over the UAS and LAS scores of each sentence and randomly chooses to switch the sentence’s scores across models. After a single pass, the scorer checks whether the new weighted UAS and LAS scores are more extreme than the originally calculated averages. This procedure is repeated for a predefined number of iterations after which a form of randomized paired-sample t-test for hypothesis testing is performed with respect to how often the new averages were more extreme. The results are the $p$ -values for both the differences in UAS and LAS between the two models. We only report significance testing for LAS since in our experiments, both scores show the same trends.Footnote i
5. Monolingual parsing
In this section, we investigate parsing quality when training and testing on the small Xibe treebank. This gives us the (potentially competitive) baseline given a setting where we have a small amount of sentences available. It is an open question whether pre-trained multilingual word embeddings can provide useful information in a situation where the target language is not included in the embeddings, and neither is its script. We assume that the subword embeddings from these models can still provide useful information since the word embeddings are trained on a small training set of 400 sentences.
For the experiments, we train both lexicalized and delexicalized models. For the lexicalized models, we can use different combinations of representations as features: We use POS tag embeddings, character embeddings, and embeddings from large-scale pre-trained language models (see Section 4.3 for a description) in combination with word embeddings derived from the training data. The delexicalized model replaces words by their corresponding POS tags.
5.1 Results
The results for the monolingual experiments using different feature representations are shown in Table 7. When using only word embeddings trained on the Xibe training set, we reach an LAS of 56.57. Adding POS embeddings (WORD + POS) decreases the results, but adding character embeddings (WORD + CHAR) improves results, as expected, to an LAS of 62.34. That a concatenation of word and character embeddings can improve parsing performance has previously been shown across typologically diverse languages (Sayyed and Dakota Reference Sayyed and Dakota2021), pointing to their superior ability to handle out-of-vocabulary (OOV) words (Vania et al. Reference Vania, Grivas and Lopez2018).
$diff_{LAS}$ shows the LAS difference to the best model (word + char + CINO); * indicates significant differences ( $p_{LAS}\lt 0.05$ )
Results for using the pre-trained multilingual language models are mostly negative, with LAS results around or slightly below the results for word embeddings. The only exception is the setting when we use the combination of word embeddings, character embeddings, and CINO subword embeddings (WORD + CHAR + CINO). This results in the highest LAS of 62.47.
Table 7 also shows the difference in LAS between the best model and each of the remaining models, along with significance results for this comparison. Note that there is no significant difference between the best performing model and WORD + CHAR, WORD + CHAR + POS, and WORD + CHAR + POS + CINO, thus indicating that CINO’s contribution is rather minimal. We assume that this may be due to the fact that the Xibe script has only minimal overlap with Mongolian in CINO, but no overlap with any of the other languages.
However, these results raise a number of questions: Is the missing overlap in scripts the reason for the poor performance of the pre-trained multilingual embeddings? Why do the POS embeddings perform so poorly given that they provide gold POS information? How severe is the OOV problem when using Xibe word embeddings? We will investigate these issues below.
5.2 A closer look at pre-trained language models
Our results for monolingual parsing in Table 7 are somewhat surprising if we ignore the missing overlap in scripts: None of the pre-trained language models manage to outperform a model of Xibe word embeddings plus character embeddings (WORD + CHAR). In fact, two of the pre-trained models perform significantly worse. In this section, we will investigate whether the missing overlap in scripts is the reason for this performance.
As described in Section 4.3, the multilingual pre-trained language models that we use are mBERT, cross-lingual XLM-R, and CINO. Neither of the training corpora of the three models contains any Xibe data. However, the three language models use different tokenizers, which may influence parsing in different ways.
-
(3)
ini hahajui beijing deri bedereme jihebi
his son Beijing abl return.conv come.past
“His son came back from Beijing.”
We use a randomly chosen treebank sentence, see example (3), and tokenize it using the three pre-trained language models. The results are shown in Table 7. The table shows clearly that mBERT is not equipped to handle an unknown script: It does not recognize the Xibe characters. As a consequence, all words are treated as unknown tokens (UNK). A closer look at mBERT tokenization of the whole Xibe treebank shows that all words written in the Xibe script are represented as UNK by the mBERT tokenizer, only punctuation, numbers, and words spelled in Roman characters, such as “wifi,” are handled by the tokenizer, but these are very few in number.
XLM-R, in contrast, uses the SentencePiece tokenizer, which encodes the string as a Unicode sequence and decodes the normalized Unicode sequence to a string. This procedure seems to work better, and XLM-R is able to tokenize some words. However, there are considerable differences across the words, one word (hahajui; Eng: son) is split into characters, all the others are kept as complete words.
To determine the extent of the problem, we check how many words in the Xibe treebank are not subdivided into subwords (excluding all punctuation). Results show that 60.68 percent (12,461 out of 20,535) of the words are kept in their original form. This means that for Xibe, the XLM-R tokenization is closer to word-level tokenization, which explains why the XLM-R results are very similar to the results when using word embeddings in Table 7. The fact that the XLM-R results are slightly lower than those for word embeddings shows that the tokenization that does happen generally does not provide useful information to the parser.
CINO utilizes the same architecture as XLM-R. The main differences between CINO and XLM-R concern the SentencePiece tokenizer and the data used to train embeddings (see Section 4.3). Table 7 shows that the WORD + CINO model outperforms the WORD model by 1.22 percent points in LAS, suggesting that using the additional Tibetan and Mongolian information is helpful. The effectiveness of CINO can partially be attributed to its tokenizer which is more capable of splitting words into subwords, as shown in Table 8. When using the CINO tokenizer, only 7.43 percent (1,525 out of 20,535) of the words are not split into subwords. However, the example shows that this tokenizer has a tendency to tokenize words into individual characters, which explains the similarity of the parsing results to those of the WORD + CHAR model. One reason for this behavior may be found in the fact that Xibe and traditional Mongolian only share twelve out of thirty-five letters, that is, CINO is not capable of recognizing the remaining twenty-three letters. However, a closer look at the marked case, that is, subwords longer than one character, shows that these consist of only Mongolian characters, only Xibe characters, and a mix of both in almost equal parts. Consequently, familiarity with the Mongolian characters is not the only criterion for forming longer subwords.
Overall, it is evident that all pre-trained multilingual word embeddings have problems determining useful subword units given Xibe’s unique script, which explains the low results for models using those embeddings.
5.3 A closer look at word plus part-of-speech embeddings
In general, integrating POS information with the Xibe word embeddings is expected to improve parsing results since the parser has access to gold POS tags and should thus be able to better disambiguate lexical ambiguities. However, from Table 7, we observe that parsing performance declines when part-of-speech embeddings are added to the word embeddings. In comparison to the model using word embeddings, the WORD + POS model reaches an LAS that is more than one point lower (55.52 vs. 56.57). Additionally, when adding POS embeddings to the best performing model (WORD + CHAR + CINO), the LAS deteriorates slightly from 62.47 to 62.26. The same trend also occurs when adding POS embeddings to WORD + CINO. While these differences are not significant, they still show that the gold POS information cannot be used successfully by the parser.
Table 9 provides a comparison of the WORD + CHAR and the WORD + CHAR + POS model in terms of the accuracy per POS tag of having been assigned the correct head, along with the absolute frequency of this POS tag in the Xibe treebank. The accuracies show that there is no clear trend, about half of the POS tags improve when POS tags are added, the other half deteriorates, and the split is distributed across frequencies and across open and closed class POS tags. This lack of trends is also evident in other forms of probing the results. We interpret this lack of trends as an indication of a complex interaction of the different types of embeddings, without any clear reasons for the lower accuracies when POS embeddings are used. However, we also acknowledge that this may be an artifact of the small training set size.
5.4 A closer look at word embeddings
Here, we investigate the reasons for the poor performance of the Xibe word embeddings model as compared to the delexicalized POS model. When representing a sentence via word embeddings, the LAS for this model is 3.36 points lower than the delexicalized POS model and 5.77 points lower than the WORD + CHAR model. Since the lexicalized model is trained on only 400 sentences and tested on a second fold of 400 sentences, we assume that the main problem consists of OOV words. We check the OOV ratio in the three folds of the test data, shown in Table 10. The table shows that the OOV ratio on average is 46.38 percent. Since nearly half of the words in the test data are unknown, it is not surprising that this model is struggling. We further check the number of unknown words on the sentence level, finding that the test sentences on average have 17.99 percent of unknown words relative to the training data. The missing information degrades prediction performance. The OOV problem is mitigated by the WORD + CHAR model, which reaches an LAS very close to the highest LAS for monolingual parsing.
5.5 A closer look at the delexicalized model
In contrast to the low performance of the WORD model, the delexicalized model performs significantly better. Using the coarse-grained universal part-of-speech tags instead of words alleviates the OOV problem of the word embeddings model. Part-of-speech sequences can encode local syntactic information, which the parser mainly relies on. However, the rather coarse-grained universal POS tags tend to cluster together many relevant syntactic distinctions. For example, all verbs are replaced by “VERB,” and morphological information is largely lost. This may be detrimental if some of the encoded information is necessary for syntactic decisions.
We now investigate whether adding suffix information to the POS tag will provide the parser with better information. For this approach, we need to walk a fine line between adding syntactically relevant information and increasing the size of the POS tagset and consequently the OOV rate, given the small size of the treebank. We focus on verbs since these have the richest morphology in Xibe. For each verb, instead of representing the verb form by the POS tag “VERB,” we attach the verbal suffix to the POS tag. For example, we use “” for taci-mbi, “” for gene-me, and “” for niru-re.
Since there does not exist a morphological analyzer for Xibe, we use a bootstrapping approach, extracting the relevant suffixes using a set of rules. Table 11 lists all the inflectional suffixes for Xibe verbs based on Xiboyu Yufa Tonglun (Eng.: General Introduction to Xibe Grammar; Šetuken 2009), including affirmative forms and negative forms. We use this list of suffixes to extend the verbal POS tag.
To ensure that we do not overgeneralize, we have manually compared the “VERB + suffix” tags and the original verbs. We found only three erroneous cases. These suffixes were removed before parsing.
After applying the rules, 95.6 percent of all verbs are represented by a more specific POS tag including a suffix. The remaining verbs are mainly uninflected verbs, participles marked for case, and irregular verbs.
In order to investigate whether adding verbal morphology to the POS tag provides relevant information, we compare results using this version of delexicalization to the standard method. We are aware that this model may not be completely delexicalized since we use suffix information, but POS tags often include partial morphological information, such as plural or past tense in the Penn Treebank POS tagset for English (Santorini Reference Santorini1990) and the finite/infinite verb distinction in the Penn Treebank POS tagset and the Stuttgart-Tübingen tagset for German (Schiller, Teufel, and Thielen Reference Schiller, Teufel and Thielen1995). However, we consider this method a knowledge-poor method for injecting morphological information into the parser.
The results of this experiment are shown in Table 12. A comparison of the two models shows that the LAS significantly improves by 1.74 points to 61.67, thus indicating clearly that the morphological information present in the verbal suffixes is important for parsing.
$diff_{LAS}$ shows LAS improved from the POS ONLY model to the POS + VERB SUFFIX model. * indicates significant difference ( $p_{LAS}\lt 0.05$ )
Table 13 gives a selective overview of how adding suffixes changes the parser’s performance on the verb POS tags as well as on clausal dependencies. The results show a considerable increase in accuracy for the POS tag VERB, which improves by 4.57 percent points. The dependency relations for clauses show increases between 3.78 and 37.18 percent points. The most significant change occurs for clausal complements, which increase from an F-score of 0.00 to 37.18.
A manual inspection of these cases between the two models shows that in the pos only model, “ccomp” relations tend to be misclassified as “advcl.” This over-generalization is partly corrected in the pos + verb suffix model. Figure 4 shows one example. Here, the main verb is the imperative se, bi sain fonjimbi is its clausal complement. Since “advcl” is more frequent than “ccomp,” the parser prefers this dependency label over “ccomp” when it does not have access to verbal features. Adding verb suffixes to the POS tag provides the information that fonjimbi is finite and thus cannot be a converb.
Our results show clearly that the verbal suffixes provide explicit cues for identifying syntactic patterns.
6. Cross-lingual dependency parsing
As we have seen in Section 5, monolingual parsing using a pre-trained multilingual language model is only moderately successful. In monolingual parsing, the parser only has a limited number of sentences from which to learn Xibe syntax, and the substantial proportion of unknown words creates a challenge. Another possibility of addressing the problem consists in cross-lingual parsing, by training a parser on a source language which is syntactically closely related and higher resourced,Footnote j and parse Xibe. Here, we rely on the commonality of the UD annotation scheme.
As shown in Section 2.2, lexicalized approaches can lead to good results in a cross-lingual setting. Since in the case of Xibe, there will be no or very little overlap in the scripts between source and target language, we assume that a lexicalized approach will not be successful. In Section 5, we have shown that unlexicalized parsing is highly successful in a monolingual setting. Our assumption is that this will transfer to a cross-lingual setting as long as the languages are syntactically similar enough since in this case, the parser can profit from the (generally) higher number of training sentences. As source languages, we use the 6 languages from the Transeurasian language family, as described in Section 4.1.
Since in monolingual parsing, using word and character embeddings performs better than only using word embeddings and word plus POS embeddings, we also use the word + char setting here. We also use character embeddings and CINO embeddings for the same reason.
6.1 Results
Results of the cross-lingual dependency parsing experiments are shown in Table 14. These results are considerably lower that the monolingual baseline (repeated in the last row of Table 14), across the board. The highest LAS is reached by training on one data split of the Kazakh treebank, in the unlexicalized setting. This reaches an LAS of 41.42, which is about 18.5 percent points lower than the result of the corresponding monolingual baseline.
$diff_{LAS}$ is the LAS difference between the delexicalized source model with the Xibe baseline. * indicates significant differences ( $p_{LAS}\lt 0.05$ )
As expected, all the lexicalized models perform poorly in the cross-lingual setting. The results in Table 14 show that the results of the word based models are in the range between 3.23 and 11.65 for LAS. Incorporating characters to the word model improves results in about two-thirds of the cases; however, it is unclear how to delineate the positive and negative cases, since even the different splits of Buryat and Kazakh show inconclusive results. Adding CINO embeddings generally improves performance to a limited extent, exceptions are Kazakh split 2 + 3 and Uyghur. However, even those improved results do not manage to improve beyond an LAS of 21.27 (based on the Japanese GSDLUW treebank).
Note that the six source languages cover a wide range of scripts, but do not have any overlap with the Xibe script. This leads us to the conclusion that lexicalized cross-lingual parsing is not a viable option in cases where the target script is not included in the training data.
In contrast to the lexicalized models, the delexicalized models perform better. The best setting reaches an LAS of 41.42 when training on the Kazakh treebank (split 3). However, there are significant differences across the three splits of the Kazakh treebank, indicating that this result may not be stable. The next highest result is based on training on the Turkish BOUN, which reaches an LAS of 41.00. However, these results are still substantially lower than those of the monolingual delexicalized model (LAS: 59.93). this means that the parser cannot harness the larger training data size to improve parsing results.
In the investigation of causes for the low performance of all cross-lingual models, the first assumption would be that the source and target language differ in terms of word order and types of syntactic constructions. However, we chose our source language mainly by language family, ensuring that all source languages show the same large scale syntactic characteristics as Xibe: All languages have a strict SOV word order, and phrases are all head final. This does not preclude other word order differences across languages, but these should have less effect on parsing accuracy. For this reason, we have a closer look at the dependency labels next, to determine possible reasons for the low performance.
6.2 Dependency relations
In this section, we investigate potential discrepancies in dependency labels between the source treebanks and Xibe as our target treebank. It is a well known fact that not all UD treebanks use the full set of dependency labels (Nivre et al. Reference Nivre, De Marneffe, Ginter, Goldberg, Hajic, Manning, McDonald, Petrov, Pyysalo and Silveira2016, Reference Nivre, de Marneffe, Ginter, Hajič, Manning, Pyysalo, Schuster, Tyers and Zeman2020; de Marneffe et al. Reference de Marneffe, Manning, Nivre and Zeman2021). For this reason, we first concentrate on the overlap between the labels in the Xibe treebank and each source treebank to decide whether there is a correlation between the (lack of) overlap and parsing performance.
The Xibe treebank uses all forty dependency relations, including the thirty basic dependency relations and ten relations with subtypes. Since the evaluation focuses on main types, we restrict our investigation to those as well.
Table 15 lists the dependency relations that do not occur in the training data of a given source treebank and the percentage of all the unknown labels in the Xibe test set. The table shows that there is a certain interdependence between the rate of unknown labels and parser performance: The source treebanks that reach an LAS higher than forty on Xibe, that is, Kazakh and the Turkish BOUN, all show minimal rates of unknown labels. However, the remaining picture is less clear. The Korean GSD, and the Japanese GSD and GSDLUW treebanks have the highest unknown label rates of around 7 percent. However, even though these rates are very similar, the parsing performance ranges between an LAS of 33.27 (Korean GSD) and 36.75 (Japanese GSDLUW). Additionally, the Korean Kaist treebank has an unknown label rate of only 1.49 percent, but results in a considerably lower LAS of 22.03. Consequently, we can conclude that while having a low unknown label rate is a prerequisite for high parsing accuracy, it alone is not sufficient to guarantee good performance.
We now look more closely at the parsing performance of specific dependency relations parsed by the highest performing model of Turkish BOUN along with Korean Kaist and the two Japanese models. Specifically, we choose the dependency relations with the highest frequency in the Xibe treebank: nominal subject (nsubj), object (obj), oblique (obl), case marking (case), and adverbial clause (advcl). These labels account for 40.85 percent of the dependency relations in the Xibe treebank.
The F1 scores for the five labels are shown in Table 16. The results show that all parsing models have significant difficulties locating subjects, objects, and obliques. All of the F-scores for these dependency relations range between 0.00 and 26.63. Nominal subjects seem to be the most difficult to identify given Japanese source treebanks while Korean source treebanks lead to problems in identifying obliques.
As described in Section 3, Xibe has a strict SOV order, and the main verb is in sentence final position. This strict word order should provide the parser with cues about the order of arguments. However, in delexicalized parsing, it is more difficult to differentiate between the arguments. Figure 5 shows the dependency tree of example (1) plus an incorrect parse. Here, we have a pro-drop subject, plus an oblique and an object. In delexicalized parsing, the parser only has access to the POS level, which obscures all case information in the adpositions, thus leaving the parser guessing. The parse shows that all three adpositional phrases are attached to the verb as obliques. This is a clear indication that we need to provide more case information to the parser.
Another problem that we observe from Table 16 is that none of the five source treebanks allow an accurate recognition of adverbial clauses (advcl), the F-scores range from 0.16 (Korean Kaist) to 20.88 (Turkish BOUN). Next, we compare the two Korean treebanks with the Xibe treebank. More specifically, we determine the types of heads and dependents that share the “advcl” dependency relation. We list the five most frequent head-dependent pairs per treebank in Table 17. In the Korean Kaist treebank, the most frequent types all have an adverbial (ADV) dependent. In the Korean GSD treebank, in contrast, the most frequent dependents are verbs. This is a clear indication of variability in annotation. For Xibe, 93.96 percent of the adverbial clause relations consist of a VERB depending on a VERB head. This is a consequence of the frequent use of converbs, as described in Section 3.4. Korean also uses converbs, but to a lesser degree. Whether the higher variability of dependents in the Xibe treebank is due to differences in language phenomena or due to annotation artifacts is difficult to determine, but it is clear that these differences lead to a deterioration in parsing performance in a cross-lingual setting.
In summary, our investigation has shown that the low results in cross-lingual parsing have a range of underlying causes, including mismatches in the set of dependency labels used, differences in annotation, and differences in syntactic preferences across different languages, even when we choose the languages that are the most closely related to the target language (and that possess a UD treebank). Additionally, since delexicalized parsing needs to rely on the seventeen POS tags in the UD POS tagset as representation of a sentence, these categories are too coarse-grained to allow the parser to make reliable syntactic decisions.
In the next section, we focus on one method to address differences in syntax and annotation automatically: we combine all source languages but then determine the subset of sentences that is the most similar to the target language.
6.3 Multi-source parsing
As described above, in single source, delexicalized cross-lingual dependency parsing, the parser cannot profit from the larger training set size of a source language because of differences between languages, differences in annotation, and too coarse-grained information on the POS level. To address the first two issues, we investigate multi-source parsing, where we combine all reliable source languages, but then select only those sentences that are the most similar to the target language using perplexity.
We restrict ourselves to only those treebanks that have obtained an LAS greater than 36.00 in Table 14, that is, Kazakh, Uyghur, Turkish BOUN, Turkish Kenet, and Japanese GSDLUW. By having two Turkish treebanks, both of which are larger (Turkish Kenet with 18,687 sentences and Turkish BOUN with 9,768 sentences), this may give Turkish too much influence in the multi-source setup. Initial experimental results when including Turkish Kenet treebank yielded lower results, suggesting this may indeed be the case. For this reason, the final set of treebanks consists of only Kazakh, Uyghur, Turkish BOUN, and Japanese GSDLUW treebanks.
$diff_{LAS}$ is the difference between the Turkish BOUN model, the multi-source models and the Xibe baseline. * indicates significant differences ( $p_{LAS}\lt 0.05$ )
6.3.1 Perplexity
Perplexity has been shown to be good metric when selecting additional training data for cross-lingual parsing (Søgaard, Reference Søgaard2011) as well as in other multi-source setups, such as domain adaptation (Hwa Reference Hwa2001; Khan, Dickinson, and Kübler Reference Khan, Dickinson and Kübler2013). To calculate perplexity we use NLTK (Loper, Bird, and Klein Reference Loper, Bird and Klein2009) to train a POS bigram language model on the Xibe treebank. Laplace smoothing is performed with an $\alpha$ of 1 to handle the high likelihood of unseen POS sequences on such small datasets. Then for each sentence in the selected source treebanks, we compute the perplexity and select those below thresholds of 10, 15, and 20t, respectively, as additional training data.
6.3.2 Results
Table 18 shows the results of this experiment. For convenience, we repeat the results for using the Turkish BOUN treebank as source language and for monolingual training on Xibe from Table 14. The results show that we gain about 7 percent points in UAS and 8.6 percent points in LAS when we use all four reliable source treebanks for training. While this improves results, they are still about 4 percent points below the monolingual UAS and 10 percent points below the monolingual LAS. Restricting the sentences to those with a perplexity lower than 15 gives a slight boost to a UAS of 65.85 and an LAS of 49.86 by deleting about 2,000 sentences. Lowering the perplexity to 10 decreases the results below the scores of the full set of treebanks.
Table 19 provides an analysis of the dependency labels that show the most differences across the individual settings. We use the Turkish BOUN model for the single-source results and the multi-source ( $PPL\leq 15$ ) for the multi-source results. The results show a sizable difference in adverbial clauses (advcl): This dependency relation is parsed most reliably in the monolingual setting, it fares poorly in the single-source cross-lingual setting, but is handled more successfully in the multi-source setting. This result makes sense when we consider what we know about converbs (see Section 3.4): They are a highly frequent construction in Xibe, which explains the good results in the monolingual setting. Since the other source languages also use converbs, the decrease in performance in the single-source model shows that the source languages use them in fewer or more restricted functions. By combining source languages into the multi-source training set, we provide the parser with a wider range of converb functionality, which helps the parser to handle this construction more successfully.
The results of the multi-source experiments fall in line with other work (Søgaard, Reference Søgaard2011; Rosa and Žabokrtský, Reference Rosa and Žabokrtský2015a) showing that it is possible to improve the results of delexicalized cross-linguistic parsing by choosing a more relevant set of source data. We see that the pure size of a source treebank is less important than similarity to the target languages, since the Korean Kaist treebank is larger than our best multi-source treebank, and the Turkish Kenet treebank is about the same size, but both result in considerably lower performance. Combining source languages can provide the parser with a wider range of phenomena than are present in any single-source language. While this may lead to a decrease in performance if the target language does not use this full range, it can also improve performance, as in the case of the Xibe converbs.
However, we also see that the similarity of sentences is not a panacea, the cross-lingual results are still far below the monolingual ones. A comparison of POS bigrams, especially given the rather coarse-grained nature of UD POS tags, may not provide enough information to find the most relevant sentences.
7. Conclusion
In this work, we have investigated parsing for an under-resourced language, Xibe, which uses a unique script that is not present in any of the other languages for which we have resources. We first investigated a monolingual setting, determining how best to parse Xibe using the small treebank that is available. These results show that we reach the best results when we combine word and character embeddings with the CINO language model. Since all other language models lead to a deterioration of parsing performance, we come to the conclusion that this improvement results from having Mongolian included in the language model, since Mongolian is the only language that shares a subset of characters with Xibe. We also show that we reach competitive results when parsing completely delexicalized, that is, by focusing on POS instead of word embeddings. We also show that the UD POS tagset is impoverished and that adding automatically extracted verbal suffixes to the POS tags improves results.
In a second set of experiment, we have focused on a cross-lingual setting where we train on related source languages. Our investigation has shown that this setting leads to significantly lower results than training on the small Xibe training set. These low results have a range of underlying causes, including mismatches in the set of dependency labels, differences in annotation, and differences in syntactic preferences across different languages, even though we chose closely related languages. None of these issues are new or different from settings where source and target share the same script, but they are exacerbated by the difference in script since a multilingual language model cannot provide the necessary bridge across the languages. Using delexicalization to bridge the languages is necessary but also comes with a high price: We abstract away from specific types of information, such as case information, which is necessary for many parsing decisions.
Our next steps need to focus on injecting more information into the POS tags. In a way, this is reminiscent of the parsing situation before the arrival of large scale neural language models, where much attention was paid to finding the optimal level of abstraction between using words or POS tags as input for the parser. While neural parsers have drastically improved performance, they can still be too dependent on lexical information making them not robust enough to lexical variation (Kasai and Frank Reference Kasai and Frank2019), while simultaneously questions persist about the granularity of POS tags for a given language within a standard annotation scheme (e.g., UD; Anderson and Gómez-Rodríguez Reference Anderson and Gómez-Rodríguez2020) or in some instances even their necessity (Zhou et al. Reference Zhou, Zhang, Li and Zhang2020b). This shows very clearly that the problems in parsing remain the same over time, even though we are making progress in addressing them.