1 Introduction
Text-as-data methods allow generating insights from text corpora that could otherwise be analyzed only by investing large amounts of human effort, time, and financial resources (cf. Grimmer and Stewart Reference Grimmer and Stewart2013). However, when applied in cross-lingual research, many existing quantitative text analysis methods face limitations (Baden et al. Reference Baden, Pipal, Schoonvelde and van der Velden2021), such as picking up on language differences instead of substantively more interesting patterns (cf. Lind et al. Reference Lind, Eberl, Eisele, Heidenreich, Galyga and Boomgaarden2021a). Analyzing multilingual corpora as-is, in turn, necessitates analysts to duplicate their efforts (Lucas et al. Reference Lucas, Nielsen, Roberts, Stewart, Storer and Tingley2015; Reber Reference Reber2019).
Machine translation (MT) has been proposed and validated as a remedy to these limitations (e.g., Lucas et al. Reference Lucas, Nielsen, Roberts, Stewart, Storer and Tingley2015). However, when relying on commercial MT services, translating large multilingual corpora can be expensive. Translating only dictionary keywords or the words retained after tokenizing documents in their original languages (e.g., Proksch et al. Reference Proksch, Lowe, Wäckerle and Soroka2019; Reber Reference Reber2019), in turn, can lead to incorrect translations.
This paper presents an alternative approach to cross-lingual quantitative text analysis. Instead of translating texts, they are represented in a language-independent vector space by processing them through a pre-trained multilingual sentence embedding (MSE) model. Existing pre-trained models enable semantically meaningful text representation and are publicly available for replicable and resource-efficient use in research. However, only a minority of the texts used to pre-train these models stem from the political domain.
To do so, I focus on cross-lingual text classification as an application. First, I rely on a dataset compiled by Düpont and Rachuj (Reference Düpont and Rachuj2022) that records machine-translated and original sentences of election manifestos in the Comparative Manifestos Project (CMP) corpus (Volkens et al. Reference Volkens2020). I assess how reliable MSE-based classifiers perform in classifying sentences’ topics and positions compared to classifiers trained using bag-of-words (BoW) representations of machine-translated texts (the “MT+BoW” benchmark). I also include MT-based classifiers in this comparison that rely on translations by the open-source M2M model (Fan et al. Reference Fan2021) to compare between “free” alternatives. This analysis shows that relying on MSEs for text representation enables training classifiers that are no less reliable than their MT-based counterparts. Moreover, I find that relying on free MT (i.e., the M2M model) instead of Goggle’s commercial MT service reduces the reliability of classifiers only slightly.
Next, I examine how these classifiers’ reliability depends on the amount of labeled data available for training. This analysis shows that adopting the MSE approach tends to result in more reliable cross-lingual classifiers than the MT+BoW approach and, at least, likely results in no less reliable classifiers—particularly when working with training data sizes typically available in applied research. However, as more training data are added, this comparative advantage decreases.
Lastly, I compare how MT+BoW and MSE-based classifiers perform in cross-lingual transfer classification, that is, classifying sentences written in languages not in the training data. Annotated text corpora are often limited in their country coverage, and extending them to new countries beyond their original language coverage is a promising application of cross-lingual classifiers. I probe the MSE and MT+BoW approaches in this task based on a dataset compiled by Lehmann and Zobel (Reference Lehmann and Zobel2018) covering eight languages that records human codings of manifesto quasi-sentences into those discussing the immigration issues and those that do not. Specifically, I conduct an extensive text classification experiment to estimate how much less reliable cross-lingual transfer classification is compared to the “within-language” classification benchmark examined in the first two analyses. This experiment shows that cross-lingual transfer tends to result in fewer reliability losses when relying on the MSE instead of the MT approach.
2 Approaches to Cross-Lingual Quantitative Text Analysis
The goal of quantitative text analysis is to infer indicators of latent concepts from text (Grimmer and Stewart Reference Grimmer and Stewart2013; Laver, Benoit, and Garry Reference Laver, Benoit and Garry2003). Achieving this goal in multilingual applications is challenging, because similar ideas are express with different words in different languages. The two sentences in Table 1 illustrate this. In both, authors pledge to lower unemployment, but these ideas are expressed in different words in English and German.
Hence, the goal of cross-lingual quantitative text analysis is to obtain identical measurements for documents if they indicate the same concept independent from the language they are written in (cf. Lucas et al. Reference Lucas, Nielsen, Roberts, Stewart, Storer and Tingley2015, 258). There are currently two dominant approaches to tackle this challenge: “separate analysis” and “input alignment” through MT.
2.1 Established Approaches
The first approach is to separately analyze documents in their original languages. For example, in the case of human coding, separate analysis requires human coders to annotate each language-specific subcorpus. Analysts can then use these annotations to directly estimate quantities of interest or to train language-specific supervised text classifiers.Footnote 1 A significant shortcoming of the separate analysis approach thus is that analysts need to duplicate their efforts for each language present in a corpus (Lucas et al. Reference Lucas, Nielsen, Roberts, Stewart, Storer and Tingley2015; Reber Reference Reber2019). This duplication makes separate analysis a relatively resource-intensive strategy.
An alternative approach is input alignment. The idea of input alignment is to represent documents in a language-independent way that enables analysts to apply standard quantitative text analysis methods to their multilingual corpora instead of analyzing them language by language (Lucas et al. Reference Lucas, Nielsen, Roberts, Stewart, Storer and Tingley2015). Translating text inputs into one target language using commercial MT services, such as Google Translate, has been established as a best practice to achieve this. Specifically, with the full-text translation approach, texts are translated as-is into the “target” language (e.g., English).Footnote 2 Full-text translated documents can then be pre-processed and tokenized into words and phrases (n-gram tokens) to obtain monolingual BoW representations of originally multilingual documents. 2 This approach has been shown to enable reliable dictionary analysis (Windsor, Cupit, and Windsor Reference Windsor, Cupit and Windsor2019), topic modeling (de Vries, Schoonvelde, and Schumacher Reference De Vries, Schoonvelde and Schumacher2018; Lucas et al. Reference Lucas, Nielsen, Roberts, Stewart, Storer and Tingley2015; Maier et al. Reference Maier, Baden, Stoltenberg, De Vries-Kedem and Waldherr2021; Reber Reference Reber2019), and supervised text classification (Courtney et al. Reference Courtney, Breen, McMenamin and McNulty2020; Lind et al. Reference Lind, Heidenreich, Kralj and Boomgaarden2021b).
However, when researchers rely on commercial MT services, translating full texts can be very expensive, rendering this approach relatively resource-intensive, too (but see Lind et al. Reference Lind, Heidenreich, Kralj and Boomgaarden2021b). An alternatives is to tokenize documents in their original languages, and only translate the resulting language-specific sets of words and phrases (e.g., Düpont and Rachuj Reference Düpont and Rachuj2022; Lucas et al. Reference Lucas, Nielsen, Roberts, Stewart, Storer and Tingley2015). Similar to the full-text translation approach, token translation enables representing documents as BoW vectors in the target language. 2 Moreover, it is relatively resource-efficient because it implies translating fewer characters. However, token translation implies translating words and phrases outside their textual contexts, which can result in incorrect translations that impair the quality of BoW text representations.
Hence, researchers’ dependence on commercial MT services for full-text translation has created a trade-off between cost efficiency and text representation quality. The recent publication of pre-trained MT models by (for example, M2M by Fan et al. Reference Fan2021) promises to break this dependence, and I evaluate this possibility in my analyses below. However, I first present MSE as an alternative, MT-free approach to language-independent text representation.
2.2 Multilingual Sentence Embedding
MSE is a method to represent sentence-like textsFootnote 3 as fixed-length, real-valued vectors such that texts with similar meaning are placed close in the joint vector space independent from their language. Because MSE allows representing documents written in different languages in the same feature space, it presents an alternative input alignment approach to cross-lingual quantitative text analysis. Table 2 and Figure 1 illustrate this for the two sentences in Table 1. Because these sentences are semantically very similar, their embeddings are very similar and they are hence placed close in the embedding space.
Note: These data serve illustrative purposes only.
The idea of representing textual inputs as dense vectors (i.e., “embed” them) to encode their semantic relationships is old (Harris Reference Harris1954). Word embedding models obtain such vectors for short n-grams (e.g., Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014), and have already been popularized in the social sciences (cf. Garg et al. Reference Garg, Schiebinger, Jurafsky and Zou2018; Rodman Reference Rodman2020; Rodriguez and Spirling Reference Rodriguez and Spirling2021). Sentence embedding models obtain such fixed-length vectors for sentence-like texts such that semantically similar texts are placed relatively close in the embedding space (e.g., Conneau et al. Reference Conneau, Kiela, Schwenk, Barrault and Bordes2017). MSE methods, in turn, obtain such vectors in a language-agnostic way.
Researchers have developed different MSE methods in recent years (e.g., Artetxe and Schwenk Reference Artetxe and Schwenk2019; Reimers and Gurevych Reference Reimers and Gurevych2020; Yang et al. Reference Yang2020). They commonly use corpora recording translations of sentences in different languages (“parallel sentences”) as inputs to train a neural network model that learns to induce a sentence embedding of the input text.Footnote 4 The Language-Agnostic Sentence Embedding Representations (LASER) model proposed by Artetxe and Schwenk (Reference Artetxe and Schwenk2019), for example, trains to translate parallel sentences and learns to induce language-agnostic sentence embeddings as an intermediate step.
Once “pre-trained” on large amounts of parallel data, MSE models can be used to embed texts they have not seen during pre-training.Footnote 5 Indeed, existing pre-trained MSE models have been shown to obtain sentence embeddings that (i) encode texts’ semantic similarity independent of language and (ii) provide critical signals to achieve competitive performances in a wide range of natural language processing tasks (e.g., Artetxe and Schwenk Reference Artetxe and Schwenk2019; Reimers and Gurevych Reference Reimers and Gurevych2020; Yang et al. Reference Yang2020). Moreover, publicly available MSE models have been pre-trained on large parallel corpora covering very many languages (e.g., 113 in the case of LASER; see Section B.1 of the Supplementary Material).
This suggests that MSE is an attractive alternative to the MT approach to input alignment discussed above. Instead of BoW count vectors of documents’ machine-translated texts, one combines their MSE vectors in a document-feature matrix (cf. Table 2).Footnote 6 As elaborated below, this enables using MSEs as features to train cross-lingual supervised text classifiers.
3 Empirical Strategy
To assess whether MSE enables reliable cross-lingual analyses of political texts, I evaluate this approach for cross-lingual supervised text classification (CLC).Footnote 7 The overarching goal of my analyses is to establish whether relying on pre-trained MSE models for text representation enables reliable measurement in relevant political text classification tasks. The reliability of classifiers trained using BoW representations of machine-translated texts—the “MT+BoW” approach—constitutes the reference point in this assessment.
I focus on supervised text classification as an application for three reasons. First, it figures prominently in quantitative text analysis (e.g., Barberá et al. Reference Barberá, Boydstun, Linn, McMahon and Nagler2021; Burscher, Vliegenthart, and De Vreese Reference Burscher, Vliegenthart and De Vreese2015; D’Orazio et al. Reference D’Orazio, Landis, Palmer and Schrodt2014; Rudkowsky et al. Reference Rudkowsky, Haselmayer, Wastian, Jenny, Emrich and Sedlmair2018). However, in contrast to topic modeling (Chan et al. Reference Chan2020; cf. Lind et al. Reference Lind, Eberl, Eisele, Heidenreich, Galyga and Boomgaarden2021a), relatively view attention has been paid to evaluate translation-free CLC approaches besides the separate analysis strategy (but see Glavaš, Nanni, and Ponzetto Reference Glavaš, Nanni and Ponzetto2017).
Second, MSEs can be directly integrated into the supervised text classification pipeline. Labeled documents are first sampled into training and validation data splits. Documents in the training data are then embedded using a pre-trained MSE model and their MSEs used as features to train a supervised classifier.
Third, the reliability of supervised text classifiers can be evaluated using clearly defined metrics (cf. Grimmer and Stewart Reference Grimmer and Stewart2013, 279). One first applies a classifier to predict the labels of documents in the validation data split. Comparing a classifier’s predictions to “true” labels then allows quantifying its reliability in labeling held-out documents with metrics such as the F1 score.
3.1 Analysis 1: Comparative Reliability
I first assess how reliable MSE-based classifiers perform in classifying the topic and position of sentences in political parties’ election manifestos compared to classifiers trained with the MT+BoW approach.Footnote 8 Classifying the topical focus and left–right orientation of political texts are among the main applications of text classification methods in comparative politics research (Benoit et al. Reference Benoit, Conway, Lauderdale, Laver and Mikhaylov2016; Burscher et al. Reference Burscher, Vliegenthart and De Vreese2015; Osnabrügge et al. Reference Osnabrügge, Ash and Morelli2021; Quinn et al. Reference Quinn, Monroe, Colaresi, Crespin and Radev2010). Assessing the comparative reliability of MSE-based classifiers in these tasks is thus relevant to a large group of researchers. Moreover, the data I use are representative of other annotated political text corpora with sentence-like texts or paragraphs as units of annotation (e.g., Barberá et al. Reference Barberá, Boydstun, Linn, McMahon and Nagler2021; Baumgartner, Breunig, and Grossman Reference Baumgartner, Breunig and Grossman2019; Rudkowsky et al. Reference Rudkowsky, Haselmayer, Wastian, Jenny, Emrich and Sedlmair2018).
3.1.1 Data
The annotated sentences used in this analysis stem from a subset of the CMP corpus (Volkens et al. Reference Volkens2020)Footnote 9 for which machine-translated full texts are available from the replication materials of Düpont and Rachuj (Reference Düpont and Rachuj2022, henceforth D&R).Footnote 10 D&R study programmatic diffusion between parties across countries and have translated a sample of manifestos covering 12 languages (see Table 3) with Google Translate to validate their token translation-based measurement strategy. Sentences in D&R’s original data are not labeled, however, and I have hence matched them to original quasi-sentence-level CMP codings.Footnote 11 This allows training and evaluating topic and position classifiers with both the MT+BoW and MSE approaches on the same data and hence their direct comparison.
3.1.2 Classifier Training and Evaluation
The resulting corpus records 70,999 sentences. I randomly sample these sentences five times into 50:50 training and validation data splits.Footnote 12 This ensures that the out-of-sample performance estimates I report are not dependent on the data split.
For each training dataset and classification task, I train five classifiers: two MT+BoW and three MSE-based ones. In the case of the MT+BoW approach, I train one classifier using D&R’s original Google Translate translations and another one using translations of the same sentence I have obtained using the open-source M2M model (Fan et al. Reference Fan2021). This allows assessing whether relying on “free” MT instead of a commercial service impairs the reliability of BoW-based classifiers. In both cases, I apply a five-times repeated fivefold cross-validation (5 $\times $ 5 CV) procedure to select the best-performing classifier.Footnote 13
In the case of the MSE approach, I train classifiers relying on three different publicly available pre-trained MSE models. 13 The LASER model (Artetxe and Schwenk Reference Artetxe and Schwenk2019) already discussed in Section 2.2 and two models that have been trained for sequence alignment of parallel sentences by adopting the multilingual “knowledge distillation” procedure proposed by Reimers and Gurevych (Reference Reimers and Gurevych2020): a multilingual Universal Sentence Encoder (mUSE, Yang et al. Reference Yang2020), and an XLM-RoBERTa (XLM-R) model (Conneau et al. Reference Conneau2020). Embedding texts with different pre-trained models allows comparing their suitability for political text classification applications.
For each training dataset and task, I then evaluate the five resulting classifiers on sentences in the corresponding validation datasets and bootstrap 50 F1 score estimates per classifier. I summarize these estimates in Figure 2 below.
3.2 Analysis 2: Comparative Effectiveness
Next, I examine how these classifiers’ reliability depends on the amount of labeled data available at training time (cf. Barberá et al. Reference Barberá, Boydstun, Linn, McMahon and Nagler2021; Burscher et al. Reference Burscher, Vliegenthart and De Vreese2015). The amount of digitized texts available for quantitative analyses is increasing, but collecting annotations for these data is usually very resource-intensive (cf. Benoit et al. Reference Benoit, Conway, Lauderdale, Laver and Mikhaylov2016; Hillard, Purpura, and Wilkerson Reference Hillard, Purpura and Wilkerson2008). As a consequence, applied researchers can often afford to collect annotations for only a few documents in their target corpus. It is thus practically relevant to know which of the text representation approaches I compare proves more reliable in data-scarce scenarios.
3.2.1 Data
I use the same data as in Analysis 1.
3.2.2 Classifier Training and Evaluation
The training and evaluation procedure I adopt is the same as in Analysis 1, with two exception. First, I vary the size of the training datasets from 5% to 45% (in 5 percentage point increments) of the target corpus, whereas in Analysis 1, I have trained on 50%. So the smallest (largest) training dataset in Analysis 2 records 3,549 (31,948) labeled sentences.Footnote 14 Second, I rely only on the knowledge-distilled XLM-R model for sentence embedding because it results in the most reliable MSE-based classifiers in Analysis 1.
3.3 Analysis 3: Cross-Lingual Transfer
Last, I investigate which of the two text representation approaches I compare enables more reliable cross-lingual transfer classification, that is, classifying documents written in languages not present in the training data. Such “out-of-language” classification is a promising application of cross-lingual text classifiers. Annotated text corpora are often limited in their country coverage. Training cross-lingual text classifiers on these data promises to extend their coverage to new countries beyond their original language coverage.
3.3.1 Data
I rely on a dataset compiled by Lehmann and Zobel (Reference Lehmann and Zobel2018, henceforth L&Z) covering eight languages (see Table 3). Their data record human codings of election manifesto quasi-sentences into those that discuss the immigration issues and those that do not, and I train cross-lingual classifiers for this binary classification task.
The example of identifying passages in political documents that discuss the issue of immigration is an ideal case for probing the reliability of supervised text classifiers in cross-lingual transfer. The politicization of immigration by the radical right since the 1990s has raised scholars’ interest in studying how governments, mainstream parties, and the media change their attention to this issue. However, then-existing databases, such as the CMP, lacked suitable indicators to address this question quantitatively. Despite scholars’ impressive efforts to obtain such indicators by means of content analysis, the resulting annotated corpora are often limited in their geographic coverage (cf. Lehmann and Zobel Reference Lehmann and Zobel2018; Ruedin and Morales Reference Ruedin and Morales2019). The methodological problem of expanding the coverage of these corpora by means of cross-lingual transfer classification has thus considerable practical relevance.
3.3.2 Classifier Training and Evaluation
I examine this problem in an experimental setup designed to estimate how much less reliable cross-lingual transfer classification is compared to the “within-language” classification benchmark examined in Analyses 1 and 2. The basic idea of this setup is to use quasi-sentences written in some “source languages” to train a classifier that is then evaluated on held-out quasi-sentences. Repeated for many different combinations of source languages, I can estimate how reliably a given set of held-out quasi-sentences can be classified when the languages they are written in are among the source languages (“within language” classification) compared to when they are not (“out of language” classification, i.e., cross-lingual transfer).Footnote 15
4 Results
4.1 Comparative Reliability
Figure 2 reports the reliability of position and topic classifiers in terms of their cross-class mean F1 scores. Comparing average cross-class mean F1 scores shows that the best MSE-based classifier (the one trained using XLM-R embeddings) outperforms the benchmark classifier (the one relying on commercial MT) in topic classification while performing as reliably in position classification. In addition, the best MSE-based classifier is more reliable than the BoW-based classifier relying on free MT in topic classification and a similar tendency can be observed in the case of position classification. There is thus no indication that training using MSEs instead of BoW representations of machine-translated texts substantially reduces the reliability of cross-lingual text classifiers.Footnote 16
Note, however, that absolute F1 scores indicate that the reliability achieved with either approach is rather modest. One reason for this may be the strong class imbalance across label categories.Footnote 17 The poor quality of human annotations in the CMP corpus is another likely reason (cf. Mikhaylov, Laver, and Benoit Reference Mikhaylov, Laver and Benoit2012). As shown in Section 4.3, better performance can be achieved with less noisy labels.
Comparing MSE-based classifiers, it is notable that using the knowledge-distilled XLM-R model for sentence embedding results in the most reliable classifiers in both tasks, whereas using LASER consistently results in the least reliable classifiers (cf. Reimers and Gurevych Reference Reimers and Gurevych2020). With regard to differences in MT-based classifiers’ F1 scores, it is striking that the classifiers relying on translation with the open-source M2M model label held-out sentences only slightly less reliably than those relying on Google’s commercial MT service.Footnote 18
Finally, when comparing classifiers’ reliability across languages (see Figure S.7 in the Supplementary Material), it is notable that the F1 scores of the classifiers relying on commercial MT and the classifiers trained using XLM-R embeddings are strongly correlated for both tasks.Footnote 19 Moreover, the standard deviation of languagewise differences in classifiers’ F1 scores is modest in both tasksFootnote 20 and these differences are mostly indistinguishable from zero (accounting for variability in bootstrapped F1 scores; see Figure S.8 in the Supplementary Material). Furthermore, with 0.15, the correlation in language-specific F1 scores between tasks is rather low, suggesting that task-specific factors contribute significantly to between-language differences in classifiers’ reliability. These findings are reassuring since they provide little evidence of systematic language bias in the pre-trained embedding model.Footnote 21
In summary, Figure 2 provides evidence that the MSE approach enables training position and topic classifiers that are no less reliable than classifiers trained using BoW representations of machine-translated texts. Moreover, relying on a free MT model instead of a commercial service reduces the reliability of BoW classifiers only slightly in the two tasks examined here.
4.2 Comparative Effectiveness as a Function of Training Data Size
However, how effective using MSEs for text representation is compared to the MT approach also depends on the amount of labeled data available at training time. This is shown in Figure 3 by plotting the F1 scores of classifiers trained on different amounts of labeled data when adopting these different text representation approaches.
Three patterns stand out from the data presented in Figure 3. First, MSE-based classifiers tend to outperform their MT-based counterparts when training data are scarce. Second, as more training data are added, this comparative advantage decreases. Third, relying on the open-source M2M model for MT instead of Google’s commercial service consistently results in less reliable classifiers, but, in line with the findings presented above, differences in terms of F1 score points are overall very small.
The first pattern is more pronounced in the case of topic classification. Taking variability in bootstrapped F1 estimates into account, the MSE-based topic classifiers outperform the ones relying on commercial MT across the entire range of training data sizes examined here.Footnote 22 What is more, when training on only 3.5K labeled sentences, the topic classifiers relying on MT are only slightly more reliable than human coders (Mikhaylov et al. Reference Mikhaylov, Laver and Benoit2012, 85), whereas MSE-based classifiers perform relatively well.
While this comparative advantage is less pronounced in the case of position classification,Footnote 23 the BoW-based classifiers outperform their MSE-based counterparts at none of the training size values examine here. This suggests that the amount of labeled data needed to reach a “tipping point” at which MT-based classifiers begin outperforming their MSE-based counterparts is quiet large and likely larger than what is typically available in applied political and communication science research.
Nevertheless, adding more training data results in greater F1 improvements for MT-based than for MSE-based classifiers in both tasks. As a consequence, the comparative reliability advantage of the MSE approach tends to decrease. This difference between approaches in how adding more training data affects classifiers’ reliability is not surprising. Training on BoW representations, classifiers learn to identify tokens in the training data that allow reliable classification. The features enabling reliable classification based on BoW representations are thus “domain-specific.” In contrast, classifiers trained using MSEs hinge on the representations the embedding model has learned to induce while pre-training on corpora that overwhelmingly stem from other domains. Learning domain-specific BoW features from machine-translated texts should thus eventually trump the “transfer learning” logic underpinning the MSE approach.Footnote 24 But as emphasized above, the amount of labeled data needed to reach this “tipping point” is likely very large.
This reasoning also helps explaining why the range of training data sizes at which MSE-based classifiers are more reliable than their MT-based counterparts is larger for topic than for position classification. Identifying tokens that reliably predict held-out sentences topical focus among seven different policy issue areas from strongly imbalanced training data is more difficult than identifying tokens that discriminate between three positional categories. This gives the MSE-based classifiers a greater head start in topic classification.
Viewed together, the data presented in Figure 3 suggest that adopting the MSE approach tends to enable more—but at least no less—reliable cross-lingual classification than training on BoW representations of machine-translated texts. While the comparative reliability advantage of the MSE approach decreases as the training data size increases, none of the training data sizes examined here results in BoW-based classifiers that outperform their MSE-based counterpart. This suggests that the MSE approach is particularly suited when working with training data sizes typically available in applied research.
4.3 Cross-Lingual Transfer Classification
But how do the MSE and MT+BoW approaches to cross-lingual text classification perform when applied to label documents written in languages not present in the training data? As described above, I rely on the L&Z dataset recording codings of manifestos’ quasi-sentences into those that discuss the immigration issue and those that discuss other issues to address this question.
To establish a baseline estimate of the two approaches’ reliability in this binary classification task, I have first trained MSE- and MT-based classifiers on a balanced dataset recording a total of 10,394 quasi-sentences sampled from all eight languages and evaluated them on held-out quasi-sentences.Footnote 25 In this within-language classification setup, both approaches result in very reliable classifiers. With average F1 scores of 0.85 [0.84, 0.86] and 0.83 [0.82, 0.85], respectively, the MSE-based and MT-based classifiers are about equally reliable. Given that they were trained on about 10K labeled quasi-sentences, this finding is in line with the results presented in Figure 3. However, the immigration issue classifiers are much more reliable. Moreover, their language-specific average F1 scores are all above 0.79 in the case of the MSE-based classifier and above 0.75 in the case of the MT-based classifier. This suggest that both approaches should enable training classifiers on the L&Z data that perform well in cross-lingual transfer.
However, Figure 4 provides evidence that cross-lingual transfer tends to result in less F1 reductions (relative to the within-language classification benchmark) when relying on the MSE instead of the MT+BoW approach. For example, the “reliability cost”Footnote 26 of cross-lingual transfer into Danish is about 2.8 F1 score points with the MT+BoW approach and only 0.8 F1 score points with the MSE approach. This pattern is consistent across languages recorded in the L&Z data. Moreover, the average F1 scores achieved by MSE-based classifiers when predicting quasi-sentences written in languages that were not in the training data (i.e., out-of-language evaluation) are higher than those of their MT-based counterparts. This suggests that the MSE approach enables more reliable cross-lingual transfer classification.
5 Conclusion
In this paper, I have argued that relying on MSEs presents an attractive alternative approach to text representation in cross-lingual analysis. Instead of translating texts written in different languages, they are represented in a language-independent vector space by processing them through a pre-trained MSE model.
To support this claim empirically, I have evaluated whether relying on pre-trained MSE models enables reliable cross-lingual measurement in supervised text classification applications. Based on a subset of the CMP corpus (Volkens et al. Reference Volkens2020) for which machine-translated full texts are available, I have first assessed how reliably MSE-based classifiers perform in classifying manifesto sentences’ topics and positions compared to classifiers trained using BoW representations of machine-translated texts. Moreover, I have evaluated how these classifiers’ reliability depends on the amount of labeled data available for training. These analyses show that adopting the MSE approach tends to result in more reliable cross-lingual classifiers than the MT+BoW approach and, at least, likely results in no less reliable classifiers. However, as more training data are added, this comparative advantage decreases. Moreover, I show that relying on an open-source MT model (Fan et al. Reference Fan2021) reduces MT-based classifiers’ reliability only slightly.
In addition, I have assessed how MSE- and MT-based classifiers perform when applied to classify sentences written in a language that was not present in their training data (i.e., cross-lingual transfer). Using a dataset compiled by Lehmann and Zobel (Reference Lehmann and Zobel2018) that records human codings of manifesto quasi-sentences into those discussing the immigration issues and those that do not, I show that cross-lingual transfer tends to result in fewer reliability losses when relying on the MSE instead of the MT approach compared to the within-language classification benchmark examined in the first two analyses.
These results suggest that MSE is an important addition to applied researchers’ text analysis toolkit, especially when their resources to collect labeled data are limited. When they want to train a cross-lingual classifier on a small to modestly sized labeled corpus, adopting the MSE approach can benefit the reliability of their classifier but, at least, will likely not harm it. Moreover, when the country coverage of their labeled corpus is limited and extending it by means of cross-lingual transfer would require “out-of-language” classification, my analyses suggest that adopting the MSE instead of the MT approach should result in fewer additional classification error.
Acknowledgments
I thank Tarik Abou-Chadi, Elliott Ash, Pablo Barberá, Nicolai Berk, Theresa Gessler, Fabrizio Gilardi, Benjamin Guinaudeau, Christopher Klamm, Fabienne Lind, Stefan Müller, Sven-Oliver Proksch, Martijn Schoonvelde, Ronja Sczepanski, Jon Slapin, Lukas Stötzer, and three anonymous reviewers for their thoughtful comments on this manuscript. I acknowledge support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence – Strategy EXC 2126/1-390838866.
Data Availability Statement
Replication code for this article has been published in Code Ocean, a computational reproducibility platform that enables users to run the code and can be viewed interactively at https://doi.org/10.24433/CO.5199179.v1 (Licht Reference Licht2022a). A preservation copy of the same code and data can also be accessed via Dataverse at https://doi.org/10.7910/DVN/OLRTXA (Licht Reference Licht2022b).
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2022.29.