1 Motivation
Word embeddings (e.g., Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) are now an important tool of social science. In contrast to traditional ways of representing the contents of documents, these estimated real-valued vectors enable us to talk more directly about the “meanings” and connotations of terms in natural language (Caliskan, Bryson, and Narayanan Reference Caliskan, Bryson and Narayanan2017; Rodman Reference Rodman2020). Applications include modeling political emotions (e.g., Gennaro and Ash Reference Gennaro and Ash2022) and legislative ideology (e.g., Rheault and Cochrane Reference Rheault and Cochrane2020). At least two challenges remain: First, obtaining high-quality embeddings for non-English languages can be difficult. Second, it has proved nontrivial to place embeddings in a modeling framework, such that one can answer questions of the form “does this group differ in a statistically significant way in terms of their embeddings of a given term?” Here, we provide resources for the union of these issues. We use the embedding models and multilingual data from the fastText project of Grave et al. (Reference Grave, Bojanowski, Gupta, Joulin and Mikolov2018) and combine it with recent advances in “a la carte” (ALC) embeddings (Khodak et al. Reference Khodak, Saunshi, Liang, Ma, Stewart and Arora2018). The latter can then be seamlessly placed in a regression-style setup courtesy of Rodriguez, Spirling, and Stewart (Reference Rodriguez, Spirling and Stewart2023).
1.1 New fastText Embeddings
The fastText project underpins the first contribution and provides two types of resources: first, an (open-source) modeling architecture “that allows users to learn text representations”Footnote 1; second, the output of applying that embedding model to 157 languages for which training data come from Common Crawl and Wikipedia. A strength of the fastText model is that it uses subword information in addition to the usual context word arrangement for prediction. This can result in higher-quality embeddings than for whole words (only) because tokens that are not identical but that contain similar parts (like policy and policies) are not treated as completely separate entities. This is helpful when, say, a specific form of a word was rare in the training documents, but for which we still have some information from other tokens that were more common, e.g., misspelled words.Footnote 2
On inspection, we saw that Common Crawl includes many typos and rare terms (plus many English loan words). Beyond this potential for noise, Common Crawl is not separated by language—it is one combined corpus that requires nontrivial division for the end user we have in mind here. Our first contribution is simply taking the fastText pipeline and fitting it to Wikipedia in various languages. Thus, we have “our” version of fastText, which is cleaner than the original (though the training domain is admittedly more restricted).Footnote 3
1.2 New ALC Embeddings and Transformation Matrices
Our second set of contributions is to produce ALC embeddings. First, for this “new” version of fastText. Second, we provide ALC embeddings for GloVe, which we also trained on Wikipedia corpora. Details on these embeddings can be found in the SI,Footnote 4 but the logic is straightforward. Essentially, embeddings of a given word $w_v$ are estimated by taking the mean of the pre-trained embeddings of the tokens around it ( $u_w$ ) and then using a transformation matrix (denoted $\mathbf {A}$ ) to redirect the new embedding away from common directions in the embeddings space (e.g., function words) otherwise likely to be overrepresented in that averaging process. This allows analysts to produce high-quality vector representations even when they have very little data—including single instances of terms, assuming one has the context of that word and a sufficiently large corpus to pre-train embeddings. This, in turn, facilitates statistical inference because one can place the embeddings on “the left-hand side” and covariates of interest as predictors: for this purpose, Rodriguez et al. (Reference Rodriguez, Spirling and Stewart2023) give machinery for estimating both coefficients (on, say, group membership variables) and uncertainty around them. We provide those required (reasonable) pre-trained embeddings using both fastText and GloVe models applied to Wikipedia and the relevant learned transformation matrix. We note that while there certainly are other non-English language embedding resources (e.g., Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), they do not easily slot into a broader regression-style inference model with standard errors, p-values, etc.
1.3 Coverage and Intended Use
At the time of writing, we make all required products available for 40 of the most common languages (other than English).Footnote 5 This covers the majority of first- and second-language speakers on the Earth and the great majority of all languages on the Web. Moreover, we have constructed pipeline production code for anyone who wishes to produce similar items for any of the 157 languages originally provided via fastText.
Our materials are aimed at two—often overlapping sets—of low-resource users. First, analysts who work with languages that have relatively small corpora from which it is hard to learn high-quality embeddings. For example, scholars with a few political pamphlets or tweets from France may struggle to build embeddings for a relatively new term like “iel” (a gender-neutral pronoun) from such a small corpus. The alternative strategy—of translating the small corpus to a language for which embeddings do exist—may be unpalatable. Second, analysts who do not have local access to the computational resources required to train embedding models—we mean this both in terms of time/skill and power per se.
We now validate these approaches and discuss their relative performance. We first show that the ALC representations work well relative to the “full” embeddings that they approximate. We then focus high-cost efforts (i.e., crowdsourcing) on comparing (1) our version of fastText (fit to Wikipedia) with the original version of fastText and then (2) our version of fastText with an ALC version of our fastText. We do this because the fastText resources are the most innovative part of what we provide.
2 Performance and Validation
The resources we provide are useful to the extent that they provide reasonable representations of concepts, especially political ones. We now show that this is the case.
2.1 Reconstruction: ALC Embeddings Provide Reasonable Approximations of the “Truth”
Recall that ALC embeddings are an approximation to (what we might describe as) true ones, where “true” means the embeddings estimated from a vast corpus. We have the latter insofar as we can learn fastText or GloVe embeddings from, say, Wikipedia. We can then compare that truth to our estimate (our ALC embedding). We would hope that our ALC embedding can reconstruct that truth and, on average, be “close” to it rather than “far” from it. These standards are vague in an absolute sense, but they do allow us some comparison across languages. The unit of comparison here is 100 random terms per language, constrained to have a higher frequency than the median token in the corpus.Footnote 6 For each term and each language, we estimate the cosine similarity between its pre-trained embedding and its corpus-wide ALC embedding. In SI E, we describe exactly how this test proceeds.
The cosine similarities by construction range between $-1$ and 1. If this number is 1, then the ALC embeddings (of our random terms) perfectly approximate our “true” embeddings; if they are zero or even negative, they provide a very poor approximation. In Figure 1, we report the results for all the languages we have worked with so far, including the mean (diamond) and the cosine for each of the 100 random terms (circles).
We have two immediate observations: first, ALC generally recovers both architectures’ pre-trained embeddings very well for any language. In general, means are around 0.77 for fastText and 0.67 for GloVe.Footnote 7 Second, there is nontrivial variation within and between languages. In particular, and as we show more explicitly in Figure 3 of SI D, ALC does best when there is more training data—for example, English has a higher mean than Irish. Moreover, within languages with lower means, we see longer left tails—that is, there are more terms further from the mean where ALC does a worse job of approximating the “truth.” Again, this is primarily a consequence of training data availability.
A more qualitatively informative procedure is to check that words represented via our embeddings “mean” what we expect them to. We first verify this by studying a curated domain setting—specifically, translated English/Spanish speeches at the European Parliament (EP), 1999–2001 (Høyland, Sircar, and Hix Reference Høyland, Sircar and Hix2009). We proceed as described in SI F.
2.2 Crowdsourcing: Similar Aggregate Performance, ALC Delivers More Substantive Connotations
Another and somewhat easier way to assess the quality of our embedding resources in different languages is to look at the nearest neighbors of certain political terms. Consider Table 1. There, we provide nearest neighbors (by cosine similarity) for the terms democracy and equality. The nearest neighbors are drawn from two resources: our recompiled version of fastText and our ALC-based version of fastText.Footnote 8 Consistent with our notes above, the training corpus is (English) Wikipedia.
Reassuringly, these nearest neighbors make sense—that is, neither model produces “odd” results. Arguably, by moving beyond lexical similarities and similar word stems, ALC produces slightly more “useful” results than the pure fastText model. The same is true when we analyze the French terms nationalisme (nationalism) and racisme (racism), for which the training corpus is French Wikipedia, per Table 2.
To scale these comparisons between models, we turn to crowdsourcing (Benoit et al. Reference Benoit, Conway, Lauderdale, Laver and Mikhaylov2016). Following Rodriguez and Spirling (Reference Rodriguez and Spirling2022), we designed a lightweight web application that shows crowdworkers a token with political connotations and then asks which of two words (drawn from two models) the worker thinks is a more plausible “context” term for that token. We translated the app into all of the (non-English) United Nations “Official Languages,” and, in each language, we use eight “political” terms (law, liberty, equality, justice, politics, tax, citizen, and police). Hence, we evaluate Arabic, (traditional Mandarin) Chinese, French, Russian, and Spanish. In addition, we also created Japanese and Korean versions. If we take Rodriguez et al. (Reference Rodriguez, Spirling and Stewart2023) as sufficient evidence for the merits of ALC in English, then, combined with our exercise, we “cover” around 45% of the world’s first and second languages and around 77% of the Web’s content languages.Footnote 9 Locating native speakers of these (non-English) languages was not trivial (and not cheap) in some cases. We worked with a specialist crowdsourcing firm, CloudResearch, for this purpose. In SI G, we give more details on this process.
We ask crowdworkers to make two sets of comparisons: original fastText versus our version and then our version of fastText versus an ALC version of that resource. In Figure 2, we give an overview of the results. In the top subfigure, we report the comparison of our version of fastText to the original fastText. Each bar represents a term in the task (the far left bar is an overall result); we also include 95% confidence intervals. When that bar is higher than 1, respondents (on average) preferred our version; when below 1, they preferred the original. Ultimately, this comparison is equivocal, with the original fastText being preferred in a couple of cases, but mostly, the difference is not statistically significant. The bottom subfigure compares our fastText to our ALC. Here, we see that, for the crowdworkers, ALC is generally not the preferred option, though again, this is equivocal in some cases.
Across languages, crowdworkers mostly do not see huge differences in quality and have a mild preference for the (original) fastText resources (see SI H).Footnote 10 So does this mean an analyst should always prefer the original fastText over our version, including the one using ALC? The answer is “no” for two reasons. First, the ALC embeddings give one access to the inferential machinery we discussed above. That is, the ALC embeddings are, by construction, an approximation, but they also allow one to conduct regressions, do statistical tests, and so forth. Second, and perhaps more fundamentally, these contest results disguise some important heterogeneity in use cases. Put simply, crowdworkers prefer more obvious “everyday” or “vanilla” nearest neighbors, whereas our new resources are likely helpful to analysts interested in technical terms. To see this concretely, consider Arabic—specifically, the Arabic word for law, . The ALC nearest neighbor is (legislator), whereas the fastText nearest neighbor is (legally). Going down the list, fastText returns many lexical neighbors like (legal) and (a combination of a function word and the original keyword). Meanwhile, ALC returns more context-specific terms like (binding) and (legislation).
A final note on our crowdsourcing data is that the comparisons were based on minimal preprocessing and post-processing of the embeddings. For example, we imposed only very small minimum counts for a given term to be included in their set of embeddings, specifically a minimum frequency of 10 occurrences in the language-specific Wikipedia corpus. We did this to make the comparison as “raw” and clear as possible. However, following some internal experiments, we adjusted the various cutoffs upward in our distributed resources. We did this especially for larger languages to ensure more robust and sensible embeddings. Put otherwise, the relative ALC versus non-ALC crowd comparisons above are likely the worst-case scenario for ALC.Footnote 11
3 Advice to Researchers Using Our Resources
Our observations about ALC above are with reference to the relevant transformation matrix ( $\mathbf {A}$ ) having been estimated from the underlying corpus—specifically, Wikipedia. Unsurprisingly, whether this is appropriate for a given problem is a function of how “close” the researcher’s corpus is to Wikipedia. Here are three gradated scenarios to guide researchers in making such choices in practice:
-
1. Approximately in sample: if the researcher’s local corpus is “close enough” to Wikipedia, then using our pre-fitted transformation matrix will work as well as anything else from the perspective of producing ALC embeddings. We demonstrate this with an example in SI J, where we use ALC embeddings for the German Wikipedia to identify homonyms.
-
2. Out of sample, small corpus. The researcher is out of sample if their corpus does not particularly resemble Wikipedia. If their corpus is too small to fit local models, we recommend using our estimated $\mathbf {A}$ matrix and carefully checking its validity. We give an example for this case using French and Italian parliamentary corpora in SI K.
-
3. Out of sample, large corpus. If their corpus is large, we advise researchers to simply fit a local transformation matrix using our pipeline code—and potentially fit their own embeddings. Of course, this involves a judgment call: the user must decide whether their inferences are better with our $\mathbf {A}$ for the language and corpus at stake or with their own (and/or with their own local embeddings). We did local fitting of $\mathbf {A}$ to our various parliamentary corpora to provide calibration. As illustrated in SI K, the results are satisfactory for the Congressional Record (median speech length 215 words) but unsatisfactory for the French and Italian corpora (median speech lengths 40 and 140 words, respectively).
To the extent researchers seek more concrete advice, our evidence suggests using our estimated quantities as a first cut on the problem. If they seem suitable and can be validated—for example, via substantive inspection of the nearest neighbors—then one can build out from there. If they do not seem suitable, consider estimating your own with our code. Subsumed in this recommendation is the idea that one might train with something other than Wikipedia on quality grounds. That is, we acknowledge that this resource has some plausible heterogeneity across languages, and analysts should use their expert judgment in deciding whether it is appropriate for their use case. In any case, our resources are a reasonable comparison point for any such work.
Acknowledgements
First version: March 17, 2023. This version: November 11, 2024. A previous version of our work won the “Best Virtual Poster” award at the Summer Polmeth Meeting (2022). Christopher Lucas, Sebastian Popa, Clara Suong, and Luwei Ying provided very helpful comments on an earlier draft. We thank Mikhail Khodak for providing us with helpful feedback. We received excellent research assistance and advice from Dias Akhmetbekov, Alia ElKattan, Tatsuya Koyama, Cristina Mac Gregor Vanegas, Francis William Touola Meda, Yinxuan Wang, and Kyu Sik Yang.
Data Availability Statement
The resources discussed in this paper are available at http://alcembeddings.org/. This includes the training pipeline, the trained resources, and data. Replication code for this article has been published in Code Ocean, a computational reproducibility platform that enables users to run the code and can be viewed interactively at https://doi.org/10.24433/CO.1866319.v3 (Wirsching et al. Reference Wirsching, Rodriguez, Spirling and Stewart2024).
Supplementary Material
For supplementary material accompanying this paper, please visit https://doi.org/10.1017/pan.2024.17.