Hostname: page-component-cd9895bd7-q99xh Total loading time: 0 Render date: 2024-12-26T09:01:31.333Z Has data issue: false hasContentIssue false

Meemi: A simple method for post-processing and integrating cross-lingual word embeddings

Published online by Cambridge University Press:  13 October 2021

Yerai Doval*
Affiliation:
Grupo COLE, Escola Superior de Enxeñaría Informática, Universidade de Vigo, Ourense Vigo, Spain
Jose Camacho-Collados
Affiliation:
School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
Luis Espinosa-Anke
Affiliation:
School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
Steven Schockaert
Affiliation:
School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
*
*Corresponding author. E-mail: yerai.doval@uvigo.es
Rights & Permissions [Opens in a new window]

Abstract

Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual embeddings define a multilingual space where word embeddings from two or more languages are integrated together. Current state-of-the-art approaches learn these embeddings by aligning two disjoint monolingual vector spaces through an orthogonal transformation which preserves the structure of the monolingual counterparts. In this work, we propose to apply an additional transformation after this initial alignment step, which aims to bring the vector representations of a given word and its translations closer to their average. Since this additional transformation is non-orthogonal, it also affects the structure of the monolingual spaces. We show that our approach both improves the integration of the monolingual spaces and the quality of the monolingual spaces themselves. Furthermore, because our transformation can be applied to an arbitrary number of languages, we are able to effectively obtain a truly multilingual space. The resulting (monolingual and multilingual) spaces show consistent gains over the current state-of-the-art in standard intrinsic tasks, namely dictionary induction and word similarity, as well as in extrinsic tasks such as cross-lingual hypernym discovery and cross-lingual natural language inference.

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2021. Published by Cambridge University Press

1. Introduction

A popular research direction in multilingual Natural Language Processing (NLP) consists in learning mappings between two or more monolingual word embedding spaces. These mappings, together with the initial monolingual spaces, define a multilingual word embedding space in which words from different languages with a similar meaning are represented as similar vectors. Such multilingual embeddings do not only play a central role in multilingual NLP tasks but they also provide a natural tool for transferring models that were trained on resource-rich languages (typically English) to other languages, where the availability of annotated data may be more limited.

State-of-the-art models for aligning monolingual word embeddings currently rely on learning an orthogonal mapping from the monolingual embedding of a source language into the embedding of a target language. Somewhat surprisingly, perhaps, this restriction to orthogonal mappings, as opposed to arbitrary linear or even non-linear mappings, has proven crucial to obtain optimal results. The advantages of using orthogonal transformations are twofold. First, because they are more constrained than arbitrary linear transformations, they can be learned from noisy data in a more robust way. This plays a particularly important role in settings where alignments between monolingual spaces have to be learned from small and/or noisy dictionaries (Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2017), including dictionaries that have been heuristically induced in a purely unsupervised way (Conneau et al. Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a; Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2018b). Second, orthogonal transformations preserve the distances between the word vectors, which means that the internal structure of the monolingual spaces is not affected by the alignment. Approaches that rely on orthogonal transformations thus have to assume that the word embedding spaces for different languages are approximately isometric (Barone Reference Barone2016). However, it has been argued that this assumption is not always satisfied (Kementchedjhieva et al. Reference Kementchedjhieva, Ruder, Cotterell and Søgaard2018; Søgaard, Ruder, and Vulić Reference Søgaard, Ruder and Vulić2018; Patra et al. Reference Patra, Moniz, Garg, Gormley and Neubig2019). Moreover, rather than treating the monolingual embeddings as fixed elements, we may intuitively expect that embeddings from different languages may actually be used to improve each other. This idea was exploited by Faruqui and Dyer (Reference Faruqui and Dyer2014), who learn linear mappings from two monolingual spaces onto a new, shared, multilingual space. They found that the resulting changes to the internal structure of the monolingual spaces can indeed bring benefits. In multilingual evaluation tasks, however, their method is outperformed by approaches that rely on orthogonal transformations (Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2016).

While the emphasis has shifted from static word vectors to contextualised language models in recent years, it is worth mentioning that static vectors remain an important case of study. On the one hand, static vectors are still needed in applications where the computational demands of contextualised language models are prohibitive, or where word meaning needs to be captured in the absence of context (e.g., ontology alignment). On the other hand, static vectors can also provide useful prior knowledge when training contextualised models such as mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019). In particular, Artetxe, Ruder, and Yogatama (Reference Artetxe, Ruder and Yogatama2020) show how static cross-lingual embeddings can be exploited for zero-shot multilingual transfer of contextualised models.

In this article, we propose a simple method that combines the advantages of orthogonal transformations with the potential benefit of allowing monolingual spaces to affect each other’s internal structure. Specifically, we first align the given monolingual spaces by learning an orthogonal transformation using an existing state-of-the-art method. Subsequently, we aim to reduce any remaining discrepancies by trying to find the middle ground between the aligned monolingual spaces. Specifically, let (w,v) be an entry from a bilingual dictionary (i.e., v is the translation of w), and let $\mathbf{w}$ and $\mathbf{v}$ be the vector representations of w and v in the aligned monolingual spaces. Our aim is to learn linear mappings $\mathbf{M_s}$ and $\mathbf{M_t}$ such that $\mathbf{w}\mathbf{M_s} \approx \mathbf{v}\mathbf{M_t} \approx \frac{\mathbf{v}+\mathbf{w}}{2}$ , for each entry (w,v) from a given dictionary. Crucially, because we start from monolingual spaces that are already aligned, applying the mappings $\mathbf{M_s}$ and $\mathbf{M_t}$ can be thought of as a fine-tuning step. We will refer to this proposed fine-tuning step as Meemi (Meeting in the middle).Footnote a Our experimental analysis reveals that this combination of an orthogonal transformation followed by a simple non-orthogonal fine-tuning step consistently, and often substantially outperforms existing approaches in cross-lingual evaluation tasks. We also find that the proposed transformation leads to improvements in the monolingual spaces, which, as already mentioned, is not possible with orthogonal transformations. This article extends our earlier work in Doval et al. (Reference Doval, Camacho-Collados, Espinosa-Anke and Schockaert2018) in the following ways:

  1. (1) We introduce a more general formulation of Meemi, in which the averages that are used to compute the linear transformations can be weighted (e.g. by word frequencies as we explore in this paper).

  2. (2) We generalise the approach to an arbitrary number of languages, thus allowing us to learn truly multilingual vector spaces.

  3. (3) We more thoroughly compare the obtained multilingual models, extending the number of baselines and evaluation tasks. We now also include a more extensive analysis of the results, for example, studying the impact of the size of the bilingual dictionaries in more detail.

  4. (4) In the evaluation, we now include two distant languages which do not use the Latin alphabet: Farsi and Russian. This will further support the generalisation of our conclusions.

2. Background: Cross-lingual alignment methods

In this article we analyse cross-lingual word embedding models that are based on aligning monolingual vector spaces. The overall process underpinning these methods is as follows. Given two monolingual corpora, a word vector space is first learned independently for each language. This can be achieved with standard word embedding models such as Word2vec (Mikolov et al. Reference Mikolov, Chen, Corrado and Dean2013a), GloVe (Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014) or FastText (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017). Second, a linear alignment strategy is used to map the monolingual embeddings to a common bilingual vector space. It is worth mentioning that we do not require parallel or comparable corpora to build our multilingual models as in the case of Zennaki, Semmar, and Besacier (Reference Zennaki, Semmar and Besacier2019) or Vulić and Moens (Reference Vulić and Moens2016).

These linear transformations are learned from a supervision signal in the form of a bilingual dictionary (although some methods can also deal with dictionaries that are automatically generated as part of the alignment process; see below). This approach was popularised by Mikolov, Le, and Sutskever (Reference Mikolov, Le and Sutskever2013b). Specifically, they proposed to learn a matrix $\mathbf{W}$ which minimises the following objective:

(1) \begin{equation}\sum_{i=1}^n \| \mathbf{x_i}\mathbf{W} - \mathbf{z_i} \|^2\end{equation}

where we write $\mathbf{x_i}$ for the vector representation of some word $x_i$ in the source language and $\mathbf{z_i}$ is the vector representation of the translation $z_i$ of $w_i$ in the target language. This optimisation problem corresponds to a standard least-squares regression problem, whose exact solution can be efficiently computed (although Mikolov et al. Reference Mikolov, Le and Sutskever2013b do not use this method). Note that this approach relies on a bilingual dictionary containing the training pairs $(x_1,z_1),...,(x_n,z_n)$ . However, once the matrix $\mathbf{W}$ has been learned, for any word w in the source language, we can use $\mathbf{x}\mathbf{W}$ as a prediction of the vector representation of the translation of w. In particular, to predict which word in the target language is the most likely translation of the word w from the source language, we can then simply take the word z whose vector $\mathbf{z}$ is closest to the prediction $\mathbf{x}\mathbf{W}$ .

The restriction to linear mappings might intuitively seem overly strict. However, it was found that higher-quality alignments can be found by being even more restrictive. In particular, Xing et al. (Reference Xing, Wang, Liu and Lin2015) suggested to normalise the word vectors in the monolingual spaces and restrict the matrix $\mathbf{W}$ to an orthogonal matrix (i.e., imposing the constraint that $\mathbf{W}\mathbf{W}^T=\mathbf{1}$ ). Under this restriction, the optimisation problem (1) is known as the orthogonal Procrustes problem, whose exact solution can still be computed efficiently. Another approach was taken by Faruqui and Dyer (Reference Faruqui and Dyer2014), who proposed to learn linear transformations $\mathbf{W_s}$ and $\mathbf{W_t}$ , which, respectively, map vectors from the source and target language word embeddings onto a shared vector space. They used Canonical Correlation Analysis to find the transformations $\mathbf{W_s}$ and $\mathbf{W_t}$ which minimise the dimension-wise covariance between $\mathbf{X}\mathbf{W_s}$ and $\mathbf{Z}\mathbf{W_t}$ , where $\mathbf{X}$ is a matrix whose rows are $\mathbf{x_1},...,\mathbf{x_n}$ and similarly $\mathbf{Z}$ is a matrix whose rows are $\mathbf{z_1},...,\mathbf{z_n}$ . Note that while the aim of Xing et al. (Reference Xing, Wang, Liu and Lin2015) is to avoid making changes to the cosine similarities between word vectors from the same language, Faruqui and Dyer (Reference Faruqui and Dyer2014) specifically want to take into account information from the other language with the aim of improving the monolingual embeddings themselves. Artetxe et al. (Reference Artetxe, Labaka and Agirre2016) propose a model which combines ideas from Xing et al. (Reference Xing, Wang, Liu and Lin2015) and Faruqui and Dyer (Reference Faruqui and Dyer2014). Specifically, they use the formulation in (1) with the constraint that $\mathbf{W}$ be orthogonal, as in Xing et al. (Reference Xing, Wang, Liu and Lin2015), but they also apply a preprocessing strategy called mean centering which is closely related to the model from Faruqui and Dyer (Reference Faruqui and Dyer2014). On top of this, in Artetxe, Labaka, and Agirre (Reference Artetxe, Labaka and Agirre2018a) they propose a multi-step framework in which they experiment with several pre-processing and post-processing strategies. These include whitening (which involves applying a linear transformation to the word vectors such that their covariance matrix is the identity matrix), re-weighting each coordinate according to its cross-correlation (which means that the relative importance of those coordinates with the strongest agreement between both languages is increased), de-whitening (i.e., inverting the whitening step to restore the original covariances) and a dimensionality reduction step, which is seen as an extreme form of re-weighting (i.e., those coordinates with the least agreement across both languages are simply dropped). They also consider the possibility of using orthogonal mappings from both embedding spaces into a shared space, rather than mapping one embedding space onto the other, where the objective is based on maximising cross-covariance. This route is also followed by Kementchedjhieva et al. (Reference Kementchedjhieva, Ruder, Cotterell and Søgaard2018). Other approaches that have been proposed for aligning monolingual word embedding spaces include models which replace (1) with a max-margin objective (Lazaridou, Dinu, and Baroni Reference Lazaridou, Dinu and Baroni2015) and models which rely on neural networks to learn non-linear transformations (Lu et al. Reference Lu, Wang, Bansal, Gimpel and Livescu2015).

A central requirement of the aforementioned methods is that they need a sufficiently large bilingual dictionary. Several approaches have been proposed to address this limitation, showing that high-quality results can be obtained in a purely unsupervised way. For instance, Artetxe et al. (Reference Artetxe, Labaka and Agirre2017) propose a method that can work with a small synthetic seed dictionary, for example, only containing pairs of identical numerals (1,1), (2,2), (3,3), etc. To this end, they alternatingly use the current dictionary to learn a corresponding orthogonal transformation and then use the learned cross-lingual embedding to improve the synthetic dictionary. This improved dictionary is constructed by assuming that the translation of a given word w is the nearest neighbour of $\mathbf{x}\mathbf{W}$ among all words from the target language. This approach was subsequently improved in Artetxe et al. (Reference Artetxe, Labaka and Agirre2018b), where state-of-the-art results were obtained without even assuming the availability of a synthetic seed dictionary. The key idea underlying their approach, called VecMap, is to initialise the seed dictionary in a fully unsupervised way based on the idea that the histogram of similarity scores between a given word w and the other words from the source language should be similar to the histogram of similarity scores between its translation z and the other words from the target language. Another approach which aims to learn bilingual word embeddings in a fully unsupervised way, called MUSE, is proposed in Conneau et al. (Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a). The main difference with VecMap lies in how the initial seed dictionary is learned. For this purpose, MUSE relies on adversarial training (Goodfellow et al. Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio2014), similar as in earlier models (Barone Reference Barone2016; Zhang et al. Reference Zhang, Liu, Luan and Sun2017a) but using a simpler formulation, based on the model in (1) with the orthogonality constraint on $\mathbf{W}$ . The main intuition is to choose $\mathbf{W}$ such that it is difficult for a classifier to distinguish between word vectors $\mathbf{z}$ sampled from the target word embedding and vectors $\mathbf{x}\mathbf{W}$ , with $\mathbf{x}$ sampled from the source word embedding. There have been other approaches to create this initial bilingual dictionary without supervision via adversarial training (Zhang et al. Reference Zhang, Liu, Luan and Sun2017b; Hoshen and Wolf Reference Hoshen and Wolf2018; Xu et al. Reference Xu, Yang, Otani and Wu2018) or stochastic processes (Alvarez-Melis and Jaakkola Reference Alvarez-Melis and Jaakkola2018), but their performance has not generally surpassed existing methods (Artetxe et al. Reference Artetxe, Labaka and Agirre2018b; Glavaš et al. Reference Glavaš, Litschko, Ruder and Vulić2019). For a more comprehensive summary of existing methods, please refer to Ruder, Vulić, and Søgaard (Reference Ruder, Vulić and Søgaard2019).

In this work, we make use of the three mentioned variants of VecMap, namely the supervised implementation based on the multi-step framework from Artetxe et al. (Reference Artetxe, Labaka and Agirre2018a), which will be referred to as VecMap $_{multistep}$ , the orthogonal method (VecMap $_\textrm{ortho}$ ) (Artetxe et al. Reference Artetxe, Labaka and Agirre2016) and its unsupervised version (VecMap $_\textrm{uns}$ ) (Artetxe et al. Reference Artetxe, Labaka and Agirre2018b). Similarly, we will consider the supervised and unsupervised variants of MUSE (MUSE and MUSE $_\textrm{uns}$ , respectively) (Conneau et al. Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a). In the next section, we present our proposed post-processing method based on an unconstrained linear transformation to improve the results of the previous methods.Footnote b

3. Fine-tuning cross-lingual embeddings by meeting in the middle

After the initial alignment of the monolingual spaces, we propose to apply a post-processing step which aims to bring the two monolingual spaces closer together by lifting the orthogonality constraint. To this end, we learn an unconstrained linear transformation that maps word vectors from one space onto the average of that word vector and the vector representation of its translation (according to a given bilingual dictionary). This approach, which we call Meemi (Meeting in the middle), is illustrated in Figure 1. In particular, the figure illustrates the two-step nature, where we first learn an orthogonal transformation (using VecMap or MUSE), which aligns the two monolingual spaces as much as possible without changing their internal structure. Then, our approach aims to find a middle ground between the two resulting monolingual spaces. This involves applying a non-orthogonal transformation to both monolingual spaces.

Figure 1. Step by step integration of two monolingual embedding spaces: (1) obtaining isolated monolingual spaces, (2) aligning these spaces through an orthogonal linear transformation and (3) map both spaces using an unconstrained linear transformation learned on the averages of translation pairs.

By averaging between the representations obtained from different languages, we hypothesise that the impact of language-specific phenomena and corpus specific biases will be reduced whereas its core semantic features will become more dominant. However, because we start from aligned spaces, the changes which are made by this transformation are relatively small. Our transformation is thus intuitively fine-tuning the usual orthogonal transformation, rather than replacing it. Note that this approach can naturally be applied to more than two monolingual spaces (Section 3.2). First, however, we will consider the standard bilingual case.

3.1 Bilingual models

Let D be the given bilingual dictionary, encoded as a set of word pairs (w, $w^{\prime}$ ). Using the pairs in D as training data, we learn a linear mapping $\mathbf{X}$ such that $\mathbf{w} \mathbf{X} \approx \frac{\mathbf{w}+\mathbf{w'}}{2}$ for all $(w,w')\in D$ , where we write $\mathbf{w}$ for the vector representation of word w in the given (aligned) monolingual space. This mapping $\mathbf{X}$ can then be used to predict the averages for words outside the given dictionary. To find the mapping $\mathbf{X}$ , we solve the following least squares linear regression problem:

(2) \begin{equation} E=\sum_{(w,w') \in D} \left\|\mathbf{w} \mathbf{X}-\frac{\mathbf{w}+\mathbf{w'}}{2}\right\|^2\end{equation}

Similarly, we separately learn a mapping $\mathbf{X'}$ such that $\mathbf{w'} \mathbf{X'} \approx \frac{\mathbf{w}+\mathbf{w'}}{2}$ .

It is worth mentioning that we had also experimented with non-linear mappings before arriving at the present formulation. However, multilayer perceptrons paired with different regularisation terms to avoid overfitting, such as penalising mappings that deviated excessively from the identity mapping, obtained lower performance figures, which led us to discard this path at the moment.

We also consider a weighted variant of Meemi where the linear model is trained on weighted averages based on word frequency. Specifically, let $f_{w}$ be the occurrence count of word w in the corresponding monolingual corpus, then $\frac{\mathbf{w}+\mathbf{w'}}{2}$ is replaced by

(3) \begin{equation}\frac{f_{w} \mathbf{w} + f_{w'} \mathbf{w'}}{f_{w} + f_{w'}}\end{equation}

The intuition behind this weighted model is that the word w might be much more prevalent in the first language than the word $w^{\prime}$ is in the second language. A clear example is when $w=w'$ , which may be the case, among others, if w is a named entity. For instance, suppose that w is the name of a Spanish city. Then, we may expect to see more occurrences of w in a Spanish corpus than in an English corpus. In such cases, it may be beneficial to consider the word vector obtained from the Spanish corpus to be of higher quality, and thus give more weight to it in the average.

We will write Meemi (M) to refer to the model obtained by applying Meemi after the base method M, where M may be any variant of VecMap or MUSE. Similarly, we will write Meemi $_\textrm{w}$ (M) in those cases where the weighted version of Meemi was used.

3.2 Multilingual models

To apply Meemi in a multilingual setting, we exploit the fact that bilingual orthogonal methods such as VecMap (without re-weighting) and MUSE do not modify the target monolingual space but only apply an orthogonal transformation to the source. Hence, by simply applying this method to multiple language pairs while fixing the target language (i.e., for languages $l_{1}, l_{2}, ..., l_{n}$ , we construct pairs of the form $(l_{i}, l_{n})$ with $i \in \{1,...,n-1\}$ ), we can obtain a multilingual space in which all of the corresponding monolingual models are aligned with, or mapped onto, the same target embedding space. Note, however, that if we applied a re-weighting strategy, as suggested in Artetxe et al. (Reference Artetxe, Labaka and Agirre2018a) for VecMap, the target space would no longer remain fixed for all source languages and would instead change depending on the source in each case. While most previous work has been limited to bilingual settings, multilingual models involving more than two languages have already been studied by Ammar et al. (Reference Ammar, Mulcaire, Tsvetkov, Lample, Dyer and Smith2016), who used an approach based on Canonical Correlation Analysis. As in our approach, they also fix one specific language as the reference language.

Formally, let D be the given multilingual dictionary, encoded as a set of tuples $(w_1,w_2,...,w_n)$ , where n is the number of languages. Using the tuples in D as training data, we learn a linear mapping $\mathbf{X_i}$ for each language, such that $\mathbf{w_i}\mathbf{X_i} \approx \frac{\mathbf{w_1}+...+\mathbf{w_n}} {n}$ for all $(w_1,...,w_n)\in D$ . This mapping $\mathbf{X_i}$ can then be used to predict the averages for words in the ith language outside the given dictionary. To find the mappings $\mathbf{X_i}$ , we solve the following least squares linear regression problem for each language:

(4) \begin{equation} E_{\textit{multi}}=\sum_{(w_1,...,w_n) \in D} \left\|\mathbf{w_i}\mathbf{X_i}- \frac{\mathbf{w_1}+...+\mathbf{w_n}} {n}\right\|^2\end{equation}

Note that while a weighted variant of this model can straightforwardly be formulated, we will not consider this in the experiments.

4. Experimental setting

In this section, we explain the common training settings for all experiments. First, the monolingual corpora that were used, as well as other training details that pertain to the initial monolingual embeddings, are discussed in Section 4.1. Then, in Section 4.2, we explain which bilingual and multilingual dictionaries were used as supervision signals. Finally, all compared systems are listed in Section 4.3.

4.1 Corpora and monolingual embeddings

Instead of using comparable corpora such as Wikipedia, as in much of the previous work (Artetxe et al. Reference Artetxe, Labaka and Agirre2017; Conneau et al. Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a), we make use of independent corpora extracted from the web. This represents a more realistic setting where alignments are harder to obtain, as already noted by Artetxe et al. (Reference Artetxe, Labaka and Agirre2018b). For English, we use the 3B-word UMBC WebBase Corpus (Han et al. Reference Han, Kashyap, Finin, Mayfield and Weese2013), containing over 3 billion words. For Spanish, we used the Spanish Billion Words Corpus (Cardellino Reference Cardellino2016), consisting of over a billion words. For Italian and German, we use the itWaC and sdeWaC corpora from the WaCky project (Baroni et al. Reference Baroni, Bernardini, Ferraresi and Zanchetta2009), containing 2 and 0.8 billion words, respectively.Footnote c For Finnish and Russian, we use their corresponding Common Crawl monolingual corpora from the Machine Translation of News Shared Task 2016,Footnote d composed of 2.8B and 1.1B words, respectively. Finally, for Farsi we leverage the newswire Hamshahri corpus (AleAhmad et al. Reference AleAhmad, Amiri, Darrudi, Rahgozar and Oroumchian2009), composed of almost 200M words.

In a preprocessing step, all corpora were tokenised using the Stanford tokeniser (Manning et al. Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014) and lowercased. Then we trained FastText word embeddings (Bojanowski et al. Reference Bojanowski, Grave, Joulin and Mikolov2017) on the preprocessed corpora for each language. The dimensionality of the vectors was set to 300, using the default values for the remaining hyperparameters.

In our experiments, we consider, first, 6 Indo-European languages, of which Spanish and Italian are Romance, English and German are Germanic, Russian is Slavic and Farsi is Iranian. Second, we also include experiments for Finnish, which is a Uralic Finnic language (Dryer and Haspelmath Reference Dryer and Haspelmath2013). Finally, we have also included a set of exclusively distant languages: Arabic and Hebrew, both of them Semitic Afro-Asiatic; Finnic Uralic Estonian, Slavic Indo-European Polish and Sino-Tibetan Chinese. For this latter set of languages, we use the pretrained monolingual embeddings available from the FastText website,Footnote e obtained from Common Crawl and Wikipedia. Since we could not access the source corpora for these monolingual embeddings, we could not gather frequency information and therefore we only tested the default variant of Meemi (i.e., not weighted). Furthermore, for the multilingual version of Meemi (see Section 3.2), we consider those languages for which we train the corresponding monolingual embeddings: English, Spanish, Italian, German, Russian, Farsi and Finnish.

4.2 Training dictionaries

We use the training dictionaries provided by Conneau et al. (Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a) as supervision. These bilingual dictionaries were compiled using the internal translation tools from Facebook. To make the experiments comparable across languages, we randomly extracted 8000 training pairs for all language pairs considered, as this is the size of the smallest available dictionary. For completeness we also present results for fully unsupervised systems (see the following section), which do not take advantage of any dictionaries.

4.3 Compared systems

We have trained both bilingual and multilingual models involving up to seven languages. In the bilingual case, we consider the supervised and unsupervised variants of VecMap and MUSE to obtain the base alignments and then apply plain Meemi and weighted Meemi on the results. For supervised VecMap we compare with its orthogonal version VecMap $_\textrm{ortho}$ and the multi-step procedure VecMap $_{multistep}$ . For the multilingual case, we follow the procedure described in Section 3.2 making use of all seven languages considered in the evaluation, that is, English, Spanish, Italian, German, Finnish, Farsi and Russian. Note that in the bilingual case all three variants of VecMap can be used, whereas in the multilingual setting we can only use VecMap $_\textrm{ortho}$ .

5. Intrinsic evaluation

In this section, we assess the intrinsic performance of our post-processing techniques in cross-lingual (Section 5.1) and monolingual (Section 5.2) settings.

5.1 Cross-lingual performance

We evaluate the performance of all compared cross-lingual embedding models on standard purely cross-lingual tasks, namely dictionary induction (Section 5.1.1) and cross-lingual word similarity (Section 5.1.2).

5.1.1 Bilingual dictionary induction

Also referred to as word translation, this task consists in automatically retrieving the word translations in a target language for words in a source language. Acting on the corresponding cross-lingual embedding space which integrates the two (or more) languages of a particular test case, we obtain the nearest neighbours to the source word in the target language as our translation candidates. The performance is measured with precision at k ( $P@k$ ), defined as the proportion of test instances where the correct translation candidate for a given source word was among the k highest ranked candidates. The nearest neighbours ranking is obtained by using cosine similarity as the scoring function. For this evaluation, we use the corresponding test dictionaries released by Conneau et al. (Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a).

We show the results attained by a wide array of models in Tables 1 and 2, where we can observe that the best figures are generally obtained by Meemi over the bilingual VecMap models. The impact of Meemi is more apparent when used in combination with the orthogonal base models, with improvements over the multi-step version of VecMap as well in most languages. These improvements are statistically significant at the 0.05 level across all language pairs, using paired t-tests. On the other hand, using the weighted version of Meemi (i.e., Meemi $_\textrm{w}$ in Table 1) does not seem to be particularly beneficial on this task, with the only exception of English-Farsi. In general, the performance of unsupervised models (i.e., VecMap $_\textrm{uns}$ and MUSE $_\textrm{uns}$ ) is competitive in closely-related languages such as English-Spanish or English-German but they considerably under-perform for distant languages, especially English-Finnish and English-Russian. We have double-checked the anomalous results for English-Finnish, and they appear to be correct under our current testing framework after five runs obtaining the same result. Finally, the results obtained by the multilingual model that includes all seven languages considered, that is, Meemi-multi (VecMap $_\textrm{ortho}$ ) in Table 1, improve over the base orthogonal model, but they do not improve over the results of our bilingual model. We further discuss the impact of adding languages to the multilingual model in Section 7.3.

Table 1. Precision at k ( $P@K$ ) performance of different cross-lingual embedding models in the bilingual dictionary induction task

5.1.2 Cross-lingual word similarity

Cross-lingual word similarity constitutes a straightforward benchmark to test the quality of bilingual embeddings. In this case, and in contrast to monolingual similarity, words in a given pair (a,b) belong to different languages, for example, a belonging to English and b to Farsi. For this task we make use of the SemEval-17 multilingual similarity benchmark (Camacho-Collados et al. Reference Camacho-Collados, Pilehvar, Collier and Navigli2017), considering the four cross-lingual data sets that include English as target language in particular, but discarding multi-word expressions. Also, we use the Multi-SimLex data set published by Vulic et al. (Reference Vulic, Baker, Ponti, Petti, Leviant, Wing, Majewska, Bar, Malone, Poibeau, Reichart and Korhonen2020) for our experiments on the set of exclusively distant languages: Arabic, Hebrew, Estonian, Polish and Chinese. Performance is computed in terms of Pearson and Spearman correlation with respect to the gold standard.

Table 2. Dictionary induction results for distant language pairs using FastText pre-trained monolingual embeddings as input using precision at k ( $P@K$ )

Table 3. Cross-lingual word similarity results in terms of Pearson (r) and Spearman ( $\rho$ ) correlation. Languages codes: English-EN, Spanish-ES, Italian-IT, German-DE and Farsi-FA

Tables 3 and 4 show the results of the different embeddings models in the cross-lingual word similarity task. Except in a few cases for the VecMap $_{multistep}$ model, our Meemi transformation proves superior to the base models (at the 0.05 level for paired t-tests over all language pairs) and to all their unsupervised variants. For distant languages, where the results are lower overall, our Meemi transformation proves useful, generally outperforming the best VecMap models. Similarly as in the bilingual dictionary induction task, the weighted version of Meemi proves robust only on English-Farsi (Table 4), which suggests that this weighting scheme is most useful for distant languages, as in this case the Farsi monolingual space (which is learned from a smaller corpus and hence, as we will see in the next section, has a lower quality) gets closer to the English monolingual space. As far as the multilingual model is concerned, it proves beneficial in all cases with respect to the orthogonal version of VecMap, as well as compared to the bilingual variant of Meemi.

As for the results with distant languages in Table 4 (using pre-trained FastText embeddings), the trend is even more pronounced. Meemi helps improve the performance in all languages for the MUSE and VecMap orthogonal methods, and it also improves the performance of VecMap $_{multistep}$ in Arabic, Hebrew and Estonian.

Table 4. Cross-lingual word similarity results for distant language pairs using FastText pre-trained monolingual embeddings as input in terms of Pearson (r) and Spearman ( $\rho$ ) correlation. Language codes: English-EN, Arabic-AR, Hebrew-HE, Estonian-ET, Polish-PL and Chinese-ZH

5.2 Monolingual performance

One of the advantages of breaking the orthogonality of the transformation is the potential to improve the monolingual quality of the embeddings. To test the difference between the original word embeddings and the embeddings obtained after applying the Meemi transformation, we take monolingual word similarity as a benchmark. Given a word pair, this task consists in assessing the semantic similarity between both words in the pair, in this case from the same language. The evaluation is then performed in terms of Spearman and Pearson correlation with respect to human judgements. In particular, we use the monolingual data sets (English, Spanish, German and Farsi) from the SemEval-17 task on multilingual word similarity. The results provided by the original monolingual FastText embeddings are also reported as baseline.

Table 5 shows the results on the monolingual word similarity task. In this task, our multilingual model representing seven languages in a single space clearly stands out, obtaining the best overall results for English, Spanish and Italian and improving over the base VecMap $_\textrm{ortho}$ model on the rest. With the exception of German, where the multi-step framework of Artetxe et al. (Reference Artetxe, Labaka and Agirre2018a) proves most effective, the plain Meemi transformation improves over the base models, for both VecMap and MUSE.

Table 5. Monolingual word similarity results in terms of Pearson (r) and Spearman ( $\rho$ ) correlation

6. Extrinsic evaluation

We complement the intrinsic evaluation experiments, which are typically a valuable source for understanding the properties of the vector spaces, with downstream extrinsic cross-lingual tasks. This evaluation is especially necessary in the view that the intrinsic behaviour does not always correlate well with downstream performance (Bakarov, Suvorov, and Sochenkov Reference Bakarov, Suvorov and Sochenkov2018; Glavaš et al. Reference Glavaš, Litschko, Ruder and Vulić2019). In particular, for this extrinsic evaluation we will focus on the following question: how does our post-processing method help alleviate limitations of cross-lingual models that are due to their use of orthogonality constraints? In particular, we perform experiments with the orthogonal model of VecMap (i.e., VecMap $_{\text{ortho}}$ ), in combination with the proposed Meemi strategy, both in bilingual and multilingual settings. For the latter case, we considered all six languages, that is, Spanish, Italian, German, Finnish, Farsi and Russian, keeping English as the target language.

The tasks considered are cross-lingual hypernym discovery (Section 6.1) and cross-lingual natural language inference (Section 6.2).

6.1 Cross-lingual hypernym discovery

Hypernymy is an important lexical relation, which, if properly modeled, directly impacts downstream NLP tasks such as semantic search (Hoffart, Milchevski, and Weikum Reference Hoffart, Milchevski and Weikum2014; Roller and Erk Reference Roller and Erk2016), question answering (Prager et al. Reference Prager, Chu-Carroll, Brown and Czuba2008; Yahya et al. Reference Yahya, Berberich, Elbassuoni and Weikum2013) or textual entailment (Geffet and Dagan Reference Geffet and Dagan2005). Hypernyms, in addition, are the backbone of taxonomies and lexical ontologies (Yu et al. Reference Yu, Wang, Lin and Wang2015), which are in turn useful for organising, navigating and retrieving online content (Bordea, Lefever, and Buitelaar Reference Bordea, Lefever and Buitelaar2016). We propose to evaluate the quality of a range of cross-lingual vector spaces in the extrinsic task of hypernym discovery, that is, given an input word (e.g., ‘cat’), retrieve or discover its most likely (set of) valid hypernyms (e.g., ‘animal’, ‘mammal’, ‘feline’ and so on). Intuitively, by leveraging a bilingual vector space condensing the semantics of two languages, one of them being English, the need for large amounts of training data in the target language may be reduced.Footnote f

The base model is a (cross-lingual) linear transformation trained with hyponym-hypernym pairs (Espinosa-Anke et al. Reference Espinosa-Anke, Camacho-Collados, Delli Bovi and Saggion2016), which is afterwards used to predict the most likely (set of) hypernyms given a new term. Training and evaluation data come from the SemEval 2018 Shared Task on Hypernym Discovery (Camacho-Collados et al. Reference Camacho-Collados, Delli Bovi, Espinosa-Anke, Oramas, Pasini, Santus, Shwartz, Navigli and Saggion2018). Note that current state-of-the-art systems aimed at modeling hypernymy (Shwartz, Goldberg, and Dagan Reference Shwartz, Goldberg and Dagan2016; Bernier-Colborne and Barriere Reference Bernier-Colborne and Barriere2018; Held and Habash Reference Held and Habash2019) combine large amounts of annotated data along with language-specific rules and cue phrases such as Hearst Patterns (Hearst Reference Hearst1992), both of which are generally scarcely (if at all) available for languages other than English. As a reference, we have included the best performing unsupervised system for both Spanish and Italian (we will refer to this baseline as BestUns). This unsupervised baseline is based on the distributional models described in Shwartz, Santus, and Schlechtweg (Reference Shwartz, Santus and Schlechtweg2017).

Table 6. Cross-lingual hypernym discovery results in terms of Mean Reciprocal Rank (MRR), Mean Average Precision (MAP) and precision at 5 ( $P@5$ ). In this case, VecMap = VecMap $_{\text{ortho}}$

As such, we report experiments (Table 6) with training data only from English (11,779 hyponym-hypernym pairs), and enriched models informed with relatively few training pairs (500, 1K, and 2K) from the target languages. Evaluation is conducted with the same metrics as in the original SemEval task, that is, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP) and precision at 5 ( $P@5$ ). Specifically, MRR rewards the position of the first correct retrieved hypernym:

\begin{equation*}\mbox{{MRR}} = \frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{rank_i}\end{equation*}

where Q is a sample of experiment runs and $rank_i$ refers to the rank position of the first relevant outcome for the ith run. However, in this hypernym discovery data set, the vast majority of terms accept more than one correct hypernym, which is why MAP was considered as the official task metric in the SemEval task. This metric is defined as follows:

\begin{equation*}\mbox{MAP} = \frac{1}{|Q|}\sum_{q \in Q} AP(q)\end{equation*}

where AP (Average Precision) is the average of the $P@{K_1},...,P@{K_n}$ scores, where $K_1,...,K_n$ are the positions where the gold hypernyms appear in the ranking. As the maximum number of hypernyms allowed per term was 15, we only consider the first 15 gold hypernyms in cases where there are more.

We report comparative results between the following systems: VecMap $_\textrm{uns}$ (the unsupervised variant), VecMap $_\textrm{ortho}$ (the orthogonal transformation variant), VecMap $_{multi-step}$ (the supervised multi-stage variant) and three Meemi variants: Meemi (VecMap); Meemi $_\textrm{w}$ (VecMap) and Meemi-multi (VecMap). The first noticeable trend is the better performance of the unsupervised VecMap version versus its supervised orthogonal and multi-step counterparts. Nevertheless, we find remarkably consistent gains over both VecMap variants when applying Meemi, across all configurations for the two language pairs considered. In fact, the weighted (Meemi $_\textrm{w}$ ) version brings an increase in performance between 1 and 2 MRR and MAP points across the whole range of target language supervision (from zero to 2k pairs). This is in contrast to the instrinsic evaluation, where the weighted model did not seem to provide noticeable improvements over the plain version of Meemi. Finally, concerning the fully multilingual model, the experimental results suggest that, while still better than the orthogonal baselines, it falls short when compared to the weighted bilingual version of Meemi. This result suggests that exploring weighting schemes for the multilingual setting may bring further gains, but we leave this extension for future work.

6.2 Cross-lingual natural language inference

The task of natural language inference (NLI) consists in detecting entailment, contradiction or neutral relations in pairs of sentences. In our case, we test a zero-shot cross-lingual transfer setting where a system is trained with English corpora and is then evaluated on a different language. We base our approach on the assumption that better aligned cross-lingual embeddings should lead to better NLI models, and that the impact of the input embeddings may become more apparent in simple methods; as opposed to, for instance, complex neural network architectures. Hence, and also to account for the coarser linguistic granularity of this task (being a sentence classification problem rather than word level), we employ a simple bag-of-words approach where a sentence embedding is obtained through word vector averaging. We then train a linear classifierFootnote g to predict one of the three possible labels in this task, namely entailment, contradiction or neutral. We use the full MultiNLI English corpus (Williams, Nangia, and Bowman Reference Williams, Nangia and Bowman2018) for training and the Spanish and German test sets from XNLI (Conneau et al. Reference Conneau, Rinott, Lample, Williams, Bowman, Schwenk and Stoyanov2018b) for testing. For comparison, we also include a lower bound obtained by considering English monolingual embeddings for input; in this case, FastText trained on the UMBC corpus, which is the same model used to obtain multilingual embeddings.

Accuracy results are shown in Table 7. The main conclusion in light of these results is the remarkable performance of the unsupervised VecMap model and, most notably, multilingual Meemi for both Spanish and German, clearly outperforming the orthogonal bilingual mapping baseline. Our results are encouraging for two reasons. First, they suggest that, at least for this task, collapsing several languages into a unified vector space is better than performing pairwise alignments. And second, the inherent benefit of having one single model accounting for an arbitrary number of languages.

Table 7. Accuracy, or the number of correct classifications (entailment, contradiction or neutral) over the total number of tests instances, on the XNLI task using different cross-lingual embeddings as features

7. Analysis

We complement our quantitative (intrinsic and extrinsic) evaluations with an analysis aimed at discovering the most salient characteristics of the transformation that is found by Meemi. We present a qualitative analysis with examples in Section 7.1, as well as an analysis on the impact of the size of training dictionaries in Section 7.2 and on the performance of the multilingual model in Section 7.3.

7.1 Studying word translations

Table 8 lists a number of examples where, for a source English word, we explore its highest ranked cross-lingual synonyms (or word translations) in a target language. We select Spanish as a use case.

Let us study the examples listed in Table 8, as they constitute illustrative cases of linguistic phenomena which go beyond correct or incorrect translations. First, the word ‘crazy’ is correctly translated by both VecMap and Meemi; loco (masculine singular), locos (masculine plural) or loca (feminine) being standard translations, with no further connotations, of the source word. However, the most interesting finding lies in the fact that for Meemi-multi, the preferred translation is a colloquial (or even vulgar) translation which was not considered as correct in the gold test dictionary. The Spanish word chifladas translates to English as ‘going mental’ or ‘losing it’. Similarly, we would like to highlight the case of ‘telegraph’. This word is used in two major senses, namely to refer to a message transmitter and as a reference to media outlets (several newspapers have the word ‘telegraph’ in their name). VecMap and Meemi (correctly) translate this word into the common translation telégrafo (the transmission device), whereas Meemi-multi prefers its named-entity sense.

Other cases, such as ‘conventions’ and ‘discover’, are examples to illustrate the behaviour for common ambiguous nouns. In both cases, candidate translations are either misspellings of the correct translation (descubr for ‘discover’) or misspellings involving tokens conflating two words whose compositional meaning is actually a correct candidate translation for the source word; for example, legislaciones nacionales (‘national rulings’) for ‘conventions’. Finally, ‘remark’ offers an example of a case where ambiguity causes major disruptions. In particular, ‘remark’ translates in Spanish to observación, which in turn has an astronomical sense; ‘astronomical observatory’ translates to observatorio astronómico.

Table 8. Word translation examples from English and Spanish, comparing VecMap with the bilingual and multilingual variants of Meemi. For each source word, we show its five nearest cross-lingual synonyms. Bold translations are correct, according to the source test dictionary (cf. Section 5.1.1)

7.2 Impact of training dictionary and corpus size

Our method relies on the availability of suitable bilingual training dictionaries, where we can expect that the size of these dictionaries should have a clear impact on the quality of the final transformation. This is analysed in Figure 2 for the task of cross-lingual word similarity. The figure shows the absolute improvement (in percentage points) over VecMap by applying Meemi, using different training dictionary sizes for supervision.

Figure 2. Absolute improvement (in terms of Pearson correlation percentage points) by applying the Meemi over the two base orthogonal models VecMap and MUSE on the cross-lingual word similarity task, with different training dictionary sizes. As data points in the X-axis we selected 100, 1000, 3000, 5000 and 8000 word pairs in the dictionary.

As can be observed, using Meemi improves the results, for all language pairs, when dictionaries of 8K, 5K or 3K word pairs are used, but its performance heavily drops with dictionaries of smaller sizes (i.e. 1K and especially 100). In fact, having a larger dictionary helps avoid overfitting, which is a recurring problem in cross-lingual word embedding learning (Zhang et al. Reference Zhang, Liu, Luan and Sun2017a). The most remarkable case is that of Farsi, where Meemi improves the most, but where access to a sufficiently large dictionary becomes even more important. This behaviour clearly shows under which conditions our proposed final transformation can be applied with higher success rates. We leave exploring larger dictionaries and their impact in different tasks and languages for future work.

On the other hand, we have observed that while corpus size plays a role in the performance of our models, it is not as notable as it might seem at first. Given the different corpus sizes of the data we used to train our monolingual embeddings, we analysed the correlation between these sizes, mentioned in Section 4.1, and the performance figures presented in Tables 1, 3 and 5. The average Pearson correlation across multilingual models in dictionary induction, where all languages are available, is 0.38 (discarding VecMap $_\textrm{uns}$ due to its anomalous results for Finnish), while for cross-lingual and monolingual word similarity it is 0.69 and 0.65, respectively. Note, however, that in these latter cases we are missing two distant languages, that is, Finnish and Russian.

7.3 Multilingual performance

In this section we assess the benefits of our proposed multilingual integration (cf. Section 3.2). To this end, we measure fluctuations in performance as more languages were added to the initially bilingual model. Thus, starting from a bilingual embedding space obtained with VecMap $_\textrm{ortho}$ , we apply Meemi over a number of aligned spaces, which ultimately leads to a fully multilingual space containing the following languages: Spanish, Italian, German, Finnish, Farsi, Russian and English. This latter language is used as the target embedding space for the orthogonal transformations due to it being the richest in terms of resource availability.

To avoid a lengthy and overly exhaustive approach where all possible combinations from two to seven languages are evaluated, we opted for conducting an experiment where languages are divided into two groups and added one by one in a fixed order: the first group is formed by languages that obtain the best alignments with English in previous experiments, which broadly coincides with those that are closer to English in terms of language family and alphabet (i.e., Spanish, Italian and German), and then the second group formed by the remaining languages (i.e., Finnish, Farsi and Russian). However, this approach does not allow us to use, for example, the English-Farsi test set until reaching the fifth step. To solve this, if the language that is needed for the test set has not yet been included, we replace the last language that was added by the one that is needed for the test set. For instance, while we normally add Italian as the second source language (resulting in trilingual space en-es-it), for the English-German test set, the results are instead based on a space where we added German instead of English (i.e. the trilingual space en-es-de). In Table 9, we show the results obtained by the multilingual models in bilingual dictionary induction.

The best results are achieved when more than two languages are involved in the training, which correlates with the results obtained in the rest of the tasks and highlights the ability of Meemi to successfully exploit multilingual information to improve the quality of the embedding models involved. In general, the performance fluctuates more significantly when adding the first language to the bilingual models and then stabilises at a similar level to the bilingual case when adding more distant languages.

Table 9. Dictionary induction results obtained with the multilingual extension of Meemi over (VecMap $_\textrm{ortho}$ ) in terms of precision at k ( $P@K$ ). The sequence in which source languages are added to the multilingual models is: Spanish, Italian, German, Finnish, Farsi and Russian (English is the target). The x indicates the use of the test language in each case (if the test language is already included, the following language in the sequence is added). We also include the scores of the original VecMap $_\textrm{ortho}$ as baseline

8. Conclusion

In this article, we have presented an extended study of Meemi, a simple post-processing method for improving cross-lingual word embeddings which was first presented in Doval et al. (Reference Doval, Camacho-Collados, Espinosa-Anke and Schockaert2018). Our initial goal was to learn improved bilingual alignments from those obtained by state-of-the-art cross-lingual methods such as VecMap (Artetxe et al. Reference Artetxe, Labaka and Agirre2018a) or MUSE (Conneau et al. Reference Conneau, Lample, Ranzato, Denoyer and Jégou2018a). We do this by applying a final unconstrained linear transformation to their initial mappings. Our extensive evaluation reveals that Meemi, using only dictionary translation as supervision, can improve on the supervised and unsupervised variants of these models, in both close and distant languages. This also confirms findings from recent work that unsupervised models may be more brittle than supervised models, even if these are using only word translations as supervision (Vulić et al. Reference Vulić, Glavaš, Reichart and Korhonen2019).

In this work, we have also gone beyond the bilingual setting by exploring an extension of the original Meemi model to align embeddings from an arbitrary number of languages in a single shared vector space. In particular, we take advantage of the fact that, assuming the initial alignment was obtained with an orthogonal mapping, Meemi can naturally be applied to any number of languages through a single linear transformation per language.

Regarding the evaluation, we extended the language set to include, in addition to the usual Indo-European languages such as English, Spanish, Italian or German, other distant languages such as Finnish, Farsi and Russian. The results we report in this article show that Meemi is highly competitive, consistently yielding better results than competing baselines, especially in the case of distant languages. We are particularly encouraged by the multilingual results, which prove that bringing together distant languages from different families in a shared vector space appears to be beneficial in most cases.

9. Future work

We will continue to explore the possibilities of post-processing multilingual models, investigating their impact in different tasks. Given the fact that going from restrictive orthogonal transformations to the less constrained Meemi transformation was found to be beneficial in the integration of monolingual models, it remains to be seen whether there are benefits in further fine-tuning the alignment, in the form of some kind of constrained non-linear transformation.

Given the recent breakthroughs in multilingual contextualized language models such as mBERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), we also plan on exploring the use of static (i.e., non-contextualized) cross-lingual word embeddings as prior knowledge for those models, as was suggested by Artetxe et al. (Reference Artetxe, Ruder and Yogatama2020) (see ending of Section 1). More specifically, instead of freezing the pretrained input embeddings when training the contextualized model, it would be interesting to analyse the effect of updating the parameters of the cross-lingual word vectors jointly with the rest of the language model. An advantage of our cross-lingual vectors, compared to the ones that were considered by Artetxe et al. (Reference Artetxe, Ruder and Yogatama2020), is that we can train them on a wider range of languages (i.e., not just bilingual), which would allow for a more comprehensive exploitation of multilingual training corpora.

Acknowledgements

Yerai Doval has been supported by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO) through the ANSWER-ASAP project (TIN2017-85160-C2-2-R); by the Spanish State Secretariat for Research, Development and Innovation (which belongs to MINECO) and the European Social Fund (ESF) through a FPI fellowship (BES-2015-073768) associated to TELEPARES project (FFI2014-51978-C2-1-R) and by the Xunta de Galicia through TELGALICIA research network (ED431D 2017/12). This work was partly supported by ERC Starting Grant 637277.

Footnotes

a Code is available at https://github.com/yeraidm/meemi This page will be updated with pre-trained models for new languages.

b Other works have also shown how the orthogonal constrain can be relaxed (Joulin et al. Reference Joulin, Bojanowski, Mikolov, Jégou and Grave2018) when training with a specific bilingual dictionary induction objective, but this has been shown not to be optimal for other tasks (Glavaš et al. Reference Glavaš, Litschko, Ruder and Vulić2019).

c The same English, Spanish and Italian corpora are used as input corpora for the hypernym discovery SemEval task (Section 6.1).

f Note that this task is more challenging than hypernym detection, which is typically framed as a binary classification problem (Upadhyay et al. Reference Upadhyay, Vyas, Carpuat and Roth2018), as the search space is equal to the size of the vocabulary considered for each language.

g The codebase for these experiments is that of SentEval (Conneau and Kiela Reference Conneau and Kiela2018).

References

AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M. and Oroumchian, F. (2009). Hamshahri: a standard Persian text collection. Knowledge-Based Systems 22(5), 382387.CrossRefGoogle Scholar
Alvarez-Melis, D. and Jaakkola, T. (2018). Gromov-Wasserstein alignment of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 18811890.CrossRefGoogle Scholar
Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C. and Smith, N.A.(2016). Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.Google Scholar
Artetxe, M., Labaka, G. and Agirre, E. (2016). Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas. Association for Computational Linguistics, pp. 2289–2294.CrossRefGoogle Scholar
Artetxe, M., Labaka, G. and Agirre, E. (2017). Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada. Association for Computational Linguistics, pp. 451–462.CrossRefGoogle Scholar
Artetxe, M., Labaka, G. and Agirre, E. (2018a). Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, Louisiana. Association for the Advancement of Artificial Intelligence, pp. 5012–5019.CrossRefGoogle Scholar
Artetxe, M., Labaka, G. and Agirre, E. (2018b). A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 789–798.CrossRefGoogle Scholar
Artetxe, M., Ruder, S. and Yogatama, D. (2020). On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online. Association for Computational Linguistics, pp. 4623–4637.CrossRefGoogle Scholar
Bakarov, A., Suvorov, R. and Sochenkov, I. (2018). The limitations of cross-language word embeddings evaluation. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics, New Orleans, Louisiana. Association for Computational Linguistics, pp. 94–100.CrossRefGoogle Scholar
Barone, A.V.M. (2016). Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany. Association for Computational Linguistics, pp. 121–126.Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A. and Zanchetta, E. (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209226.CrossRefGoogle Scholar
Bernier-Colborne, G. and Barriere, C. (2018). CRIM at SemEval-2018 Task 9: a hybrid approach to hypernym discovery. In Proceedings of The 12th International Workshop on Semantic Evaluation, New Orleans, Louisiana. Association for Computational Linguistics, pp. 722–728.CrossRefGoogle Scholar
Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association of Computational Linguistics 5(1), 135146.CrossRefGoogle Scholar
Bordea, G., Lefever, E. and Buitelaar, P. (2016). SemEval-2016 task 13: Taxonomy Extraction Evaluation (TExEval-2). In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California. Association for Computational Linguistics, pp. 1081–1091.CrossRefGoogle Scholar
Camacho-Collados, J., Delli Bovi, C., Espinosa-Anke, L., Oramas, S., Pasini, T., Santus, E., Shwartz, V., Navigli, R. and Saggion, H. (2018). SemEval-2018 task 9: hypernym discovery. In Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, Louisiana. Association for Computational Linguistics, pp. 712724.CrossRefGoogle Scholar
Camacho-Collados, J., Pilehvar, M.T., Collier, N. and Navigli, R. (2017). SemEval-2017 task 2: multilingual and cross-lingual semantic word similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada. Association for Computational Linguistics, pp. 15–26.CrossRefGoogle Scholar
Cardellino, C. (2016). Spanish Billion Words Corpus and Embeddings. http://crscardellino.me/SBWCE/.Google Scholar
Conneau, A. and Kiela, D. (2018). SentEval: an evaluation toolkit for universal sentence representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA), pp. 1699–1704.Google Scholar
Conneau, A., Lample, G., Ranzato, M., Denoyer, L. and Jégou, H. (2018a). Word translation without parallel data. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, Canada. OpenReview.net.Google Scholar
Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S.R., Schwenk, H. and Stoyanov, V. (2018b). XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 24752485.CrossRefGoogle Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. Association for Computational Linguistics, pp. 4171–4186.Google Scholar
Doval, Y., Camacho-Collados, J., Espinosa-Anke, L. and Schockaert, S. 2018. Improving cross-lingual word embeddings by meeting in the middle. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 294–304.Google Scholar
Dryer, M.S. and Haspelmath, M. (eds) (2013). WALS Online. Leipzig, Germany: Max Planck Institute for Evolutionary Anthropology.Google Scholar
Espinosa-Anke, L., Camacho-Collados, J., Delli Bovi, C. and Saggion, H. (2016). Supervised distributional hypernym discovery via domain adaptation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas. Association for Computational Linguistics, pp. 424–435.CrossRefGoogle Scholar
Faruqui, M. and Dyer, C. (2014). Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden. Association for Computational Linguistics, pp. 462–471.CrossRefGoogle Scholar
Geffet, M. and Dagan, I. (2005). The distributional inclusion hypotheses and lexical entailment. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, Michigan. Association for Computational Linguistics, pp. 107–114.CrossRefGoogle Scholar
Glavaš, G., Litschko, R., Ruder, S. and Vulić, I. (2019). How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 710–721.CrossRefGoogle Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, Montreal, Canada. MIT Press, pp. 26722680.Google Scholar
Han, L., Kashyap, A., Finin, T., Mayfield, J. and Weese, J. (2013). UMBC EBIQUITY-CORE: semantic textual similarity systems. In Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics, Volume 1, Atlanta, Georgia. Association for Computational Linguistics, pp. 4452.Google Scholar
Hearst, M.A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2, Nantes, France. Association for Computational Linguistics, pp. 539–545.CrossRefGoogle Scholar
Held, W. and Habash, N. (2019). The effectiveness of simple hybrid systems for hypernym discovery. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 3362–3367.CrossRefGoogle Scholar
Hoffart, J., Milchevski, D. and Weikum, G. (2014). STICS: searching with strings, things, and cats. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Queensland, Australia. Association for Computing Machinery, pp. 1247–1248.CrossRefGoogle Scholar
Hoshen, Y. and Wolf, L. (2018). Non-adversarial unsupervised word translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 469478.CrossRefGoogle Scholar
Joulin, A., Bojanowski, P., Mikolov, T., Jégou, H. and Grave, é. (2018). Loss in translation: learning bilingual word mapping with a retrieval criterion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 2979–2984.CrossRefGoogle Scholar
Kementchedjhieva, Y., Ruder, S., Cotterell, R. and Søgaard, A. (2018). Generalizing procrustes analysis for better bilingual dictionary induction. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium. Association for Computational Linguistics, pp. 211–220.Google Scholar
Lazaridou, A., Dinu, G. and Baroni, M. (2015). Hubness and pollution: delving into cross-space mapping for zero-shot learning. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China. Association for Computational Linguistics, pp. 270–280.CrossRefGoogle Scholar
Lu, A., Wang, W., Bansal, M., Gimpel, K. and Livescu, K. (2015). Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado. Association for Computational Linguistics, pp. 250256.CrossRefGoogle Scholar
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, Maryland. Association for Computational Linguistics, pp. 5560.CrossRefGoogle Scholar
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.Google Scholar
Mikolov, T., Le, Q.V. and Sutskever, I. (2013b). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.Google Scholar
Patra, B., Moniz, J.R.A., Garg, S., Gormley, M.R. and Neubig, G. (2019). Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, pp. 184–193.CrossRefGoogle Scholar
Pennington, J., Socher, R. and Manning, C.D. (2014). GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. Association for Computational Linguistics, pp. 1532–1543.CrossRefGoogle Scholar
Prager, J., Chu-Carroll, J., Brown, E.W. and Czuba, K. (2008). Question answering by predictive annotation. In Advances in Open Domain Question Answering. Springer, pp. 307–347.CrossRefGoogle Scholar
Roller, S. and Erk, K. (2016). Relations such as hypernymy: identifying and exploiting hearst patterns in distributional vectors for lexical entailment. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas. Association for Computational Linguistics, pp. 2163–2172.CrossRefGoogle Scholar
Ruder, S., Vulić, I. and Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research 65, 569631.CrossRefGoogle Scholar
Shwartz, V., Goldberg, Y. and Dagan, I. (2016). Improving hypernymy detection with an integrated path-based and distributional method. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. Association for Computational Linguistics, pp. 2389–2398.CrossRefGoogle Scholar
Shwartz, V., Santus, E. and Schlechtweg, D. (2017). Hypernyms under Siege: linguistically-motivated artillery for hypernymy detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain. Association for Computational Linguistics, pp. 6575.CrossRefGoogle Scholar
Søgaard, A., Ruder, S. and Vulić, I. (2018). On the limitations of unsupervised bilingual dictionary induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics, pp. 778–788.CrossRefGoogle Scholar
Upadhyay, S., Vyas, Y., Carpuat, M. and Roth, D. (2018). Robust cross-lingual hypernymy detection using dependency context. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics, pp. 607618.CrossRefGoogle Scholar
Vulic, I., Baker, S., Ponti, E.M., Petti, U., Leviant, I., Wing, K., Majewska, O., Bar, E., Malone, M., Poibeau, T., Reichart, R. and Korhonen, A. (2020). Multi-simlex: a large-scale evaluation of multilingual and cross-lingual lexical semantic similarity. arXiv preprint arXiv:2003.04866.Google Scholar
Vulić, I., Glavaš, G., Reichart, R. and Korhonen, A. (2019). Do we really need fully unsupervised cross-lingual embeddings? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. Association for Computational Linguistics, pp. 43984409.Google Scholar
Vulić, I. and Moens, M.-F. (2016). Bilingual distributed word representations from document-aligned comparable data. Journal of Artificial Intelligence Research 55, 953994.CrossRefGoogle Scholar
Williams, A., Nangia, N. and Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Association for Computational Linguistics, pp. 11121122.CrossRefGoogle Scholar
Xing, C., Wang, D., Liu, C. and Lin, Y. (2015). Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado. Association for Computational Linguistics, pp. 1006–1011.CrossRefGoogle Scholar
Xu, R., Yang, Y., Otani, N. and Wu, Y. (2018). Unsupervised cross-lingual transfer of word embedding spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, pp. 2465–2474.CrossRefGoogle Scholar
Yahya, M., Berberich, K., Elbassuoni, S. and Weikum, G. (2013). Robust question answering over the web of linked data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, New York, USA. Association for Computing Machinery, pp. 1107–1116.CrossRefGoogle Scholar
Yu, Z., Wang, H., Lin, X. and Wang, M. (2015). Learning term embeddings for hypernymy identification. In Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina. AAAI Press, pp. 1390–1397.Google Scholar
Zennaki, O., Semmar, N. and Besacier, L. (2019). A neural approach for inducing multilingual resources and natural language processing tools for low-resource languages. Natural Language Engineering 25(1), 4367.CrossRefGoogle Scholar
Zhang, M., Liu, Y., Luan, H. and Sun, M. (2017a). Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. Association for Computational Linguistics, pp. 19591970.CrossRefGoogle Scholar
Zhang, M., Liu, Y., Luan, H. and Sun, M. (2017b). Earth mover’s distance minimisation for unsupervised bilingual lexicon induction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark. Association for Computational Linguistics, pp. 19341945.CrossRefGoogle Scholar
Figure 0

Figure 1. Step by step integration of two monolingual embedding spaces: (1) obtaining isolated monolingual spaces, (2) aligning these spaces through an orthogonal linear transformation and (3) map both spaces using an unconstrained linear transformation learned on the averages of translation pairs.

Figure 1

Table 1. Precision at k ($P@K$) performance of different cross-lingual embedding models in the bilingual dictionary induction task

Figure 2

Table 2. Dictionary induction results for distant language pairs using FastText pre-trained monolingual embeddings as input using precision at k ($P@K$)

Figure 3

Table 3. Cross-lingual word similarity results in terms of Pearson (r) and Spearman ($\rho$) correlation. Languages codes: English-EN, Spanish-ES, Italian-IT, German-DE and Farsi-FA

Figure 4

Table 4. Cross-lingual word similarity results for distant language pairs using FastText pre-trained monolingual embeddings as input in terms of Pearson (r) and Spearman ($\rho$) correlation. Language codes: English-EN, Arabic-AR, Hebrew-HE, Estonian-ET, Polish-PL and Chinese-ZH

Figure 5

Table 5. Monolingual word similarity results in terms of Pearson (r) and Spearman ($\rho$) correlation

Figure 6

Table 6. Cross-lingual hypernym discovery results in terms of Mean Reciprocal Rank (MRR), Mean Average Precision (MAP) and precision at 5 ($P@5$). In this case, VecMap = VecMap$_{\text{ortho}}$

Figure 7

Table 7. Accuracy, or the number of correct classifications (entailment, contradiction or neutral) over the total number of tests instances, on the XNLI task using different cross-lingual embeddings as features

Figure 8

Table 8. Word translation examples from English and Spanish, comparing VecMap with the bilingual and multilingual variants of Meemi. For each source word, we show its five nearest cross-lingual synonyms. Bold translations are correct, according to the source test dictionary (cf. Section 5.1.1)

Figure 9

Figure 2. Absolute improvement (in terms of Pearson correlation percentage points) by applying the Meemi over the two base orthogonal models VecMap and MUSE on the cross-lingual word similarity task, with different training dictionary sizes. As data points in the X-axis we selected 100, 1000, 3000, 5000 and 8000 word pairs in the dictionary.

Figure 10

Table 9. Dictionary induction results obtained with the multilingual extension of Meemi over (VecMap$_\textrm{ortho}$) in terms of precision at k ($P@K$). The sequence in which source languages are added to the multilingual models is: Spanish, Italian, German, Finnish, Farsi and Russian (English is the target). The x indicates the use of the test language in each case (if the test language is already included, the following language in the sequence is added). We also include the scores of the original VecMap$_\textrm{ortho}$ as baseline