1. Introduction
Bilingual lexicons provide meaningful information for the semantic equivalence of words across languages and are useful for various cross-lingual tasks, such as cross-lingual information retrieval (CLIR) (Levow et al. Reference Levow, Oard and Resnik2005; Li and Gaussier Reference Li and Gaussier2012). Even though parallel corporaFootnote a and the word alignment methods that can be deployed on them have proven to be useful for building bilingual lexicons (Och and Ney Reference Och and Ney2004), they are only available for a limited number of domains among resource-rich languages. Consequently, the scarcity of multilingual parallel corpora, particularly for specialized areas, has led researchers to focus their efforts on finding word translation pairs from comparable corporaFootnote b (Fung and Cheung Reference Fung and Cheung2004; Haghighi et al. Reference Haghighi, Liang, Berg-Kirkpatrick and Klein2008; Prochasson et al. Reference Prochasson, Morin and Kageura2009; Tamura et al. Reference Tamura, Watanabe and Sumita2012). Hence, the exploitation of comparable corpora has marked a turning point in the task of bilingual lexicon extraction and has raised constant interest, thanks to the abundance, continuous growth, and availability of such corpora (Morin and Hazem Reference Morin and Hazem2016; Zhang et al. Reference Zhang, Liu, Luan and Sun2017; Søgaard et al. Reference SØgaard, Ruder and Vulic2018).
Bilingual lexicon extraction from a comparable corpus, sometimes referred to as bilingual lexicon induction, is the task that aims at automatically extracting translation pairs from two monolingual corpora. Most state-of-the-art approaches using comparable corpora to extract bilingual lexicons assume that “a word and its translations tend to appear in similar contexts across languages” Fung (Reference Fung1998). Contexts consist of co-occurring words and are either represent as explicit vectors (Fung Reference Fung1998; Laroche and Langlais Reference Laroche and Langlais2010) or based on word embeddings (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013; Pennington et al. Reference Pennington, Socher and Manning2014; Vulic and Moens Reference Vulic and Moens2016; Artetxe et al. Reference Artetxe, Labaka and Agirre2016; Ethan Fast 2017; Hazem and Morin Reference Hazem and Morin2017; Xu et al. Reference Xu, Yang, Otani and Wu2018). Once contexts, usually consisting of co-occurring words, have been identified, they are mapped across languages using a bilingual lexicon (Fung Reference Fung1998; Rapp Reference Rapp1999). As the bilingual lexicon usually used is either a large, general dictionary or a small, domain-specific lexicon, there is a high risk of missing potential associations across languages when trying to extract bilingual lexicons in specific domains (Gaussier et al. Reference Gaussier, Renders, Matveeva, Goutte and Déjean2004; Déjean et al. Reference Déjean, Gaussier, Renders and Sadat2005; Tamura et al. Reference Tamura, Watanabe and Sumita2012; Irvine and Callison-Burch Reference Irvine and Callison-Burch2013; Linard et al. Reference Linard, Daille and Morin2015; Vulic and Moens Reference Vulic and Moens2016; Morin and Hazem Reference Morin and Hazem2016). We refer to this problem as the sparsity problem.
Facing this situation, we follow in this paper the core idea of distributional methods and propose to combine context vectors with additional, automatically extracted knowledge. We conjecture here that a gain can be expected by relying on enriched representations of words derived from formal concept analysis (FCA) (Ganter and Wille Reference Ganter and Wille1999). Indeed, one can formulate another distributional hypothesis with formal concepts: “If a word belonging to a formal concept C appears in a context vector V, then it is likely that all the words that belong to the same formal concept C appear as well in V.”
Our contributions in this study are threefold:
-
(1) We first show how FCA can be used to build monolingual and bilingual closed concepts from comparable collections. From these closed concepts, we first derive monolingual and bilingual clusters with high comparability scores.
-
(2) We then propose to combine standard context vectors with concept vectors based on closed concepts for extracting bilingual lexicons from comparable corpora. Our experiments show that the combination proposed improves the performance of bilingual lexicon extraction compared to the standard approach and two recent state-of-the-art unsupervised models (Zhang et al. Reference Zhang, Liu, Luan and Sun2017; Xu et al. Reference Xu, Yang, Otani and Wu2018).
-
(3) Finally, we exploit the extracted bilingual lexicons in a CLIR system and show that this leads to improved quality in terms of precision and mean average precision (MAP).
It is important to note here that our goal is to show that FCA can be used to solve, at least partly, the sparsity problem. We do not claim that this is the only possible approach to do so. We nevertheless claim that FCA is an easy-to-deploy and simple approach to solve the sparsity problem.
In terms of language resources, our approach only necessitates unlabeled, bilingual corpora and a general bilingual dictionary. In particular, it can be used, by relying on standard context vectors, on languages for which word embeddings, and in particular contextualized word embeddings, are not available and maybe difficult to acquire by lack of sufficiently large unlabeled corpora and/or computational power. As such, our approach can be deployed on almost all language classes defined in Joshi et al. (Reference Joshi, Santy, Budhiraja, Bali and Choudhury2020) with the exception of The Left-Behinds, which corresponds to languages with exceptionally limited resources, and to some extent of the The Scraping-Bys for which only some amount of unlabeled data is available.
The remainder of the paper is organized as follows: Section 2 presents the related work on bilingual lexicon extraction from comparable corpora. We then present FCA foundations for mining closed concepts in Section 3. Section 4 describes how to use closed concepts to improve, through clustering, the quality of a given comparable corpus, following Li and Gaussier (Reference Li and Gaussier2010); Li et al. (Reference Li, Gaussier and Aizawa2011). We then propose to combine standard context vectors with concept vectors based on closed concepts for bilingual lexicon extraction (Section 5). The efficiency of our approach is validated by our experimental study which shows that FCA leads to corpora of improved quality (in terms of comparability scores) as well as to better bilingual lexicons (Section 6). We finally illustrate the benefits of the bilingual lexicons extracted in a CLIR setting (Section 6). The conclusion section wraps up the article and outlines future work.
2. Related work
Distributional approaches, which aim at building representations of words based on the contexts they occur in, are at the core of methods to extract information from corpora. In the context of comparable corpora, a basic assumption is that words which are translations of each other are likely to appear in similar contexts across languages. Under this hypothesis, Fung (Reference Fung1995) and Rapp (Reference Rapp1995) pioneered bilingual lexicon extraction and proposed three main steps for this task: context modeling, context-similarity calculation, and translation-pair finding. The context of a given word, which we will refer to as head word, usually consists of neighboring words within a predefined window (Rapp Reference Rapp1999; Andrade et al. Reference Andrade, Matsuzaki and Tsujii2011), a sentence (Laroche and Langlais Reference Laroche and Langlais2010), a paragraph (Fung and McKeown Reference Fung and McKeown1997), a document (Shao and Ng Reference Shao and Ng2004), or dependency relations (Otero Reference Otero2008; Garera et al. Reference Garera, Callison-Burch and Yarowsky2009). Words in a context are usually weighted, based on, for exapmle, $tf*idf$ , pointwise mutual information or log-likelihood ratio tests, in order to reflect the strength of their relation with the head word (Fung Reference Fung1995; Rapp Reference Rapp1999; Fung and Lo Reference Fung and Lo1998; Chiao and Zweigenbaum Reference Chiao and Zweigenbaum2003; Andrade et al. Reference Andrade, Matsuzaki and Tsujii2011). Once context vectors have been built, they can be translated using a seed bilingual lexicon, usually a bilingual dictionary from the general domain. One can then compare context vectors from different languages using different similarity measures, as the Euclidean distance (Fung Reference Fung1995), the cosine similarity (Fung and Lo Reference Fung and Lo1998), the city-block metric (Rapp Reference Rapp1999), the number of overlapping context words (Andrade et al. Reference Andrade, Matsuzaki and Tsujii2011), the Jensen–Shannon divergence (Pekar et al. Reference Pekar, Mitkov, Blagoev and Mulloni2006), and the weighted Jaccard index (Hazem and Morin Reference Hazem and Morin2012). A latent space can also be constructed from the seed lexicon to capture polysemy and synonymy prior to compute a similarity between context vectors (Gaussier et al. Reference Gaussier, Renders, Matveeva, Goutte and Déjean2004). Lastly, the target candidate translations of a given source head word correspond to the head words of the target context vectors closest to the translation of the source context vectors.
Additional clues, as transliteration information (Shao and Ng Reference Shao and Ng2004) or co-occurrence information from aligned documents (Prochasson and Fung Reference Prochasson and Fung2011), can also be integrated in the above process. The study in Irvine and Callison-Burch (Reference Irvine and Callison-Burch2017) reviews several such additional clues and introduces a supervised method to extract bilingual lexicons from comparable corpora. As supervision is costly and as additional clues are not always available, we solely make use in this study of unsupervised methods with no additional clues. However, as the vast majority of methods, we also use a seed lexicon as previous studies to dispense with it have not been successful (Jagarlamudi et al. Reference Jagarlamudi, Udupa and Daumé2011). Lastly, another line of distributional approaches has focused on inferring multilingual topic models from parallel and comparable corpora (Vulic et al. Reference Vulic, Smet, Tang and Moens2015), with the possibility to address such tasks as cross-lingual event-centered news clustering (which is only a special case of cross-lingual document clustering), cross-lingual document classification, cross-lingual semantic similarity, and CLIR. The bilingual lexicons obtained with this type of approaches are usually not very useful if the multilingual topic model is solely trained on comparable corpora.
More recently, Mikolov et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) introduced a new method for building word representations that aims at building word vectors so as to maximize the probability of a word given its context. Canonical correlation analysis was then used in Faruqui and Dyer (Reference Faruqui and Dyer2014) to project the embeddings to a shared space in both languages on the basis of an existing bilingual dictionary. In the same vein, an approach to learn word-embedding bilingual mappings was presented in Artetxe et al. (Reference Artetxe, Labaka and Agirre2016), again from a bilingual dictionary. This approach preserves monolingual invariance through the use of several constraints in connection with the method proposed in Faruqui and Dyer (Reference Faruqui and Dyer2014). In addition, multilingual word embeddings were utilized in Chandar et al. (Reference Chandar, Lauly, Larochelle, Khapra, Ravindran, Raykar and Saha2014) on sentence-aligned parallel data and on document-aligned non-parallel data in Vulic and Moens (Reference Vulic and Moens2016) to produce bilingual word embeddings. In Hazem and Morin (Reference Hazem and Morin2018), the authors put forward a combination of different embedding models learned from specialized and general-domain data sets, resulting in higher performance. In a more specific domain such as the biomedical domain, Heyman et al. (Reference Heyman, Vulic and Moens2018) considered bilingual lexicon extraction as a classification problem and trained a neural network composed of a combination of recurrent long short-term memory and deep feed-forward networks in order to obtain word-level and character-level representations. Lastly, Langlais and Jakubina (Reference Langlais and Jakubina2017) carefully compared different approaches (using or not word embeddings) for bilingual lexicon extraction from comparable corpora and showed that word embeddings were to be preferred for frequent terms, but not for less frequent ones.
In contrast with all these previous approaches, we propose here to combine context-vector representations, based on word embeddings, with semantically related words obtained with closed concept mining methods pertaining to FCA. We first identify words in the context vector of a given head word and use word embeddings to weigh context words (each coordinate corresponding to the cosine between the embedding of the head word and that of the context word). The use of semantically related words allows relying on richer representations and finally leads to improved lexicons and CLIR systems. It is worth noting that this paper is an extension of Chebel et al. (Reference Chebel, Latiri and Gaussier2017) as it involves a complete formalization and additional experiments regarding corpus comparability and the impact of the combination weight. In addition, we make use here of bilingual lexicons we extract from comparable corpora in the context of CLIR systems.
3. Mathematical foundations: Key FCA settings
We present in this section the concepts used for mining closed concepts. We rely here on the FCA framework for text mining presented in Ganter and Wille (Reference Ganter and Wille1999) and adapted to our problem in Chebel et al. (Reference Chebel, Latiri and Gaussier2017). We first formalize an extraction context made up of documents and index terms, called the textual context.
Definition 1 A textual context is a triplet $\mathfrak{M} \;:\!=\; (\mathscr{C},\mathscr{T},\mathscr{I})$ where:
-
• $\mathscr{C} \;:\!=\; \{d_1, d_2, \ldots, d_n\}$ is the collection of documents (finite set of n documents);
-
• $\mathscr{T} \;:\!=\; \{t_1, t_2,\ldots, t_m\}$ is a finite set of m distinct words in the corpus. $\mathscr{T}$ comprises the words of the different documents in $\mathscr{C}$ ;
-
• $\mathscr{I}$ $\subseteq \mathscr{C}\times \mathscr{T}$ is a binary (incidence) relation. Each relation (d, t) $\in$ $\mathscr{I}$ indicates that document d $\in$ $\mathscr{C}$ contains term t $\in$ $\mathscr{T}$ .
Example 1 Consider the textual context given in Figure 1 (left). Here, $\mathscr{C}\;:\!=\; \{d_{1}, d_{2}, d_{3}, d_{4}, d_{5}\}$ and $\mathscr{T}\;:\!=\; \{{A}, {B}, {C}, {D}, {E} \}$ . ( $d_{2}$ , B) $\in$ $\mathscr{I}$ , meaning that document $d_2$ contains term B.
We first recall the basic definitions of the Galois lattice-based paradigm in FCA (Ganter and Wille Reference Ganter and Wille1999) and its applications to closed concepts mining.
Definition 2 Concept $C = (T,D)$ is defined by two sets, a set of terms T and a set of documents D, respectively, called the “intension” and “extension” of the concept, such that all terms in T co-occur in all documents of D. The support of C in $\mathfrak{M}$ is equal to the number of documents in $\mathscr{C}$ containing all the terms of T. The absolute support is formally defined as follows (Han et al. 2000):
The relative support (aka frequency) of C $\in$ $\mathfrak{M}$ is equal to $\frac{\displaystyle \textit{Supp}(C)}{\displaystyle |\mathscr{C}|}$ , where $|\mathscr{C}|$ denotes the number of documents in the collection $\mathscr{C}$ (we denote by $|X|$ the cardinal of set X)Footnote c.
A concept is said frequent if its terms co-occur in corpus $\mathscr{C}$ a number of times greater than or equal to a user-defined support threshold, denoted minsupp. Otherwise, it is said unfrequent (aka rare).
Example 2 Consider the textual context given in Figure 1. Since both B and C terms simultaneously appear in documents $d_2$ , $d_3$ , and $d_5$ , the pair $<{BC}$ , $\{d_2, d_3, d_5\}>$ is considered a formal concept. Its intension is given by the set of terms $\{B,C\}$ and its extension by the set of documents $\{d_2, d_3, d_5\}$ . Its absolute support is equal to $|\{d_2, d_3, d_5\}|$ = 3. This concept is frequent since its support is greater than $minsupp=2$ .
Definition 3 Galois closure operator Let $C=(D,T)$ be a concept. Two functions are defined in order to map sets of documents to sets of terms and vice versa:
where $\mathscr{P}( X)$ denotes the power set of X. Both functions $\Psi$ and $\Phi$ constitute Galois operators. $\Psi(T)$ is equal to the set of documents containing all words of T. Its cardinality is equal to Supp(T). $\Phi(D)$ is equal to the set of words appearing in all the documents of D. Consequently, the compound operator $\Omega\;:\!=\; \Phi \circ \Psi$ is a Galois closure operator which associates to a set of words T the set of words which appear in all the documents where the words of T co-occur.
A closed concept is then defined as follows:
Definition 4 Concept $C = (D,T)$ is said to be closed if $\Omega (C)$ = C. A closed concept is thus the maximal set of common words to a given set of documents. A closed concept is said to be frequent w.r.t. the minsupp threshold if Supp(C) = $|\Psi(T)| \geq \textit{minsupp}$ . Hereafter, we denote by CC a closed concept.
Example 3 With respect to the previous example, $\{B,C,E\}$ is a closed termset since there is no other term appearing in all documents containing $\{B,C,E\}$ : $\{B,C,E\}$ is the maximal set of terms common to documents $\{d_2, d_3, d_5\}$ . We then have: $\Omega(\{B,C,E\}) = \{B,C,E\}$ . If minsupp is set to 2, $\{B,C,E\}$ is also frequent since $|\Psi (\{B,C,E\})| = |\{d_2, d_3, d_5\}| = 3 \geq 2$ .
It is worth noting that in our work, each closed concept CC represents a class of documents grouped by a set of representative terms. Consequently, a closed concept represents a maximal group of terms appearing in the same documents.
4. FCA-based clustering to extract coherent concepts within and across languages
This section describes the application of the above concepts to comparable corpora. In the remainder, $\mathscr{C}^{comp}= \mathscr{C}^s \cup \mathscr{C}^t$ denotes a comparable corpus, usually unbalanced in the amount of source ( $\mathscr{C}^s$ ) and target ( $\mathscr{C}^t$ ) texts. More generally, notation $\mathscr{C}$ refers to any monolingual corpus of a source or target language.
Our goal here is to improve corpora comparability for better bilingual lexicon extraction. As described in section 6, we work on two comparable corpora, one based on the French–English language pair and the other on the Italian–English pair.
Our approach relies on three main steps, namely:
-
(1) Mining closed concepts: This step consists in extracting the closed concepts from comparable corpora.
-
(2) Translation and disambiguation: This step deals with the translations and the disambiguation of the terms in the closed concept extensions.
-
(3) Alignment: The extracted monolingual closed concepts are aligned based on their disambiguated translations and using an unsupervised classification algorithm. This leads to group monolingual closed concepts of different languages in multilingual closed concepts.
The aforementioned steps are detailed in what follows.
4.1 Mining closed concepts from comparable corpora
In the remainder, for simplicity, we use the term closed concept instead of frequent closed concept. The reader should bear in mind that all the closed concepts considered occur a certain number of times. Extracting closed concepts from comparable corpora requires preprocessing steps in order to extract the most representative terms. We rely here on a part-of-speech tagger, namely TreeTagger Footnote d, so as to focus on terms that are either common nouns, proper nouns, or adjectives. The rationale for this focus is that nouns and adjectives are the most informative grammatical categories and are most likely to represent the content of documents (Barker and Cornacchia Reference Barker and Cornacchia2000). A stoplist is used to discard functional French, Italian, and English terms that are very common. This task is carried out on a French–English and Italian–English comparable corpora (cf. Section 6.1). The context document-term $\mathfrak{M}$ is then built by retaining only terms corresponding to the selected grammatical categories.
In order to extract closed concepts from comparable corpora, we adapt the Charm-L algorithm (Zaki and Hsiao Reference Zaki and Hsiao2005) to consider any given textual context $\mathfrak{M}$ . The algorithm extracts all the closed concepts as described in Zaki and Hsiao (Reference Zaki and Hsiao2005), with respect to the minimal and maximal support thresholds minsupp and maxsupp Footnote e. These thresholds are experimentally set as follows: While considering the Zipf distribution of each collection, the maximum threshold of the support values is experimentally set in order to filter out trivial terms that occur in most documents, and are thus not informative. A minimal threshold is also set to eliminate marginal terms that occur in few documents.
The Charm-L algorithm iteratively generates frequent closed concepts in the form of pairs of sets of terms and documents:
Each pair represents a set of documents $\{d_1, d_2,\ldots,d_m\}$ (extension), sharing a set of terms $\{t_1, t_2,\ldots, t_n\}$ (intension) with a support greater than or equal to minsupp. We obtain at the end all French closed concepts, $CC^{(f)}$ , all English closed concepts, $CC^{(e)}$ , and all Italian closed concepts, $CC^{(i)}$ .
Within a given language, one can define (dis-)similarity measures between closed concepts. We make use here of the Euclidean distance based on the $tf*idf$ (Salton and Buckley Reference Salton and Buckley1988) representation of each term, where tf denotes term frequency and idf inverse document frequency. For any term t in the vocabulary V of a given collection and any closed concept CC with n terms and m documents extracted from this collection, tf(t,CC) is equal to the normalized number of occurrences of t in the intension of CC: $tf(t,CC) = \frac{1}{n}$ if t appears in the intension of CC and 0 otherwise. The inverse document frequency is based on the ratio of the documents in the extension of CC by the documents which contain t, denoted $m_t$ : $idf(t,CC) = \log(1 + \frac{m}{1+m_t})$ . The distance between two closed concepts $CC_i$ and $CC_j$ is then defined as:
We now turn to multilingual closed concepts.
4.2 Closed concepts translation and selection
In order to relate closed concepts across languages, we first expand each term in the intension of a closed concept with its translations in the target language using an existing bilingual dictionary. To do so, starting with a closed concept in the source language, we first consider all possible translations of the terms in the intension and produce all combinations of such translations. Thus, if there are p terms in the intension of a concept, and m possible translations for each of them on average, one ends up, on average, with $m^p$ possible translations of the original closed concept. In practice, however, both p and m are small and the number of possible translations for any closed concept remains tractable (at most a few tens).
Then, for each possible translation of the intension of a closed concept, we associate an extension consisting of the set of documents containing the translated terms. By doing so, the representation of a translation parallels the one of closed conceptsFootnote f. We finally select, among the set of target closed concepts, the one that is closest, according to the Euclidean distance described above, to a possible translation of the source closed concept. We will refer to the selected target closed concept as the most likely translation (indeed, a correct candidate translation of a closed concept is more likely to be present as a closed concept in the target language). In case of ties, the choice of the target closed concept is random according to a uniform distribution.
At the end of this process, each closed concept in the source language is either left unchanged, if no translation was identified, or is associated with a closed concept in the target language.
4.3 Closed concept alignment
From the closed concepts obtained above, one can build clusters so as to find similar concepts within and across languages. We rely here on the standard K-means algorithm using the Euclidean distance defined in Equation (5). The similarity between closed concepts in the source language is based on their original intension and extension, while the similarity between source and target closed concepts is based on the most likely translation of the source closed concept. The number of clusters K is chosen so as to have a sufficient number of clusters to cover the whole collection while not separating all documents in too many clusters. We find that 300 is a reasonable choice for the larger collection we are considering (CLEF 2003), whereas 40 is reasonable for Breast Cancer. Furthermore, we rely on a discretization step with equal width binning prior to running the K-means algorithm, as in Fayyad and Irani (Reference Fayyad and Irani1993).
Finally, we obtain clusters of closed concepts that can either comprise bilingual documents (when closed concepts from different languages are grouped) or monolingual documents (when only closed concepts from the same language are grouped).
In the next section, we make use of context vectors extracted from the comparable corpora corresponding to the bilingual clusters obtained above.
5. FCA-based approach for bilingual lexicon extraction
The description given in this section follows the one in Chebel et al. (Reference Chebel, Latiri and Gaussier2017). We nevertheless detail it here for completeness as well as to facilitate reading (as readers can find all the information within a single document).
As aforementioned, we propose here to enrich context vectors with concept vectors for bilingual lexicon extraction from comparable corpora. Our proposed method is based on the following steps: (1) computing context vectors, (2) building concept vectors, and (3) combining them.
5.1 Computing context vectors
We rely here on word-embedding-based context vectors (for short context vectors in the remainder) which are standard context vectors in which the coordinates correspond to similarities between word embeddings. To do so, the word2vec toolkit is used to compute word embeddings by learning 300-dimensional representations of words with the Skip-Gram model introduced by Mikolov et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013). We rely in this study on standard parameter settings, without trying to optimize parameter values: the size of the contextual window is equal to 5 (Laroche and Langlais Reference Laroche and Langlais2010), the sub-sampling rate to $1e^{05}$ and we use the negative sampling method to estimate the probability of a target word.
We then compute context vectors in a standard way, using again a window of five words. The final context vector for a given word t is then given by:
where T denotes the transpose, $|\mathscr{C}|$ the number of words in $\mathscr{C}$ , and $w_i$ the weight of the association of the $i^{th}$ word with t, measured here by the cosine between the embeddings of the words obtained before. If a word does not appear in the context of a given word, its weight is set to 0. Examples of French, English, and Italian context vectors computed from the comparable SDA95_french, GlasgowHerald95, and SDA95_italian (CLEF’2003) corpora are given in Table 1.
5.2 Building concept vectors
Concept vectors are based on the closed concepts obtained as described in Section 4.1. Examples of such closed concepts obtained from the comparable collections SDA95_french, GlasgowHerald95, and SDA95_italian (CLEF’2003) corpora are given in Table 2.
Let $\mathscr{N}_{\mathscr{C}}$ denote the number of closed concepts in a collection $\mathscr{C}$ . For any term t in the vocabulary of $\mathscr{C}$ , a concept vector for t is a vector over the closed concepts of $\mathscr{C}$ , that is:
where $\mu(t,CC_i)$ represents the importance of t in $CC_i$ . We rely here on the weight proposed in Chebel et al. (Reference Chebel, Latiri and Gaussier2015) based on the weighting schema $tf \times idf$ (Salton and Buckley Reference Salton and Buckley1988) and defined by:
with tf(t,CC) and idf(t,CC) defined as above. As $\mu$ is null when the term does not co-occur with all the terms in the intension of a closed concept, only the closed concepts containing t are fully taken into account in the concept vector of t.
5.3 Combination
The extracted concept vectors are then combined with context vectors according to two combination strategies:
-
(1) Direct combination: For each base word, a single vector of dimension $\mathscr{C}+\#(CC)$ is built from its concept and context vectors, denoted in the following as a combined vector. It contains both local co-occurrence information provided by the context vector and global information provided by the concept vector. Once the combined vectors in the source language are translated, they are compared to combined vectors in the target language via the standard cosine similarity measure.
-
(2) Weighted combination: Context and concept vectors are treated as distinct vectors, which are translated separately and which are then compared via a weighted linear combination. Thus, the similarity between two words t (from the source corpus) and t ′ (from the target corpus) is assessed as follows:
(8) \begin{equation}SIM(t,t') = \lambda cos(\overrightarrow{\mbox{V}}_t^{\mbox{trans}},\overrightarrow{\mbox{V}}_{t'}) + (1-\lambda) cos(\overrightarrow{\mbox{VC}}_t^{\mbox{trans}},\overrightarrow{\mbox{VC}}_{t'}),\end{equation}where $\lambda \in [0,1]$ is a parameter weighing the relative importance of context and concept vectors, which can be, for example, learned by k-fold cross-validation (cf. section 6.3). “trans” denotes here that the vector has been translated, meaning that the weight of a source word is transferred to its translation(s), as provided by a bilingual dictionary for context vectors and by the most likely translation (see Section 4.2) for concept vectors.
6. Experimental study
Our approach was evaluated on both specialized and unspecialized comparable corpora. We notice that domain-specific comparable corpora are often of small size, unlike unspecialized comparable corpora which tend to be larger (as for example journalistic corpora). These particularities make the standard approach more sensitive to context representations (Morin and Hazem Reference Morin and Hazem2016). We use here both an intrinsic evaluation, assessing the quality of the bilingual lexicons obtained, and an extrinsic one, assessing the usefulness of the bilingual lexicons in the context of CLIR systems.
6.1 Linguistic resources
We evaluate our multilingual document clustering approach on two different corpora, using the two language pairs French–English and Italian–English. These two comparable corpora have different characteristics:
-
(1) Unspecialized comparable corpora: We consider a subset of the multilingual collection used in the Cross-Language Evaluation Forum CLEF’2003Footnote g. We rely here on the news articles (from newspapers of news agencies) SDA95 in French (42,615 documents), GlasgowHerald95 in English (56,472 documents), and SDA95 in Italian (48,980 documents).
-
(2) Specialized comparable corpus: The Breast Cancer corpus is an unbalanced corpus composed of documents collected from the Elsevier websiteFootnote h. The documents are retrieved from the medical domain, within the sub-domain of “breast cancer.” We use the same corpus as in Morin and Hazem (Reference Morin and Hazem2016). The corpus comprises 130 French documents (about 530,000 words) and 1640 English documents (about 7.4 million words).
Each corpus is preprocessed with TreeTagger and only nouns and adjectives are used for building the concept vectors. Table 3 summarizes the main characteristics of each corpus. As a bilingual dictionary, we utilize the general French–English bilingual dictionary of Li and Gaussier (Reference Li and Gaussier2012) which contains 74,921 entries. The number of dictionary entries that are present in CLEF’2003 is 20,432, whereas it is of 6861 for Breast Cancer. We use also an Italian–English bilingual dictionary that contains 28,744 entriesFootnote i. The number of dictionary entries that are present in CLEF’2003 is 7011. Lastly, for concept vector extraction, we set minsupp to 30 for CLEF’2003 corpora and 20 for the Breast Cancer corpus to focus on informative closed concepts (low values of minsupp tend to yield non-informative concepts).
6.2 Bilingual clusters comparability evaluation
The bilingual clusters constructed previously can be seen as units of a comparable corpus. We propose here to assess the quality of such units in terms of comparability. To do so, we make use of two standard comparability measures defined between documents in source and target languages, $d_s$ and $d_t$ :
-
(1) Binary measure: For a given source document $d_s$ and a target document $d_t$ , the binary measure counts terms in $d_s$ which have translations in $d_t$ and then normalizes these counts by the vector size. The binary measure uses the function trans(t,d), which returns 1 if a translation of term t is found in document d, and 0 otherwise. The similarity using the binary measure is computed as follows (Saad et al. Reference Saad, Langlois and Smali2014):
(9) \begin{equation}bin(d_s,d_t) = \frac{\sum_{t \in d_s} trans(t,d_t)}{\mid d_s\mid}.\end{equation}In its original form, bin is not symmetric. We symmetrize it here by considering the average in both directions (source $\leftrightarrow$ target):(10) \begin{equation}sbin(d_s,d_t) = \frac{bin(d_s, d_t)+ bin(d_t, d_s)}{2}.\end{equation} -
(2) Cosine measure: The cosine measure computes the cosine similarity between source and target vectors of the documents using the standard $tf*idf$ weighting scheme (Salton and Buckley Reference Salton and Buckley1988):
(11) \begin{equation}cosine(d_s, d_t) = \frac{ \overrightarrow{d_s} \cdot \overrightarrow{d_t}}{\parallel \overrightarrow{d_s} \parallel \cdot \parallel \overrightarrow{d_t} \parallel}.\end{equation}
To evaluate the comparability quality of the mined bilingual clusters, we compute the comparability scores of documents within each bilingual cluster through both sbin and cosine. We then derive an average comparability score over all clusters. Table 4 summarizes the results obtained on the French–English and Italian–English corpora. For comparison purposes, we also display the comparability score of the original unclustered bilingual corpora, denoted here $\mathscr{C}^{comp}$ . This amounts to considering that there is a single bilingual cluster encompassing all documents in the collection.
As one can note, the clustering allows improving the comparability of the bilingual collection. This is true for the two measures and the three collections considered. This result is important as bilingual clusters can be used to refine the search space of candidate translations in bilingual lexicon extraction.
In order to better understand the results of the clustering process described previously, we report in Table 5 the percentage of actual bilingual clusters in each collection as well as the percentage of bilingual clusters above specific values of sbin (namely, $0.37$ and $0.6$ ). As it can be noted, the percentage of bilingual clusters is relatively small (15% and 17% on CLEF 2003, French–English and Italian–English, and 34% on Breast Cancer). This shows that the method gathers the bilingual part of the collection in few clusters; the other clusters gather the documents that are specific to each language. Furthermore, the vast majority of bilingual clusters (84%) in CLEF-2003, French–English, have a comparability score with sbin that is higher than the comparability score of the original bilingual corpus ( $0.37$ ). The same is true for CLEF-2003, Italian–English. On Breast Cancer, the improvement in comparability is even higher (22%) which we attribute to the fact that, as the corpus is a specialized one, the clusters obtained tend to be more homogeneous within and across languages.
6.3 Experimental evaluation of FCA-based bilingual lexicon
6.3.1 Comparative baselines
In our experiments, as comparative baselines, we consider the standard approach (Rapp Reference Rapp1999) and two recent unsupervised, neural-based approaches introduced in Xu et al. (Reference Xu, Yang, Otani and Wu2018) and Zhang et al. (Reference Zhang, Liu, Luan and Sun2017).
-
• The standard approach (Rapp Reference Rapp1999): The standard approach follows the three steps (modeling contexts, calculating context similarities, and finding translation pairs) described in Section 2. It is used here with weights based on word embeddings, as described in Section 5.1.
-
• Reference Zhang, Liu, Luan and Sun Zhang et al. (2017) approach Footnote j: This approach takes monolingual word embeddings as input and uses an adversarial network to cross the language barrier. In our experiments, we use as input the CBOW model (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013) with default hyperparameters in word2vec on our comparable corpora described in Section 6.1. The embedding dimension d is 50. The word embeddings are normalized to unit length. When sampling words for adversarial training, frequent words are penalized in a way similar to Mikolov et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013). G is initialized with a random orthogonal matrix and the hidden layer size D is set to 500.
-
• Reference Xu, Yang, Otani and Wu Xu et al. (2018) approach Footnote k: This approach makes use of optimal transport to identify word equivalents across languages based on their word-embeddings. We directly use their code with word embeddings of dimension 50 trained on our comparable corpora (see Section 6.1).
6.3.2 Experimental settings and evaluation protocol
Evaluation metrics: We evaluate the performance of our approach, denoted $d_{FCA}$ , using precision (P), recall (R), F1-score (F1), and MAP as defined in Manning et al. (Reference Manning, Raghavan and SchÜtze2008); Morin and Hazem (Reference Morin and Hazem2016). We will compare our results on the Breast Cancer corpus to those of Morin and Hazem (Reference Morin and Hazem2016). Note that the standard approach we use here follows exactly the same process as $d_{FCA}$ , without the enrichment with concept vectors.
Precision assesses the proportion of lists containing the correct translation, whereas recall R gives the proportion of translations that are recovered in the candidate list. The F1-score is the harmonic mean between precision and recall. In case of multiple translations, a list is deemed to contain the correct translation as soon as one of the possible translations is present (Li and Gaussier Reference Li and Gaussier2012). The MAP (Manning et al. Reference Manning, Raghavan and SchÜtze2008) is used to show the ability of the algorithm to precisely rank the selected candidate translations. Assuming the total number of English words in the reference list is m, let $r_i$ be the rank of the first correct translation in the candidate translation list for the $i^{th}$ term in the evaluation set. The MAP score is then defined by Manning et al. (Reference Manning, Raghavan and SchÜtze2008):
with the convention that if the correct translation does not appear in the top N candidates, $\frac{1}{r_i}$ is set to 0. MAP is our primary measure to compare the proposed methods.
Evaluation protocol: To evaluate the quality of the lexicons extracted for the various runs, including the comparative baselines described in Section 6.3.1, we use 10-fold cross-validation for CLEF’2003 and the reference list of 169 French/English single words utilized in Morin and Hazem (Reference Morin and Hazem2016) for Breast Cancer. The bilingual dictionaries are divided into three parts, namely:
-
• 10% of the source words together with their translations are randomly chosen and used as the test set;
-
• 10% of the source words together with their translations are randomly chosen and utilized as the validation set for learning the parameter $\lambda$ defined in equation (8) for weighting the combined model;
-
• The rest is devoted to the training corpus on which the context and concept vectors are extracted.
Note that source words not present in source context/concept vectors or with no translation in target context/concept vectors are excluded from the evaluation and validation set. The value of $\lambda$ obtained through cross-validation on CLEF’2003 is $0.7$ for $SDA95\_French$ and GlasgowHerald95 (FR-EN) and $0.6$ for $SDA95\_Italian$ and GlasgowHerald95 (IT-EN). For Breast Cancer, the value of $\lambda$ is $0.6$ . We make use of these values in all our experiments. Table 6 illustrates the evolution of various evaluation measures (precision, recall, F1 score, and MAP) for N equal to 200 according to different values of $\lambda$ for the two corpora (the evaluation is computed on the test set for CLEF’2003).
6.3.3 Results and discussion
Figures 2, 3, and 4 highlight various precision values obtained respectively with the standard approach, the weighted combination, and the direct combination, using the best value of $\lambda$ for the different comparable corpora. We consider different sizes (N) for the candidates list, varying from 1 to 500. As one can see, the weighted combination outperforms both the standard approach and the direct combination, especially for medium and large candidate lists ( $N = 100, 300$ , and 500). Furthermore, if the improvement is relatively modest for the French–English general corpus (Figure 2), it is more important and more systematic for the specialized corpus (Figure 3) and the Italian–English general corpus (Figure 4). We attribute this difference to the fact that, compared to the French–English general corpus, relatively few dictionary entries are present in the specialized corpus and the Italian–English general corpus. This suggests that our method is particularly appropriate for specialized domains and languages for which large, general bilingual dictionaries may not be available. Indeed, as already mentioned in Section 1, our approach is well adapted to low-resource languages as it can be deployed on top of traditional context vectors that do not rely on word embeddings, not available in all languages.
Table 7 displays the results obtained in terms of MAP. From these results, we notice that the overall MAP is improved with the concept vectors on the comparable corpus CLEF’2003 (FR-EN) with the weighted combination. This combination significantly outperforms all baselines. This demonstrates that the information in concept vectors is relevant to represent words in a bilingual lexicon extraction setting. For CLEF’2003 (IT-EN), $d_{FCA}$ , with direct and weighted combinations, outperforms the standard approach ( $21.80\%$ and $20.20\%$ vs $20.10\%$ ). Nevertheless, its performance is below the word-embedding-based models. This can be explained by the fact that the monolingual word embeddings for this pair (IT-EN) in the comparable corpus are better aligned than the aligned concept vectors. Indeed, as there are more unique English words than Italian words, the translation process used with concept vectors may have difficulties in identifying the correct translations.
For the specialized comparable corpus, namely Breast Cancer (FR-EN), $d_{FCA}$ with a weighted combination ( $42.40\%$ ) outperforms all other models. This is an important finding since, up to our knowledge, no previous evaluation of FCA-based models has been conducted for bilingual lexicon extraction from specialized corpora. Moreover, the experiments show that the results obtained on Breast Cancer are above the ones obtained on CLEF’2003, which can be explained by the fact that the vocabulary used in the breast cancer field is more specific and less ambiguous than the one used in a general domain corpus. It is worth noting that for Breast Cancer, the obtained result ( $42.40\%$ with the weighted combination) slightly exceeds the one reported in Morin and Hazem (Reference Morin and Hazem2016) ( $42.30\%$ is the best MAP obtained on the unbalanced version of the corpus). This said, a different bilingual dictionary is used in this study; furthermore, as the two approaches are different, they could certainly complement each other.
Recently in Hazem and Morin (Reference Hazem and Morin2018), the authors proposed meta-embedding representations. The important finding of their work was the efficiency of the character n-gram models, namely the character-based CBOW (CharCBOW) and skip-gram (CharSG) models. For Breast Cancer, it is reported in Hazem and Morin (Reference Hazem and Morin2018) that the CharSG, CBOW, and CharCBOW models used individually, respectively, obtain MAP scores of $36.4\%$ , $21.9\%,$ and $60.8\%$ . Our model $d_{FCA}$ with a weighted combination obtains a MAP score of $42.40\%$ , above both CharSG ( $36.4\%$ ) and CBOW ( $21.9\%$ ) but below CharCBOW. We conjecture that using CharCBOW in our approach as well could lead to even better results.
As conjectured earlier, our approach allows obtaining a representation of concept vectors that is less sparse than the one obtained with context vectors. Indeed, as shown in Table 8, the average size of vectors for CLEF’2003 increases from 28 to 41 words (FR-EN) and from 30 to 44 (IT-EN) when considering concept vectors. It furthermore increases from 19 to 36 words for Breast Cancer. This growth is important and shows that the similarity between a base word and its candidate translations relies on more information. This information is valuable as illustrated in the results discussed previously.
6.4 Embedding extracted lexicons in CLIR
CLIR is concerned with the problem of finding documents written in a language different from that of the query. If attempts to model multilinguality in information retrieval date back from the early seventies, a renewed interest for this field appeared in the mid-nineties with the rise of the Web, as pages written in many different languages were all of a sudden available. International organizations and government agencies of multilingual countries have been traditional (and still are) users of CLIR systems.
There are two main approaches to cross the language barrier in CLIR systems, either using a bilingual dictionary to translate query words or using a machine translation system, in which case one can translate entire documents. If the second approach can lead to very good results, it is only applicable to language pairs for which machine translation systems are available. We focus here on the first approach and show that bilingual lexicons extracted from comparable corpora can complement existing bilingual dictionaries and improve CLIR systems.
6.4.1 Experimental settings
We now assess the impact of using the extracted bilingual lexicon within standard CLIR systems, namely:
-
• A vector space model based on Robertson’s tf and Sparck Jones’ idf (Robertson and Sparck Jones Reference Robertson and Sparck Jones1988), referred to as TF-IDF,
-
• BM25 with the default parameter setting given by the Terrier system,
-
• The Jelinek-Mercer, Dirichlet and Hiemstra versions of the language models, again with the default parameters of the Terrier system ( $\lambda =0.15$ and $\mu = 2500$ ), referred to as LM-JM, LM-DIR, and LM-H.
We use in our experiments the English text collections from the bilingual tasks of the CLEF campaign 3, with English, French, and Italian queries, from 2000 to 2003. Table 9 lists the number of documents (Nd), the number of distinct words (Nw), the average document length ( $DL_{avg}$ ) in the English document collections, as well as the number of queries, Nq, in each task (all the queries are available in all languages). As the queries from 2000 to 2002 have the same target collection, they are combined in a single task. In all our experiments, we utilize bilingual dictionaries composed respectively of 70,000 entries for the French–English language pair, and 28,000 entries for the Italian–English language pair. For evaluation, we use the standard IR metric MAP to evaluate the different models. Finally, we rely on a paired t-test (at the level $0.05$ ) to measure the significance difference between the various CLIR systems.
Our experiments are conducted on the French–English and Italian–English language pairs, using two bilingual lexicons: (i) the Od original dictionary as used in the previous lexicon extraction experiments (Section 6.3), and (ii) the $d_{FCA}$ bilingual lexicon which corresponds to the best automatically extracted lexicon obtained before (Section 5). The CLIR systems using the original dictionary Od are considered as baselines. We then combine in each experiment the extracted lexicon $d_{FCA}$ with Od with the strategy SYN (Pirkola Reference Pirkola1998; Li and Gaussier Reference Li and Gaussier2012). This latter considers all the translations of a given query term in the documents as forming a single word. The idea behind the synonym operator is to translate the query terms using a bilingual lexicon and treat all the alternative translations for a word as a synonym set (Pirkola Reference Pirkola1998), known as the structured query translation. This strategy has been shown to outperform other ones in different studies (Ballesteros and Sanderson Reference Ballesteros and Sanderson2003).
6.4.2 CLIR experiments and results
The obtained results are summarized in Table 10. They show that the combination of bilingual lexicons $Od + d_{FCA}$ leads to a significant improvement over the baselines, especially on the language pair French–English on CLEF 2000–2002. A smallest improvement is observed on CLEF 2003 for the language pair Italian–English. This last result is not really surprising as only 7011 out of 28,744 entries of the Italian–English bilingual dictionary Od are present in CLEF 2003. One can also note that, compared to the best runs reported in Savoy (Reference Savoy2003), our approach enhances the MAP for the two language pairs FR-EN ( $47.38\%$ vs $42.70\%$ ) and IT-EN ( $40.04\%$ vs $37.77\%$ ) with the Okapi-BM25 model. In addition, the improvement obtained by $Od + d_{FCA}$ with the Jelinek-Mercer model on the language pair French–English reaches $6.67\%$ on CLEF 2003 and $2.09\%$ on CLEF 2000–2002, which respectively accounts for $95.79\%$ and $93.12\%$ of the corresponding monolingual baselines. The improvement obtained by $Od + d_{FCA}$ with the Jelinek-Mercer model on the language pair Italian–English reaches $1.80\%$ on CLEF 2003 and $1.18\%$ on CLEF 2000–2002, which respectively accounts for $93.79\%$ and $87.52\%$ of the corresponding monolingual baselines. A better gain ( $2.89\%$ ) is achieved with the $tf \times idf$ model on the language pair French–English and with BM25 on the language pair Italian–English ( $2.69\%$ ).
Lastly, Figures 5, 6, 7, and 8 further validate these findings. The CLIR systems based on the Jelinek-Mercer model and the extracted lexicon Od + $d_{FCA}$ obtain the best scores for all test collections, both for exact precision at low recall and for the percentage of the corresponding monolingual performance. This shows that the lexicon entries added from formal closed concepts improve retrieval scores and lead to better CLIR models.
7. Conclusion
We have proposed in this paper a new approach to bilingual lexicon extraction based on closed concepts directly extracted from the comparable corpora under consideration. The extracted concepts are then used to build concept vectors that complement the embedding-based context vectors currently used in bilingual lexicon extraction from comparable corpora. The experimental study, conducted on two comparable corpora, a specialized one from the medical domain and a general one made of news articles, has shown that the retained concept vectors provide a partial solution to the sparsity problem encountered with context vectors. Furthermore, the quality of the lexicons extracted with both concept and context vectors is higher than the quality of the lexicons extracted with only context vectors. Lastly, we have integrated within the CLIR systems the bilingual lexicon extracted from comparable corpora and demonstrated that the extracted lexicon based on concept vectors contributes to improve the CLIR performance. All in all, our approach based on FCA is an easy-to-deploy and simple approach to solve the sparsity problem of bilingual lexicon extraction.
In the future, we plan on testing our approach on the task in which one is not given a particular corpus but rather a particular bilingual dictionary that needs to be completed. We plan in this context to integrate the concept vectors we have used here with methods based on word embeddings in order to further assess the usefulness of concept vectors.