1. Introduction
Bilingual phrase alignment from comparable corpora is the task of making explicit translation equivalent relations that exist between the phrases of two texts without a source text–target text relationship. Unsupervised bilingual phrase alignment is difficult. In this work, a phrase refers to single words and multiword expressions of any type, such as nominal or verbal phrases. Hence, the first challenge consists in learning a unified phrase representation, so that phrases can be compared independently of their length. The second challenge is the alignment, which cannot be solved directly without supervised cross-lingual information.
In this work we tackle these two challenges, first, we propose a method for learning a length-independent phrase representation, then we integrate this method into an end-to-end training architecture to learn bilingual representations in an unsupervised manner. Consequently, bilingual phrase alignment becomes a vector comparison task using the bilingual representations previously learned.
1.1 Unified phrase representation
Learning unified phrase representation can be seen as a short sequence language modeling task with one special property: the modeled representation should be one single unit (e.g., one vector) for inputs of variable length. As with a long sequence (e.g., a sentence) modeling task, both the compositionality and the hierarchical syntactic relation of the composing words should be taken into consideration. For instance, although most phrases are freely combined like in “wind turbine” and “life quality”, the meaning of some idiomatic or semi-idiomatic phrases can diverge from their constituent words as in “couch potato”. Besides, even among more compositional phrases, the inner syntactical structure determines how constituent words are connected, hence influencing the overall semantics. For example, in the compound noun “sneaker shoe”, the “sneaker” constituent dominates the semantics of the phrase when associated with “shoe”.
Naively, we can pretrain phrase embeddings if we consider them as a single token, but it ignores compositionality and inner component relations of the phrase. Furthermore, learning phrase embeddings as individual vocabulary entries is extremely memory intensive and will lead to a data sparsity problem. Finally, phrases not seen during training cannot be handled by this approach. Artetxe, Labaka, and Agirre (Reference Artetxe, Labaka and Agirre2018b) proposed a generalized skip-gram that learns n-gram embeddings on-the-fly while keeping the desirable property of unigram invariance to handle compositional phrases but it still suffers from the sparsity and memory problem. Regarding the compositional method, two major approaches have been exploited in previous works. They both use word-level vectors for composing phrase representation. The first one consists in simple linear functions like addition, element-wise multiplication, or concatenation (Mitchell and Lapata Reference Mitchell and Lapata2009; Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b; Garten et al. Reference Garten, Sagae, Ustun and Dehghani2015; Goikoetxea, Agirre, and Soroa Reference Goikoetxea, Agirre and Soroa2016; Hazem and Daille Reference Hazem and Daille2018; Liu, Morin, and Peña Saldarriaga Reference Liu, Morin and Peña Saldarriaga2018). The first two vector combination methods are simple and proved to be very effective in many NLP tasks; however, they ignore the syntactical structure of the phrase. In other words, these methods do not distinguish the word order. For example, “service department” and “department service” will have the same representation while they do not convey equal semantics. The concatenation method does register word order but variable length phrases are no longer semantically comparable even if we pad them. In addition, they all ignore the inner structure of the phrase. The second family of approaches includes more complex information, as they usually involve neural networks trained with extra information such as the phrase textual context (the words before and after the phrase) or a syntax tree structure (a part-of-speech parsing tree). Several works (Socher, Manning, and Ng Reference Socher, Manning and Ng2010; Socher et al. Reference Socher, Perelygin, Wu, Chuang, Manning, Ng and Potts2013b; Irsoy and Cardie Reference Irsoy and Cardie2014; Paulus, Socher, and Manning Reference Paulus, Socher and Manning2014; Le and Zuidema Reference Le and Zuidema2015) have had promising results by using recursive neural networks (Goller and Küchler Reference Goller and Küchler1996) to capture syntactical information. However, the recursive neural network requires a tree structure for each training sample which may not always be available. To address this limitation, we propose a new tree-free recursive neural network to encode phrases of variable length into a single vector while preserving the compositionality and the syntactic information within the phrase.
More recently, contextualized word representations (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018; Radford et al. Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019) have achieved appealing improvements on a range of NLP tasks, but the models are mainly evaluated in classification-like or span prediction tasks, whether on sequence- or token-level (Rajpurkar et al. Reference Rajpurkar, Zhang, Lopyrev and Liang2016; Wang et al. Reference Wang, Singh, Michael, Hill, Levy and Bowman2018; Williams, Nangia, and Bowman Reference Williams, Nangia and Bowman2018; Zellers et al. Reference Zellers, Bisk, Schwartz and Choi2018). While our final task, bilingual phrase alignment, is a vector comparison task, we would like to evaluate these approaches on similar comparative tasks such as the phrase synonymy or similarity by extracting the vectors calculated by the previous layers before the final classification layer. Note that these models generate a sequence of vectors so we still have to apply some extra procedures to retrieve one single fixed-length vector to represent the whole sequence in our scenario. The two major strategies are also eligible: we could simply use the mean vector over the sequence or choose one specific vector to represent all the sequence, otherwise we could stack other neural networks which can generate a single fixed-length vector given a sequence input.
Our tree-free recursive neural network can be trained in a typical encoder–decoder architecture exploited in many neural machine translation frameworks or a Siamese-like system (Bromley et al. Reference Bromley, Guyon, LeCun, Säckinger and Shah1993). The advantage of these end-to-end systems is that they can be easily scaled up or incorporated in other networks without extra training information.
We evaluate our system on four data sets: two open domain data sets from Semeval 2013 and 2017 and two specialized domain data sets. The first corpus, from a European public project, covers the renewable energy domain in English and French, while the second will be released with this paper and covers a cancer subtopic in the medical domain. The results obtained improve state-of-the-art approaches on the similarity and synonymy tasks. Furthermore, several ablation tests are conducted to evaluate the impact of our phrase encoder, its training objective, and contextualized embeddings used as input.
1.2 Unsupervised bilingual phrase alignment
Beginning with the seminal works of Fung (Reference Fung1995) and Rapp (Reference Rapp1999) based on word co-occurrences for BWA (bilingual word alignment), significant improvements have been recently achieved by neural network-based approaches (Mikolov, Le, and Sutskever Reference Mikolov, Le and Sutskever2013a; Faruqui and Dyer Reference Faruqui and Dyer2014; Xing et al. Reference Xing, Wang, Liu and Lin2015; Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2018a; Peng, Lin, and Stevenson Reference Peng, Lin and Stevenson2021), but most work on the subject focuses on single terms. The alignment of multiword expressions (MWE) from comparable corpora is discussed less (Robitaille et al. Reference Robitaille, Sasaki, Tonoike, Sato and Utsuro2006; Morin and Daille Reference Morin and Daille2012). Our work is in line with Liu et al. (Reference Liu, Morin and Peña Saldarriaga2018), where the objective is to rank all the candidates in a given list containing phrases of variable length based on a source phrase. Moreover, unlike Liu et al. (Reference Liu, Morin and Peña Saldarriaga2018), our work can align phrases in an unsupervised manner without explicit cross-lingual information.
We adapt our tree-free recursive neural network as a phrase encoder for bilingual phrase alignment tasks because it can generate one single-fixed length vector for phrases of variable length while conserving the syntactical relation between words.
Concerning the model training, since the meaning of domain-specific phrases is highly context related, the commonly used sequence-to-sequence systems better fit our needs. After phrase encoding, we can decode its representation to predict its context, thus establishing a relation between the phrase and its context. Unlike common neural machine translation sequence-to-sequence systems, our model encodes a phrase and decodes it with regard to its syntactic context via our tree-free recursive neural network. In order to be able to align phrases in different languages, we make the encoder cross-lingual which means that the input vectors in different languages share the same vector space (Artetxe et al. Reference Artetxe, Labaka and Agirre2018a; Liu et al. Reference Liu, Morin and Peña Saldarriaga2018). We also incorporate a back-translation mechanism (Sennrich, Haddow, and Birch Reference Sennrich, Haddow and Birch2016) of single words during training by using pretrained bilingual word embeddings (BWE). Moreover, our model relies exclusively on monolingual data, and is trained in an unsupervised manner. After completion of the training phase, we obtain a shared cross-lingual phrase encoder that can generate a unified representation of phrases of any length.
As for the data sets, we use the same specialized corpora as in our monolingual evaluation: one covers the renewable energy domain and the other covers the cancer subtopic of the medical domain. We manually create 3 gold standards for the first domain with 3 different language pairs: English-Spanish, English-French and English-Chinese, and 1 English-Spanish gold standard for the medical domain corpus. Our experiments on these data sets show that our method significantly outperforms existing unsupervised methods for the different length phrase alignment by a mean of 8.8 MAP points.
2. Background
2.1 Sequence representation modeling
The simple additive approach for encoding a sequence of word vectors into one single vector is always considered as an effective baseline (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b; Del, Tättar, and Fishel Reference Del, Tättar and Fishel2018; Liu et al. Reference Liu, Morin and Peña Saldarriaga2018; Laville et al. Reference Laville, Hazem, Morin and Langlais2020; Huang et al. Reference Huang, Cai and Church2020). Another possible improvement is to use a recursive neural network (RNN) (Goller and Küchler Reference Goller and Küchler1996). It is a generalized version of the recurrent neural network (Elman Reference Elman1990) which always applies a left binary tree, where the first two leaves are combined to form a node, then the node is combined to the next leaf to form the next level node, etc. The recursive neural network encodes a sequence of word vectors along a tree structure, for example a parse tree, by recursively applying the weight matrices to each node association. This architecture has been successfully exploited in a variety of tasks, Socher et al. (Reference Socher, Bauer, Manning and Andrew2013a) use an untied weight RNN for constituent parsing where they use different weight matrices depending on the constituent syntactic category, Le and Zuidema (Reference Le and Zuidema2014) collect the context information by adding an outer representation for each node. Their system is served in a dependency parsing task. Moreover, various works (Socher et al. Reference Socher, Perelygin, Wu, Chuang, Manning, Ng and Potts2013b; Irsoy and Cardie Reference Irsoy and Cardie2014; Paulus et al. Reference Paulus, Socher and Manning2014) apply the RNN to generate sentence-level representations for the sentiment analysis task using labeled data.
Figure 1 shows an example of a sequence of length four. Suppose we have a parse tree, each input is a word vector $v_i \in \mathbb{R}^d$ . The network applies a linear function with a weight matrix $W_l \in \mathbb{R}^{d*d}$ for each left node child and a weight matrix $W_r \in \mathbb{R}^{d*d}$ for each right node child in the given tree. So, for each non-leaf node $\eta$ , the corresponding vector $x_{\eta}$ is calculated as follows:
where $v_{l(\eta)}$ and $v_{r(\eta)}$ mean, respectively, the left and the right child vector of the node $\eta$ .
The disadvantage of RNN in our scenario is the need of a tree structure because, as stated above, not only it is not always available in all languages but it is also not possible to retrieve the context sentence for the parsing if we meet a new freely combined phrase that has never occurred in the corpus. The recurrent neural network or the LSTM does not need a tree structure but applies a universal left binary tree to all sequences, The convolutional neural network, however, with a kernel size of 2 can be considered as a specialized RNN where it adopts element-wise multiplication rather than matrix multiplication with only one layer by a pooling operation. The more advanced and purely self-attention-based model, Multi-Head Attention cell (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) has shown great potential in sequence modeling, but it has many more parameters compared to the previously mentioned models. It is obvious that if a model has more parameters, it is more powerful. As a consequence, it is not directly comparable to other models that have significantly fewer learnable parameters.
Figure 2 shows how the recurrent and a 2 kernel sized convolutional neural network can model a sequence with 4 tokens. $W_{xh}\in \mathbb{R}^{h*d}$ and $W_{hh}\in \mathbb{R}^{h*h}$ are the parameters in a typical recurrent neural network where h is the hidden dimension, and for the convolutional network with a kernel size of 2, we can consider the convolution operation as two element-wise multiplications (dashed line in Figure 2) with a left multiplier $v_{l}\in \mathbb{R}^d$ and a right multiplier $v_{r}\in \mathbb{R}^d$ . Stacking $v_{l}$ and $v_{r}$ forms the actual convolution kernel. The final vector is obtained by a pooling operation such as max or average. Note that the addition-based approach (Liu et al. Reference Liu, Morin and Peña Saldarriaga2018) can be viewed as a specialized version of CNN where the values in $v_{l}$ and $v_{r}$ equal to one and pooling is done by averaging the vectors.
Recently, published language models like BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018) or ELMo (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018) can also encode a sequence of word vectors of a phrase into one single vector if we use the output vector at one particular step, for example, the last token in ELMo or the first special [CLS] token in BERT. The additive approach can also be applied to all the output vectors. Like the static word embedding models, these models are pretrained on large general corpora. By default, they all encode a sequence of word vectors into a new sequence of word vectors that are also known as contextualized word embeddings. Consequently, the tasks similar to sequence labeling (e.g., sentence tagging) and span prediction (e.g., question answering) naturally fit these models. The sequence-level classification task can be achieved by representing the sequence with the output vector of a special token position like the first or last token of a sequence. To the best of our knowledge, these models have been applied mostly on classification or span prediction tasks.
2.2 Feature-based and fine-tuning-based language representations
The feature-based strategy has existed since the 1990s. The traditional co-occurrence count-based method (Church and Hanks Reference Church and Hanks1990; Dagan, Pereira, and Lee Reference Dagan, Pereira and Lee1994; Niwa and Nitta Reference Niwa and Nitta1994; Bullinaria and Levy Reference Bullinaria and Levy2007; Turney and Pantel Reference Turney and Pantel2010) represents a word by a sparse co-occurrence vector and often applies the pointwise mutual information to associate the word and its context. Neural network-based methods (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b; Pennington, Socher, and Manning Reference Pennington, Socher and Manning2014) represent a word by a dense embedding vector, which has led to significant improvements in major NLP tasks. These word-level static vectors can be incorporated into other systems as the basic input units to generate higher level representations.
Contextualized models (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018; Radford et al. Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019) are sequence-level representations with word-level granularity. In fact, they all exploit word-level representations as the basic input and output units. Once pretrained, we can use these models in a specific task by stacking supplementary layers onto them. The difference between the feature based and the fine-tuning based approach lies in whether we freeze the parameters of these pretrained models or not when we incorporate them into a task-specific training framework. The feature-based approach extracts the output of the pretrained model and uses this output as static features of the input by omitting the gradients of the parameters, while the fine-tuning based approach updates its parameters during the back-propagation of the training. The advantage of the fine-tuning-based approach is that the whole system can be readjusted to the task-specific training corpus, but it takes up much more time and space consuming compared to the feature-based approach. Moreover, according to Devlin et al. (Reference Devlin, Chang, Lee and Toutanova2018), similar performance can be obtained (with –0.3 points in F1 CoNLL-2003 NER) using the same BERT model in the feature and the fine-tuning based settings. This is particularly interesting because fine tuning a large model with millions of parameters can be exceedingly long while updating only a few layers is much more efficient.
2.3 Cross-lingual word embeddings
In order to map phrases of different languages into one common space with compositional models, word-level mapping is an essential prerequisite. Following the success of word embeddings (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b) trained on monolingual data, a large proportion of research concentrated on at mapping word embeddings into a common space for multiple languages. Cross-lingual word embeddings were pioneered by Mikolov et al. (Reference Mikolov, Le and Sutskever2013a) by using a linear transformation matrix. A large number of works tried since then to improve the linear transformation method (Lazaridou, Dinu, and Baroni Reference Lazaridou, Dinu and Baroni2015; Artetxe, Labaka, and Agirre Reference Artetxe, Labaka and Agirre2016; Liu et al. Reference Liu, Morin and Peña Saldarriaga2018). Artetxe et al. (Reference Artetxe, Labaka and Agirre2018a) compiled a substantial amount of similar works (Mikolov et al. Reference Mikolov, Le and Sutskever2013a; Faruqui and Dyer Reference Faruqui and Dyer2014; Xing et al. Reference Xing, Wang, Liu and Lin2015; Shigeto et al. Reference Shigeto, Suzuki, Hara, Shimbo and Matsumoto2015; Zhang et al. Reference Zhang, Gaddy, Barzilay and Jaakkola2016; Artetxe et al. Reference Artetxe, Labaka and Agirre2016; Smith et al. Reference Smith, Turban, Hamblin and Hammerla2017) into a multistep bilingual word embedding framework. More recently, Lample and Conneau (Reference Lample and Conneau2019) proposed pretrained cross-lingual transformer-based language models using masked language modeling like Devlin et al. (Reference Devlin, Chang, Lee and Toutanova2018) and a translation language modeling training objective with parallel data to further improve the quality of pretrained cross-lingual embeddings for languages that share the same alphabet.
2.4 Training objectives in language modeling
Predicting the next word or sentence is the most common training objective in a wide range of previous works with an architecture similar to encoder–decoder (Bahdanau, Cho, and Bengio Reference Bahdanau, Cho and Bengio2014; Sutskever et al. Reference Sutskever, Vinyals and Le2014; Cho et al. Reference Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio2014; Luong, Pham, and Manning Reference Luong, Pham and Manning2015; Gehring et al. Reference Gehring, Auli, Grangier, Yarats and Dauphin2017; Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017; Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018; Radford et al. Reference Radford, Wu, Child, Luan, Amodei and Sutskever2019). In addition, the BERT model jointly has another training objective of predicting only the randomly masked tokens. The corresponding ablation studies have proven this to be beneficial. This objective can be considered as a special version of the denoizing objective (Vincent et al. Reference Vincent, Larochelle, Bengio and Manzagol2008), which reconstructs the original sentence from a randomly noised version.
2.5 Encoder–decoders in neural machine translation with low-resource
To train our network, we use the widely exploited encoder–decoder model in neural machine translation (NMT). Although there are many different models, they all implement an encoder–decoder architecture optionally combined with an attention mechanism (Bahdanau et al. Reference Bahdanau, Cho and Bengio2014; Luong et al. Reference Luong, Pham and Manning2015) to tackle long sequences. This type of model has become the main trend in the recent years producing the current state-of-the-art results. It takes advantage of longer context information, and continuous representations can be easily trained in an end-to-end system.
Cho et al. (Reference Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio2014) proposed a model to learn representations of variable-length sequences, however, their approach requires parallel phrase pairs for training. Therefore, we looked at NMT models making use of monolingual corpora to enhance translation in low-resource scenarios. When no parallel data exist between source and target languages, several works proposed the use of a pivot language (Firat et al. Reference Firat, Sankaran, Al-Onaizan, Yarman Vural and Cho2016; Saha et al. Reference Saha, Khapra, Chandar, Rajendran and Cho2016; Chen et al. Reference Chen, Liu, Cheng and Li2017) acting as a bridge between source and target. Following the same idea, Johnson et al. (Reference Johnson, Schuster, Le, Krikun, Wu, Chen, Thorat, Viégas, Wattenberg, Corrado, Hughes and Dean2017) proposed a multilingual NMT model that creates an implicit bridge between language pairs for which no parallel data are used for training. Whether explicit or implicitly, all these works still require the use of parallel corpora between the pivot language and other languages.
More interestingly for our work, some research has recently conducted on training NMT models with monolingual corpora only (Lample et al. Reference Lample, Conneau, Denoyer and Ranzato2018; Artetxe et al. Reference Artetxe, Labaka, Agirre and Cho2018c; Yang et al. Reference Yang, Chen, Wang and Xu2018). They all use pretrained cross-lingual word embeddings as input. Then a shared encoder is involved to encode different noised sequences in the source and the target languages. The decoder decodes the encoded vector to reconstruct its original sequence. This strategy is called denoizing (Vincent et al. Reference Vincent, Larochelle, Bengio and Manzagol2008) with the objective to minimize the following cross-entropy loss:
where $\theta_{enc}$ and $\theta_{dec}$ , respectively, mean the parameters in the encoder and the decoder, $x \in D_{l}$ is a sampled sequence from the monolingual data, and $dec_{\rightarrow l}(enc(\mathcal{N}(x)))$ represents a reconstructed sequence from the noised version of the original sequence x.
In addition, the back-translation mechanism (Sennrich et al. Reference Sennrich, Haddow and Birch2016; Zhang and Zong Reference Zhang and Zong2016) has been dominantly exploited in unsupervised neural machine translation (Lample et al. Reference Lample, Conneau, Denoyer and Ranzato2018; Artetxe et al. Reference Artetxe, Labaka, Agirre and Cho2018c) to build the link between the two languages by alternatively applying the source-to-target model to source sentences in order to generate inputs for training the target-to-source model (and vice versa):
where $D_{l1}$ and $D_{l2}$ are the two language corpora, $dec_{\rightarrow l1}$ means that the decoder will decode the sequence in l1 language (or l2 resp.). Suppose y is the translation of $x \in D_{l1}$ , then $dec_{\rightarrow l1}(enc(y))$ represents the reconstructed source sentence from the synthetic translation. The goal is to generate pseudo parallel sentence pairs to train the models with a reconstruction loss.
Also pertaining to our work, Yang et al. (Reference Yang, Chen, Wang and Xu2018) introduce a semi-shared encoder to retain specific properties of each language and directional self-attention to model word order. More recently, Wu, Wang, and Wang (Reference Wu, Wang and Wang2019) propose an alternative approach that extracts and edits candidate translation sentences with comparative loss.
To sum up, most of the previous works use either compositional approaches, for instance, the average of all word vectors of a sentence, or a representative vector of a neural network, for instance, the special CLS token in BERT to represent a sequence. While for the bilingual phrase alignment, the previous studies exploit essentially the cross-lingual word embeddings. This work is an extension of Liu et al. (Reference Liu, Morin, Peña Saldarriaga and Lark2020) for the bilingual phrase alignment part which also uses cross-lingual word embeddings as input. Moreover, we provide full study on the phrase representation learning.
3. Unified phrase representation learning
3.1 Tree-free recursive neural network
In order to encode phrases of variable length without tree structures in a fixed-length vector, we propose a new network a tree-free recursive neural network (TF-RNN). We consider it as a variant of the original recursive neural network because the basic idea is always to associate each token following a bottom-up structure. This structure is required as input of the original recursive neural network, while in the TF-RNN we eliminate this requirement by recursively splitting each node into a left and a right semantic part, then associating the left part with its right-hand neighbor and the right part with its left-hand neighbor. This is motivated by our hypothesis that the semantics of a pair of words could be retrieved by combining their meaning with some position-specific weights, and consequently the semantics of a sequence of words could be retrieved by recursively combining the semantics of each word pair. In fact, by doing this, we create a pseudo binary tree structure where we associate each adjacent node pair without parsing it twice. This kind of structure can be seen as an approximation of a generalized sentence syntax as each language unit is directly associated with its adjacent neighbors and hierarchically associated with other units eventually yielding the overall semantics of all the units.
Let $\big[v_1^0, v_2^0, v_3^0, ..., v_n^0\big]$ with $v_i^0 \in \mathbb{R}^{d}$ be the input word vector sequence with n words, the TF-RNN outputs a single fixed-length vector $v_o \in \mathbb{R}^{p}$ by following steps:
where j indicates the pseudo-tree structure layer level, a phrase with n word components will have n levels in such a structure. $W_l \in \mathbb{R}^{d*d}$ and $W_r \in \mathbb{R}^{d*d} $ , respectively, represent the left and right weight matrices for the extraction of the word semantics; $b_l \in \mathbb{R}^{d} $ and $b_r \in \mathbb{R}^{d} $ are the corresponding bias vectors. A node vector on level $j+1$ , $v_{i}^{j+1}$ is calculated in terms of a pair of adjacent node vectors from the previous level j. Once we reach the final level n, the final output vector $v_o$ can be calculated by a linear layer on top with $U \in \mathbb{R}^{p*d}$ and $b \in \mathbb{R}^{p}$ as its parameters. A nonlinear activation function is applied after each operation. An example of a sequence of length three is illustrated in Figure 3.
3.2 Complexity
We compare the complexity of different neural network layers that can encode sequences of variable length, the RctNN and self-attention encode the input sequence to another sequence of equal length while our proposal TF-RNN and CNN with padding and pooling (right part of Figure 2) encode to one fixed-length vector.
We compare mainly two criteria for one layer of each architecture, the first is the computational complexity that represents how many weight parameters are involved in the linear transformation; the second is the maximum dependency length which is the length of the paths that forward and backward signals have to traverse in the network. This is critical for learning long-range dependencies in many sequence transduction tasks. We show the complexity comparison in Table 1.
For the computational complexity, the RctNN has n times linear transformations with matrices in $\mathbb{R}^{d*d}$ , while the CNN is more expensive than RctNN by a factor of the kernel width k which can be seen as having k times the weight matrices in a RctNN. The self-attention is faster for most cases since usually $n<d$ . Our proposal seems to be the most expensive since the complexity is quadratic in terms of input length and model dimension; however, it should be noted that our objective is to encode phrases which are most of the time n-gram with $n \in [1,5]$ . For unigrams and bigrams, our encoder is less complex than RctNN and CNN. Also, our encoder is a one-layer architecture compared to the self-attention which has a “depth” of 8 heads in the Transformer-base architecture.
As the self-attention is a dynamic fully-connected layer, it traces each input position with one linear transformation. As for our proposal, we can see that it is linearly related to the input length, yet again since our inputs are mostly short sequences, this is not considered as problematic in our scenario.
3.3 Encoder–decoder training with wrapped context prediction
We use a fairly standard encoder–decoder architecture to train the phrase encoder. Our TF-RNN is used as encoder, thus phrases of variable length can be represented by a fixed-length vector without the need of a tree structure. The decoder is a two-layer LSTM. Furthermore, instead of predicting the next word or phrase like in many other similar systems, we let the generator generate the context of the phrase. However, one disadvantage of predicting only the context is that the syntax of the output sequence is misguided by the missing phrase. Since most of the phrases are either nominal or verbal, we decide to use a single random vector to wrap all the tokens of a phrase to help the generator reconstruct a syntactically complete context during the system training.
Apart from the static word embeddings, we also incorporate the recent works of contextualized embeddings as the input for our TF-RNN phrase encoder. As mentioned in Section 2.2, feature-based usage of these models degrades the results only slightly and the results compared to the fine-tuning based usage which takes up far more time and space. In our framework, we apply the feature-based approach to accelerate our experiments with relatively low resources. More concretely, this means that we freeze all the parameters in the contextualized embedding models when back-propagating the whole network. After the training, the encoder serves as a phrase vector generator which takes either static or contextualized embedding vectors. Figure 4 shows an instance of the framework.
Note that in the case of the static embeddings, the embedding layer is actually a look-up table while for the contextualized ones, it is a forward pass of the pretrained model. More specifically, to get the word vector from the static embeddings, one can directly use one row or column of the hidden layer matrix with the word index. For the contextualized embeddings, it passes the entire sentence (or several words between the word) to the neural network and run a forward pass to get the context-dependant word vector.
4. Toward unsupervised bilingual phrase alignment
4.1 Tree-free phrase encoder in cross-lingual context
We would have used the same encoder explained in Section 3.3, however, in our preliminary experiments, it did not perform well as the synthetic translations are sometimes of low quality and the accumulated translation errors affect with the recursivity more radically (Wu et al. Reference Wu, Wang and Wang2019). The same phenomenon occurs also in other similar networks such as the recurrent or LSTM network. Meanwhile, since the additive approach (Liu et al. Reference Liu, Morin and Peña Saldarriaga2018) manages to maintain a decent performance, we decided to adapt the tree-free recursive neural network to the cross-lingual context by levelling the network. Consequently, the network has more additive features while being able to distinguish the word order and distribute different weights. More concretely, there are three layers in the adapted version, in the first we always split the semantics of each word into two parts by a linear transformation: the right side and the left side. Then we associate these nodes by concatenation, the left side is supposed to be associated with the right side of the previous token and vice versa. The second layer is composed of a fully connected layer that maps the input vectors to output vectors in a specified dimension. Finally, the third layer consists in the addition of all intermediate level nodes and outputting a single fixed-length vector. The sum operation is motivated by the additive characteristics mentioned in Mikolov et al. (Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b) as the additive approach has showed interesting results in our preliminary experiments. Figure 5 shows the schema of the proposed network which is clearly a flat version of the TF-RNN presented in Section 3.1.
We use pretrained cross-lingual embeddings as the input vector sequence $[v_1, v_2, v_3, ..., v_n]$ with $v_i \in \mathbb{R}^{d}$ , the output vector $v_o \in \mathbb{R}^{p}$ is calculated as follows:
where $W_l \in \mathbb{R}^{d*d}$ and $W_r \in \mathbb{R}^{d*d}$ denote the left and the right weight matrix, respectively, in the linear transformation of the semantic association, $b_l \in \mathbb{R}^{d}$ and $b_r \in \mathbb{R}^{d}$ are the corresponding bias vectors, $and\ U \in \mathbb{R}^{p*d}$ and $b \in \mathbb{R}^{p}$ are the parameters in the fully connected layer with d the input dimension and p the output vector dimension.
Consequently, our phrase encoder produces vector representations that are word order sensitive and that can distribute different weights for the different phrase components without using structured input.
4.2 Unsupervised training
The general encoder–decoder architecture of our method is shown in Figure 6. Since the input sequence is always a short sequence of under 7 tokens, usually a two or three word phrase, we did not use an attention mechanism which is intended to capture long-range dependencies. The network tries to predict the sentence containing the input phrase from its encoded vector. One can argue that our system is only unsupervised under the prerequisite of pretrained bilingual embeddings. This is true. However, since pretrained embeddings are largely available and can easily be obtained with general public parallel data, we consider that our system is unsupervised because we do not need specific parallel data.
As illustrated in Figure 6, in addition to our phrase encoder, we incorporate a pseudo back-translation mechanism for single words based on bilingual word embeddings (Artetxe et al. Reference Artetxe, Labaka and Agirre2018a; Liu et al. Reference Liu, Morin and Peña Saldarriaga2018). The decoder consists of a single-layer LSTM and a fully connected layer on top of it. The goal of the decoder is to reconstruct the wrapped sentence which contains the current input phrase. We name this process context prediction. The intuition behind context prediction is based on the distributional hypothesis (Harris Reference Harris1954), that is words in similar contexts tend to have similar meanings. This idea is studied in Del et al. (Reference Del, Tättar and Fishel2018): instead of an end-to-end system, they first learn all the phrase embeddings by Skip-gram considering them as a single word, and then learn the composition function by a regression model that predicts the pretrained phrase embeddings from its composing word embeddings. However, they limit the phrase length to 2, while we would like to propose a unified end-to-end framework which is able to learn the phrase composition of variable length and the mapping simultaneously. Overall, the system uses three key concepts:
Wrapped sentences. Like in NMT, we use special tokens to mark the start and the end of a sentence. Apart from the standard special tokens, we exploit the same training strategy as in our monolingual system: the wrapped sentence. In addition to what we have stated in Section 2.4, this allows the system to recognize the phrase when decoding and to strengthen links between languages.
Shared encoder. The system treats input phrases in different languages via the universal encoder detailed in Section 4.1. Works using a similar idea are He et al. (Reference He, Xia, Qin, Wang, Yu, Liu and Ma2016), Lee, Cho, and Hofmann (Reference Lee, Cho and Hofmann2017) and Artetxe et al. (Reference Artetxe, Labaka, Agirre and Cho2018c). As the input embeddings are already mapped to a common space, the representation generated by the shared encoder is also a cross-lingual vector. After the training, we use exclusively the shared encoder to generate cross-lingual phrase representations, which is essential for our final task: bilingual phrase alignment.
Pseudo back-translation. Since we do not have cross-lingual data, a direct link between a phrase in language l1 and one in language l2 is not feasible. However, synthetic translations of single words can be easily obtained using bilingual word embeddings. By using translated single-word phrases to train our model, we create stronger links between the two languages. This can be viewed as pseudo back-translation as we generate synthetic translations by BWE while in NMT systems the translation is generated by the corresponding decoder (Sennrich et al. Reference Sennrich, Haddow and Birch2016; Artetxe et al. Reference Artetxe, Labaka, Agirre and Cho2018c).
Therefore, the system potentially has four objective loss functions when we alternatively iterate all phrases in the two languages l1 and l2:
where $\mathcal{L}_{cp\ lp \rightarrow lq}$ means the context prediction loss from an encoded phrase in language lp to the context of language lq, $dec_{\rightarrow l}(enc(x))$ is the reconstructed version of the wrapped sentence, ws(x) denotes the real wrapped sentence containing the phrase x and ${\small\text{BWE}}(x)$ is the translated single-word phrase for x using bilingual word embedding.
5. Experiment settings
5.1 Phrase synonymy and similarity
Data and resources. For the phrase synonymy task, we use two specialized domain corpora: Wind Energy (WE)Footnote a and a new Breast Cancer (BC) corpus. The WE corpus comes with 6 languages. In this work, we only evaluate on the English and the French corpora which have, respectively, 13,338 and 33,887 sentences. The BC corpus is in English and contains 26,716 sentences. The aim of the phrase synonymy task is to find a phrase synonym in a given corpus. Usually a large list of candidates is first extracted from the corpus so we can select candidates which are the most likely phrase synonyms. In order to build the candidate list, we use the IXA pipes (Agerri et al. Reference Agerri, Bermudez and Rigau2014) libraryFootnote b to preprocess the corpora with the built-in preprocessing tools following this order: normalization, tokenization and pos-tagging. Then a list of phrases of a maximum of 7 words is extracted using the open source tool PKE.Footnote c Finally, 8923 and 6412 phrases are extracted from the English and French WE corpora and 8989 phrases from the BC corpus after filtering hapaxes (threshold 1). We use the same gold standard as Hazem and Daille (Reference Hazem and Daille2018) for the WE corpus. The gold standard for the BC corpus was built based on the MeSH 2018 thesaurusFootnote d and contains 108 phrases.
As for the phrase similarity task, two public open domain data sets are obtained from the task 2 of previous Semeval campaigns. One is from task 2 of Semeval 2017 (Camacho-Collados et al. Reference Camacho-Collados, Pilehvar, Collier and Navigli2017)Footnote e and the other one from the task 5 of Semeval 2013 (Korkontzelos et al. Reference Korkontzelos, Zesch, Zanzotto and Biemann2013).Footnote f The Semeval 2017 data set has a gold standard of 95 pairs of phrases after filtering those with only single words, an evaluation script, a 64-dimensional static word embedding model and a wiki corpus with 46 million sentences which contains the context information of the phrases. The Semeval 2013 data set only contains a gold standard of 7814 pairs of multiword phrases.
Input embeddings. Regarding the embedding model, we use deeplearning4j Footnote g to train domain-specific 100-dimensional word embeddings using the Skip-gram model, with 15 negative samples and a window size of 5. Since the specialized corpora are fairly small, we concatenate these embeddings with the 300-dimensional fastText vectors pretrained on wikipedia (Grave et al. Reference Grave, Bojanowski, Gupta, Joulin and Mikolov2018),Footnote h resulting in 400-dimensional vectors. This technique has proven to be very effective for specialized domain corpora (Hazem and Morin Reference Hazem and Morin2017; Liu et al. Reference Liu, Morin and Peña Saldarriaga2018). For the general domain corpus, we simply use the Semeval 2017 given model. Pertaining to the contextualized embedding model, we incorporate the implementation of BERTFootnote i and ELMoFootnote j because they both have pretrained models on multiple languages. The BERT implementation has a multilingual model that contains 104 languages, while the ELMo implementation has 44 separate language models. It is worth mentioning that all these models are pretrained on large general corpora (1B words for ELMo and 3.3B for BERT). Finally, we have two types of input embeddings:
-
Static. The 400-dimensional static word embedding vectors obtained from concatenating the pretrained fastText vectors and vectors trained on small specialized domain corpora for the phrase synonymy task and also the 64-dimensional static word embedding vectors provided by the Semeval 2017 data set for the phrase similarity task.
-
BERT or ELMo. Pretrained contextualized embeddings with feature-based usage setting.
5.2 Bilingual phrase alignment experiments
Data and resources. For the bilingual phrase alignment task, we use the same Wind Energy (WE) comparable corpus as in the monolingual tasks. This time we evaluate the English, French, Spanish, and Chinese corpora. Furthermore, we extend the Breast Cancer corpus (BC) to an English-Spanish comparable corpus by crawling from a scientific website.Footnote k
The English BC corpus has 26,716 sentences and the Spanish one has 62,804 sentences. The gold standard was constructed based on the MeSH 2018 thesaurusFootnote l and contains 108 phrase pairs which exist in our corpus. Concerning the WE corpora, the English, French, Spanish and Chinese parts respectively contain 13,338, 33,887, 29,083 and 17,932 sentences. Hazem and Morin (Reference Hazem and Morin2016) proposed a reference list consisting of 139 single words for the English-French corpus, while Liu et al. (Reference Liu, Morin and Peña Saldarriaga2018) provided a gold standard with 73 multiword phrases for the same corpus. Based on the reference list of Liu et al. (Reference Liu, Morin and Peña Saldarriaga2018), we propose a new gold standard including also single words. Moreover, we extended this gold standard to other languages while ensuring that all reference lists share the same 90 English phrases to be aligned. Finally, alignment reference lists were obtained for three language pairs: English-French, English-Spanish and English-Chinese. For the sake of comparability, we report results on the data sets of Liu et al. (Reference Liu, Morin and Peña Saldarriaga2018) and Hazem and Morin (Reference Hazem and Morin2016). Table 2 shows the detailed gold term distribution in terms of length for BC and WE corpus. Note that for one English term, we can have multiple correct alignments in the target language.
For the preprocessing and phrase extraction, we also use the IXA pipes library to tokenize and lemmatize French and Spanish corpora. It should be noted that the WE Chinese corpus is already pre-segmented. Then we use the Stanford CoreNLP libraryFootnote m pos-tagger for all languages, then for the phrase extraction we use the same PKE tool as mentioned in monolingual tasks. After hapax filtering, each corpus contains roughly 6000 candidate phrases of maximal length 7.
Input cross-lingual embeddings. We implement the bilingual word embedding framework mentioned in Section 2.3 using deeplearning4j 1.0.0-beta3.Footnote n We also use this method to obtain 400-dimensional word embedding vectors as in the monolingual tasks. Recall that this technique follows the idea discussed in Hazem and Morin (Reference Hazem and Morin2017) and Liu et al. (Reference Liu, Morin and Peña Saldarriaga2018). We then apply the bilingual word embedding framework so all word embeddings at input level in each experiment are mapped to a common space. For each language pair, the seed lexicon is selected by a frequency threshold of 50, obtaining around 2000 word pairs. We use unit length normalization, mean centering, matrix whitening, re-weighting, and de-whitening to generate cross-lingual word embeddings.Footnote o Since our goal is to evaluate the contributions of our system, we will not measure the impact of different pretrained embeddings but prefer to focus on those achieving state-of-the-art results to date.
5.3 Training settings
For all our experiments, the dimension of the encoded vector ( $v_o$ in Figure 3) for the shared encoder is set to 500. This is also the hidden size for the LSTM decoders. For the fastText or ELMo input embeddings, sentences longer than 100 words are cutoff. While for BERT, sentences longer than 150 tokens are cutoff. Because BERT uses SentencePiece (Kudo and Richardson Reference Kudo and Richardson2018) tokenizer and some words are tokenized into several tokens. By truncating sentences, the training is quicker and more stable. We pad the special token [CLS] to the beginning of every sentence for the models with BERT. To extract the features from the BERT model, we sum the output vectors of the last four hidden layers (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2018), this has shown to be the second best method, with only 0.2 F-score point behind concatenating the last four layers which is 4 times less space efficient. The model is trained by a minibatch of 20, which means that given one phrase, we calculate the mean of the cross-entropy loss between 20 predicted and real sentences. We run our experiments for a maximum of 200 epochs with an early-stop condition of three consecutive loss increases. One model with static word embeddings takes about 2 days to train on a single Geforce 1080 Ti GPU with Pytorch 1.0 and Cuda 10 on Ubuntu 16.04, while training with contextualized embeddings takes about 4 days with the feature-based strategy.
5.4 Evaluation settings
The generated phrase vectors are compared by cosine similarity. For the synonymy and bilingual alignment tasks, we simply calculate the cosine of all pre-extracted phrase candidates and rank them. We use the evaluation script provided with the Semeval2017 data set for the similarity task,Footnote p and the MAP (Mean Average Precision) score (Manning et al. Reference Manning, Raghavan and Schütze2008) to evaluate the synonymy and bilingual alignment task:
where $|W|$ corresponds to the size of the evaluation list, and $Rank_{i}$ corresponds to the ranking of a correct synonym candidate i.
5.5 Reference methods
Baseline approaches. Regarding the monolingual tasks, we have implemented three types of baseline approaches:
-
Skip-gram-ext. The extended version of Skip-gram (Artetxe et al. Reference Artetxe, Labaka and Agirre2018b) with 300 dimensional vectors. The implementation is publicly available.Footnote q
-
Static mean. The additive approach that has proven to be surprisingly effective (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b; Liu et al. Reference Liu, Morin and Peña Saldarriaga2018).
-
ELMo/BERT mean/reduce/concat. We extract a single fixed-length vector from the feature-based output sequence of ELMo/BERT with two strategies. The mean is similar to the additive approach where we simply calculate the mean vector over all normalized vectors in the output sequence. The reduce strategy uses one vector to represent the whole sequence: for ELMo it is the last token vector while for BERT we use the output of the hidden layer for the first [CLS] token. The concatenation for ELMo is based on the original ELMo paper (Peters et al. Reference Peters, Neumann, Iyyer, Gardner, Clark, Lee and Zettlemoyer2018) where the authors propose to concatenate the first and the last token to represent a sequence.
For the bilingual alignment task, we have also implemented two baseline classes:
-
Static mean. This is the same approach as in the monolingual tasks.
-
Co-occurrence based approach. The compositional approach (Grefenstette Reference Grefenstette1999; Tanaka Reference Tanaka2002; Robitaille et al. Reference Robitaille, Sasaki, Tonoike, Sato and Utsuro2006) is a quick and direct method to align multiword expressions. It is basically a dictionary look-up approach that translates each word via a dictionary and sort all candidates by frequency. Morin and Daille (Reference Morin and Daille2012) proposed a co-occurrence based approach called compositional approach with context-based method (CMCBP) to tackle the problem of out of dictionary words. However, this approach can only align phrases of the same length, so we compare only a subset of the multiword phrase pairs.
Encoder–decoder system with other phrase encoders. To compare our proposed TF-RNN and the adapted cross-lingual version, we also implemented several neural networks which do not require structured input: RecurrentNN, CNN, Transformer encoder, and LSTM, which is reported to obtain the best results in Del et al. (Reference Del, Tättar and Fishel2018). They all have the same output dimension, and the CNN has a kernel size of 2 and a zero-padding so that even single-word phrases can be encoded. A small Transformer encoder with 4 layers and 4 heads is also implemented, its hidden dimension size of the feed forward is twice the model dimension. Note that it still has many more parameters than other architectures (5 million parameters vs. roughly 0.5 million in other architectures).
6. Results and discussion
6.1 Phrase synonymy and similarity
Overall results on phrase synonymy and similarity tasks are shown in Tables 3 and 4. We compare our proposal with several state-of-the-art methods that can be applied to our tasks.
Our approach with static word embeddings and the TF-RNN as phrase encoder has the best results regarding the phrase synonymy task on specialized domain corpora. The TF-RNN has also managed to obtain the third best result for the phrase similarity task on Semeval2017. Given that the Semeval2013 data set does not provide any textual data and the model is trained on the textual corpus of Semeval2017, the results on Semeval2013 for the context prediction approaches are biased by data availability. Moreover, compared to the approaches with contextualized embedding input, the context prediction approaches (last four lines in both tables) have, for the most part, better results on the data sets that provide a textual corpus to train the model. Although the contextualized embedding models capture the inner relation between each component word in a phrase, they cannot exploit the context information of the phrase during the test phase or if the phrase is out of the training corpus. The encoder–decoder training-based approach, however, memorizes and generalizes the context information of different phrases in the training corpus. In addition, if we compare the four phrase encoders from the three unit length normalization, mean centering, matrix whitening context-based approaches, our proposed TF-RNN outperforms the existing neural networks on the synonymy task on every data set and obtains tangible improvements on the similarity task on the Semeval2017 data set and slightly better results on the Semeval2013 data set compared to the recurrent neural network. Although the Transformer encoder has better results on the similarity task, our encoder has comparable performance while having fewer parameters (0.5M in TF-RNN vs. 5M in Transformer). Therefore, we believe that carefully representing the phrase following a relevant syntactical structure can generate better vector representations.
Among the noncontext prediction-based approaches (the first six lines in both tables), first the extended Skip-gram works very poorly for the synonymy task. This is because many phrases during the inference are freely combined so they may not appear in the training corpus. As a consequence, these phrases do not have any representation in the look-up table. This phenomenon can also be observed on the Semeval2013 similarity task. However, it performs surprisingly well for the similarity task on the Semeval2017. The reason probably lies in the fact that Semeval2017 has a large training corpus and contains all the phrases in our test. We also notice that the contextualized embeddings (from the second to the fifth lines in both tables) are not better than the static embeddings. In fact, the static embeddings hold the best results on the BC and the Semeval2017 data set. For the contextualized embedding models, it seems that the mean of each output vector better fits our tasks, excepting for the ELMo model on the Semeval2013 data set. Comparing the BERT and ELMo models with mean representation, the BERT model has relatively respectable results on the English synonymy data sets while the ELMo model is more effective on the French data set and the similarity task. Our explanation is that the BERT model is a multilingual model mixed with 104 languages so it is not surprising that the model is biased by the English training corpus. Conversely, the ELMo French model is a separate model trained only on French data. For the similarity task, the ELMo model largely outperforms the BERT model on the Semeval2013 data set although the two models have similar results on the Semeval2017 data set.
In addition to the comparison with other existing methods, we have conducted a series of ablation tests to better understand the behavior of the key components in our system.
Static versus contextualized input embedding results are reported in Table 5.
As stated before, the static embeddings for the synonymy task are open domain pretrained vectors reinforced with specialized domain embeddings, trained on small specialized domain corpora. This solution has been exploited to generate meaningful embedding vectors on specialized domain corpora for bilingual lexicon extraction (Hazem and Morin Reference Hazem and Morin2017; Liu et al. Reference Liu, Morin and Peña Saldarriaga2018).
As shown in Table 5, the static embeddings concatenated with specialized domain information achieve clearly better results on the specialized domain data sets (WE and BC). On the contrary, the ELMo model trained on general domain corpora has the best results. We can the deduce that the availability of domain-specific information outweighs the choice of a particular word embedding architecture. For a specialized domain corpus, it is more effective to exploit domain information to improve the model rather than using more advanced architectures or high-coverage open domain resources. Besides, it shows that our system can efficiently incorporate the contextualized embeddings as we obtain the best results on the general domain Semeval2017 data set, improving the state-of-the art approach by nearly 9 points.
Results with the BERT model are the worst on the French and the Semeval data sets. Yet, it outperforms the ELMo model on English synonymy data sets. Again, this confirms that the model is less effective on non-English data sets as we previously discussed. As for the similarity task, we assume that increasing the training size (e.g., 831 phrases in Semeval2017 vs. 8923 in WE-en) would improve the system because the BERT model uses a subword tokenizer that often tokenizes often a word into multiple units. This could make it more difficult to generalize meaningful parameter weights during training.
Wrapped context prediction versus other training objectives results are shown in Table 6.
In order to prove the effectiveness of the proposed training objective, we evaluated two more models using two different training objectives with exactly the same experimental settings. The first one predicts all the sentence tokens, represented by “plain”. The second one predicts only the context tokens around the phrase without the wrapped phrase token, represented by “context”.
We can clearly see that the wrapped context training objective consistently obtains the best results compared to other possible objectives in our scenario. Although the context prediction strategy is fairly close, adding a wrapped token to replace the phrase allows the system to learn from a syntactically more complete sequence. Predicting all the tokens including the phrase is worse than the context prediction objective even if it predicts a syntactically complete sequence. The reason for this is possibly that predicting the phrase tokens makes the encoder over related to the specific phrase components rather than the generalized features across different but similar phrases, eventually it is difficult for the encoder to generate close vectors for these phrases.
Pseudo-Siamese network versus encoder–decoder system results are shown in Table 7.
There is another applicable framework for training our phrase encoder in an unsupervised way: the pseudo-siamese network (Bromley et al. Reference Bromley, Guyon, LeCun, Säckinger and Shah1993; Zagoruyko and Komodakis Reference Zagoruyko and Komodakis2015; Wang, Li, and Lazebnik Reference Wang, Li and Lazebnik2016). The idea is quite simple, instead of predicting a sequence of context tokens from a decoder, the network minimizes the vector distance between the phrase and the context vector. We use an LSTM to encode the context with a self-attention mechanism (Lin et al. Reference Lin, Feng, dos Santos, Yu, Xiang, Zhou and Bengio2017).
We can see that the pseudo-siamese network performs very poorly on all data sets, with extremely large drops on the similarity. It is somewhat unexpected for us since both training approaches are inspired by the Harris distributional hypothesis (Harris Reference Harris1954). This may be due to the nature of the comparison tasks or the small size of our training samples. It perhaps explains why encoder–decoder systems are becoming more popular in recent studies compared to others.
6.2 Bilingual phrase alignment
Table 8 shows the overall results on all test phrases. Since the distributional approach (Morin and Daille Reference Morin and Daille2012) does not include the alignment of variable length phrases, we ignore the corresponding results in the table.
It is clearly shown that the proposed method has a better overall performance. Especially when it comes to different length phrase alignments, the new approach significantly improves the MAP with an average score of 8.8 points. This proves that the proposed method is able to produce high-quality alignment for phrases of variable length. Keep in mind that the different length distribution represents a small proportion of all test phrases except for the English-Chinese corpus, so the overall score would be furthermore improved if we had uniform distribution for all kinds of alignment. The second best method is the previously described addition approach, which obtains good results (Mikolov et al. Reference Mikolov, Sutskever, Chen, Corrado and Dean2013b; Del et al. Reference Del, Tättar and Fishel2018). However, we observe that between linguistically distant language pairs (English-Chinese), all encoder–decoder systems outperform the addition-based approach. The CNN has some interesting results in same length alignment, and the LSTM is powerful concerning short phrase alignment but unlike in Del et al. (Reference Del, Tättar and Fishel2018), it falls behind on other types. This difference may be explained by the fact that they limit the alignment to two-word phrases.
The transformer encoder does not obtain better results than the addition-based approach nor better results than the other encoders. First, for all the addition is still more adaptive and effective for short sequence comparison between linguistically close language pairs (Hazem and Morin Reference Hazem and Morin2017; Liu et al. Reference Liu, Morin and Peña Saldarriaga2018; Del et al. Reference Del, Tättar and Fishel2018). Second, as we set a maximum epoch of 200, we think that the transformer encoder may not be converged after 200 epochs because it has a much bigger parameter-sample ratio than the other encoders. Finally, transformer architectures are basically multihead self-attentions, which are designed for capturing the relations in long sequences while we encode mostly short sequences.
The relatively poor results on the English-Chinese corpus may be due to the segmentation of Chinese words. More concretely, as the input vectors for the Chinese sequences are at word-level, many words in our gold standard are not segmented in the same way as in the given corpus which is already pre-segmented. We would like to replace the word-level embeddings by character-level ones in our future works.
Concerning the single-word alignment on BC, 25 among the 72 single words are in fact acronyms which are particularly difficult to align. This would explain why the single-word alignment has much poorer results than other distributions. The proposed method obtains strong results for single-word alignment and we believe this happens because the system sees more single-word alignment samples generated by the pseudo back-translation during training.
In order to show that the proposed method can still maintain a reasonable performance on single words, we present in Table 9 the results for single words compared to state-of-the-art work on bilingual word embedding (Artetxe et al. Reference Artetxe, Labaka and Agirre2018a), including the 139 English-French single word data set of Hazem and Morin (Reference Hazem and Morin2016) (suffixed -HM in the Table 9). To be comparable, we only test on single-word phrases and the candidate list is limited to all single words in the corpus vocabulary. In our data sets, the source English words are the same 15 as in the sw line of Table 8.
We can see that in general, compared to Artetxe et al. (Reference Artetxe, Labaka and Agirre2018a), the proposed approach does not significantly degrade the results except for the English-Chinese words. In addition, we succeed in maintaining a better result with regard to the original transformation matrix method (Mikolov et al. Reference Mikolov, Le and Sutskever2013a) with only one exception on the English-French wind energy data set. This shows that our approach is not biased by the compositionality of the multiword expressions.
6.3 Bilingual phrase alignment qualitative analysis
For a better understanding of how the proposed method succeeds or fails to align different types of phrases, we analyzed some of the alignments proposed by our system.
Table 10 shows examples extracted from the top 2 nearest candidates to the source phrase in column 2. Again we see that the proposed method is capable of generating better results over different types of alignment. In the first example, with the proposed approach, the source phrase breast cancer is aligned to cáncer de mama (lit. “cancer of breast”), which is the expected phrase in Spanish and is far more idiomatic than cáncer mamario (lit. “cancer mammary”) obtained by the addition approach. In line 7, we see that the perfect translation for wind vane is found by our proposal: , while the additive approach finds (lit. “yaw electric machine”). Besides, examples in lines 3, 5, 6, 7, and 8 are all composed of phrases of variable length, and the corresponding reference phrase can be found in the fourth column. Interestingly, we find that the proposed system find paraphrases referring to fairly domain-specific phrases like blade tip, which is aligned to côté supérieur de la pale (lit. “side top of the blade”). This is also the case for Darrieus rotor aligned to rotor vertical, which is an outstanding result since the Darrieus rotor is a kind of vertical rotor.
Though the proposed method performs generally well on phrases, we observe that it occasionally over emphasizes occasionally too much the syntactic head in a multiword phrase. For instance, in the second example, cell death is aligned to muerte (“death”), while the addition-based approach manages to align it to muerte celular (lit. “death cellular”), which is the reference phrase in Spanish. Undoubtedly, death is the syntactic head for the noun phrase cell death, it is clear that the proposed method puts more weight on the syntactic information rather than the compositional property for this phrase. In a more generalized manner, the translations for English source phrases with syntactical patterns such as ADJ NOUN will be only NOUN. This also explains why we do not obtain better results on equal-length phrase alignment on the English-Spanish and English-Chinese wind energy corpora (Table 8). This bias could be due to the increased amount of single-word phrase samples of the pseudo back-translation reinforced learning. This suggests that we could possibly improve the system by adding synthetic translations for multiword phrases during the training.
7. Conclusion and perspective
Significant advances have been achieved in bilingual word-level alignment, yet the challenge remains for phrase-level alignment. Moreover, the need for parallel data is a critical drawback for the alignment task. This work proposes a system that alleviates these two problems: a unified phrase representation model using cross-lingual word embeddings as input and an unsupervised training algorithm inspired by recent works on neural machine translation.
The proposed system consists in a encoder–decoder system where for the encoder part we introduce a new short sequence encoder called a tree-free recursive neural network (TF-RNN), that constructs cross-lingual representations of phrases of any length and takes into account word order. For the decoder part, we use a two-layer LSTM that decodes these representations w.r.t their contexts. As for the training strategy, in order to train the network in an unsupervised way, we also incorporate a pseudo back translation mechanism. Experiments on five data sets show that the adaptability that our method offers does not imply performance drawbacks. In fact, on the bilingual phrase alignment task results are on par with the state of the art. As for the alignment of phrases of different lengths, our method improves the latest results by a mean of 8.8 points in MAP and seems mainly limited by segmentation issues, which we intend to address in future works using character-level embeddings.
We would also like to continue studying more extensive evaluations in our future work and further study the extract-edit approach (Wu et al. Reference Wu, Wang and Wang2019) to improve our system. Based on a method similar to back-translation, we could use extracted and edited phrases as the synthetic translations which would avoid any misleading caused by the poor translations. Finally, two strategies could be more deeply explored regarding data selection: pretraining corpus merge and post-training embedding merge. The former investigates the nature and the quality of the corpora and trains the word embeddings with one finely merged corpus. The latter trains separately word embeddings from the general and specialized domain corpora and then merges these word embeddings. We would like to study the behavior of using different merging approaches such as a specific layer related to the merge or multitask learning for both separated embeddings.