Hostname: page-component-78c5997874-8bhkd Total loading time: 0 Render date: 2024-11-10T12:31:23.779Z Has data issue: false hasContentIssue false

An empirical study of cyclical learning rate on neural machine translation

Published online by Cambridge University Press:  09 February 2022

Weixuan Wang
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Choon Meng Lee
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Jianfeng Liu
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Talha Colakoglu
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
Wei Peng*
Affiliation:
Artificial Intelligence Application Research Center, Huawei Technologies, Co., Ltd., Shenzhen, People’s Republic of China
*
*Corresponding author. E-mail: peng.wei1@huawei.com
Rights & Permissions [Opens in a new window]

Abstract

In training deep learning networks, the optimizer and related learning rate are often used without much thought or with minimal tuning, even though it is crucial in ensuring a fast convergence to a good quality minimum of the loss function that can also generalize well on the test dataset. Drawing inspiration from the successful application of cyclical learning rate policy to computer vision tasks, we explore how cyclical learning rate can be applied to train transformer-based neural networks for neural machine translation. From our carefully designed experiments, we show that the choice of optimizers and the associated cyclical learning rate policy can have a significant impact on the performance. In addition, we establish guidelines when applying cyclical learning rates to neural machine translation tasks.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

1. Introduction

There have been many interests in deep learning optimizer research recently (Kingma and Ba Reference Kingma and Ba2014; Reddi, Kale, and Kumar Reference Reddi, Kale and Kumar2018; Luo et al. Reference Luo, Xiong, Liu and Sun2019; Zhang et al. Reference Zhang, Lucas, Hinton and Ba2019; Liu et al. Reference Liu, Jiang, He, Chen, Liu, Gao and Han2019). These works attempt to answer the question: what is the best step size to use in each step of the gradient descent? With the first-order gradient descent being the de facto standard in deep learning optimization, the question of the optimal step size or learning rate in each step of the gradient descent arises naturally. The difficulty in choosing a good learning rate can be better understood by considering the two extremes: (1) when the learning rate is too small, training takes a long time; (2) while overly large learning rate may cause training to diverge instead of converging to a satisfactory solution.

The two main classes of optimizers commonly used in deep learning are the momentum based Stochastic Gradient Descent (SGD) (Bottou Reference Bottou2010) and adaptive momentum-based methods (Duchi, Hazan, and Singer Reference Duchi, Hazan and Singer2010; Kingma and Ba Reference Kingma and Ba2014; Reddi et al. Reference Reddi, Kale and Kumar2018; Luo et al. Reference Luo, Xiong, Liu and Sun2019; Liu et al. Reference Liu, Jiang, He, Chen, Liu, Gao and Han2019). The difference between the two lies in how the newly computed gradient is updated. In SGD with momentum, the new gradient is updated as a convex combination of the current gradient and the exponentially averaged previous gradients. For the adaptive case, the current gradient is further weighted by a term involving the sum of squares of the previous gradients. For a more detailed description and convergence analysis, please refer to Reddi et al. (Reference Reddi, Kale and Kumar2018).

SGD maintains a fixed learning rate for each step gradient. The frequent updates in the training process produce high variance parameters, causing the objective function to fluctuate with different intensity. This fluctuation may enable SGD to find a new and potentially better local minimum in non-convex optimization problems. However, SGD suffers in loss surfaces with sharp minima due to frequent overshooting. The SGD method can diverge or converge incredibly slowly if its learning rate is set inappropriately (Ruder Reference Ruder2016).

With adaptive learning rate, Adam is well-suited for optimization problems with noisy gradients. Adam, with a decaying schedule, tends to avoid overshooting problems due to loss function fluctuation. The experiments conducted on the MNIST and CIFAR10 dataset showed that Adam has the fastest convergence property, compared to other optimizers, in particular SGD with Nesterov momentum (Kingma and Ba Reference Kingma and Ba2014). Adam has been popular with the deep learning community due to the speed of convergence. However, Adabound (Luo et al. Reference Luo, Xiong, Liu and Sun2019), a proposed improvement to Adam by clipping the gradient range, showed in the experiments that given enough training epochs, SGD can converge to a better quality solution than Adam. To quote from the future work of Adabound, “why SGD usually performs well across diverse applications of machine learning remains uncertain.” The choice of optimizers is by no means straight forward or cut and dry.

Another critical aspect of training a deep learning model is the batch size. Once again, while the batch size was previously regarded as a hyperparameter, recent studies such as Keskar et al. (Reference Keskar, Mudigere, Nocedal, Smelyanskiy and Tang2016) have shed light on the role of batch size when it comes to generalization, that is, how the trained model performs on the test dataset. Research works (Keskar et al. Reference Keskar, Mudigere, Nocedal, Smelyanskiy and Tang2016; Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997a) explored the idea of sharp versus flat minima when it comes to generalization. From experimental results on convolutional networks, for example, AlexNet (Krizhevsky, Sutskever, and Hinton Reference Krizhevsky, Sutskever and Hinton2017), VggNet (Simonyan and Zisserman Reference Simonyan and Zisserman2014), Keskar et al. (Reference Keskar, Mudigere, Nocedal, Smelyanskiy and Tang2016) demonstrated that overly large batch size tends to lead to sharp minima while sufficiently small batch size brings about flat minima. Dinh et al. (Reference Dinh, Pascanu, Bengio and Bengio2017), however, argues that sharp minima can also generalize well in deep networks, provided that the notion of sharpness is taken in context. The computer vision (CV) community tends to advocate training schedules with large batch sizes (Smith et al. Reference Smith, Kindermans, Ying and Le2018). There is a lack of empirical study investigating interaction effects between learning rate scheduling and batch size. Increasing the learning rate and scaling the batch size Smith et al. (Reference Smith, Kindermans, Ying and Le2018) achieves good performance with fewer parameter updates, but it has not been proven effective when applied to neural machine translation (NMT).

While the aforementioned works have helped to contribute to our understanding of the nature of the various optimizers, they are mainly focused on computer vision (CV)-related deep learning networks and datasets. In contrast, the rich body of works in NMT and other natural language processing (NLP)-related tasks have been largely left untouched. Recall that CV deep learning networks and NMT deep learning networks are very different. For instance, the convolutional network that forms the basis of many successful CV deep learning networks is translation invariant, for example, in a face recognition network, the convolutional filters produce the same response even when the same face is shifted or translated. On the other hand, recurrent neural networks (RNN) (Hochreiter and Schmidhuber Reference Hochreiter and Schmidhuber1997b; Chung et al. Reference Chung, Gülçehre, Cho and Bengio2014) and transformer-based deep learning networks (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017; Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019) for NMT are specifically looking patterns in sequences. It is often assumed that using the mainstream optimizer (Adam) with the default settings is good enough. However, our empirical study demonstrated that there is significant room for improvement. While there is some awareness in the NMT and NLP community of the cyclical learning rate (CLR), there is still a lack of understanding of the effect of applying CLR to the NMT task. In addition, how the combined effect of learning rate policies and batch size affect the NMT loss is also largely unexplored.

1.1 The contributions

The contributions of this study are to:

  • Explore the use of cyclical learning rates for NMT in our empirical study and provide detailed practical insights into the application of CLR.

  • Disclose the interaction of learning rate and batch size on the loss in NMT. From our experiments, we show how the use of CLR enables NMT to work with smaller batch sizes while achieving comparable validation loss compared to large batch sizes. This empirical finding will have implications for NMT training in resource-constrained scenarios.

  • Establish an explanation via visualizing the loss error surface, optimizer trajectory for the NMT training process. At the same time, the rationale behind the performance leap when applying CLR to NMT is unveiled.Footnote a

2. Related works

Our work is mainly related to the CLR study (Smith Reference Smith2017), which addresses the learning rate issue by having repeated cycles of linearly increasing and decreasing learning rates, constituting the triangle policy for each cycle. CLR draws its inspiration from curriculum learning (Bengio et al. Reference Bengio, Louradour, Collobert and Weston2009) and simulated annealing (Aarts and Korst Reference Aarts and Korst2003). Bengio et al. (Reference Bengio, Louradour, Collobert and Weston2009) clarify when and why a curriculum or “starting small” strategy can benefit machine learning algorithms. They contribute to this question by showing several cases involving vision and language tasks in which very simple multi-stage curriculum strategies give rise to improved generalization and faster convergence. Smith (Reference Smith2017) demonstrated the effectiveness of CLR on standard computer vision (CV) datasets CIFAR-10 and CIFAR-100, using well-established CV architecture such as ResNet (He et al. Reference He, Zhang, Ren and Sun2015) and DenseNet (Huang, Liu, and Weinberger Reference Huang, Liu and Weinberger2016).

One interesting aspect of CLR is the need to balance regularizations such as weight decay, dropout, and batch size, as pointed out in Smith and Topin (Reference Smith and Topin2017). The experiments verified that various regularizations need to be toned down when using CLR to achieve good results. In particular, the generalization results using the small batch size from the above-mentioned studies no longer hold for CLR. This is interesting because the use of CLR allows training to be accelerated by using a larger batch size without the sharp minima generalization concern. A related work (McCandlish et al. Reference McCandlish, Kaplan, Amodei and Team2018) sets a theoretical upper limit on the speed up in training time with increasing batch size. There is no uplift in training time beyond this theoretical upper limit, even with increased batch size.

Li et al. (Reference Li, Xu, Taylor and Goldstein2018) proposes various visualization methods for understanding the loss landscape defined by the loss functions and how the various deep learning architectures affect the landscape. The proposed visualization techniques allow a depiction of the optimization trajectory, which is particularly helpful in understanding the behavior of the various optimizers and how they eventually reach their local minima.

NMT is based on an end-to-end framework without the need to handle SMT (statistical machine translation)-specific problems like word alignments, translation rules, and complicated decoding methods. Bahdanau, Cho, and Bengio (Reference Bahdanau, Cho and Bengio2015) proposed a sequence-to-sequence NMT model based on RNN, leading to significant improvements in translation quality. The words of the input sentences from a source language are first encoded into vectors of intermediate representation and passed to the decoder. The context vector is derived by applying an attention mechanism that measures the degree of alignment between the source sentences and target sentences. And the decoder converts the context vector to the target translated word based on the previously translated words. However, RNN suffers from seriality resulting in long training time for the NMT model. Vaswani et al. (Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) proposed a more efficient transformer architecture. Both encoders and decoders of a transformer contain self-attention and pointwise fully connected layers without relying on an RNN architecture. Decoder layers have another sublayer that computes attention values over the encoder’s output, enabling learning alignment patterns between the source and the target sentences.

In order to get better performance for NMT, most of the existing works mainly focus on modifying the framework of the NMT model and designing better attention mechanisms. There are few research works on the optimization methods of model parameters. Wiseman and Rush (Reference Wiseman and Rush2016) propose a model using a beam search training scheme to get sequence-level scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage while preserving the proven model architecture of seq2seq (sequence-to-sequence) and its efficient training approach. Hoang, Haffari, and Cohn (Reference Hoang, Haffari and Cohn2017) convert the decoding from a discrete optimization problem to a continuous optimization problem. The significant difference between our work and these studies lies in that we focus on improving the NMT training process from another perspective. As far as we know, CLR has not been applied to NMT. The methodology, best practices, and experiments are mainly based on results from CV architecture and datasets. It is by no means apparent or straightforward that the same approach can be directly carried over to NMT. Therefore, we explore the use of cyclical learning rates policy for NMT.

3. Neural machine translation

The NMT models are implemented using an encoder-decoder framework with attention mechanisms to learn a mapping between two sequences of symbolic representations. Given a sentence pair (X, Y), X is the source sentence and Y is the target sentence, where $X = (x_1, x_2, ..., x_{m})$ and $Y = (y_1, y_2, ..., y_{n})$ are the words in the sentence pair. An NMT model contains two parts, an encoder and d decoder, to decode a target sequence given the encoded source sequence probabilistically. An encoder transforms the source sentence X into the context vectors. And a decoder generates target translation Y from the context vectors by maximizing the probability of $p(y_j|y_{<j})$ . During the training process, an NMT models the translation probability as:

(1) \begin{equation}p(y|x) = \prod_{j=1}^{n}p\left(y_j|y_{<j},x,\theta\right)\end{equation}

where $\theta$ is the model parameters and $y_{<j} = (y_1, y_2,...,y_{j-1})$ is the partial translation. The generation conditional probability of $y_j$ is produced by the decoder, representing the final output of the decoder.

Given the sentence-aligned bilingual training data $ D = \{< X^{(k)},Y^{(k)} >\} _{k=1}^K$ , the cost function can be defined as the following conditional log-likelihood:

(2) \begin{equation}L( \theta ,D) = \frac{1}{K} \sum_{k=1}^{K} \sum_{j=1}^{n} log\left(p\left(y_j^{(k)}|y_{<j}^{(k)},X^{(k)}, \theta \right)\right)\end{equation}

in which the corresponding gradient $ \nabla L( \theta ,D)$ can be calculated as the sum of the gradients of the sentences in minibatch B:

(3) \begin{equation}\nabla L( \theta ,D) = \sum_{k=1}^{B} \nabla L\left( \theta,\left(X^{(k)},Y^{(k)}\right)\right)\end{equation}

The gradients of each sentence in a parallel corpus can be calculated as a sum of gradients per step:

(4) \begin{equation}\nabla L\Big( \theta,\Big(X^{(k)},Y^{(k)}\Big)\Big) = \sum_{j=1}^{n} \nabla log\left( p\left(y_j^{(k)}|y_{\lt j}^{(k)},X^{(k)}, \theta \right)\right)\end{equation}

We can then use maximum likelihood estimation with SGD or Adam optimization strategy and backpropagation to get the optimal parameters. For SGD, the calculation with learning rate $\eta$ is:

(5) \begin{equation}\theta \leftarrow \theta - \eta \times \bigtriangledown L( \theta , D)\end{equation}

and the final parameter optimization method can be defined as follows:

(6) \begin{equation}\theta \leftarrow \theta - \eta \times \sum_{k=1}^{B} \sum_{j=1}^{n} \bigtriangledown log\left( p\left(y_j^{(k)}|y_{<j}^{(k)},X^{(k)}, \theta \right)\right.\end{equation}

For Adam, the calculation with learning rate $\eta$ is as follows, $\beta _1$ is set to 0.9 representing the weighted average, $\beta _2$ is set to 0.99 that is the RMSprop term:

(7) \begin{equation}g = \frac{1}{|B|} \sum_{k \in B} \nabla L\left( \theta,\left(X^{(k)},Y^{(k)}\right)\right)\end{equation}
(8) \begin{equation}m \leftarrow \beta _1 m + \left(1 - \beta _1 \right) g\end{equation}
(9) \begin{equation} v \leftarrow \beta _2 V + \left(1 - \beta _2 \right) g ^2\end{equation}
(10) \begin{equation} \widehat{m} \leftarrow \frac{m}{1 - \beta _1}\end{equation}
(11) \begin{equation}\widehat{v} \leftarrow \frac{v}{1 - \beta _2}\end{equation}
(12) \begin{equation} \theta \leftarrow \theta - \eta \times \frac{ \widehat{m} }{ \sqrt{ \widehat{v} } + \epsilon }\end{equation}

4. The proposed approach

Our main approach in the NMT-based learning rate policy is based on CLR’s triangular learning rate policy. Figure 1 depicts the learning rate decay policy, which is the way the learning rate changes over training epochs. For various optimizers, the learning rate is usually decayed to a small value to ensure convergence. The commonly used decay schemes include piece-wise constant step function, inverse (reciprocal) square root. This study adopts two learning rate decay policies:

  • Fixed decay (shrinking) policy where the max learning rate is halved after each learning rate cycle, and

  • No decay. This is unusual because for both SGD and adaptive momentum optimizers, a decay policy is required to ensure convergence.

Figure 1. The learning rate decay used in our experiments.

Our adopted learning rate decay policy is interesting because experiments in Smith (Reference Smith2017) showed that using a decay rate is detrimental to the resultant accuracy.

The CLR decay policy should be contrasted with the standard inverse square root policy (INV) that is commonly used in deep learning platforms, for example, in fairseq (Ott et al. Reference Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier and Auli2019). The INV typically starts with a warm-up phase where the learning rate is linearly increased to a maximum value. The learning rate is decayed as the reciprocal of the square root of the number of epochs from the above-mentioned maximum value.

Given the base learning rate and the max learning rate, the learning rate update is as follows:

(13) \begin{equation}c = 1 + \frac{i}{2 \times s}\end{equation}
(14) \begin{equation}x = \left| \frac{i}{s} - 2 \times c + 1\right|\end{equation}
(15) \begin{equation}\eta = \eta_{base} + ( \eta_{\max} - \eta_{base} ) \times \max(0 , 1 - x)\end{equation}

where i is the current number of iterations of training, s is the stepsize that denotes half the period or cycle length, c denotes the current cycle, $\eta_{\max}$ and $\eta_{base}$ represent the max learning rate $\eta$ and the base learning rate, respectively. This policy varies the learning rate linearly between the minimum $\eta_{base}$ and the maximum $\eta_{\max}$ . The no decay policy calculation is presented above. For the fixed decay policy, the max learning rate is no longer a constant. $\eta_{\max}$ is halved with the decay rate d each learning rate cycle:

(16) \begin{equation}\eta_{\max} = \eta_{\max} \times d\end{equation}

The learning rate is subsequently applied in the gradient calculation to update the parameters of NMT. Compared with the traditional NMT backpropagation process, CLR varies the learning rate within a range of values rather than adopting a stepwise fixed or exponentially decreasing value. A short run of only a few epochs is sufficient to estimate the CLR policies’ boundary learning rates in a range test (Section 4.2). Then, a policy where the learning rate cyclically varies between these bounds is adequate to obtain near-optimal results, often with fewer iterations. Unlike adaptive learning rate methods, this policy is easy to implement and incurs essentially no additional computational expense.

4.1 CLR-based NMT

Algorithm 1 summarizes the training of the NMT model using CLR. First, we train the NMT model with the base learning rate $\eta_{base}$ on the range test set for some epoch $n_1$ , deriving the maximum learning rate $\eta_{\max}$ . The parameters of NMT model are subsequently updated using the $\eta_{\max}$ and $\eta_{base}$ based CLR. If the current iteration i is an integer m that is a multiple of the cycle length C which is the twice of step size, $\eta_{\max}$ will be by multiplying with the decay rate d.

Algorithm 1 CLR-based NMT

4.2 Learning rate range test

For CLR, some pertinent parameters need to be determined: the base learning rate ( $\eta_{base}$ ), the maximum learning rate ( $\eta_{\max}$ ), and cycle length C. As suggested in CLR, we perform the range test to set the base/maximum learning rate while the cycle number is some multiples of the number of epochs. The range test is designed to select the base/maximum learning rate in CLR. Without the range test, the base/maximum learning rate in CLR will need to be tuned as hyperparameters which is difficult and time consuming. The purpose of the learning rate range test is to search for the largest learning rate without divergence as a large learning rate leads to faster model convergence than a small learning rate.

In a range test, the network is trained for several epochs with the learning rate linearly increased from an initial rate. For instance, the range test for the IWSLT2014 (DE2EN) dataset was run for 35 epochs, with the initial learning rate set to some small values, for example, $1e^{-5}$ for Adam and increased linearly over the 35 epochs. Given the range test curve, for example, Figure 2, the base learning rate ( $\eta_{base}$ ) is set to the point where the loss starts to decrease while the maximum learning rate ( $\eta_{\max}$ ) is selected as the point where the loss starts to plateau or to increase. We observe the qualitative different range test curves for CV and NMT datasets, as we can see from Figures 2 and 3. The CV range test curve looks more well-defined in terms of choosing the maximum learning rate from the point where the curve starts to be ragged. For NMT, the range curve exhibits a smoother, more plateau characteristic. As shown in Figure 2, the base learning rate is selected as the initial learning rate for the range test, and the initial value should be a small value to ensure model convergence otherwise searching for a maximum learning rate to speed up training convergence is meaningless. The maximum learning rate is the point where the loss stagnates. For the step size, following the guideline given in Smith (Reference Smith2017) to select the step size between 2 and 10 times the number of iterations in an epoch and set the step size to 4.5 epochs. And we also perform the range test of SGD, as shown in Figure A1 in Appendix A.

Figure 2. Range test curve for the IWSLT2014-de-en dataset, showing the chosen base and maximum learning rate for the triangular policy.

Figure 3. Range test curve for the CV CIFAR-100 dataset.

An intuitive understanding of why CLR methods work comes from considering the loss function topology in NMT model. The difficulty in minimizing the loss arises from saddle points rather than poor local minima. Saddle points have small gradients that slow the learning process. However, increasing the learning rate allows more rapid traversal of saddle point plateaus. A more practical reason as to why CLR works is that, by following the methods in Section 5, it is likely the optimal learning rate will be between the bounds and near-optimal learning rates will be used throughout training.

4.3 Batch size

The other point of interest is how to deal with batch size when using CLR in NMT. While it is well known in the NMT community that larger batch size improves performance, how CLR impacts the batch size performance relationship is one of the central themes in our study. Following the lead in Smith and Topin (Reference Smith and Topin2017), we look at how the NMT tasks perform when varying the batch size on top of the CLR policy. Compared to Smith and Topin (Reference Smith and Topin2017), we stretch the batch size range, going from batch size as small as 256 to as high as 8192. Only through examining the extreme behaviors can we better understand the effect of batch size superimposed on CLR.

5. Experiments

The purpose of this section is to explore the effects of applying CLR and various batch sizes to train NMT models in a series of experiments.

5.1 Experiment settings

The experiments are performed on two translation directions (DE $\rightarrow$ EN and FR $\rightarrow$ EN) for IWSLT2014 and IWSLT2017 (Cettolo, Girardi, and Federico Reference Cettolo, Girardi and Federico2012).

The data are pre-processed using functions from Moses (Koehn et al. Reference Koehn, Hoang, Birch, Callison-Burch, Federico, Bertoldi, Cowan, Shen, Moran, Zens, Dyer, Bojar, Constantin and Herbst2007). The punctuation is normalized into a standard format. After tokenization, byte pair encoding (BPE) (Sennrich, Haddow, and Birch Reference Sennrich, Haddow and Birch2016) is applied to the data to mitigate the adverse effects of out-of-vocabulary (OOV) rare words. The sentences with a source-target sentence length ratio greater than 1.5 are removed to reduce potential errors from sentence misalignment. Long sentences with a length greater than 250 are also removed as a common practice. The split of the datasets produces the training, validation (valid.), and test sets with the count of sentence pairs and the size of vocabulary presented in Table 1.

Table 1. Datasets used for the experiment

The transformer architecture (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) from fairseq (Ott et al. Reference Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier and Auli2019)Footnote b is used for all the experiments. The hyperparameters are presented in Table 2. We compared training under CLR with an inverse square for two popular optimizers used in machine translation tasks, Adam and SGD. All models are trained using one NVIDIA V100 GPU.

Table 2. Hyperparameters for the experiments

The learning rate boundary of the CLR is selected by the range test (shown in Figure 2). The base and maximum (max) learning rates adopted in this study are presented in Table 3. Shrink strategy is applied when examining the effects of CLR in training NMT. The optimizers (Adam and SGD) are assigned with two options: (1) without shrink (as “nshrink”); (2) with shrink at a rate of 0.5 (“yshrink”), which means the maximum learning rate for each cycle is reduced at a decay rate of 0.5.

Table 3. Learning rate boundary for CLR

5.2 Effects of applying CLR to NMT training

A hypothesis we hold is that NMT training under CLR may result in a better local minimum than that achieved by training with the default learning rate schedule. A comparison experiment is performed for training NMT models for “IWSLT2014-de-en” corpus using CLR and INV with a range of initial learning rates on two optimizers (Adam and SGD), respectively. It can be observed that both Adam and SGD are very sensitive to the initial learning rate under the default INV schedule before CLR is applied (as shown in Figures 4 and 5). In general, SGD prefers a bigger initial learning rate when CLR is not applied. The initial learning rate of Adam is more concentrated towards the central range.

Applying CLR has positive impacts on NMT training for both Adam and SGD. When applied to SGD, CLR exempts the needs for a big initial learning rate as it enables the optimizer to explore the local minima better. Shrinking on CLR for SGD is not desirable as a higher learning rate is required (Figure 5). Furthermore, we observe that applying CLR on SGD with a shrink option leads to a high convergence error (Figure 6). In this sense, the “nshrink” option is mandatory for SGD. It is noted that applying CLR to Adam produces consistent improvements regardless of shrink options (Figure 4). Furthermore, it can be observed that the effects of applying CLR to Adam are more pronounced than those of SGD, as shown in Figure 6. Similar results are obtained from our experiments on “IWSLT2017-de-en” and “IWSLT2014-fr-en” corpora (Figures A2 and A3 in Appendix A). The corresponding BLEU scores are presented in Table 4, in which the above-mentioned effects of CLR on Adam can also be established. The training takes fewer epochs to converge to reach a local minimum with better BLEU scores (i.e., bold fonts in Table 4).

Figure 4. A comparison study of training NMT models on IWSLT2014-de-en using CLR and INV with a range of initial learning rate on Adam. The learning rate policy “adam_cyc_nshink_5e-4” denotes the optimizer Adam is trained under CLR with the no shrink option and a maximum learning rate of $5e^{-4}$ .

Figure 5. A comparison study of training NMT models on IWSLT2014-de-en using CLR and INV with a range of initial learning rates on SGD. The learning rate policy “sgd_cyc_yshink_5e-4” denotes the optimizer SGD that is trained under CLR with the shrink option and a maximum learning rate of $5e^{-4}$ .

5.3 Effects of batch size on CLR

Batch size is regarded as a significant factor influencing deep learning models from the various CV studies detailed in Section 1. It is well known to CV researchers that a large batch size is often associated with a poor test accuracy.

Table 4. The best BLEU for various learning rate policies when training NMT models on IWSLT2014-de-en, IWSLT2017-de-en and IWSLT2014-fr-en. The total number of training epochs for all the experiments is 50. The table is sorted by the best BLEU in descending order

Figure 6. A view of effects of applying CLR to Adam and SGD when training the NMT on IWSLT2014-de-en.

Figure 7. Effects of various batch sizes when training the NMT on IWSLT2014-de-en corpus with CLR.

Figure 8. Effects of various batch sizes when training the NMT on IWSLT2014-de-en corpus with TRI, COS, SINE, and INV under different learning rates. The validation loss shown in this figure is the minimal validation loss when training each model with a different batch size and learning rate configuration. The training was run for a fixed number of epochs (50).

The situation is reversed for the case of NMT, as demonstrated in Popel and Bojar (Reference Popel and Bojar2018), where the use of a larger batch size generally helps improve training BLEU. Figure 5 in Popel and Bojar (Reference Popel and Bojar2018) illustrates the benefits of large batch sizes, showing steady training BLEU improvement going from a batch size of 1000–4500, eventually plateauing at a batch size of 6000. In this paper, Figure 7 shows that the trend of CLR with a larger batch size for NMT training does indeed lead to better performance. Furthermore, when the batch size reaches a certain value, it is no longer a critical factor; for example, we observe similar validation losses for the training with batch sizes of 4096 and 8192. The effect of the various learning rate strategies is shown in Figure 8. The various strategies share similar good performance under large batch sizes, in line with Figure 5 in Popel and Bojar (Reference Popel and Bojar2018). Interestingly, as the batch size decreases to batch size as small as 256, the difference in the various strategies starts to show. Both the triangle and sine with no shrinking strategies show exceptional performance even at a batch size of 256, comparable to using a large batch size of 4096. Our findings in Figure 8 have a direct bearing on NMT training with constrained GPU resources, particularly GPU memory. We run the same experiment on one GPU and four GPUs and show the results in Table A1 (Appendix A). We find out the effect of the large batch size has a roof effect after extending the experiment to $4096 \times 4$ GPUs. Based on the experimental results, a batch size of 4096 in a single GPU produces slightly better results than those obtained from the batch size of $4096 \times 4$ GPUs. For memory-constrained GPU training, both the Adam triangle and sine with “nshrink” strategies present themselves as promising learning rate scheduler candidates.

A further experiment is performed to unveil whether the effect of batch size on CLR is sensitive to the learning rate scheduler. We test four learning rate schedulers, namely the triangular (TRI), a full-phase cosine (COS),Footnote c a sine (SINE), and a standard INV, for batch sizes of 256, 1024, and 4096, respectively, on learning rates of $3e^{-4}$ and $5e^{-4}$ . A clear trend emerges from the results of this experiment (Figure 8). All four learning rate schedulers can achieve reasonable low validation loss using a large batch size (4096) given enough training epochs. All three learning rate schedulers from the CLR family outperform INV in this batch size. A TRI learning rate scheduler is a consistent winner performing well across batch bands. The larger batch size effect afforded by CLR appears to be immune to learning rates. The larger learning rate ( $5e^{-4}$ ) is associated with a higher validation loss at a smaller batch size (256) though. An immediate question that may arise from Figure 8 is the significant performance difference between the sine and cosine forms. One possible explanation is that the sine form allows warm-up, whereas the cosine form starts with a large learning rate before decay. As many previous studies have shown, having a warm-up stage helps achieve better training results.

6. Further analysis and discussions

An unsuitable maximum learning rate for the triangular policy of CLR-based NMT often leads to non-convergence. The learning rate range test illustrated in Section 4.2 is further discussed in this section to unveil the practical details. It is suggested to validate the effect of CLR on a larger scale dataset. We perform a comparison test on training NMT models using Adam with a TRI and an INV scheduler on WMT2014-de-en to this end. We also discuss the rationale why CLR works in this section.

6.1 How to apply CLR to NMT training matters

As mentioned in Section 4.2, the value of the base learning rate is required to be a small value to guarantee convergence during the training. We believe that setting the minimum learning rate to be $1e^{-5}$ is reasonable. It is a value empirically ensuring convergence when training transformer-based NMT (note that the observations and analysis are based on IWLST dataset). Training with a base learning rate of a value of $1e^{-8}$ generates an almost identical result to that from $1e^{-5}$ (Figure 9). The experiment shows that the training may not be sensitive to selecting the base learning rate as long as it is a value ensuring convergence.

Figure 9. The first $\eta_{\max}$ (MLR1) with “nshrink” is a preferable option for SGD when applying CLR to train NMT models on IWSLT2014-de-en.

A range test is performed to identify the maximum learning rates for the triangular policy of CLR (Figure 2). The experiments show the training is sensitive to the selection of $\eta_{\max}$ . As the range curve for training NMT models is distinctive to that obtained from a typical case of computer vision, it is not clear how to choose the $\eta_{\max}$ when applying CLR. A comparison experiment is performed to try $\eta_{\max}$ with different values. It can be observed that the first $\eta_{\max}$ (MLR1) is a preferable option for both SGD and Adam (Figures 10 and 11). Choosing a larger learning rate on the extreme right end as the triangular policy’s maximum learning rate often leads to the loss not converging due to gradient divergence. It is better to be more conservative and choose the point where the loss stagnates as the maximum learning rate for the triangular policy.

Figure 10. The first $\eta_{\max}$ (MLR1) is a preferable option for Adam when applying CLR to train NMT models on IWSLT2014-de-en.

Figure 11. Comparison of the base learning rate with values of $1e^{-5}$ and $1e^{-8}$ for the IWSLT2014-de-en dataset.

The “nshrink” option is mandatory for SGD, but this constraint can be relaxed for Adam. Adam is sensitive to excessive learning rate, that is, the second $\eta_{\max}$ (MLR2). The pairing of CLR with Adam produces consistent improvements regardless of shrink rate, but such enhancement is not the same for different datasets. This indicates the pre-defined shrink rate (i.e., 0.5) can further be optimized.

6.2 Validating the effect of CLR on NMT training in a large scale dataset

As shown in Table 5, the WMT2014-de-en data is involved to train NMT models using Adam with a TRI and an INV learning rate scheduler. We perform a range test and produce a maximum learning rate ( $\eta_{\max}$ ) of $1e^{-3}$ from the initial rate $1e^{-5}$ . A step size of 4.5 epochs is chosen to set the TRI learning rate scheduler with a shrink option. The experimental results demonstrate that the NMT model trained with a TRI learning rate scheduler can reach a lower validation loss compared to the model with an INV scheduler (Figure 12). The corresponding best BLEU scores for this comparison study are presented in the Table 6. It can be observed that the NMT model trained with a TRI learning rate scheduler can produce a higher BLEU score than that using INV scheduler in a much fewer number of epochs.

Table 5. WMT 2014 DE2EN dataset used for the experiment

Table 6. The best BLEU scores for training NMT models on WMT14-de-en dataset using TRI and INV learning rate schedulers

Figure 12. Comparison of the validation loss with a maximum learning rate of $1e^{-3}$ for the WMT2014-de-en dataset.

Figure 13. Loss surface, optimizer trajectory, and learning rates visualization for training NMT models on IWSLT2014-de-en.

6.3 Rationale behind applying CLR to NMT training

As pointed out in Dauphin, de Vries, and Bengio (Reference Dauphin, de Vries and Bengio2015), the difficulty in optimizing deep learning networks is due to saddle points, not local minima. One theoretical perspective of why CLR works is that the increasing learning rate helps the optimizer to escape from saddle point plateaus (Smith Reference Smith2017). The other more intuitive reason is that the learning rates covered in CLR are likely to include the near-optimal learning rate which will be would ideally be used throughout the training (Smith Reference Smith2017). This near-optimal learning rate during the cycle may allow training to escape from a saddle the training is stuck in.

The other benefit that CLR brings is fast convergence. The mental model we have is that of the training being in the mouth of a near global minimum valley and the near-optimal learning rate encountered during the cycle allows training to make rapid progress towards the valley bottom. Figures 4 and 5 lend credence to this hypothesis, where we see CLR converging much faster compared to INV. Leveraging the visualization techniques proposed by Li et al. (Reference Li, Xu, Taylor and Goldstein2018), we take a peek at the error surface, optimizer trajectory, and learning rate. The first thing to note is the smoothness of the error surface. This is perhaps not so surprising given the abundance of skip connections in transformer-based networks. Referring to Figure 13(c), we see the cyclical learning rate greatly amplifying Adam’s learning rate in flatter region while nearer the local minimum, the cyclical learning rate policy does not harm convergence to the local minimum. This is in contrast to Figure 13(a) and (b), where although the adaptive nature of the learning rate in Adam helps to move quickly across flatter region, the effect is much less pronounced without the cyclical learning rate. Figure 13 not only gives credence to the hypothesis that the cyclical learning rate helps to escape saddle point plateaus but it also demonstrates that the near-optimal learning rate is likely included in the cyclical learning rate policy.

Some explanation about Figure 13 is in order here. Following Li et al. (Reference Li, Xu, Taylor and Goldstein2018), we first assemble the network weight matrix by concatenating columns of network weights at each epoch. We then perform a principal component analysis and use the first two components for plotting the loss landscape. Even though all three plots in Figure 13 seem to converge to the local minimum, bear in mind that this is only for the first two components, with the first two components contributing to 84.84%, 88.89%, and 89.5% of the variance, respectively. With the first two components accounting for a large portion of the variance, it is thus reasonable to use Figure 13 as a qualitative guide.

7. Conclusion

In this paper, we perform an empirical study exploring how cyclical learning rate can be applied to train transformer-based neural networks for NMT. From the various experiment results, we have explored the use of CLR and demonstrated the benefits of CLR for transformer-based networks unequivocally. Not only does CLR help to improve the generalization capability in terms of test set results but it also allows using smaller batch size for training without adversely affecting the generalization capability. Instead of just using default optimizers and learning rate policies, we hope to bring insights to the NMT community on the importance and the way of choosing a useful optimizer and an associated learning rate policy.

Acknowledgements

We would like to appreciate Peng Yang for his support relating to loss surface visualization.

A. Appendix

Figures A1A3, Table A1 are included in this Appendix.

Figure A1. Range test curve of SGD for the IWSLT2014-de-en dataset.

Table A1. The comparison results of one GPU and four GPUs with a batch size of 4096

Figure A2. Effects of applying CLR to training NMT on IWSLT2017-de-en.

Figure A3. Effects of applying CLR to training NMT on IWSLT2014-fr-en.

Footnotes

a The codes are available at https://github.com/Vicky-Wil/NLE_CLR_NMT.

c This scheduler uses a full-phase cosine wave. This is different than the one described in Loshchilov and Hutter (Reference Loshchilov and Hutter2016) which uses the half cosine wave and warm restarts.

References

Aarts, E.H.L. and Korst, J.H.M. (2003). Simulated annealing and boltzmann machines. In Michael A. Arbib (ed), Handbook of Brain Theory and Neural Networks (2nd ed). Cambridge, Massachusetts: MIT Press, pp. 1039–1044.Google Scholar
Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Bengio Y. and LeCun Y. (eds), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings.Google Scholar
Bengio, Y., Louradour, J., Collobert, R. and Weston, J. (2009). Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009, ACM International Conference Proceeding Series, vol. 382, pp. 4148.CrossRefGoogle Scholar
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In 19th International Conference on Computational Statistics, COMPSTAT 2010, Paris, France, August 22–27, 2010 - Keynote, Invited and Contributed Papers, pp. 177186.CrossRefGoogle Scholar
Cettolo, M., Girardi, C. and Federico, M. (2012). Wit $^3$ : Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261268.Google Scholar
Chung, J., Gülçehre, Ç., Cho, K. and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv, abs/1412.3555.Google Scholar
Dauphin, Y., de Vries, H. and Bengio, Y. (2015). Equilibrated adaptive learning rates for non-convex optimization. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada, pp. 15041512.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 41714186.Google Scholar
Dinh, L., Pascanu, R., Bengio, S. and Bengio, Y. (2017). Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017, Volume 70, pp. 10191028.Google Scholar
Duchi, J.C., Hazan, E. and Singer, Y. (2010). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 21212159.Google Scholar
He, K., Zhang, X., Ren, S. and Sun, J. (2015). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770778.Google Scholar
Hoang, C.D.V., Haffari, G. and Cohn, T. (2017). Decoding as continuous optimization in neural machine translation. arXiv preprint arXiv:1701.02854.Google Scholar
Hochreiter, S. and Schmidhuber, J. (1997a). Flat minima. Neural Computation 9, 142.CrossRefGoogle ScholarPubMed
Hochreiter, S. and Schmidhuber, J. (1997b). Long short-term memory. Neural Computation 9, 17351780.CrossRefGoogle ScholarPubMed
Huang, G., Liu, Z. and Weinberger, K.Q. (2016). Densely connected convolutional networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22612269.Google Scholar
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M. and Tang, P.T.P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. ArXiv, abs/1609.04836.Google Scholar
Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic. Association for Computational Linguistics, pp. 177180.CrossRefGoogle Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM 60, 8490.CrossRefGoogle Scholar
Li, H., Xu, Z., Taylor, G. and Goldstein, T. (2018). Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, MontrÉal, Canada, pp. 63916401.Google Scholar
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J. and Han, J. (2019). On the variance of the adaptive learning rate and beyond. ArXiv, abs/1908.03265.Google Scholar
Loshchilov, I. and Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.Google Scholar
Luo, L., Xiong, Y., Liu, Y. and Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. ArXiv, abs/1902.09843.Google Scholar
McCandlish, S., Kaplan, J., Amodei, D. and Team, O.D. (2018). An empirical model of large-batch training. ArXiv, abs/1812.06162.Google Scholar
Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D. and Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Demonstrations, pp. 4853.Google Scholar
Popel, M. and Bojar, O. (2018). Training tips for the transformer model. Prague Bulletin of Mathematical Linguistics 110, 4370.CrossRefGoogle Scholar
Reddi, S.J., Kale, S. and Kumar, S. (2018). On the convergence of adam and beyond. ArXiv, abs/1904.09237.Google Scholar
Ruder, S. (2016). An overview of gradient descent optimization algorithms. ArXiv, abs/1609.04747.Google Scholar
Sennrich, R., Haddow, B. and Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics, pp. 17151725.CrossRefGoogle Scholar
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.Google Scholar
Smith, L.N. (2017). Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464472.CrossRefGoogle Scholar
Smith, L.N. and Topin, N. (2017). Super-convergence: Very fast training of residual networks using large learning rates. ArXiv, abs/1708.07120.Google Scholar
Smith, S.L., Kindermans, P., Ying, C. and Le, Q.V. (2018). Don’t decay the learning rate, increase the batch size. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings. OpenReview.net.Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.U. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc., pp. 59986008.Google Scholar
Wiseman, S. and Rush, A.M. (2016). Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1–4, 2016, pp. 12961306.CrossRefGoogle Scholar
Zhang, M.R., Lucas, J., Hinton, G.E. and Ba, J. (2019). Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp. 95939604.Google Scholar
Figure 0

Figure 1. The learning rate decay used in our experiments.

Figure 1

Algorithm 1 CLR-based NMT

Figure 2

Figure 2. Range test curve for the IWSLT2014-de-en dataset, showing the chosen base and maximum learning rate for the triangular policy.

Figure 3

Figure 3. Range test curve for the CV CIFAR-100 dataset.

Figure 4

Table 1. Datasets used for the experiment

Figure 5

Table 2. Hyperparameters for the experiments

Figure 6

Table 3. Learning rate boundary for CLR

Figure 7

Figure 4. A comparison study of training NMT models on IWSLT2014-de-en using CLR and INV with a range of initial learning rate on Adam. The learning rate policy “adam_cyc_nshink_5e-4” denotes the optimizer Adam is trained under CLR with the no shrink option and a maximum learning rate of $5e^{-4}$.

Figure 8

Figure 5. A comparison study of training NMT models on IWSLT2014-de-en using CLR and INV with a range of initial learning rates on SGD. The learning rate policy “sgd_cyc_yshink_5e-4” denotes the optimizer SGD that is trained under CLR with the shrink option and a maximum learning rate of $5e^{-4}$.

Figure 9

Table 4. The best BLEU for various learning rate policies when training NMT models on IWSLT2014-de-en, IWSLT2017-de-en and IWSLT2014-fr-en. The total number of training epochs for all the experiments is 50. The table is sorted by the best BLEU in descending order

Figure 10

Figure 6. A view of effects of applying CLR to Adam and SGD when training the NMT on IWSLT2014-de-en.

Figure 11

Figure 7. Effects of various batch sizes when training the NMT on IWSLT2014-de-en corpus with CLR.

Figure 12

Figure 8. Effects of various batch sizes when training the NMT on IWSLT2014-de-en corpus with TRI, COS, SINE, and INV under different learning rates. The validation loss shown in this figure is the minimal validation loss when training each model with a different batch size and learning rate configuration. The training was run for a fixed number of epochs (50).

Figure 13

Figure 9. The first $\eta_{\max}$ (MLR1) with “nshrink” is a preferable option for SGD when applying CLR to train NMT models on IWSLT2014-de-en.

Figure 14

Figure 10. The first $\eta_{\max}$ (MLR1) is a preferable option for Adam when applying CLR to train NMT models on IWSLT2014-de-en.

Figure 15

Figure 11. Comparison of the base learning rate with values of $1e^{-5}$ and $1e^{-8}$ for the IWSLT2014-de-en dataset.

Figure 16

Table 5. WMT 2014 DE2EN dataset used for the experiment

Figure 17

Table 6. The best BLEU scores for training NMT models on WMT14-de-en dataset using TRI and INV learning rate schedulers

Figure 18

Figure 12. Comparison of the validation loss with a maximum learning rate of $1e^{-3}$ for the WMT2014-de-en dataset.

Figure 19

Figure 13. Loss surface, optimizer trajectory, and learning rates visualization for training NMT models on IWSLT2014-de-en.

Figure 20

Figure A1. Range test curve of SGD for the IWSLT2014-de-en dataset.

Figure 21

Table A1. The comparison results of one GPU and four GPUs with a batch size of 4096

Figure 22

Figure A2. Effects of applying CLR to training NMT on IWSLT2017-de-en.

Figure 23

Figure A3. Effects of applying CLR to training NMT on IWSLT2014-fr-en.