I. INTRODUCTION
Deep neural networks (DNNs) have been changing the history of machine learning in terms of performance [Reference Schmidhuber1–Reference Silver4]. Although their high performance originates from their exponential expressive power owing to the depth [Reference Bengio5–Reference Raghu, Poole, Kleinberg, Ganguli and Sohl-Dickstein8], such deep networks are difficult to train owing to the so-called vanishing gradient. In fact, a classic feedforward network with 56 layers had a larger empirical risk than one with 20 layers [Reference He, Zhang, Ren and Sun9], implying that the network is not fully trained. To overcome this degradation problem, many heuristics have been proposed and some of them improved their performance. In particular, skip connections in the residual networks (ResNet) [Reference He, Zhang, Ren and Sun9,Reference He, Zhang, Ren and Sun10] and batch normalization (BN) [Reference Ioffe and Szegedy11] enable extremely deep NNs (1202 layers) to be trained with a small empirical risk and a small expected risk. In addition, a ResNet skipping two layers showed better performance than a ResNet skipping one layer or a standard feedforward neural network [Reference He, Zhang, Ren and Sun9].
In the case of the linear model, the expected risk and empirical risk have been theoretically evaluated. The model selection theory such as AIC [Reference Akaike12] and MDL [Reference Grunwald13] evaluated the expected risk by measuring the gap between a trained model and a true model that generate data using the Cramer–Rao bound. The convex analysis evaluated how fast the empirical risk decreases by calculating properties of loss landscape such as the strong convexity, Lipschitzness, and smoothness [Reference Bottou, Frank and Jorge14]. However, these theoretical analyses cannot be applied to recently proposed DNN techniques since DNNs have singular points [Reference Fukumizu, Akaho and Amari15,Reference Amari, Hyeyoung and Tomoko16] and are non-convex [Reference Kawaguchi17] even when its activation function is the identity. The singular points make the Fisher information matrix degenerate and thus the Cramer–Rao bound doesn't hold, which implies that the classical model selection theory cannot be applied. The non-convexity also makes the convex analysis difficult to apply.
Regardless of the difficulty, the recent popularity of DNNs has promoted the development of new methodologies as below for theoretical analyses of DNN techniques [Reference Bousquet and Elisseeff18–Reference Furusho and Ikeda26]. One is to calculate the algorithmic stability to evaluate the generalization gap defined as the difference between the expected risk and empirical risk. The other is to calculate the eigenvalues of the Fisher information matrix of DNNs to evaluate how fast the empirical risk decreases around the minimal point.
Since this method is widely applicable and has succeeded in quantifying the effectiveness of skip connections and BN, this overview paper briefly introduces the method and shows how to apply it to DNN techniques.
II. PROBLEM FORMULATION
A) Samples for training
Let the training set be denoted by $S = \{ z(n) \}_{n=1}^N$, where each training example $z(n)$ consists of an input $x(n)$ and the corresponding target $y(n)$. An example $z(n) = ( x(n), y(n) )$ is independently identically chosen from a probability distribution $\mathcal {D}$ on the joint space $\mathcal {Z}$ of the input space $\mathcal {X}$ and the output space $\mathcal {Y}$. Note that the indices of the examples are omitted if they are clear from the context.
B) Training of deep neural network
The DNN $f: \mathcal {X} \times \Theta \rightarrow \mathcal {Y}$ with parameters $\theta \in \Theta$ predicts the corresponding target $y \in \mathcal {Y}$ for a given input $x \in \mathcal {X}$, where Θ is the parameter space. Its performance is measured on the basis of the expected risk,
where $\ell (z, \theta ) = \frac {1}{2} \left\Vert f(x;\theta ) - y \right\Vert ^2$ is the squared loss. The parameters are trained by the gradient descent (GD),
to minimize the empirical risk,
instead of the expected risk because the data distribution $\mathcal {D}$ is not known, where $\theta _t$ and η denote the output of the GD at the tth update and the learning rate, respectively. Note that the parameters $\theta _0$ are initialized according to the method specified in each subsequent analysis section.
C) Decomposition of expected risk
The expected risk is decomposed into two components,
The generalization gap measures the difference between the expected risk and empirical risk, while the empirical risk expresses how fast the GD optimizes the parameters. Recent analytical techniques evaluate each of them as described next.
III. GENERALIZATION GAP
A) Formulation of ResNets
We evaluate the effectiveness of ResNets by deriving upper bounds of the generalization gaps of the following linear DNNs: $f: \mathbb {R}^D \times \Theta \rightarrow \mathbb {R}^D$, where θ denotes the parameters of each NN.
MLP:
ResNet1:
ResNet2:
Although these DNNs are linear with respect to the input x and have the same expressive ability, they are nonconvex with respect to the parameter θ and have different parameter representations.
B) Algorithmic stability
A training algorithm $\mathcal {A}$ receives the training set S and outputs a trained model $\mathcal {A}(S)$. The algorithmic stability measures how much the removal of one example $z(n)$ from the training set S affects the trained model $\mathcal {A}(S^n)$ in terms of the expected loss, where $S^n = S {\setminus} z(n)$ (Fig. 1).
Definition 1 (Definition 4 in [Reference Bousquet and Elisseeff18])
The training algorithm $\mathcal {A}$ is pointwise hypothesis stable if there exists $\epsilon _{stab}$ such that $\forall n \in [N],$
where the expectation is taken with respect to the randomness of the algorithm $\mathcal {A}$ and the training set S.
A stable algorithm $\mathcal {A}$ with small $\epsilon _{stab}$ outputs a trained model with a small generalization gap in the framework of statistical learning theory.
Theorem 1 (Theorem 11 in [Reference Bousquet and Elisseeff18])
If the training algorithm $\mathcal {A}$ is pointwise hypothesis stable, the following holds with probability at least $1-\delta :$
where M is an upper bound of the loss function.
The algorithmic stability $\epsilon _{stab}$ of the GD depends on the flatness of the loss landscape around a global minimum.
Definition 2 (Definition 4 in [Reference Charles and Papailiopoulos19])
The empirical risk $R_S$ satisfies the Polyak–Lojasiewicz $($PL$)$ condition with a constant μ if the following holds:
where $\theta _*$ is a global minimum.
Here, the constant μ for the PL condition expresses the flatness of the loss landscape around a global minimum. If the empirical risk $R_S$ satisfies the PL condition with μ and is β-smooth, that is, the gradient is β-Lipschitz, then the following inequality holds:
where $\prod _{\Theta _*}(\theta _t)$ is the projection of $\theta _t$ on the set of global minima $\Theta _*$. This shows that an excess risk is smaller than a quadratic function of parameters and that the constant μ controls its flatness (Fig. 2).
A training algorithm has better stability if it converges faster and its loss function has flatter minima.
Theorem 2 (Theorem 3 in [Reference Charles and Papailiopoulos19])
Suppose that the empirical risk $R_S$ satisfies the PL condition with μ and the loss function is α-Lipschitz. If the training algorithm $\mathcal {A}$ converges parameters to the global minima $\theta _*$ with $\left\Vert \theta _t - \theta _* \right\Vert \leq \epsilon _{t},$ it is pointwise hypothesis stable, that is,
C) Upper bounds of the generalization errors
We applied the above stability analysis to MLP, ResNet1, and ResNet2 under Assumptions 1 and 2 [Reference Furusho, Liu and Ikeda24].
Assumption 1 The input correlation matrix is the identity, $\sum _{(x,y) \in S} x x^T = I$.
Assumption 2 The eigenvalues of the output–input correlation matrix $\sum _{(x,y) \in S} y x^T$ are greater than one.
These assumptions are rather weak since a dataset satisfies Assumption 1 if it is preprocessed by principal component analysis (PCA) whitening. In addition, the PCA-whitened MNIST dataset satisfies Assumption 2.
Theorem 3 (Theorems 3 and 4 in [Reference Furusho, Liu and Ikeda24])
Initialize the linear DNNs by orthogonal initialization [Reference Saxe, McClelland and Ganguli27]. Then, under Assumptions 1 and 2, ResNet2 has flatter minima, as shown in Table 1, where $a_{\min }$ and $a_{\max }$ are the minimum and maximum singular values of the weights, during training, respectively, and C is a constant $($Fig. 3$)$. In addition, its parameters converge slower than the other DNNs, as shown in Table 2, where γ is the minimum singular value of the transform by the layer during training.
Remark 1 Theorem 3 implies that ResNet2 has a smaller generalization gap than MLP or ResNet1 when the parameters are updated by the GD a sufficient number of times.
D) Numerical experiments
To confirm the validity of the above analyses and the applicability to the DNNs with the ReLU activation function, $\phi (\cdot ) = \max \{ 0, \cdot \}$, some numerical experiments were carried out. The dataset was the MNIST dataset [Reference LeCun, Bottou, Bengio and Haffner28] after PCA whitening, so that $\forall d \in [D], \mathbb {E} [ x_d ] = 0$ and ${\rm Var} (x_d)=1$, and projection into the principal subspace of 10 dimensions.
We initialized the DNNs with 10 hidden units in each layer by the orthogonal initialization and trained these by the GD. During the training, the training loss, the test loss, and the approximate value of the stability $\epsilon _{stab}$,
were calculated every five updates (Figs. 4 and 5). The results show that ResNet2 had a greater stability and a smaller generalization gap than the other DNNs and that the analyses are valid even for the DNNs with the ReLU activation.
IV. EMPIRICAL RISK
A) Formulation of batch normalization
We evaluate the effectiveness of BN by deriving the empirical risk of the following DNNs with the ReLU activation $\phi (\cdot ) = \max \{ 0, \cdot \}$:
ResNet:
ResNet with BN:
Here, $h^0 = W^0 x$ is the projection of the input, and the expectation of the BN is taken with respect to the input in the batch of the GD. Without the loss of generality, the projection matrix is a square matrix initialized by Xavier initialization [Reference Glorot and Bengio29], and the inputs in the training set are normalized to $\mathbb {E}_{x} [ x_i ] = 0$ and ${\rm Var} ( x_i ) = 1$.
B) Hessian and Fisher information matrices
The empirical risk $R_S(\theta _t)$ is approximated by the second-order Taylor expansion around the minima $\theta _*$,
where $H(\theta _*) = \nabla _{\theta } \nabla _{\theta } R_S(\theta _*)$ is the Hessian matrix. The Hessian matrix is decomposed as $H(\theta _*) = U \Lambda U^T$, where U and Λ are a unitary square matrix comprising the eigenvectors and a diagonal matrix filled with the eigenvalues, respectively, which simplifies the empirical risk to
where $v_t = U^T (\theta _t - \theta _*)$, and the GD to
Let $\lambda _{\rm min}$ and $\lambda _{\rm max}$ be the minimum and maximum eigenvalues of $H(\theta _*)$, respectively. The GD converges when the learning rate is $\eta =2/\lambda _{\rm max}$ and it converges fastest when the learning rate is $\eta = 1/\lambda _{\rm max}$. The fastest convergence rate is the reciprocal of the condition number, $\lambda _{\rm max} / \lambda _{\rm min}$ [Reference LeCun, Bottou, Orr and Muller21], as is well known in adaptive filtering theory [Reference Haykin30]. However, the Hessian matrix and its eigenvalues are difficult to calculate owing to the complicated structure of DNNs.
Recently, the Fisher information matrix (FIM) of $p(x,y;\theta )$,
has been found to approximate the Hessian matrix of a DNN, $f(x;\theta )$, where $p(x,y;\theta ) = p(x) p(y \vert x;\theta )$, $p(y \vert x; \theta ) = \mathcal {N} ( f(x;\theta ), 1 )$, and $p(x)$ is the probability of the input. In this case, the FIM is rewritten as
and the following holds:
the second term of which is negligible when the error is small. In addition, the eigenvalues of the FIM of a sufficiently wide neural network do not change during training [Reference Jacot, Gabriel and Hongler31,Reference Karakida, Akaho and Amari32].
C) Bounds of the empirical risk
We calculated the eigenvalues of the FIMs of the naive ResNet and the ResNet with BN averaged over the random He initialization [Reference He, Zhang, Ren and Sun33] (expected FIM) under Assumptions 3 and 4 [Reference Furusho and Ikeda25].
Assumption 3 The forward signal $u^l_i$ is independent of the backward error signal ${\partial f(x;\theta )}/{\partial h^l_i}$.
Assumption 4 Half of the hidden units per layer are active $\phi ' (h^l_i) = 1$.
Although Assumption 3 is rather unrealistic, some theorems have been derived on the basis of Assumption 3 and their results were in agreement with those of numerical experiments [Reference Karakida, Akaho and Amari22,Reference Poole, Lahiri, Sohl-Dickstein and Ganguli34,Reference Yang and Schoenholz35]. In addition, the binary class PCA-whitened MNIST dataset satisfies Assumption 4 (Fig. 6).
Theorem 4 (Modification of Table 1 in [Reference Furusho and Ikeda25])
Under Assumptions 3 and 4, the maximum eigenvalue $\lambda _{\rm max}$ of the expected FIM of the ResNet grows exponentially with the depth,
where $m_{\lambda }$ is the mean of all the eigenvalues $\{ \lambda _i \}_{i=1}^{(L+1) D^2}$.
Remark 2 The learning rate of the GD must be exponentially small with respect to the depth of the ResNet for convergence of the parameters to the minima.
Theorem 5 (Modification of Table 1 in [Reference Furusho and Ikeda25])
Under Assumptions 3 and 4, BN relaxes the exponential growth of the eigenvalue to $L \log L$ order at most,
where $H_L = \sum _{k=1}^{L} \frac {1}{k}$ is the harmonic number.
Remark 3 BN enables the GD to use a larger learning rate than that of the ResNet for convergence of the parameters to the minima.
Note that our discussion is focused on the minima $\theta _*$. This is justified by the fact that the GD and the stochastic GD make the parameters into the minima under some conditions [Reference Jacot, Gabriel and Hongler31,Reference Ge, Huang, Jin and Yuan36,Reference Lee, Jason, Scimochowitz, Jordan and Recht37].
D) Numerical experiments
To confirm the validity of the above analyses, some numerical experiments were carried out. The dataset was a subset of the MNIST dataset [Reference LeCun, Bottou, Bengio and Haffner28] with class labels of 0 and 1 after the PCA whitening so that $\forall d \in [D], \mathbb {E} [x_d ] = 0$, and ${\rm Var} (x_d) = 1$, with projection into the principal subspace of 50 dimensions.
We initialized the ResNet and ResNet with BN, which have 50 hidden units in each layer, by the He initialization, calculated the mean eigenvalues and maximum eigenvalues of the expected FIMs of these DNNs, and found that the mean eigenvalues were in agreement with the theoretical values and that the maximum eigenvalues were bounded by the theoretical upper and lower bounds (Fig. 7).
In addition, the convergence properties of the ResNet and ResNet with BN were numerically examined. Each algorithm with various numbers of layers L and learning rates η updated the parameters 50 times for each run and the training loss was averaged over five runs (Fig. 8). It was found that the algorithms converged if the learning rate was less than the lower bounds of the stable convergence, that is, $2 / ( \mbox {upper bound of } \lambda _{\rm max})$.
From a more practical viewpoint, we evaluated the training loss and test loss of the ResNet and ResNet with BN at each update, where each algorithm used an optimal learning rate, $\eta = 1 / ( \mbox {upper bound of } \lambda _{\rm max})$ (Fig. 9). The result shows that the BN accelerates the convergence and increases the stability.
V. CONCLUSION
Some theoretical tools have been developed to analyze the theoretical properties of DNNs. We applied them to DNNs with new techniques such as skip connections and BN and showed why and how they improve the performance of DNNs. Skip connections reduce the generalization gap of standard DNNs, and the reductions are greater when the connections skip two layers at once, by smoothing the loss landscape around the minima. BN enables the GD to use a larger learning rate for convergence and accelerates training by smoothing the entire loss landscape. These analytical techniques may help researchers develop new DNN models.
FINANCIAL SUPPORT
This work was supported in part by JSPS-KAKENHI grant numbers 18J15055 and 18K19821, and the NAIST Big Data Project initials. For example, “This work was supported by the Wellcome Trust (A.B., grant numbers XXXX, YYYY), (C.D., grant number ZZZZ); the Natural Environment Research Council (E.F., grant number FFFF); and the National Institutes of Health (A.B., grant number GGGG), (E.F., grant number HHHH)”. Where no specific funding has been provided for research, please provide the following statement: “This research received no specific grant from any funding agency, commercial or not-for-profit sectors.”
STATEMENT OF INTEREST
None.
Yasutaka Furusho received his B.E. from National Institute of Technology, Kumamoto College, Japan, in 2015 and his M.E. from Graduate School of Information Science, Nara Institute of Science and Technology, Japan, in 2017. He is currently working toward his D.E. at Nara Institute of Science and Technology. His research interests include the analysis of neural networks based on statistical learning theory and information geometry, and its application to the model selection of neural networks.
Kazushi Ikeda received his B.E., M.E., and Ph.D in Mathematical Engineering and Information Physics from the University of Tokyo in 1989, 1991, and 1994, respectively. He was a research associate with the Department of Electrical and Computer Engineering of Kanazawa University from 1994 to 1998. He was a research associate of Chinese University of Hong Kong for three months in 1995. He was with Graduate School of Informatics, Kyoto University, as an associate professor from 1998 to 2008. Since 2008, he has been a full professor of Nara Institute of Science and Technology. He was the editor-in-chief of the Journal of the Japanese Neural Network Society and is currently an action editor of Neural Networks and an associate editor of IEEE Transactions on Neural Networks and Learning Systems.