1. Introduction
We consider the following two-block separable convex optimization problem with linear equality constraints:
where $A \in \mathbb{R}^{n \times n_1}, B \in \mathbb{R}^{n \times n_2}, b \in \mathbb{R}^n, \mathcal{X} \subseteq \mathbb{R}^{n_1}$ , and $\mathcal{Y} \subseteq \mathbb{R}^{n_2}$ are closed convex sets, and ${\theta _2}:{\mathbb{R}^{{n_2}}} \to \mathbb{R} \cup \left\{ { + \infty } \right\}$ is a convex function (not necessarily smooth). ${\theta _1}:{\mathbb{R}^{{n_1}}} \to \mathbb{R}$ is a convex function and is smooth on an open set containing $\mathcal{X}$ , but has its specific structure; in particular, we assume that there is a stochastic first-order oracle (SFO) for $\theta_1$ , which returns a stochastic gradient $G\left( {x,\xi } \right)$ at x, where $\xi$ is a random variable whose distribution is supported on $\Xi \subseteq \mathbb{R}^d$ , satisfying
where $\sigma > 0$ is some constant. In addition, we make the following assumptions throughout the paper: (i) The solution set of (1) is assumed to be nonempty, (ii) the gradient of $\theta_1$ is L-Lipschitz continuous for some $L > 0$ , i.e., $\left\| \nabla{\theta _1}\left( x \right) - \nabla{\theta _2}\left( y \right) \right\|$ $\le L\left\| {x - y} \right\|$ for any $x, y \in \mathcal{X}$ , (iii) y-subproblem has a minimizer at each iteration. As a linearly constrained convex optimization problem, though the model (1) is special, it is rich enough to characterize many optimization problems arising from various application fields, such as machine learning, image processing, and signal processing. In these fields, a typical scenario is that one of the functions represents a data fidelity term and the other function is a regularization term.
Without considering the specific structure of $\theta_1$ , i.e., the function value and gradient information is readily available, a classical method for solving problem (1) is the alternating direction method of multipliers (ADMM). ADMM was originally proposed by Glowinski and Marroco (Reference Glowinski and Marroco1975), and Gabay and Mercier (Reference Gabay and Mercier1976), which is a Gauss-Seidel implementation of augmented Lagrangian method (Glowinski, Reference Glowinski2014) or an application of Douglas-Rachford splitting method on the dual problem of (1) (Eckstein and Bertsekas, Reference Eckstein and Bertsekas1992). For both convex and non-convex problems, there are extensive studies on the theoretical properties of ADMM. In particular, for convex optimization problems, theoretical results on convergence behavior are abundant, whether global convergence, sublinear convergence rate, or linear convergence rate, see, e.g., Eckstein and Bertsekas (Reference Eckstein and Bertsekas1992); He and Yuan (Reference He and Yuan2012); Monteiro and Svaiter (Reference Monteiro and Svaiter2013); He and Yuan (Reference He and Yuan2015); Deng and Yin (Reference Deng and Yin2016); Yang and Han (Reference Yang and Han2016); Han et al. (2016). Recently, ADMM has been studied on non-convex models satisfying the KL inequality or other similar properties, see, e.g., Li and Pong (Reference Li and Pong2015); Wang et al. (Reference Wang, Yin and Zeng2019); Jiang et al. (Reference Jiang, Lin, Ma and Zhang2019); Zhang and Luo (Reference Zhang and Luo2020). For a thorough understanding on some recent developments of ADMM, one can refer to a survey (Han, Reference Han2022). However, when the objective function value (or its gradient) in (1) is computationally costly or even impossible to compute, we can only access some noisy information and deterministic ADMM does not work. Such a setting is exactly what the stochastic programming (SP) model considers. In SP, the objective function is often in the form of expectation. In this case, getting the full function value or gradient information is impractical. To tackle this problem, Robbins and Monro originally introduced the stochastic approximation (SA) approach in 1951 (Robbins and Monro, Reference Robbins and Monro1951). Since then, SA has gone through many developments; for more detail, readers are referred to a series of works by Nemirovski, Ghadimi, and Lan, etc, see, e.g., Nemirovski et al. (Reference Nemirovski, Juditsky, Lan and Shapiro2009); Ghadimi and Lan (Reference Ghadimi and Lan2012); Lan (Reference Lan2012); Ghadimi and Lan (Reference Ghadimi and Lan2013); Ghadimi et al. (Reference Ghadimi, Lan and Zhang2016). As for solving problem (1), motivated by the SA, some stochastic ADMM type algorithms have been proposed recently, see, e.g., Ouyang et al. (Reference Ouyang, He, Tran and Gray2013); Suzuki (Reference Suzuki2013, Reference Suzuki2014); Zhao et al. (Reference Zhao, Yang, Zhang and Li2015); Gao et al. (Reference Gao, Jiang and Zhang2018). Note that in these works, only the basic iterative scheme of ADMM was considered. It is well-known that incorporating an acceleration factor into the subproblem and the update on the dual variables often improves the algorithmic performance, which is the idea of generalized ADMM (Eckstein and Bertsekas, Reference Eckstein and Bertsekas1992, Fang et al., Reference Fang, He, Liu and Yuan2015). In this paper, we study generalized ADMM in the stochastic setting. In particular, we propose a stochastic linearized generalized ADMM (SLG-ADMM) for solving two-block separable stochastic optimization problem (1) and analyze corresponding worst-case convergence rate by means of the framework of variational inequality. Moreover, we establish the large deviation properties of SLG-ADMM under certain light-tail assumptions.
The rest of this paper is organized as follows. We present the iterative scheme of SLG-ADMM and summarize some preliminaries which will be used in the theoretical analysis in Section 2. In Section 3, we analyze the worst-case convergence rate and the high probability guarantees for objective error and constraint violation for the SLG-ADMM. Finally, we make some conclusions in Section 4.
Notation. 1. For two matrices A and B, the ordering relation $A \succ B$ ( $A \succeq B$ ) means $A-B$ is positive definite (semidefinite). $I_m$ denotes the $m \times m$ identity matrix. For a vector x, $\left\| x \right\|$ denotes its Euclidean norm; for a matrix X, $\left\| X \right\|$ denotes its spectral norm. For any symmetric matrix G, define $\left\| x \right\|_G^2: = {x^T}Gx$ and ${\left\| x \right\|_G}: = \sqrt {{x^T}Gx} $ if $G \succeq 0$ . $\mathbb{E}\left[ \cdot \right]$ denotes the mathematical expectation of a random variable. ${\rm Pr}\left\{\cdot\right\}$ denotes the probability value of an event. $\partial$ and $\nabla$ denote the subdifferential and gradient operator of a function, respectively. We also sometimes use $\left( {x,y} \right)$ and $\left( {x,y,\lambda } \right)$ to denote the vectors ${\left( {{x^T},{y^T}} \right)^T}$ and ${\left( {{x^T},{y^T},{\lambda ^T}} \right)^T}$ , respectively.
2. Stochastic Linearized Generalized ADMM
In this section, we first present the iterative scheme of SLG-ADMM for solving (1), and then, we introduce some preliminaries that will be frequently used in the later analysis.
We give some remarks on this algorithm. Algorithm 1 is an ADMM type algorithm, which alternates through one x-subproblem, one y-subproblem, and an update on the dual variables (multipliers). The algorithm is stochastic since at each iteration SFO is called to obtain a stochastic gradient $G\left(x^k , \xi\right)$ which is an unbiased estimation of $\nabla {\theta _1}\left( x^k \right)$ and is bounded relative to $\nabla {\theta _1}\left( x^k \right)$ in expectation. The algorithm is linearized because of the following two aspects: (i) The term $G\left(x^k , \xi\right)^T\left(x - x^k\right)$ in the x-subproblem of SLG-ADMM is a stochastic version of linearization of ${\theta _1}\left( x^k \right)$ . (ii) x-subproblem and y-subproblem are added proximal terms $\frac{1}{{2}}\left\| {x - {x^k}} \right\|_{{G_{1,k}}}^2 $ and $\frac{1}{2}\left\| {y - {y^k}} \right\|_{{G_{2,k,}}}^2$ respectively, where $\left\{G_{1,k}\right\} \text{and} \left\{G_{2,k}\right\}$ are two sequences of symmetric and positive definite matrices that can be changed with iteration; with the choice of ${G_{2,k}} \equiv \tau {I_{{n_2}}} - \beta {B^T}B,\tau > \beta \left\| {{B^T}B} \right\|$ , the quadratic term in the y-subproblem is linearized. The same fact applies to the x-subproblem. Furthermore, SLG-ADMM incorporates an acceleration factor $\alpha$ ; generally, the case with $\alpha \in \left(1,2\right)$ could lead to better numerical results than the special case with $\alpha = 1$ . When $\alpha = 1$ , $G_{1,k} \equiv I_{n_1}$ , and the term $\frac{1}{2}\left\| {y - {y^k}} \right\|_{{G_{2,k}}}^2$ vanishes, SLG-ADMM reduces to the algorithm appeared in earlier literatures (Ouyang et al., Reference Ouyang, He, Tran and Gray2013; Gao et al., Reference Gao, Jiang and Zhang2018).
Let the Lagrangian function of the problem (1) be
defined on $\mathcal{X} \times \mathcal{Y} \times \mathbb{R}^n$ . We call $\left( {{x^ * },{y^ * },{\lambda ^ * }} \right)$ a saddle point of $L\left(x, y, \lambda \right) \in \mathcal{X} \times \mathcal{Y} \times \mathbb{R}^n$ if the following inequalities are satisfied:
Obviously, a saddle point $\left( {{x^ * },{y^ * },{\lambda ^ * }} \right)$ can be characterized by the following inequalities
Below we invoke a proposition which characterize the optimality condition of an optimization model by a variational inequality. The proof can be found in He (Reference He2017).
Proposition. 2. Let $\mathcal{X} \subset \mathbb{R}^n$ be a closed convex set and let $\theta \left( x \right):{\mathbb{R}^n} \to \mathbb{R}$ and $f \left( x \right):{\mathbb{R}^n} \to \mathbb{R}$ be convex functions. In addition, $f\left(x\right)$ is differentiable. Assuming that the solution set of the minimization problem $\min \left\{ {\theta \left( x \right) + f\left( x \right)\left| {x \in \mathcal{X}} \right.} \right\}$ is nonempty, then we have the assertion that
if and only if
Hence with this proposition, solving (1) is equivalent to solving the following variational inequality problem under the assumption that the solution set of problem (1) is nonempty: Finding ${w^ * } = \left( {{x^ * },{y^ * },{\lambda ^ * }} \right) \in \Omega : = \mathcal{X} \times \mathcal{Y} \times {\mathbb{R}^n}$ such that
where
The variables with superscript or subscript such as $u^k, w^k, {\bar{u}_k}, {\bar{w}_k}$ are denoted similarly. In addition, we define two auxiliary sequences for the convergence analysis. More specifically, for the sequence $\left\{ {{w^k}} \right\}$ generated by the SLG-ADMM, let
Based on the above notations and the update scheme of $\lambda ^ k$ in SLG-ADMM, we have
and
Then, we get
where M is defined as
For notational simplicity, we define two sequences of matrices that will be used later: for $k = 0, 1, \ldots$
Obviously, for any k, the matrices $M, H_k$ , and $Q_k$ satisfy $Q_k = H_k M$ .
Throughout the paper, we need the following assumptions:
Assumption. (i) The primal-dual solution set $\Omega ^*$ of problem (1) is nonempty.
(ii) $\theta_1 \left(x\right)$ is differentiable, and its gradient satisfies the L-Lipschitz condition
for all $x_1, x_2 \in \mathcal{X}$ .
(iii)
where $\sigma > 0$ is some constant.
Under the second assumption, it holds that for all $x, y \in \mathcal{X}$ ,
A direct result of combining this property with convexity is shown in the following lemma.
Lemma. 3 Suppose function f is convex and differentiable, and its gradient is L-Lipschitz continuous, then for any x, y, z, we have
In addition, if f is $\mu$ -strongly convex, then for any x, y, z we have
Proof. Since the gradient of f is L-Lipschitz continuous, then for any y, z we have $f\left( y \right) \le f\left( z \right) + {\left( {y - z} \right)^T}\nabla f\left( z \right) + \frac{L}{2}{\left\| {y - z} \right\|^2}.$ Also, due to the convexity of f, we have for any x, z, $f\left( x \right) \ge f\left( z \right) + {\left( {x - z} \right)^T}\nabla f\left( z \right).$ Adding the above two inequalities, we get the conclusion. If f is $\mu$ -strongly convex, then for any x, z, $f\left( x \right) \ge f\left( z \right) + {\left( {x - z} \right)^T}\nabla f\left( z \right) + \frac{\mu }{2}{\left\| {x - z} \right\|^2}.$ Then, we combine this inequality with $f\left( y \right) \le f\left( z \right) + {\left( {y - z} \right)^T}\nabla f\left( z \right) + \frac{L}{2}{\left\| {y - z} \right\|^2}$ , and the proof is completed.
3. Theoretical Analysis of SLG-ADMM
In this section, we shall establish theoretical properties of SLG-ADMM. More specifically, in Subsection 3.1, we analyze the expected convergence rates of SLG-ADMM. And, we analyze the large deviation properties of SLG-ADMM in Subsection 3.2.
3.1 Expected convergence rate
First, this subsection considers that the function $\theta_1$ is convex. The next several lemmas are to obtain an upper bound of $\theta \left( {{{\tilde u}^k}} \right) - \theta \left( u \right) + {\left( {{{\tilde w}^k} - w} \right)^T}F\left( {{{\tilde w}^k}} \right)$ . With such a bound, it is possible to estimate the worst-case convergence rate of SLG-ADMM.
Lemma. 4. Let the sequence $\left\{ {{w^k}} \right\}$ be generated by the SLG-ADMM and the associated $\left\{ {{{\tilde w}^k}} \right\}$ be defined in (2). Then, we have
where $Q_k$ is defined in (7), and $\delta ^ k = G\left( {{x^k},\xi } \right) - \nabla {\theta _1}\left( {{x^k}} \right)$ , similarly hereinafter.
Proof. The optimality condition of the x-subproblem in SLG-ADMM is
Using $\tilde{x}^k$ and $\tilde{\lambda}^k$ defined in (2) and notation of $\delta^k$ , (9) can be rewritten as
In lemma 1, letting $y = \tilde{x}^k$ , $z = x^k$ , and $f = \theta_1$ , we get
Combining (10) and (11), we obtain
Similarly, the optimality condition of y-subproblem is
Substituting (3) into (13), we obtain that
At the same time,
That is,
Combining (12), (14), and (15), we get
Finally, by the definition of F and $Q_k$ , we come to the conclusion.
Next, we need to further explore the terms on the right-hand side of (8).
Lemma. 5. Let the sequence $\left\{ {{w^k}} \right\}$ be generated by the SLG-ADMM and the associated $\left\{ {{{\tilde w}^k}} \right\}$ be defined in (2). Then for any $w \in \mathcal{X} \times \mathcal{Y} \times {\mathbb{R}^n}$ , we have
Proof. Using $Q_k = H_k M$ and ${w^k} - {w^{k + 1}} = M\left( {{w^k} - {{\tilde w}^k}} \right)$ in (5), we have
Now applying the identity: for the vectors a, b, c, d and a matrix H with appropriate dimension,
In this identity, letting $a = w$ , $b = \tilde{w}^k$ , $c = w^k$ , $d = \tilde{w}^k$ , and $H = Q_k$ , we have
Next, we simplify the term ${\left\| {{w^k} - {{\tilde w}^k}} \right\|_{{H_k}}^2 - \left\| {{w^{k + 1}} - {{\tilde w}^k}} \right\|_{{H_k}}^2}$ .
where the second equality uses ${w^k} - {w^{k + 1}} = M\left( {{w^k} - {{\tilde w}^k}} \right)$ in (5), and the last equality holds since the transpose of $M^TH_k$ is $H_kM,$ and hence,
The remaining task is to prove
By simple algebraic operation,
With this result, (19) holds and the proof is completed.
Now, we are ready to establish the first main theorem for SLG-ADMM. In this theorem, we take $G_{1,k}$ of the form $\tau _k I_{n_1} - \beta A^TA, \tau_k > 0$ , which simplifies the system of linear equation in x-subproblem, and $G_{2,k} \equiv G_2$ . Of course, $G_2$ can also take the similar form as $G_{1,k}$ . In particular, if $G_2 = \eta I_{n_2} - \beta B^TB, \eta \ge \beta \left\| {{B^T}B} \right\|$ , then y-subproblem reduces to the proximal mapping of g.
Theorem. Let the sequence $\left\{ {{w^k}} \right\}$ be generated by the SLG-ADMM, the associated $\left\{ {{{\tilde w}^k}} \right\}$ be defined in (2), and
for some pre-selected integer N. Choosing $ {{\tau _k}} \equiv \sqrt{N} + M$ , where M is a constant satisfying the ordering relation $MI_{n_1} \succeq LI_{n_1} + \beta A^TA $ , then we have
Proof. Combining lemma 2 and lemma 3, we get
where the second inequality holds owing to the Young’s inequality and $\alpha \in \left(0,2\right)$ . Meanwhile,
where the equality holds since for any $w_1$ and $w_2$ ,
and the inequality follows from the convexity of $\theta$ . Now summing both sides of (21) from 0 to N and then taking the average, and using (22), the assertion of this theorem follows directly.
Corollary. 6. Let the sequence $\left\{ {{w^k}} \right\}$ be generated by the SLG-ADMM, the associated $\left\{ {{{\tilde w}^k}} \right\}$ be defined in (2), and
for some pre-selected integer N. Choosing $ {{\tau _k}} \equiv \sqrt{N} + M$ , where M is a constant satisfying the ordering relation $MI_{n_1} \succeq LI_{n_1} + \beta A^TA $ , then we have
and
where e is an unit vector satisfying $ - {e^T}\left( {A{{\bar x}_k} + B{{\bar y}_k} - b} \right) = \left\| {A{{\bar x}_k} + B{{\bar y}_k} - b} \right\|$ .
Proof. In (20), let $w = \left( {{x^ * },{y^ * },\lambda } \right)$ , where $\lambda = {\lambda ^ * } + e$ . Obviously, $\left\| e \right\| = 1$ . Then, the left-hand side of (20) is
Such a result is attributed to
where the first equality follows from the definition of F, and the second and last equalities hold due to ${A{x^ * } + B{y^ * } - b} = 0$ and the choice of $\lambda$ . On the other hand, substituting $w = \bar w_N$ into the variational inequality associated with (1), we get
Combining (25) and (26), we obtain that the left-hand side of (20) is no less than $\left\| {A{{\bar x}_N} + B{{\bar y}_N} - b} \right\|$ when letting $w = \left(x^*, y^*, \lambda^* + e\right)$ . Hence,
where in the first inequality we use $\mathbb{E}\left[ {{\delta ^k}} \right] = 0$ and $\mathbb{E}\left[ {{{\left\| {{\delta ^k}} \right\|}^2}} \right] \le {\sigma ^2}$ . The first part of this corollary is proved. Next, we prove the second part. Substituting $w = \bar w_N$ into the variational inequality associated with (1), we get
i.e.,
Take expectation on both sides of (28), and hence, (24) is proved.
Remark. 7. (i) In the above theorem or corollary, N needs to be selected in advance, and hence, $\tau_k$ s are constant. In fact, $\tau_k$ can also vary with the number of iterations. In the case of $\tau_k = \sqrt{k} + M$ , if the distance between $w^k$ and $w^*$ is bounded, i.e., ${\left\| {{w^k} - {w^ * }} \right\|^2} \le {R^2}$ for any k, we can also obtain a worst-case convergence rate. The main difference with that proof idea in the above theorem or corollary is bounding the term $\sum\nolimits_{t = 0}^k {\left( {\left\| {{x^t} - {x^ * }} \right\|_{G_{1,t}}^2 - \left\| {{x^{t + 1}} - {x^ * }} \right\|_{G_{1,t}}^2} \right)} $ , which is now bounded as follows.
(ii) The above corollary reveals that the worst-case expected convergence rate of SLG-ADMM for solving general convex problems is $\mathcal{O}\left(\frac{1}{\sqrt{N}}\right)$ , where N is the iteration number.
At end of this subsection, we assume that $\theta_1$ is $\mu$ -strongly convex, i.e., ${\theta _1}\left( x \right) \ge {\theta _1}\left( y \right) + \left\langle {\nabla {\theta _1}\left( y \right),x - y} \right\rangle + \frac{\mu }{2}{\left\| {x - y} \right\|^2}, \mu > 0$ for all $x, y \in \mathcal{X}$ . With the strong convexity, we can obtain not only the objective function value gap and constraint violation converge to zero in expectation but also the convergence of ergodic iterates of SLG-ADMM.
Theorem. Let the sequence $\left\{ {{w^k}} \right\}$ be generated by the SSL-ADMM and the associated $\left\{ {{{\tilde w}^k}} \right\}$ be defined in (2), and
Choosing $\tau_k = \mu\left(k+1\right) + M$ , where M is a constant satisfying the ordering relation $M{I_{{n_1}}} \succeq L{I_{{n_1}}} + \beta {A^T}A$ , then SLG-ADMM has the following properties
and
where e is an unit vector satisfying $ - {e^T}\left( {A{{\bar x}_k} + B{{\bar y}_k} - b} \right) = \left\| {A{{\bar x}_k} + B{{\bar y}_k} - b} \right\|$ , and
Proof. First, similar to the proof of lemma 2, using the $\mu$ -strong convexity of f, we conclude that for any $w \in \Omega$
where $Q_k$ is defined in (7). Then using the result in lemma 3,
Combining (31) and (32), we get
Now using (22) and (33), we have
Finally, taking expectation on both sides of (32) and following the proof for getting (27) and (28), we obtain
and
Therefore, (29) and (30) are proved.
This theorem implies that under the assumption that $\theta_1$ is strongly convex, the worst-case expected convergence rate for the SLG-ADMM can be improved to $\mathcal{O}\left(\ln k / k\right)$ with the choice of diminishing size. The following theorem shows the convergence of ergodic iterates of SLG-ADMM, which is not covered in some earlier literatures (Ouyang et al., Reference Ouyang, He, Tran and Gray2013; Gao et al., Reference Gao, Jiang and Zhang2018). Furthermore, if $\theta_2$ is also strongly convex, the assumption that B is full column rank can be removed.
Theorem. Let the sequence $\left\{ {{w^k}} \right\}$ be generated by the SLG-ADMM, the associated $\left\{ {{{\tilde w}^k}} \right\}$ be defined in (2), and
Choosing $\tau_k = \mu\left(k+1\right) + M$ , where M is a constant satisfying the ordering relation $M{I_{{n_1}}} \succeq L{I_{{n_1}}} + \beta {A^T}A$ , and assuming B is full column rank and $\lambda_{\min}$ denotes the minimum eigenvalue of $B^TB$ , then we have
where the bounds for $\mathbb{E}\left[ {\left\| {A{{\bar x}_k} + B{{\bar y}_k} - b} \right\|} \right]$ and $\mathbb{E}\left[ {\theta \left( {{{\bar u}_k}} \right) - \theta \left( {{u^ * }} \right)} \right]$ are the same as in (29) and (30), respectively.
Proof. Since $\left(x^*, y^*, \lambda^*\right)$ is a solution of (1), we have ${A^T}{\lambda ^ * } = \nabla {\theta _1}\left( {{x^ * }} \right) \ {\rm and} \ {B^T}{\lambda ^ * } \in \partial {\theta _2}\left( {{y^ * }} \right).$ Hence, since $\theta_1$ is strongly convex and $\theta_2$ is convex, we have
and
Adding up (36) and (37), we get $\theta \left( {{{\bar u}_k}} \right) \ge \theta \left( {{u^ * }} \right) + {\left( {{\lambda ^ * }} \right)^T}\left( {A{{\bar x}_k} + B{{\bar y}_k} - b} \right) + \frac{\mu }{2}{\left\| {{{\bar x}_k} - {x^ * }} \right\|^2},$ that is
On the other hand,
this implies ${\left\| {B\left( {{{\bar y}_k} - {y^ * }} \right)} \right\|} \le {\left\| A \right\|}{\left\| {{{\bar x}_k} - {x^ * }} \right\|} + {\left\| {A{{\bar x}_k} + B{{\bar y}_k} - b} \right\|,}$ and hence,
Adding (38) and (39), using Jensen’s inequality $\mathbb{E}X^{\frac{1}{2}} \le \left(\mathbb{E}X\right)^{\frac{1}{2}}$ for a random variable X, and taking expectation imply
The proof is completed.
3.2 High probability performance analysis
In this subsection, we shall establish the large deviation properties of SLG-ADMM. By (23) and (24), and Markov’s inequality, we have for any $\varepsilon_1 > 0$ and $\varepsilon_2 > 0$ that
and
However, these bounds are not strong. In the following, we will show these high probability bounds can be significantly improved when imposing standard “light-tail” assumption, see, e.g., Nemirovski et al. (Reference Nemirovski, Juditsky, Lan and Shapiro2009); Lan (Reference Lan2020). Specifically, assume that for any $x \in \mathcal{X}$
ʼ
This assumption is a little bit stronger than b) in Assumption (iii), which can be explained by Jensen’s inequality. For further analysis, we assume that $\mathcal{X}$ is bounded and its diameter is denoted by $D_X$ , defined as $\max _{x_1,x_2\in \mathcal{X} } \left \| x_1 - x_2 \right \| $ . The following theorem shows the high probability bound for objective error and constraint violation of SLG-ADMM.
Theorem. Let the sequence $\left\{ {{w^k}} \right\}$ be generated by the SLG-ADMM, the associated $\left\{ {{{\tilde w}^k}} \right\}$ be defined in (2), and
for some pre-selected integer N. Choosing $ {{\tau _k}} \equiv \sqrt{N} + M$ , where M is a constant satisfying the ordering relation $MI_{n_1} \succeq LI_{n_1} + \beta A^TA $ , then SLG-ADMM has the following properties
(i)
(ii)
where e is an unit vector satisfying $ - {e^T}\left( {A{{\bar x}_N} + B{{\bar y}_N} - b} \right) = \left\| {A{{\bar x}_N} + B{{\bar y}_N} - b} \right\|$ .
Proof. Let $\zeta^t = \frac{1}{N}\left ( x^* - x^{t} \right ) ^{T} \delta ^{t} $ . Clearly, $\left \{ \zeta ^{t} \right \} _{t\ge 1} $ is a martingale-difference sequence. Moreover, it follows from the definition of $D_X$ and that light-tail assumption that
Now using a well-known result (see Lemma 4.1 in Lan (Reference Lan2020)) for the martingale-difference sequence, we have for any $\Theta \ge 0$
Also, observe that by Jensen’s inequality
whence, taking expectation,
It then follows from Markov’s inequality that for any $\Theta \ge 0$
Using (44) and (45) in (20) for $w = \left(x^*, y^*, \lambda ^* + e\right)$ , we conclude that
and
The result immediately follows from the above inequalities.
Remark.8. In view of the last Theorem, if we take $\Theta = {\rm ln} \ N$ , then we have
and
For strongly convex case, using similar derivation, the high probability bound for objective error and constraint violation of SLG-ADMM is
and
Observe that the convergence rate of ergodic iterates of SLG-ADMM is obtained in (35). The high probability bound can be also established, which is shown as follows
where N is the iteration number. In contrast to (40) and (41), we can observe that the results in the last theorem are much finer.
4. Conclusion
In this paper, we analyze the expected convergence rates and the large deviation properties of a stochastic variant of generalized ADMM using the variational inequality framework. By means of this framework, the proof is very clear. When the model is deterministic and SFO is not needed, our proposed algorithm reduces to a generalized proximal ADMM, and the convergence region of $\alpha$ is the same as that in the corresponding literature.