Hostname: page-component-cd9895bd7-jn8rn Total loading time: 0 Render date: 2024-12-25T19:31:15.910Z Has data issue: false hasContentIssue false

Almost exact recovery in noisy semi-supervised learning

Published online by Cambridge University Press:  11 November 2024

Konstantin Avrachenkov
Affiliation:
Inria Sophia Antipolis, 2004 Rte des Lucioles, Valbonne, France
Maximilien Dreveton*
Affiliation:
School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
*
Corresponding author: Maximilien Dreveton; Email: maximilien.dreveton@epfl.ch
Rights & Permissions [Opens in a new window]

Abstract

Graph-based semi-supervised learning methods combine the graph structure and labeled data to classify unlabeled data. In this work, we study the effect of a noisy oracle on classification. In particular, we derive the maximum a posteriori (MAP) estimator for clustering a degree corrected stochastic block model when a noisy oracle reveals a fraction of the labels. We then propose an algorithm derived from a continuous relaxation of the MAP, and we establish its consistency. Numerical experiments show that our approach achieves promising performance on synthetic and real data sets, even in the case of very noisy labeled data.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2024. Published by Cambridge University Press.

1. Introduction

Semi-supervised learning (SSL) aims at achieving superior learning performance by combining unlabeled and labeled data. Since typically the amount of unlabeled data is large compared to the amount of labeled data, SSL methods are relevant when the performance of unsupervised learning is low, or when the cost of getting a large amount of labeled data for supervised learning is too high. Unfortunately, many standard SSL methods have been shown to not efficiently use the unlabeled data, leading to unsatisfactory or unstable performance [Reference Chapelle, Schölkopf and Zien11, Chap. 4], [Reference Ben-David, Lu and Pál9, Reference Cozman, Cohen and Cirelo12]. Moreover, noise in the labeled data can further degrade the performance. In practice, the noise can come from a tired or non-diligent expert carrying out the labeling task or even from adversarial data corruption.

In this paper, we investigate the problem of graph clustering, where one aims to group the nodes of a graph into different classes. Our working model is the two-class degree corrected stochastic block model (DC-SBM), with side information on some node’s community assignment given by a noisy oracle. The DC-SBM was introduced in [Reference Karrer and Newman18] to account for degree heterogeneity and block structure. Let n be the number of nodes. Each node $i \in [n]$ is given a community label $Z_i \in \{-1,1\}$ chosen uniformly at random and a parameter $\theta_i \gt 0$. Given $Z = (Z_1, \dots, Z_n)$ and $\theta = \left( \theta_1, \dots, \theta_n \right)$, an undirected edge is added between nodes i and j with probability $\min( 1, \theta_i \theta_j p_{\rm in})$, if $Z_i = Z_j$, and with probability $\min(1, \theta_i\theta_j p_{\rm out})$, otherwise. This model reduces to the standard stochastic block model (SBM) [Reference Abbe1] if $\theta_i = 1$ for every node i. The unsupervised clustering problem consists of inferring the latent community structure Z given one observation of a DC-SBM graph. We make the problem semi-supervised by introducing a noisy oracle. For every node, this oracle reveals the correct community label with probability η 1, a wrong community label with probability η 0, and reveals nothing with probability $1-\eta_1-\eta_0$.

We first derive the maximum a posteriori (MAP) estimator for SSL-clustering in a DC-SBM graph given the a priori information induced by a noisy oracle and graph structure. We note that, despite its simplicity, this result did not appear previously in the literature, neither for a perfect oracle nor for SBM. In particular, we show that the MAP is the solution to a minimization problem that involves a trade-off between three factors: a cut-based term (as in the unsupervised scenario), a regularization term (penalizing solutions with unbalanced clusters), and a loss term (penalizing predictions that differ from the oracle information).

As solving the MAP estimator is NP-hard, we propose a continuous relaxation and derive an SSL version of a spectral method based on the adjacency matrix. We establish a bound on the ratio of misclassified nodes for this continuous relaxation, and we show that this ratio goes to zero under the hypothesis that the average degree diverges and an almost perfect oracle (see Corollary 3.2 for a rigorous statement). As a result, the proposed SSL method guarantees almost exact recovery (recovering all but o(n) labels when n goes to infinity) even when a part of the side information is incorrect. We note that even though we work with the case of two clusters, most of our results are extendable to the setting of more than two clusters at the expense of more cumbersome notations.

One can make several parallels between our continuous relaxation and state-of-the-art techniques. Indeed, SSL-clustering often relies on minimization frameworks (see [Reference Avrachenkov, Mishenin, Gonçalves and Sokol5, Reference Chapelle, Schölkopf and Zien11] for an overview). The idea of minimizing a well-chosen energy function was proposed in [Reference Zhu, Ghahramani and Lafferty30], under the constraint of keeping the labeled nodes’ predictions equal to the oracle labels. As we show in the numerical section, this hard constraint is unsuitable if the oracle reveals false information. Consequently, Belkin et al. [Reference Belkin, Matveeva and Niyogi8] introduced an additional loss term in the energy function to allow the prediction to differ from the oracle information. We recover this loss term with an additional theoretical justification because it comes from a relaxation of the MAP.

Moreover, the regularization term is necessary to prevent the solution from being flat and making classification rely on second-order fluctuations. This phenomenon was previously observed by [Reference Nadler, Srebro and Zhou23] in the limit of an infinite amount of unlabeled data, as well as by [Reference Mai and Couillet21] in the large dimension limit. The regularization term here consists of subtracting a constant term from all the entries of the adjacency matrix. It resembles previous regularization techniques, like the centering of the adjacency matrix proposed in [Reference Mai and Couillet22]. However, contrary to [Reference Mai and Couillet22], we study a noisy framework without assuming a large-dimension asymptotic regime. Moreover, we solve exactly the relaxed minimization problem instead of giving a heuristic with an extra parameter.

It was shown in [Reference Saad and Nosratinia25] that even with a perfect oracle revealing a constant fraction of the labels, the phase transition phenomena for exact recovery in SBM (recovering all the correct labels with high probability) remains unchanged. Thus, for the exact recovery problem, one could discard all the side information and simply use unsupervised algorithms when the number of data points goes to infinity. Of course, wasting potentially valuable information is not entirely satisfactory. Thus, in the present work, we consider the case of almost exact recovery and an oracle with noisy information. In [Reference Banerjee, Deka and Olvera-Cravioto7, Reference Kadavankandy, Avrachenkov, Cottatellucci and Sundaresan17] criteria different from the exact recovery have also been considered in the framework of SSL.

The paper is structured as follows. We introduce the model and main notations in Section 2, along with the derivation of the MAP estimator (Section 2.2). A continuous relaxation of the MAP is presented in Section 3 as well as the guarantee of its convergence to the true community structure (Subsection 3.2). We postpone some proofs to the Appendix and leave in the main text only those we consider important to the material exposition. We conclude the paper with numerical results (Section 4), emphasizing the effect of the noise on the clustering accuracy. In particular, we outperform state-of-the-art graph-based SSL methods in a difficult regime (few label points or large noise).

Lastly, the present paper is a follow-up work on [Reference Avrachenkov and Dreveton3]. However, there are very important developments. In [Reference Avrachenkov and Dreveton3] we have only established almost exact recovery on SBM for Label Spreading [Reference Zhou, Bousquet, Lal, Weston and Schölkopf29] heuristic algorithm with a linear number of labeled nodes (see [Reference Avrachenkov and Dreveton3, Assumption 3]). In the present work, we extend the analysis to DC-SBM, investigate the effect of noisy labeled data, and allow a potentially sublinear number of labeled nodes. We also add experiments with real and synthetic data that illustrate our theoretical results.

2. MAP estimator in a noisy semi-supervised setting

2.1. Problem formulation and notations

A homogeneous DC-SBM is parametrized by the number of nodes n, two class-affinity parameters $p_{\rm in}, p_{\rm out}$, and a pair $(\theta, Z)$ where $\theta \in \mathbf{R}^n$ is a vector of intrinsic connection intensities and $Z \in \{-1,1\}^n$ is the community labeling vector. Given $(p_{\rm in}, p_{\rm out}, \theta, Z)$, the graph adjacency matrix $A = (a_{ij})$ is generated as

(2.1)\begin{align} A_{ij} = A_{ji} \sim \left\{ \begin{array}{ll} \operatorname{Ber} \left( \theta_i \theta_j p_{\rm in} \right), & \qquad \mathrm{if}\quad Z_i = Z_j, \\ \operatorname{Ber} \left( \theta_i \theta_j p_{\rm out} \right), & \qquad \mathrm{otherwise,} \\ \end{array} \right. \end{align}

for $i \not= j$, and $A_{ii} = 0$. We assume throughout the paper that $Z_i \sim \operatorname{Uni} \left( \{-1,1\} \right) $, and that the entries of θ are independent random variables satisfying $\theta_i \in [ \theta_{\min}, \theta_{\max} ]$ with $\mathbb{E} \theta_i = 1$, $\theta_{\min} \gt 0$, and $\theta_{\max}^2 \max(p_{\rm in}, p_{\rm out}) \leq 1$. In particular, when all the θi’s are equal to one, the model reduces to the SBM:

(2.2)\begin{align} A_{ij} = A_{ji} \sim \left\{ \begin{array}{ll} \operatorname{Ber} \left( p_{\rm in} \right), & \qquad \mathrm{if}\quad Z_i = Z_j, \\ \operatorname{Ber} \left( p_{\rm out} \right), & \qquad \mathrm{otherwise.} \\ \end{array} \right. \end{align}

In addition to the observation of the graph adjacency matrix A, an oracle gives us extra information about the cluster assignment of some nodes. This can be represented as a vector $s \in \{0,-1,1\}^n$, whose entries si are independent and distributed as follows:

(2.3)\begin{align} s_{i} \ = \ \left\{ \begin{array}{ll} Z_i, & \qquad \mathrm{with \ probability} \quad \eta_1, \\ -Z_i, & \qquad \mathrm{with \ probability} \quad \eta_0, \\ 0, & \qquad \mathrm{otherwise}. \end{array} \right. \end{align}

In other words, the oracle (2.3) reveals the correct cluster assignment of node i with probability η 1 and gives a false cluster assignment with probability η 0. It reveals nothing with probability $1-\eta_1-\eta_0$. The quantity $\mathbb{P}\left( s_i \not = Z_i \, | \, s_i \not= 0 \right)$ is the rate of mistakes of the oracle (i.e., the probability that the oracle reveals false information given that it reveals something), and is equal to $\eta_0/(\eta_1+\eta_0)$. The oracle is informative if this quantity is less than $1/2$, which is equivalent to $\eta_1 \gt \eta_0$. In the following, we will always assume that the oracle is informative.

Assumption 2.1. The oracle is informative, that is, $\eta_1 \gt \eta_0$.

Given the observation of A and s, the goal of clustering is to recover the community labeling vector Z. For an estimator $\widehat{Z} \in \{-1,1\}^n$ of Z, the relative error is defined as the proportion of misclassified nodes

(2.4)\begin{align} L \left( \widehat{Z} , Z \right) \ = \ \frac{1}{n} \sum_{i=1}^n 1\left( \widehat{Z} _i \not= Z_i \right). \end{align}

Note that, unlike unsupervised clustering, we do not take a minimum over the permutations of the predicted labels since we should be able to learn the correct community labels from the informative oracle.

Notations Given an oracle s, we let $\ell$ be the set of labeled nodes, that is $\ell := \{i \in V : s_i \not= 0 \}$, and denote $\mathcal{P}$ the diagonal matrix with entries $\left( \mathcal{P} \right)_{ii} = 1$, if $i \in \ell$, and $\left(\mathcal{P} \right)_{ii} = 0$, otherwise.

The notation In stands for the identity matrix of size n × n, and 1n (resp., 0n) is the vector of size n × 1 of all ones (resp., of all zeros).

For any matrix $A = \left(a_{ij}\right)_{i \in [n], j \in [m]}$ and two sets $S \subset [n]$, $T \subset [m]$, we denote $A_{S,T} = \left(a_{ij}\right)_{i \in S, j \in T}$ the matrix obtained from A by keeping elements, whose row indices are in S and column indices are in T. We denote by $\|x\|$ the Euclidean norm of a vector x and by $\|A\|$ the spectral norm of a matrix $A \in \mathbf{R}^{n\times m}$. Finally, $A \odot B$ refers to the entry-wise matrix product between two matrices A and B of the same size.

2.2. MAP estimator for semi-supervised recovery in DC-SBM

Given a realization of a DC-SBM graph adjacency matrix A and the oracle information s, the MAP estimator is defined as

(2.5)\begin{align} \widehat{Z} ^{\operatorname{MAP}} & \ = \ \mathrm{arg\,max}_{z \in \{-1,1\}^n } \mathbb{P}( z \, | \, A, s). \end{align}

This estimator is known to be optimal (in the sense that if it fails then any other estimator would also fail, see, e.g., [Reference Iba16]) for the exact recovery of all the community labels. Theorem 2.2 provides an expression of the MAP.

Theorem 2.2 Let G be a graph drawn from DC-SBM as defined in (2.1) and s be the oracle information as defined in (2.3). Denote $ M \ = \ (F_1-F_0) \odot A + F_0, $ where $F_0 = \left(f^{(0)}_{ij} \right)$ and $F_1 = \left(f^{(1)}_{ij} \right)$ such that $f^{(a)}_{ij} = \log \frac{\mathbb{P}(A_{ij} = a \, | \, z_i = z_j )}{\mathbb{P}(A_{ij} = a \, | \, z_i \not= z_j ) } $ for $a \in \{0,1\}$. The MAP estimator defined in (2.5) is given by

(2.6)\begin{equation} \widehat{Z} ^{\operatorname{MAP}} \ = \ \mathrm{arg\,min}_{\substack {z \in \{-1,1\}^n } } \, \left( z^T M z + \log \left( \frac{\eta_1}{\eta_0} \right) \left\| \mathcal{P} z - s \right\|^2 \right). \end{equation}

For a perfect oracle $(\eta_0 = 0)$ this reduces to

(2.7)\begin{equation} \widehat{Z} ^{\operatorname{MAP}} \ = \ \mathrm{arg\,min}_{\substack{z \in \{-1,1\}^n \\ z_{\ell} = s_{\ell} } } z^T M z. \end{equation}

The proof of Theorem 2.2 is standard and postponed to Appendix A. We note that, despite being a priori standard, this result did not appear previously in the literature (neither for the standard SBM nor for the perfect oracle).

The minimization problem (2.6) consists of a trade-off between minimizing a quadratic function $z^T M z$ and a penalty term. This trade-off reads as follows: for each labeled node such that the prediction contradicts the oracle, a penalty $\log \left( \frac{\eta_1}{\eta_0} \right) \gt 0$ is added. In particular, when the oracle is uninformative, that is $\eta_1 = \eta_0$, then this term is null, and Expression (2.6) reduces to the MAP for unsupervised clustering.

The following Corollary 2.3, whose proof is in Appendix A, provides the expression of the MAP estimator for a standard SBM.

Corollary 2.3. The MAP estimator for semi-supervised clustering on SBM graph with $p_{\rm in} \gt p_{\rm out}$ and with an oracle s defined in (2.3) is given by

(2.8)\begin{align} \widehat{Z} ^{\operatorname{MAP}} \ = \ \mathrm{arg\,min}_{z \in \{-1,1\}^n } \, \left( - z^T \left(A - \tau 1_n 1_n^T \right) z + \lambda^* \left\| \mathcal{P} z - s \right\|_2^2 \right), \end{align}

where $ \tau = \dfrac{\log \Big( \dfrac{1-p_{\rm out}}{1-p_{\rm in}} \Big) }{\log \Big( \dfrac{p_{\rm in} (1-p_{\rm out})}{p_{\rm out} (1-p_{\rm in})} \Big) }$ and $\lambda^* = \dfrac{\log \Big( \dfrac{\eta_1}{\eta_0} \Big)}{\log \bigg( \dfrac{p_{\rm in} (1-p_{\rm out})}{p_{\rm out}(1-p_{\rm in})} \bigg)}. $ For the perfect oracle, this reduces to

(2.9)\begin{equation} \widehat{Z} ^{\operatorname{MAP}} \ = \ \mathrm{arg\,min}_{\substack{z \in \{-1,1\}^n \\ z_{\ell} = s_{\ell} } } \, z^T \left( -A + \tau 1_n 1_n^T \right) z. \end{equation}

3. Almost exact recovery using a continuous relaxation

As finding the MAP estimate is NP-hard[Reference Wagner and Wagner26], we perform a continuous relaxation (Section 3.1). We then give an upper bound on the number of misclassified nodes in Section 3.2.

3.1. Continuous relaxation of the MAP

For the sake of presentation simplicity, we focus on the MAP for SBM, that is, minimization problem (2.8). We perform a continuous relaxation mirroring what is commonly done for spectral methods [Reference Newman24], namely

(3.1)\begin{align} \widehat{X} \ = \ \mathrm{arg\,min}_{\substack{x \in \mathbf{R}^{n} \\ \sum_i \kappa_i x_i^2 = \sum_i \kappa_i } } \left( -x^T A_\tau x + \lambda ( s - \mathcal{P} x )^T ( s - \mathcal{P} x ) \right), \end{align}

where $A_{\tau} = A - \tau 1_n 1_n^T$ and $\kappa = (\kappa_1, \dots, \kappa_n)$ is a vector of positive entries. We choose to constrain x on the hyper-sphere $\|x\|^2 = n$ by letting $\kappa_i = 1$, but other choices would lead to a similar analysis. In particular, in the numerical Section 4 we will compare this choice with a degree-normalization approach (i.e., $\kappa_i = d_i$).

We further note that for the perfect oracle, the corresponding relaxation of (2.9) is

(3.2)\begin{equation} \widehat{X} = \mathrm{arg\,min}_{\substack{x \in \mathbf{R}^n \\ x_{\ell} = s_{\ell} \\ \|x\|^2 = n } } \left( - x^T A_\tau x \right). \end{equation}

Given the classification vector $\widehat{X} \in \mathbf{R}^{n}$, node i is classified into cluster $\widehat{Z} _i \in \{-1, 1\}$ such that

(3.3)\begin{equation} \widehat{Z} _i \ = \ \left\{ \begin{array}{ll} 1, & \qquad \mathrm{if}\quad \widehat{X} _i \gt 0, \\ -1, & \qquad \mathrm{otherwise}. \\ \end{array} \right. \end{equation}

Let us solve the minimization problem (3.1). By letting $\gamma \in \mathbf{R}$ be the Lagrange multiplier associated with the constraint $\|x\|^2 = n$, the Lagrangian of the optimization problem (3.1) is

\begin{equation*} -x^T A_\tau x + \lambda ( s - \mathcal{P} x )^T ( s - \mathcal{P} x ) - \gamma \left( x^T x - n \right). \end{equation*}

This leads to the constrained linear system

(3.4)\begin{align} \left\{ \begin{array}{rc} \left( - A_\tau + \lambda \mathcal{P} - \gamma I_n \right) x \ = \ & \lambda s, \\ x^T x \ = \ & n, \end{array} \right. \end{align}

whose unknowns are γ and x.

While [Reference Mai and Couillet22] let γ to be a hyper-parameter (hence the norm constraint $x^T x = n$ is no longer verified), the exact optimal value of γ can be found explicitly following [Reference Gander, Golub and Von Matt14]. Firstly, we note that if $(\gamma_1,x_1)$ and $(\gamma_2,x_2)$ are solutions of the system (3.4), then (see Lemma D.1 for the derivations)

\begin{align*} \mathcal{C} (x_1) - \mathcal{C} (x_2) \ = \ \frac{\gamma_1 - \gamma_2 }{2} \, \left\| x_1 - x_2 \right\|^2, \end{align*}

where $\mathcal{C}(x) = - x^T A_\tau x + \lambda (s-\mathcal{P} x)^T (s-\mathcal{P} x)$ is the cost function minimized in (3.1). Hence, among the solution pairs $(\gamma, x)$ of the system (3.4), the solution of the minimization problem (3.1) is the vector x associated with the smallest γ.

Secondly, the eigenvalue decomposition of $-A_{\tau} + \lambda \mathcal{P}$ reads as

\begin{align*} -A_{\tau} + \lambda \mathcal{P} \ = \ Q \Delta Q^T, \end{align*}

where $\Delta = \mathrm{diag} ( \delta_1, \dots, \delta_n)$ with $\delta_1 \leq \dots \leq \delta_n$ and $Q^T Q = I_n$. Therefore, after the change of variables $u = Q^T x$ and $b = \lambda Q^T s$, the system (3.4) is transformed to

\begin{align*} \left\{ \begin{array}{rl} \Delta u & \ = \ \gamma u + b, \\ u^T u & \ = \ n. \end{array} \right. \end{align*}

Thus, the solution $\widehat{X} $ of the optimization problem (3.1) satisfies

(3.5)\begin{align} \left( - A_\tau + \lambda \mathcal{P} - \gamma_* I_n \right) \widehat{X} & \ = \ \lambda s, \end{align}

where $\gamma_*$ is the smallest solution of the explicit secular equation [Reference Gander, Golub and Von Matt14]

(3.6)\begin{align} \sum_{i=1}^n \left( \frac{b_i }{ \delta_i - \gamma } \right)^2 - n \ = \ 0. \end{align}

We summarize this in Algorithm 1. Note that for the sake of generality, we let λ and τ be hyper-parameters of the algorithm. If the model parameters are known, we can use the expressions of λ and τ derived in Corollary 2.3. The choice of λ and τ is further discussed in Section 4. We must use power iterations or Krylov subspace methods to apply Algorithm 1 to large data sets. The main computational bottleneck in those methods will be the matrix-vector product $A_\tau v$. The matrix Aτ is not sparse. Since Aτ is a sum of a sparse matrix and a rank-one matrix, the computation of $A_\tau v = Av - \tau (1_n^T v) 1_n$ can be done efficiently by subtracting the same scalar $\tau (1_n^T v)$ from all the entries of the result of the sparse matrix-vector multiplication.

Algorithm 1. Semi-supervised learning with regularized adjacency matrix.

3.2. Ratio of misclassified nodes

This section gives bounds on the number of unlabeled nodes misclassified by Algorithm 1. We then specialize the results for some particular cases.

Theorem 3.1 Consider a DC-SBM with a noisy oracle as defined in (2.1) and (2.3). Let $\bar{d} = \frac{n}{2}(p_{\rm in} + p_{\rm out})$ and $\bar{\alpha} = \frac{n}{2}(p_{\rm in} - p_{\rm out})$. Suppose that $\tau \gt p_{\rm out}$ and that $\eta_0 n \sqrt{\eta_1+\eta_0} \ll \lambda$, and let $\widehat{Z} $ be the output of Algorithm 1. Then, for any r > 0, there exists a constant C such that the proportion of misclassified unlabeled nodes satisfies

\begin{equation*} L \left( \widehat{Z} _u , Z_u \right) \ \le \ C \left( \frac{p_{\rm in} + p_{\rm out} }{p_{\rm in} - p_{\rm out} } \right)^2 \Bigg(\dfrac{\bar{\alpha } + \lambda }{ \lambda } \Bigg)^2 \frac{1 }{(\eta_1 + \eta_0) \left( \eta_1 - \eta_0 \right)^2 \bar{d} }, \end{equation*}

with probability at least $1 - n^{-r}$.

The value of λ in Theorem 3.1 serves as a hyper-parameter of the algorithm and may not necessarily be equal to the value $\lambda^*$ computed in Corollary 2.3. Consequently, one can opt for a λ significantly larger than $\eta_0 n \sqrt{\eta_0+\eta_1}$, even if the $\lambda^*$ from Corollary 2.3 is not much larger than $\eta_0 n \sqrt{\eta_0+\eta_1}$. Selecting $\lambda \gt \lambda^*$ indicates an excessive reliance on the information provided by the oracle, but it has a benign effect on the error bound of the unlabeled nodes given in Theorem 3.1.

The core of the proof relies on the concentration of the adjacency matrix toward its expectation. This result, as presented in [Reference Le, Levina and Vershynin19], holds under loose assumptions: it is valid for any random graph whose edges are independent of each other. To use this result for $\bar{d} = o\big(\log n \big)$, one needs to replace the matrix Aτ by $A_\tau' = A' - \tau 1_n 1_n^T$, where Aʹ is the adjacency matrix of the graph obtained after reducing the weights on the edges incident to the high degree vertices. We refer to [Reference Le, Levina and Vershynin19, Sect. 1.4] for more details. This extra technical step is not necessary when $\bar{d} = \Omega(\log n)$. Moreover, concentration also occurs if we replace the adjacency matrix with the normalized Laplacian in Eq. (3.5). In that case, we obtain a generalization of the Label Spreading algorithm [Reference Zhou, Bousquet, Lal, Weston and Schölkopf29], [Reference Chapelle, Schölkopf and Zien11, Chap. 11].

In the following, the mean-field graph refers to the weighted graph formed by the expected adjacency matrix of a DC-SBM graph. Furthermore, we assume without loss of generality that the first $n/2$ nodes are in the first cluster and the last $n/2$ are in the second cluster. Therefore, $ \mathbb{E} A = Z B Z^T $ with $ B = \begin{pmatrix} p_{\rm in} & p_{\rm out} \\ p_{\rm out} & p_{\rm in} \end{pmatrix} $ and $ Z = \begin{pmatrix} 1_{n/2} & 0_{n/2} \\ 0_{n/2} & 1_{n/2} \end{pmatrix} . $ In particular, the coefficients θi disappear because $\mathbb{E} \theta_i = 1$. We consider the setting in which the diagonal elements of $\mathbb{E} A$ are not zeros. This accounts for modifying the definition of DC-SBM, where we can have self-loops with probability p in. Nevertheless, we could set the diagonal elements of $\mathbb{E} A$ to zeros and our results would still hold at the expense of cumbersome expressions. Note that the matrix $\mathbb{E} A$ has two non-zero eigenvalues: $\bar{d} = n \frac{p_{\rm in} + p_{\rm out}}{2} $ and $\bar{\alpha } = n \frac{p_{\rm in} - p_{\rm out}}{2} $.

Proof of Theorem 3.1

We prove the statement in three steps. We first show that the solution $\widehat{X} $ of the constrained linear system (3.4) is concentrated around the solution $\bar{x} $ of the same system for the mean-field model. Then, we compute $\bar{x} $ and show that we can retrieve the correct cluster assignment from it. We finally conclude with the derivation of the bound.

  1. (i) Similarly to [Reference Avrachenkov, Kadavankandy and Litvak4] and [Reference Avrachenkov and Dreveton3], let us rewrite Eq. (3.5) as a perturbation of a system of linear equations corresponding to the mean-field solution. We thus have

    \begin{equation*} \big( \mathbb{E} \tilde{\mathcal{L}} + \Delta \tilde{\mathcal{L}} \big) \big( \bar{x} + \Delta x \big) = \lambda s, \end{equation*}

    where $\tilde{\mathcal{L}} = - A_\tau + \lambda \mathcal{P} - \gamma_* I_n$, $\Delta x := \widehat{X} - \bar{x} $ and $\Delta \tilde{\mathcal{L}} := \tilde{\mathcal{L}} - \mathbb{E} \tilde{\mathcal{L}}$.

We recall that a perturbation of a system of linear equations $ (A + \Delta A) (x + \Delta x) = b $ leads to the following sensitivity inequality (see, e.g., [Reference Horn and Johnson15, Sect. 5.8]):

\begin{equation*} \dfrac{\|\Delta x\|}{\|x\|} \ \le \ \kappa(A) \dfrac{\| \Delta A \| }{\| A \| }, \end{equation*}

where $\|.\|$ is the operator norm associated with a vector norm $\|.\|$ (we use the same notations for simplicity) and $\kappa(A) := \|A^{-1}\| \cdot \|A\|$ is the condition number. In our case, the above inequality can be rewritten as follows:

(3.7)\begin{equation} \dfrac{\left\| \widehat{X} - \bar{x} \right\| }{\left\| \bar{x} \right\| } \ \le \ \left\| \left( \mathbb{E} \: \tilde{\mathcal{L}} \right)^{-1} \right\| \cdot \left\| \Delta \: \tilde{\mathcal{L}} \right\|, \end{equation}

employing the Euclidean vector norm and spectral operator norm. The spectral study of $\mathbb{E} \: \tilde{\mathcal{L}}$ (see Corollary B.3 in Appendix B.1) gives:

\begin{align*} \left\| \left( \mathbb{E} \: \tilde{\mathcal{L}}\right)^{-1} \right\| & \ = \ \dfrac{1}{\min \big \{|\lambda| : \lambda \in \mathrm{Sp} \big( \mathbb{E} \: \tilde{\mathcal{L}} \big) \big \} } = \dfrac{1}{- t_2^+ -\bar{\gamma}_* }, \end{align*}

where $t_2^+$ is defined in Corollary B.3 in Appendix B.1 and $\bar{\gamma}_*$ is the solution of Eq. (3.6) for the mean-field model. Lemma B.4 in Appendix B.1.1 leads to

(3.8)\begin{align} \left\| \left( \mathbb{E} \: \tilde{\mathcal{L}} \right)^{-1} \right\| & \ \le \ \frac{1}{\lambda + \bar{\alpha} }. \end{align}

The last ingredient needed is the concentration of the adjacency matrix around its expectation. We have

\begin{align*} \Big\|\tilde{\mathcal{L}} - \mathbb{E} \tilde{\mathcal{L}} \Big\| \ \le \ \left\| \left( \gamma_* - \bar{\gamma }_* \right) I_n \right\| + \| A - \mathbb{E} \: A \| \ \le \ \left| \: \gamma_* - \bar{\gamma }_* \: \right| + \| A - \mathbb{E} \: A \|. \end{align*}

Proposition B.5 in Appendix B.1.2 shows that

\begin{align*} \left| \: \gamma_* - \bar{\gamma} _* \: \right| & \ \le \ \left( 1 + \frac{\left( \bar{\alpha} + \lambda \right)^3 }{2 \sqrt{\eta_1+\eta_0} ( \eta_1 - \eta_0 ) \bar{\alpha}^2 \lambda } \right) \sqrt{\overline{d}}. \end{align*}

Moreover, when $d = \Omega(\log n)$, it is shown in [Reference Feige and Ofek13] that for every r > 0 there exists a constant Cʹ such that $\left\| A - \mathbb{E} \: A \right\| \ \le \ C' \sqrt{\overline{d}}$ holds with probability at least $1-n^{-r}$. If $\bar{d} = o(\log n)$, the same result holds with a proper preprocessing on A, and we refer the reader to [Reference Le, Levina and Vershynin19] for more details. We will omit this extra step in the proof to keep notations short. Using this concentration bound, we have

\begin{align*} \Big\|\tilde{\mathcal{L}} - \mathbb{E} \tilde{\mathcal{L}} \Big\| & \ \le \ \left( C' + \frac{27 \left( \bar{\alpha} + \lambda \right)^3 }{\sqrt{2} \sqrt{\eta_1+\eta_0} ( \eta_1 - \eta_0 ) \bar{\alpha}^2 \lambda } \right) \sqrt{\overline{d}} \\ & \ \le \ \left( C' + \frac{27}{\sqrt{2}} \right) \frac{(\lambda+\bar{\alpha})^3 }{\bar{\alpha}^2 \lambda } \frac{\sqrt{\overline{d}}}{\sqrt{\eta_1+\eta_0} \left( \eta_1-\eta_0 \right)} \end{align*}

for some constant Cʹ. Let $C = C' + \frac{27}{\sqrt{2}}$. By combining the above with inequality (3.8), the inequality (3.7) becomes

(3.9)\begin{equation} \dfrac{\left\| \widehat{X} - \bar{x} \right\| }{\left\| \bar{x} \right\| } \ \le \ C \, \frac{(\lambda+\bar{\alpha})^2 }{\bar{\alpha}^2 \lambda } \frac{\sqrt{\overline{d}} }{\sqrt{\eta_1+\eta_0} \left( \eta_1-\eta_0 \right)}. \end{equation}

  1. (ii) Node i in the mean-field model is correctly classified by decision rule (3.3) if the sign of $\bar{x} _i$ equals the sign of Zi. Corollary C.2 in Appendix C shows that this is indeed the case for the unlabeled nodes.

  2. (iii) Finally, for an unlabeled node i to be correctly classified, the node’s value $\widehat{X} _{i}$ should be close enough to its mean-field value $\bar{x} _{i}$.

In particular, part (ii) shows that if $|\widehat{X} _{i} - \bar{x} _{i}|$ is smaller than some non-vanishing constant β, then an unlabeled node i will be correctly classified. An unlabeled node i is said to be β-bad if $\left| \widehat{X} _i - \bar{x} _i \right| \gt \beta$. We denote by Sβ the set of β-bad nodes. The nodes that are not β-bad are almost surely correctly classified, and thus $ L \left( \widehat{Z} _u, Z_u \right) \le \frac{| S_\beta | }{n } $.

From $\left\| \widehat{X} - \bar{x} \right\|^2 \geq \sum\limits_{i \in S_{\beta} } \left| \widehat{X} _i - \bar{x} _i \right|^2$, it follows that $\left\| \widehat{X} - \bar{x} \right\|^2 \geq | S_{\beta} | \times \beta^2$. Thus, using inequality (3.9) and the norm constraint $\left\| \bar{x} \right\|^2 = n$, we have with probability at least $1-n^{-r}$,

\begin{align*} \left| S_{\beta} \right| & \ \le \ \dfrac{1}{\beta^2} \left( \frac{C}{\eta_1 - \eta_0} \frac{\bar{\alpha} + \lambda }{\bar{\alpha} \lambda} \sqrt{\overline{d}} \right)^2 n, \end{align*}

for some constant C. We end the proof by noticing that $\frac{\bar{d} }{\bar{\alpha} } = \frac{p_{\rm in} + p_{\rm out}}{p_{\rm in} - p_{\rm out}}$.

Corollary 3.2. (Almost exact recovery in the diverging degree regime)

Consider a DC-SBM such that $\bar{d} \gg 1$, $\frac{p_{\rm in} + p_{\rm out}}{p_{\rm in} - p_{\rm out}} = O(1)$, $\sqrt{\eta_0+\eta_1}(\eta_1 - \eta_0) \gg \frac{1}{\sqrt{\overline{d}}}$, and $\eta_0 n \sqrt{\eta_0+\eta_1} \ll \lambda$. Suppose that $\tau \gt p_{\rm out}$ and $\lambda \gt rsim \bar{\alpha }$. Then, Algorithm 1 correctly classifies almost all the unlabeled nodes.

Proof. With the corollary’s assumptions $(\eta_1-\eta_0)^2 \bar{d} \rightarrow +\infty$ and $\frac{\bar{\alpha } + \lambda}{\lambda} = O(1)$, by Theorem 3.1 the fraction of misclassified nodes is of the order o(1).

The quantity $(\eta_1 - \eta_0)n$ is the expected difference between the number of nodes correctly labeled and the number of nodes wrongly labeled by the oracle. In particular, Corollary 3.2 allows for a sub-linear number of labeled nodes since η 0 and η 1 can go to zero.

Corollary 3.3. (Detection in the constant degree regime)

Consider a DC-SBM such that $p_{\rm in} = \frac{c_{\rm in}}{n}$ and $p_{\rm out} = \frac{c_{\rm out}}{n}$ where $c_{\rm in}, c_{\rm out}$ are constants. Suppose that $\sqrt{\eta_0+\eta_1}(\eta_1 - \eta_0)$ is a non-zero constant, and let $\tau \gt 2p_{\rm out}$ and $\lambda \gt rsim 1$. Assume furthermore that $\eta_0 n \sqrt{\eta_0+\eta_1} \ll \lambda$. Then, for $\frac{(c_{\rm in} - c_{\rm out})^2}{c_{\rm in} + c_{\rm out}}$ bigger than some constant, w.h.p. Algorithm 1 performs better than a random guess.

Proof. According to Theorem 3.1, the fraction of misclassified nodes is smaller than $\frac{1}{2}$ when $\frac{(c_{\rm in} - c_{\rm out})^2}{c_{\rm in} + c_{\rm out}}$ is larger than $\frac{4C}{(\eta_1 - \eta_0)^2} \left( \frac{\bar{\alpha } + \lambda}{\lambda} \right)^2$, which is indeed lower-bounded by a constant.

The quantity $\frac{(c_{\rm in} - c_{\rm out})^2}{c_{\rm in} + c_{\rm out}}$ can be interpreted as the signal-to-noise ratio. It is unfortunate that Corollary 3.3 does not allow us to control the constant in the statement of the corollary. This constant comes from the concentration of the adjacency matrix. Similar remarks were made in [Reference Le, Levina and Vershynin19] for the analysis of spectral clustering in the constant degree regime for SBMs graph.

4. Numerical experiments

This section presents numerical experiments both on simulated data sets generated from DC-SBMs and on real networks. In particular, we discuss the impact of the oracle mistakes (defined by the ratio $\frac{\eta_0}{\eta_0+\eta_1}$) on the performance of the algorithms. The code for the numerical experiments is available on GitHub at https://github.com/mdreveton/ssl-sbm

4.1. Synthetic data sets

4.1.1. Choice of λ and τ

Let us denote by σ 1 and σ 2 the largest and second largest eigenvalues of A. We choose $\tau = \frac{4}{n} (\sigma_1+\sigma_2)$ and $\lambda = \frac{\log \frac{\eta_1}{\eta_0} }{\log \frac{\sigma_1 + \sigma_2}{\sigma_1-\sigma_2} }$, if $\eta_0 \not=0$, and $\lambda = \frac{\log \left( n \eta_1 \right) }{\log \frac{\sigma_1 + \sigma_2}{\sigma_1-\sigma_2} }$, otherwise. The heuristic for this choice is as follows. For an SBM graph, we have $\sigma_1 \approx \frac{n}{2}\left(p_{\rm in} + p_{\rm out} \right)$ and $\sigma_2 \approx \frac{n}{2} ( p_{\rm in} - p_{\rm out})$, hence $ \frac{4}{n} (\sigma_1+\sigma_2) = 2 p_{\rm in} \gt p_{\rm out} $, and τ satisfies the condition of Theorem 3.1. For λ, we have $ \frac{\log \frac{\eta_1}{\eta_0} }{\log \frac{\sigma_1 + \sigma_2}{\sigma_1-\sigma_2} } \approx \frac{\log \frac{\eta_1}{\eta_0} }{\log \frac{p_{\rm in}}{p_{\rm out}} }$, which is indeed close to the expression of λ derived in Corollary 2.3 if $p_{\rm in}, p_{\rm out} = o(1)$.

4.1.2. Choice of relaxation

We first compare the choice of the constraint in the continuous relaxation (3.1). Specifically, we compare the choice $\sum_i x_i^2 = n$ (we refer to it as standard relaxation) versus $\sum_i d_i x_i^2 = 2|E|$ (we refer to it as degree-normalized relaxation). This leads to two versions of Algorithm 1, whose cost obtained on SBMs graph with a noisy oracle is presented in Figure 1. In particular, we observe that the normalized choice leads to a smaller cost. Therefore, in the following we will only consider the version of Algorithm 1 solving the relaxed problem (3.1) with constraint $\sum_{i} d_i x_i^2 = 2 |E|$ instead of $\sum_i x_i^2 = n$, as it gives better numerical results.

Figure 1. Cost in Algorithm 1 with the standard and normalized versions of the constraint, on 50 realizations of SBM with $n = 500, p_{\rm out} = 0.03$ and 50 labeled nodes with $10\%$ noise.

4.1.3. Experiments on synthetic graphs

We first consider clustering on DC-SBM. We set n = 2000, $p_{\rm in} = 0.04$, and $p_{\rm out} = 0.02$. We consider three scenarios.

  • In Figure 2(a) we consider a standard SBM ($\theta_i= 1$ for all i);

  • In Figure 2(b) we generate θi according to $|\mathcal{N}( 0, \sigma^2)| + 1 - \sigma \sqrt{2/\pi} $ where $| \mathcal{N}(0, \sigma^2) |$ denotes the absolute value of a normal random variable with mean 0 and variance σ 2. We take σ = 0.25. Note that this definition enforces $\mathbb{E} \theta_i = 1$.

  • In Figure 2(c) we generate θi from Pareto distribution with density function $f(x) = \frac{a m^a}{x^{a+1} } 1(x \geq m)$ with a = 3 and $m = 2/3$ (chosen such that $\mathbb{E} \theta_i = 1$).

We compare the performance of Algorithm 1 with that of the algorithm of [Reference Mai and Couillet22] (referred to as Centered similarities) and the Poisson learning algorithm described in [Reference Calder, Cook, Thorpe and Slepcev10]. We chose these two algorithms as references since they perform very well on real data sets and are designed to avoid flat solutions. Results are shown in Figure 2. We observe that when the oracle noise is low, the performance of Algorithm 1 is comparable to Centered similarities. But, when the noise becomes non-negligible, the performance of Centered similarities deteriorates, while the accuracy of Algorithm 1 remains high. We notice that Poisson learning gives poor results on synthetic data sets.

Figure 2. Average accuracy obtained by different semi-supervised clustering methods on DC-SBM graphs, with $n = 2000,\ p_{\rm in} = 0.04$, and $ p_{\rm out} = 0.02$ with different distributions for θ. The number of labeled nodes is equal to 40. Accuracies are computed on the unlabeled nodes, and are averaged over 100 realizations; the error bars show the standard error.

4.2. Experiments on real data

We next use real data to show that even if real networks are not generated by the DC-SBM, Algorithm 1 still performs well.

4.2.1. MNIST

As a real-life example, we perform simulations on the standard MNIST data set [Reference LeCun, Cortes and Burges20]. As preprocessing, we select 1000 images corresponding to two digits and compute the k-nearest-neighbors graph (we take k = 8) with Gaussian weights $w_{ij} = \exp\left( - \|x_i - x_j\|^2 / s_i^2 \right)$ where xi represents the data for image i and si is the average distance between xi and its K-nearest neighbors. Figure 3 gives accuracy for different digit pairs. While the performance of Poisson learning is excellent, it can suffer from the oracle noise. On the other hand, the accuracy of Algorithm 1 remains unchanged.

Figure 3. Average accuracy obtained on a subset of the MNIST data set by different semi-supervised algorithms as a function of the oracle-misclassification ratio, when the number of labeled nodes is equal to 10. Accuracy is averaged over 100 random realizations, and the error bars show the standard error.

To further highlight the influence of the noise, we plot in Figure 4 the accuracy obtained by the three algorithms on the unlabeled nodes, the correctly labeled nodes, and the wrongly labeled nodes. We observe that the hard constraint $X_\ell = s_\ell$ imposed by Centered similarities forces the correctly labeled nodes to be correctly classified. In contrast, the wrongly labeled nodes are not classified much better than a random guess. This heavily penalizes the unlabeled nodes’ accuracy in an extremely noisy setting. On the contrary, Algorithm 1 allows for a smoother recovery: the unlabeled, correctly labeled, and wrongly labeled nodes have roughly the same classification accuracy. While some correctly labeled nodes are misclassified, many wrongly labeled nodes become correctly classified, and the unlabeled nodes are better recovered. Finally, Poisson learning shows a performance somewhere in between these two extreme cases: its accuracy on the unlabeled nodes is excellent, but it fails at correctly classifying the erroneously labeled nodes.

Figure 4. Average accuracy obtained on the unlabeled, correctly labeled, and wrongly labeled nodes by the oracle. Simulations are done on the 1,000 digits (2,4). The noisy oracle correctly classifies 24 nodes and misclassifies 16 nodes, and the boxplots show 100 realizations.

4.2.2. Common benchmark networks

Finally, we perform simulations on three benchmark networks: Political Blogs, LiveJournal, and DBLP. These networks are commonly used for graph clustering since the “ground truth” clusters are known. For LiveJournal and DBLP, we consider only the two largest clusters. The dimension of the data sets is given in Table 1 and the performances of semi-supervised algorithms in Figure 5. We observe that Algorithm 1 and Poisson learning outperform Centered similarities and can still achieve good accuracy even in the presence of noise in labeled data.

Figure 5. Average accuracy obtained on real networks by different semi-supervised algorithms as a function of the oracle-misclassification ratio. The number of labeled nodes is 30 for Political Blogs and LiveJournal, and 100 for DBLP. Accuracy is averaged over 50 random realizations, and the error bars show the standard error.

Table 1. Parameters of the real data sets. n1 (resp., n2) corresponds to the size of the first (resp., second) cluster, and $|E|$ is the number of edges of the network.

Funding statement

This research has been done within the project of Inria—Nokia Bell Labs “Distributed Learning and Control for Network Analysis.”

Appendix A. Derivation of the MAP

Proof of Theorem 2.2

Bayes’ formula gives $ \mathbb{P}(z \, | \, A, s) \propto \mathbb{P}( A \, | \, z, s ) \ \mathbb{P}( z \, | \, s ), $ where the proportionality symbol hides $\mathbb{P}(A \, | \, s)$-term independent of z.

The likelihood term can be rewritten as follows:

\begin{align*} \mathbb{P}( A \, | \, z, s ) \ = \ \mathbb{P}( A \, | \, z) \, \propto \, \prod_{\substack{i \lt j \\ z_i = z_j } } \left( \frac{p_{\rm in}}{p_{\rm out}} \frac{1-\theta_i \theta_j p_{\rm out}}{1 - \theta_i \theta_j p_{\rm in}} \right)^{a_{ij}} \left( \frac{1- \theta_i \theta_j p_{\rm in}}{1 - \theta_i \theta_j p_{\rm out} } \right), \end{align*}

where the proportionality hides a constant $ C = \prod\limits_{i \lt j} \left( \frac{\theta_i \theta_j p_{\rm out}}{1 - \theta_i \theta_j p_{\rm out}} \right)^{a_{ij}} \left( 1 -\theta_i \theta_j p_{\rm out} \right)$ independent of z. Hence,

(A.1)\begin{align} \log \mathbb{P}\left( A \, | \, z, s \right) & \ = \ \log C + \frac{1}{2} \sum_{i,j} 1( z_i \not= z_j ) \left( \left( f^{(1)}_{ij} - f^{(0)}_{ij} \right) a_{ij} + f^{(0)}_{ij} \right) \nonumber \\ & \ = \ \log C + \frac{1}{2} \sum_{i,j= 1}^n \frac{1 - z_i z_j}{2} \left( \left( f^{(1)}_{ij} - f^{(0)}_{ij} \right) a_{ij} + f^{(0)}_{ij} \right) \nonumber \\ & \ = \ \log C' - \frac{1}{4} x^T M x \end{align}

for some constant Cʹ and $M = (F_1 - F_0) \odot A + F_0 $.

The oracle information, given by the term $\mathbb{P}( z \, | \, s )$, is equal to

(A.2)\begin{align} \mathbb{P}(z \, | \, s) & \ = \ \prod_{i=1}^n \frac{\mathbb{P}(s_i \, | \, z_i) }{\mathbb{P}(s_i)} \mathbb{P}( z_i ) \nonumber \\ & \ = \ \left( \dfrac{\eta_1}{\eta_1 + \eta_0 }\right)^{\big|\{i\in \ell \colon z_i = s_i \} \big| } \ \left( \dfrac{\eta_0}{\eta_1 + \eta_0}\right)^{\big|\{i\in \ell \colon z_{i} \not= s_{i} \} \big| } \ \left( \dfrac{1}{2} \right)^{n } \nonumber \\ & \ = \ \left( \dfrac{\eta_0}{\eta_1}\right)^{ \big|\{i \in \ell \, \colon z_{i} \not= s_{i} \} \big| } \left( \dfrac{\eta_1}{\eta_1+\eta_0} \right)^{ \big| \ell \big| } \left( \dfrac{1}{2} \right)^{n }, \end{align}

where we used $ \big|\{i \in \ell \colon z_{i} = s_{i} \} \big| + \big|\{i \in \ell \colon z_{i} \not= s_{i} \} \big| = \big| \ell \big|$ in the last line. Noticing that

\begin{align*} \left| \{i \in \ell : z_{i} \not= s_{i} \} \right| \ = \ \frac{1}{4} \sum_{i=1}^n \left( \left( \mathcal{P} z\right)_{i} - s_{i} \right)^2 \ = \ \frac14 \left( \mathcal{P} z - s \right)^T \left( \mathcal{P} z - s \right), \end{align*}

yields

(A.3)\begin{align} \log \mathbb{P}\left( z \, | \, s \right) & \ = \ - \frac{1}{4} \log \left( \frac{\eta_1}{\eta_0} \right) \cdot \left\| \mathcal{P} z - s \right\|^2 + C', \end{align}

where Cʹ is a term independent of z.

If $\eta_0 \not= 0$, the combination of Eqs. (A.1) and (A.3) with Bayes’ formula gives Expression (2.6). If $\eta_0 = 0$, then from Eq. (A.2) the term $\mathbb{P}( z \, | \, s)$ is non-zero (and constant) if and only if $z_{i} = s_{i}$ for every labeled node $i \in [\ell]$, and we obtain Expression (2.7).

Proof of Corollary 2.3

The proof follows from Theorem 2.2 and the fact that $f_{ij}^{(0) } = \log \frac{1-p_{\rm in} }{1-p_{\rm out} }$ and $f_{ij}^{(1) } = \log \frac{p_{\rm in} }{p_{\rm out} }$.

Appendix B. Lemmas related to mean-field solution of the secular equation

Appendix B.1. Spectral study of a perturbed rank-2 matrix

Lemma B.1. (Matrix determinant lemma)

Suppose $A \in \mathbf{R}^n$ is invertible, and let $U, V$ be two n by m matrices. Then $\det(A + U V^T) = \det A \det(I_m + V^T A^{-1} U)$.

Proof. We take the determinant of $ \begin{pmatrix} A & -U \\ V^T & I \end{pmatrix} = \begin{pmatrix} A & 0 \\ V^T & I \end{pmatrix} . \begin{pmatrix} I & -A^{-1} U \\ 0 & I + V^T A^{-1} U \end{pmatrix} $ and $ \det \begin{pmatrix} A & -U \\ V^T & I \end{pmatrix} = \det I \det\left( A + U V^T \right) $ by the Schur complement formula [Reference Horn and Johnson15, Sect. 0.8.5].

Proposition B.2. Let $M = Z B Z^T $, where $B = \begin{pmatrix} a & b \\ b & a \end{pmatrix}$ is a $2\times2$ matrix, and $Z = \begin{pmatrix} 1_{n/2} & 0_{n/2} \\ 0_{n/2} & 1_{n/2} \end{pmatrix}$ is an n × 2 matrix. Let m be an even number. We denote by $P_\mathcal{L}$ the n × n diagonal matrix whose first $\frac{m}{2}$ and last $\frac{m}{2}$ diagonal elements are ones, all other elements being zeros. Then, $ \det \Big( t I_n + \lambda P_\mathcal{L} - M \Big) \ = \ t^{n-m-2} (t+\lambda)^{m-2} (t-t_1^+) (t-t_1^-) (t-t_2^+) (t-t_2^-) $ with

\begin{align*} t_1^\pm & \ = \ \dfrac{1}{2}\Bigg( \frac{n}{2}(a+b) - \lambda \pm \sqrt{\Big( \lambda + \frac{n}{2} (a+b) \Big)^2 - 2 (a+b) \lambda m } \Bigg), \\ t_2^\pm & \ = \ \dfrac{1}{2}\Bigg( \frac{n}{2}(a-b) - \lambda \pm \sqrt{\Big( \lambda + \frac{n}{2} (a-b) \Big)^2 - 2 (a-b) \lambda m } \Bigg). \end{align*}

Proof. For now, assume that $t \not = - \lambda$ and $t\not = 0$. Then, $t I_n + \lambda \, P_\mathcal{L} $ is invertible, and by Lemma B.1,

(B.1)\begin{align} \det \Big( t I_n + \lambda P_\mathcal{L} - M \Big) & \ = \ \det( tI_n + \lambda P_\mathcal{L} ) \det\Big(I_2 + Z^T (t I_n + \lambda P_\mathcal{L} )^{-1} (-ZB) \Big) \nonumber \\ & \ = \ (t+\lambda)^{m} t^{n-m} \det\Big(I_2 - Z^T (t I_n + \lambda P_\mathcal{L} )^{-1} ZB \Big). \end{align}

Moreover,

\begin{align*} \big( t I_n + \lambda \, P_\mathcal{L} \big)^{-1} \ = \ \dfrac{1}{t}(I_n - P_\mathcal{L}) + \dfrac{1}{t+\lambda} P_\mathcal{L} \ = \ \dfrac{1}{t} I_n - \dfrac{\lambda}{t (t+\lambda)} P_\mathcal{L}. \end{align*}

Therefore, we can write

\begin{align*} Z^T \big( t I_n + \lambda \, P_\mathcal{L} \big)^{-1} ZB \ = \ \dfrac{1}{t} Z^T Z B - \dfrac{\lambda}{t(t+\lambda) } Z^T P_{\mathcal{L}} ZB \ = \ \dfrac{1}{t} \dfrac{n}{2} B - \dfrac{\lambda}{t(t+\lambda)} \dfrac{m}{2} B \ = \ x B, \end{align*}

where $x := \dfrac{n}{2} \dfrac{1}{t(t+\lambda)} \left( t + \lambda \left( 1 - \dfrac{m}{n} \right) \right) $. Thus, a direct computation of the determinant gives

\begin{align*} \det \Big(I_2 - Z^T \big( t I_n + \lambda \, P_\mathcal{L} \big)^{-1} ZB \Big) \ = \ \Big(1-x(a+b)\Big) \Big(1-x(a-b)\Big). \end{align*}

Going back to Eq. (B.1), we can write

(B.2)\begin{align} \det \Big( t I_n + \lambda P_\mathcal{L} - M \Big) & \ = \ (t+ \lambda)^{m-2} t^{n-m-2} P_1(t) P_2(t), \end{align}

with $P_1(t) = t(t+\lambda) - \frac{n}{2} (a+b) \big(t+\lambda(1-\frac{m}{n}) \big) $ and $P_2(t) = t(t+\lambda) - \frac{n}{2} (a-b) \big(t+\lambda(1-\frac{m}{n}) \big) $. Since $t \in \mathbf{R} \mapsto \det (t I_n + \lambda P_\mathcal{L} - M )$ is continuous (even analytic), expression (B.2) is also valid for t = 0 and $t = - \lambda$ [Reference Avrachenkov, Filar and Howlett6]. We end the proof by observing that

\begin{align*} P_1(t) \ = \ (t-t_1^+) (t-t_1^-) \qquad \text{and } \qquad P_2(t) \ = \ (t-t_2^+) (t-t_2^-), \end{align*}

where $t_1^\pm$ and $t_2^\pm$ are defined in the proposition’s statement.

Corollary B.3. Let A be the adjacency matrix of a DC-SBM with $p_{\rm in} \gt p_{\rm out} \gt 0$, and s be the oracle information. Let $\lambda, \tau \gt 0$, and $\bar{d}_\tau = \frac{n}{2}\left( p_{\rm in} + p_{\rm out} \right) - n \tau$, $\bar{\alpha} = \frac{n}{2}\left( p_{\rm in} - p_{\rm out} \right)$. Let $A_\tau := A - \tau 1_n 1_n^T$ and $P_\mathcal{L}$ be the diagonal matrix whose element $(P_\mathcal{L})_{ii}$ is 1 if $s_i \not=0$, and 0 otherwise. Then, the spectrum of $\mathbb{E} \tilde{\mathcal{L}} =- \mathbb{E} A_\tau + \lambda \mathcal{P} - \gamma I_n $ is $ \left\{-\gamma - t_1^\pm; -\gamma - t_2^\pm; -\gamma; -\gamma + \lambda; 0 \right\}, $ where

\begin{align*} t_1^\pm & \ = \ \dfrac{1}{2}\Bigg( \bar{d}_\tau - \lambda \pm \sqrt{\left( \lambda + \bar{d}_\tau \right)^2 - 4 \bar{d}_\tau \lambda \left( \eta_1 + \eta_0 \right) } \Bigg), \\ t_2^\pm & \ = \ \dfrac{1}{2} \Bigg( \bar{\alpha} - \lambda \pm \sqrt{\Big( \lambda + \bar{\alpha} \Big)^2 - 4 \bar{\alpha} \lambda \left( \eta_1 + \eta_0 \right) } \Bigg). \end{align*}

Proof. Let $M = \begin{pmatrix} p_{\rm in} - \tau & p_{\rm out} - \tau \\ p_{\rm out} - \tau & p_{\rm in} - \tau \end{pmatrix}$ and $Z = \begin{pmatrix} 1_{n/2} & 0_{n/2} \\ 0_{n/2} & 1_{n/2} \end{pmatrix}$. Then, we notice that $\mathbb{E} A_\tau = Z M Z^T$ and we can apply Proposition B.2 to compute the characteristic polynomial of $\mathbb{E} \tilde{\mathcal{L}}$. For $x\in \mathbf{R}$, $ \det \left( \mathbb{E} \tilde{\mathcal{L}} - x I_n \right) \ = \ \det \Big( (-\gamma - x) I_n - \mathbb{E} A_\tau + \lambda \mathcal{P} \Big) , $ whose roots are $-\gamma - t_1^{\pm}, -\gamma-t_2^{\pm}$, $-\gamma$, and $-\gamma + \lambda$.

Appendix B.1.1. Bounds for $\bar{\gamma} _*$

Lemma B.4. Let $\bar{\gamma}_*$ be the solution of Eq. (3.6) for the mean-field model. Then,

\begin{align*} -\bar{\alpha} ( 1-2\eta_0) \ \le \ \bar{\gamma}_* \ \le \ - \bar{\alpha }. \end{align*}

Proof. For $\lambda \ge 0$, we denote by $(\bar{x}_\lambda, \bar{\gamma}_*(\lambda) )$ the solution of the system (3.4) on a mean-field DC-SBM. The proof is in two steps. First, let us show that $\bar{\gamma} _*(0) = -\bar{\alpha} $ and $\bar{\gamma} _*(\infty) = -\bar{\alpha} (1-2\eta_0)$. For λ = 0, the constrained linear system (3.4) reduces to an eigenvalue problem, and hence $\bar{\gamma} _*(0)$ equals $-\alpha$, the smallest eigenvalue of $-\mathbb{E} A_{\tau}$. Moreover, when $\lambda = \infty $, the hard constraint $x_{\ell} = \bar{s} _{\ell}$ is enforced, and the system (3.4) becomes

\begin{align*} \left\{ \begin{array}{rl} (-\mathbb{E} A_{\tau} - \bar{\gamma} _*(\infty) I_n)_{uu} \bar{x} _u & \ = \ (\mathbb{E} A_{\tau})_{u\ell} \bar{s} _\ell \\ \bar{x} _u^T \bar{x} _u & \ = \ n (1 - \eta_0 - \eta_1) \end{array} \right. \end{align*}

and we verify by hand that $ \bar{\gamma} _*(\infty) = - \bar{\alpha} (1-2\eta_0)$ together with $\bar{x} _u = Z_u$ is indeed the solution.

Second, if we let $C_\lambda(x) = -x^T \mathbb{E} A_\tau x + \lambda ( \bar{s} - \mathcal{P} x)^T ( \bar{s} - \mathcal{P} x)$ be the cost function minimized in (3.1), then from Eq. (3.4) we have $\bar{\gamma} _*(\lambda_1) - \bar{\gamma} _*(\lambda_2) = C_{\lambda_1}(\bar{x} _1) - C_{\lambda_2}(\bar{x} _2) + \lambda_1 \bar{x} _1^T \bar{s} - \lambda_2 \bar{x} _2^T \bar{s} $. Since $\lambda \mapsto C_\lambda(x)$ is increasing, then $\lambda_1 \le \lambda_2$ implies $C_{\lambda_1}(\bar{x} _1) \le C_{\lambda_2}(\bar{x} _2)$. Since $\bar{x} _\lambda^T \bar{s} \ge 0$ (if it was not the case, then $C_\lambda(-\bar{x} _\lambda) \le C_\lambda(\bar{x} _\lambda)$, and hence $\bar{x} _\lambda \not= \mathrm{arg\,min}_{x \in \mathbf{R}^n} C_\lambda(x)$), we can conclude that $\bar{\gamma} _*(0) \le \bar{\gamma} _*(\lambda)$ and that $\bar{\gamma} _*(\lambda) \le \bar{\gamma} _*(\infty)$.

Appendix B.1.2. Concentration of $\gamma_*$

Proposition B.5. Let $\gamma_*$ and $\bar{\gamma}_*$ be the solutions of Eq. (3.4) for a DC-SBM and the mean-field DC-SBM, respectively. Then

\begin{equation*} \left| \gamma_* - \bar{\gamma}_* \right| \ \le \ \left( 1 + \frac{\left( \bar{\alpha} + \lambda \right)^3 }{2 \sqrt{\eta_1+\eta_0} ( \eta_1 - \eta_0 ) \bar{\alpha}^2 \lambda } \right) \sqrt{\overline{d}}. \end{equation*}

Proof. The gradient with respect to $(\bar{\delta} _1,...,\bar{\delta} _n, \bar{b} _1,...,\bar{b} _n, \gamma)$ of the left-hand-side of Eq. (3.6) is equal to

\begin{equation*} 2 \sum_{i=1}^n \frac{\bar{b} _i }{\bar{\delta} _i - \bar{\gamma} } \left[ \frac{\Delta b_i }{\bar{\delta} _i - \bar{\gamma} _* } - \frac{\bar{b} _i \Delta \delta_i }{( \bar{\delta} _i - \bar{\gamma} _* )^2 } + \frac{\bar{b} _i \Delta \gamma}{( \bar{\delta} _i - \bar{\gamma} _* )^2 } \right]. \end{equation*}

Thus, we have

\begin{equation*} \Delta \gamma \sum_{i=1}^n \frac{\bar{b} _i^2 }{( \bar{\delta} _i - \bar{\gamma} _* )^3 } \ = \ \sum_{i=1}^n \frac{\bar{b} _i^2 }{( \bar{\delta} _i - \bar{\gamma} _*)^3 } \Delta \delta_i - \sum_{i=1}^n \frac{\bar{b} _i }{( \bar{\delta} _i - \bar{\gamma} _* )^2 } \Delta b_i + o \left( \Delta \delta_i, \Delta b_i \right). \end{equation*}

Firstly, we see that for all $i \in [n]$, $\Delta \delta_i = \left| \delta_i - \bar{\delta}_i \right| \le \left\| A - \mathbb{E} A \right\| \le \bar{d}$ by the concentration of the adjacency matrix of a DC-SBM graph. Therefore, using this fact and $\bar{\gamma}_* \le \bar{\delta}_1 \le \bar{\delta}_2 \le \cdots \le \bar{\delta}_n$,

\begin{align*} \Delta \gamma \ = \ \left| \gamma_* - \bar{\gamma}_* \right| & \ \le \ \max_{i} \left| \delta_i - \bar{\delta} _i \right| + \frac{\max_{i} \frac{1}{( \bar{\delta} _i - \bar{\gamma} _*)^2 } }{\min_{i} \frac{1}{( \bar{\delta} _i - \bar{\gamma} _* )^3 } } \frac{\sum_{i} | \bar{b} _i | \cdot | b_i - \bar{b} _i | }{\sum_i \bar{b} _i^2 } \\ & \ \le \ \sqrt{\overline{d}} + \frac{\max_i \left( \bar{\delta} _i - \bar{\gamma} _* \right)^3 }{\min_i \left( \bar{\delta} _i - \bar{\gamma} _* \right)^2 } \frac{\sum_{i} | \bar{b} _i | \cdot | b_i - \bar{b} _i |}{\sum_i \bar{b} _i^2 }. \end{align*}

We notice that $\min_i | \bar{\delta}_i - \bar{\gamma}_* | = \bar{\delta}_1 - \bar{\gamma}_*$. By using Lemma B.4 and the expression of $\bar{\delta}_1$ given in Corollary B.3, we have

\begin{equation*} \min_i | \bar{\delta} _i - \bar{\gamma} _* | \ \ge \ \bar{\alpha} + \lambda. \end{equation*}

Similarly, $\max_i | \bar{\delta}_i - \bar{\gamma}_* | = \bar{\delta}_n - \bar{\gamma}_* = \bar{\delta}_n - \bar{\delta}_1 + \bar{\delta}_1 - \bar{\gamma}_*$. Corollary B.3 implies $\bar{\delta}_n = \lambda$ and $\bar{\delta}_1 = \frac{1}{2} \left( \lambda - \bar{\alpha} - \sqrt{\left(\lambda + \bar{\alpha}\right)^2 - 4\bar{\alpha} \lambda (\eta_0+\eta_1)}\right)$, thus $\bar{\delta}_n - \bar{\delta}_1 \le \bar{\alpha} + \lambda$. Hence, using Lemma B.4,

\begin{equation*} \max_i | \bar{\delta} _i - \bar{\gamma} _* | \ \le \ \frac{3}{2} \left( \bar{\alpha} + \lambda \right). \end{equation*}

Therefore, we have

(B.3)\begin{align} \left| \gamma_* - \bar{\gamma}_* \right| & \ \le \ \sqrt{\overline{d}} + \frac{27}{8} ( \bar{\alpha} + \lambda ) \cdot \frac{\sum_{i} | \bar{b} _i | \cdot | b_i - \bar{b} _i |}{\sum_i \bar{b} _i^2 }. \end{align}

The term $\frac{\sum_{i} | \bar{b} _i | \cdot | b_i - \bar{b} _i | }{\sum_i \bar{b} _i^2}$ can be bounded as follow. Let $\mathcal{I} = \{i \in [n] \colon \bar{b} _i \not=0 \}$. Then

\begin{align*} \sum_{i} | \bar{b} _i | \cdot | b_i - \bar{b} _i | & \ \le \ \max_{i \in \mathcal{I}} | b_i - \bar{b} _i | \cdot \sum_{i \in \mathcal{I}} \left| \bar{b} _i \right|. \end{align*}

Combining the Cauchy–Schwarz inequality

\begin{align*} \left| b_i - \bar{b} _i \right| \ = \ \lambda \left| ( Q_{\cdot i} - \bar{Q}_{\cdot i})^T \bar{s} \right| \ \le \ \lambda \left\| Q_{\cdot i} - \bar{Q}_{\cdot i} \right\|_2 \cdot \| \bar{s} \|, \end{align*}

with the Davis–Kahan theorem [Reference Yu, Wang and Samworth28]

\begin{align*} \left\| Q_{\cdot i} - \bar{Q}_{\cdot i} \right\|_2 & \ \le \ \frac{2^{3/2} \left\| A - \mathbb{E} A\right\| }{\min \left\{\bar{\delta}_i - \bar{\delta}_{i-1}, \bar{\delta}_{i+1} - \bar{\delta}_i \right\} }, \end{align*}

$\| \bar{s} \| = \sqrt{(\eta_0+\eta_1)n}$, and the concentration of A toward $\mathbb{E} A$, yields

\begin{align*} \max_{i \in \mathcal{I}} | b_i - \bar{b} _i | \ \le \ \frac{\lambda \sqrt{(\eta_0+\eta_1) n } }{ \min_{i \in \mathcal{I} } \left\{\bar{\delta} _i - \bar{\delta} _{i-1}, \bar{\delta} _{i+1} - \bar{\delta}_i \right\} } \cdot 2^{3/2} \sqrt{\overline{d}}. \end{align*}

Using Lemma B.6, we see that $\mathcal{I} = \{i \in [n] : \delta_i \not\in \{0, t_1^-\} \}$. Combining it with Corollary B.3, gives

\begin{align*} \min_{i \in \mathcal{I} } \left\{\bar{\delta}_i - \bar{\delta}_{i-1}, \bar{\delta}_{i+1} - \bar{\delta}_i \right\} & \ = \ \lambda + t_2^+ \\ & \ = \ \frac{\alpha+\lambda}{2} \left( 1 - \sqrt{1 - 4 \frac{\alpha \lambda}{(\alpha+\lambda)^2} (\eta_0+\eta_1) } \right) \\ & \ \ge \ \frac{\alpha \lambda}{\alpha + \lambda} (\eta_0+\eta_1), \end{align*}

where we used $\sqrt{1-x} \le 1 - x/2$. Therefore,

\begin{equation*} \max_{i \in \mathcal{I}} \left| b_i - \bar{b} _i \right| \ \le \ 2^{3/2} \sqrt{\frac{n \bar{d} }{\eta_0+\eta_1 } } \cdot \frac{\alpha + \lambda}{\alpha} . \end{equation*}

Finally, Lemma B.7 ensures that

\begin{align*} \frac{\sum_i \left| \bar{b} _i \right| }{\sum_i \bar{b} _i^2 } \ \le \ \frac{2}{\sqrt{n} (\eta_1-\eta_0)} \cdot \frac{\lambda + \alpha}{\alpha \lambda} \left( 1 + \frac{2 \eta_0 n \sqrt{\eta_1+\eta_0}}{\lambda } \right). \end{align*}

Therefore,

\begin{align*} \frac{\sum_i \left| \bar{b} _i \right| \cdot \left| b_i - \bar{b} _i \right| }{\sum_i \bar{b} _i^2} & \ \le \ 2^{5/2} \left( \frac{\alpha+\lambda}{\alpha} \right)^2 \frac{\sqrt{\overline{d}} }{(\eta_1-\eta_0)\sqrt{\eta_0+\eta_1} } \left( 1 + \frac{2 \eta_0 n \sqrt{\eta_1+\eta_0}}{\lambda } \right) \\ & \ \le \ 2^{3} \left( \frac{\alpha+\lambda}{\alpha} \right)^2 \frac{\sqrt{\overline{d}} }{(\eta_1-\eta_0)\sqrt{\eta_0+\eta_1} }, \end{align*}

where we used the condition $2 \eta_0 n \sqrt{\eta_1+\eta_0} \ll \lambda$.

Going back to inequality (B.3), this implies that $ \left| \gamma_* - \bar{\gamma}_* \right| \ \le \ \left( 1 + \frac{27}{2^{6} } \frac{(\alpha+\lambda)^3}{\alpha^2 \lambda} \frac{1}{(\eta_1-\eta_0) \sqrt{\eta_0+\eta_1}} \right) \sqrt{\overline{d}}, $ and this concludes the proof.

Lemma B.6. Let $-\mathbb{E} A_\tau + \lambda \mathcal{P} = \bar{Q} \bar{\Delta} \bar{Q}^T$, where $\bar{\Delta} = \mathrm{diag}\left( \bar{\delta} _1, \dots , \bar{\delta} _n\right)$ and $\bar{Q}^T \bar{Q} = I_n$. Denote $\bar{b} = \lambda \bar{Q}^T s$. We have $ \bar{b} _1 \ \ge \ \sqrt{n} \frac{\lambda (\eta_1 - \eta_0)}{2} \frac{\bar{\alpha } }{\lambda + \bar{\alpha } }. $ Moreover, $\bar{b} _i = 0$ if $\bar{\delta} _i = 0$ or if $\bar{\delta} _i = - t_1^-$.

Proof. First, from Corollary B.3, $ \bar{\delta }_1 = -t_2^+ = - \frac{1}{2}\Bigg( \bar{\alpha } - \lambda + \sqrt{\Big( \lambda + \bar{\alpha } \Big)^2 - 4 \bar{\alpha } \lambda \left( \eta_1 + \eta_0 \right) } \Bigg). $ By symmetry, the ith component of the first eigenvector $\bar{Q}_{\cdot 1}$ (associated with $\bar{\delta}_1$) is equal to

\begin{align*} \begin{cases} v_1 \, Z_i & \quad \text{if } i \in [\ell], \\ v_0 \, Z_i &\quad \text{if } i \not \in [\ell], \end{cases} \end{align*}

where v 1 and v 0 are to be determined. Thus, the equation $\left( -\mathbb{E} A_\tau + \lambda \mathcal{P} \right) \bar{Q}_{\cdot 1} = \bar{\delta}_1 \bar{Q}_{\cdot 1}$ leads to

\begin{align*} \begin{cases} \bar{\alpha } \left( (\eta_1 + \eta_0) v_1 + (1 - \eta_1 - \eta_0) v_0 \right) & \ = \ -t_2^+ v_0 \\ \bar{\alpha } \left( (\eta_1 + \eta_0) v_1 + (1-\eta_1 - \eta_0) v_0 \right) + \lambda v_1 & \ = \ -t_2^+ v_1, \end{cases} \end{align*}

which, given the norm constraint $\|v\|_2 = 1$, yields

\begin{align*} \begin{cases} v_1 & \ = \ \dfrac{1}{\sqrt{n}} \frac{t_2^+ }{\sqrt{(\eta_1+\eta_0) \left( t_2^+ \right)^2 + (1-\eta_1 - \eta_0) \left( t_2^+ + \lambda \right)^2 } }, \\ v_0 & \ = \ \dfrac{1}{\sqrt{n}} \frac{+ t_2^+ + \lambda}{\sqrt{(\eta_1+\eta_0) \left( t_2^+ \right)^2 + (1-\eta_1 - \eta_0) \left( t_2^+ + \lambda \right)^2 }}. \end{cases} \end{align*}

Since $\bar{b} _1 = \lambda v^T \bar{s} = \lambda (\eta_1 - \eta_0) n v_1$, we have

\begin{align*} \frac{\bar{b} _1 }{\sqrt{n} } & \ = \ \lambda (\eta_1 - \eta_0) \frac{t_2^+}{\sqrt{(\eta_1+\eta_0) \left( t_2^+ \right)^2 + (1-\eta_1 - \eta_0) \left( t_2^+ + \lambda \right)^2 } }. \end{align*}

The proof ends by noticing that $t_2^+ \geq \frac{\bar{\alpha }}{2}$ and $t_2^+ \leq \bar{\alpha }$. Indeed,

\begin{align*} \frac{\bar{b} _1 }{\sqrt{n} } & \ \ge \ \lambda(\eta_1 - \eta_0) \frac{\bar{\alpha } }{2 \sqrt{(\eta_1 + \eta_0) \bar{\alpha }^2 + (1-\eta_1-\eta_0) (\bar{\alpha } + \lambda)^2 } } \\ & \ \ge \ \frac{\lambda(\eta_1 - \eta_0)}{2} \frac{\bar{\alpha } }{\left( \bar{\alpha } + \lambda \right) \sqrt{(\eta_1 + \eta_0) \left( \frac{\bar{\alpha }}{\bar{\alpha } + \lambda } \right)^2 + 1-\eta_1 - \eta_0 } } \\ & \ \ge \ \frac{\lambda (\eta_1 - \eta_0) }{2 } \frac{\bar{\alpha } }{\lambda + \bar{\alpha } }. \end{align*}

This proves the first claim of the lemma.

Similarly, by symmetry the ith component of the eigenvector vʹ associated with $-t_1^-$ equals $v_{\ell}'$ if $i \in \ell$, and $v_u'$ otherwise, and therefore $(v')^T s = 0$.

Finally, let $I_0 := \{i \in [n] : \bar{\delta}_i = 0 \}$. By Corollary B.3, we have $|I_0| = n (1-\eta_1-\eta_0) - 2$. Since 0 is also eigenvalue of order $n(1-\eta_0-\eta_1)-2$ of the extracted sub-matrix $\left( -\mathbb{E} A_\tau + \lambda \mathcal{P} \right)_{u,u} = \left( -\mathbb{E} A_\tau \right)_{u,u} $, we have for all $k \in I_0$, $\bar{Q}_{ik} = 0$ for every $i \in [n]$. Therefore, for $k \in I_0$, $b_k = \lambda \bar{Q}^T_{\cdot k} s = 0$.

Lemma B.7. Let $-\mathbb{E} A_\tau + \lambda \mathcal{P} = \bar{Q} \bar{\Delta} \bar{Q}^T$, where $\bar{\Delta} = \mathrm{diag}\left( \bar{\delta} _1, \dots , \bar{\delta} _n\right)$ and $\bar{Q}^T \bar{Q} = I_n$. Denote $\bar{b} = \lambda \bar{Q}^T s$ and let $\mathcal{I} = \{i \in [n] \colon \bar{b} _i \not=0 \}$. We have $ \frac{\sum_{i \in \mathcal{I}} |\bar{b} _i| }{\sum_{i \in \mathcal{I}} |\bar{b} _i|^2} \ \le \ \frac{2}{\sqrt{n} (\eta_1-\eta_0)} \cdot \frac{\lambda + \alpha}{\alpha \lambda} \left( 1 + \frac{2 \eta_0 n \sqrt{\eta_1+\eta_0}}{\lambda } \right) $.

Proof. Using Lemma B.6, we see that $\mathcal{I} = \{i \in [n] : \delta_i \not\in \{0, t_1^-\} \}$. Thus,

\begin{align*} \frac{\sum_{i \in \mathcal{I}} |\bar{b} _i| }{\sum_{i \in \mathcal{I}} |\bar{b} _i|^2} \ = \ \frac{|b_1| + \sum_{i \colon \delta_i = \lambda} |\bar{b} _i| }{|\bar{b} _1|^2 + \sum_{i \colon \delta_i = \lambda} |\bar{b} _i|^2} \ \le \ \frac{1 }{|\bar{b} _1| } + \frac{\sum_{i \colon \delta_i = \lambda} |\bar{b} _i| }{|\bar{b} _1|^2}, \end{align*}

where $\bar{b} _1$ denotes the element of vector $\bar{b} $ corresponding to eigenvalue $\delta_1 = -t_2^+$. By Lemma B.6, we have $\bar{b} _1 \ \ge \ \sqrt{n} \frac{\lambda (\eta_1 - \eta_0)}{2} \frac{\bar{\alpha } }{\lambda + \bar{\alpha } }$. Hence,

(B.4)\begin{align} \frac{1 }{|\bar{b} _1|} \ \le \ \frac{\lambda + \bar{\alpha} }{\bar{\alpha} \lambda } \frac{2}{(\eta_1-\eta_0) \sqrt{n}}. \end{align}

We note that the eigenvalue $\delta_i = \lambda$ is of multiplicity $\eta n - 2$. Let us denote by $\{v_i\}$ the corresponding $\eta n - 2$ orthonormal eigenvectors associated with eigenvalue λ. Let vij denote the jth entry of vi. We notice from the block structure of $-\mathbb{E} A_{\tau} + \lambda \mathcal{P}$ that $v_{ij} = 0$ if $j \notin \ell$. Moreover, if we let $\tilde{v}_i$ be the restriction of vi to $\ell$, then $\tilde{v}_i$ belongs to the kernel of $ \left( -\mathbb{E} A_\tau \right)_{\ell \ell}$. Therefore, $\sum_{j \in \ell} \tilde{v}_{ij} = 0$, and

\begin{align*} \bar{b} _i \ = \ \lambda v_i^T s \ = \ -2 \lambda \sum_{j \in \ell_0} \tilde{v}_{ij}, \end{align*}

where $\ell_0 = \{j \in \ell \colon s_i \ne z_i\}$ is the set of nodes mislabeled by the oracle. Hence,

\begin{align*} \sum_{i \colon \delta_i = \lambda} |\bar{b} _i| & \ = \ 2 \lambda \sum_{i \colon \delta_i = \lambda } \left| \sum_{j \in \ell_0} \tilde{v}_{ij} \right| \ \le \ 2 \lambda \sum_{i \in \ell} \sum_{j \in \ell_0} |\tilde{v}_{ij}| \ \le \ 2 \lambda \sum_{j \in \ell_0} \sqrt{|\ell|} \sqrt{\sum_{i\in \ell} |\tilde{v}_{ij}|^2} \ \le \ 2 \lambda |\ell_0| \sqrt{|\ell|}, \end{align*}

where the last inequality follows from the fact that the matrix $(\tilde{v}_{ij})_{i,j}$ is orthogonal. Hence, using $\bar{b} _1 \ \ge \ \sqrt{n} \frac{\lambda (\eta_1 - \eta_0)}{2} \frac{\bar{\alpha } }{\lambda + \bar{\alpha } }$, $|\ell_0| = \eta_0$, and $|\ell| = (\eta_0+\eta_1) n$, we obtain

\begin{align*} \frac{\sum_{i \colon \delta_i = \lambda} |\bar{b} _i| }{|\bar{b} _1|^2 } \ \le \ 4 \eta_0 \sqrt{n} \frac{\sqrt{\eta_1+\eta_0}}{\eta_1-\eta_0} \frac{\lambda+\alpha}{\alpha}. \end{align*}

Combining the latter inequality with (B.4) leads to the desired result.

Appendix C. Mean-field solution

In this section, we calculate the solution $\bar{x}$ to the mean-field model and deduce from it the conditions to recover the clusters.

Proposition C.1. Suppose that $\tau \gt p_{\rm out}$. Then the solution of Eq. (3.5) on the mean-field DC-SBM is the vector $\bar{x}$ whose element $\bar{x}_i$ is given by

\begin{align*} \bar{x}_i \ = \ \begin{cases} C \left( - 1 + (\eta_1 - \eta_0) \bar{\alpha} B \right) Z_i, & \quad \text{if } i \in \ell \text{and } s_i \not = Z_i, \\ C \left( 1 + (\eta_1 - \eta_0) \bar{\alpha} B \right) Z_i, & \quad \text{if } i \in \ell \text{and } s_i = Z_i, \\ \frac{- \bar{\alpha} C}{\bar{\alpha } (1-\eta_1 - \eta_0) + \bar{\gamma}_* } (\eta_1 - \eta_0) \left( 1 + (\eta_1 + \eta_0) \bar{\alpha} B \right) Z_i , & \quad \text{if } i \not \in \ell, \end{cases} \end{align*}

where $\bar{\alpha} = \frac{n}{2}(p_{\rm in} - p_{\rm out})$, $B = \frac{\bar{\alpha} \bar{\gamma} _* }{\lambda \bar{\alpha} (1-\eta_1-\eta_0) + \bar{\gamma} _*(\lambda - \bar{\alpha } - \bar{\gamma} _*)}$ and $C = \frac{\lambda}{\lambda - \bar{\gamma} _*}$.

Proof. Let $\bar{x}$ be a solution of Eq. (3.5). By symmetry, we have

\begin{align*} \bar{x}_{i} \ = \ \begin{cases} x_t \, Z_i, & \quad \text{if } i \in [\ell] \text{and } \bar{s} _i = Z_i, \\ x_f \, Z_i, &\quad \text{if } i \in [\ell] \text{and } \bar{s} _i = - Z_i, \\ x_0 \, Z_i, &\quad \text{if } i \not \in [\ell], \end{cases} \end{align*}

where xt, xf and x 0 are unknowns to be determined. Since for every $i \in [n]$

\begin{align*} \left( \mathbb{E} A_\tau \bar{x} \right)_i \ = \ \bar{\alpha } \left( x_0 (1-\eta_1 - \eta_0) + x_t \eta_1 + x_f \eta_0 \right), \end{align*}

the linear system composed of the equations $ \big( \left( -\mathbb{E} A_\tau + \lambda \mathcal{P} - \bar{\gamma} _* I_n \right) \bar{x} \big)_i \ = \ \lambda s_i $ for all $i \in [n]$ leads to the system

\begin{align*} \begin{cases} - \bar{\alpha} \left( (1-\eta_1 - \eta_0) x_0 + x_t \eta_1 + x_f \eta_0 \right) - \bar{\gamma} _* x_0 & \ = \ 0, \\ - \bar{\alpha} \left( (1-\eta_1 - \eta_0) x_0 + x_t \eta_1 + x_f \eta_0 \right) - \bar{\gamma} _* x_t + \lambda x_t & \ = \ \lambda, \\ - \bar{\alpha} \left( (1-\eta_1 - \eta_0) x_0 + x_t \eta_1 + x_f \eta_0 \right) - \bar{\gamma} _* x_f + \lambda x_f & \ = \ - \lambda. \end{cases} \end{align*}

The rows of the latter system correspond to a node unlabeled by the oracle, correctly labeled and falsely labeled, respectively. This system can be rewritten as follows:

\begin{align*} \begin{cases} x_0 & \ = \ \frac{- \bar{\alpha } }{\bar{\alpha } (1-\eta_1-\eta_0) + \bar{\gamma} _* } \left( \eta_1 x_t + \eta_0 x_f \right), \\ \bar{\gamma} _* x_0 + x_t (\lambda - \bar{\gamma} _*) & \ = \ \lambda, \\ \bar{\gamma} _* x_0 + x_f (\lambda - \bar{\gamma} _*) & \ = \ - \lambda. \end{cases} \end{align*}

In particular, we have $ x_t - x_f \ = \ \frac{2\lambda }{\lambda - \bar{\gamma} _* }. $ By subsequently eliminating x 0 and xt in the equation $\bar{\gamma} _* x_0 + x_f (\lambda - \bar{\gamma} _*) = - \lambda$, we find

\begin{align*} x_f & \ = \ \frac{\lambda}{\lambda - \bar{\gamma} _*} \left( -1 + \frac{\bar{\alpha } \bar{\gamma} _* \left( \eta_1 - \eta_0 \right) }{\lambda \bar{\alpha } (1-\eta_1 - \eta_0) + \lambda \bar{\gamma} _* - \bar{\gamma} _* ( \bar{\alpha } + \bar{\gamma} _* ) } \right), \\ x_t & \ = \ \frac{\lambda }{\lambda - \bar{\gamma} _* } \left( 1 + \frac{\bar{\alpha } \bar{\gamma} _* \left( \eta_1 - \eta_0 \right)}{\lambda \bar{\alpha } (1-\eta_1 - \eta_0) + \lambda \bar{\gamma} _* - \bar{\gamma} _* ( \bar{\alpha } + \bar{\gamma} _* ) }\right), \end{align*}

and finally

\begin{align*} x_0 \ = \ \frac{- \bar{\alpha} }{\bar{\alpha }(1-\eta_1-\eta_0) + \bar{\gamma} _* } \cdot \frac{\lambda }{\lambda - \bar{\gamma} _*} \left( 1 + \frac{\bar{\alpha} \bar{\gamma} _* \left( \eta_1 + \eta_0 \right)}{\lambda \bar{\alpha} (1-\eta_1 - \eta_0) + \lambda \bar{\gamma} _* - \bar{\gamma} _* ( \bar{\alpha} + \bar{\gamma} _* ) }\right). \end{align*}

Corollary C.2. Suppose that $\tau \gt p_{\rm out} $. Then $\mathrm{sign}\left( \bar{x} _i \right) = \mathrm{sign}\left( Z_i \right)$ if

  • node i is not labeled by the oracle;

  • node i is correctly labeled by the oracle;

  • node i is mislabeled by the oracle and $\lambda \lt (1-2\eta_0) \bar{\alpha} \frac{\eta_1-\eta_0}{\eta_1+\eta_0}$.

Proof. A node i is correctly classified by decision rule (3.3) if the sign of $\bar{x} _i$ is equal to the sign of Zi. Using Lemma B.4 in Appendix B.1.1, we have $- \bar{\alpha} \le \bar{\gamma} _* \le -\bar{\alpha} (1 - 2 \eta_0)$. Therefore, the quantities B and C in Proposition C.1 verify $C \ge 0$ and $\frac{1-2\eta_0}{\lambda(\eta_0+\eta_1)} \le B \le \frac{1}{\lambda(\eta_0+\eta_1)}$. The statement then follows from the expression of $\bar{x}_i$ computed in Proposition C.1.

Appendix D. Cost comparison in the constrained eigenvalue problem

Lemma D.1.

Let $(\gamma_1,x_1)$ and $(\gamma_2,x_2)$ be two solutions of the system (3.4), and denote by $\mathcal{C}(x) = - x^T A_\tau x + \lambda (s-\mathcal{P} x)^T (s-\mathcal{P} x)$ the cost function minimized in (3.1). Then, we have

\begin{align*} \mathcal{C} (x_1) - \mathcal{C} (x_2) \ = \ \frac{\gamma_1 - \gamma_2 }{2} \, \left\| x_1 - x_2 \right\|^2. \end{align*}

Proof. Because $(\gamma_1,x_1)$ and $(\gamma_2,x_2)$ are solutions of (3.4), it holds that

(D.1)\begin{align} \left( - A_\tau + \lambda \mathcal{P} \right) x_1 \ = \ & \gamma_1 x_1 + \lambda s, \end{align}
(D.2)\begin{align} \left( - A_\tau + \lambda \mathcal{P} \right) x_2 \ = \ & \gamma_2 x_2 + \lambda s, \end{align}

as well as $\|x_1\|^2 = \|x_2\|^2=n$. Thus, we notice that

\begin{align*} \mathcal{C}(x_1) & \ = \ x_1^T \left( - A_{\tau} + \lambda \mathcal{P} \right) x_1 + \lambda s^T s - 2 \lambda \, x_1^T \mathcal{P} s \\ & \ = \ - \lambda x_1^T s + \gamma_1 n + \lambda s^T s, \end{align*}

where we used $\mathcal{P} s = s$ and the fact that $(\gamma_1, x_1 )$ is a solution of the system (3.4). Therefore,

\begin{align*} \mathcal{C}(x_1) - \mathcal{C}(x_2) \ = \ \left( \gamma_1 - \gamma_2 \right) n + \lambda \left( x_2 - x_1 \right)^T s. \end{align*}

Finally, by multiplying on the left Eq. (D.1) by $x_2^T$ (resp., Eq. (D.2) by $x_1^T$), we obtain

\begin{align*} \lambda x_2^T s & \ = \ x_2^T \left( - A_\tau + \lambda \mathcal{P} \right) x_1 - \gamma_1 x_2^T x_1 , \\ \lambda x_1^T s & \ = \ x_1^T \left( - A_\tau + \lambda \mathcal{P} \right) x_2 - \gamma_2 x_1^T x_2.\\ \end{align*}

Thus,

\begin{align*} \mathcal{C}(x_1) - \mathcal{C}(x_2) \ = \ \left( \gamma_1 - \gamma_2 \right) \left( n - x_1^T x_2 \right) \ = \ \frac{\gamma_1 - \gamma_2 }{2 } \left( \|x_1\|^2 + \|x_2\|^2 - 2 x_1^T x_2 \right) \ = \ \frac{\gamma_1 - \gamma_2 }{2} \, \left\| x_1 - x_2 \right\|^2, \end{align*}

where we used the constraints $\|x_1\|^2 = \|x_2\|^2=n$.

References

Abbe, E. (2017). Community detection and stochastic block models: Recent developments. The Journal of Machine Learning Research 18(1): 64466531.Google Scholar
Adamic, L.A. & Glance, N. (2005). The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery, pp. 3643.CrossRefGoogle Scholar
Avrachenkov, K. & Dreveton, M. (2019). Almost exact recovery in label spreading. In International Workshop on Algorithms and Models for the Web-Graph. Springer, pp. 3043.CrossRefGoogle Scholar
Avrachenkov, K., Kadavankandy, A., & Litvak, N. (2018). Mean field analysis of personalized PageRank with implications for local graph clustering. Journal of Statistical Physics 173(3–4): 895916.CrossRefGoogle Scholar
Avrachenkov, K., Mishenin, A., Gonçalves, P., & Sokol, M. (2012). Generalized optimization framework for graph-based semi-supervised learning. In Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, pp. 966974.CrossRefGoogle Scholar
Avrachenkov, K.E., Filar, J.A., & Howlett, P.G. (2013). Analytic perturbation theory and its applications. SIAM.CrossRefGoogle Scholar
Banerjee, S., Deka, P., & Olvera-Cravioto, M. (2023). Pagerank nibble on the sparse directed stochastic block model. In International Workshop on Algorithms and Models for the Web-Graph. Springer, pp. 147163.CrossRefGoogle Scholar
Belkin, M., Matveeva, I., & Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs. In International Conference on Computational Learning Theory. Springer, pp. 624638.CrossRefGoogle Scholar
Ben-David, S., Lu, T., & Pál, D. (2008). Does unlabeled data provably help? Worst-case analysis of the sample complexity of semi-supervised learning. In Conference on Learning Theory.Google Scholar
Calder, J., Cook, B., Thorpe, M., & Slepcev, D. (2020). Poisson learning: Graph based semi-supervised learning at very low label rates. In International Conference on Machine Learning. PMLR, pp. 13061316.Google Scholar
Chapelle, O., Schölkopf, B., & Zien, A. (2006). Semi-supervised learning. Adaptive computation and machine learning. MIT Press.CrossRefGoogle Scholar
Cozman, F.G., Cohen, I., & Cirelo, M. (2002). Unlabeled data can degrade classification performance of generative classifiers. In Proceedings of Flairs-02, pp. 327331.Google Scholar
Feige, U. & Ofek, E. (2005). Spectral techniques applied to sparse random graphs. Random Structures & Algorithms 27(2): 251275.CrossRefGoogle Scholar
Gander, W., Golub, G.H., & Von Matt, U. (1989). A constrained eigenvalue problem. Linear Algebra and its Applications 114: 815839.CrossRefGoogle Scholar
Horn, R.A. and Johnson, C.R. (2012). Matrix analysis. Cambridge University Press.CrossRefGoogle Scholar
Iba, Y. (1999). The Nishimori line and Bayesian statistics. Journal of Physics A: Mathematical and General 32(21): 38753888.CrossRefGoogle Scholar
Kadavankandy, A., Avrachenkov, K., Cottatellucci, L., & Sundaresan, R. (2017). The power of side-information in subgraph detection. IEEE Transactions on Signal Processing 66(7): 19051919.CrossRefGoogle Scholar
Karrer, B. & Newman, M.E.J. (2011). Stochastic blockmodels and community structure in networks. Physical Review E 83(1): .CrossRefGoogle ScholarPubMed
Le, C.M., Levina, E., & Vershynin, R. (2017). Concentration and regularization of random graphs. Random Structures & Algorithms 51(3): 538561.CrossRefGoogle Scholar
LeCun, Y., Cortes, C., & Burges, C.J.C. The MNIST database of handwritten digits. Available at: https://yann.lecun.com/exdb/mnist/.Google Scholar
Mai, X. & Couillet, R. (2018). A random matrix analysis and improvement of semi-supervised learning for large dimensional data. Journal of Machine Learning Research 19(1): 30743100.Google Scholar
Mai, X. & Couillet, R. (2021). Consistent semi-supervised graph regularization for high dimensional data. Journal of Machine Learning Research 22(94): 148.Google Scholar
Nadler, B., Srebro, N., & Zhou, X. (2009). Semi-supervised learning with the graph Laplacian: The limit of infinite unlabelled data. Advances in Neural Information Processing Systems 22: 13301338.Google Scholar
Newman, M.E.J. (2013). Spectral methods for community detection and graph partitioning. Physical Review E 88(4): .CrossRefGoogle ScholarPubMed
Saad, H. & Nosratinia, A. (2018). Community detection with side information: Exact recovery under the stochastic block model. IEEE Journal of Selected Topics in Signal Processing 12(5): 944958.CrossRefGoogle Scholar
Wagner, D. & Wagner, F. (1993). Between min cut and graph bisection. In International Symposium on Mathematical Foundations of Computer Science. Springer, pp. 744750.CrossRefGoogle Scholar
Yang, J. & Leskovec, J. (2015). Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems 42(1): 181213.CrossRefGoogle Scholar
Yu, Y., Wang, T., & Samworth, R.J. (2015). A useful variant of the Davis–Kahan theorem for statisticians. Biometrika 102(2): 315323.CrossRefGoogle Scholar
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., & Schölkopf, B. (2004). Learning with local and global consistency. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pp. 16 321328.Google Scholar
Zhu, X., Ghahramani, Z., & Lafferty, J.D. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning, pp. 912919.Google Scholar
Figure 0

Algorithm 1. Semi-supervised learning with regularized adjacency matrix.

Figure 1

Figure 1. Cost in Algorithm 1 with the standard and normalized versions of the constraint, on 50 realizations of SBM with $n = 500, p_{\rm out} = 0.03$ and 50 labeled nodes with $10\%$ noise.

Figure 2

Figure 2. Average accuracy obtained by different semi-supervised clustering methods on DC-SBM graphs, with $n = 2000,\ p_{\rm in} = 0.04$, and $ p_{\rm out} = 0.02$ with different distributions for θ. The number of labeled nodes is equal to 40. Accuracies are computed on the unlabeled nodes, and are averaged over 100 realizations; the error bars show the standard error.

Figure 3

Figure 3. Average accuracy obtained on a subset of the MNIST data set by different semi-supervised algorithms as a function of the oracle-misclassification ratio, when the number of labeled nodes is equal to 10. Accuracy is averaged over 100 random realizations, and the error bars show the standard error.

Figure 4

Figure 4. Average accuracy obtained on the unlabeled, correctly labeled, and wrongly labeled nodes by the oracle. Simulations are done on the 1,000 digits (2,4). The noisy oracle correctly classifies 24 nodes and misclassifies 16 nodes, and the boxplots show 100 realizations.

Figure 5

Figure 5. Average accuracy obtained on real networks by different semi-supervised algorithms as a function of the oracle-misclassification ratio. The number of labeled nodes is 30 for Political Blogs and LiveJournal, and 100 for DBLP. Accuracy is averaged over 50 random realizations, and the error bars show the standard error.

Figure 6

Table 1. Parameters of the real data sets. n1 (resp., n2) corresponds to the size of the first (resp., second) cluster, and $|E|$ is the number of edges of the network.