Hostname: page-component-745bb68f8f-mzp66 Total loading time: 0 Render date: 2025-02-11T09:09:01.114Z Has data issue: false hasContentIssue false

On model selection for dense stochastic block models

Published online by Cambridge University Press:  14 January 2022

Ilkka Norros*
Affiliation:
University of Helsinki
Hannu Reittu*
Affiliation:
VTT Technical Research Centre of Finland
Fülöp Bazsó*
Affiliation:
Wigner Research Centre for Physics
*
*Postal address: Department of Mathematics and Statistics, P.O. Box 64, FI-00014 University of Helsinki, Finland. Email address: ilkka.norros@elisanet.fi
**Postal address: VTT Technical Research Centre of Finland, Espoo, 02044 VTT, Finland. Email address: hannu.reittu@vtt.fi
***Postal address: Department of Computational Sciences, Institute for Particle and Nuclear Physics, Wigner Research Centre for Physics, P.O. Box 49, H-1525 Budapest, Hungary. Email address: bazso.fulop@wigner.hu
Rights & Permissions [Opens in a new window]

Abstract

This paper studies estimation of stochastic block models with Rissanen’s minimum description length (MDL) principle in the dense graph asymptotics. We focus on the problem of model specification, i.e., identification of the number of blocks. Refinements of the true partition always decrease the code part corresponding to the edge placement, and thus a respective increase of the code part specifying the model should overweight that gain in order to yield a minimum at the true partition. The balance between these effects turns out to be delicate. We show that the MDL principle identifies the true partition among models whose relative block sizes are bounded away from zero. The results are extended to models with Poisson-distributed edge weights.

Type
Original Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

Let us define a stochastic block model (SBM) as a random graph $G=(V,E)$ , whose structure is defined by a partition $\xi=\left\{{{A_1,\ldots,A_k}}\right\}$ of V and by a symmetric $k\times k$ matrix $D=(d_{ij})_{i,j=1}^k$ of real numbers $d_{ij}\in[0,1]$ as follows: for every pair $\left\{{{v,w}}\right\}$ of distinct vertices of V such that $v\in A_i$ , $w\in A_j$ , $\left\{{{v,w}}\right\}\in E$ with probability $d_{ij}$ , and all Bernoulli random variables $e_{vw}=e_{wv}=1_{\{{\left\{{{v,w}}\right\}\in E}\}}$ are independent. For unambiguity, we also set the irreducibility condition that no two rows of D are equal, i.e.,

(1.1) \begin{equation}\mbox{for all }i,\ j,\ i<j,\mbox{ there is }q_{ij}\in\left\{{{1,\ldots,k}}\right\}\mbox{ such that }d_{iq_{ij}}\not=d_{jq_{ij}}.\end{equation}

An SBM is said to be irreducible if its partition $\xi$ and matrix D satisfy (1.1). This condition is necessary for identifiability, because a random graph generated by an irreducible block model with $(\xi,D)$ can be equivalently generated by any reducible model obtained by refining $\xi$ , i.e., splitting one or more of its blocks into smaller sets.

A slightly different variant of SBMs has been defined by first drawing the sizes of blocks $A^{(n)}_i$ as independent Poisson random variables and then proceeding with the matrix D as before. SBMs have also been called generalized random graphs; the case of trivial partition $\xi=\left\{{{V}}\right\}$ yields a random graph with edge probability $d_{11}$ .

SBMs were first defined for community modeling in [Reference Holland, Laskey and Leinhardt8], and community detection has remained their main application as well as a source of research problems—see e.g. [Reference Heimlicher, Lelarge and Massoulié7, Reference Massoulié10, Reference Peixoto13]. However, large networks appear now in almost all imaginable application areas, and there are grounds to consider SBMs as a rather generic form of the separation of structure and randomness in large real-world systems. Indeed, Szemerédi’s regularity lemma [Reference Szemerédi22], a fundamental result in graph theory, states that any large enough graph can be approximated by an SBM-like block structure, where the randomness is replaced by pseudo-randomness in terms of so-called $\epsilon$ -regularity, with a number of blocks k that depends only on $\epsilon$ . This remarkable combinatorial fact has been utilized to solve hard problems in many areas of mathematics (see e.g. [Reference Alon, Fischer, Newman and Shapira1, Reference Komlós, Simonovits, Miklós, Sós and Szonyi9]). Looking for SBM-like structures in large real-world graphs and matrices is a generic way to obtain low-dimensional approximations and compressed representations of them.

The typical case of SBM estimation is that the graph $G=(V,E)$ is observed, but the partition $\xi$ is unobserved. For any fixed candidate partition, a corresponding estimate of the matrix D of edge densities is obtained immediately as consisting of empirical means.

If the number of blocks k is known (and the irreducibility condition (1.1) holds, as we always assume), the partition can be identified accurately by several methods, when the graph is sufficiently dense. In mathematical theorems, one considers asymptotics, and dense graph asymptotics means that the densities $d_{ij}$ remain the same when the graph size $n=|V|$ grows to infinity, the relative block sizes staying fixed. The main estimation techniques are maximum likelihood fitting, where the optimization may utilize expectation maximization (e.g. [Reference Bolla and Elbanna3]), simulated annealing or Markov chain Monte Carlo algorithms, and spectral methods (e.g. [Reference Bolla2, Reference Massoulié10]). In sparse graph asymptotics, where all densities decrease to zero while keeping their proportions, identification of the partition may be possible or impossible even with known k, depending on the proportions of the densities; see [Reference Heimlicher, Lelarge and Massoulié7, Reference Massoulié10]. Gao et al. [Reference Gao, Ma, Zhang and Zhou5] consider achieving the minimal possible fraction of errors in SBM estimation. Their work is mostly relevant for the sparse case and when k is known.

We focus on the case called model selection, where k is unknown, and on the dense graph setting. In practical tasks of grouping real-world data, it is often a major challenge to select the ‘right’ partition size k. Intuitively, the optimal choice of k should strike a balance between the degree of pseudo-randomness of the edge-level structures and the model complexity specified by the size of the partition. A popular device has been the Akaike information criterion (AIC), which has been applied to block model estimation in, e.g., [Reference Nepusz, Négyessy, Tusnády, Bazsó, Bollobás, Kozma, Miklós and Springer11]. The AIC simply adds the ‘number of model parameters’ to the maximal log-likelihood and chooses the model that minimizes the sum.

We apply a more refined information-theoretic approach, Rissanen’s minimum description length (MDL) principle (see [Reference Grünwald6]), according to which the quality of a model should be measured by the length of the bit string that it yields for unique encoding of the data. In our case, the data consist of the graph G, and the model consists of the partition $\xi$ and the density matrix D. Rosvall and Bergstrom [Reference Rosvall and Bergstrom21] pointed out information theory as a natural choice for community structure analysis, grouping redundant elements into a few communities. The MDL principle, in particular, has been applied to SBMs at least by Peixoto [Reference Peixoto14]. He derived interesting estimates which indicate that there is an upper limit on the number of communities that can be detected, $k\sim \sqrt{n}$ as a function of the number of vertices n. However, he did not consider the exact identification of the number of blocks k, which is the focus of the present paper.

Wang and Bickel [Reference Wang and Bickel23] also used information theory for likelihood-based model selection for SBMs. Their conclusions are similar to ours. In addition, they show the validity of results in a sparse case when the average degree grows at a polylog (polynomial in $\log n$ ) rate and in the case of degree-corrected block models. They use asymptotic distributions instead of the exact ones that we use. The algorithmic part of their work is also different from ours. They end up in a likelihood-based Bayesian information criterion (BIC) that is asymptotically consistent. Along with a term that corresponds to log-likelihood, there is a term proportional to $k^2n\log n$ with some tuning coefficient that has to be defined separately in every specific case. So defined, the BIC has a minimum at the right value of k. Instead of such a term, we prefer to use MDL model complexity that is not case-sensitive. Our techniques are different as well.

The main contributions of this paper are the following: (i) the consequent application of the MDL approach, providing (ii) basically simple and transparent techniques, building on Chernoff bounds, to (iii) prove three theorems on model specification. A crucial role is played by Theorem 2, which states that all refinements of $\xi$ increase an MDL-based objective function with high probability. We show that the MDL principle identifies the true partition with high probability among models whose relative block sizes are bounded away from zero. The results are also shown to hold with Poissonian edge weights. We have applied Poissonian block models as a heuristic for finding regular structures in large real-world data matrices, in a method we call regular decomposition [Reference Nepusz, Négyessy, Tusnády, Bazsó, Bollobás, Kozma, Miklós and Springer11, Reference Pehkonen and Reittu12, Reference Reittu, Bazsó, Weiss, Fränti, Brown, Loog, Escolano, Pelillo and Springer15, Reference Reittu, Norros and Bazsó17, Reference Reittu18].

The paper is structured as follows. Section 2 defines the main notions: SBMs and the MDL principle. Section 3 presents our main results. The proofs are given in Section 2; some of the auxiliary propositions may be of independent interest. Finally, the results are discussed in Section 3. The proofs apply simple information-theoretic tools collected in Appendix A.

2. Basics and definitions

2.1. Stochastic block models

If G(V, E) is a graph, where V is the set of vertices and E is the set of edges, the link density of a nonempty vertex set $X\subseteq V$ is defined as

\begin{equation*}d(X)=\frac{|e(X)|}{\binom{|X|}{2}},\quad\mbox{where}\quad e(X)=\left\{{{\left\{{{v,w}}\right\}\in E}} \,{:}\,{{v,w\in X}}\right\},\end{equation*}

and $\left|\cdot\right|$ denotes the cardinality of a set. Similarly, the link density between two disjoint nonempty vertex sets $X,Y\subseteq V$ is defined as

\begin{equation*}d(X,Y)\,{:\!=}\, \frac{|e(X,Y)|}{\left|X\right|\left|Y\right|},\quad\mbox{where}\quad e(X,Y)=\left\{{{\left\{{{v,w}}\right\}\in E}} \,{:}\, {{v\in X,\ w\in Y}}\right\}.\end{equation*}

Definition 2.1. Let $\epsilon>0$ . A pair of disjoint sets $A,B\subseteq V$ of G(V, E) is called $\epsilon$ -regular if, for every $X\subseteq A$ and $Y\subseteq B$ such that $\left|X\right|> \epsilon \left|A\right|$ and $\left|Y\right|> \epsilon \left|B\right|$ , we have

\begin{equation*}\left|d(X,Y)-d(A,B)\right|< \epsilon.\end{equation*}

The notion of a (binary) stochastic block model (SBM) was defined at the very beginning of this paper. A Poissonian block model is defined similarly except that the elements of the matrix D are numbers in $[0,\infty)$ , and the random variables $e_{ij}$ have Poisson distributions with respective parameters $d_{ij}$ . Thanks to the independence assumption, the sums $\sum_{u\in X}\sum_{v\in Y}e_{uv}$ are Poisson-distributed for any disjoint $X,Y\subset V$ .

A sequence of random graphs $G_n=(V_n,E_n)$ presenting copies of the same SBM in different sizes can, for definiteness, be constructed as follows.

Construction 2.1. Let $\gamma_1,\ldots,\gamma_k$ be positive, distinct real numbers such that $\sum_{i=1}^k\gamma_i=1$ .

  1. 1. Divide the interval (0,1] into k segments

    \begin{equation*} I_1=(0,\gamma_1],\ I_2=(\gamma_1,\gamma_1+\gamma_2],\ \ldots,\ I_k=\left(\sum_{i=1}^{k-1}\gamma_i,1\right],\end{equation*}
    and define $\Gamma=\left\{{{I_1,\ldots,I_k}}\right\}$ . For $n=1,2,\ldots$ , let the vertices of $G_n$ be
    \begin{equation*}V_n=\left\{{{\frac{i}{n}}} \,{:}\, {{i\in\left\{{{1,\ldots,n}}\right\}}}\right\}.\end{equation*}
    For each n, let $\xi_n$ be the partition of $V_n$ into the blocks
    \begin{equation*} A^{(n)}_i=I_i\cap V_n,\quad i=1,\ldots,k.\end{equation*}

For small n, we may obtain several empty copies of the empty set numbered as blocks. However, from some $n_0$ on, all blocks are nonempty and $\xi_n=\left\{{{A^{(n)}_1,\ldots,A^{(n)}_k}}\right\}$ is a genuine partition of $V_n$ . We can then generate a sequence of SBMs based on the sequence $(V_n,\xi_n,D)$ .

2.2. The minimum description length principle

The minimum description length (MDL) principle was introduced by J. Rissanen, inspired by Kolmogorov’s complexity theory (see [Reference Grünwald6, Reference Rissanen20]; our general reference on coding and information theory is [Reference Cover and Thomas4]). The basic idea is the following: a set $\mathcal{D}$ of data is optimally explained by a model $\mathcal{M}$ when a combined unique encoding of (i) the model and (ii) the data as interpreted in this model is as short as possible. An encoding means here a bijective mapping of objects to a binary prefix code, i.e., a code where no code word is the beginning of another code word, and any sequence of them is thus uniquely decodable.

The principle is best illustrated by our actual case, that of simple graphs. A graph $G=(V,E)$ with $|V|=n$ can always be encoded as a binary string of length $\binom{n}{2}=n(n-1)/2$ , where each binary variable corresponds to a vertex pair, and a value 1 (resp. 0) indicates an edge (resp. absence of an edge). Thus, the MDL of G is always at most $\binom{n}{2}$ . However, G may have a structure whose disclosure would allow a much shorter description. For example, let G be a bipartite graph consisting of two blocks with cardinality $n/2$ . Knowing this, it suffices to code the edges between the blocks, which requires at most $(n/2)^2$ bits, just half of the previous number. However, the partition must still be specified. It requires n additional bits to say which block each vertex belongs to. On the other hand, if the overall density of G is a number $d=h/\binom{n}{2}\not=1/2$ , the edges can be encoded using only $\binom{n}{2}H(d)$ bits, where H(d) denotes the Shannon entropy $H(d)=-d\log_2d-(1-d)\log_2(1-d)$ . (There are $\binom{N}{K}$ binary sequences of length N that have exactly K 1s, and $\binom{N}{K}\le\exp(NH(K/N))$ ; see e.g. [24, IV.2].) However, the integer h must be encoded, and the shortest-length prefix coding for integers requires

(2.1) \begin{equation}l^* (h)=\max(0,\log_2{(h}))+\max(0,\log_2\log_2(h))+\cdots\end{equation}

bits [Reference Grünwald6, Reference Rissanen19].

Definition 2.2. Denote by $\mathcal{M}_{n/k}$ the set of all irreducible (see the condition (1.1)) SBMs $(V,\xi,D)$ with $|V|=\left\{{{1,\ldots,n}}\right\}$ and $\xi=\left\{{{V_1,\ldots,V_k}}\right\}$ such that, for $i,j\in\left\{{{1,\ldots,k}}\right\}$ ,

\begin{equation*} d_{ij}=\frac{h_{ij}}{|V_i||V_j|},\quad d_{ii}=\frac{h_{ii}}{\binom{|V_i|}{2}},\quad h_{ij}\in\left\{{{0,\ldots,|V_i||V_j|}}\right\},\quad h_{ii}\in\left\{{{0,\ldots,\binom{|V_i|}{2}}}\right\}. \end{equation*}

$\mathcal{M}_{n/k}$ is called the modeling space of SBMs with n vertices and k blocks, and

(2.2) \begin{eqnarray}\mathcal{M}_n\,{:\!=}\,\bigcup_{1\leq k\leq n}\mathcal{M}_{n/k}\end{eqnarray}

is called the modeling space of all SBMs with n vertices.

The condition that the $h_{ij}$ be integers entails that the modeling space $\mathcal{M}_n$ be finite. The models in $\mathcal{M}_{n/k}$ are parameterized by $\Theta=(\xi,D)$ . Note that without the irreducibility condition (1.1) there would not be a bijection between SBMs and their parameterizations.

A good model $\Theta\in\mathcal{M}_{n/k}$ for a graph G is one that gives the largest probability for G, and it is called the maximum likelihood model. Denote the parameter of this model by

(2.3) \begin{eqnarray}\hat{\Theta}_k(G)\,{:\!=}\, \arg\ \max_{\Theta\in \mathcal{M}_{n/k}}P(G\mid\Theta),\end{eqnarray}

where $P(G\mid \Theta)$ denotes the probability that the probabilistic model specified by $\Theta$ produces G. Part of likelihood optimization is trivial: when a partition $\eta=\left\{{{A_1,\ldots,A_k}}\right\}$ is considered for a given graph G, the optimal edge probabilities are the empirical edge densities,

(2.4) \begin{eqnarray}d_{ij}=\frac{|e(A_i,A_j)|}{|A_i|| A_j|},\quad i\neq j, \qquad d_{ii}=\frac{|e(A_i)|}{\binom{|A_i|}{2}}.\end{eqnarray}

As a result, the edge densities are always rational numbers, which explains the corresponding definition used in $\mathcal{M}_{n/k}$ .

The nontrivial part of maximum likelihood estimation within $\mathcal{M}_{n/k}$ is to find the optimal partition, but, as noted in the introduction, this has also been well studied in the literature.

Kraft’s inequality (e.g. [Reference Cover and Thomas4]) implies that if letters are drawn from an alphabet with probabilities $p_1,\ldots,p_N$ , there exists a prefix coding with code lengths $\lceil -\log p_1\rceil, \ldots, \lceil-\log p_N\rceil$ , and moreover, this coding scheme minimizes the mean code length. Thus, for any probability distribution P on the space of all graphs G with n vertices, there exists a prefix coding with code lengths $\lceil-\log_2{P(\left\{{{G}}\right\})}\rceil$ . In particular, to every graph $G_0$ with $|V|=n$ we can associate an encoding with code lengths $\lceil-\log_2 P(G|\hat{\Theta}_k(G_0))\rceil$ corresponding to the SBM $\hat{\Theta}_k(G_0)\in\mathcal{M}_{n/k}$ . However, this is not all, since in order to be able to decode we must know what particular model was used. This means that $\hat{\Theta}_k(G_0)$ must also be prefix-encoded, with some code length $L(\hat{\Theta}_k(G_0))$ . We end up with a description length called the two-part MDL [Reference Grünwald6]:

(2.5) \begin{eqnarray}L(G)=\lceil-\log_2{P(G\mid \hat{\Theta}_k(G))}\rceil+L(\hat{\Theta}_k(G)).\end{eqnarray}

A detailed implementation of this principle is proposed in the next section.

3. MDL analysis of stochastic block models

3.1. Block model codes

In the rest of this paper, we define information-theoretic functions in terms of natural logarithms, and certain notions (such as code lengths) should be divided by $\log 2$ to obtain their values in bits. We denote by $H(\cdot)$ both Shannon’s entropy function for a partition and the entropy of a Bernoulli distribution:

\begin{equation*}H(\eta)=-\sum_{B\in\eta}\frac{|B|}{|V|}\log\frac{|B|}{|V|},\qquad H(p)=-p\log p-(1-p)\log(1-p).\end{equation*}

Definition 3.1. A block model code of a graph $G=(V,E)$ with respect to a partition $\eta=\left\{{{B_1,\ldots,B_m}}\right\}$ of V is a code with the structure described below.The model part is as follows:

  1. 1. The sizes of the blocks $B\in\eta$ are given as integers.

  2. 2. The edge density d(B) inside each block $B\in\eta$ and the edge density d(B, B $^{\prime}$ ) between each pair of distinct blocks $B,B^{\prime}\in\eta$ are given as the numerators of the rational numbers presenting the exact (empirical) densities.

  3. 3. For $v\in V$ , define $i_v=\sum_{i=1}^m i1_{\{{v\in B_i}\}}$ . The partition $\eta$ is specified by encoding the sequence $(i_v)_{v\in V}$ by a prefix code corresponding to the membership distribution $P(i\in B_j)=|B_j|/n$ .

The data part is as follows:

  1. 4. The edges inside each block $B_i\in\eta$ are specified by a prefix code corresponding to random distribution of edges with density $d_i$ .

  2. 5. The edges between each pair of distinct blocks $B_i,B_j\in\eta$ are specified by a prefix code corresponding to random distribution of edges with density $d_{ij}$ .

Note that a block model code can be given for any graph with respect to any partition of its vertices. Next, we shall specify the functions we use to estimate the lengths of each of the five code parts described in Definition 3.1. Parts 3–5 present long sequences from a finite ‘alphabet’. By classical information theory, basically Kraft’s inequality, the required code length per element is the entropy of the distribution of the ‘letters’.

Parts 1 and 2 encode certain integers. To obtain mathematically tractable and well-behaved estimates, we make some simplifications. Instead of Rissanen’s $l^*$ function (2.1), we use the plain logarithm, and instead of the lengths of numerators of rational numbers, we use the lengths of their denominators. (Besides simplicity, this choice has the desirable effect that the respective code length estimates then depend only on the partition structure, not on the edge densities.) All length estimates should be nonzero. On the other hand, we do not need to care about ceiling functions. These considerations lead us to estimate the code length of a positive integer N as $\log(1+N)$ . With the additional simplification $\binom{m}{2}\approx m^2/2$ , the estimate of the length of Part 2 of the code would then be

(3.1) \begin{equation}L_2^{\prime}(G|\eta)=\sum_{B\in\eta}\log\frac{(1+|B|)^2}{2} +\frac{1}{2}\sum_{B,B^{\prime}\in\eta,\ B\not=B^{\prime}}\log\big((1+|B|)(1+|B^{\prime}|)\big).\end{equation}

So far we have followed the MDL methodology quite strictly, up to the above simplifications. Now, however, we introduce a twist that is motivated only by its usefulness in proving Theorem 3.3: we strengthen the impact of Part 2 by multiplying $L_2^{\prime}(G|\eta)$ by $|\eta|+1.$

Thus, we propose to base SBM selection on the following weighted estimate of the length of a block model code:

(3.2) \begin{align}\nonumber L(G|\eta)&=L_1(G|\eta)+L_2(G|\eta)+L_3(G|\eta)+L_4(G|\eta)+L_5(G|\eta),\quad\mbox{where}\\[1pt]\nonumber L_1(\eta)&=\sum_{i=1}^m\log(1+|B_i|),\\[1pt]L_2(\eta)&=(m+1)\left(\sum_{i=1}^m\log\frac{(1+|B_i|)^2}{2} +\sum_{i<j}\log\big((1+|B_i|)(1+|B_j|)\big)\right),\\[1pt] \nonumber L_3(G|\eta)&=|V|H(\eta),\\[1pt]\nonumber L_4(G|\eta)&=\sum_{i=1}^m\binom{|B_i|}{2}H(d(B_i)),\\[1pt]\nonumber L_5(G|\eta)&=\sum_{i<j}|B_i||B_j|H(d(B_i,B_j)).\end{align}

For sums of only some of the functions $L_i$ , we define $L_{12}(G|n)\,{:\!=}\,L_{1}(G|n)+L_{2}(G|n)$ , etc.

Proposition 3.1. For any graph $G=(V,E)$ , the model part $L_{123}(G|\eta)$ is monotonically increasing and the data part $L_{45}(G|\eta)$ is monotonically decreasing in $\eta$ with respect to refinement of partitions. That is, for any partitions $\eta$ and $\zeta$ of V such that $\eta\le\zeta$ , we have

\begin{align*}L_{123}(G|\eta)\le L_{123}(G|\zeta),\quad L_{45}(G|\eta)\ge L_{45}(G|\zeta).\\[-10pt]\end{align*}

Proof. The claim concerning the model part follows from the subadditivity of $\log(1+x)$ for $x\ge1$ and from the monotonicity of $\eta\mapsto H(\eta)$ .

As regards the data part, note that the internal density of $B\in\eta$ can be written as a convex combination of densities related to the partition $\zeta\cap B$ of B:

\begin{align*}d(B)&=\frac{|E(B)|}{\displaystyle\binom{|B|}{2}}\\&=\frac{1}{\displaystyle\binom{|B|}{2}}\left(\sum_{C\in\eta\cap B}|E(C)|+\sum_{\begin{array}{c}\scriptstyle C,C^{\prime}\in\zeta\cap B\\\scriptstyle C\not=C^{\prime} \end{array}}|E(C,C^{\prime})|\right)\\&=\sum_{C\in\eta\cap B}\frac{\displaystyle\binom{|C|}{2}}{\displaystyle\binom{|B|}{2}}d(C)+\frac12\sum_{\begin{array}{c}\scriptstyle C,C^{\prime}\in\zeta\cap B\\\scriptstyle C\not=C^{\prime} \end{array}}\frac{|C||C^{\prime}|}{\displaystyle\binom{|B|}{2}}d(C,C^{\prime}).\end{align*}

By the concavity of H, we then have

\begin{align*}L_{45}(G|\eta)&=\sum_{B\in\eta}\binom{|B|}{2}H(d(B))+\frac12\sum_{\begin{array}{c}\scriptstyle B,B^{\prime}\in\eta\\\scriptstyle B\not=B^{\prime} \end{array}}|B||B^{\prime}|H(d(B,B^{\prime}))\nonumber.\end{align*}
(3.3) \begin{align}&\ge\sum_{B\in\eta}\sum_{C\in\zeta\cap B}\binom{|C|}{2}H(d(C))\nonumber\\[3pt]&\quad+\sum_{B\in\eta}\frac12\sum_{\begin{array}{c}\scriptstyle C,C^{\prime}\in\zeta\cap B \\\scriptstyle C\not=C^{\prime} \end{array}}|C||C^{\prime}|H(d(C,C^{\prime}))\nonumber\\[3pt]&\quad+\frac12\sum_{\begin{array}{c}\scriptstyle B,B^{\prime}\in\eta \\\scriptstyle B\not=B^{\prime} \end{array}}\sum_{C\in\zeta\cap B}\sum_{C^{\prime}\in\zeta\cap B^{\prime}}|C||C^{\prime}|H(d(C,C^{\prime}))\nonumber\\[3pt]&=L_{45}(G|\zeta),\end{align}

where two first sums after the inequality sign come from $L_4(G|\eta)$ and the third from $L_5(G|\eta)$ .

Poissonian block models

Let us now consider Poissonian block models, where the entries of E are Poisson-distributed integers. For a pair of disjoint sets $A,B\subset V$ , the set $\left\{{{e_{ij}}} \,{:}\, {{i\in A,\,j\in B}}\right\}$ is a sample from a distribution $R=(r_\ell)_{\ell\ge0}$ that is a mixture of Poisson distributions. It would be hard to encode the sample by first estimating the unknown mixture distribution. Instead, we base the code simply on the sample mean

\begin{equation*}e_{AB}=\frac{1}{|A||B|}\sum_{i\in A,\ j\in B}e_{ij}\end{equation*}

and encode $\left\{{{e_{ij}}} \,{:}\, {{i\in A,\,j\in B}}\right\}$ as if it came from a Poisson distribution $P=(p_\ell)$ with parameter $e_{AB}$ . If the $e_{ij}$ really came from a Poisson distribution, a value $e_{ij}=\ell$ would, by Kraft’s inequality, be well encoded by a code word with approximate length

\begin{equation*}-\log p_\ell=-\log\bigg(\frac{e_{AB}^\ell}{\ell!}e^{-e_{AB}}\bigg).\end{equation*}

However, a mixture of different Poisson distributions is not a Poisson distribution. Denote by $R=(r_\ell)$ the empirical distribution of $\left\{{{e_{ij}}} \,{:}\, {{i\in A,\ j\in B}}\right\}$ ; i.e., $r_\ell$ is the number of $e_{ij}$ with value $\ell$ , divided by $|A||B|$ . The fundamental information inequality

(3.4) \begin{equation}\sum_\ell r_\ell(\!-\log p_\ell)\ge\sum_\ell r_\ell(\!-\log r_\ell)=H(R)\end{equation}

implies that this encoding is suboptimal for arbitrary disjoint subsets A and B, but it is optimal when A and B are blocks of the model partition $\xi$ and R comes from a sample from a pure Poisson distribution. (The inequality (3.4) is a special case of the nonnegativity of the Kullback–Leibler divergence; it says that using a code recommended by Kraft’s inequality with wrong probabilities increases the expected code word length.) This suboptimality is no problem, because for our purposes it only improves the contrast between $\xi$ and other partitions of V.

Using the above coding rule for an arbitrary partition $\eta=\left\{{{B_1,\ldots,B_m}}\right\}$ , the amount of code needed to encode all of the $e_{ij}$ is

\begin{align*}&\sum_{i=1}^m\binom{|B_i|}{2}\sum_{\ell\ge0}r_\ell^{(B_i)}\left(\!-\log\left(\frac{e_{B_i}^\ell}{\ell!}e^{-e_{B_i}}\right)\right)\\[3pt]&\quad +\sum_{i<j}|B_i||B_j|\sum_{\ell\ge0}r_\ell^{(B_iB_j)}\left(\!-\log\left(\frac{e_{B_iB_j}^\ell}{\ell!}e^{-e_{B_iB_j}}\right)\right)\\[3pt]&=\sum_{i=1}^m\binom{|B_i|}{2}(\!-\log e_{B_i})\sum_{\ell\ge0}\ell r_\ell^{(B_i)}+\sum_{i<j}|B_i||B_j|(\!-\log e_{B_iB_j})\sum_{\ell\ge0}\ell r_\ell^{(B_iB_j)}\\[3pt]&\quad +\sum_{i=1}^m\binom{|B_i|}{2}\sum_{\ell\ge0}r_\ell^{(B_i)}\log \ell!+\sum_{i<j}|B_i||B_j|\sum_{\ell\ge0}r_\ell^{(B_iB_j)}\log\ell!\\[3pt]&\quad +\sum_{i=1}^m\binom{|B_i|}{2}e_{B_i}+\sum_{i<j}|B_i||B_j|e_{B_iB_j}\\[3pt]&=\sum_{i=1}^m\binom{|B_i|}{2}\phi(e_{B_i})+\sum_{i<j}|B_i||B_j|\phi(e_{B_iB_j})+\frac{1}{2}\sum_{v,w\in V,\,v\not=w}\log e_{vw}!+\sum_{v,w\in V,\,v\not=w}e_{vw},\end{align*}

where

\begin{equation*}e_{B_i}\,{:\!=}\,\frac{\sum_{v,w\in B_i}e_{vw}}{|B_i|(|B_i|-1)},\quad e_{B_iB_j}\,{:\!=}\,\frac{\sum_{v,w\in B_i}e_{vw}}{|B_i||B_j|},\end{equation*}

the r-symbols refer to the empirical distribution of the $e_{vw}$ , and $\phi(x)=-x\log x$ . Now, we define our MDL-based objective function for a Poissonian block model E with respect to any partition $\eta=\left\{{{B_1,\ldots,B_m}}\right\}$ as

(3.5) \begin{align}\nonumber L(E|\eta)&=L_0(E|\eta)+L_1(E|\eta)+L_2(E|\eta)+L_3(E|\eta)+L_4(E|\eta)+L_5(E|\eta),\\[3pt]\nonumber L_0(E|\eta)&=\frac{1}{2}\sum_{v,w\in V,\,v\not=w}\log e_{vw}!+\sum_{v,w\in V,\,v\not=w}e_{vw},\\[3pt]\nonumber L_1(E|\eta)&=\sum_{i=1}^m\log(1+|B_i|),\\[3pt] L_2(E|\eta)&=(m+1)\left(\sum_{i=1}^m\log\frac{(1+|B_i|)^2}{2} +\sum_{i<j}\log\big((1+|B_i|)(1+|B_j|)\big)\right),\\[3pt] \nonumber L_3(E|\eta)&=|V|H(\eta),\\[3pt] \nonumber L_4(E|\eta)&=\sum_{i=1}^m\binom{|B_i|}{2}\phi(e_{B_i}),\\[3pt] \nonumber L_5(E|\eta)&=\sum_{i<j}|B_i||B_j|\phi(e_{B_iB_j}).\end{align}

Note that the term $L_0(E|\eta)$ is independent of the partition $\eta$ and can be neglected when minimizing over $\eta$ . The parts $L_1(E|\eta)$ and $L_2(E|\eta)$ are copied from the case of plain graphs.

One application of Poisson block models was found in [Reference Reittu, Leskelä, Räty and Fiorucci16], which used them to analyze graph distance matrices. This approach can help in finding communities in large and sparse networks. In [Reference Reittu, Bazsó, Weiss, Fränti, Brown, Loog, Escolano, Pelillo and Springer15], it was suggested that the Poisson block model objective function (3.5) could be used for any matrix with nonnegative entries, resulting in the same cost function as in the integer case.

3.2. Accuracy of model selection

Our results on MDL-based model selection are summarized in the following theorem. It is formulated in terms of the asymptotic behavior of a model sequence as specified by Construction 1. In this framework, an event is said to happen with high probability if its probability tends to 1 when $n\to\infty$ .

Consider a sequence of irreducible SBMs $(G_n,\xi_n)$ based on a vector $(\gamma_1,\ldots,\gamma_k)$ of relative block sizes and a matrix $D=(d_{ij})_{i,j=1}^k$ of edge probabilities. Define

\begin{equation*}\mathfrak{d}(\eta,\xi_n)=\frac{1}{n}\max_{B\in\eta}\min_{A\in\xi_n}|B\setminus A|.\end{equation*}

Thus, $\mathfrak{d}(\eta,\xi_n)=0$ if and only if $\eta$ is a refinement of $\xi_n$ . For $\sigma\in(0,1)$ , denote by $\mathcal{P}_{n,\sigma}$ the set of partitions of $V_n$ whose blocks each have at least $\sigma n$ vertices. Obviously, $|\eta|\le1/\sigma$ for $\eta\in\mathcal{P}_{n,\sigma}$ . If $v\in V_n$ and $B\in\eta$ , denote by $\eta_{v,B}$ the partition obtained from $\eta$ by moving vertex v to block B (if $v\in B$ , then $\eta_{v,B}=\eta$ ).

Theorem 3.1. Fix some minimal relative block size $\sigma\in(0,\min_i\gamma_i)$ . There are two numbers $\epsilon_0>0$ and $\kappa_0>0$ such that the following holds with high probability: Let a partition $\eta\in\mathcal{P}_{n,\sigma}$ satisfy $|\eta|\ge k$ and $\mathfrak{d}(\eta,\xi_n)\le\epsilon_0$ . If $A\in\xi_n$ and $B\in\eta$ are such that $0<|B\setminus A|/n\le\epsilon_0$ , choose any vertex $v\in B\setminus A$ . Then there is a block $B^{\prime}\in\eta$ such that

\begin{equation*}L(G_n|\eta_{v,B^{\prime}})<L(G_n|\eta)-\kappa_0 n.\end{equation*}

Moreover, $L(G_n|\eta)\ge L(G_n|\tilde\eta)$ , where $\tilde\eta$ is a refinement of $\xi_n$ .

Theorem 3.2. For any $\epsilon>0$ and positive integer m, there is a constant $c_{\epsilon,m}>0$ such that the following holds with high probability:

\begin{equation*}\mbox{For all }\eta\mbox{ such that }|\eta|\le m\mbox{ and }\mathfrak{d}(\eta,\xi_n)>\epsilon, \quad L(G_n|\xi_n\vee\eta))< L(G_n|\eta)-c_{\epsilon,m}n^2.\end{equation*}

Theorem 3.3. With high probability, minimizing $L(\eta)$ identifies $\xi_n$ among all refinements $\eta$ of $\xi_n$ .

Corollary 3.1. For any $\sigma\in(0,\min_i\gamma_i)$ , $\xi_n$ is with high probability the unique minimizer of $L(G_n|\eta)$ for $\eta\in\mathcal{P}_{n,\sigma}$ , and

\begin{equation*} L(G_n|\eta)\ge L(G_n|\eta\vee\xi_n)\ge L(G_n|\xi_n).\end{equation*}

The corresponding results hold mutatis mutandis for Poissonian block models. The proofs are essentially identical. The differences, required by the replacement of binomial distributions by Poisson distributions, are indicated at the end of Appendix A.

4. Proofs of the theorems

Through this section, we consider a sequence $(G_n|\xi_n)$ of increasing versions of a fixed SBM based on a vector $(\gamma_1,\ldots,\gamma_k)$ of relative block sizes and a matrix $D=(d_{ij})_{i,j=1}^k$ of edge probabilities, as specified in Construction 2.1.

Proof of Theorem 3.1. Let $\epsilon,\delta>0$ be small numbers and m a positive integer to be specified. They can be chosen so that the following hold:

  • The value of $\epsilon$ is small enough so that $\eta$ is close to $\eta\vee\xi_n$ when $\mathfrak{d}(\eta,\xi_n)\le\epsilon$ :

    (4.1) \begin{equation}m\epsilon<\delta\min_i{\gamma_i}. \end{equation}
  • All the differing link probabilities are widely separated in $\delta$ units:

    (4.2) \begin{equation} \delta\le\frac{1}{m}\min\left\{{{|d_{ij_1}-d_{ij_2}|}} \,{:}\, {{i,j_1,j_2\in\left\{{{1,\ldots,k}}\right\},\ d_{ij_1}\not=d_{ij_2}}}\right\}. \end{equation}

For any $A_i,A_j\in\xi_n$ (possibly $i=j$ ), we then have

(4.3) \begin{equation} |d(A_i,A_j)-d_{ij}|+m\epsilon\le\delta \end{equation}

with high probability.

Let $\eta$ be a partition of $V_n$ such that $\mathfrak{d}(\eta,\xi_n)\le\epsilon$ . For $i=1,\ldots,k$ , denote by $B_i$ a block $B\in\eta$ such that $|B\cap A_i|$ is maximal. Then $|B_i\setminus A_i|/n\le\epsilon$ , and the blocks $B_1,\ldots,B_k$ are distinct for small $\epsilon$ . Let us number arbitrarily any remaining blocks of $\eta=\left\{{{B_1,\ldots,B_{|\eta|}}}\right\}$ . We can further assume that if $|\eta|>k$ , then for each $q>k$ there is a unique $j_q$ such that $|B_q\setminus A_{j_q}|/n\le\epsilon$ (for $q\le k$ , let $j_q=q$ ).

Assume now that $v\in B_s\setminus A_j$ , $|B_s\setminus A_j|/n\le\epsilon$ , and $v\in A_i\cap B_s$ with $i\not=j$ . We choose $B^{\prime}=B_i$ and compare the partitions $\eta$ and $\eta_{v,B_i}$ . Let $b_q=|B_q|$ , $q=1,\ldots,|\eta|$ , and $\tilde{B}_i=B_i\cup\left\{{{v}}\right\}$ , $\tilde{B}_s=B_s\setminus\left\{{{v}}\right\}$ . Then

(4.4) \begin{equation}\begin{split}&L_{45}(G_n|\eta)-L_{45}(G_n|\eta_{v,B_i})\\&=\binom{b_i}{2}H(d(B_i))+\binom{b_s}{2}H(d(B_s)) -\binom{b_i+1}{2}H(d(\tilde{B}_i))-\binom{b_s-1}{2}H(d(\tilde{B}_s))\\&\quad+\sum_{q\not=i,s}\Big[b_ib_qH(d(B_i,B_q))+b_sb_qH(d(B_s,B_q))\\&\quad\quad-(b_i+1)b_qH(d(\tilde{B}_i,B_q)) +(b_s-1)b_qH(d(\tilde{B}_s,B_q))\Big]\\&\quad+b_ib_sH(d(B_i,B_j))-(b_i+1)(b_s-1)H(d(\tilde{B}_i,\tilde{B}_s)).\end{split}\end{equation}

Consider first the sum over q. Leaving out the common factor $b_q$ , each term of the sum can be written as

\begin{align*}&\quad b_s\bigg[H(d(B_s,B_q))-\frac{b_s-1}{b_s}H(d(\tilde{B}_j,B_q))\bigg]\\&\quad-(b_i+1)\bigg[H(d(\tilde{B}_i,B_q))-\frac{b_i}{b_i+1}H(d(B_i,B_q))\bigg]\\&=b_s\bigg[H\left(\frac{b_s-1}{b_s}d(\tilde{B}_s,B_q) +\frac{1}{b_s}d(\{v\},B_q)\right)\\\end{align*}
\begin{align*}&\quad\quad-\frac{b_s-1}{b_s}H(d(\tilde{B}_s,B_q)) -\frac{1}{b_s}H(d(\{v\},B_q))\bigg]\\&-(b_i+1)\bigg[H\left(\frac{b_i}{b_i+1}d(B_i,B_q) +\frac{1}{b_i+1}d(\{v\},B_q)\right)\\&\quad\quad-\frac{b_i}{b_i+1}H(d(B_i,B_q)) -\frac{1}{b_i+1}H(d(\{v\},B_q))\bigg]\end{align*}

(note the addition and subtraction of the term $H(d(\{v\},B_q))$ ). Using Lemmas A.2 and A.3 ( $\epsilon$ -regularity) and the relations (4.1), (4.2), and (4.3), the last expression can be set to be, with high probability, arbitrarily close to the number

\begin{equation*}D_B(d_{iq}\|d_{jj_q})-D_B(d_{ij_q}\|d_{ij_q})=D_B(d_{ij_q}\|d_{jj_q})\end{equation*}

(the function $D_B(\cdot\|\cdot)$ , the Kullback–Leibler divergence of Bernoulli distributions, is defined in (A.2)). Thus, the sum over q is, with high probability, close to

\begin{equation*} b_q\sum_{q\not=i,s}D_B(d_{ij_q}\|d_{j{j_q}}).\end{equation*}

Let us then turn to the remaining parts of (4.4), which refer to two codings of the internal links of $B_i\cup B_s$ . Similarly as above, we can add and subtract terms to transform these parts into

\begin{align*}&\binom{b_s}{2}\Bigg[H\left(\frac{\binom{b_s-1}{2}}{\binom{b_s}{2}} d(\tilde{B}_s)+\frac{b_s-1}{\binom{b_s}{2}}d(\{v\},\tilde{B}_s)\right)\\[2pt]&\quad\quad-\frac{\binom{b_s-1}{2}}{\binom{b_s}{2}}H(d(\tilde{B}_s)) -\frac{b_s-1}{\binom{b_s}{2}}H(d(\{v\},\tilde{B}_s))\Bigg]\\[2pt]&-\binom{b_i+1}{2}\Bigg[H\left(\frac{\binom{b_i}{2}}{\binom{b_i+1}{2}} d(B_i)+\frac{b_i}{\binom{b_i+1}{2}}d(\{v\},B_i)\right)\\[2pt]&\quad\quad-\frac{\binom{b_i}{2}}{\binom{b_i+1}{2}}H(d(B_i)) -\frac{b_i}{\binom{b_i+1}{2}}H(d(\{v\},B_i))\Bigg]\\[2pt]&+b_ib_s\Bigg[H\left(\frac{b_s-1}{b_s}d(B_i,\tilde{B}_s) +\frac{1}{b_s}d(\{v\},B_i)\right)\\[2pt]&\quad\quad-\frac{b_s-1}{b_s}H(d(B_i,\tilde{B}_s)) -\frac{1}{b_s}H(d(\{v\},B_i))\Bigg]\\[2pt]&-(b_i+1)(b_s-1)\Bigg[H\left(\frac{b_i}{b_i+1}d(B_i,\tilde{B}_s) +\frac{1}{b_i+1}d(\{v\},\tilde{B}_s)\right)\\[2pt]&\quad\quad-\frac{b_i}{b_i+1}H(d(B_i,\tilde{B}_s)) -\frac{1}{b_i+1}H(d(\{v\},\tilde{B}_s))\Bigg]\\[2pt]&\approx(b_s-1)D_B(d_{ij}\|d_{jj})-(b_i+1)D_B(d_{ii}\|d_{ii}) +b_iD_B(d_{ii}\|d_{ij})-(b_s-1)D_B(d_{ij}\|d_{ij})\\[2pt]&\approx b_sD_B(d_{ij}\|d_{jj})+b_iD_B(d_{ii}\|d_{ij}).\end{align*}

By the above analysis of (4.4), we have obtained

(4.5) \begin{align}\nonumber&L_{45}(G_n|\eta)-L_{45}(G_n|\eta_{v,B_i})\\&\approx b_q\sum_{q\not=i,s}D_B(d_{jj_q}\|d_{ij_q})+b_jD_B(d_{ij}\|d_{jj})+b_iD_B(d_{ii}\|d_{ij}),\end{align}

where the approximation can be made arbitrarily accurate by the choice of $\epsilon$ , m, and $\delta$ . By the irreducibility assumption (1.1), there is a block $A_{j_q}$ such that $d_{{j_q}i}\not=d_{{j_q}j}$ , with the possibility that $q\in\left\{{{i,s}}\right\}$ . It follows that at least one of the $D_B(x\|y)$ terms in (4.5) is positive. Let $\kappa^*=\min\left\{{{D_B(d_{ij_1}\|d_{ij_2})}} \,{:}\, {{d_{ij_1}\not=d_{ij_2}}}\right\}$ . Thus, with high probability,

\begin{equation*}L_{45}(G_n|\eta)-L_{45}(G_n|\eta_{v,B_i})>\frac{1}{2}(\kappa^*\min_i\gamma_i)n.\end{equation*}

Choose $\epsilon_0=\epsilon$ and $\kappa_0=\kappa^*\min_i\gamma_i/2$ . As regards the other code parts, it is easy to compute that

\begin{equation*}L_3(G_n|\eta)-L_3(G_n|\eta_{v,B_i})=n(H(\eta)-H(\eta_{v,B_i}))\to\log\frac{\gamma_j}{\gamma_i},\end{equation*}

and the changes of $L_1$ and $L_2$ when moving from $\eta$ to $\eta_{v,B_i}$ are negligible. This concludes the proof of the claim concerning moving a single vertex from $B_j$ to $B_i$ . The second claim follows from noting that the first claim continues to hold a fortiori when moving similarly all remaining vertices that prevent $\eta$ from being a refinement of $\xi_n$ .

Proof of Theorem 3.2. Fix an $\epsilon\in(0,1)$ and let $\eta$ be a partition of $V_n$ such that $\mathfrak{d}(\eta,\xi_n)>\epsilon$ . Lemma 3.1 yields that $L_{45}(G_n|\eta)\ge L_{45}(G_n|\eta\vee\xi_n)$ .

By assumption, there exists $B\in\eta$ such that $|B\setminus A|>\epsilon n$ for every $A\in\xi_n$ . It is easy to see that there must be (at least) two distinct blocks, say $A_i$ and $A_j$ , such that

(4.6) \begin{equation}\min\left\{{{|A_i\cap B|,|A_j\cap B|}}\right\}\ge\frac{\epsilon}{k-1}n.\end{equation}

By the irreducibility assumption (1.1), there is a block $A_q$ such that $d_{qi}\not=d_{qj}$ , with the possibility that $q\in\left\{{{i,j}}\right\}$ . Fix an arbitrary $\delta>0$ to be specified later. By $\epsilon$ -regularity (Claim 2 of Lemma A.3), with high probability, every choice of a partition $\eta$ satisfying $\exists B\in\eta\ \forall A\in\xi_n\ |B\setminus A|>\epsilon n$ results in some blocks $A_i$ , $A_j$ , $A_q$ with the above characteristics plus the regularity properties

(4.7) \begin{equation}|d(A_i\cap B,A_q\cap B^{\prime})-d_{iq}|\le\delta,\qquad |d(A_j\cap B,A_q\cap B^{\prime})-d_{jq}|\le\delta,\end{equation}

where B $^{\prime}$ denotes a block of $\eta$ that maximizes $|A_q\cap B^{\prime}|$ (note that because $|\eta|\le m$ , $|A_q\cap B^{\prime}|\ge|A_q|/m$ ). By the concavity of H,

\begin{align*}&|A_i\cap B||A_q\cap B^{\prime}|H(d(A_i\cap B,A_q\cap B^{\prime}))\\[3pt]&\quad+|A_j\cap B||A_q\cap B^{\prime}|H(d(A_j\cap B,A_q\cap B^{\prime}))\\[3pt]&=|(A_i\cup A_j)\cap B||A_q\cap B^{\prime}| \bigg[\frac{|A_i\cap B|}{|(A_i\cup A_j)\cap B|}H(d(A_i\cap B,A_q\cap B^{\prime}))\\[3pt]&\quad+\frac{|A_j\cap B|}{|(A_i\cup A_j)\cap B|} H(d(A_j\cap B,A_q\cap B^{\prime}))\bigg]\\[3pt] &<|(A_i\cup A_j)\cap B||A_q\cap B^{\prime}|H(d((A_i\cup A_j)\cap B,A_q\cap B^{\prime})).\end{align*}

In the case that $q\in\left\{{{i,j}}\right\}$ and $B=B^{\prime}$ , we obtain a similar equation where $|A_q\cap B|$ is partly replaced by $|A_q\cap B|-1$ . Because of (4.7) and (4.6), the difference between the sides of the inequality has a positive lower bound that holds with high probability. On the other hand, this difference is part of the overall concavity inequality (3.3) in the proof of Lemma 3.1. Thus, we can choose $c_{\epsilon,m}$ so that

\begin{equation*}L_{45}(G_n|\xi_n\vee\eta))< L_{45}(G_n|\eta)-c_{\epsilon,m}n^2\end{equation*}

with high probability. It remains to note that $L_{123}(G_n|\xi_n\vee\eta))-L_{123}(G_n|\eta)$ is proportional to n at most.

Let us now turn to preparing the proof of Theorem 3.3 concerning refinements of $\xi_n$ . For any partition $\eta\ge\xi_n$ , write

\begin{equation*}\Delta L(\eta)=L(G_n|\eta)-L(G_n|\xi_n)\end{equation*}

and, for individual components of the code,

\begin{equation*}\Delta L_{123}(\eta)\,{:\!=}\,L_{123}(G_n|\eta)-L_{123}(G_n|\xi_n),\quad \text{etc.}\end{equation*}

Lemma 4.1. For any partition $\eta$ of $V_n$ ,

\begin{equation*} L_{2}(\eta)=(|\eta|+1)^2\sum_{B\in\eta}\log(1+|B|)-|\eta|(|\eta|+1)\log2.\end{equation*}

Proof. The claim is proved by elaborating (3.1) as follows:

\begin{align*} L_2^{\prime}(\eta)&=2\sum_{B\in\eta}\log(1+|B|)-|\eta|\log2 +\frac12\sum_{B\in\eta} \sum_{B^{\prime}\in\eta\setminus\left\{{{B}}\right\}}(\!\log(1+|B|) +\log(1+|B^{\prime}|))\\ &=2\sum_{B\in\eta}\log(1+|B|)-|\eta|\log2 +\sum_{B\in\eta}\log(1+|B|) \sum_{B^{\prime}\in\eta\setminus\left\{{{B}}\right\}}1\\ &=2\sum_{B\in\eta}\log(1+|B|)-|\eta|\log2 +(|\eta|-1)\sum_{B\in\eta}\log(1+|B|)\\ &=(|\eta|+1)\sum_{B\in\eta}\log(1+|B|)-|\eta|\log2,\end{align*}

where the second line follows from symmetry.

Lemma 4.2. For $\eta\ge\xi_n$ ,

(4.8) \begin{align}\Delta L_2(\eta)&\ge(|\eta|-k)(|\eta|+k+1)\left(k\log n-c_n(\xi_n)-\log2\right),\end{align}
(4.9) \begin{align}\log \Delta L_{23}(\eta)&\le\log n+2\log|\eta|+2\log2,\end{align}

where

(4.10) \begin{equation}c_n(\xi_n)=-\sum_{A\in\xi}\log\frac{1+|A|}{n}.\end{equation}

Proof of (4.8). Note first that it follows from the subadditivity of $\log(1+x)$ for $x\ge1$ that for any $\eta\ge\xi$ we have

(4.11) \begin{equation}\sum_{B\in\eta}\log(1+|B|)\ge\sum_{A\in\xi_n}\log(1+|A|).\end{equation}

Using Lemma 4.1 and (4.11), we have

\begin{align*} \Delta L_2(\eta) =&(|\eta|+1)^2\sum_{B\in\eta}\log(1+|B|)-|\eta|(|\eta|+1)\log2\\[3pt] &-(k+1)^2\sum_{A\in\xi_n}\log(1+|A|)+k(k+1)\log2\\[3pt] \ge&(|\eta|-k)(|\eta|+k+2)\sum_{A\in\xi_n}\log(1+|A|) -(|\eta|-k)(|\eta|+k+1)\log2\\[3pt] \ge&(|\eta|-k)(|\eta|+k+1)\left(\sum_{A\in\xi_n}\log(1+|A|)-\log2\right)\\[3pt] =&(|\eta|-k)(|\eta|+k+1)\left(\sum_{A\in\xi_n} \left(\log n+\log\frac{1+|A|}{n}\right)-\log2\right)\\[3pt] \ge&(|\eta|-k)(|\eta|+k+1)(k\log n-c_n(\xi_n)-\log2).\end{align*}

Proof of (4.9) We now use (4.11) in the other direction, and note that the concavity of $\log x$ implies

\begin{align*} \sum_{B\in\eta}\log(1+|B|) =&|\eta|\sum_{B\in\eta}\frac{1}{|\eta|}\log(1+|B|)\\[3pt] \le&|\eta|\log\left(\sum_{B\in\eta}\frac{1+|B|}{|\eta|}\right)\\[3pt] =&|\eta|\log\left(1+\frac{n}{|\eta|}\right)\\[3pt] \le& \log e^n=n.\end{align*}

We then have

\begin{align*} &\log\Delta L_{23}(\eta)\\[3pt] &\le\log\bigg((|\eta|-k)(|\eta|+k+2)\sum_{B\in\eta}\log(1+|B|) -(|\eta|-k)(|\eta|+1)\log2+nH(\eta|\xi_n)\bigg)\\[3pt] &\le\log\Big((|\eta|-k)(|\eta|+k+2)n+n(H(\eta)-H(\xi_n))\Big)\\[3pt] &\le\log\left(n\left[(|\eta|-k)(|\eta|+k+2)+\log|\eta|\right]\right). \\[-35pt] \end{align*}

Proposition 4.1. For any refinement $\eta$ of $\xi_n$ , we have

(4.12) \begin{equation}-\Delta L_{45}(\eta)\buildrel{(st)}\over\le\sum_{j=1}^{M(|\eta|)-M(k)}(\!\log 2+Y_i),\end{equation}

where (st) refers to stochastic order, the $Y_i$ are independent Exp(1) random variables, and

(4.13) \begin{equation}M(x)=\frac{x(x+1)}{2}.\end{equation}

Proof. Here we apply results presented in Appendix A. Denote by $\eta\cap A_i$ the subset of $\eta$ whose members are subsets of the block $A_i$ of $\xi_n$ . Writing the edge code lengths of the coarser and finer partition similarly as in (3.3), taking the difference, and using (A.5), we obtain

(4.14) \begin{equation}\begin{split}&\quad -L_{45}(\eta)\\&=\sum_{i=1}^k\Bigg(\sum_{B\in\eta\cap A_i}\binom{|B|}{2}D_B(d(B)\|d_{ii}) +\frac12\sum_{\begin{array}{c}\scriptstyle B,B^{\prime}\in\eta\cap A_i\\\scriptstyle B\not=B^{\prime} \end{array}}|B||B^{\prime}|D_B(d(B,B^{\prime})\|d_{ii})\\&\qquad-\binom{|A_i|}{2}D_B(d(A_i)\|d_{ii})\Bigg)\\&\quad +\sum_{i<j}\Bigg(\sum_{B\in\eta\cap A_i}\sum_{B^{\prime}\in\eta\cap A_j}|B||B^{\prime}| D_B(d(B,B^{\prime})\|d_{ij})-|A_i||A_j|D_B(d(A_i,A_j)\|d_{ij})\Bigg).\end{split}\end{equation}

Applying Proposition A.4 to each term of both outer sums now yields the claim, because

\begin{align*}\sum_{i=1}^k(M(|\eta\cap A_i|)-1)+\sum_{i<j}(|\eta\cap A_i||\eta\cap A_j|-1)=M(|\eta|)-M(k).\\[-45pt]\end{align*}

It is rather surprising that the stochastic bound (4.12) depends only on the number of blocks in $\eta$ —not on their relative sizes, nor on the overall model size n.

Proof of Theorem 3.3. Let $\eta$ be a refinement of $\xi_n$ and recall Lemma 3.1: refining the partition with respect to $\xi_n$ yields a gain, based on the concavity of H, in the data code part $L_{45}$ , but additional costs in the model code part $L_{123}$ . We have to relate these to each other. Since $L_1(G_n|\eta)$ is just a small addition to $L_2(G_n|\eta)$ (see Lemma 4.1), we can ignore it and focus on $L_2(G_n|\eta)$ and $L_3(G_n|\eta)$ as regards the model part.

The refinement gain in code part $L_{45}$ was bounded in Proposition 4.1 stochastically by Exp(1) random variables. The rate function (see the beginning of Appendix A) of the distribution Exp(1) is

\begin{equation*} I_E(x)=x-1-\log x.\end{equation*}

Proposition 4.1 yields, using (A.1) and Proposition A.1 that for $y>\log 2$

\begin{equation*} \begin{split}& \mathbb{P}\left({-\Delta L_{45}(\eta)>y}\right)\\[3pt]&\le\mathbb{P}\left({\sum_{j=1}^{M(|\eta|)-M(k)}Y_i>y-(M(|\eta|)-M(k))\log 2}\right)\\[3pt]&\le\exp\left(\!-(M(|\eta|)-M(k)) I_E\left(\frac{y-(M(|\eta|)-M(k))\log 2}{M(|\eta|)-M(k)}\right)\right)\\[3pt]&\le\exp\big(\!-y+(M(|\eta|)-M(k))\log y\big) \left(\frac{2e}{M(|\eta|)-M(k)}\right)^{M(|\eta|)-M(k)}. \end{split}\end{equation*}

For two refinements of $\xi_n$ , write $\eta^{\prime}\sim\eta$ if the block sizes of $\eta'$ in each $A_i$ are identical to those of $\eta$ . The number of refinements $\eta'$ of $\xi_n$ with $\eta'\sim\eta$ is bounded above by

\begin{equation*}\exp\left(\sum_{A\in\xi_n}|A|H(\eta\cap A)\right)=e^{nH(\eta|\xi_n)},\end{equation*}

where $H(\eta\cap A)$ denotes the entropy of the partition of A induced by $\eta$ and $H(\eta|\xi_n)$ the conditional entropy of $\eta$ given $\xi_n$ . On the other hand, we have

\begin{equation*}\Delta L_3(\eta)=nH(\eta)-nH(\xi_n)=nH(\eta|\xi_n).\end{equation*}

Let $m=|\eta|$ . With $y=\Delta L_{23}^{\prime}$ , the union bound and Lemma 2 yield

(4.15) \begin{align}\nonumber&\mathbb{P}\left({\inf_{\eta'\sim\eta}\Delta L(\eta)<0}\right)\\[2pt]\nonumber\le&\exp\left(nH(\eta|\xi_n)-\Delta L_3(\eta)-\Delta L_2^{\prime}(\eta) +\frac12(m-k)(m+k+1)\log\Delta L_{23}^{\prime}(\eta)\right)\\[2pt]\nonumber &\cdot\left(\frac{2e}{M(m)-M(k)}\right)^{M(m)-M(k)}\\[2pt]\nonumber \le&\exp\Big(\!-(m-k)(m+k+1)\left(k\log n-c_n(\xi_n)-\log2\right)\\[2pt]\nonumber &+\frac12(m-k)(m+k+1)\log\left(n\left[(m-k)(m+k+2)+\log m\right]\right)\\[2pt]\nonumber &-\frac12(m-k)(m+k+1)(\log((m-k)(m+k+1))-2\log2-1\Big)\\[2pt] =&\exp\Big\{\!-(m-k)(m+k+1)\left(k-\frac12\right)\log n\\[2pt]\nonumber &+\frac12(m-k)(m+k+1)\rho_1(m)\Big\},\end{align}

where

(4.16) \begin{equation}\rho_1(m)=\log\left(\frac{\displaystyle(m-k)(m+k+2)+\log m}{(m-k)(m+k+1)-2c_n(\xi)-4\log2-1}\right),\end{equation}

and $\rho_1(m)=o(m)$ as $m\to\infty$ .

Consider all different block size sequences of partitions $\eta>\xi$ such that $|\eta|=m$ :

\begin{align*} &\kappa_{11}\ge\kappa_{12}\ge\cdots\ge\kappa_{1\sigma_1}; \ \ldots;\ \kappa_{k1}\ge\cdots\ge\kappa_{k\sigma_k},\\[2pt] &\sum_{j=1}^{\sigma_i}\kappa_{ij}=|A_i|,\ i=1,\ldots,k; \quad \sum_{i=1}^k\sigma_{i}=m.\end{align*}

A rough upper bound of their number is

(4.17) \begin{equation} k^{m-k}n^{m-k} =\exp\Big((m-k)\log n+(m-k)\log k\Big).\end{equation}

Indeed, choose first the number of subblocks in each block $A_i$ of $\xi_n$ , then the sizes of the subblocks, each in decreasing order. Note that within each block of $\xi_n$ , the size of its smallest subblock is determined by the others; i.e., k of the m block sizes ‘come for free’.

Using (4.15), (4.17), and the union bound, we obtain for any integer $m>k$

(4.18) \begin{align}\nonumber&\mathbb{P}\left({\inf_{\eta>\xi,\ |\eta|=m}\Delta L(\eta)<0}\right)\\[2pt] \nonumber\le&\exp\Big\{(m-k)\log n+(m-k)\log k-(m-k)(m+k+1)\left(k-\frac12\right)\log n\\[2pt] \nonumber &+\frac12(m-k)(m+k+1)\rho_1(m)\Big\}\\[2pt] &=\exp\Big\{\!-(m-k)\left((m+k+1)\left(k-\frac12\right)-1\right)\log n+m^2\rho_2(m)\Big\},\end{align}

where

\begin{equation*} \rho_2(m)=\frac{1}{m^2}\left(\frac12(m-k)(m+k+1)\rho_1(m)+(m-k)\log k\right),\end{equation*}

and $\rho_2(m)=o(m)$ . Adding and subtracting $m\log n$ and using the facts $1\le k<m$ and $k\le m$ , (4.18) can be further bounded by

\begin{align*}&\exp\left(\!-m\log n-\left[(m-k)\left(\frac{m}{2}-\frac12\right)-m\right]\log m+m^2\rho_2(m)\right)\\[2pt] =&\exp\left(\!-m\log n-\left(\frac14\log m-\rho_2(m)\right)m^2-\frac{m}{2}\left(\frac{m}{2}-k-3+\frac{k}{m}\right)\right).\end{align*}

Now define

\begin{equation*}m^*\,{:\!=}\,\inf\left\{{{m\ge k+1}} \,{:}\, {{\rho_2(m')\le\frac{1}{4}\log m'\quad \forall m'\ge k+1}}\right\}\vee (2k+6).\end{equation*}

Let $n\ge m^*$ . Then we have

\begin{equation*}\mathbb{P}\left({\inf_{\eta>\xi,\ |\eta|=m}\Delta L(\eta)<0}\right)\le\left\{\begin{array}{ll}\displaystyle n^{-k}\exp\left(m^2\rho_2(m)+\frac{m}{2}+4\right),&\quad m<m^*,\\e^{-m\log n},&\quad m\ge m^*.\end{array}\right.\end{equation*}

Finally,

\begin{align*}\mathbb{P}\left({\inf_{\eta>\xi}\Delta L(\eta)<0}\right)\le&\sum_{m=k+1}^{n}\mathbb{P}\left({\inf_{\eta>\xi,\ |\eta|=m}\Delta L(\eta)<0}\right)\\[3pt] \le&n^{-k}\sum_{m=k+1}^{m^*-1}e^{m^2\rho_2(m)+m/2+4}+\sum_{m=m^*}^{n}e^{-m\log n}\\[3pt] \le&\textit{const}(k)\cdot n^{-k}+\frac{n^{-m^*}}{1-n^{-1}}\to 0,\quad\mbox{as }n\to\infty.\end{align*}

5. Discussion

The results of this paper provide rather clear insight into the description length optimization landscape for SBMs. When a partition $\eta$ deviates only a little from the true partition $\xi_n$ and $|\eta|=k=|\xi_n|$ , minimization of $L_{45}(\eta)$ (likelihood maximization) in a single round by moving each vertex to its most appropriate block leads to exact identification of $\xi_n$ (Theorem 3.1). $L_{45}$ is essentially the log-likelihood function. With $|\eta|>k$ , however, minimization of $L_{45}$ may lead instead to some refinement of $\xi_n$ , with a value $L_{45}(\eta)$ that is strictly smaller than $L_{45}(\xi_n)$ . We have quantified this additional gain, which is of order n, and shown that it is overweighted by additional model complexity, measured by functions suggested by the MDL principle. MDL seems outstanding as a theoretically solid way to weigh likelihood gain against model complexity.

In the earlier paper [Reference Reittu18], we and our coauthors analyzed several questions on regular decomposition, i.e., our MDL-based methodology, focusing on the case in which k is known. It was found that not a single vertex can be misplaced without a substantial penalty in $L_{45}$ , proportional to n. In this sense, the global minimum is well separated from all local minima. However, there is no easy way of testing whether a minimum found by a greedy algorithm is the global one. This problem is acute with real-life data, in which there is no SBM-type ground truth. It is an interesting theoretical challenge to determine the probability of finding the global optimum of MDL using a greedy minimization algorithm.

The paper [Reference Reittu18] proposed a way to speed up the $L_{45}$ minimization when k is known and $n=|V|$ arbitrarily large. The idea is that a moderately sized random sample from V reveals the block structure, which can be identified by some brute force method, in our case by repeating the greedy algorithm many times with random initial partitions. The majority of the vertices can then be placed into their blocks with very high accuracy.

In [Reference Reittu18], we also compared regular decomposition by simulations with a spectral clustering method. The results suggested that the spectral clustering method is sensitive to the relative block sizes and requires them to be well balanced, whereas regular decomposition is not sensitive to the relative block sizes.

The additional insights provided by the present paper may be useful in designing algorithms where k is unknown and likelihood maximization is combined with some different kinds of operations. One simple operation worth studying is that of merging blocks according to the MDL criterion.

Practical implementation of MDL often requires a bit of art in addition to science. In the present work, we could not prove Theorem 3.3 completely without using the slightly bigger function $L_2$ instead of the better motivated $L^{\prime}_2$ , although MDL is theoretically able to identify SBM-like models exactly. Another remaining technical question is whether the restriction to partitions with nonnegligible relative block sizes is really necessary for $L(\eta)\ge L(\eta\vee\xi_n)$ .

Appendix A. Information-theoretic preliminaries

Consider a random variable X with moment-generating function

\begin{equation*}\phi_X(\beta)=\mathbb{E}\,{e^{\beta X}},\end{equation*}

and let $\mathcal{D}_X=\left\{{{\beta}} \,{:}\, {{\phi_X(\beta)<\infty}}\right\}$ . We restrict our attention to distributions of X for which $\mathcal{D}_X$ is an open (finite or infinite) interval. The corresponding rate function is

\begin{equation*}I_X(x)=-\inf_{\beta\in\mathcal{D}_X}(\!\log\phi_X(\beta)-\beta x);\end{equation*}

this is a strictly convex function with minimum 0 at $\mathbb{E}\,{X}$ and value $+\infty$ outside the range of X. For the mean $\bar{X}_n=\frac{1}{n}\sum_1^nX_i$ of independent and identically distributed copies of X, we have

(A.1) \begin{equation}I_{\bar{X}_n}(x)=nI_X(x).\end{equation}

The following inequalities are known as the Chernoff bounds for the random variable X.

Proposition A.1 We have

\begin{align*} \mathbb{P}\left({X\le x}\right)&\le e^{-I_X(x)}\quad\mbox{for }x\le\mathbb{E}\,{X},\\ \\[-7pt] \mathbb{P}\left({X\ge x}\right)&\ge e^{-I_X(x)}\quad\mbox{for }x\ge\mathbb{E}\,{X}.\end{align*}

We make frequent use of the following consequence of the Chernoff bounds. Let the convex hull of the support of X be the closure of $(x^-,x^+)$ , and define

\begin{equation*} a^-\,{:\!=}\,\lim_{x\downarrow x^-}I_X(x)=-\log\mathbb{P}\left({X=x^-}\right),\qquad a^+\,{:\!=}\,\lim_{x\uparrow x^+}I_X(x)=-\log\mathbb{P}\left({X=x^+}\right)\end{equation*}

(note that $a^-<\infty$ if and only if the distribution of X has an atom at $x^-$ , and similarly for $a^+$ ). We can now write

\begin{equation*} \begin{split}I_X(x)&=I^-_X(x)1_{\{{(x^-,\mathbb{E}\,{X}]}\}}(x)+I^+_X(x)1_{\{{[\mathbb{E}\,{X},x^+)}\}}(x)\\ \\[-7pt] &\quad+a^-\cdot1_{\{{x^-}\}}(x)+a^+\cdot1_{\{{x^+}\}}(x) +\infty\cdot1_{\{{\mathbb{R}\setminus[x^-,x^+]}\}}(x). \end{split}\end{equation*}

With the assumptions made above, the functions $I^-_X(x)$ and $I^+_X(x)$ are, respectively, bijections from $(x^-,\mathbb{E}\,{X}]$ and $[\mathbb{E}\,{X},x^+)$ to $[0,a^-)$ and $[0,a^+)$ .

Lemma A.1. We have

\begin{align*} I_X(X)\buildrel{(st)}\over\le \log 2 +Y,\end{align*}

where $\buildrel{(st)}\over\le$ denotes stochastic order and Y is a random variable with distribution Exp(1).

Proof. For any $z\ge0$ , we have

\begin{align*}\mathbb{P}\left({I_X(X)>z}\right)&=\mathbb{P}\left({1_{\{{X<\mathbb{E}\,{X}}\}}I_X(X)>z\mbox{ or }1_{\{{X>\mathbb{E}\,{X}}\}}I_X(X)>z}\right)\\[3pt]&=\mathbb{P}\left({1_{\{{X<\mathbb{E}\,{X}}\}}I^-_X(X)>z}\right)+\mathbb{P}\left({1_{\{{X>\mathbb{E}\,{X}}\}}I^+_X(X)>z}\right)\\[3pt]&=1_{\{{z<a^-}\}}\mathbb{P}\left({X<I^{-(\!-1)}_X(z)}\right)+1_{\{{z<a^+}\}}\mathbb{P}\left({X>I^{+(\!-1)}_X(z)}\right)\\[3pt]&\le1_{\{{z<a^-}\}}\exp(\!-I_X(I^{-(\!-1)}_X(z)))+1_{\{{z<a^+}\}}\exp(\!-I_X(I^{+(\!-1)}_X(z)))\\[3pt]&\le2e^{-z}\\[3pt]&=e^{-(z-\log 2)},\end{align*}

where the first inequality comes from Proposition A.1. Thus,

\begin{align*} \mathbb{P}\left({I_X(X)>z}\right)\le\min\left\{1,\,e^{-(z-\log 2)}\right\}=e^{-(z-\log 2)^+} =\mathbb{P}\left({\log 2+Y>z}\right).\\[-40pt]\end{align*}

In the case that X has the Bernoulli(p) distribution, we have

(A.2) \begin{equation}I_X(x)=D_B(x\|p)\,{:\!=}\,x\log\frac{x}{p}+(1-x)\log\frac{1-x}{1-p}.\end{equation}

Lemma A.2. The first and second derivatives of the functions H(x) and $x\mapsto D_B(x\|p)$ are

(A.3) \begin{align}H'(x)&=\log\frac{1-x}{x},&\quad H''(x)&=-\frac{1}{x(1-x)},\end{align}
(A.4) \begin{align}D_B^{\prime}(x\|p)&=H'(p)-H'(x),&\quad D_B^{\prime\prime}(x\|p)&=\frac{1}{x(1-x)}.\end{align}

Since $D_B(p\|p)=D_B^{\prime}(p\|p)=0$ and $D_B^{\prime\prime}(p\|p)=-H''(p)$ , we also have

(A.5) \begin{equation}H(q)-(H(p)+H'(p)(q-p))=-D_B(q\|p),\end{equation}

and

\begin{align*}&\lim_{n\to\infty}n\left[H\left(\left(1-\frac{1}{n}\right)p+\frac{1}{n}q\right)-\left(\left(1-\frac{1}{n}\right)H(p)+\frac{1}{n}H(q)\right)\right]\\[3pt]&=(q-p)H'(p)-(H(q)-H(p))\\[3pt] &=D_B(q\|p).\end{align*}

Proposition A.2. Let $n\ge2$ , and let $X_1$ and $X_2$ be independent random variables with distributions Bin(m,p) and Bin $(n-m,p)$ , respectively. Let $X_{12}=X_1+X_2$ and $\bar{X}_1=X_1/m$ , $\bar{X}_2=X_2/(n-m)$ , $\bar{X}_{12}=X_{12}/n$ . Then the following identities hold:

(A.6) \begin{align}&mD_B(\bar{X}_1\|p)+(n-m)D_B(\bar{X}_2\|p)-nD_B(\bar{X}_{12}\|p)\end{align}
(A.7) \begin{align}&=X_{12}D_B\left(\left.\frac{X_1}{X_{12}}\right\|\frac{m}{n}\right) +(n-X_{12})D_B\left(\left.\frac{m-X_1}{n-X_{12}}\right\|\frac{m}{n}\right)\end{align}
(A.8) \begin{align}&=mD_B(\bar{X}_1\|\bar{X}_{12}) +(n-m)D_B\left(\left.\frac{X_{12}-X_1}{n-m}\right\|\bar{X}_{12}\right).\end{align}

The identities in Proposition A.2 are obtained by writing out the full expression for (A.6) and rearranging the $\log$ terms in two other ways. The formulae (A.7) and (A.8) are written without $X_2$ , expressing the fact that any two of the three random variables $X_1$ , $X_2$ , and $X_{12}$ contain the same information as the full triple. Note that (A.7) and (A.8) do not contain p. This reflects the fact that the conditional distribution of $X_1$ given $X_{12}$ , known as the hypergeometric distribution, does not depend on p. The identity of (A.6) and (A.8) can be interpreted so that the two positive terms of (A.6) measure exactly the same amount of information about p as what is subtracted by the negative term. Moreover, (A.8) has the additional interpretation of presenting the rate function of the hypergeometric distribution, as stated in the following proposition.

Proposition A.3. Let X have the distribution Hypergeometric(n,m,z), i.e. the conditional distribution of $X_1$ of Proposition A.2 given that $X_{12}=z$ . The rate function of X is

(A.9) \begin{equation}I_X(x)=mD_B\left(\left.\frac{x}{m}\right\|\frac{z}{n}\right) +(n-m)D_B\left(\left.\frac{z-x}{n-m}\right\|\frac{z}{n}\right).\end{equation}

Proof. Define the bivariate moment-generating function of $(X_1,X_2)$ as

\begin{equation*}\phi(\alpha,\beta)=\mathbb{E}\,{e^{\alpha X_1+\beta X_2}}.\end{equation*}

Write

\begin{equation*}\mathbb{P}\left[\left.{X_1=m}\,\right|{X_{12}=z}\right]=\frac{\mathbb{P}\left({X_1=m,\ X_2=z-m}\right)}{\mathbb{P}\left({X_{12}=z}\right)},\end{equation*}

and note that we can assume $p=z/n$ . We can now derive the claim using $\phi(\alpha,\beta)$ in a similar manner as in the well-known proof of the one-dimensional Chernoff bound.

Proposition A.4. Let $k\ge2$ , and let $X_i$ , $i\in\left\{{{1,\ldots,k}}\right\}$ , be independent random variables with distributions Bin $(n_i,p)$ , respectively. Let $n=\sum_{i=1}^kn_i$ , $X_{1\ldots j}=\sum_{i=1}^jX_i$ , $\bar{X}_i=X_i/n_i$ , and $\bar{X}_{1\dots j}=X_{1\dots j}/\sum_{i=1}^jn_i$ . Then

(A.10) \begin{align}\sum_{i=1}^kn_iD_B(\bar{X}_i\|p)-nD_B(\bar{X}_{1\ldots k}\|p)&\buildrel{(st)}\over\le\sum_{i=1}^{k-1}\left(\log2+Y_i\right),\end{align}

where $Y_1,\ldots,Y_{k-1}$ are independent Exp(1) random variables.

Proof. For $k=2$ , the left-hand side of (A.10) equals

(A.11) \begin{equation}n_1D_B(\bar{X}_1\|\bar{X}_{12})+n_2D_B\left(\left.\frac{X_{12}-X_1}{n_2}\right\|\bar{X}_{12}\right)\end{equation}

by Proposition A.2 For any $N\in\left\{{{0,\ldots,n}}\right\}$ , consider the conditional distribution of (A.11), given that $X_{12}=N$ . By Proposition A.3, this is the distribution of the Hypergeometric $(n,n_1,N)$ rate function taken at the random variable $X_1$ with the same distribution. The claim now follows by Lemma A.1, because the stochastic upper bound does not depend on N, i.e. on the value of $X_{12}$ .

For $k>2$ we proceed by induction. Assume that the claim holds for $k-1$ and write

\begin{align*}&\sum_{i=1}^kn_iD_B(\bar{X}_i\|p)-nD_B(\bar{X}_{12}\|p)\\[2pt]&=\sum_{i=1}^{k-1}n_iD_B(\bar{X}_i\|p)-(n-n_k)D_B(\bar{X}_{1\ldots(k-1)}\|p)\\[2pt]&\quad+n_kD_B(\bar{X}_k\|p)+(n-n_k)D_B(\bar{X}_{1\ldots(k-1)}\|p)-nD_B(\bar{X}_{1\ldots k}\|p).\end{align*}

By the induction hypothesis, the first row of the second expression is stochastically bounded by $\sum_{i=1}^{k-2}(\log2+Y_i)$ , irrespective of the value of $X_{1\ldots(k-1)}$ . Similarly, the second row is stochastically bounded by $\log2+Y_k$ , where $Y_k\sim\mbox{ Exp}(1)$ , irrespective of the value of $X_{1\ldots k}$ . It remains to note that $Y_k$ can be chosen to be independent of $(Y_1,\ldots,Y_{k-2})$ , because $X_k$ is independent of $(X_1,\ldots,X_{k-1})$ , and of $\bar{X}_{1\ldots(k-1)}$ in particular.

Lemma A.3. Consider the sequence $(G_n,\xi_n)$ of SBMs as in Subsection 3.2. Then the following hold:

  1. 1. For any blocks $A_i$ and $A_j$ such that $d_{ij}\not\in\left\{{{0,1}}\right\}$ , it holds for an arbitrary $\epsilon>0$ with high probability that

    \begin{equation*}\min_{v\in A_i}\frac{|e(\{v\},A_j)|}{|A_j|}\ge d_{ij}-n^{-\frac12+\epsilon},\qquad\max_{v\in A_i}\frac{|e(\{v\},A_j)|}{|A_j|}\le d_{ij}+n^{-\frac12+\epsilon}.\end{equation*}
  2. 2. For any $\epsilon>0$ , all block pairs of the partition $\xi_n$ are $\epsilon$ -regular (Definition 2.1) with high probability.

Proof. Claim 1: By Proposition A.1 and (A.4),

\begin{align*}&\mathbb{P}\left({\max_{v\in A_i}\frac{|e(\{v\},A_j)|}{|A_j|}>d_{ij}+h}\right)\\[3pt]&\le\sum_{v\in A_i}\mathbb{P}\left({\frac{|e(\{v\},A_j)|}{|A_j|}>d_{ij}+h}\right)\\[3pt]&\le|A_i|\exp\left(\!-|A_j|D_B(d_{ij}+h\,\|\,d_{ij})\right)\\[3pt] &=|A_i|\exp\left(\!-|A_j|\left(\frac{h^2}{2d_{ij}(1-d_{ij})} +\frac{h^3}{6}D_B^{\prime\prime\prime}(z\|d_{ij})\right)\right).\end{align*}

The last expression converges to zero with the choice $h=n^{-\frac12+\epsilon}$ (recall that $|A_i|\sim n\gamma_i$ and $|A_j|\sim n\gamma_j$ ), which proves the claim on the maximum. The case of the minimum is symmetric.

Claim 2: Fix $\epsilon>0$ and consider any i, j. Let $U_1\subseteq A_i$ and $U_2\subseteq A_j$ be such that $|U_1|\ge\epsilon|A_i|$ and $|U_2|\ge\epsilon|A_j|$ . By Proposition A.1,

\begin{equation*}\mathbb{P}\left({|d(U_1,U_2)-d_{ij}|>\epsilon}\right)\le e^{-|U_1||U_2|D_B(d_{ij}+\epsilon\|d_{ij})}+e^{-|U_1||U_2|D_B(d_{ij}-\epsilon\|d_{ij})}.\end{equation*}

Let $\iota(\epsilon)=\min\left\{{{D_B(d_{ij}+\epsilon\|d_{ij}),D_B(d_{ij}-\epsilon\|d_{ij})}}\right\}$ . The union bound yields

\begin{align*}&\mathbb{P}\left({\exists U_1\subseteq A_i,\ U_2\subseteq A_j\,{:}\,\|U_1|\ge\epsilon|A_i|,\ |U_2|\ge\epsilon|A_j|,\|d(U_1,U_2)-d_{ij}|>\epsilon}\right)\\[3pt] &\le2|A_i||A_j|\exp\left((|A_i|+|A_j|)\log2 -\epsilon^2|A_i||A_j|\iota(\epsilon)\right)\\[3pt] &\le2n^2\exp\left(n^2\left(\frac{(\gamma_i+\gamma_j)\log2}{n} -\gamma_i\gamma_j\epsilon^2\iota(\epsilon)\right)\right)\to0\quad\mbox{as } n\to\infty,\end{align*}

because $\iota(\epsilon)>0$ .

Preliminaries for the Poissonian block model

For the Poissonian block model, the function $\phi(x)=-x\log x$ replaces binomial entropy in the counterparts of code lengths $L_4+L_5$ . We indicate below how the crucial steps of the proofs would change.

Denote by $D_P(b\|a)$ the Kullback–Leibler divergence between the distributions $Poisson(a)$ and $Poisson(b)$ ,

(A.12) \begin{equation}D_P(b\|a)=a-b+b\log b-b\log a.\end{equation}

For a counterpart to Lemma A.2, note that, for any $z>0$ ,

(A.13) \begin{equation}\phi''(x)=-\frac1x=-\frac{{\,\textrm{d}}^2}{{\,\textrm{d}}{x^2}}D_P(x\|z).\end{equation}

Lemma A.4. For any $\alpha\in[0,1]$ and $x,y>0$ , let $z=\alpha x+(1-\alpha)y$ . Then we have

(A.14) \begin{align}\phi(z)-(\alpha\phi(x)+(1-\alpha)\phi(y))&=\alpha D_P(x\|z)+(1-\alpha)D_P(y\|z)\end{align}
(A.15) \begin{align}&=zI_{\textit{Ber}(\alpha)}\left(\frac{\alpha x}{z}\right).\end{align}

Proof. The equality (A.14) follows from (A.13) by an argument similar to the derivation of (A.5). The expression (A.15) is obtained by writing the right-hand side of (A.14) with the substitution $y=(z-\alpha x)/(1-\alpha)$ and recombining the log terms.

The Poissonian counterpart of Proposition A.4 is the following.

Proposition A.5. Let $a>0$ , $k\ge2$ , $n_i\ge1$ , $i=1,\ldots,k$ , and $n=\sum_in_i$ . Let $X_i$ , $i\in\left\{{{1,\ldots,k}}\right\}$ , be independent random variables with distributions Poisson $(n_ia)$ , respectively. Let $X_{1\ldots j}=\sum_{i=1}^jX_i$ , $\bar{X}_i=X_i/n_i$ , and $\bar{X}_{1\dots j}=X_{1\dots j}/\sum_{i=1}^jn_i$ . Then

(A.16) \begin{align}n\phi(\bar{X}_{1\dots k})-\sum_{i=1}^kn_i\phi(\bar{X}_i)&\buildrel{(st)}\over\le\sum_{i=1}^{k-1}\left(\log2+Y_i\right),\end{align}

where $Y_1,\ldots,Y_{k-1}$ are independent Exp(1) random variables.

Proof. The proof of Proposition A.4 can be imitated as follows:

  • Using induction, it suffices to consider the case $k=2$ .

  • Apply Lemma A.4 to the left-hand side of (A.16) with $x=\bar{X}_1$ , $y=\bar{X}_2$ , $z=\bar{X}_{12}$ , and $\alpha=n_1/n$ . This yields

    \begin{equation*}X_{12}I_{\textit{Ber}\left(\frac{n_1}{n}\right)}\left(\frac{X_1}{X_{12}}\right)=I_{\textit{Bin}\left(X_{12},\frac{n_1}{n}\right)}(X_1).\end{equation*}
  • Now, the conditional distribution of $X_1$ given $X_{12}$ is the above binomial distribution. Thus, we can apply Lemma A.1 in a similar way as in the proof of Proposition A.4

Funding

This work was supported by the Academy of Finland project 294763 (Stomograph).

Competing Information

There were no competing interests to declare which arose during the preparation or publication process for this article.

References

Alon, N., Fischer, E., Newman, I. and Shapira, A. (2006). A combinatorial characterization of the testable graph properties: it’s all about regularity. In Proc. 38th Annual ACM Symposium on Theory of Computing (STOC ’06), Association for Computing Machinery, New York, pp. 251–260.Google Scholar
Bolla, M. (2016). Relating multiway discrepancy and singular values of nonnegative rectangular matrices. Discrete Appl. Math. 203, 2634.10.1016/j.dam.2015.09.013CrossRefGoogle Scholar
Bolla, M. and Elbanna, A. (2015). Estimating parameters of a probabilistic heterogeneous block model via the EM algorithm. J. Prob. Statist. 2015, 657965.10.1155/2015/657965CrossRefGoogle Scholar
Cover, T. and Thomas, J. (1991). Elements of Information Theory. John Wiley, New York.10.1002/0471200611CrossRefGoogle Scholar
Gao, C., Ma, Z., Zhang, A. and Zhou, H. (2017). Achieving optimal misclassification proportion in stochastic block model. J. Mach. Learning Res. 18, 145.Google Scholar
Grünwald, P. (2007). Minimum Description Length Principle. MIT Press.10.7551/mitpress/4643.001.0001CrossRefGoogle Scholar
Heimlicher, S., Lelarge, M. and Massoulié, L. (2012). Community detection in the labelled stochastic block model. In NIPS Workshop on Algorithmic and Statistical Approaches for Large Social Networks, Lake Tahoe, NV. Available at https://arxiv.org/abs/1209.2910.Google Scholar
Holland, P., Laskey, K. and Leinhardt, S. (1983). Stochastic blockmodels: first steps. Social Networks 5, 109137.10.1016/0378-8733(83)90021-7CrossRefGoogle Scholar
Komlós, J. and Simonovits, M. (1996). Szemerédi’s regularity lemma and its applications in graph theory. In Combinatorics, Paul Erdös Is Eighty, eds Miklós, D., Sós, V. and Szonyi, T., János Bolyai Mathematical Society, Budapest, pp. 295–352.Google Scholar
Massoulié, L. (2014). Community detection thresholds and the weak Ramanujan property. In Proc. 46th Annual ACM Symposium on Theory of Computing (STOC ’14), Association for Computing Machinery, New York, pp. 694–703.10.1145/2591796.2591857CrossRefGoogle Scholar
Nepusz, T., Négyessy, L., Tusnády, G. and Bazsó, F. (2008). Reconstructing cortical networks: case of directed graphs with high level of reciprocity. In Handbook of Large-Scale Random Networks (Bolyai Society Mathematical Studies 18), eds Bollobás, B., Kozma, R. and Miklós, D., Springer, Berlin, Heidelberg, pp. 325–368.10.1007/978-3-540-69395-6_8CrossRefGoogle Scholar
Pehkonen, V. and Reittu, H. (2011). Szemerédi-type clustering of peer-to-peer streaming system. In Proc. 2011 International Workshop on Modeling, Analysis, and Control of Complex Networks (Cnet ’11), Association for Computing Machinery, New York, pp. 2330.Google Scholar
Peixoto, T. P. (2012). Entropy of stochastic blockmodel ensembles. Phys. Rev. E 85, 056122.10.1103/PhysRevE.85.056122CrossRefGoogle ScholarPubMed
Peixoto, T. P. (2013). Parsimonious module inference in large networks. Phys. Rev. Lett. 110, 148701.10.1103/PhysRevLett.110.148701CrossRefGoogle ScholarPubMed
Reittu, H., Bazsó, F. and Weiss, R. (2014). Regular decomposition of multivariate time series and other matrices. In Structural, Syntactic, and Statistical Pattern Recognition (S+SSPR 2014), eds Fränti, P., Brown, G., Loog, M., Escolano, F. and Pelillo, M., Springer, Berlin, Heidelberg, pp. 424–433.10.1007/978-3-662-44415-3_43CrossRefGoogle Scholar
Reittu, H., Leskelä, L., Räty, T. and Fiorucci, M. (2018). Analysis of large sparse graphs using regular decomposition of graph distance matrices. In Proc. 2018 IEEE International Conference on Big Data, Institute of Electrical and Electronics Engineers, New York, pp. 3784–3792.10.1109/BigData.2018.8622118CrossRefGoogle Scholar
Reittu, H., Norros, I. and Bazsó, F. (2017). Regular decomposition of large graphs and other structures: scalability and robustness towards missing data. In Proc. 2017 IEEE International Conference on Big Data, Institute of Electrical and Electronics Engineers, New York, pp. 3352–3357.10.1109/BigData.2017.8258320CrossRefGoogle Scholar
Reittu, H. et al. (2019). Regular decomposition of large graphs: foundation of a sampling approach to stochastic block model fitting. Data Sci. Eng. 4, 4460.10.1007/s41019-019-0084-xCrossRefGoogle Scholar
Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Ann. Statist. 11, 416431.10.1214/aos/1176346150CrossRefGoogle Scholar
Rissanen, J. (1998). Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore.10.1142/0822CrossRefGoogle Scholar
Rosvall, M. and Bergstrom, C. T. (2007). An information-theoretic framework for resolving community structure in complex networks. Proc. Nat. Acad. Sci. USA 104, 73277331.10.1073/pnas.0611034104CrossRefGoogle Scholar
Szemerédi, E. (1976). Regular partitions of graphs. In Problèmes Combinatoires et Théorie des Graphes (Colloq. Int. CNRS 260), CNRS, Orsay, pp. 399–401.Google Scholar
Wang, Y. X. R. and Bickel, P. J. (2017). Likelihood-based model selection for stochastic block models. Ann. Statist. 45, 500–528.10.1214/16-AOS1457CrossRefGoogle Scholar
Yaglom, A. and Yaglom, I. (1983). Probability and Information. D. Reidel, Dordrecht.Google Scholar