1. Introduction
Let us define a stochastic block model (SBM) as a random graph $G=(V,E)$ , whose structure is defined by a partition $\xi=\left\{{{A_1,\ldots,A_k}}\right\}$ of V and by a symmetric $k\times k$ matrix $D=(d_{ij})_{i,j=1}^k$ of real numbers $d_{ij}\in[0,1]$ as follows: for every pair $\left\{{{v,w}}\right\}$ of distinct vertices of V such that $v\in A_i$ , $w\in A_j$ , $\left\{{{v,w}}\right\}\in E$ with probability $d_{ij}$ , and all Bernoulli random variables $e_{vw}=e_{wv}=1_{\{{\left\{{{v,w}}\right\}\in E}\}}$ are independent. For unambiguity, we also set the irreducibility condition that no two rows of D are equal, i.e.,
An SBM is said to be irreducible if its partition $\xi$ and matrix D satisfy (1.1). This condition is necessary for identifiability, because a random graph generated by an irreducible block model with $(\xi,D)$ can be equivalently generated by any reducible model obtained by refining $\xi$ , i.e., splitting one or more of its blocks into smaller sets.
A slightly different variant of SBMs has been defined by first drawing the sizes of blocks $A^{(n)}_i$ as independent Poisson random variables and then proceeding with the matrix D as before. SBMs have also been called generalized random graphs; the case of trivial partition $\xi=\left\{{{V}}\right\}$ yields a random graph with edge probability $d_{11}$ .
SBMs were first defined for community modeling in [Reference Holland, Laskey and Leinhardt8], and community detection has remained their main application as well as a source of research problems—see e.g. [Reference Heimlicher, Lelarge and Massoulié7, Reference Massoulié10, Reference Peixoto13]. However, large networks appear now in almost all imaginable application areas, and there are grounds to consider SBMs as a rather generic form of the separation of structure and randomness in large real-world systems. Indeed, Szemerédi’s regularity lemma [Reference Szemerédi22], a fundamental result in graph theory, states that any large enough graph can be approximated by an SBM-like block structure, where the randomness is replaced by pseudo-randomness in terms of so-called $\epsilon$ -regularity, with a number of blocks k that depends only on $\epsilon$ . This remarkable combinatorial fact has been utilized to solve hard problems in many areas of mathematics (see e.g. [Reference Alon, Fischer, Newman and Shapira1, Reference Komlós, Simonovits, Miklós, Sós and Szonyi9]). Looking for SBM-like structures in large real-world graphs and matrices is a generic way to obtain low-dimensional approximations and compressed representations of them.
The typical case of SBM estimation is that the graph $G=(V,E)$ is observed, but the partition $\xi$ is unobserved. For any fixed candidate partition, a corresponding estimate of the matrix D of edge densities is obtained immediately as consisting of empirical means.
If the number of blocks k is known (and the irreducibility condition (1.1) holds, as we always assume), the partition can be identified accurately by several methods, when the graph is sufficiently dense. In mathematical theorems, one considers asymptotics, and dense graph asymptotics means that the densities $d_{ij}$ remain the same when the graph size $n=|V|$ grows to infinity, the relative block sizes staying fixed. The main estimation techniques are maximum likelihood fitting, where the optimization may utilize expectation maximization (e.g. [Reference Bolla and Elbanna3]), simulated annealing or Markov chain Monte Carlo algorithms, and spectral methods (e.g. [Reference Bolla2, Reference Massoulié10]). In sparse graph asymptotics, where all densities decrease to zero while keeping their proportions, identification of the partition may be possible or impossible even with known k, depending on the proportions of the densities; see [Reference Heimlicher, Lelarge and Massoulié7, Reference Massoulié10]. Gao et al. [Reference Gao, Ma, Zhang and Zhou5] consider achieving the minimal possible fraction of errors in SBM estimation. Their work is mostly relevant for the sparse case and when k is known.
We focus on the case called model selection, where k is unknown, and on the dense graph setting. In practical tasks of grouping real-world data, it is often a major challenge to select the ‘right’ partition size k. Intuitively, the optimal choice of k should strike a balance between the degree of pseudo-randomness of the edge-level structures and the model complexity specified by the size of the partition. A popular device has been the Akaike information criterion (AIC), which has been applied to block model estimation in, e.g., [Reference Nepusz, Négyessy, Tusnády, Bazsó, Bollobás, Kozma, Miklós and Springer11]. The AIC simply adds the ‘number of model parameters’ to the maximal log-likelihood and chooses the model that minimizes the sum.
We apply a more refined information-theoretic approach, Rissanen’s minimum description length (MDL) principle (see [Reference Grünwald6]), according to which the quality of a model should be measured by the length of the bit string that it yields for unique encoding of the data. In our case, the data consist of the graph G, and the model consists of the partition $\xi$ and the density matrix D. Rosvall and Bergstrom [Reference Rosvall and Bergstrom21] pointed out information theory as a natural choice for community structure analysis, grouping redundant elements into a few communities. The MDL principle, in particular, has been applied to SBMs at least by Peixoto [Reference Peixoto14]. He derived interesting estimates which indicate that there is an upper limit on the number of communities that can be detected, $k\sim \sqrt{n}$ as a function of the number of vertices n. However, he did not consider the exact identification of the number of blocks k, which is the focus of the present paper.
Wang and Bickel [Reference Wang and Bickel23] also used information theory for likelihood-based model selection for SBMs. Their conclusions are similar to ours. In addition, they show the validity of results in a sparse case when the average degree grows at a polylog (polynomial in $\log n$ ) rate and in the case of degree-corrected block models. They use asymptotic distributions instead of the exact ones that we use. The algorithmic part of their work is also different from ours. They end up in a likelihood-based Bayesian information criterion (BIC) that is asymptotically consistent. Along with a term that corresponds to log-likelihood, there is a term proportional to $k^2n\log n$ with some tuning coefficient that has to be defined separately in every specific case. So defined, the BIC has a minimum at the right value of k. Instead of such a term, we prefer to use MDL model complexity that is not case-sensitive. Our techniques are different as well.
The main contributions of this paper are the following: (i) the consequent application of the MDL approach, providing (ii) basically simple and transparent techniques, building on Chernoff bounds, to (iii) prove three theorems on model specification. A crucial role is played by Theorem 2, which states that all refinements of $\xi$ increase an MDL-based objective function with high probability. We show that the MDL principle identifies the true partition with high probability among models whose relative block sizes are bounded away from zero. The results are also shown to hold with Poissonian edge weights. We have applied Poissonian block models as a heuristic for finding regular structures in large real-world data matrices, in a method we call regular decomposition [Reference Nepusz, Négyessy, Tusnády, Bazsó, Bollobás, Kozma, Miklós and Springer11, Reference Pehkonen and Reittu12, Reference Reittu, Bazsó, Weiss, Fränti, Brown, Loog, Escolano, Pelillo and Springer15, Reference Reittu, Norros and Bazsó17, Reference Reittu18].
The paper is structured as follows. Section 2 defines the main notions: SBMs and the MDL principle. Section 3 presents our main results. The proofs are given in Section 2; some of the auxiliary propositions may be of independent interest. Finally, the results are discussed in Section 3. The proofs apply simple information-theoretic tools collected in Appendix A.
2. Basics and definitions
2.1. Stochastic block models
If G(V, E) is a graph, where V is the set of vertices and E is the set of edges, the link density of a nonempty vertex set $X\subseteq V$ is defined as
and $\left|\cdot\right|$ denotes the cardinality of a set. Similarly, the link density between two disjoint nonempty vertex sets $X,Y\subseteq V$ is defined as
Definition 2.1. Let $\epsilon>0$ . A pair of disjoint sets $A,B\subseteq V$ of G(V, E) is called $\epsilon$ -regular if, for every $X\subseteq A$ and $Y\subseteq B$ such that $\left|X\right|> \epsilon \left|A\right|$ and $\left|Y\right|> \epsilon \left|B\right|$ , we have
The notion of a (binary) stochastic block model (SBM) was defined at the very beginning of this paper. A Poissonian block model is defined similarly except that the elements of the matrix D are numbers in $[0,\infty)$ , and the random variables $e_{ij}$ have Poisson distributions with respective parameters $d_{ij}$ . Thanks to the independence assumption, the sums $\sum_{u\in X}\sum_{v\in Y}e_{uv}$ are Poisson-distributed for any disjoint $X,Y\subset V$ .
A sequence of random graphs $G_n=(V_n,E_n)$ presenting copies of the same SBM in different sizes can, for definiteness, be constructed as follows.
Construction 2.1. Let $\gamma_1,\ldots,\gamma_k$ be positive, distinct real numbers such that $\sum_{i=1}^k\gamma_i=1$ .
-
1. Divide the interval (0,1] into k segments
\begin{equation*} I_1=(0,\gamma_1],\ I_2=(\gamma_1,\gamma_1+\gamma_2],\ \ldots,\ I_k=\left(\sum_{i=1}^{k-1}\gamma_i,1\right],\end{equation*}and define $\Gamma=\left\{{{I_1,\ldots,I_k}}\right\}$ . For $n=1,2,\ldots$ , let the vertices of $G_n$ be\begin{equation*}V_n=\left\{{{\frac{i}{n}}} \,{:}\, {{i\in\left\{{{1,\ldots,n}}\right\}}}\right\}.\end{equation*}For each n, let $\xi_n$ be the partition of $V_n$ into the blocks\begin{equation*} A^{(n)}_i=I_i\cap V_n,\quad i=1,\ldots,k.\end{equation*}
For small n, we may obtain several empty copies of the empty set numbered as blocks. However, from some $n_0$ on, all blocks are nonempty and $\xi_n=\left\{{{A^{(n)}_1,\ldots,A^{(n)}_k}}\right\}$ is a genuine partition of $V_n$ . We can then generate a sequence of SBMs based on the sequence $(V_n,\xi_n,D)$ .
2.2. The minimum description length principle
The minimum description length (MDL) principle was introduced by J. Rissanen, inspired by Kolmogorov’s complexity theory (see [Reference Grünwald6, Reference Rissanen20]; our general reference on coding and information theory is [Reference Cover and Thomas4]). The basic idea is the following: a set $\mathcal{D}$ of data is optimally explained by a model $\mathcal{M}$ when a combined unique encoding of (i) the model and (ii) the data as interpreted in this model is as short as possible. An encoding means here a bijective mapping of objects to a binary prefix code, i.e., a code where no code word is the beginning of another code word, and any sequence of them is thus uniquely decodable.
The principle is best illustrated by our actual case, that of simple graphs. A graph $G=(V,E)$ with $|V|=n$ can always be encoded as a binary string of length $\binom{n}{2}=n(n-1)/2$ , where each binary variable corresponds to a vertex pair, and a value 1 (resp. 0) indicates an edge (resp. absence of an edge). Thus, the MDL of G is always at most $\binom{n}{2}$ . However, G may have a structure whose disclosure would allow a much shorter description. For example, let G be a bipartite graph consisting of two blocks with cardinality $n/2$ . Knowing this, it suffices to code the edges between the blocks, which requires at most $(n/2)^2$ bits, just half of the previous number. However, the partition must still be specified. It requires n additional bits to say which block each vertex belongs to. On the other hand, if the overall density of G is a number $d=h/\binom{n}{2}\not=1/2$ , the edges can be encoded using only $\binom{n}{2}H(d)$ bits, where H(d) denotes the Shannon entropy $H(d)=-d\log_2d-(1-d)\log_2(1-d)$ . (There are $\binom{N}{K}$ binary sequences of length N that have exactly K 1s, and $\binom{N}{K}\le\exp(NH(K/N))$ ; see e.g. [24, IV.2].) However, the integer h must be encoded, and the shortest-length prefix coding for integers requires
bits [Reference Grünwald6, Reference Rissanen19].
Definition 2.2. Denote by $\mathcal{M}_{n/k}$ the set of all irreducible (see the condition (1.1)) SBMs $(V,\xi,D)$ with $|V|=\left\{{{1,\ldots,n}}\right\}$ and $\xi=\left\{{{V_1,\ldots,V_k}}\right\}$ such that, for $i,j\in\left\{{{1,\ldots,k}}\right\}$ ,
$\mathcal{M}_{n/k}$ is called the modeling space of SBMs with n vertices and k blocks, and
is called the modeling space of all SBMs with n vertices.
The condition that the $h_{ij}$ be integers entails that the modeling space $\mathcal{M}_n$ be finite. The models in $\mathcal{M}_{n/k}$ are parameterized by $\Theta=(\xi,D)$ . Note that without the irreducibility condition (1.1) there would not be a bijection between SBMs and their parameterizations.
A good model $\Theta\in\mathcal{M}_{n/k}$ for a graph G is one that gives the largest probability for G, and it is called the maximum likelihood model. Denote the parameter of this model by
where $P(G\mid \Theta)$ denotes the probability that the probabilistic model specified by $\Theta$ produces G. Part of likelihood optimization is trivial: when a partition $\eta=\left\{{{A_1,\ldots,A_k}}\right\}$ is considered for a given graph G, the optimal edge probabilities are the empirical edge densities,
As a result, the edge densities are always rational numbers, which explains the corresponding definition used in $\mathcal{M}_{n/k}$ .
The nontrivial part of maximum likelihood estimation within $\mathcal{M}_{n/k}$ is to find the optimal partition, but, as noted in the introduction, this has also been well studied in the literature.
Kraft’s inequality (e.g. [Reference Cover and Thomas4]) implies that if letters are drawn from an alphabet with probabilities $p_1,\ldots,p_N$ , there exists a prefix coding with code lengths $\lceil -\log p_1\rceil, \ldots, \lceil-\log p_N\rceil$ , and moreover, this coding scheme minimizes the mean code length. Thus, for any probability distribution P on the space of all graphs G with n vertices, there exists a prefix coding with code lengths $\lceil-\log_2{P(\left\{{{G}}\right\})}\rceil$ . In particular, to every graph $G_0$ with $|V|=n$ we can associate an encoding with code lengths $\lceil-\log_2 P(G|\hat{\Theta}_k(G_0))\rceil$ corresponding to the SBM $\hat{\Theta}_k(G_0)\in\mathcal{M}_{n/k}$ . However, this is not all, since in order to be able to decode we must know what particular model was used. This means that $\hat{\Theta}_k(G_0)$ must also be prefix-encoded, with some code length $L(\hat{\Theta}_k(G_0))$ . We end up with a description length called the two-part MDL [Reference Grünwald6]:
A detailed implementation of this principle is proposed in the next section.
3. MDL analysis of stochastic block models
3.1. Block model codes
In the rest of this paper, we define information-theoretic functions in terms of natural logarithms, and certain notions (such as code lengths) should be divided by $\log 2$ to obtain their values in bits. We denote by $H(\cdot)$ both Shannon’s entropy function for a partition and the entropy of a Bernoulli distribution:
Definition 3.1. A block model code of a graph $G=(V,E)$ with respect to a partition $\eta=\left\{{{B_1,\ldots,B_m}}\right\}$ of V is a code with the structure described below.The model part is as follows:
-
1. The sizes of the blocks $B\in\eta$ are given as integers.
-
2. The edge density d(B) inside each block $B\in\eta$ and the edge density d(B, B $^{\prime}$ ) between each pair of distinct blocks $B,B^{\prime}\in\eta$ are given as the numerators of the rational numbers presenting the exact (empirical) densities.
-
3. For $v\in V$ , define $i_v=\sum_{i=1}^m i1_{\{{v\in B_i}\}}$ . The partition $\eta$ is specified by encoding the sequence $(i_v)_{v\in V}$ by a prefix code corresponding to the membership distribution $P(i\in B_j)=|B_j|/n$ .
The data part is as follows:
-
4. The edges inside each block $B_i\in\eta$ are specified by a prefix code corresponding to random distribution of edges with density $d_i$ .
-
5. The edges between each pair of distinct blocks $B_i,B_j\in\eta$ are specified by a prefix code corresponding to random distribution of edges with density $d_{ij}$ .
Note that a block model code can be given for any graph with respect to any partition of its vertices. Next, we shall specify the functions we use to estimate the lengths of each of the five code parts described in Definition 3.1. Parts 3–5 present long sequences from a finite ‘alphabet’. By classical information theory, basically Kraft’s inequality, the required code length per element is the entropy of the distribution of the ‘letters’.
Parts 1 and 2 encode certain integers. To obtain mathematically tractable and well-behaved estimates, we make some simplifications. Instead of Rissanen’s $l^*$ function (2.1), we use the plain logarithm, and instead of the lengths of numerators of rational numbers, we use the lengths of their denominators. (Besides simplicity, this choice has the desirable effect that the respective code length estimates then depend only on the partition structure, not on the edge densities.) All length estimates should be nonzero. On the other hand, we do not need to care about ceiling functions. These considerations lead us to estimate the code length of a positive integer N as $\log(1+N)$ . With the additional simplification $\binom{m}{2}\approx m^2/2$ , the estimate of the length of Part 2 of the code would then be
So far we have followed the MDL methodology quite strictly, up to the above simplifications. Now, however, we introduce a twist that is motivated only by its usefulness in proving Theorem 3.3: we strengthen the impact of Part 2 by multiplying $L_2^{\prime}(G|\eta)$ by $|\eta|+1.$
Thus, we propose to base SBM selection on the following weighted estimate of the length of a block model code:
For sums of only some of the functions $L_i$ , we define $L_{12}(G|n)\,{:\!=}\,L_{1}(G|n)+L_{2}(G|n)$ , etc.
Proposition 3.1. For any graph $G=(V,E)$ , the model part $L_{123}(G|\eta)$ is monotonically increasing and the data part $L_{45}(G|\eta)$ is monotonically decreasing in $\eta$ with respect to refinement of partitions. That is, for any partitions $\eta$ and $\zeta$ of V such that $\eta\le\zeta$ , we have
Proof. The claim concerning the model part follows from the subadditivity of $\log(1+x)$ for $x\ge1$ and from the monotonicity of $\eta\mapsto H(\eta)$ .
As regards the data part, note that the internal density of $B\in\eta$ can be written as a convex combination of densities related to the partition $\zeta\cap B$ of B:
By the concavity of H, we then have
where two first sums after the inequality sign come from $L_4(G|\eta)$ and the third from $L_5(G|\eta)$ .
Poissonian block models
Let us now consider Poissonian block models, where the entries of E are Poisson-distributed integers. For a pair of disjoint sets $A,B\subset V$ , the set $\left\{{{e_{ij}}} \,{:}\, {{i\in A,\,j\in B}}\right\}$ is a sample from a distribution $R=(r_\ell)_{\ell\ge0}$ that is a mixture of Poisson distributions. It would be hard to encode the sample by first estimating the unknown mixture distribution. Instead, we base the code simply on the sample mean
and encode $\left\{{{e_{ij}}} \,{:}\, {{i\in A,\,j\in B}}\right\}$ as if it came from a Poisson distribution $P=(p_\ell)$ with parameter $e_{AB}$ . If the $e_{ij}$ really came from a Poisson distribution, a value $e_{ij}=\ell$ would, by Kraft’s inequality, be well encoded by a code word with approximate length
However, a mixture of different Poisson distributions is not a Poisson distribution. Denote by $R=(r_\ell)$ the empirical distribution of $\left\{{{e_{ij}}} \,{:}\, {{i\in A,\ j\in B}}\right\}$ ; i.e., $r_\ell$ is the number of $e_{ij}$ with value $\ell$ , divided by $|A||B|$ . The fundamental information inequality
implies that this encoding is suboptimal for arbitrary disjoint subsets A and B, but it is optimal when A and B are blocks of the model partition $\xi$ and R comes from a sample from a pure Poisson distribution. (The inequality (3.4) is a special case of the nonnegativity of the Kullback–Leibler divergence; it says that using a code recommended by Kraft’s inequality with wrong probabilities increases the expected code word length.) This suboptimality is no problem, because for our purposes it only improves the contrast between $\xi$ and other partitions of V.
Using the above coding rule for an arbitrary partition $\eta=\left\{{{B_1,\ldots,B_m}}\right\}$ , the amount of code needed to encode all of the $e_{ij}$ is
where
the r-symbols refer to the empirical distribution of the $e_{vw}$ , and $\phi(x)=-x\log x$ . Now, we define our MDL-based objective function for a Poissonian block model E with respect to any partition $\eta=\left\{{{B_1,\ldots,B_m}}\right\}$ as
Note that the term $L_0(E|\eta)$ is independent of the partition $\eta$ and can be neglected when minimizing over $\eta$ . The parts $L_1(E|\eta)$ and $L_2(E|\eta)$ are copied from the case of plain graphs.
One application of Poisson block models was found in [Reference Reittu, Leskelä, Räty and Fiorucci16], which used them to analyze graph distance matrices. This approach can help in finding communities in large and sparse networks. In [Reference Reittu, Bazsó, Weiss, Fränti, Brown, Loog, Escolano, Pelillo and Springer15], it was suggested that the Poisson block model objective function (3.5) could be used for any matrix with nonnegative entries, resulting in the same cost function as in the integer case.
3.2. Accuracy of model selection
Our results on MDL-based model selection are summarized in the following theorem. It is formulated in terms of the asymptotic behavior of a model sequence as specified by Construction 1. In this framework, an event is said to happen with high probability if its probability tends to 1 when $n\to\infty$ .
Consider a sequence of irreducible SBMs $(G_n,\xi_n)$ based on a vector $(\gamma_1,\ldots,\gamma_k)$ of relative block sizes and a matrix $D=(d_{ij})_{i,j=1}^k$ of edge probabilities. Define
Thus, $\mathfrak{d}(\eta,\xi_n)=0$ if and only if $\eta$ is a refinement of $\xi_n$ . For $\sigma\in(0,1)$ , denote by $\mathcal{P}_{n,\sigma}$ the set of partitions of $V_n$ whose blocks each have at least $\sigma n$ vertices. Obviously, $|\eta|\le1/\sigma$ for $\eta\in\mathcal{P}_{n,\sigma}$ . If $v\in V_n$ and $B\in\eta$ , denote by $\eta_{v,B}$ the partition obtained from $\eta$ by moving vertex v to block B (if $v\in B$ , then $\eta_{v,B}=\eta$ ).
Theorem 3.1. Fix some minimal relative block size $\sigma\in(0,\min_i\gamma_i)$ . There are two numbers $\epsilon_0>0$ and $\kappa_0>0$ such that the following holds with high probability: Let a partition $\eta\in\mathcal{P}_{n,\sigma}$ satisfy $|\eta|\ge k$ and $\mathfrak{d}(\eta,\xi_n)\le\epsilon_0$ . If $A\in\xi_n$ and $B\in\eta$ are such that $0<|B\setminus A|/n\le\epsilon_0$ , choose any vertex $v\in B\setminus A$ . Then there is a block $B^{\prime}\in\eta$ such that
Moreover, $L(G_n|\eta)\ge L(G_n|\tilde\eta)$ , where $\tilde\eta$ is a refinement of $\xi_n$ .
Theorem 3.2. For any $\epsilon>0$ and positive integer m, there is a constant $c_{\epsilon,m}>0$ such that the following holds with high probability:
Theorem 3.3. With high probability, minimizing $L(\eta)$ identifies $\xi_n$ among all refinements $\eta$ of $\xi_n$ .
Corollary 3.1. For any $\sigma\in(0,\min_i\gamma_i)$ , $\xi_n$ is with high probability the unique minimizer of $L(G_n|\eta)$ for $\eta\in\mathcal{P}_{n,\sigma}$ , and
The corresponding results hold mutatis mutandis for Poissonian block models. The proofs are essentially identical. The differences, required by the replacement of binomial distributions by Poisson distributions, are indicated at the end of Appendix A.
4. Proofs of the theorems
Through this section, we consider a sequence $(G_n|\xi_n)$ of increasing versions of a fixed SBM based on a vector $(\gamma_1,\ldots,\gamma_k)$ of relative block sizes and a matrix $D=(d_{ij})_{i,j=1}^k$ of edge probabilities, as specified in Construction 2.1.
Proof of Theorem 3.1. Let $\epsilon,\delta>0$ be small numbers and m a positive integer to be specified. They can be chosen so that the following hold:
-
The value of $\epsilon$ is small enough so that $\eta$ is close to $\eta\vee\xi_n$ when $\mathfrak{d}(\eta,\xi_n)\le\epsilon$ :
(4.1) \begin{equation}m\epsilon<\delta\min_i{\gamma_i}. \end{equation} -
All the differing link probabilities are widely separated in $\delta$ units:
(4.2) \begin{equation} \delta\le\frac{1}{m}\min\left\{{{|d_{ij_1}-d_{ij_2}|}} \,{:}\, {{i,j_1,j_2\in\left\{{{1,\ldots,k}}\right\},\ d_{ij_1}\not=d_{ij_2}}}\right\}. \end{equation}
For any $A_i,A_j\in\xi_n$ (possibly $i=j$ ), we then have
with high probability.
Let $\eta$ be a partition of $V_n$ such that $\mathfrak{d}(\eta,\xi_n)\le\epsilon$ . For $i=1,\ldots,k$ , denote by $B_i$ a block $B\in\eta$ such that $|B\cap A_i|$ is maximal. Then $|B_i\setminus A_i|/n\le\epsilon$ , and the blocks $B_1,\ldots,B_k$ are distinct for small $\epsilon$ . Let us number arbitrarily any remaining blocks of $\eta=\left\{{{B_1,\ldots,B_{|\eta|}}}\right\}$ . We can further assume that if $|\eta|>k$ , then for each $q>k$ there is a unique $j_q$ such that $|B_q\setminus A_{j_q}|/n\le\epsilon$ (for $q\le k$ , let $j_q=q$ ).
Assume now that $v\in B_s\setminus A_j$ , $|B_s\setminus A_j|/n\le\epsilon$ , and $v\in A_i\cap B_s$ with $i\not=j$ . We choose $B^{\prime}=B_i$ and compare the partitions $\eta$ and $\eta_{v,B_i}$ . Let $b_q=|B_q|$ , $q=1,\ldots,|\eta|$ , and $\tilde{B}_i=B_i\cup\left\{{{v}}\right\}$ , $\tilde{B}_s=B_s\setminus\left\{{{v}}\right\}$ . Then
Consider first the sum over q. Leaving out the common factor $b_q$ , each term of the sum can be written as
(note the addition and subtraction of the term $H(d(\{v\},B_q))$ ). Using Lemmas A.2 and A.3 ( $\epsilon$ -regularity) and the relations (4.1), (4.2), and (4.3), the last expression can be set to be, with high probability, arbitrarily close to the number
(the function $D_B(\cdot\|\cdot)$ , the Kullback–Leibler divergence of Bernoulli distributions, is defined in (A.2)). Thus, the sum over q is, with high probability, close to
Let us then turn to the remaining parts of (4.4), which refer to two codings of the internal links of $B_i\cup B_s$ . Similarly as above, we can add and subtract terms to transform these parts into
By the above analysis of (4.4), we have obtained
where the approximation can be made arbitrarily accurate by the choice of $\epsilon$ , m, and $\delta$ . By the irreducibility assumption (1.1), there is a block $A_{j_q}$ such that $d_{{j_q}i}\not=d_{{j_q}j}$ , with the possibility that $q\in\left\{{{i,s}}\right\}$ . It follows that at least one of the $D_B(x\|y)$ terms in (4.5) is positive. Let $\kappa^*=\min\left\{{{D_B(d_{ij_1}\|d_{ij_2})}} \,{:}\, {{d_{ij_1}\not=d_{ij_2}}}\right\}$ . Thus, with high probability,
Choose $\epsilon_0=\epsilon$ and $\kappa_0=\kappa^*\min_i\gamma_i/2$ . As regards the other code parts, it is easy to compute that
and the changes of $L_1$ and $L_2$ when moving from $\eta$ to $\eta_{v,B_i}$ are negligible. This concludes the proof of the claim concerning moving a single vertex from $B_j$ to $B_i$ . The second claim follows from noting that the first claim continues to hold a fortiori when moving similarly all remaining vertices that prevent $\eta$ from being a refinement of $\xi_n$ .
Proof of Theorem 3.2. Fix an $\epsilon\in(0,1)$ and let $\eta$ be a partition of $V_n$ such that $\mathfrak{d}(\eta,\xi_n)>\epsilon$ . Lemma 3.1 yields that $L_{45}(G_n|\eta)\ge L_{45}(G_n|\eta\vee\xi_n)$ .
By assumption, there exists $B\in\eta$ such that $|B\setminus A|>\epsilon n$ for every $A\in\xi_n$ . It is easy to see that there must be (at least) two distinct blocks, say $A_i$ and $A_j$ , such that
By the irreducibility assumption (1.1), there is a block $A_q$ such that $d_{qi}\not=d_{qj}$ , with the possibility that $q\in\left\{{{i,j}}\right\}$ . Fix an arbitrary $\delta>0$ to be specified later. By $\epsilon$ -regularity (Claim 2 of Lemma A.3), with high probability, every choice of a partition $\eta$ satisfying $\exists B\in\eta\ \forall A\in\xi_n\ |B\setminus A|>\epsilon n$ results in some blocks $A_i$ , $A_j$ , $A_q$ with the above characteristics plus the regularity properties
where B $^{\prime}$ denotes a block of $\eta$ that maximizes $|A_q\cap B^{\prime}|$ (note that because $|\eta|\le m$ , $|A_q\cap B^{\prime}|\ge|A_q|/m$ ). By the concavity of H,
In the case that $q\in\left\{{{i,j}}\right\}$ and $B=B^{\prime}$ , we obtain a similar equation where $|A_q\cap B|$ is partly replaced by $|A_q\cap B|-1$ . Because of (4.7) and (4.6), the difference between the sides of the inequality has a positive lower bound that holds with high probability. On the other hand, this difference is part of the overall concavity inequality (3.3) in the proof of Lemma 3.1. Thus, we can choose $c_{\epsilon,m}$ so that
with high probability. It remains to note that $L_{123}(G_n|\xi_n\vee\eta))-L_{123}(G_n|\eta)$ is proportional to n at most.
Let us now turn to preparing the proof of Theorem 3.3 concerning refinements of $\xi_n$ . For any partition $\eta\ge\xi_n$ , write
and, for individual components of the code,
Lemma 4.1. For any partition $\eta$ of $V_n$ ,
Proof. The claim is proved by elaborating (3.1) as follows:
where the second line follows from symmetry.
Lemma 4.2. For $\eta\ge\xi_n$ ,
where
Proof of (4.8). Note first that it follows from the subadditivity of $\log(1+x)$ for $x\ge1$ that for any $\eta\ge\xi$ we have
Using Lemma 4.1 and (4.11), we have
Proof of (4.9) We now use (4.11) in the other direction, and note that the concavity of $\log x$ implies
We then have
Proposition 4.1. For any refinement $\eta$ of $\xi_n$ , we have
where (st) refers to stochastic order, the $Y_i$ are independent Exp(1) random variables, and
Proof. Here we apply results presented in Appendix A. Denote by $\eta\cap A_i$ the subset of $\eta$ whose members are subsets of the block $A_i$ of $\xi_n$ . Writing the edge code lengths of the coarser and finer partition similarly as in (3.3), taking the difference, and using (A.5), we obtain
Applying Proposition A.4 to each term of both outer sums now yields the claim, because
It is rather surprising that the stochastic bound (4.12) depends only on the number of blocks in $\eta$ —not on their relative sizes, nor on the overall model size n.
Proof of Theorem 3.3. Let $\eta$ be a refinement of $\xi_n$ and recall Lemma 3.1: refining the partition with respect to $\xi_n$ yields a gain, based on the concavity of H, in the data code part $L_{45}$ , but additional costs in the model code part $L_{123}$ . We have to relate these to each other. Since $L_1(G_n|\eta)$ is just a small addition to $L_2(G_n|\eta)$ (see Lemma 4.1), we can ignore it and focus on $L_2(G_n|\eta)$ and $L_3(G_n|\eta)$ as regards the model part.
The refinement gain in code part $L_{45}$ was bounded in Proposition 4.1 stochastically by Exp(1) random variables. The rate function (see the beginning of Appendix A) of the distribution Exp(1) is
Proposition 4.1 yields, using (A.1) and Proposition A.1 that for $y>\log 2$
For two refinements of $\xi_n$ , write $\eta^{\prime}\sim\eta$ if the block sizes of $\eta'$ in each $A_i$ are identical to those of $\eta$ . The number of refinements $\eta'$ of $\xi_n$ with $\eta'\sim\eta$ is bounded above by
where $H(\eta\cap A)$ denotes the entropy of the partition of A induced by $\eta$ and $H(\eta|\xi_n)$ the conditional entropy of $\eta$ given $\xi_n$ . On the other hand, we have
Let $m=|\eta|$ . With $y=\Delta L_{23}^{\prime}$ , the union bound and Lemma 2 yield
where
and $\rho_1(m)=o(m)$ as $m\to\infty$ .
Consider all different block size sequences of partitions $\eta>\xi$ such that $|\eta|=m$ :
A rough upper bound of their number is
Indeed, choose first the number of subblocks in each block $A_i$ of $\xi_n$ , then the sizes of the subblocks, each in decreasing order. Note that within each block of $\xi_n$ , the size of its smallest subblock is determined by the others; i.e., k of the m block sizes ‘come for free’.
Using (4.15), (4.17), and the union bound, we obtain for any integer $m>k$
where
and $\rho_2(m)=o(m)$ . Adding and subtracting $m\log n$ and using the facts $1\le k<m$ and $k\le m$ , (4.18) can be further bounded by
Now define
Let $n\ge m^*$ . Then we have
Finally,
5. Discussion
The results of this paper provide rather clear insight into the description length optimization landscape for SBMs. When a partition $\eta$ deviates only a little from the true partition $\xi_n$ and $|\eta|=k=|\xi_n|$ , minimization of $L_{45}(\eta)$ (likelihood maximization) in a single round by moving each vertex to its most appropriate block leads to exact identification of $\xi_n$ (Theorem 3.1). $L_{45}$ is essentially the log-likelihood function. With $|\eta|>k$ , however, minimization of $L_{45}$ may lead instead to some refinement of $\xi_n$ , with a value $L_{45}(\eta)$ that is strictly smaller than $L_{45}(\xi_n)$ . We have quantified this additional gain, which is of order n, and shown that it is overweighted by additional model complexity, measured by functions suggested by the MDL principle. MDL seems outstanding as a theoretically solid way to weigh likelihood gain against model complexity.
In the earlier paper [Reference Reittu18], we and our coauthors analyzed several questions on regular decomposition, i.e., our MDL-based methodology, focusing on the case in which k is known. It was found that not a single vertex can be misplaced without a substantial penalty in $L_{45}$ , proportional to n. In this sense, the global minimum is well separated from all local minima. However, there is no easy way of testing whether a minimum found by a greedy algorithm is the global one. This problem is acute with real-life data, in which there is no SBM-type ground truth. It is an interesting theoretical challenge to determine the probability of finding the global optimum of MDL using a greedy minimization algorithm.
The paper [Reference Reittu18] proposed a way to speed up the $L_{45}$ minimization when k is known and $n=|V|$ arbitrarily large. The idea is that a moderately sized random sample from V reveals the block structure, which can be identified by some brute force method, in our case by repeating the greedy algorithm many times with random initial partitions. The majority of the vertices can then be placed into their blocks with very high accuracy.
In [Reference Reittu18], we also compared regular decomposition by simulations with a spectral clustering method. The results suggested that the spectral clustering method is sensitive to the relative block sizes and requires them to be well balanced, whereas regular decomposition is not sensitive to the relative block sizes.
The additional insights provided by the present paper may be useful in designing algorithms where k is unknown and likelihood maximization is combined with some different kinds of operations. One simple operation worth studying is that of merging blocks according to the MDL criterion.
Practical implementation of MDL often requires a bit of art in addition to science. In the present work, we could not prove Theorem 3.3 completely without using the slightly bigger function $L_2$ instead of the better motivated $L^{\prime}_2$ , although MDL is theoretically able to identify SBM-like models exactly. Another remaining technical question is whether the restriction to partitions with nonnegligible relative block sizes is really necessary for $L(\eta)\ge L(\eta\vee\xi_n)$ .
Appendix A. Information-theoretic preliminaries
Consider a random variable X with moment-generating function
and let $\mathcal{D}_X=\left\{{{\beta}} \,{:}\, {{\phi_X(\beta)<\infty}}\right\}$ . We restrict our attention to distributions of X for which $\mathcal{D}_X$ is an open (finite or infinite) interval. The corresponding rate function is
this is a strictly convex function with minimum 0 at $\mathbb{E}\,{X}$ and value $+\infty$ outside the range of X. For the mean $\bar{X}_n=\frac{1}{n}\sum_1^nX_i$ of independent and identically distributed copies of X, we have
The following inequalities are known as the Chernoff bounds for the random variable X.
Proposition A.1 We have
We make frequent use of the following consequence of the Chernoff bounds. Let the convex hull of the support of X be the closure of $(x^-,x^+)$ , and define
(note that $a^-<\infty$ if and only if the distribution of X has an atom at $x^-$ , and similarly for $a^+$ ). We can now write
With the assumptions made above, the functions $I^-_X(x)$ and $I^+_X(x)$ are, respectively, bijections from $(x^-,\mathbb{E}\,{X}]$ and $[\mathbb{E}\,{X},x^+)$ to $[0,a^-)$ and $[0,a^+)$ .
Lemma A.1. We have
where $\buildrel{(st)}\over\le$ denotes stochastic order and Y is a random variable with distribution Exp(1).
Proof. For any $z\ge0$ , we have
where the first inequality comes from Proposition A.1. Thus,
In the case that X has the Bernoulli(p) distribution, we have
Lemma A.2. The first and second derivatives of the functions H(x) and $x\mapsto D_B(x\|p)$ are
Since $D_B(p\|p)=D_B^{\prime}(p\|p)=0$ and $D_B^{\prime\prime}(p\|p)=-H''(p)$ , we also have
and
Proposition A.2. Let $n\ge2$ , and let $X_1$ and $X_2$ be independent random variables with distributions Bin(m,p) and Bin $(n-m,p)$ , respectively. Let $X_{12}=X_1+X_2$ and $\bar{X}_1=X_1/m$ , $\bar{X}_2=X_2/(n-m)$ , $\bar{X}_{12}=X_{12}/n$ . Then the following identities hold:
The identities in Proposition A.2 are obtained by writing out the full expression for (A.6) and rearranging the $\log$ terms in two other ways. The formulae (A.7) and (A.8) are written without $X_2$ , expressing the fact that any two of the three random variables $X_1$ , $X_2$ , and $X_{12}$ contain the same information as the full triple. Note that (A.7) and (A.8) do not contain p. This reflects the fact that the conditional distribution of $X_1$ given $X_{12}$ , known as the hypergeometric distribution, does not depend on p. The identity of (A.6) and (A.8) can be interpreted so that the two positive terms of (A.6) measure exactly the same amount of information about p as what is subtracted by the negative term. Moreover, (A.8) has the additional interpretation of presenting the rate function of the hypergeometric distribution, as stated in the following proposition.
Proposition A.3. Let X have the distribution Hypergeometric(n,m,z), i.e. the conditional distribution of $X_1$ of Proposition A.2 given that $X_{12}=z$ . The rate function of X is
Proof. Define the bivariate moment-generating function of $(X_1,X_2)$ as
Write
and note that we can assume $p=z/n$ . We can now derive the claim using $\phi(\alpha,\beta)$ in a similar manner as in the well-known proof of the one-dimensional Chernoff bound.
Proposition A.4. Let $k\ge2$ , and let $X_i$ , $i\in\left\{{{1,\ldots,k}}\right\}$ , be independent random variables with distributions Bin $(n_i,p)$ , respectively. Let $n=\sum_{i=1}^kn_i$ , $X_{1\ldots j}=\sum_{i=1}^jX_i$ , $\bar{X}_i=X_i/n_i$ , and $\bar{X}_{1\dots j}=X_{1\dots j}/\sum_{i=1}^jn_i$ . Then
where $Y_1,\ldots,Y_{k-1}$ are independent Exp(1) random variables.
Proof. For $k=2$ , the left-hand side of (A.10) equals
by Proposition A.2 For any $N\in\left\{{{0,\ldots,n}}\right\}$ , consider the conditional distribution of (A.11), given that $X_{12}=N$ . By Proposition A.3, this is the distribution of the Hypergeometric $(n,n_1,N)$ rate function taken at the random variable $X_1$ with the same distribution. The claim now follows by Lemma A.1, because the stochastic upper bound does not depend on N, i.e. on the value of $X_{12}$ .
For $k>2$ we proceed by induction. Assume that the claim holds for $k-1$ and write
By the induction hypothesis, the first row of the second expression is stochastically bounded by $\sum_{i=1}^{k-2}(\log2+Y_i)$ , irrespective of the value of $X_{1\ldots(k-1)}$ . Similarly, the second row is stochastically bounded by $\log2+Y_k$ , where $Y_k\sim\mbox{ Exp}(1)$ , irrespective of the value of $X_{1\ldots k}$ . It remains to note that $Y_k$ can be chosen to be independent of $(Y_1,\ldots,Y_{k-2})$ , because $X_k$ is independent of $(X_1,\ldots,X_{k-1})$ , and of $\bar{X}_{1\ldots(k-1)}$ in particular.
Lemma A.3. Consider the sequence $(G_n,\xi_n)$ of SBMs as in Subsection 3.2. Then the following hold:
-
1. For any blocks $A_i$ and $A_j$ such that $d_{ij}\not\in\left\{{{0,1}}\right\}$ , it holds for an arbitrary $\epsilon>0$ with high probability that
\begin{equation*}\min_{v\in A_i}\frac{|e(\{v\},A_j)|}{|A_j|}\ge d_{ij}-n^{-\frac12+\epsilon},\qquad\max_{v\in A_i}\frac{|e(\{v\},A_j)|}{|A_j|}\le d_{ij}+n^{-\frac12+\epsilon}.\end{equation*} -
2. For any $\epsilon>0$ , all block pairs of the partition $\xi_n$ are $\epsilon$ -regular (Definition 2.1) with high probability.
Proof. Claim 1: By Proposition A.1 and (A.4),
The last expression converges to zero with the choice $h=n^{-\frac12+\epsilon}$ (recall that $|A_i|\sim n\gamma_i$ and $|A_j|\sim n\gamma_j$ ), which proves the claim on the maximum. The case of the minimum is symmetric.
Claim 2: Fix $\epsilon>0$ and consider any i, j. Let $U_1\subseteq A_i$ and $U_2\subseteq A_j$ be such that $|U_1|\ge\epsilon|A_i|$ and $|U_2|\ge\epsilon|A_j|$ . By Proposition A.1,
Let $\iota(\epsilon)=\min\left\{{{D_B(d_{ij}+\epsilon\|d_{ij}),D_B(d_{ij}-\epsilon\|d_{ij})}}\right\}$ . The union bound yields
because $\iota(\epsilon)>0$ .
Preliminaries for the Poissonian block model
For the Poissonian block model, the function $\phi(x)=-x\log x$ replaces binomial entropy in the counterparts of code lengths $L_4+L_5$ . We indicate below how the crucial steps of the proofs would change.
Denote by $D_P(b\|a)$ the Kullback–Leibler divergence between the distributions $Poisson(a)$ and $Poisson(b)$ ,
For a counterpart to Lemma A.2, note that, for any $z>0$ ,
Lemma A.4. For any $\alpha\in[0,1]$ and $x,y>0$ , let $z=\alpha x+(1-\alpha)y$ . Then we have
Proof. The equality (A.14) follows from (A.13) by an argument similar to the derivation of (A.5). The expression (A.15) is obtained by writing the right-hand side of (A.14) with the substitution $y=(z-\alpha x)/(1-\alpha)$ and recombining the log terms.
The Poissonian counterpart of Proposition A.4 is the following.
Proposition A.5. Let $a>0$ , $k\ge2$ , $n_i\ge1$ , $i=1,\ldots,k$ , and $n=\sum_in_i$ . Let $X_i$ , $i\in\left\{{{1,\ldots,k}}\right\}$ , be independent random variables with distributions Poisson $(n_ia)$ , respectively. Let $X_{1\ldots j}=\sum_{i=1}^jX_i$ , $\bar{X}_i=X_i/n_i$ , and $\bar{X}_{1\dots j}=X_{1\dots j}/\sum_{i=1}^jn_i$ . Then
where $Y_1,\ldots,Y_{k-1}$ are independent Exp(1) random variables.
Proof. The proof of Proposition A.4 can be imitated as follows:
-
Using induction, it suffices to consider the case $k=2$ .
-
Apply Lemma A.4 to the left-hand side of (A.16) with $x=\bar{X}_1$ , $y=\bar{X}_2$ , $z=\bar{X}_{12}$ , and $\alpha=n_1/n$ . This yields
\begin{equation*}X_{12}I_{\textit{Ber}\left(\frac{n_1}{n}\right)}\left(\frac{X_1}{X_{12}}\right)=I_{\textit{Bin}\left(X_{12},\frac{n_1}{n}\right)}(X_1).\end{equation*} -
Now, the conditional distribution of $X_1$ given $X_{12}$ is the above binomial distribution. Thus, we can apply Lemma A.1 in a similar way as in the proof of Proposition A.4
Funding
This work was supported by the Academy of Finland project 294763 (Stomograph).
Competing Information
There were no competing interests to declare which arose during the preparation or publication process for this article.