1. Introduction
Random generation of combinatorial structures forms a prominent research area of theoretical computer science. Its wide applications include such topics as simulation of large physical statistical models [Reference Landau and BinderLB14], automated software testing [Pał+11, Reference Clarke, Grumberg, Jha, Lu, Veith, Emerson and SistlaCla+00] and counterexample construction for interactive theorem provers [Reference Paraskevopoulou, Hrttcu, Dénès, Lampropoulos and PiercePar+15], statistical analysis of queueing networks [Reference Bouillard, Bušic and RovettaBBR14], RNA design [Reference Hammer, Wang, Will and PontyHam+19], or network theory, where one of the major challenges is to devise a realistic model of random graphs reflecting the properties of real-world networks [Reference BarabásiBar16].
Given a formal specification defining a set of combinatorial structures, such as graphs, proteins, or tree-like data structures, we are interested in designing an efficient algorithmic sampling scheme, generating such structures according to some prescribed and rigorously controlled distribution. For instance, being interested in sampling certain plane trees following a uniform outcome distribution, where plane trees with an equal number of nodes share the same probability of being constructed, we want to obtain a combinatorial sampler satisfying these input requirements.
There exists a number of different sampling techniques in the literature. Depending on the considered class of structures, there exist different ad-hoc methods, such as the prominent sampler for plane binary trees due to Rémy [Rém85]. If no direct sampling technique is applicable, a common technique is to use rejection sampling in which one generates objects from a larger, yet simpler class and rejects unwanted samples. Although usually straightforward to implement, rejection sampling may quickly become infeasible, especially if the rejection rate grows exponentially fast. Another method, quite popular in physical modelling, is various Monte Carlo Markov Chain algorithms. If each of the chain states has an equal number of transitions, then the stationary distribution is in fact uniform. Let us remark that this technique and its modifications were successfully applied in sampling random walks and dimer models [Reference Propp and WilsonPW96].
One of the earliest examples of a universal sampling template is Nijenhuis and Wilf’s recursive method [Reference Nijenhuis and WilfNW78], later systematised by Flajolet, Zimmermann and Van Cutsem [Reference Flajolet, Zimmermann and Van CutsemFZC94]. In this approach, the input specification, as well as the combinatorial structures it defines, are recursively decomposed into primitive building blocks. Accordingly, sampling such objects follows closely the recursive structure of their specification. The generation scheme is split into two stages – an initial preprocessing phase, and the proper sampling itself. During the former, a set of decision probabilities based on a target size n is computed and stored for later use. The probabilities are chosen so to guarantee a uniform distribution among structures of size n constructed in the latter phase. Consequently, the sampling process reduces to a series of random decisions following precomputed distributions, dictating how to compose the output structure.
Although quite general, the recursive method poses considerable practical limitations. In both phases, the designed algorithm can manipulate integers of size exponential in the target size n, turning its effective bit complexity to $O(n^{3+o(1)})$ , compared to $\Theta(n^2)$ arithmetic operations required. Denise and Zimmermann reduced later the average-case bit complexity of the recursive method to $O(n \log n)$ in time and O(n) in space using a certified floating-point arithmetic optimisation [Reference Denise and ZimmermannDZ99]. Regardless, worst-case space bit complexity remained $O(n^2)$ as well as bit complexity for specifications defining non-algebraic languages. Remarkably, for rational languages Bernardi and Giménez [Reference Bernardi and GiménezBG12] recently linked the floating-point optimisation of Denise and Zimmermann with a specialised divide-and-conquer scheme reducing further the worst-case space bit complexity and the average-case time bit complexity to O(n).
For many years, the exact-size sampling paradigm was the de facto standard in combinatorial generation. In many applications, however, such a precision is not necessary, and the outcome size may fluctuate around some target value n. Such a relaxed paradigm was made possible with the seminal paper of Duchon, Flajolet, Louchard and Schaeffer who proposed a universal sampler construction framework of so-called Boltzmann samplers [Reference Duchon, Flajolet, Louchard and SchaefferDuc+04]. The key idea of their approach is to relax the previous exact-size setting and allow for approximate-size samplers, generating structures within a target size window $[(1-\varepsilon) n,(1+\varepsilon) n]$ centred around some input size n. Like in the recursive method, Boltzmann samplers closely follow the recursive structure of the input specification. However now, instead of directly manipulating large integers or floating-point numbers in order to compute respective decision probabilities, the preprocessing phase uses numerical oracles to evaluate systems of generating functions corresponding to the specified combinatorial structures.
Throughout the years, a series of important extensions and improvements of Boltzmann samplers was proposed. Let us mention, for instance, linear approximate-size (and quadratic exact-size) Boltzmann samplers for planar graphs [Reference FusyFus05], general-purpose samplers for unlabelled structures [Reference Flajolet, Fusy and PivoteauFFP07], efficient samplers for plane partitions [Reference Bodini, Fusy and PivoteauBFP10] or the cycle pointing operator for Pólya structures [Reference Bodirsky, Fusy, Kang and VigerskeBod+11]. The framework of Boltzmann samplers was moreover generalised onto differential specifications [Reference Bodini, Roussel and SoriaBRS12, Reference Bodini, Dien, Fontaine, Genitrini and HwangBod+16]. Finally, let us mention linear exact-size samplers for Catalan and Motzkin trees exploiting the shape of their holonomic specifications [Reference Bacher, Bodini and JacquotBBJ13].
What was left open since the initial work of Duchon, Flajolet, Louchard, and Schaeffer was the development of (i) efficient, general-purpose Boltzmann oracles providing effective means of evaluating combinatorial systems within their disks of convergence, and (ii) an automated tuning procedure controlling the expected parameter value sizes of generated objects. The former problem was finally addressed by Pivoteau, Salvy and Soria [Reference Pivoteau, Salvy and SoriaPSS12] who defined a rapidly converging combinatorial variant of the Newton oracle by lifting the combinatorial version of Newton’s iteration of Bergeron, Labelle and Leroux [Reference Bergeron, Labelle and LerouxBLL98] to a new numerical level. In principle, using their Newton iteration and an appropriate use of binary search, it became possible to approximate the singularity of a given algebraic combinatorial system with arbitrarily high precision. However, even if the singularity $\rho$ is estimated with high precision, say $10^{-10}$ , its approximation quality does not correspond to an equally accurate approximation of the generating function values at $\rho$ , often not better than $10^{-2}$ . Precise evaluation at z close to $\rho$ requires an extremely accurate precision of z. Fortunately, it is possible to trade-off the evaluation precision for an additional rejection phase using the idea of analytic samplers [Reference Bodini, Lumbroso and RolinBLR15] retaining the uniformity even with rough evaluation estimates.
Nonetheless, frequently in practical applications such as for instance software testing, uniform distribution of outcome structures might not be the most effective choice [Reference Arts, Hughes, Norell and SvenssonArt+15]. In fact, it can be argued that most software bugs are minuscule, neglected corner cases, which will not be caught using large, typical instances of random data, see [Pał+11, Reference Runciman, Naylor and LindbladRNL08]. In such cases, additional control over the internal structure of generated objects is required, cf. [Pał12]. Non-uniform generation schemes are also required in genomics [Reference Denise, Roques and TermierDRT00]. Patterns observed in real genomic sequences are tested against randomness and therefore sequences with given nucleotide frequencies need to be sampled. Random generation becomes more involved when the properties do not relate to simple motifs, but, for example, relates to secondary protein structure [Reference Hammer, Wang, Will and PontyHam+19] or to evolution histories [Reference Chauve, Ponty and WallnerCPW20].
In [Reference Denise, Roques and TermierDRT00], a multi-parametric random generation framework for context-free languages was suggested, while the question of efficient numerical tuning of the required generating function arguments was left open. In [Reference Bodini and PontyBP10] Bodini and Ponty proposed a multidimensional Boltzmann sampler model, developing a tuning algorithm meant for the random generation of words from context-free languages with a given target letter frequency vector. This was a major improvement over [Reference Denise, Roques and TermierDRT00]; however, their algorithm converges only in an a priori unknown vicinity of the target tuning variable vector. In practice, it is therefore possible to control no more than a few tuning parameters at the same time.
In the present paper, we propose a novel polynomial-time tuning algorithm based on convex optimisation techniques, overcoming the previous convergence difficulties. We demonstrate the effectiveness of our approach with several examples of rational, algebraic, Pólya structures, and labelled structures. Remarkably, with our new method, we are able to easily handle large combinatorial systems with thousands of combinatorial classes and tuning parameters.
In order to illustrate the effectiveness of our approach, we implemented a prototype Python library called $\textsf{Paganini} $ meant to provide suitable tuning vectors, and a standalone sampler generator $\textsf{Boltzmann Brain}$ . Our software is freely available as open sourceFootnote 1 , Footnote 2 .
The paper is structured as follows. In Section 2 we outline basic concepts of combinatorial random sampling, including the principles of the recursive method and Boltzmann sampling. In Section 2.2 we introduce a new (to our best knowledge) admissible combination operation of combinatorial classes, and explain its usefulness in certain auxiliary transformations for combinatorial specifications used during the construction of the tuning problem. In the following Section 3, we show how to express the tuning problem as a convex optimisation problem and provide important simplifications for context-free grammars, labelled structures, increasing trees and other types of structures. In Section 4, we provide a detailed complexity analysis of the convex optimisation problems obtained from various combinatorial specifications using the notion of self-concordant barriers. Then, in Section 5 we address the problem of finding an optimal target expectation for Boltzmann samplers when the target size is constrained in a finite interval $[n_1, n_2] $ . Next, in Section 6 we describe our prototype implementation. Finally, in Section 7 we illustrate the effectiveness of our approach providing several exemplary applications of our tuning algorithm.
2. Combinatorial random sampling
2.1 Admissible combinatorial classes
Let us consider the neutral class $\mathcal{E}$ , commonly denoted as 1, consisting of a single object of size zero, and its atomic counterpart $\mathcal{Z}$ , which is a class consisting of a single object of size one. Both are equipped with a finite set of admissible operators, such as the disjoint union $+$ , Cartesian product $\times$ , and sequence ${\rm S{\small EQ}}$ , see [Reference Flajolet and SedgewickFS09, Section I.2]. Depending on whether we consider labelled or unlabelled structures, we allow for more expressive admissible operators including for instance the multiset ${\rm MS{\small ET}}$ , set ${\rm S{\small ET}}$ , or cycle ${\rm C{\small YC}}$ constructions. In such a setting, combinatorial specifications we consider in the current paper are finite systems of equations (possibly mutually recursive) built from elementary classes $\mathcal{E}$ , $\mathcal{Z}$ , and admissible operators.
Example 1. Consider the following joint specification for $\mathcal{T}$ and $\mathcal{Q}$ . In the combinatorial class $\mathcal T $ of trees, nodes at even level (the root starts at level one) have either no or exactly two children, whereas each node at odd level has an arbitrary number of non-planarily ordered children:
In order to distinguish, in other words mark, some additional combinatorial parameters we consider the following natural multivariate extension of specifiable classes.
Definition 2 (Specifiable k-parametric combinatorial classes). A specifiable k-parametric combinatorial class is a combinatorial specification built, in a possibly recursive manner, from k distinct atomic classes $\mathcal{Z}_1, \ldots, \mathcal{Z}_k$ , the neutral class $\mathcal{E}$ , and admissible operators. In particular, a vector ${\boldsymbol{\mathcal{C}}} = \left({\mathcal{C}_1,\ldots,\mathcal{C}_m}\right) $ forms a specifiable k-parametric combinatorial class if its specification can be written down as
where the right-hand side expressions $\Phi_i(\boldsymbol{\mathcal{C}}, \mathcal{Z}_1, \ldots,\mathcal{Z}_k)$ are composed from $\boldsymbol{\mathcal{C}},\mathcal{Z}_1,\ldots,\mathcal{Z}_k$ , admissible operators, and the neutral class $\mathcal{E} $ .
Example 3. Let us continue our running example, see (1). Note that we can introduce two additional marking classes $\mathcal U $ and $\mathcal V $ , into (2), each of weight zero, turning our example into a k-specifiable combinatorial class:
Here, $\mathcal{U}$ is meant to mark the occurrences of nodes at odd levels, whereas $\mathcal{V}$ is meant to mark leaves at even levels. In effect, we decorate the univariate specification with explicit information regarding the internal structural patterns of our interest.
Much like in their univariate variants, k-parametric combinatorial specifications are naturally linked to ordinary multivariate generating functions, see [Reference Flajolet and SedgewickFS09, Chapter III].
Definition 4 (Multivariate generating functions). The multivariate ordinary generating function $C(z_1,\ldots,z_k)$ in variables $z_1,\ldots,z_k$ associated to a specifiable k-parametric combinatorial class $\mathcal{C}$ is defined as
where $c_\boldsymbol{p}=c_{p_1,\ldots,p_k}$ denotes the number of structures with $p_i$ atoms of type $\mathcal{Z}_{{i}}$ , and $\boldsymbol{z}^\boldsymbol{p}$ denotes the product $z_1^{p_1} \cdots z_k^{p_k}$ . In the sequel, we call $\boldsymbol{p}$ the (composition) size of the structure.
In this setting, we can easily lift the usual univariate generating function construction rules associated with admissible constructions to the realm of multivariate generating functions associated to specifiable k-parametric combinatorial classes.
2.2 Combination operator
Many examples of combinatorial structures are naturally expressed as collections of disjoint unions and do not require to explicitly exclude certain undesired configurations. Consequently, right-hand side expressions $\Phi_i(\boldsymbol{\mathcal{C}}, \mathcal{Z}_1, \ldots, \mathcal{Z}_k)$ of their combinatorial specifications tend to be composed of summands with positive coefficients. One notable exception, however, is specifications constructed using the inclusion-exclusion principle which, due to the induced class subtraction, may cause considerable difficulties for both tuning and sampling.
From the sampling perspective, exclusion in the combinatorial specification requires additional rejection, whose cost may undesirably dominate over the cost of sampling. Even worse, later in Section 3 we will argue that, in general, negative coefficients do not allow to express the tuning problem in convex optimisation form.
Remarkably, in some cases it is possible to rewrite the initial system with negative coefficients in such a way that the resulting system contains only positive terms. A typical example of such a situation is the construction of a non-empty product containing at most one object from each of the classes $\mathcal{C}_1, \ldots,\mathcal{C}_d $ . Symbolically:
Note that (5) is a combinatorial class consisting of non-empty tuples of length d, such that their i-th coordinate is either $\mathcal{E}$ or an object from $\mathcal{C}_i$ . The above concise specification explicitly excludes the empty tuple, however introduces subtraction in the specification. We can eliminate the subtraction by expanding all brackets in (5), however, such a naive transformation produces $2^d - 1$ summands in the resulting specification. Instead, we propose another transformation.
Definition 5 (Combination operator). Let $\mathcal{C}_1, \ldots, \mathcal{C}_d$ be combinatorial classes. The admissible k-combination or k-selection ${\rm S{\small ELECT}}_k(\mathcal{C}_1, \ldots, \mathcal{C}_d) $ of $\mathcal{C}_1, \ldots, \mathcal{C}_d$ is a combinatorial class consisting of all tuples $\left({c_1,\ldots,c_d}\right)$ such that
-
$c_i$ is either empty, that is $\mathcal{E}$ , or an element of $\mathcal{C}_i$ , and
-
exactly $\binom{d}{k}$ elements of $\left({c_1,\ldots,c_d}\right)$ are non-empty.
Example 6. Consider the classes $\mathcal{C}_1, \ldots, \mathcal{C}_d$ . Note that
In words, the class ${\rm S{\small ELECT}}_k(\mathcal{C}_1, \ldots, \mathcal{C}_d)$ is isomorphic with the disjoint union of all ordered k-products of classes in $\mathcal{C}_1, \ldots, \mathcal{C}_d$ . In particular, ${\rm S{\small ELECT}}_0(\mathcal{C}_1,\ldots, \mathcal{C}_d) \cong 1$ . Furthermore, it holds ${\rm S{\small ELECT}}_d(\mathcal{C}_1,\ldots, \mathcal{C}_d) \cong \prod_i \mathcal{C}_i$ . It should be noted that the isomorphism in (6) cannot be replaced with strict equality, as ${\rm S{\small ELECT}}_k(\mathcal{C}_1, \ldots, \mathcal{C}_d)$ consists of tuples of length d whereas the right-hand side sum consists of tuples of length k.
Let us denote ${\rm S{\small ELECT}}_k(\mathcal{C}_1, \ldots, \mathcal{C}_d) $ as $\mathcal{S}_k$ . Note that
The resulting classes $\mathcal{S}_k $ are, in fact, elementary symmetric polynomials consisting of all $\binom{d}{k}$ summands of the expanded $\prod_{i = 1}^{d} (1 + \mathcal{C}_i)$ of length k. Since a direct, verbose representation of ${\rm S{\small ELECT}}_{k}(\mathcal{C}_1, \ldots, \mathcal{C}_d)$ is exponential in d, we suggest an indirect dynamic programming approach, allowing us to obtain a much more succinct representation of each of the selection operations using a total of $O(d^2)$ of auxiliary combinatorial classes.
Let $\mathcal{P}_{k, m}$ be a subset of $\mathcal{S}_k$ in which products are restricted to involve classes from $\mathcal{C}_{1}, \ldots, \mathcal{C}_m$ . Note that $\mathcal{P}_{k,m}$ is empty if and only if $m < k$ . Moreover, $\mathcal{P}_{k,d} = \mathcal{S}_k$ . We start with
Now, suppose that we have computed $\mathcal{P}_{0,m},\mathcal{P}_{1,m},\ldots,\mathcal{P}_{k,m}$ for all values of m, and wish to compute the next row of classes corresponding to $k + 1$ . We start to iterate m in increasing order. For all m such that $m < k + 1$ , we set $\mathcal{P}_{k+1, m} = \emptyset$ . Otherwise, if $m \geqslant k+1$ we note that
The correctness of the above scheme can be proven by induction. With its help, we can compute a lower-triangular matrix of the symbolic representations for $\mathcal{P}_{k,m}$ and so also the elementary symmetric polynomials $\mathcal{S}_k$ . As a by-product, we can therefore efficiently rewrite the initial system (5) into an equivalent one with positive terms.
This transformation can be used as an auxiliary construction in other combinatorial specifications. For example, we apply the described technique in Section 7.4 in order to sample from a multi-parametric class
2.3 Boltzmann samplers and the recursive method
There exists a number of different sampling techniques in the literature. In the current paper, we focus on two most prominent, general purpose frameworks — the recursive method [Reference Nijenhuis and WilfNW78] and Boltzmann samplers [Reference Duchon, Flajolet, Louchard and SchaefferDuc+04]. In what follows, we focus specifically on their distinguishing features.
2.3.1 The recursive method
The three basic building bricks of the recursive method are, similarly to the symbolic method and admissible classes, disjoint union, Cartesian product, and pointing. In this framework, the counting sequences of the combinatorial classes need to be readily available, as they determine the branching probabilities of the random generation process. The three operations are processed as follows.
Disjoint union. If the counting sequence $(C_n)_{n \geqslant 0} $ of the target class $\mathcal{C} $ is given by $C_n = A_n + B_n $ where $A_n $ and $B_n $ are the counting sequences of the classes $\mathcal{A} $ and $\mathcal{B} $ , respectively, then with probability $\frac{A_n}{A_n + B_n} $ an object from $\mathcal{A} $ is constructed, otherwise an object from class $\mathcal{B} $ is constructed.
Cartesian product. If the counting sequence of the target class $\mathcal{C} $ satisfies the equation $C_n = \sum_{k = 0}^n A_k \cdot B_{n-k} $ , then a tuple of sizes $(k, n-k) $ is chosen with probability $\frac{A_kB_{n-k}}{C_n} $ and objects from $\mathcal{A} $ and $\mathcal{B} $ of respective sizes k and $n-k $ are generated.
Pointing. If $\mathcal{A} $ is a combinatorial class, then an object in a pointed class $\Theta \mathcal{A} $ is isomorphic to an object from $\mathcal{A} $ which has a distinguished atom. Now, if the counting sequence of $\mathcal{A} $ is $A_n $ , then the counting sequence of $\Theta \mathcal{A} $ is given by $n \cdot A_n$ . Having one of the samplers for $\mathcal{A} $ or $\Theta \mathcal{A} $ , it is easy to obtain the other one by either distinguishing a label uniformly at random or, the other way round, by forgetting which label is distinguished.
Note that the pointing operator substantially enriches the set of admissible specifications and plays a central rôle in the design of recursive samplers. For example, the two labelled operations, ${\rm S{\small ET}} $ and ${\rm C{\small YC}}$ are bijectively transformed using the pointing operation as follows:
We are not aware of a direct and efficient sampling algorithm based on the recursive method for unlabelled structures involving the ${\rm MS{\small ET}}$ and ${\rm C{\small YC}}$ operators.
2.3.2 Boltzmann samplers
While the recursive method is applicable to specifications irrespective of the analyticity of their generating functions, Boltzmann samplers work only with classes whose generating functions are analytic. However, unlike the recursive method, Boltzmann samplers do not require the underlying counting series. Instead, they rely on the values of the generating functions. As a consequence, the size of the outcome structure is no longer fixed, but follows a Boltzmann distribution $\mathbb P(\text{size} = k) = \frac{A_kz^k}{\sum_{n \geqslant 0} A_n z^n} $ with z being its parameter. Nevertheless, conditioned on size, such samplers generate a uniformly chosen object from the target class.
Boltzmann samplers support three basic operations, that is disjoint union, Cartesian product, and substitution. For instance, assuming that that $\varphi(x) $ is analytic, we can consider a family of simply generated trees satisfying
Let us remark however, that while substitutions of type $\varphi(T(z)) $ are usually easier to handle, substitutions in form of $T(\varphi(z)) $ are much more involved. Note that the recursive method does not easily support substitutions. Now, let us focus on how these three basic operations are processed.
Disjoint union. Consider a target class $\mathcal{C} $ with a generating function C(x). Let $\mathcal{A} $ and $\mathcal{B} $ be two combinatorial classes with generating functions A(x) and B(x), respectively, such that $\mathcal{C} = \mathcal{A} +\mathcal{B} $ . Then, with probability $\frac{A(x)}{A(x) + B(x)} $ an object from $\mathcal{A} $ is drawn, otherwise an object from class $\mathcal{B} $ is generated.
Cartesian product. Consider a target class $\mathcal{C} $ with a generating function C(x). Let $\mathcal{A} $ and $\mathcal{B} $ be two combinatorial classes with generating functions A(x) and B(x), respectively, such that $\mathcal{C} = \mathcal{A}\times \mathcal{B} $ . Then, an independent pair of recursively generated objects from $\mathcal{A} $ and $\mathcal{B} $ is drawn.
Substitution. Let $C(x) = \varphi(B(x)) $ where $\varphi(t) = \sum_i \phi_i t^i $ . Then, a Boltzmann sampler for $\mathcal{C} $ is obtained as follows. Fix $t = B(x)$ and sample a random integer k from the distribution $\mathbb P(k) =\phi_k t^k / \varphi(t) $ . Finally, draw k independent copies of recursively sampled objects from $\mathcal{B} $ .
The substitution rule, in particular, provides Boltzmann samplers for the labelled ${\rm S{\small ET}} $ , ${\rm S{\small EQ}} $ and ${\rm C{\small YC}}$ constructions. It turns out that such samplers also cover a wide interesting family of combinatorial classes and constructions, including unlabelled structures, first-order differential specifications, Hadamard product and Dirichlet generating series, see [Reference BodiniBod10] for further details.
We summarise the most common rules in Tables 1, 2, and 3. We write $X \implies \Gamma $ to denote a procedure generating random objects from $\Gamma $ based on the random discrete distribution X – we draw an integer r from X and then, repeatedly and independently, invoke the respective sampler $\Gamma$ r times. As a result, we obtain an r-tuple of sampled objects. For more details we refer the reader to [Reference Duchon, Flajolet, Louchard and SchaefferDuc+04] and [Reference Flajolet, Fusy and PivoteauFFP07].
2.3.3 Multi-parametric Boltzmann samplers
Consider a multi-parametric combinatorial class $\mathcal{C} $ with a multivariate generating function $C(\boldsymbol{z}) $ . Let $\omega \in \mathcal{C} $ be a combinatorial object with composition size $\boldsymbol{p} $ . Then, a multi-parametric Boltzmann sampler $\Gamma \mathcal{C}(\boldsymbol{z}) $ outputs $\omega$ with probability
Such samplers can be constructed from multi-parametric combinatorial specifications in the same way as ordinary Boltzmann samplers are constructed. When the expressions of generating functions involve different values of the tuning variables, then these variables, after substitution, yield new branching probabilities.
Proposition 7 (Log-exp transform of the tuning problem). Let $\boldsymbol{N} = (N_1, \ldots, N_k) $ be the random vector where $N_i $ equals the number of atoms of type $\mathcal{Z}_1$ in a random combinatorial structure returned by the k-parametric Boltzmann sampler $\Gamma\mathcal{C}(\boldsymbol{z})$ . Then, the expectation vector $\mathbb E_\boldsymbol{z}(\boldsymbol{N}) $ and the covariance matrix $\mbox{Cov}_\boldsymbol{z}(\boldsymbol{N})$ are given by
Hereafter, we use $e^\boldsymbol{z} $ to denote coordinate-wise exponentiation.
Proof. Let the following nabla notation denote the vector of derivatives (so-called gradient vector) with respect to the variable vector $\boldsymbol{z} =(z_1, \ldots, z_k) $ :
Let $(c_\boldsymbol{n})_{\boldsymbol{n} \succeq \boldsymbol 0} $ be the counting sequence of the combinatorial class $\mathcal{C} $ . The probability generating function $p(\boldsymbol{u} | \boldsymbol{z}) $ for $\boldsymbol{N} $ with $\boldsymbol{u} $ as an argument and $\boldsymbol{z} $ as a parameter takes the form
where $\bullet $ denotes component-wise vector multiplication. Hence, the expected value and the covariance of $\boldsymbol{N} $ can be immediately expressed through its probability generating function as
The proof is finished by expanding the log-exp transform in (14) and comparing the result to the expressions obtained from the probability generating functions.
Remark 8. The function $\gamma(\boldsymbol{z}) := \log C(e^\boldsymbol{z}) $ is convex because its matrix of second derivatives, as a covariance matrix, is positive semi-definite inside the set of convergence. This crucial assertion will later prove central to the design of our tuning algorithm. The expressions for the expectation and the covariance matrix are similar to those obtained in [Reference Bender and Bruce RichmondBR83] for central limit theorem for multivariate generating functions.
Remark 9. Uni-parametric recursive samplers of Nijenhuis and Wilf take, as well as Boltzmann samplers, a system of generating functions as their input. This system can be modified by putting fixed values of tuning variables, in effect altering the corresponding branching probabilities. The resulting distribution of the random variable corresponding to a weighted recursive sampler coincides with the distribution of the Boltzmann-generated variable conditioned on the structure size. As a corollary, the tuning procedure that we discuss in the following section is also valid for the exact-size approximate-frequency recursive sampling.
2.4 Complexity of exact parameter sampling
While the current paper is devoted to tuning of the parameters in expectation, let us pause for a moment and ask the following, natural question – what is the complexity of exact-parameter sampling for multi-parametric combinatorial specifications?
In the current section, we show that, unless both the classes of decision problems solvable in randomised $\mbox{RP}$ and nondeterministic $\mbox{NP}$ polynomial time are equal, then already for unambiguous context-free languages there exists no fully polynomial-time algorithm for almost-uniform exact-parameter sampling problem. Since it is widely conjectured that $\mbox{RP} \neq \mbox{NP}$ , cf. [Reference Welsh and GaleWG01], this infeasibility result justifies parameter tuning in expectation, which can be regarded as a continuous relaxation of the exact-parameter problem variant.
Consider a context-free grammar G with derivation rules in form of
where $A_i $ is a non-terminal symbol, and the right-hand side expression $T_{i,j} $ is a (possibly empty) word consisting of both terminal and non-terminal symbols. Recall that the context-free grammar G is said to be unambiguous if each word its generates has a unique derivation. Let $a_1, \ldots, a_d $ be distinct terminal symbols. The exact multi-parametric sampling problem for unambiguous context-free grammars can be stated as follows – given natural numbers $n_1, n_2, \ldots, n_d $ , sample uniformly at random a word of length $n = n_1 + \cdots + n_d $ from the language L(G) generated by G, such that the number of occurrences of each terminal symbol $a_j $ is equal to $n_j $ .
Example 10. As a simple illustrating example of the discussed problem, consider the following grammar B generating (unambiguously) all binary words over the alphabet $\Sigma = \{{\texttt{0},\texttt{1}}\}$ :
Recall that $\varepsilon $ denotes the empty word. In this example, given numbers $n_0$ and $n_1$ , the multi-parametric sampling problem asks to generate a uniformly random binary word over $\Sigma$ which has exactly $n_0$ $\texttt{0}$ s, and $n_1$ $\texttt{1}$ s.
In what follows we prove that, in general, the problem of exact-size multi-parametric sampling (even if the specification does not involve loops) can be reduced to the $\#\mathcal{P}$ -complete $\#2$ -SAT problem, which asks to count the number of satisfiable variable assignments of a given 2-CNF formula. As suggested to us by Sergey Tarasov in personal communication, such a complexity result might be folklore knowledge; however, we did not manage to find it in the literature. We therefore take the liberty to fill this gap. Detailed definitions from complexity theory can be found in the papers referenced during the proof of the following theorem.
Theorem 11 (Infeasibility of exact parameter sampling). Unless $\mbox{NP} = \mbox{RP} $ , there is no fully polynomial-time algorithm for almost-uniform multi-parametric sampling from unambiguous context-free grammars.
Proof. We start by showing that extracting the coefficients $[z_1^{k_1} \cdots z_m^{k_m}] F_i(z_1,\ldots,z_m)$ of a multivariate generating $F_i(z_1,\ldots,z_m)$ satisfying a system of polynomial equations in form of
is $\#P$ -hard. We proceed by reduction from the $\#P$ -complete $\#2$ -SAT problem.
Consider a 2-SAT formula
with n Boolean variables $x_1,\ldots,x_n$ and m clauses. For each clause $(\alpha_j \lor \beta_j)$ we create a distinct (complex) variable $c_j$ . Next, for each literal $x \in \{{x_1,\ldots,x_n, \overline x_1 \ldots, \overline x_n}\}$ we introduce a corresponding multivariate generating function $X(c_1,\ldots,c_m)$ (where $X \in \{ X_1, \ldots, X_n, \overline X_1, \ldots, \overline X_n \}$ ) defined as
where $x \mapsto \textbf{1}_{A}(x)$ denotes the indicator function with respect to a set A (i.e. $\textbf{1}_{A}(x) = 1$ if $x \in A$ , and $\textbf{1}_{A}(x) = 1$ if $x \not\in A$ ). Finally, we create a multivariate generating function
Clearly, all of these generating functions form a polynomial system of equations. Intuitively, generating functions $X_i(c_1,\ldots,c_m)$ are products of variables $c_j$ corresponding to clauses which become satisfied once the respective literal $x_i$ is true. Furthermore, $H(c_1,\ldots,c_m)$ encodes all possible variable assignments. Indeed, if we expand all brackets in (23) we obtain $2^n$ summands, each corresponding to a distinct variable assignment – occurrences of $X_i(c_1,\ldots,c_m)$ in each summand encode setting the respective variable $x_i$ to true, whereas $\overline X_i(c_1,\ldots,c_m)$ encode setting the respective variable $x_i$ to false.
Let us consider an arbitrary summand $H_\varphi$ in the expanded (23) corresponding to some variable assignment $\varphi \colon \{{x_1,\ldots,x_n}\} \to \{ {True}, {False} \}$ . Note that once we unfold the respective definitions of $X_i(c_1,\ldots,c_m)$ and $\overline X_i(c_1,\ldots,c_m)$ , $H_\varphi$ becomes a monomial consisting of $c_j$ ’s satisfied by $\varphi$ . Each clause $(\alpha_j \lor \beta_j)$ in F consists of two literals, hence its corresponding variable $c_j$ can occur at most twice in $H_\varphi$ . Consequently, the number of satisfiable assignments satisfying F is equal to
that is to the number of monomials $H_\varphi$ in which each clause variable occurs at least once. In order to obtain the number (24) of satisfiable assignments using a single, exact coefficient extraction, we note that
And so, it is as hard to extract the coefficients of a generating function as solving a $\#2 $ -SAT instance. Nevertheless, it should be noticed that not every hard enumeration problem automatically corresponds to a hard uniform random sampling problem.
In order to complete the proof, we use the celebrated [Reference Jerrum, Valiant and VaziraniJVV86, Theorem 6.4] which proves that if there exists a fully polynomial almost-uniform random sampling algorithm (with an exponentially small error), then there exists a fully polynomial randomised approximation scheme for the counting problem as well (within a polynomially small error). Moreover, unless $\mbox{RP} = \mbox{NP}$ , there exists not fully polynomial randomised approximation scheme for $\#2$ -SAT [Reference Welsh and GaleWG01]. Hence, indeed the theorem statement must hold.
3. Multi-parametric tuning
A combinatorial specification typically involves several classes $\mathcal{C}_1,\ldots, \mathcal{C}_m $ which are defined in a mutually recursive manner. Let us denote $(\mathcal{C}_1, \ldots, \mathcal{C}_m) $ as $\boldsymbol{\mathcal{C}} $ . For tuning and sampling purposes, one particular class in $\boldsymbol{\mathcal{C}} $ is chosen. For example, consider a system
Suppose that we want to sample the objects from the class $\mathcal C_1 $ and we fix the expected values of parameters $\mathcal{Z}_1, \ldots, \mathcal{Z}_k $ to be $N_1, \ldots, N_k $ , respectively. Let $C_1(z_1, \ldots, z_k), \ldots,C_m(z_1, \ldots, z_k) $ be the related generating functions. Then, the system of polynomial equations corresponding to the tuning problem (see e.g. [Reference Duchon, Flajolet, Louchard and SchaefferDuc+04, Proposition 2.1]) is given by
Note that, in general, the values of the tuning parameters cannot be obtained by independently solving each of the equations (27). Each of the functions is depending on all the arguments at the same time (see also Figure 1). Hence, we propose an alternative procedure to achieve this goal.
From a technical point of view, the easiest case for theoretical analysis is combinatorial specifications corresponding to unambiguous context-free grammars. Here, the system defining the generating functions becomes a well-founded system of polynomial equations, see [Reference Pivoteau, Salvy and SoriaPSS12]. The most general framework of Boltzmann sampling comprises much more cases, including labelled objects, Pólya structures, or first- and second-order differential specifications. There also exist specifications whose equations include subtractions (related to the inclusion-exclusion principle), substitutions, and in these cases, different sampling strategies should be applied, for example recursive sampling or sampling with rejections, see Section 2.3. In this section, we are only concerned with tuning, setting thus all of these sampling issues aside.
3.1 Tuning as a convex optimisation problem
It turns out that instead of solving a system of polynomial equations (27) involving the derivatives of the generating function, a simpler convex optimisation problem, involving only the values of the generating functions, can be considered. Having the derivatives of this function comes as an advantage, because it allows to use first-order subroutines to solve the optimisation problem. The following theorem contains the most general form of our tuning approach. For convenience, we will write $f({\cdot})\to \min_\boldsymbol{z} $ , $f({\cdot}) \to \max_\boldsymbol{z} $ to denote the minimisation (maximisation, respectively) problem of the target function $f({\cdot}) $ with respect to the vector variable $\boldsymbol{z} $ .
Theorem 12 (Tuning as convex optimisation). Let $F(z_1, \ldots, z_d) $ be a formal power series with non-negative coefficients analytic in an open d -dimensional set $\Omega \subseteq \{z_1 > 0, \ldots, z_d > 0 \} $ . Assume that the solution of the multi-parametric tuning problem
belongs to $\Omega $ .
Let $\boldsymbol{N}$ denote the vector $(N_1,\ldots,N_d)$ . Then, the tuning problem (28) is equivalent to the following convex optimisation problem over real variables $\varphi $ and $\boldsymbol{\zeta} = (\zeta_1, \ldots, \zeta_d)$ :
provided that the arguments of F meet its domain, and the logarithm is well-defined. The respective tuning parameters $\boldsymbol{z}^*$ satisfy then $\boldsymbol{z}^* = e^{\boldsymbol{\zeta}}$ .
Proof. Using a log-exp transformation, we note that the tuning problem (28) is equivalent to $\nabla_{\boldsymbol{\zeta}} \log F(e^{\boldsymbol{\zeta}}) = \boldsymbol{N}$ , cf. Proposition 7. Since the right-hand side vector $\boldsymbol{N}$ is equal to $\nabla_{\boldsymbol{\zeta}}(\boldsymbol{N}^{\top}\boldsymbol{\zeta})$ , the tuning problem is further equivalent to
Note that the function under the gradient is a sum of a convex and linear function, and so necessarily convex itself. We can therefore equivalently express (30) as a convex minimisation problem in form of
This problem can be reduced to a standard form (29) by adding an auxiliary variable $\varphi $ .
Remark 13. We do not require the exponents of the generating function to be non-negative integers. Depending on the specific application, they might, for instance, be positive real or rational numbers [Reference Hammer, Wang, Will and PontyHam+19]. Even the non-negativity requirement could be omitted as well, as illustrated by a following univariate example. Let the exponents of the generating function F(z) belong to the set A which may potentially include negative elements. Then, the log-exp transform of the function F(z) is a composition of a convex function and a set of linear functions, which is again, convex:
In contrast, it is crucial that the coefficients of the generating functions remain non-negative, as otherwise, the logarithm of the sum of exponents with potentially negative weights $a_s $ ceases to be convex.
In subsequent sections we show that having a combinatorial specification is a for the generating function $F(z_1, \ldots, z_d) $ is a great advantage – instead of requiring additional oracles providing the values of the generating function and its derivatives, a more direct approach is available.
3.2 Unambiguous context-free grammars
In the current section, we refine our general tuning procedure for regular and unambiguous context-free specifications, avoiding any use of external oracles. Recall that such systems are often used in connection with the Drmota–Lalley–Woods framework (see [Reference Flajolet and SedgewickFS09, Section VII.6]). However, contrary to [Reference Flajolet and SedgewickFS09, p. VII.6], we do not distinguish linear and non-linear cases. Instead, we develop a general result allowing to cover both cases. Also note that we do not require the system to be strongly connected, replacing this condition by a weaker requirement that every state is reachable from the initial one.
Theorem 14 (Tuning with finite parameter expectation). Let $\boldsymbol{\mathcal{C}} = \boldsymbol \Phi(\boldsymbol{\mathcal{C}}, \boldsymbol{\mathcal{Z}}) $ be a multi-parametric system with $\boldsymbol{\mathcal{C}} = \left({\mathcal{C}_1,\ldots,\mathcal{C}_m}\right) $ and $\boldsymbol \Phi = \left({\Phi_1, \ldots, \Phi_m}\right) $ , where all the functions $\Phi_1, \ldots, \Phi_m $ are positive polynomials. Suppose that in the dependency graph corresponding to $\boldsymbol \Phi $ all the states are reachable from the initial state $\mathcal C_1 $ . Let $\boldsymbol{N} = (N_1, \ldots, N_k) $ be the vector of target atom occurrences of each type. Fix the expectations $N_i $ of the parameters of objects sampled from $\mathcal C_1 $ to $\mathbb E_\boldsymbol{z} \boldsymbol{N} = \boldsymbol \nu $ . Then, the tuning vector $\boldsymbol{z}$ is equal to $e^{\boldsymbol \xi} $ where $\boldsymbol \xi $ comes from the convex problem:
Hereafter, $e^{\boldsymbol \xi} $ and $\log \boldsymbol \Phi $ denote coordinate-wise exponentiation and logarithm, respectively.
Proof. Consider the vector $\boldsymbol{z}^\ast $ such that $\mathbb E_{\boldsymbol{z}^\ast} (\boldsymbol{N}) = \boldsymbol \nu. $ Let $\boldsymbol{c} $ denote the logarithms of the values of generating functions at point $\boldsymbol{z}^\ast = e^{\boldsymbol \xi^\ast} $ . Clearly, for such a choice of the vectors $\boldsymbol{c} $ and $\boldsymbol \xi = \boldsymbol \xi^\ast $ all inequalities in (32) become equalities.
Let us show that if the point $(\boldsymbol{c}, \boldsymbol \xi ) $ is optimal, then all the inequalities in (32) become equalities. Firstly, consider the case when the inequality
does not turn to an equality. Certainly, there is a gap and the value $c_1 $ can be decreased without affecting the validity of other inequalities. In doing so, the target function value is decreased as well. Hence, the point $(\boldsymbol{c}, \boldsymbol \xi) $ cannot be optimal.
Now, suppose that the initial inequality does turn to equality, however $c_k >\log \Phi_k(e^\boldsymbol{c}, e^{\boldsymbol \xi}) $ for some $k \neq 1$ . Since each of the states is reachable from the initial state $\mathcal C_1 $ , it means that there exists a path $P = c_1 \to c_2 \to \cdots \to c_k $ (indices are chosen without loss of generality) in the corresponding dependency graph. Note that for pairs of consecutive variables $(c_i, c_{i+1})$ in P, the function $\log\Phi_i(e^\boldsymbol{c}, e^{\boldsymbol \xi})$ is strictly monotonic in $c_{i+1}$ (as it is obtained as a log-exp transform of a positive polynomial and it references $c_{i+1}$ ). In such a case we can decrease $c_{i+1}$ so to assure that $c_i >\log \Phi_i(e^\boldsymbol{c}, e^{\boldsymbol \xi})$ while the point $(\boldsymbol{c}, \boldsymbol \xi) $ remains feasible. Decreasing $c_{i+1},c_i,\ldots,c_1$ in order, we finally arrive at a feasible point with a decreased target function value. In consequence, $(\boldsymbol{c}, \boldsymbol \xi) $ could not have been optimal to begin with.
So, eventually, the optimisation problem reduces to minimising the expression, subject to the system of equations $\boldsymbol{c} = \log \boldsymbol \Phi(e^\boldsymbol{c}, e^{\boldsymbol \xi})$ or, equivalently, $\boldsymbol C(\boldsymbol{z}) = \boldsymbol \Phi(\boldsymbol C(\boldsymbol{z}), \boldsymbol{z})$ and can be therefore further reduced to Theorem 12.
Remark 15. Let us note that the above theorem extends to the case of labelled structures with ${\rm S{\small ET}} $ and ${\rm C{\small YC}}$ operators. For unlabelled Pólya operators like ${\rm MS{\small ET}}$ or ${\rm C{\small YC}}$ , we have to truncate the specification to bound the number of substitutions. In consequence, it becomes possible to sample corresponding unlabelled structures, including partitions, functional graphs, series-parallel circuits, etc. For more details, see Section 6.1 .
Singular Boltzmann samplers (also defined in [Reference Duchon, Flajolet, Louchard and SchaefferDuc+04]) are the limit variant of ordinary Boltzmann samplers with an infinite expected size of generated structures. In their multivariate version, samplers are considered singular if their corresponding variable vectors belong to the boundary of the respective convergence sets. We present an alternative option to tune such singular samplers corresponding to Drmota–Lalley–Woods framework [Reference Flajolet and SedgewickFS09, Section VII.6], provided that their corresponding dependency graphs are strongly connected.
Theorem 16 (Tuning with infinite parameter expectation). Let $\boldsymbol{\mathcal{C}} = \boldsymbol \Phi(\boldsymbol{\mathcal{C}}, \mathcal{Z},\boldsymbol{\mathcal{U}}) $ be a strongly connected multi-parametric system of positive polynomial equations with $\boldsymbol{\mathcal{C}} = \left({\mathcal{C}_1,\ldots,\mathcal{C}_m}\right) $ , the atomic class $\mathcal{Z}$ marking the corresponding structure size, and $\boldsymbol{\mathcal{U}} = \left({\mathcal{U}_1,\ldots,\mathcal{U}_k}\right) $ being a vector (possibly empty) of distinguished atoms. Assume that the target expected frequencies of the atoms $\mathcal U_i $ are given by the vector $\boldsymbol \alpha $ . Then, the variables $(z, \boldsymbol{u}) $ that deliver the tuning of the corresponding singular Boltzmann sampler are the result of the following convex optimisation problem, where $z = e^{\xi} $ , $\boldsymbol{u} = e^{\boldsymbol \eta} $ :
Proof. By similar reasoning as in the previous proof, we can show that the maximum is attained when all the inequalities turn to equalities. Indeed, suppose that at least one inequality is strict, say $c_j > \log \Phi_j(e^\boldsymbol{c}, e^{\xi}, e^{\boldsymbol \eta}) $ .
Because all right-hand sides of each of the inequality are monotonic with respect to $c_j$ , we note that the value $c_j $ can be slightly decreased by choosing a sufficiently small distortion $\varepsilon > 0 $ , turning all the equalities containing $c_j $ in the right-hand side $\log \Phi_i(e^\boldsymbol{c}, e^{\xi}, e^{\boldsymbol \eta}) $ into strict inequalities. Clearly, we can repeat this process until all equalities turn into inequalities.
Finally, let us focus on the target function. Again, because all right-hand sides of each inequality are monotonic with respect to $\xi$ , we can slightly increase its value and increase the target function so to remain inside the feasible set.
Let us fix $\boldsymbol{u} = e^{\boldsymbol \eta} $ . For rational and algebraic grammars, within the Drmota–Lalley–Woods framework, see for instance [Reference DrmotaDrm97], the corresponding generating function singular approximation takes the form
If $t < 0 $ , then the asymptotically dominant term becomes $-b_0 \left( 1 -\frac{z}{\rho(\boldsymbol{u})} \right)^{t}$ . In this case, tuning the target expected frequencies corresponds to solving the following equation as $z \to \rho(u)$ :
Let us substitute the asymptotic expansion (35) into (36) to track how $\boldsymbol{u} $ depends on $\boldsymbol \alpha $ :
Only dominant terms are accounted for. Then, by the binomial theorem
With $z = \rho(\boldsymbol{u}) $ , as $n \to \infty $ , we obtain after cancellations
which can be rewritten as
Passing to exponential variables (40) becomes
As we already discovered, the dependence $\xi(\boldsymbol \eta) $ is given by the system of equations because the maximum is achieved only when all inequalities turn to equations. That is, tuning the singular sampler is equivalent to maximising $\xi + \boldsymbol \alpha^\top \boldsymbol \eta $ over the set of feasible points.
Remark 17. For ordinary and singular samplers, the corresponding feasible set remains the same; what differs is the optimised target function. Singular samplers correspond to imposing an infinite target size. In practice, however, the required singularity is almost never known exactly but rather calculated up to some feasible finite precision. The tuned structure size is therefore enormously large, but still, nevertheless, finite. In this context, singular samplers provide a natural limiting understanding of the tuning phenomenon and as such, there are several possible ways of proving Theorem 16.
Figure 2 illustrates the feasible set for the class of binary trees and its transition after applying the log-exp transform, turning the set into a convex collection of feasible points. In both figures, the singular point is the rightmost point on the plot. Ordinary sampler tuning corresponds to finding the tangent line which touches the set, given the angle between the line and the abscissa axis.
Remark 18. As an interesting by-product, our tuning algorithm provides a way of obtaining the singularities of a system and the values of the generating functions at the point of this singularity.
Remark 19. It is also possible to consider singular tuning the case of non-strongly connected specifications. However, practically speaking, it should be noted that a notion of a singular sampler for non-strongly connected specifications such as
is ambiguous – both singular samplers for T and F admit different values of the variable z. For T its $z = e^{-1} $ whereas for F we have $z = e^{-1/2}/2 $ .
If the substitutions in the dependency graph of the specification include only subcritical compositions (see [Reference Flajolet and SedgewickFS09, Section VI.9]) it is possible to incrementally tune its parts in topological order. Further theoretical analysis of singular samplers involving supercritical compositions is somewhat more complicated, but can be developed as well. Nevertheless, the easiest practical way to derive a singular tuner is to tune the target class with a large, yet finite object size.
3.3 Labelled and unlabelled structures
Systems originating from labelled specifications, that is whose generating functions are typically of exponential type, feature such admissible operators as ${\rm S{\small ET}}$ or ${\rm C{\small YC}}$ , both in their unrestricted and cardinality restricted variants. Consider a labelled multi-parametric combinatorial class $\mathcal A$ with a generating function $A(\boldsymbol{z}) $ . Then, the resulting exponential generating functions obtained using these operators take form
Classes whose definition involves one of the above operators can be incorporated into the convex optimisation problem using a log-exp transformation $F(z)\mapsto e^{\varphi} $ similarly to the case context-free grammars, see Section 3.2 . Broadly speaking, the application of such operators results in a composition with one of the basic functions
expressing, respectively, the exponential generating functions for the class of labelled sets with k elements, labelled cycles with k elements, and both unrestricted labelled sets and cycles.
On the other hand, ordinary generating functions, used for enumeration of unlabelled structures, feature such operators as ${\rm MS{\small ET}}$ , ${\rm PS{\small ET}}$ , and ${\rm C{\small YC}}$ , standing for the multiset, set, and a cycle constructions, respectively. These, in contrast, are evaluated differently than their labelled counterparts. Specifically, if applied to a class with an ordinary generating function $A(\boldsymbol{z}) = \sum_{\boldsymbol{n} \succ \boldsymbol 0} a_\boldsymbol{n} \boldsymbol{z}^\boldsymbol{n}$ , we obtain, respectively,
Here $\varphi(k) $ denotes Euler’s totient function.
Let us note that both the first and the third expressions, after log-exp transformations, become convex. The resulting infinite series can be then truncated at a finite threshold, given the fact that if the corresponding singular value $\boldsymbol{\rho} < 1$ , the sequence $A(\boldsymbol{z}^k) $ converges at geometrical speed to $A(\boldsymbol{0}) $ . Such a truncation is common practice and has been applied a number of time in the context of sampling unlabelled structures, see for example [Reference Bodini, Lumbroso and RolinBLR15, Reference Flajolet, Fusy and PivoteauFFP07]. A more detailed discussion regarding these transformations is given in Section 6.1 .
Unfortunately, the remaining ${\rm PS{\small ET}}$ operator, in its aforementioned series form with negative coefficients (45), does not easily fit the convex programming framework. In principle, an alternative form could be used
though in this case, it is not clear how to convert the problem into a polynomially tractable form.
3.4 Increasing trees
Boltzmann samplers for first- and second-order differential specifications have been developed in [Reference Bodini, Roussel and SoriaBRS12, Reference Bodini, Dien, Fontaine, Genitrini and HwangBod+16]. In particular, in [Reference Bodini, Roussel and SoriaBRS12], the authors solve the problem of Boltzmann sampling from specifications of type $\mathcal{T}^{\prime} = \mathcal{F}(\mathcal{Z}, \mathcal{T})$ . There exist several particular cases of such a differential specification which admit an explicit solution of the corresponding differential equation. For example, a differential equation is said to be autonomous if it does not depend explicitly on the independent variable, that is $\mathcal{T}^{\prime} = \mathcal{F}(\mathcal{T}) $ . In this case, the underlying differential equation can be solved by separation of variables:
The final solution T(z) is obtained by inverting z(T), which, in the case of a differential equation with combinatorial origin, can be obtained using binary search.
Several different strategies for evaluating $\int\frac{1}{F(t)} $ may be also suggested. For instance, if F(t) is a polynomial in t, all its complex roots (together with multiplicity structure) can be efficiently found through a numerical procedureFootnote 3 , see [Reference ZengZen05]. In the context of multi-parametric tuning, we are mostly interested in the case when $F(T) $ also depends on auxiliary variables $\boldsymbol{u}=(u_1, \ldots, u_d) $ , which need to be tuned. In this case, the roots of the polynomial F(T) become dependent on $\boldsymbol{u} $ .
Theorem 20 (Multi-parametric increasing trees). Suppose that the generating function $T(z, \boldsymbol{u}) $ corresponding to a family of increasing trees is described by a functional equation
and that T is a formal power series with non-negative coefficients, $\boldsymbol{u} = (u_1, \ldots, u_d) $ .
Let $N_0, N_1,\ldots,N_d$ be the excepted size, and expected parameter value of $u_1, \ldots,u_d$ , respectively. Then, the tuning problem is equivalent to a convex optimisation problem over real variables $\varphi, \zeta, \eta_0, \ldots,\eta_d $ :
where the solution of the tuning problem is obtained by assigning
Proof. Following Theorem 12, multi-parametric tuning is equivalent to the following convex optimisation problem over real variables $\varphi, \zeta,\eta_1, \ldots, \eta_d $ :
where $T(z, u_1, \ldots, u_d) $ is the solution of differential equation $T^{\prime}_z(z, \boldsymbol{u}) = F(T(z, \boldsymbol{u}), \boldsymbol{u})$ with initial conditions $T(0, \boldsymbol{u}) = t_0 $ . The target solution is given by $(z, u_1, \ldots, u_d) = (e^{\zeta}, e^{\eta_1}, \ldots, e^{\eta_d}) $ and $F(z, u_1, \ldots, u_d) = e^{\varphi} $ . Let us denote by $Z(\tau, \eta_1, \ldots, \eta_d) $ the inverse function of $\log T(e^\zeta, e^{\boldsymbol \eta}) $ with respect to $\zeta $ , so that
Since T is a formal power series with non-negative coefficients, the function $\log T(e^\zeta, e^{\eta_1}, \ldots, e^{\eta_d}) $ is convex and increasing in both $\zeta $ and $\boldsymbol \eta $ . Therefore, its inverse function with respect to $\zeta $ is a concave increasing function. Moreover, the function $Z(\tau, \eta_1, \ldots, \eta_d) $ is jointly concave in all of its arguments That is because a function $f(\boldsymbol{z}) $ is concave if and only if its hypograph $\textbf{hyp} f = \{ (y, \boldsymbol{z}) \colon y\leqslant f(\boldsymbol{z}) \} $ is a convex set.
The hypograph of the inverse function $Z(\tau, \boldsymbol \eta) $ directly corresponds to the epigraph of the initial function $\log T(e^z, e^{\boldsymbol\eta}) $ . Consequently, a convex constraint $\varphi \geqslant \log T(e^{\zeta},e^{\eta_1}, \ldots, e^{\eta_d}) $ can be replaced by a different convex constraint obtained by taking the inverse with respect to $\zeta $ , resulting in a new convex optimisation problem
In order to construct $Z(\varphi, \eta_1, \ldots, \eta_d) $ , we use the solution of the autonomous differential equation, see (47). Denote by $z(t, \boldsymbol{u}) $ the inverse of the generating function $T(z, \boldsymbol{u}) $ , that is
and recall that $z(t, \boldsymbol{u}) $ is given by
Finally, a direct calculation shows that $Z(\varphi, \eta_1, \ldots, \eta_d) = \log z(e^\varphi, e^{\eta_1}, \ldots, e^{\eta_d})$ .
Remark 21. With an external oracle available, systems of differential equations involving more than one variable can still be tuned using Theorem 12. However, there is little hope to generalise Theorem 20 onto multivariate differential equations, even if we assume that all its differential equations are autonomous. In fact, by introducing an additional dimension to the problem, any first-order system of differential equations
can be transformed into a system of autonomous differential equations. Moreover, starting from dimension four, systems of functional equations can admit solutions which exhibit chaotic behaviour, see [Reference Flajolet and SedgewickFS09, Remark VII.51].
On the other hand, any system of differential equations of arbitrarily high order
can be reduced to a system of first-order differential equations by expanding the dimension space, and therefore, Boltzmann samplers for such specifications may apply. As discussed in [Reference Bodini, Roussel and SoriaBRS12], for such systems, only the first sampling step is the most expensive one, because it requires to generate a random variable defined on the interval $[0, \rho) $ , where $\rho $ is the radius of convergence of the formal power series. For all the consecutive operations, the support of the required random variables is separated from $\rho $ , so a variety of approximation methods, including Runge–Kutta can be applied.
3.5 Other types of combinatorial structures
The general technique described in Section 3.1 can be applied to any analytic multivariate generating function provided that the solution to the multiparametric tuning problem lies inside the respective domain of analyticity. It can be applied even to those coming from somewhat exotic systems, including partial differential equations, systems with negative coefficients, catalytic equations, systems with non-trivial substitutions, etc. As long as the oracle providing values and the derivatives of the generating functions is given, the source of the equations is irrelevant.
Still, for most of these systems no efficient oracle is known. Typically, the following methods, which could be called the oracles for simplicity, can be used for the mentioned types of functional equations
-
Runge–Kutta method for ordinary differential equations,
-
Grid approximation methods for partial differential equations,
-
Gröbner bases for systems of polynomial equations, or
-
Plain coefficient-wise evaluation for arbitrary systems.
All of these methods, however, do not guarantee polynomial complexity in terms of the bit length of the target size and the number of equations. When additional parameters are introduced, then the approach that we propose in Section 3.1 reduces the number of equations for an oracle to solve, but adds the optimisation procedure on the top. Specifically, with this approach, the tuning system of equations (28) is not anymore included as an input to the oracle, and the tuning parameters are chosen by the optimisation procedure, and so are considered fixed real numbers from the viewpoint of the oracle.
Remark 22. It is worth noting that for certain combinatorial classes beyond the scope of the current paper, the tuning problem does not have a solution in the analyticity domain. The simplest non-trivial example (with an infinite number of objects in a class) is the case of unrooted trees $U(z) = \sum_{n \geqslant 0} n^{n-2} \frac{z^n}{n!} $ : near its dominant singular point $z = e^{-1} $ , the Puiseux expansion takes form
This means that the maximal possible expected value of an unrooted tree generated from Boltzmann distribution is equal to
However, for most examples of this kind, the Boltzmann sampler itself is not directly available, not speaking of the fact that there would be no practical motivation to apply it for expectations higher than the maximal available value.
Still, there exist techniques different from the Boltzmann sampling which can be used to draw the objects from such classes in a controlled manner. The reason why the solution does not exist in the above example is because locally around the singularity $\rho $ , the singular term of the generating function takes form $(1 - \frac{z}{\rho})^\alpha $ , where $\alpha > 1 $ . By repeatedly applying pointed derivative to such a class, the singular term can be modified, and the exponent $\alpha $ shall be consequently replaced by a value strictly smaller than 1. From the combinatorial viewpoint, pointed derivative serves to distinguish atoms in the underlying object, which can be un-distinguished later after the sampling is performed. The possible drawbacks of this approach are that firstly, successive pointing can increase the size of the grammar, and secondly, the resulting distribution is not anymore Boltzmann, which is not surprising, because the tuning problem does not have a solution in the Boltzmann case.
It is still important to discover and understand controlled sampling techniques in the cases not covered by admissible tuning in the sense of (28): when inclusion-exclusion is used, or even in more sophisticated cases involving stretched exponents or any kinds of essential singularities.
4. Complexity of convex optimisation
In the previous section, we have reduced the problem of multiparametric tuning to a problem of convex optimisation. In this section, we provide the complexity results, case by case. Let us introduce the key elements of the interior-point optimisation and discuss how to evaluate its complexity.
A convex optimisation problem can be expressed as minimisation of a linear multivariate function (or, in an equivalent formulation, a convex one) subject to a set of constraints, expressed as inequalities of the form $f_i(\boldsymbol{z}) \leqslant 0 $ involving multivariate convex functions $f_i(\boldsymbol{z}) $ . In addition, affine inequalities and linear equations are often treated separately. Each of these constraints describes a convex set, and therefore, an intersection of these sets is also convex. A classical convex optimisation program can be formulated as
where $f_1, \ldots, f_m $ are convex functions of the vector argument $\boldsymbol{z} $ . Such programs are widely believed to be solvable in polynomial time, due to the popularisation of interior-point methods. Nevertheless, a detailed answer regarding the algorithmic complexity of convex optimisation needs a careful investigation.
For instance, if $f_i(\boldsymbol{z}) $ are linear functions with integer coefficients, then the solution to the optimisation problem is a rational number. In this case the problem is known to be solvable exactly using $O(n^{3.5} L) $ arithmetic operations, where n denotes the number of variables, and L is the bit length of the coefficient representation [Reference KarmarkarKar84]. Remarkably, the existence of a polynomial algorithm whose arithmetic complexity does not depend on L (so-called strongly-polynomial algorithms) is listed among 18 problems in Smale’s celebrated list [Reference SmaleSma98]. If the functions $f_i(\boldsymbol{z}) $ are not necessarily linear, the optimal solution does not have to be a rational number and so we usually ask for an $\varepsilon$ -approximation, instead.
Nesterov and Nemirovskii [Reference Nesterov and NemirovskiiNN94] developed a theoretically and practically efficient framework covering a large subset of convex programs. Their method relies on the notion of a self-concordant barrier whose construction depends on the problem formulation and needs to exploit the global structure of the optimisation problem. The number of Newton iterations required by their method to solve the optimisation problem (58) can be bounded by
where $\varepsilon $ is the required precision, $\nu $ is the complexity of the constructed barrier (typically proportional to the problem dimension), and $\mu_0 $ is related to the choice of the starting point. More specifically, the value of the target function is guaranteed to lie within an $\varepsilon $ -neighbourhood of the optimal value. Since $\mu_0 $ does not depend on the target precision, it is often omitted in the final analysis [Reference Potra and WrightPW00]. Further on, we compute $\nu $ for various types of combinatorial specifications.
The cost of each Newton iteration is a sum of the two parts: the cost of the composition of the matrix of second derivatives and the cost of the inversion of this matrix. In Section 4.2, we specify in more detail what is a Newton iteration in the interior-point method. While the cost of the inversion part is assumed to be cubic in the number of variables, the cost of the composition of the matrix of second derivatives depends on the problem structure.
One of the most significant achievements of interior-point programming was a proof of existence of a universal family of barriers, which admits a $O(n) $ self-concordance complexity for any convex body in $\mathbb R^n $ . Unfortunately, such an existence result seems to be currently only of theoretical interest – a construction of such a barrier requires computing the convex body’s volume, which in itself is known to be an NP-hard problem. In practice, for each concrete convex optimisation problem the barriers have to be constructed separately. There is no known constructive, general-purpose polynomial-time barrier construction algorithm for convex optimisation problems.
To summarise, the practical complexity of the convex optimisation is determined by the complexity parameter $\nu $ of the associated self-concordant barrier, which requires the problem structure to be exploited in order to be constructed. More details on the algorithmic complexity of convex optimisation methods can be found in [Reference Potra and WrightPW00].
4.1 Disciplined Convex Programming
Disciplined Convex Programming ( $\textsf{DCP} $ , sometimes also referred to as $\textsf{CVX} $ ) is another framework meant for modelling convex optimisation problems [Reference Grant, Boyd and YeGBY06]. Both the objective and constraints are built using certain atomic expressions such as $\log({\cdot})$ , $\|\cdot\|_2$ , or $\log \sum e^{\cdot}$ , and a restricted set of composition rules following basic principles of convex analysis. The inductive definition of expressions allows the framework to track their curvature and monotonicity, and, in consequence, automatically determine whether the given program is indeed a valid convex program. Essentially, $\textsf{DCP} $ can be viewed as both a domain-specific language and a compiler which transforms the given convex program to a mixture of linear, quadratic, and exponential conic optimisation problems. Such problems are passed to corresponding solvers in the necessary standard form. $\textsf{DCP} $ covers therefore a strict subset of convex programs of Nesterov and Nemirovskii’s framework, but serves well as a prototyping interface for provably convex programs. In our implementation, see Section 6, we rely on $\textsf{DCP} $ with two particular conic solvers, a second-order Embedded Conic Solver [Reference Domahidi, Chu and BoydDCB13] and recently developed first-order Splitting Conic Solver [Reference O’Donoghue, Chu, Parikh and BoydODo+16].
Some of the classical combinatorial constructions, for example ${\rm MS{\small ET}}$ , ${\rm C{\small YC}}$ or ${\rm S{\small ET}_{>0}}$ can be expressed in $\textsf{DCP} $ , however involve using (at least in principle) an infinite amount of summands. Some of the operators, for example, ${\rm MS{\small ET}}$ can be efficiently represented in a truncated form, cf. [Reference Bodini, Lumbroso and RolinBLR15], but for others it is intrinsically impossible to provide a reasonable truncation level, since a large number of summands take non-negligible values. Therefore, for such constructions $\textsf{DCP} $ is not the preferable, and the classical interior-point methods should be selected instead. In what follows, we provide some additional theoretical background on self-concordant barriers and propose our constructions which can be further used to cover these operations.
4.2 Self-concordant barriers
While the precise descriptions of the interior-point optimisation schemes and the definitions of the self-concordant barriers can be found in [Reference NemirovskiNem04], we only need to borrow a couple of statements. We intentionally do not give the definition of a self-concordant barrier, but instead use a sufficient construction condition. More specifically, we rely on the fact that self-concordant barriers are functions assigned to various convex sets in $\mathbb R^n $ and we will only need to construct such sets for epigraphs of convex functions, defined as
Lemma 23 ([Reference NemirovskiNem04, Proposition 9.2.2]). Let f(x) be a three times continuously differentiable real-valued convex function on the ray $\{x > 0\} $ such that
for some parameter $\beta > 0$ .
Then, the function
is a self-concordant barrier with complexity parameter $\nu = 1+\max[1, \beta^2] $ for the two-dimensional convex domain $\{(t, x) \in \mathbb R^2 \, \mid \, x > 0, \, f(x) \leqslant t \}$ .
The two important operations for composing convex problems are addition and composition (the composition of a convex function with an increasing convex function is also convex). Each composition is treated by introducing an additional variable; for example, if F(x) is convex and increasing, and $G(x) $ is convex, then the epigraph of the composition $\{ (t, x) \, \mid \,F(G(x)) \leqslant t \} $ can be expressed as a projection of a three-dimensional set
The behaviour of self-concordant barriers with respect to taking linear combinations or with respect to combining multiple dimensions is described by the following proposition.
Lemma 24 ([Reference NemirovskiNem04, Proposition 3.1.1]). The following propositions hold:
-
(i) Let $F_i $ be self-concordant barriers with complexity parameters $\nu_i $ for closed convex domains $Q_i \subset \mathbb R^{n_i} $ , $i = 1, \ldots, m $ . Then the function
(64) \begin{equation} F(x_1, \ldots, x_m) = F_1(x_1) + \cdots + F_m(x_m), \end{equation}is a $\left(\sum_i \nu_i\right) $ -self-concordant barrier for $Q = Q_1 \times \cdots \times Q_m $ . -
(ii) Let $F_i $ be self-concordant barriers with complexity parameters $\nu_i $ for closed convex domains $Q_i \subset \mathbb R^{n} $ and $\alpha_i \geqslant 1 $ be reals, $i = 1, \ldots, m $ . Assume that $Q = \cap_{i=1}^m Q_i $ has a non-empty interior. Then the function
(65) \begin{equation} F(x) = \alpha_1 F_1(x) + \cdots + \alpha_m F_m(x), \end{equation}is a $\left(\sum_i \alpha_i \nu_i\right) $ -self-concordant barrier for Q.
Self-concordant barrier design is one of the central issues in modern convex optimisation – knowing these barriers allows one to reduce any convex program to a standard form [Reference NesterovNes98, Section 4.2.6]. The reduction proceeds as follows.
Here, C is a technical constant, bounding the value of the target function from above. Suppose that $F_0(\boldsymbol{z}, \tau) $ , $F_1(\boldsymbol{z}, \kappa),\ldots, F_m(\boldsymbol{z}, \kappa) $ are the $\nu_0 $ -self-concordant and $\nu_i$ -self-concordant barrier functions for the epigraphs of the functions $f_0(\boldsymbol{z}), f_1(\boldsymbol{z}),\ldots,f_m(\boldsymbol{z}) $ , respectively. Then, according to Lemmas 23 and 24, the resulting self-concordant barrier
has a complexity parameter
As specified in [Reference Nesterov and NemirovskiiNN94, Section 2], each Newton step is a step of the form
where $\mu_i $ is a dampening parameter defined by the procedure, and $x_i \in\mathbb R^m $ , where m is the dimension of the parameter space. The computation of the dampening parameter, as well as the Newton step, involves computing the inverse of the matrix of second derivatives. It costs $O(m^2 L) $ arithmetic operations to compose this matrix, which may potentially vary if some exotic constructions are introduced, and $O(m^3) $ steps to compute its numerical inverse.
Finally, let us note that there is room for a variety of additional fine-tuned optimisation techniques. For instance, the Newton iterations can use faster than $O(n^3)$ matrix inversion algorithms, use $O(n^\omega)$ matrix multiplication algorithms, or even use algorithms dedicated to sparse matrices.
Remark 25. The $\textsf{DCP}$ software, that we use in our prototype tuner implementation, uses a completely different principle and expresses the problem in a standard form using so-called exponential cones. The approach with barriers gives here more theoretical guarantees as it allows to cover a broader range of constructions, but was not chosen for prototyping reasons.
4.3 Barriers for context-free specifications
In this section, we give rigorous complexity bounds for the case of context-free grammars, see Section 3.2. Each of the constraints in the auxiliary convex optimisation problem that we construct takes form
where $\boldsymbol{x} $ is the vector of unknowns, and $\boldsymbol{a}_{ij} $ are vectors of coefficients. Such a convex constraint can be rewritten into an equivalent form
where $\boldsymbol{e}_i $ is ith basis vector. Each of the functions $\exp((\boldsymbol{a}_{ij} - \boldsymbol{e}_i)^\top \boldsymbol{x}) $ is convex, as $(\boldsymbol{a}_{ij} - \boldsymbol{e}_i)^\top \boldsymbol{x} $ is linear and $e^x $ is both increasing and convex. Clearly, their sum is also convex. Using the sum-substitution technique from Lemma 24, we conclude that in order to provide a fast convex optimisation interior-point procedure, it is sufficient to construct a self-concordant barrier for the epigraph of $e^x$ . Note that $\textbf{epi}\ e^x = \{ (t, x) \in \mathbb R^2 \, \mid \, e^x\leqslant t \}$ can be equivalently rewritten as $\{ (t, x) \in \mathbb R^2 \, \mid \, t > 0, \, - \log t \leqslant -x\}$ . Furthermore, a linear variable change $x \mapsto -x $ , and variable swapping $(x, t) \mapsto (t, x) $ converts it into an epigraph of the convex function $f(x) = - \log x $ :
Finally, we note that the condition from Lemma 23 is fulfilled with $\beta = \frac{2}{3}$ . Therefore, a logarithmic barrier
is a 2 -self-concordant barrier for an epigraph of $-\log(x)$ . This also gives a 2 -self-concordant barrier for an epigraph of $e^x$ . Consequently, each unambiguous context-free specification can be converted into a standard form for an interior-point method with total barrier complexity parameter $O(L) $ where L is the total number of terms in the specification.
4.4 Cycle and positive set constructions
Among the basic labelled operators, the ${\rm C{\small YC}}$ operator cannot be easily fitted into the more restricted DCP framework that we use in our prototype implementation. The same issue concerns all of the restricted versions of labelled operators including ${\rm S{\small ET}_{>0}}$ , ${\rm S{\small ET}_{>k}}$ , and $\rm C{\small YC}_{>k} $ . We provide further, more heuristic analysis of the logarithmic barriers for these operators and hint as to why they can be efficiently supported. We do not provide practical implementations of these modified versions, however, in principle, they can be handled as well.
Let us start with two particular functions obtained as a log-exp transform of the ${\rm C{\small YC}}$ and ${\rm S{\small ET}_{>0}}$ constructors, see Section 6.1. These two functions are
Note that the domain of L(x) is equal to $\{x \in \mathbb R \, \mid \, x< 0 \} $ . The convexity of each of E(x) and L(x) is ensured by noticing that they are log-sum-exp expressions with an infinite number of positive summands:
Unfortunately, we cannot afford an infinite number of summands, as adding each summand contributes to the final barrier complexity. Instead, we are going to explicitly construct self-concordant barriers for the epigraphs of L(x) and E(x) with the help of Lemma 23.
The case of $f(x) = L(x) $ . Direct numerical evaluation shows that when $0 < x \lesssim 3 \beta $ it holds $|f^{\prime\prime\prime}(x)| \leqslant \frac{3 \beta}{x} f^{\prime\prime}(x)$ . Relative errors of $\,\widehat{x}$ satisfying $|f^{\prime\prime\prime}(\widehat{x})| = \frac{3 \beta}{\widehat{x}}f^{\prime\prime}(\widehat{x})$ for varying values of $\beta$ is listed in Table 4.
Let us notice the practical consequences of this fact. For most typical combinatorial systems, tuning the expected size to n gives the following behaviour of the tuning parameter:
-
$x_n = \rho \left(1 - O(n^{-1})\right) $ for algebraic and rational systems, and
-
$x_n = O(Poly(n)) $ for some specifications with singularity $\rho $ at infinity, such as entire functions (e.g. labelled sets or similar constructs).
In fact, we are not aware of any natural example combinatorial system where the tuning parameter approaches the singular value at an exponential speed $x_n = \rho(1 - e^{-n}) $ .
Therefore if we choose, say, $\beta = 10 $ , then according to Lemam 23, the resulting logarithmic barrier will be $\beta^2$ -self-concordant for $x \lesssim 30 $ . In practical terms, $\beta = 10 $ covers expected values up to $N \sim e^{30} \approx 10^{13} $ . More generally, for $N \to \infty $ , the complexity of the whole optimisation scheme is proportional to a square root of the sum of all the barrier complexities, which yields an additional $\log N $ factor for the resulting barrier complexity, and for the complexity of the final tuning algorithm.
The case of $f(x) = E(x) $ . If we try to apply Lemma 23 directly to E(x), we discover that the conditions of the lemma cannot be satisfied. Therefore, we pre-process the epigraph of E(x) by taking its inverse, similarly as we did for the 2-self-concordant barrier corresponding the exponent function $e^x $ . Specifically, we transform $\textbf{epi}\ E $ into
It can be proven that $f(x) = - \log \log(1 + e^x) $ is convex for $x > 0$ . Moreover, we can numerically verify that the condition of Lemma 23 is satisfied for all $x \leqslant 10^{100} $ . We conjecture that it is valid for all $x \in \mathbb R $ , but, as we discussed above, in practice we only need it to be valid for relatively small values of $x $ .
4.4.1 Restricted versions of cycles and sets
It is also possible to consider cycles and sets containing more than k elements, so that we need to construct self-concordant barriers for the functions
For the cycle construction, we again empirically discover that for $x \lesssim3 \beta$ the logarithmic barrier is self-concordant. We have numerically tested this assertion for $k \in \{ 0, 1, 2, 3\} $ and $\beta \leqslant 10 $ .
For the restricted construction ${\rm S{\small ET}_{>k}}$ , we proceed in the same way as we did for ${\rm S{\small ET}_{>0}}$ by constructing the epigraph of the negative inverse function. In this case, however, the inverse function cannot be expressed explicitly, but can only be computed numerically. All the numerical computations can be done efficiently using the binary search for the inverse function, and implicit derivative theorem for its derivatives. Again, this gives an algorithmic definition of a self-concordant barrier for restricted sets without the necessity to operate with infinite sums.
4.4.2 Unlabelled restricted cycles and multi-sets
Note that the unlabelled version of the ${\rm C{\small YC}}$ operator expressed by
is in fact a weighted sum of already familiar labelled operators, for which we have already provided efficient self-concordant barriers. Again, restricted variations of the corresponding operators ${\rm MS{\small ET}_{>0}}$ , ${\rm MS{\small ET}_{>k}}$ and ${\rm C{\small YC}}$ cannot be efficiently treated by $\textsf{DCP}$ , but the usage of self-concordant barriers can provide some insight. For example, ${\rm MS{\small ET}_{>0}}$ defined as
can be represented as a composition with $e^x - 1 $ , whose log-exp transform is $E(x) = \log(e^{e^x}-1) $ .
Next, an unlabelled cycle construction with a lower bound on the number of components $\rm C{\small YC}_{\geqslant m} $ can be expressed as
Again, as in the case of unlabelled compositions, the sequence of values of the generating functions $\left({A(\boldsymbol{z}^k)}\right)_{k=1}^\infty $ tends to $A(0) = 0$ at the speed of a geometric progression. Note that this gives an efficient composition scheme for unlabelled cycles as well, using
Finally, let us sketch how to express restricted unlabelled sets using the self-concordant barriers for restricted labelled sets. We start by considering two examples, ${\rm MS{\small ET}}_{\geqslant 2} $ and ${\rm MS{\small ET}}_{\geqslant 3} $ .
For sets with at least two elements we have
If we set $h_1 = A(\boldsymbol{z}) $ and $h_2 = \sum_{k \geqslant 2} \frac{A(\boldsymbol{z}^k)}{k} $ , then the resulting value can be rewritten as
The required composition scheme is obtained by composing with the log-exp transform of the function $e^x - x - 1$ which is $E_{\geqslant 2}(x) :=E_{>1}(x) = \log(e^{e^x} - e^x - 1) $ .
For sets with at least three elements it holds
Again, by introducing auxiliary variables $h_1 = A(\boldsymbol{z}) $ , $h_2 =\frac{A(\boldsymbol{z}^2)}{2} $ , $h_3 = \sum_{k \geqslant 3} \frac{A(\boldsymbol{z}^k)}{k} $ , and $H = h_1 + h_2 + h_3 $ , we can rewrite the operator as
where R is a positive quantity obtained by subtracting $\frac{h_1^2}{2} $ from $\frac{(h_1 + h_2 + h_3)^2}{2} $ . Again, we obtain a composition scheme by composing with $E_{\geqslant 3}(x) := E_{>2}(x) $ .
Finally, for sets with general restrictions the above scheme can be repeated for multi-sets with forbidden cardinalities by subtracting ${\rm MS{\small ET}}_{=m} $ obtained by
Afterwards, the result can be represented by adding missing parts to ${\rm MS{\small ET}}_{>m} $ . We do not provide a detailed analysis of this case, as for large m this procedure becomes less and less efficient due to the potential super-polynomial increase in m of the number of the required terms.
5. Optimal biased expectation and tuning precision
In the current section, we address the choice of the optimal expected value for the Boltzmann sampler with prescribed anticipated rejection domain. We also investigate the precision of the control vector $\boldsymbol{x}$ required to guarantee a linear time, approximate-size anticipated rejection sampling scheme.
In what follows, we analyse the rejection cost independently for each parameter. That is to say, we evaluate the cost of the anticipated rejection until the value of the investigated parameter N falls into its prescribed tolerance window $[n_1, n_2] $ . The sum of these costs taken over all the parameters is an upper bound for costs of a global rejection scheme (which, of course, could be better because of some overlapping rejections).
We distinguish two types of behaviours. The first concerns parameters admitting a singular behaviour of type ${\left(1-\frac{z}{\rho}\right)}^{-\alpha}$ which corresponds, for instance, to the size parameter of peak and flat distributions described in [Reference Bodini, Genitrini and RolinBGR15]. In this case, we will prove that for each $\alpha$ the required precision is of order $O\left(\frac{1}{N}\right)$ where N is the approximate size of the expected outcome.
The second case concerns so-called bumpy distributions, appearing in numerous situations, such as parameter analysis in strongly connected algebraic or rational grammars. Since the distribution is concentrated around its mean value, the needed precision is related to the standard deviation and is, usually, of order $O\left(\frac{1}{\sqrt{N}}\right)$ . In contrast, the size of the target window is not linear, as for the non-concentrated case, but instead is of order $\sqrt{N}$ , or of a different order $n^\gamma $ .
In the first case, we show that the standard target tuning expectation $\mathbb{E}(N) =n := \frac{n_1 + n_2}{2}$ involving the admissible window $[n_1,n_2]$ does not minimise the overall cost of rejections, if anticipated rejection is taken into account. Instead, $\mathbb{E}(N) = \beta n$ , for some suitable parameter $\beta(n_1,n_2)$ should be used, cf. Figure 3. In what follows, we describe how to find the optimal value of $\beta $ and how it improves the rejection cost, compared to the usual $n := \frac{n_1+ n_2}{2}$ parameter.
5.1 Non-concentrated case
For simplicity, we assume that the analysis concerns the size parameter z. Let $C_\boldsymbol{u}(z)$ be the generating function of a class $\mathcal{C}$ for which the parameters $\boldsymbol{u}$ are fixed (i.e. introduce some additional weights). Assume that $C_\boldsymbol{u}(z)$ is $\Delta$ -analytic and there exist analytic functions $\alpha(z)$ , $\beta(z)$ , and a corresponding singularity $\rho$ (all depending on $\boldsymbol{u}$ ) such that as $z \to \rho $ it holds
Let $C^{< n_1}(z)$ , $C^{> n_2}(z)$ and $C^{[n_1, n_2]}(z)$ denote the generating functions corresponding to the subclasses of objects of size strictly smaller than $n_1$ , strictly greater than $n_2$ , and in between (inclusively) $n_1$ and $n_2$ , respectively. Let $T_n$ stand for the cumulative size of the generated and rejected objects produced by a sampler calibrated with parameter x, and an admissible size window $[(1-\varepsilon)n, (1+\varepsilon)n]$ . From [Reference Duchon, Flajolet, Louchard and SchaefferDuc+04, Theorem 7.3], the corresponding probability generating function
satisfies
and so
Remark 26. Notice that even if the studied parameter does not denote the object size, the expectation $\mathbb{E}(T_n)$ stays essentially similar to (89). Involved generating functions correspond then to the subclasses of objects whose size contribution of the studied parameter is smaller, greater, or in between the respective window parameters. It ensues that the following analysis for the size parameter can be readily translated to others parameters.
Proposition 27. Let us consider an approximate-size, anticipated rejection Boltzmann sampler $\Gamma(\mathcal{C})$ for a class $\mathcal{C}$ of the above kind. Assume that $\Gamma(\mathcal{C})$ is calibrated using $x_n = \rho\left(1-\frac{\delta}{n}\right)$ for some parameter $\delta$ . Let $T_n$ denote the cumulative size of objects rejected by $\Gamma(\mathcal{C})$ . Then, as n tends to infinity
Proof. This theorem is very similar to [Reference Bodini, Genitrini and RolinBGR15, Theorem 3.2]. It has an analogous proof using Euler–Maclaurin summation, with the difference that we do not anymore suppose that $\alpha = \delta $ . Let us emphasise again that keeping these two parameters distinct is the key to obtaining faster samplers with lower anticipated rejection cost.
Note that $\kappa_c(\varepsilon,\alpha,\delta)$ is independent of n. Moreover, $\kappa_c(\varepsilon,\alpha,\delta)$ is bounded for each $\delta > 0$ . Consequently, $\Gamma(\mathcal{C})$ calibrated with parameter $x_n$ of precision of order $O\left(\frac{1}{n}\right)$ has expected linear time complexity.
Remark 28. The value of $\kappa_c(\varepsilon,\alpha,\delta)$ is also independent of the weight $\boldsymbol{u}$ , depending only on the type of the singularity $\alpha$ and the prescribed tolerance $\varepsilon$ . Note that it provides strong stability guarantees regarding the sampler’s performance under the change of involved parameters; most notably, in the vicinity of the values of $\boldsymbol{u}$ where the type of the singularity changes discontinuously, cf. [Reference Bodini and PontyBP10, Ban12]. Even in this particular cases, our result guarantees the linearity of the anticipated rejection sampling scheme.
With an explicit formula (90) for $\kappa_c(\varepsilon,\alpha,\delta)$ it is possible to optimise its value for specific values of $\epsilon$ and $\alpha$ . In other words, improve the multiplicative constant in the expected number $T_n$ . Table 5 provides some exemplary values for $\delta_{\text{min}}$ and the corresponding $\kappa_c(\varepsilon,\alpha,\delta_{\text{min}})$ .
Remark 29. The careful reader might be surprised by the fact that $\delta_{\text{min}}$ is not equal to $\alpha$ , as it is suggested in the seminal paper [Reference Duchon, Flajolet, Louchard and SchaefferDuc+04]. The reason behind this is the fact that anticipated rejection creates a small bias in the distribution, initially not taken into account. For instance, for $\alpha=1$ and tolerance $\varepsilon=0.1$ , the best choice for $\beta$ is not 1, but $1.6572067$ . Notably, this decreases the rejection complexity from $8.05n$ to $6.97n$ .
In consequence of the introduced bias, it is no longer optimal to find x by solving $\mathbb{E}_x(N) = n$ . Instead, we have to make a small correction accounting for the effect of anticipated rejection. Luckily, this is easy to provide. Recall that, asymptotically, $\mathbb{E}_x(N) = n$ is attained for $x_n=\rho(1-\frac{\alpha}{n})$ . After computing the expectation, it follows that in order to obtain $x_n = \rho\left(1-\frac{\delta_{\text{min}}}{n}\right)$ we have to solve the corrected $\mathbb{E}_x(N) = \frac{\alpha}{\delta_{\text{min}}} n$ , instead.
5.2 Concentrated cases
In following part we consider cases where the investigated size parameter tends asymptotically to a Gaussian law, provided that the target expectation is fixed and large. Let C(z) be, as usual, the generating function corresponding to the class $\mathcal C $ , and N be the random variable representing the size of an object generated according to the associated Boltzmann distribution. As previously, we want to evaluate the rejection cost and find an appropriate bias parameter which minimises its value.
Before we begin, let us consider a few examples of concentrated distributions analysed using the following generalised quasi-powers theorem.
Lemma 30 (Generalised quasi-powers, [Reference Flajolet and SedgewickFS09, Theorem IX.13]). Assume that, for u in a fixed complex neighbourhood $\Omega $ of $1 $ , the probability generating functions $p_n(u) $ of non-negative discrete random variables $X_n $ admit representations of the form
uniformly with respect to u, where each $h_n(u) $ is analytic in $\Omega $ . Assume also the conditions
as $n \to \infty$ , uniformly for $u \in \Omega$ .
Then, the random variable
converges in distribution to $\mathcal N(0,1) $ .
In what follows, we apply the generalised quasi-powers theorem to a series of probability generating functions capturing the outcome size distributions of example Boltzmann samplers. Therefore, the discrete random variables of interest $X_n $ denote these outcome sizes, where n stands for the target expected size, while $n \to \infty $ . We use the standard formulas for the mean and variance corresponding to the size N of objects sampled according to the Boltzmann distribution parametrised with x:
For convenience, we also use $x_n$ to denote the solution of the tuning equation $\mu(x_n) = n$ .
5.2.1 Bicoloured sets
Consider the combinatorial class $\mathcal C $ consisting of finite sets with atoms of two different types (colours). The corresponding generating function satisfies $C(x) = e^{2x} $ . Accordingly, the mean value $\mu(x)$ and standard deviation $\sigma(x)$ satisfy
whereas $x_n = n/2 $ .
The probability generating function $p_n(u)$ capturing the size distribution calibrated with parameter $x_n$ satisfies therefore
A direct calculation reveals that all the conditions of Lemma 30 hold, and the size distribution tends to a Gaussian law as $x_n \to \infty$ . In particular, we have
5.2.2 Involutions
Consider the combinatorial class of involutions, that is n-element permutations $\pi $ satisfying $\pi \circ \pi = {{id}} $ . The respective generating function satisfies $C(x)=\exp\left(x+\frac{x^2}{2}\right)$ . Consequently, the mean value $\mu(x)$ , variance ${\sigma(x)}^2$ , and tuning parameter $x_n$ satisfy
And so
Again, the premises of Lemma 30 can be easily verified. In the end, we obtain a limit Gaussian distribution where
5.2.3 Set partitions
Consider the combinatorial class of set partitions for which we have $C(x) = e^{e^x-1} $ . A direct computation provides the identities
where W(n) is the Lambert function defined as the positive solution of $W(n) e^{W(n)} = n $ .
In this case, $p_n(u) = \exp(h_n(u))\left(1 + o(1)\right)$ where $h_n(u) = e^{uW(n)}-e^{W(n)}$ . We can easily check that
Since $W(n) = \ln n - \ln \ln n + o(1)$ we note that $h^{\prime}_n(1) + h^{\prime\prime}_n(1) \to\infty$ . Moreover
which for u fixed near one tends to 0 as $n \to \infty$ .
Hence, by Lemma 30 the distribution tends, again, to a Gaussian law. In the limit we obtain
5.2.4 Fragmented permutations
Next, consider the class of fragmented permutations, that is sets of non-empty labelled sequences, see [Reference Flajolet and SedgewickFS09, Example VIII.7, p. 562]. The corresponding generating function satisfies $C(x) = \exp \left( \frac{x}{1 - x} \right)$ . Consequently, we find that
Note that as $n \to \infty$ , the tuning parameter $x_n \to 1$ . Again, we verify that, as $x_n \to \infty$ , the function $h_n(u) = \frac{u x_n}{1-ux_n}-\frac{x_n}{1-x_n} $ satisfies both
However, now $h_n(u)$ is not analytic at $u = 1$ . We cannot therefore apply Lemma 30. Nonetheless, it is still possible to prove that the limiting distribution is Gaussian, using the explicit formula for the number of fragmented permutations from [Reference Flajolet and SedgewickFS09].
From the above examples, we see that there is a variety of different behaviours regarding the variance of the limit Gaussian distribution. In what follows we will show how the mean value is connected to the anticipated rejection cost (89) in the Gaussian case.
Theorem 31 (Tuning precision for concentrated distributions). Let the size parameter asymptotically follow a Gaussian law with expectation $\mu(x)$ and standard deviation $\sigma(x)$ , and let X(n) be the inverse function of $\mu(x) $ , that is $\mu(X(n)) = n $ . Denote by $x_{n,\delta}$ the biased tuning value
Then the anticipated rejection cost $T_n $ with the target tolerance window $[n-\varepsilon\sigma(x_n),n+\varepsilon\sigma(x_n)]$ satisfies, when n tends to infinity:
where $\Phi(x) := \dfrac{1}{\sqrt{2 \pi}} \displaystyle\int_{-\infty}^xe^{-w^2/2} dw$ is the Gaussian distribution function. The minimal cost is achieved when $\delta = \delta_{\min} = 0 $ .
Proof. Using the Euler–MacLaurin estimate similarly to [Reference Bodini, Genitrini and RolinBGR15], we obtain the following estimates for $C^{[n_1, n_2]}(x), C^{>n_2}(x) $ and $\frac{d}{dx} C^{<n_1}(x) $ , when $x = x_n $ as $n \to \infty $ :
Since $\mu(x) \gg \sigma(x) $ for $x = x_n $ as $n \to \infty $ , the second summand dominates the first summand in the last expression.
Combining these quantities, and by substituting into $\mathbb{E}(T_n) = \frac {x C'^{<n_1}(x)+ n_2 C^{>n_2}(x)} { C^{[n_1,n_2]}(x)}$ , we get the estimated cost of anticipated rejection:
Let us choose $x = x_{n, \delta} $ in such a way that $\mu(x_{n, \delta}) = n + \delta \sigma(x_n)$ . Using the first two terms of the Taylor expansion of $\mu(x_{n,\delta}) $ around $x_n $ , we can show that this value is asymptotically $x_{n, \delta} \sim x_n+\delta\sigma(x_n)\frac{d X(n)}{dn}$ . This finally allows to obtain $\kappa(\varepsilon, \delta):= \lim_{n \to \infty} \dfrac{\mathbb E T_n}{n}$ :
This expression achieves its minimum when $\Phi(\varepsilon - \delta) -\Phi({-}\varepsilon - \delta) $ achieves its maximum. Its derivative has only one root $\delta = 0 $ which corresponds to the global maximum. Therefore, $\delta_{\min} = 0 $ .
Remark 32. In the case when the standard deviation $\sigma(x_n) $ is negligible compared to the mean value $\mu(x_n) $ as $n \to \infty $ , the rejection cost can be shown to be asymptotically equal
where p(x) is the probability density function of the limiting distribution. In the case when the limiting distribution is not symmetric, and the probability density function is known, the corresponding bias can be computed as $\delta_{\min} = \arg\min \kappa(\varepsilon, \delta) $ .
6. Paganini: a multi-parametric tuner prototype
To illustrate the effectiveness of our tuning procedure, we developed $\textsf{Paganini} $ Footnote 4 — a lightweight Python library implementing the tuning-as-convex-optimisation idea. Our software relies on cvxpy, a Python-embedded modelling language for Disciplined Convex Programming (DCP) [Reference Grant, Boyd and YeGBY06]. With its help, $\textsf{Paganini} $ is able to automatically compose, and solve adequate optimisation problems so to compute the parameter vector corresponding to the user-defined expectations.
6.1 Implementation details
Due to the imposed restrictions of Disciplined Convex Programming, $\textsf{Paganini} $ supports a strict, though substantial subset of admissible constructions. In both the labelled and unlabelled case, Paganini provides the basic empty class $\varepsilon$ , disjoint sum $+$ , and Cartesian product $\times$ operations. More involved constructions are briefly discussed below.
6.1.1 Sequence operator
Consider ${\rm S{\small EQ}}(\mathcal{A})$ for some class $\mathcal{A}$ (either labelled or unlabelled). By definition, ${{\rm S{\small EQ}}{(\mathcal{A})} = \varepsilon + \mathcal{A}\times {\rm S{\small EQ}}{(\mathcal{A})}}$ and so we can treat ${\rm S{\small EQ}}{(\mathcal{A})}$ as a new, auxiliary variable with a corresponding definition of the above shape. The log-exp transform of ${\rm S{\small EQ}}{(\mathcal{A})}$ takes then the form of an elementary DCP log-sum-exp function:
where $e^{\alpha} = A(\boldsymbol{z})$ and $e^{\sigma} ={\rm S{\small EQ}}{(\mathcal{A})}(\boldsymbol{z})$ .
Likewise, since
it is readily possible to translate ${\rm S{\small EQ}}$ with its restricted variants into valid DCP constraints.
6.1.2 Pólya structures
Consider ${\rm MS{\small ET}}{(\mathcal{A})}$ and its log-exp variant:
where $e^{\alpha_k} = {A(\boldsymbol{z}^k)}$ and $e^{\mu} ={\rm MS{\small ET}}{(\mathcal{A})}(\boldsymbol{z})$ .
The right-hand side of the ${\rm MS{\small ET}}$ log-exp transform is an infinite sum of exponents with positive weights. For practical purposes, we can therefore truncate the series at a finite threshold and notice that it conforms with $\textsf{DCP}$ rules. What remains is to construct constraints for the respective diagonals $A(\boldsymbol{z}^k)$ based on the definition of $A(\boldsymbol{z})$ .
It is also possible to handle the restricted ${\rm MS{\small ET}}_{= k}$ . Following [Reference Flajolet, Fusy and PivoteauFFP07, Section 2.4] we notice that ${{\rm MS{\small ET}}}_{= k}{(\mathcal{A})}(\boldsymbol{z})$ can be expressed as
where $\mathcal{P}_k$ consists of so-called partition sequences of size k, that is sequences $\left({n_i}\right)$ of natural numbers satisfying the condition $\sum_{i= 1}^k i n_i = k$ . In this form, (116) unfolds to a polynomial in $A(\boldsymbol{z}), A(\boldsymbol{z}^2),\ldots,A(\boldsymbol{z}^k)$ with positive coefficients. Consequently, ${{\rm MS{\small ET}}}_{= k}$ can be converted to a $\textsf{DCP}$ constraint just like a regular algebraic equation. Following the same idea, we can handle ${{\rm MS{\small ET}}}_{\leqslant k}$ as
The ${{\rm MS{\small ET}}}_{\geqslant k}$ variant is much more involved. Since the difference of convex functions is not necessarily a convex function itself, we cannot directly translate the defining
into a valid $\textsf{DCP}$ constraint. Let us recall, however, that for $k = 1$ in the case when $\mathcal A = z_1 + z_2 + \cdots + z_d $ it is possible to rewrite ${{\rm MS{\small ET}}}_{\geqslant k}$ so to avoid subtraction altogether and compose a corresponding $\textsf{DCP}$ constraint (cf. Section 7.4 and more generally, Section 2.2).
For general k, a more heuristic approach using Disciplined Convex-Concave Programming ( $\textsf{DCCP}$ ) might be preferred, see [Reference Shen, Diamond, Gu and BoydShe+16]. Consider the following exp transform:
where $e^{\alpha_k} = {A(\boldsymbol{z}^k)}$ and $e^{\mu} ={{\rm MS{\small ET}}{(\mathcal{A})}}_{\geqslant 1}(\boldsymbol{z})$ .
Here we have two convex expressions on both sides of the inequality. Although such constraints do not conform with $\textsf{DCP}$ rules, they are allowed in the $\textsf{DCCP}$ framework, and can be therefore included in the problem statement.
Now, let us focus the cycle construction $\rm C{\small YC}{(\mathcal{A})}$ . Note that
where $e^{\alpha_k} = {A(\boldsymbol{z}^k)}$ , $e^{\gamma} =\rm C{\small YC}{(\mathcal{A})}(\boldsymbol{z})$ , and $\varphi(k)$ is the Euler totient function.
Unfortunately, such a constraint does not meet the requirements of $\textsf{DCP}$ , even though its right-hand side is convex. On the other hand, the restricted ${\rm C{\small YC}}_{=k}{(\mathcal{A})}$ satisfies
and so is it possible to express its log-exp transform as a standard DCP log-sum-exp function. In order to emulate the unrestricted ${\rm C{\small YC}}{(\mathcal{A})}$ operator, one can either use $\textsf{DCCP}$ , reformulating (120) as a $\textsf{DCCP}$ constraint, or use the relation
with a (heuristically chosen) truncation threshold. See further comments about the cycle construction and the function $\log \log \frac{1}{1 - e^x} $ in Section 4.4.
Finally, consider the power set construction ${\rm PS{\small ET}}{(\mathcal{A})}$ :
where $e^{\alpha_k} = {A(\boldsymbol{z}^k)}$ and $e^{\pi} ={\rm PS{\small ET}}{(\mathcal{A})}(\boldsymbol{z})$ .
Due to the alternating summation and subtraction in (123), the right-hand side of the constraint cannot be expressed as an elementary, convex, $\textsf{DCP}$ expression. Consequently, it is not supported in our prototype implementation.
6.1.3 Labelled constructions
Consider the labelled set operator ${\rm S{\small ET}}{(\mathcal{A})}$ . Recall that for both the restricted and unrestricted variants we have
where $e^{\alpha} = {A(\boldsymbol{z})}$ , $e^{\sigma} ={\rm S{\small ET}}{(\mathcal{A})}(\boldsymbol{z})$ , and $e^{\sigma_k} = {{\rm S{\small ET}}}_{=k}{(\mathcal{A})}(\boldsymbol{z})$ .
In this form, it is clear that both log-exp transformations can be expressed in terms of elementary $\textsf{DCP}$ functions. Hence, both ${\rm S{\small ET}}{(\mathcal{A})}$ and ${{\rm S{\small ET}}}_{= k}{(\mathcal{A})}$ can be effectively handled.
Let us now focus on the final cycle operator $\rm C{\small YC}{(\mathcal{A})}$ . Note that
where $e^{\alpha} = {A(\boldsymbol{z})}$ , $e^{\gamma} =\rm C{\small YC}{(\mathcal{A})}(\boldsymbol{z})$ , and $e^{\gamma_k} = {\rm C{\small YC}}_{=k}{(\mathcal{A})}(\boldsymbol{z})$ .
The log-exp transform of ${\rm C{\small YC}}_{= k}$ is an elementary $\textsf{DCP}$ constraint. Unfortunately, the same does not hold for ${\rm C{\small YC}}$ and, in general, expressions in form of
Although (126) is convex, it cannot be modelled in terms of basic $\textsf{DCP}$ functions. Consequently, heuristic approaches (as discussed above) or alternative convex programming techniques like the interior point method should be applied, instead. More labelled constructions, and the ways to bypass this limitation are discussed in Section 4.4.
6.2 Sampler construction
Given a combinatorial specification enriched with user-specified parameters and their target expectations, it is possible to mechanically compute respective tuning values, and compile dedicated samplers. To illustrate this point, we implemented a sampler compiler called $\textsf{Boltzmann}$ $\textsf{Brain}$ Footnote 5 supporting algebraic, and in particular rational, specifications.
Constructed samplers for general algebraic specifications use the principle of anticipated rejection whereas samplers for rational specifications implement the idea of interruptible sampling, see [Reference Bodini, Genitrini and RolinBGR15]. In addition, both sampler types are endowed with near-optimal decision trees based on established branching probabilities [Reference Knuth and YaoKY76]. Combined, these optimisations lead to remarkably efficient samplers, supporting all of the algebraic specifications included in the current paper, see Section 7.
7. Applications
In this section, we present several examples illustrating the wide range of applications of our tuning techniques.
7.1 Polyomino tilings
We start with a benchmark example of a rational specification defining ${n\times 7}$ rectangular tilings using up to 126 different tile variants (a toy example of so-called transfer matrix models, cf. [Reference Flajolet and SedgewickFS09, Chapter V.6, Transfer matrix models]).
We begin the construction with defining the set T of admissible tiles. Each tile $t \in T$ consists of two horizontal layers. The base layer is a single connected block of width $w_t \leqslant 6$ . The second layer, placed on top of the base one, is a subset (possibly empty) of $w_t$ blocks, see Figure 4. For presentation purposes, each tile is given a unique, distinguishable colour.
Next, we construct the asserted rational specification following the general construction method of defining a deterministic automaton with one state per each possible partial tiling configuration using the set T of available tiles. Tracking the evolution of attainable configurations while new tiles arrive, we connect relevant configurations by suitable transition rules in the automaton. Finally, we (partially) minimise the constructed automaton removing states unreachable from the initial empty configuration. Once the automaton is created, we tune the tiling sampler such that the target colour frequencies are uniform, that is each colour occupies, on average, approximately $\tfrac{1}{126}\approx 0.7936\%$ of the outcome tiling area. Figure 5 depicts an exemplary tiling generated by our sampler.
Remark 33. The automaton corresponding to our ${n \times 7}$ tiling sampler consists of more than 2 000 states and 28, 000 transitions. Pushing this toy construction to its extreme, we were able to develop a sampler for ${n \times 9}$ tilings, using 1022 different tiles. The corresponding automaton has more than 19, 000 states and 357, 000 edges.
We remark that both examples are a notable improvement over the work of Bodini and Ponty [Reference Bodini and PontyBP10] who were able to sample ${n \times 6}$ tilings using 7 different tiles with a corresponding automaton consisting of roughly 1 500 states and 3 200 transitions.
7.2 Simply-generated trees with node degree constraints
Next, we give an example of simple varieties of plane trees with fixed sets of admissible node degrees, satisfying the general equation
Let us consider the case of plane trees where nodes have degrees in the set $D= \{{0,\ldots,9}\}$ , that is $\phi(y(z)) = a_0 + a_1 y(z) + a_2 {y(z)}^2 + \cdots + a_{9} {y(z)}^{9}$ . Here, the numbers $a_0, a_1, a_2, \ldots, a_{9} $ are non-negative real coefficients. We tune the corresponding algebraic specification so to achieve a target frequency of 1% for all nodes of degrees $d \geqslant 2$ . Frequencies of nodes with degrees $d \leqslant 1$ are left undistorted. For presentation purposes, all nodes with equal degree are given the same unique, distinguishable colour. Figure 6 depicts two exemplary trees generated in this manner.
Empirical frequencies for the right tree of Figure 6 and a simply-generated tree of size in between 10,000 and 10,050 with default node degree frequencies are included in Table 6.
We briefly remark that for this particular problem, Bodini, David and Marchal proposed a different, bit-optimal sampling procedure for random trees with given partition of node degrees [Reference Bodini, David and MarchalBDM16].
7.3 Variable distribution in plain $\lambda$ -terms
To exhibit the benefits of distorting the intrinsic distribution of various structural patterns in algebraic data types, we present an example specification defining so-called plain $\lambda$ -terms with explicit control over the distribution of de Bruijn indices.
In their nameless representation due to de Bruijn [Reference de BruijnBru72] $\lambda$ -terms are defined by the formal grammar $L\, ::\!=\, \lambda L | (L L) | D$ where $D = \{{{\underline{\textsf {0}}},{\underline{\textsf {1}}},{\underline{\textsf {2}}},\ldots}\}$ is an infinite denumerable set of so-called indices (cf. [Reference Bendkowski, Grygiel, Lescanne and ZaioncBen+17, Reference Gittenberger and GoŁebiewskiGG16]). Assuming that we encode de Bruijn indices as a sequence of successors of zero (i.e. use a unary base representation), the class $\mathcal{L}$ of plain $\lambda$ -terms can be specified as $\mathcal{L} = \mathcal{Z} \mathcal{L} +\mathcal{Z} {\mathcal{L}}^2 + \mathcal{D} $ where $\mathcal{D} = \mathcal{Z}{\rm S{\small EQ}}(\mathcal{Z}) $ . In order to control the distribution of de Bruijn indices, we need a more explicit specification for de Bruijn indices. For instance:
Here, we roll out the $k+1$ initial indices and assign distinct marking variables to each one of them, leaving the remainder sequence intact. In doing so, we are in a position to construct a sampler tuned to enforce a uniform distribution of 8% among all marked indices, that is indices ${\underline{\textsf {0}}},{\underline{\textsf {1}}},\ldots,{\underline{\textsf {8}}}$ , distorting in effect their intrinsic geometric distribution.
Figure 7 illustrates two random $\lambda$ -terms with such a new distribution of indices. For presentation purposes, each index in the left picture is given a distinct colour.
Empirical frequencies for the right term of Figure 7 and a plain $\lambda$ -term of size in between 10,000 and 10,050 with default de Bruijn index frequencies are included in table 7.
Let us note that algebraic data types, an essential conceptual ingredient of various functional programming languages such as Haskell or OCaml, and the random generation of their inhabitants satisfying additional structural or semantic properties is one of the central problems present in the field of property-based software testing (see, e.g. [Reference Claessen and HughesCH00, Pał12]). In such an approach to software quality assurance, programmer-declared function invariants (so-called properties) are checked using random inputs, generated accordingly to some predetermined, though usually not rigorously controlled, distribution. In this context, our techniques provide a novel and effective approach to generating random algebraic data types with fixed average frequencies of type constructors. In particular, using our methods it is possible to boost the intrinsic frequencies of certain desired subpatterns or diminish those which are unwanted.
7.4 Weighted partitions
Integer partitions are one of the most intensively studied objects in number theory, algebraic combinatorics and statistical physics. Hardy and Ramanujan obtained the famous asymptotics which has later been refined by Rademacher [Reference Flajolet and SedgewickFS09, Chapter VIII]. In his article [Reference VershikVer96], Vershik considers several combinatorial examples related to statistical mechanics and obtains the limit shape for a random integer partition of size $n $ with $\alpha \sqrt{n} $ parts and summands bounded by $\theta\sqrt{n} $ . Let us remark that Bernstein, Fahrbach, and Randall [Reference Bernstein, Fahrbach and RandallBFR18] have recently analysed the complexity of exact-size Boltzmann sampler for weighted partitions. In the model of ideal gas, there are several particles (bosons) which form a so-called assembly of particles. The overall energy of the system is the sum of the energies $\Lambda = \sum_{i=1}^N \lambda_\boldsymbol{i}$ where $\lambda_i $ denotes the energy of i -th particle. We assume that energies are positive integers. Depending on the energy level $\lambda,$ there are $j(\lambda) $ possible available states for each particle; the function $j(\lambda) $ depends on the physical model. Since all the particles are indistinguishable, the generating function P(z) for the number of assemblies $p(\Lambda) $ with energy $\Lambda $ takes the form
In the model of d-dimensional harmonic trap (also known as the Bose-Einstein condensation) according to [Reference Chase, Mekjian and ZamickCMZ99, Reference Haugset, Haugerud and AndersenHHA97, Reference Lucietti and RangamaniLR08] the number of states for a particle with energy $\lambda $ is ${ d + \lambda - \left(\begin{array}{c} 1\\[-4pt] \lambda\end{array}\right) } $ so that each state can be represented as a multiset with $\lambda $ elements having d different colours. Accordingly, an assembly is a multiset of particles (since they are bosons and hence indistinguishable) therefore the generating function for the number of assemblies takes the form
It is possible to control the expected frequencies of colours using our tuning procedure and sample resulting assemblies as Young tableaux. Each row corresponds to a particle whereas the colouring of the row displays the multiset of included colours, see Figure 8. We also generated weighted partitions of expected size 1000 (which are too large to display) with tuned frequencies of 5 colours, see Table 8.
Let us briefly explain our generation procedure. Boltzmann sampling for the outer ${\rm MS{\small ET}} $ operator is described in [Reference Flajolet, Fusy and PivoteauFFP07]. The sampling of inner ${\rm MS{\small ET}}_{\geqslant 1}(\mathcal Z_1 + \ldots + \mathcal Z_d) $ is more delicate. The generating function for this multiset can be written as
In order to correctly calculate the branching probabilities, we introduce slack variables $s_1, \ldots, s_d $ satisfying $(1 + s_i) = (1 - z_i)^{-1}$ . Boltzmann samplers for the newly determined combinatorial classes $\Gamma\mathcal S_i $ are essentially Boltzmann samplers for ${\rm S{\small EQ}}_{\geqslant1}(\mathcal Z_i) $ . Let us note that after expanding brackets the expression becomes
The total number of summands is $2^{d} - 1 $ where each summand corresponds to choosing some subset of colours. Finally, we pre-compute all the symmetric polynomials and efficiently handle the branching process in quadratic time using a dynamic programming approach discussed in Section 2.2.
7.5 Multi-partite rooted labelled trees
Consider a family of rooted labelled trees, such that the children of each node are not ordered. The exponential generating function of such trees T(z) satisfies the equation
In this example, we suggest an alteration of this model, where the nodes on each level have distinct colours. We consider a periodic system of colouring, where the levels $1, 2, \ldots, d $ have distinct colours, and then the colours repeat, that is the levels $d+1, \ldots, 2d $ have the same colours as the levels $1, 2, \ldots, d $ . Let $u_1, \ldots, u_d $ be the marking variables for the respective colours, and let $T_1, \ldots, T_d $ denote multivariate exponential generating functions for trees whose root is coloured respectively, with the colour $1, 2, \ldots, d $ . Then, these functions satisfy the system of functional equations
Using our software (see Section 6), we implement the multi-parametric tuning and sampling when $d = 10 $ and the proportions of vertices of the respective colours are sorted in an arithmetic progression $0.01, 0.03, 0.05,0.07, 0.09, 0.11, 0.13, 0.15, 0.17, 0.19$ .
An example of a resulting tree of size 1665 is shown in Figure 9, and the empirical frequencies of the colours inside this tree are given in Figure 10.
Note that even in such a simple example where the dependency graph forms a cycle, and the structure of the equations is symmetrical, no simpler tuning procedure is available — no variable can be tuned separately from the others, and the resulting variable values do not form the same arithmetic pattern as the target weights. Also, let us point to a curious pattern for the numerical values of the tuning parameters can be observed in Table 9.
7.6 Otter trees
Starting from the seminal paper of Otter [Reference OtterOtt48], unlabelled tree-like structures play an important role in chemistry, phylogenetics [Reference PennyPen82] and synthetic biology [Reference Fichtner, Voigt and SchusterFVS17]. Their study also helps to discover new methods for numerical solution of partial differential equations, involving automatic differentiation and construction of expression trees [Reference Mullier, Chapoutot and dit SandrettoMCA18].
In this section, we consider a relatively simple example of rooted unlabelled binary trees, such that the children of each node are not ordered. An additional assumption that the leaves are coloured in d distinct colours gives the following specification:
where the ${\rm MS{\small ET}}_2$ operator is defined as
Recall that the corresponding univariate model was considered in [Reference Bodini, Lumbroso and RolinBLR15] where the authors constructed a system of quadratic equations which could be solved in reverse. As in the previous example (see Section 7.5), we solve the multi-parametric tuning problem when $d = 10 $ , and the expected colour frequencies form an arithmetic progression $0.01, 0.03, \ldots, 0.19 $ .
After setting an appropriate truncation level (the probability of having nodes on a level h is exponentially decreasing in h), we solve the tuning problem and generate the corresponding trees, see Figure 11 for the generated tree of size 1434, and Figure 12 for empirical frequency histogram. The numerical values of the tuning parameters are given in Table 10.
7.7 Substitution-closed permutation classes
Permutation patterns stem from a growing body of research which originated from Knuth’s study of sorting algorithms and also later from the study of sorting networks by Tarjan and Pratt. Recently, the authors of [Bas17] have obtained a method which allows to automatically construct a combinatorial specification for permutation classes avoiding given set of permutations.
Let us start by giving a few basic definitions. A permutation $\sigma =(\sigma(1), \sigma(2), \ldots, \sigma(n)) $ is said to contain a pattern $\pi = (\pi(1), \ldots, \pi(k)) $ if $\sigma $ has a subsequence whose terms have the same relative ordering as $\pi $ . A permutation of length $n$ is said to be simple if it does not contain intervals of length strictly in between 1 and n, where an interval is a contiguous sequence of indices $\{ i \, \mid \, a \leqslant i \leqslant b\} $ such that the set of values $\{ \sigma(i) \, \mid \, a \leqslant i \leqslant b\} $ is also contiguous. And so, for instance, the permutation from Figure 13 is not simple, since it contains an interval $\{ 1, 2, 3\} $ whose image is $\{5,6,7\} $ .
Many interesting permutation classes can be described in the augmented language of context-free specifications. Given a permutation $\sigma \in S_m $ and non-empty permutations $\tau_1, \ldots, \tau_m $ , the inflation of $\sigma $ by $\tau_1, \ldots, \tau_m $ is the permutation obtained by replacing each entry $\sigma(i) $ with an interval having the same relative ordering as $\tau_i $ . If $\tau_1, \ldots, \tau_m $ belong, respectively, to the classes $\mathcal{T}_1, \ldots, \mathcal{T}_m $ , such an inflation is denoted as $\sigma[\mathcal{T}_1, \ldots, \mathcal{T}_m] $ . While from the counting point of view, $\sigma[\mathcal{T}_1, \ldots, \mathcal{T}_m] $ is isomorphic to a Cartesian product $\mathcal{T}_1 \times \cdots \times \mathcal{T}_m $ , it is useful to explicitly keep the external permutation $\sigma $ for sampling and construction purposes.
Interestingly, Albert and Atkinson describe a specification for any substitution-closed class of permutations [Reference Albert and AtkinsonAA05].
Proposition 34 ([Reference Albert and AtkinsonAA05, Lemma 11]). Suppose that a class $\mathcal{C} $ is substitution-closed and contains $12 $ and 21. Let $\mathcal{S} $ be the class of all simple permutations contained in $\mathcal{C} $ . Then, $\mathcal{C} $ satisfies the following system of equations
It is, therefore, possible to endow each of the simple permutations $\pi \in\mathcal{S} $ by a distinguished marking variable $u_{\pi} $ and insert them into the specification:
Finally, note that by tuning the expectations attached to $u_{\pi} $ we alter the expected frequencies of inflations used during the construction of a permutation.
Acknowledgements
We are grateful to Anne Bouillard, Matthieu Dien, Michael Grant, Hsien-Kuei Hwang, élie de Panafieu, Martin Pépin, Bruno Salvy for their valuable feedback and fruitful discussions. We are especially grateful to Cedric Chauve, Yann Ponty, and Sebastian Will for showing us numerous applications of Boltzmann sampling in biology, and to Sergey Tarasov for asking a question about computational complexity of random sampling. We appreciate the anonymous referee’s comments that helped to improve the paper and in particular a comment about the existence of the tuning problem solution which resulted in Remark 22.