Exchangeable and sampling-consistent distributions on rooted binary trees

Benjamin Hollering; Seth Sullivant

doi:10.1017/jpr.2021.28

Exchangeable and sampling-consistent distributions on rooted binary trees

Part of: Graph theory Mathematical biology in general Combinatorial probability

Published online by Cambridge University Press: 14 January 2022

Benjamin Hollering and

Seth Sullivant

Show author details

Benjamin Hollering*: Affiliation:
North Carolina State University
Seth Sullivant*: Affiliation:
North Carolina State University
*: *Postal address: North Carolina State University, Department of Mathematics, Box 8205, Raleigh, NC 27695, USA.
*Postal address: North Carolina State University, Department of Mathematics, Box 8205, Raleigh, NC 27695, USA.

Article contents

Abstract
Introduction
Exchangeability and finite sampling consistency
Examples of exchangeable and sampling-consistent distributions
Distributions in $\mathcal{E}_4^\infty$
Distributions on $\mathcal{E}_5^\infty$
Distributions on $\mathcal{E}_n^\infty$
Funding Information
Competing Interests
References

Rights & Permissions

Abstract

We introduce a notion of finite sampling consistency for phylogenetic trees and show that the set of finitely sampling-consistent and exchangeable distributions on n-leaf phylogenetic trees is a polytope. We use this polytope to show that the set of all exchangeable and sampling-consistent distributions on four-leaf phylogenetic trees is exactly Aldous’ beta-splitting model, and give a description of some of the vertices for the polytope of distributions on five leaves. We also introduce a new semialgebraic set of exchangeable and sampling consistent models we call the multinomial model and use it to characterize the set of exchangeable and sampling-consistent distributions. Using this new model, we obtain a finite de Finetti-type theorem for rooted binary trees in the style of Diaconis’ theorem on finite exchangeable sequences.

Keywords

Beta-splitting model multinomial model random trees

MSC classification

Primary: 05C05: Trees

Secondary: 60C05: Combinatorial probability 92B10: Taxonomy, cladistics, statistics

Type: Original Article
Information: Journal of Applied Probability , Volume 59 , Issue 1 , March 2022 , pp. 60 - 80

DOI: https://doi.org/10.1017/jpr.2021.28 [Opens in a new window]
Copyright: © The Author(s), 2022. Published by Cambridge University Press on behalf of Applied Probability Trust

1. Introduction

Leaf-labelled binary trees, which are commonly called phylogenetic trees, are frequently used to represent the evolutionary relationships between species. In this paper we will restrict our attention to rooted binary trees and our label set for a tree with n leaves will always be $[n] = \{1,2, \ldots, n\}$ . We call these trees [n]-trees and denote the set of [n]-trees with $\textrm{RB}_\textrm{L}(n)$ .

Processes for generating random [n]-trees play an important role in phylogenetics. Two common examples are the uniform distribution (where a tree is chosen uniformly at random from among all trees in $\textrm{RB}_\textrm{L}(n)$ ) and the Yule–Harding distribution (a simple Markov branching process). Some other examples of random tree models include Aldous’ $\beta$ -splitting model [Reference Aldous, Aldous, Pemantle and Springer1], the $\alpha$ -splitting model [Reference Ford8], and the coalescent process (which generates trees with edge lengths) [Reference Wakeley15]. Two features common to all these random tree processes, and desirable for any such tree process, is that they are exchangeable and sampling consistent.

Let $p_n$ denote a probability distribution on $\textrm{RB}_\textrm{L}(n)$ . Exchangeability refers to the fact that relabelling the leaves of the tree does not change its probability. That is, for all $T \in \textrm{RB}_\textrm{L}(n)$ and $\sigma \in S_n$ , $p_n(T) = p_n(\sigma T)$ . Exchangeability is a natural condition since it does not allow the names of the species to play any special role in the probability distribution. A family of distributions, $\{p_n\}_{n=2}^\infty$ , on trees has sampling consistency if, for each n, the distribution $p_n$ , which is on [n]-trees, can be realized as the marginalization of distributions $p_m$ , which is on [m]-trees, for $m > n$ . That is, the probability of an [n]-tree, T, under $p_n$ can be written as

\begin{equation*}\pi_n(p_m) (T) = p_n^m(T) = \sum_{\{S \in \textrm{RB}_\textrm{L}(m) | T = S|_{[n]}\}} p_m(S).\end{equation*}

Sampling consistency is a natural condition for a random tree model because it means that randomly missing species do not affect the underlying distribution on the species that were observed.

The goal of this paper is to study the structure of finitely sampling-consistent distributions on rooted binary trees. In particular, we aim to obtain a finite de Finetti-type theorem for these trees in the style of Diaconis’ Theorem 3 in [Reference Diaconis6]. Our motivation is two-fold. First of all, there has been significant work on understanding the set of exchangeable, sampling-consistent distributions on other discrete objects, including rooted trees. A classic result in this theory is de Finetti’s theorem for infinitely exchangeable sequences of binary random variables, which shows that every subsequence of the infinite sequence can be expressed as a mixture of independent and identically distributed sequences. This does not hold for finitely exchangeable sequences, but Diaconis later developed a finite form of de Finetti’s theorem. He showed that if a finite exchangeable sequence of binary random variables, $\{X_i\}_{i=1}^{n}$ , can be extended to an exchangeable sequence, $\{X_i\}_{i=1}^{m}$ where $m > n$ , then the original sequence can be approximated with a mixture of independent and identically distributed sequences with error $O\big(\frac{1}{m}\big)$ [Reference Diaconis6]. A substantial amount of work has been done on exchangeable arrays (see [Reference Diaconis and Janson7] for example) as well, which has been used to prove de Finetti theorems for other discrete objects. For instance, Lauritzen, Rinaldo, and Sadeghi recently developed a de Finetti theorem for exchangeable random networks [Reference Lauritzen, Rinaldo and Sadeghi12].

As previously mentioned, there has already been considerable work characterizing exchangeable and sampling-consistent distributions on trees using weighted real trees as limit objects [Reference Forman9, Reference Forman, Haulk and Pitman10, Reference Haas, Miermont, Pitman and Winkel11]. In [Reference Haas, Miermont, Pitman and Winkel11], a characterization of the exchangeable and sampling-consistent Markov branching models we discuss in Section 3.1 is obtained. A true de Finetti theorem for trees is conjectured in [Reference Forman, Haulk and Pitman10] and proven in [Reference Forman9, Theorem 3]. The approach taken in these papers is to characterize all infinitely sampling-consistent distributions on trees using a limiting object called a weighted real tree. In this paper, we instead take a geometric and combinatorial approach to the study of exchangeable and finitely sampling-consistent distributions on binary trees and examine what happens as we take the limit.

A second motivation comes from the combinatorial phylogenetics problem of studying properties of the distribution of the maximum agreement subtree of pairs of random trees. Let $T \in \textrm{RB}_\textrm{L}(n)$ and $S \subseteq [n]$ . The restriction tree $T|_S$ is the rooted binary tree with leaf label set S obtained by removing all leaves of T not in S and suppressing all vertices of degree 2 except the root. Two trees, $T_1, T_2 \in \textrm{RB}_\textrm{L}(n)$ , agree on a set $S \subseteq [n]$ if $T_1|_S = T_2|_S$ . A maximum agreement set is an agreement set of the largest size for $T_1$ and $T_2$ . The size of a maximum agreement subtree of these two trees is the cardinality of the largest subset S that $T_1$ and $T_2$ agree on and is denoted $\textrm{MAST}(T_1,T_2)$ . If S is an agreement set with $|S| = \textrm{MAST}(T_1,T_2)$ then the resulting tree $T_1|_S = T_2|_S$ is a maximum agreement subtree of $T_1$ and $T_2$ .

Understanding the distribution of $\textrm{MAST}(T_1, T_2)$ for random tree distributions would help in conducting hypothesis tests that the similarity between the trees is no greater than the similarity between random trees. For example, it was suggested in [Reference de Vienne, Giraud and Martin5] that $\textrm{MAST}(T_1, T_2)$ could be used to test the hypothesis that no cospeciation occurred between a family of host species and a family of parasite species that prey on them. The study of the distribution of $\textrm{MAST}(T_1,T_2)$ for random trees $T_1,T_2$ is primarily conducted with the assumption that $T_1$ and $T_2$ are drawn from an exchangeable, sampling-consistent distribution on rooted binary trees. Bryant, McKenzie, and Steel began the study of the distribution of $\textrm{MAST}(T_1,T_2)$ and obtained some first bounds on $\mathbb{E}(\textrm{MAST}(T_1,T_2))$ for random trees $T_1$ and $T_2$ drawn from the uniform or Yule–Harding distributions [Reference Bryant, McKenzie and Steel3]. Later work on the distribution obtained an $O(\sqrt{n})$ upper bound for $\mathbb{E}(\textrm{MAST}(T_1,T_2))$ when $T_1$ and $T_2$ are drawn from any exchangeable, sampling-consistent distribution [Reference Bernstein, Ho, Long, Steel, St. John and Sullivant2]. A lower bound on the order of $\Omega(\sqrt{n})$ has been conjectured for all exchangeable, sampling-consistent distributions as well, but this remains an open problem. Our hope in pursuing this project is that developing a better understanding of the set of all exchangeable sampling-consistent distributions might shed light on this conjecture.

In this paper we study the structure of exchangeable, sampling-consistent distributions on leaf-labelled, rooted binary trees. We introduce the notion of a polytope of exchangeable and finitely sampling-consistent distributions. We use it to study the set of exchangeable and sampling-consistent distributions on trees, and get some characterizations for trees with a small number of leaves. We show that the set of all exchangeable and sampling-consistent distributions on four-leaf trees comes from the $\beta$ -splitting model that was first introduced by Aldous in [Reference Aldous, Aldous, Pemantle and Springer1]. We have not been able to find a similar characterization for exchangeable and sampling-consistent distributions on five-leaf trees, but we describe some of the vertices of the polytope of exchangeable and finitely sampling-consistent distributions. We also introduce a new exchangeable and sampling-consistent model on trees, called the multinomial model, and show that every sampling-consistent and exchangeable distribution can be realized as a convex combination of limits of sequences of multinomial distributions.

2. Exchangeability and finite sampling consistency

In this section we describe how the set of exchangeable distributions relates to the set of all distributions on leaf-labelled, rooted binary trees. We then introduce a notion of finite sampling consistency and discuss how it relates to traditional sampling consistency.

Recall that $\textrm{RB}_\textrm{L}(n)$ denotes the set of all leaf-labelled, rooted binary trees with label set [n], which we call [n]-trees, and that $|\textrm{RB}_\textrm{L}(n)| = (2n-3)!!$ . The set of all distributions on $\textrm{RB}_\textrm{L}(n)$ is the probability simplex $\Delta_{(2n-3)!!-1} \subseteq \mathbb{R}^{(2n-3)!!}$ , where the coordinates are indexed by [n]-trees. The symmetric group $S_n$ denotes the group of permutations of [n]. For each $\sigma \in S_n$ and $T \in \textrm{RB}_\textrm{L}(n)$ let $\sigma T$ denote the tree obtained by applying $\sigma$ to the leaf labels.

Definition 2.1. A distribution p on $\textrm{RB}_\textrm{L}(n)$ is exchangeable if, for all permutations $\sigma \in S_n$ and [n]-trees $T \in \textrm{RB}_\textrm{L}(n)$ , $p(T) = p(\sigma T)$ . The set of all exchangeable distributions on $\textrm{RB}_\textrm{L}(n)$ is denoted $\mathcal{E}_n$ .

As previously mentioned, exchangeability requires that the probability of an [n]-tree under a particular distribution depend only on the shape of the tree. Thus we only need to consider distributions on the set of tree shapes. Let $\textrm{RB}_\textrm{U}(n)$ denote the set of unlabelled rooted binary trees, which we may also call trees or tree shapes. This idea is summarized in the next lemma, which is the [n]-tree analogue of [Reference Lauritzen, Rinaldo and Sadeghi12, Lemma 2].

Lemma 2.1. The set of exchangeable distributions on $\textrm{RB}_\textrm{L}(n)$ , $\mathcal{E}_n$ , is a simplex of dimension $|\textrm{RB}_\textrm{U}(n)|-1$ with coordinates indexed by tree shapes.

Proof. First, we define a distribution $p_{T} \in \mathcal{E}_n$ for each tree shape $T \in \textrm{RB}_\textrm{U}(n)$ . To do so, we let O(T) be the set of trees $T' \in \textrm{RB}_\textrm{L}(n)$ such that $\textrm{shape}(T') = T$ . For any tree $S \in \textrm{RB}_\textrm{L}(n)$ we set

\begin{equation*}p_T(S) = \begin{cases} \frac{1}{|O(T)|} & \textrm{shape}(S) = T , \\ \\[-7pt] 0 & \textrm{shape}(S) \neq T. \end{cases}\end{equation*}

Then $p_T \in \mathcal{E}_n$ since it is a probability distribution on trees, and all trees of the same shape have the same probability. We claim that $\mathcal{E}_n = \textrm{conv}\left( \{p_T \colon T \in \textrm{RB}_\textrm{U}(n) \} \right)$ , where $\textrm{conv}(A)$ denotes the convex hull of the set A. Since $p_T \in \mathcal{E}_n$ for all $T \in \textrm{RB}_\textrm{U}(n)$ , it is enough to show that any distribution $p \in \mathcal{E}_n$ can be written as a convex combination of the $p_T$ . If $p \in \mathcal{E}_n$ , then the probability of any tree $T' \in \textrm{RB}_\textrm{L}(n)$ depends only on the shape of $T^{\prime}$ not the leaf labelling, so we can write $p = \sum_{T \in \textrm{RB}_\textrm{U}(n)} p(T) p_T$ , where p(T) is $|O(T)|$ times the probability of any [n]-tree in $\textrm{RB}_\textrm{L}(n)$ with shape T. Since the original p is a probability distribution on all leaf-labelled trees, the weights in the linear combination are nonnegative and sum to 1. Lastly, we note that the vectors $p_T$ are affinely independent since there is no overlap of coordinate indices where the entries in $p_T$ are nonzero. So $\mathcal{E}_n = \textrm{conv}\left( \{p_T \colon T \in \textrm{RB}_\textrm{U}(n) \} \right)$ is a simplex and has coordinates indexed by $\textrm{RB}_\textrm{U}(n)$ .

Lemma 2.1 allows us to move from studying exchangeable distributions on leaf-labelled [n]-trees to all distributions on unlabelled trees. We will primarily focus on understanding the set of sampling-consistent distributions within $\mathcal{E}_n$ now. First, recall that for $p_m \in \mathcal{E}_m$ the marginalization or projection map $\pi_n$ gives a new distribution $p_n^m$ on $\textrm{RB}_\textrm{L}(n)$ for $n < m$ , defined for all $T \in \textrm{RB}_\textrm{L}(n)$ by

\begin{equation*}\pi_n(p_m)(T) = \sum_{\{S \in \textrm{RB}_\textrm{L}(m) | T = S|_{[n]}\}} p_m(S) .\end{equation*}

We will use this marginalization map to define a notion of finite sampling consistency.

Definition 2.2. A family of distributions $\{p_k\}_{k=n}^m$ is finitely sampling consistent or m-sampling consistent if, for each $n \leq k < m$ , $p_k = \pi_k(p_m)$ . We denote the set of all distributions in $\mathcal{E}_n$ that are m-sampling consistent by $\mathcal{E}_n^m = \pi_n(\mathcal{E}_m)$ .

It is immediate that if a distribution in $\mathcal{E}_n$ is m-sampling consistent, then for any k such that $n < k < m$ , the distribution is also k-sampling consistent. This leads to the following result.

Lemma 2.2. For all $m > k > n$ , $\mathcal{E}_n^m \subseteq \mathcal{E}_n^k$ .

A distribution in $\mathcal{E}_n$ is sampling consistent if it is part of an m-sampling-consistent family of distributions for all $m>n$ . In other words, a distribution is sampling consistent if it is in $\mathcal{E}_n^m$ for all $m>n$ . Thus, we can define the following notation for the set of exchangeable distributions on $\textrm{RB}_\textrm{L}(n)$ that are sampling consistent: $\mathcal{E}_n^\infty := \cap_{m = n}^\infty \mathcal{E}_n^m$ .

Lemma 2.3. Let $p_T \in \mathcal{E}_m$ be defined as in Lemma 2.1. Then

\begin{equation*}\mathcal{E}_n^m = \textrm{conv}\left(\{\pi_n(p_T) \colon T \in \textrm{RB}_\textrm{U}(m) \} \right).\end{equation*}

Proof. Clearly, $\textrm{conv}(\{\pi_n(p_T) \colon T \in \textrm{RB}_\textrm{U}(m) \}) \subseteq \mathcal{E}_n^m$ since $\pi_n(p_T) \in \mathcal{E}_n^m$ for all $T \in \textrm{RB}_\textrm{U}(m)$ . It is enough to show that if we have a distribution $p_n^m \in \mathcal{E}_n^m$ , then it can be written as a convex combination of the $\pi_n(p_T)$ . If $p_n^m \in \mathcal{E}_n^m$ , then there exists $p_m \in \mathcal{E}_m$ such that $\pi_n(p_m) = p_n^m$ . Since $p_m \in \mathcal{E}_m$ , we know from Lemma 2.2 that we can write $p_m = \sum_{T \in \textrm{RB}_\textrm{U}(n)} p_m(T) \cdot p_T$ . Then, evaluating $\pi_n(p_m)$ at an [n]-tree $S \in \textrm{RB}_\textrm{L}(n)$ gives

\begin{equation*}\pi_n(p_m)(S) = \sum_{\{Q \in \textrm{RB}_\textrm{L}(m) | S = Q|_{[n]}\}} \sum_{T \in \textrm{RB}_\textrm{U}(m)}p_m(T) p_T(Q) .\end{equation*}

Changing the order of summation, we have

\begin{equation*}\pi_n(p_m)(S) = \sum_{T \in \textrm{RB}_\textrm{U}(m)} p_m(T) \sum_{\{Q \in \textrm{RB}_\textrm{L}(m) | S = Q|_{[n]}\}} p_T(Q) ,\end{equation*}

but $\sum_{\{Q \in \textrm{RB}_\textrm{L}(m) | S = Q|_{[n]}\}} p_T(Q) = \pi_n(p_T)(S)$ , so we get that

\begin{equation*}\pi_n(p_m)(S) = \sum_{T \in \textrm{RB}_\textrm{U}(m)} p_m(T) (\pi_n(p_T)(S)) ,\end{equation*}

which shows that $p_n^m = \pi_n(p_m)$ can be written as a convex combination of the $\pi_n(p_T)$ .

Example 2.1. While it is the case that $\mathcal{E}_n^m = \textrm{conv}(\{\pi_n(p_T) \colon T \in \textrm{RB}_\textrm{U}(m) \})$ , not every $\pi_n(p_T)$ will be a vertex of $\mathcal{E}_n^m$ . Figure 1 illustrates this.

Figure 1. The projection of $\mathcal{E}_5^7$ onto the first two coordinates of the simplex $\mathcal{E}_5$ . The gray points correspond to the points $\pi_n(p_T)$ for $T \in \textrm{RB}_\textrm{U}(7)$ . The vertices of the simplex are labelled with the corresponding unrooted tree. Note that the balanced tree is at the origin since we’ve projected onto the coordinates corresponding to the other two trees.

Lemma 2.3 implies that understanding how the marginalization map acts on the vertices of $\mathcal{E}_m$ will allow us to compute all of $\mathcal{E}_n^m$ . The following lemma and corollary will give us a method for calculating the vertices of $\mathcal{E}_n^m$ by computing subtree densities.

Lemma 2.4. Let $S \in \textrm{RB}_\textrm{L}(n)$ and $T \in \textrm{RB}_\textrm{U}(m)$ . Also let $c_T(S) = |\{Q \in \textrm{RB}_\textrm{L}(m) \mid S = Q|_{[n]}, \textrm{shape}(Q) = T \}|$ . Then $\pi_n(p_T)(S) = \frac{c_T(S)}{|O(T)|}$ .

Proof. By the definition of the map $\pi_n$ , $\pi_n(p_T)(S) = \sum_{\{Q \in \textrm{RB}_\textrm{L}(m) | S = Q|_{[n]}\}} p_T(Q)$ , but $p_T(Q)$ is nonzero if and only if $\textrm{shape}(Q) = T$ , in which case it is $\frac{1}{|O(T)|}$ . So the above sum becomes

\begin{align*}\pi_n(p_T)(S) = \sum_{\{Q \in \textrm{RB}_\textrm{L}(m) | S = Q|_{[n]}, \textrm{shape}(Q) = T \}} \frac{1}{|O(T)|}= \frac{c_T(S) }{|O(T)|}.\\[-35pt]\end{align*}

Corollary 2.1. Let $S' \in \textrm{RB}_\textrm{U}(n)$ and $T \in \textrm{RB}_\textrm{U}(m)$ . Then $\pi_n(p_T)(S')$ , which is used to denote the sum of $\pi_n(p_T)(S)$ over all $S \in O(S')$ , is the induced subtree density of $\textrm{S}^{\prime}$ in T. That is, for any fixed $Q \in O(T)$ ,

\begin{equation*}\pi_n(p_T)(S') = \frac{|\{ I \subseteq [m] \colon |I| = n\ \textrm{ and }\ \textrm{shape}(Q|_I) = S' \}|}{\binom{m}{n}}.\end{equation*}

Figure 2. (a) A five-leaf tree. (b) The two tree shapes for four-leaf trees.

Proof. From the previous lemma, we know that for any $S \in O(S')$ , $\pi_n(p_T)(S) = \frac{c_T(S)}{|O(T)|}$ where $c_T(S) = |\{Q \in \textrm{RB}_\textrm{L}(m) \mid S = Q|_{[n]}, \textrm{shape}(Q) = T \}|$ . Then we have

\begin{equation*}\pi_n(p_T)(S') = \sum_{S \in O(S')} \frac{c_T(S)}{|O(T)|} .\end{equation*}

So, for each labelling S of $S^{\prime}$ , we are counting which fraction of labellings of T yield S when restricted to [n]. As we sum over all labellings of S, this gives us the total fraction of times that the shape $S^{\prime}$ appears as a restriction tree of the shape T when $(n-m)$ of its leaves are marginalized out, which is exactly

\begin{align*}\frac{|\{ I \subseteq [m] \colon |I| = n\ \textrm{ and }\ \textrm{shape}(Q|_I) = S' \}|}{\binom{m}{n}}. \\[-35pt]\end{align*}

The following example elucidates what is meant by induced subtree density, and shows how we can explicitly calculate this quantity.

Example 2.2. We show how to find the projection of one vertex of $\mathcal{E}_5$ down to $\mathcal{E}_4$ . $\mathcal{E}_4^5$ is the convex hull of the projection of all of the vertices of $\mathcal{E}_5$ . Begin with the tree shape T pictured in Figure 2(a). We label the leaves of T for the sake of the calculation, but it should be thought of as an unlabelled tree. We then find that the restriction to $\{1,2,3,4\}, \{1,2,3,5\}, \{1,2,4,5\}$ gives the shape $\textrm{Bal}_4$ , and the restriction to the sets $\{1,3,4,5\}, \{2,3,4,5\}$ gives the shape $\textrm{Comb}_4$ , pictured in Figure 2(b). We let the first coordinate of $\mathcal{E}_4$ be the probability of obtaining $\textrm{Comb}_4$ and the second be the probability of obtaining $\textrm{Bal}_4$ . As mentioned above, these probabilities will simply be the number of times each shape appears as a restriction tree over the total number of restriction trees. Thus, this vertex of $\mathcal{E}_5$ will give us the distribution $(2/5,3/5)$ in $\mathcal{E}_4$ .

We have now seen how to compute the vertices of $\mathcal{E}_n^m$ explicitly, but not every distribution $\pi_n(p_T)$ is a vertex of $\mathcal{E}_n^m$ . However, the comb tree always yields a vertex of $\mathcal{E}_n^m$ .

Lemma 2.5. For all $m \geq n$ , let $\textrm{Comb}_m \in \textrm{RB}_\textrm{U}(m)$ be the m-leaf comb tree. Then $\pi_n(p_{\textrm{Comb}_m})$ is a vertex in $\mathcal{E}_n^m$ .

Proof. The comb tree has only smaller comb trees as restriction trees, so the image of the comb distribution on m leaves under the marginalization map will be the comb distribution on n leaves. Since $p_{\textrm{Comb}_n}$ is a vertex of $\mathcal{E}_n$ and $\mathcal{E}_n^m$ is a subset of $\mathcal{E}_n$ , $p_{\textrm{Comb}_n}$ is also a vertex of $\mathcal{E}_n^m$ .

3. Examples of exchangeable and sampling-consistent distributions

In this section we discuss some of the well-known exchangeable and sampling-consistent families of distributions, particularly the Markov branching models. We also introduce a new family of exchangeable sampling-consistent tree distributions, namely the multinomial family.

3.1. Markov branching models

An important example of sampling-consistent and exchangeable distributions is the families of Markov branching models which can be constructed in the following way, as first introduced in [Reference Aldous, Aldous, Pemantle and Springer1].

Suppose that, for every integer $n \geq 2 $ , we have a probability distribution on $\{1,2, \ldots$ , $n-1\}$ , $q_n = (q_n(i) \colon i = 1,2, \ldots , n-1)$ , which satisfies $q_n(i) = q_n(n-i)$ . Using this family of distributions we can define a probability distribution on $\textrm{RB}_\textrm{U}(n)$ by taking the probability that i leaves fall on the left of the root-split and $n-i$ leaves fall on the right of the root-split to be $q_n(i)$ , with each choice of i labels to fall on the left having the same probability. Repeating recursively in each branch will yield the probability of a rooted binary tree. Aldous called these models Markov branching models.

Haas et al. classified the sampling-consistent Markov branching models on rooted binary trees in [Reference Haas, Miermont, Pitman and Winkel11]. They showed that every sampling-consistent Markov branching model defined by the splitting rules $q_n$ , $n \geq 2$ , has an integral representation of the form

(3.1)

\begin{equation} q_n(i) = a_n^{-1} \bigg(\binom{n}{i}\int_{0}^{1}x^i (1-x)^{n-i} \nu(\textrm{d}x) + nc\textbf{1}_{i=1} \bigg) ,\end{equation}

where $c \geq 0$ , $\nu$ is a symmetric measure on (0,1) such that $\int_{0}^{1}x (1-x) \nu(\textrm{d}x) < \infty$ , and $a_n$ is a normalization constant; $c\textbf{1}_{i=1}$ accounts for the comb distribution. A subclass of these models are those where the measure $\nu$ in (3.1) has the form $\nu(\textrm{d}x) = f(x)\textrm{d}x$ for a probability density function f on (0,1) that is symmetric on the interval (i.e. $f(x) = f(1-x)$ ) and where $c=0$ . These Markov branching models can be thought of as uniformly choosing n points in the interval (0,1) at random and then splitting the interval with respect to the density f. Repeating the splitting process recursively in each subinterval until each of the original n points is contained in its own subinterval gives a tree shape. This process is pictured in [Reference Aldous, Aldous, Pemantle and Springer1, Figure 6].

One particularly important family of Markov branching distributions is the beta-splitting model. It is a Markov branching model that belongs to the subclass mentioned above where the function f in the above description has the form

\begin{equation*}f(x) = \frac{\Gamma(2\beta +2)}{\Gamma^{2}(\beta+1)}x^{\beta}(1-x)^{\beta}\end{equation*}

for $-1 < \beta < \infty$ . For the beta-splitting model we can calculate the values $q_n(i)$ explicitly in terms of $\beta$ . By plugging the beta-splitting density function f into (3.1) for $q_n(i)$ we get the following formulas:

(3.2)

\begin{equation}q_n(i) = a_n^{-1} \binom{n}{i} \frac{\Gamma(\beta + i + 1)\Gamma(\beta + n -i +1)\Gamma(2\beta+2)}{\Gamma(2\beta+n+2)\Gamma^{2}(\beta+1)}\end{equation}

for $-1 < \beta < \infty$ . Note that (3.2) can be analytically continued on $-2 < \beta \leq -1$ , and so it is natural to extend the beta-splitting model to those values of $\beta$ . As $\beta$ approaches $-2$ the beta-splitting model approaches the distribution which puts all probability on the comb tree, so we also include $\beta = -2$ in the beta-splitting model as the comb distribution.

An important note here is that for the beta-splitting model each $q_n(i)$ is actually a rational function in $\beta$ . Using properties of the gamma function one can see that the above formula simplifies to

\begin{equation*}q_n(i) = \frac{\binom{n}{i}(i+\beta)_i (n-i+\beta)_{n-i}}{(n+2\beta+1)_n - 2(n+\beta)_n} .\end{equation*}

Since each $q_n(i)$ is a rational function in $\beta$ , we can see that the probability of obtaining a certain tree shape is also a rational function in $\beta$ , because the probability of obtaining that tree shape under the beta-splitting model is simply the product of the probability of all of the splits in the tree.

Example 3.1. Let $\textrm{Comb}_4$ and $\textrm{Bal}_4$ be the trees pictured in Figure 2(b). Then the probabilities of obtaining them under the beta-splitting model are

\begin{align*} p(\textrm{Comb}_4) & = 2q_4(1)=\frac{12+4\beta}{18+7\beta} , \\[3pt] p(\textrm{Bal}_4) & = q_4(2)=\frac{6 + 3\beta}{18+7\beta} .\end{align*}

This model also has a nice characterization among all of the sampling-consistent Markov branching models. In [Reference McCullagh, Pitman and Winkel14], McCullagh, Pitman, and Winkel showed that the beta-splitting models are the only sampling-consistent Markov branching models whose splitting rules admit a particular factorization.

We are interested in examining how the sampling-consistent Markov branching models, and in particular the beta-splitting model, fit inside $\mathcal{E}_n$ as a whole. These distributions are infinitely sampling consistent and so lie in $\mathcal{E}_n^\infty$ as well. A priori, it might seem that to determine the probability of a tree shape with n leaves under a Markov branching model one would need to have not only the distribution $q_n$ but also distributions $q_k$ where $2 \leq k \leq n-1$ . This is actually not the case for any sampling-consistent Markov branching model, though. Proposition 41 of [Reference Ford8] showed that if $(q_k \mid 2 \leq k \leq n)$ are the splitting rules for a distribution in $\mathcal{E}_n^\infty$ , then in fact it must be that

\begin{equation*} q_{n-1}(i) = \frac{(n-i)q_n(i) + (i+1)q_n(i+1)}{n-2q_n(1)} .\end{equation*}

This implies that all that is needed to define a distribution in $\mathcal{E}_n^\infty$ is the first splitting rule $q_n$ , which gives the following corollary.

Corollary 3.1. The dimension of the set of all sampling-consistent Markov branching models in $\mathcal{E}_n$ is at most $\big\lceil {\frac{n-1}{2}} \big\rceil -1$ .

Proof. As explained above, a Markov branching model is completely determined by the distribution $q_n = (q_n(i) \colon i = 1,2, \ldots , n-1)$ , which determines all of the distributions $q_k = (q_k(i) \colon i = 1,2, \ldots , k-1)$ where $ 2 \leq k \leq n-2$ . Since $q_n$ must be symmetric, we immediately get that the values $q_n(1),q_n(2), \ldots, q_n({\big\lceil {\frac{n-1}{2}} \big\rceil})$ determine all of $q_n$ . Also, since $q_n$ must be a distribution, we lose one of these as a free parameter, and thus the dimension of the set of sampling-consistent Markov branching models is bounded above by $\big(\big\lceil {\frac{n-1}{2}} \big\rceil -1\big)$ .

Note that when $n = 4$ , the space of sampling-consistent Markov branching models has dimension 1. We will see in Section 4 that the set of beta-splitting models is equal to the set of sampling-consistent Markov branching models in this case.

3.2. The multinomial model

The multinomial model associates to each tree shape $T \in \textrm{RB}_\textrm{U}(m)$ for any $ m \geq 2$ a family of probability distributions on $\textrm{RB}_\textrm{L}(n)$ for each n. We first add an extra leaf to the root of T to obtain a new tree, which we denote by $\tilde{T}$ . We then associate to every edge, e, in $\tilde{T}$ a parameter $t_e \geq 0$ . This gives us a vector of parameters $t = (t_e \mid e \in E(\tilde{T}))$ of length $2m-1$ , and we assume that $\sum_e t_e = 1$ , so that these parameters give a probability distribution on the edges of $\tilde{T}$ . We will now use this probability distribution to define a set of distributions on $\textrm{RB}_\textrm{U}(n)$ for any $n \geq 2$ . Note that n and m do not have to be related to each other.

Using the distribution t, we draw a multiset A of edges from the tree $\tilde{T}$ , where edge e occurs with probability $t_e$ . There is a natural way to take the tree $\tilde{T}$ and a multiset A of size n on the set of parameters, and to construct a new tree which we will call $\tilde{T}_A \in \textrm{RB}_\textrm{U}(n)$ . Each time that an edge e appears in A, we add a new leaf to the edge e, which will give us a new tree with $m+n$ leaves. We then simply take $\tilde{T}_A$ to be the induced subtree on only the leaves that come from A. Hence, the multinomial model on the tree T gives a way to produce random trees with an underlying skeleton that is the tree T. For large n, the resulting random trees look like T with many extra leaves added.

The multinomial probability of observing a particular multiset of edges A is the monomial

\begin{equation*}p_A = \binom{n}{m_A}\prod_{e \in \tilde{T}} t_e^{m_A(e)} ,\end{equation*}

where $m_A(e)$ denotes the number of times that e appears in the multiset A, and $m_A$ is the resulting vector.

Letting $M_n^{\tilde{T}}$ be the set of all n-element multisets of edges of $\tilde{T}$ , we can calculate the probability of observing any particular tree shape S by

\begin{equation*}p_{\tilde{T},t}(S) = \sum_{\substack{A \in M_n^{\tilde{T}} \\ \tilde{T}_A = S}} p_A.\end{equation*}

Example 3.2. Consider the tree $\tilde{T}$ from Figure 3(b) with edge parameters $(t_1,t_2,t_3)$ . To calculate the probability of the tree, $\textrm{Bal}_5$ , in Figure 3(c) we use the formula

\begin{equation*}p_{\tilde{T},t}(\textrm{Bal}_5) = \sum_{\substack{A \in M_5^{\tilde{T}} \\ \tilde{T}_A = \textrm{Bal}_5}} p_A.\end{equation*}

The only multisets that satisfy this condition are the sets $A_1 = \{2,2,2,3,3\}$ and $A_2 = \{2,2,3,3,3\}$ . This is because if 1 appears in a multiset A any positive number of times, the tree $\tilde{T}_A$ will have a single leaf on one side of the root and four leaves on the other side, regardless of what other parameters appear in the set. Thus, $A_1$ and $A_2$ are the only elements of $M_5^{\tilde{T}}$ that we sum over, so

\begin{equation*}p_{\tilde{T},t}(\textrm{Bal}_5) =\binom{5}{3,2}t_2^3 t_3^2 + \binom{5}{2,3}t_2^2 t_3^3 .\end{equation*}

The multinomial model gives a family of distributions as we let the parameter vector t range over the entire simplex. Equivalently, the model can be described as the image of the simplex under the polynomial map $p_{\tilde{T}}: \Delta_{|E(\tilde{T})| - 1} \to \mathcal{E}_n^\infty$ , where the coordinate corresponding to $S \in \textrm{RB}_\textrm{U}(n)$ has value $p_{\tilde{T},t}(S)$ for $t \in \Delta_{2m-2}$ . Since $\Delta_{2m-2}$ is a semialgebraic set and $p_{\tilde{T}}$ is a polynomial map, the multinomial model is also a semialgebraic set.

Figure 3. (a) T; (b) $\tilde{T}$ ; (c) $\textrm{Bal}_5$ .

Figure 4. This is a projection onto the first two coordinates of the simplex $\mathcal{E}_5$ . The beta-splitting model on $\textrm{RB}_\textrm{U}(5)$ is pictured in black, and the multinomial model on the two-leaf tree is pictured in gray.

Also, if we take any tree $T \in \textrm{RB}_\textrm{U}(m)$ , and any subtree $T' \in \textrm{RB}_\textrm{U}(m')$ of T, then we have $\textrm{Im}(p_{\tilde{T}'}) \subseteq \textrm{Im}(p_{\tilde{T}})$ . This is because if the parameters corresponding to edges that appear in T but not in $T^{\prime}$ are set to 0 in $p_T$ , the map will simply become $p_{T'}$ . Setting these parameters to 0 just corresponds to restricting $p_T$ to a subset of the simplex, and thus we get the image containment.

A last interesting note is that this model is perhaps similar in spirit to the W-random graphs when W is a graphon obtained from a finite graph G, as described in [Reference Lovász13]. The construction begins with a finite graph G and uses it to define a distribution on graphs with k vertices similarly to how we begin with a tree T and define a distribution on trees with k leaves.

We end this section with Figure 4, which shows both the beta-splitting model and the multinomial model inside $\mathcal{E}_5$ . In the next section we will discuss the exchangeable and sampling-consistent distributions on four-leaf trees and how they relate to the models discussed in this section.

4. Distributions in $\mathcal{E}_4^\infty$

In this section we classify all of the distributions in $\mathcal{E}_4^\infty$ . In particular, we show that $\mathcal{E}_4^\infty$ is equal to the beta-splitting model.

First, we note that since there are only two distinct tree shapes with four leaves (see Figure 2(b)), the set of exchangeable distributions is just a one-dimensional simplex $\Delta_1$ in $\mathbb{R}^2 $ . We take coordinates $(p_1,p_2)$ on $\mathbb{R}^2$ and let the first coordinate correspond to $\textrm{Comb}_4$ and the second coordinate to $\textrm{Bal}_4$ . The subset of distributions that are also sampling consistent must be some line segment within the simplex. We know from Lemma 2.5 that the comb distribution, which is (1,0) in these coordinates, is a vertex in $\mathcal{E}_4^\infty$ . If we can bound the probability of obtaining $\textrm{Bal}_4$ then we will have a complete characterization of all distributions in $\mathcal{E}_{4}^{\infty}$ . Theorem 2 in [Reference Coronado, Mir, Rosselló and Valiente4] will be the main tool to achieve this.

Theorem 4.1. (Theorem 2 of [Reference Coronado, Mir, Rosselló and Valiente4]) The most balanced tree in $\textrm{RB}_\textrm{U}(n)$ has the complete symmetric tree on four leaves appear more frequently as a subtree than any other tree in $\textrm{RB}_\textrm{U}(n)$ .

By the most balanced tree in $\textrm{RB}_\textrm{U}(n)$ , we mean the unique tree shape in $\textrm{RB}_\textrm{U}(n)$ that has the property that for any internal vertex of the tree, the number of leaves on the left and right subtrees below that differ by at most one.

Theorem 4.2. The four-leaf beta-splitting model equals the set of all exchangeable and sampling-consistent distributions on $\textrm{RB}_\textrm{U}(4)$ .

Proof. Note that $\mathcal{E}_4^n$ only has two vertices since it is a line segment. The comb distribution (1,0) is always a vertex in $\mathcal{E}_4^n$ , by Lemma 2.5. The other vertex will be the projection of the vertex of $\mathcal{E}_n$ that places the most mass on $\textrm{Bal}_4$ . The projection of a vertex $p_T \in \mathcal{E}_n$ is

\begin{equation*}(p_1,p_2) = \frac{1}{\binom{n}{4}}(m_1,m_2),\end{equation*}

where $m_1$ is the number of four-element subsets $S \subset [n]$ such that $T|_{S} = \textrm{Comb}_4$ , and $m_2$ is the number of four-element subsets $S \subset [n]$ such that $T|_{S} = \textrm{Bal}_4$ . By Theorem 4.1 we can restrict to the most balanced tree in $\textrm{RB}_\textrm{U}(n)$ . We will use $m_{2,n}$ to denote this highest value of $m_2$ that we get from the most balanced tree in $\textrm{RB}_\textrm{U}(n)$ .

The beta-splitting model on $\textrm{RB}_\textrm{U}(4)$ , on the other hand, is the line segment from (1,0) to $\big(\frac{4}{7},\frac{3}{7}\big)$ . Indeed, under the beta splitting model, the probability of $\textrm{Bal}_4$ is just

\begin{equation*} q_4(2) = \frac{\binom{4}{2} (\beta + 2)_2^2 }{ (2 \beta + 5)_4 - 2( \beta + 4)_4 } = \frac{6 \beta^4 + O(\beta^3) } { 14 \beta^4 + O(\beta^3)}.\end{equation*}

As $\beta \rightarrow \infty$ , this converges to $\tfrac{3}{7}$ . So, we will be done if we can show that

\begin{equation*}\lim_{n \rightarrow} \frac{m_{2,n}}{\binom{n}{4}} = \frac{3}{7}.\end{equation*}

To prove this, we can restrict to the subsequence of values $n = 2^k$ , since Lemma 2.2 implies that $\tfrac{m_{2,n}}{\binom{n}{4}}$ is a monotone decreasing sequence. This subsequence is easier to deal with since $m_{2,2^n}$ counts the number of four-subsets, $S \subset [2^n]$ , of the leaves of the complete symmetric tree $T_{2^n}$ in $\textrm{RB}_\textrm{U}(2^n)$ such that $T_{2^n}|_{S} = \textrm{Bal}_4$ . Using the recursive structure of $T_{2^n}$ , we see that $m_{2,2^n} = 2m_{2,2^{n-1}} + \binom{2^{n-1}}{2}^2$ . The only ways we can choose a subset S such that $T_{2^n}|_S = \textrm{Bal}_4$ are that the leaves in S fall either entirely within the left or right subtrees or that S has two leaves from both the left and right subtrees. The number of ways to choose a subset S that falls entirely on the left or right side is $m_{2,2^{n-1}}$ , by definition. The number of ways to choose two leaves from each side is $\binom{2^{n-1}}{2}^2$ . This recurrence can be solved to find an explicit formula for $m_{2,2^n}$ :

\begin{equation*}m_{2,2^n} = \sum_{i=1}^{n-1} 2^{n-i-1}\binom{2^i}{2}^2 .\end{equation*}

Now we can simplify $\frac{m_{2,2^n}}{\binom{2^n}{4}}$ to get

\begin{equation*}\frac{m_{2,2^n}}{\binom{2^n}{4}} = \frac{3(2^n) - 5}{7(2^n)-21} ,\end{equation*}

which converges to $\frac{3}{7}$ as n tends to infinity.

Note that Theorem 4.2 does not generalize to higher dimensions as the set of beta-splitting distributions is of strictly smaller dimension than the set of exchangeable sampling-consistent distributions. We explore the discrepancy between these sets in more detail in the next sections.

5. Distributions on $\mathcal{E}_5^\infty$

There are three distinct tree shapes with five leaves, so $\mathcal{E}_5$ is a two-dimensional simplex in $\mathbb{R}^3$ . For the rest of this section we will use $\textrm{Comb}_5$ , $\textrm{Gir}_5$ , and $\textrm{Bal}_5$ to represent the trees pictured in Figure 5. Specifically, let $\textrm{Comb}_5$ denote the comb tree on five leaves, $\textrm{Bal}_5$ denote the balanced tree on five leaves, and $\textrm{Gir}_5$ denote the giraffe tree on five leaves. We take coordinates $(p_1,p_2,p_3)$ on $\mathbb{R}^3$ , where $p_1,p_2,p_3$ represent the probability of obtaining $\textrm{Comb}_5$ , $\textrm{Gir}_5$ , and $\textrm{Bal}_5$ , respectively.

Figure 5. Tree shapes on five leaves.

Figure 6. The two trees from the proof of Lemma 5.2. Note that $T_0$ denotes all of the part of the tree that lies above the vertex z.

While we have not been able to give a complete description of the vertices of $\mathcal{E}_5^n$ for all n, we are able to define some tree structures in $\textrm{RB}_\textrm{U}(n)$ that do yield vertices of $\mathcal{E}_5^n$ . We have already seen that the comb tree $\textrm{Comb}_m$ always yields a vertex of $\mathcal{E}_n^m$ for all m and n. Here we provide some other examples.

Definition 5.1. For a tree $T \in \textrm{RB}_\textrm{U}(m)$ let $\textrm{comb}(T, n)$ be the tree that is obtained by creating a comb tree with n leaves and replacing one of the two leaves at the deepest level with the tree T.

Generally, if $T \in \textrm{RB}_\textrm{U}(m)$ then $\textrm{comb}(T, n)$ has $m + n - 1$ vertices. For example, $\textrm{Gir}_5 = \textrm{comb}(\textrm{Bal}_4, 2)$ . Note that it does not matter which of the leaves is replaced with T since our trees are unlabelled.

Proposition 5.1. Let $T_n = \textrm{comb}(\textrm{Gir}_5,n-4)$ . Then $\pi_5( p_{T_n})$ is a vertex in $\mathcal{E}_{5}^n$ .

Proof. First note that $T_n$ and $\textrm{Comb}_n$ are the only trees with n leaves that do not have $\textrm{Bal}_5$ as a subtree. This means that $T_n$ and $\textrm{Comb}_n$ are the only tree shapes $T \in RB_U(n)$ such that $\pi_5(p_T)$ falls on the line $p_3 = 0$ , so the segment $[\pi_5(p_{T_n}), \pi_5(p_{\textrm{Comb}_n})]$ is a face of $\mathcal{E}_5^n$ .

We now introduce another tree structure that will yield a vertex in $\mathcal{E}_5^n$ .

Definition 5.2. For two positive integers m and n let $\textrm{bicomb}(m,n)$ denote the tree made by joining a comb tree of size m and a comb tree of size n together at a new root. We call such trees bicomb trees.

For example, $\textrm{Bal}_5 = \textrm{bicomb}(2,3)$ .

Lemma 5.1. Let $T_n = \textrm{bicomb}\big(\big\lfloor \frac{n}{2} \big\rfloor, \big\lceil \frac{n}{2} \big\rceil\big)$ . Then $\pi_5( p_{T_n})$ is a vertex of $\mathcal{E}_5^n$ .

Proof. First note that for $n \geq 5$ , the only trees in $\textrm{RB}_\textrm{U}(n)$ that never contain $\textrm{Gir}_5$ as a restriction tree are the comb tree and the bicomb trees. This means that in $\mathcal{E}_5^n$ , they are the only trees that fall on the edge $p_2 = 0$ . To show that $\pi_5( p_{T_n})$ is a vertex of $\mathcal{E}_5^n$ , it remains to to show that $\pi_5( p_{T_n})$ is extremal on this edge. We know that the comb tree is one of the extremal points on this edge, and so the other extremal point will correspond to the bicomb tree with the highest density of $\textrm{Bal}_5$ as a restriction tree. Let $T' = \textrm{bicomb}(i,n-i)$ be a bicomb tree for some $1 \leq i \leq n-1$ . We let $b_5(T')$ denote the number of times that $\textrm{Bal}_5$ occurs as a restriction tree of $T^{\prime}$ . From the structure of a bicomb tree we have

\begin{equation*}b_5(T') = \binom{i}{2}\binom{n-i}{3} + \binom{i}{3}\binom{n-i}{2}.\end{equation*}

This function is maximized when $i = \big\lfloor \frac{n}{2} \big\rfloor$ .

Now we will show that the projection of the most balanced tree in $\textrm{RB}_\textrm{U}(n)$ is a vertex of $\mathcal{E}_5^n$ . To do this, we prove a few lemmas about the number of $\textrm{Comb}_5$ trees that can appear as subtrees of a tree. These results follow the basic outline of [4, Lemma 9], and are in some sense an extension of those results to five-leaf trees.

For a tree $T \in \textrm{RB}_\textrm{U}(n)$ let $c_5(T)$ count the number of five-subsets, S, of the leaves of T such that $T|_S = \textrm{Comb}_5$ . Let $c_4(T)$ and $b_4(T)$ be defined similarly, but for $\textrm{Comb}_4$ and $\textrm{Bal}_4$ respectively.

Lemma 5.2. Let T be as pictured in Figure 6, and let $\textrm{T}^{\prime}$ be obtained from T by swapping the positions of $T_2$ and $T_4$ . For $i=0,1,2,3,4$ , let $n_i$ be the number of leaves of $T_i$ , so $n = \sum_{i = 0}^4 n_i$ . Without loss of generality choose $n_1 \geq n_2$ and $n_3 \geq n_4$ . If $n_1 > n_3$ and $n_2 > n_4$ then $c_5(T) \geq c_5(T')$ . Furthermore, if $n \geq 7$ , then $c_5(T) > c_5(T')$ .

Proof. Without loss of generality, assume that $n_1 \geq n_2$ and $n_3 \geq n_4$ , and let $\Sigma_z$ denote the set of leaves of T below the vertex z. Note that by construction, this is the same as the set of leaves below the vertex z in $T^{\prime}$ . If we take a five-subset, S, of the leaves of T and $T^{\prime}$ then it is only possible for $T|_S \neq T'|_S$ if $|S \cap \Sigma_z| \geq 4$ . It is straightforward to see that if $S \cap \Sigma_z$ has zero, one, two, or three elements, $T|_S = T'|_S$ .

This means that $c_5(T) - c_5(T') = (c_5(T_z) - c_5(T^{\prime}_z)) + n_0(c_4(T_z) - c_4(T^{\prime}_z))$ , where $T_z$ and $T^{\prime}_z$ denote the subtrees of T and $T^{\prime}$ below z. Note that for any tree $S \in \textrm{RB}_\textrm{U}(n)$ , $\binom{n}{4} = c_4(S) + b_4(S),$ which gives $n_0(c_4(T_z) - c_4(T^{\prime}_z)) = n_0(b_4(T^{\prime}_z) - b_4(T_z))$ , and $(b_4(T^{\prime}_z) - b_4(T_z))$ is guaranteed to be positive by [Reference Coronado, Mir, Rosselló and Valiente4, Lemma 9], so the term $n_0(b_4(T^{\prime}_z) - b_4(T_z))$ is nonnegative. It remains to show that $(c_5(T_z) - c_5(T^{\prime}_z))$ is nonnegative. We can explicitly enumerate these quantities in the following way:

\begin{align*}c_5(T_z) & = \sum_{i=1}^{4}c_5(T_i)+ \sum_{i=1}^{4}c_4(T_i)\sum_{j=1,j\neq i}^{4}n_j + \binom{n_1}{3}n_2(n_3+n_4) + \binom{n_2}{3}n_1(n_3+n_4) \\ & \quad + \binom{n_3}{3}n_4(n_1+n_2) + \binom{n_4}{3}n_3(n_1+n_2) , \\c_5(T^{\prime}_z) & = \sum_{i=1}^{4}c_5(T_i) +\sum_{i=1}^{4}c_4(T_i)\sum_{j=1,j\neq i}^{4}n_i + \binom{n_1}{3}n_4(n_2+n_3) + \binom{n_4}{3}n_1(n_2+n_3) \\ & \quad + \binom{n_2}{3}n_3(n_1+n_4) + \binom{n_3}{3}n_2(n_1+n_4) .\end{align*}

We can simplify this to get that

\begin{equation*}c_5(T_z) - c_5(T^{\prime}_z) = \frac{1}{6} (n_1 - n_3) (n_2 - n_4) (n_1 n_3 (\!-3 + n_1 + n_3) + n_2 n_4 (-3 + n_2 + n_4)).\end{equation*}

Note that this quantity is nonnegative since $n_1 > n_3$ and $n_2 > n_4$ by assumption and $n_i \geq 1$ for $i=1,2,3,4$ . Note that if $n \geq 7$ , then either $n_0 \geq 1$ or $\sum_{i=1}^4 n_i \geq 7$ , which both guarantee that $c_5(T) - c_5(T') > 0$ .

This lemma essentially tells us that if the tree has an internal node that is unbalanced, we can find a tree that has $\textrm{Comb}_5$ appear less frequently as a restriction tree. We now have another lemma following in the style of [Reference Coronado, Mir, Rosselló and Valiente4].

Lemma 5.3. Let T be as pictured in Figure 7, and for $i=0,1,2$ let $n_i$ be the number of leaves of $T_i$ , assuming $n_1 \geq n_2$ . We also assume that $n_1+n_2 \geq 3$ . Then $c_5(T) \geq c_5(T')$ . Furthermore, if $n \geq 7$ then $c_5(T) > c_5(T')$ .

Figure 7. (a) T; (b) $\textrm{T}^{\prime}$ .

Proof. By the same reasoning as for the last lemma, we know that $c_5(T) - c_5(T') = c_5(T_Z) - c_5(T^{\prime}_z) + n_0(c_4(T_z) - c_4(T^{\prime}_z))$ , and the nonnegativity of the second term follows by [4, Lemma 10]. Now we can easily see that

\begin{align*} c_5(T_z) & = c_5(T_1) + c_5(T_2) + (n_2+1)c_4(T_1) + (n_1+1)c_4(T_2) + \binom{n_1}{3}n_2 + \binom{n_2}{3}n_1, \\ c_5(T^{\prime}_z) & = c_5(T_1) + c_5(T_2) + (n_2+1)c_4(T_1) + (n_1+1)c_4(T_2) + \binom{n_2}{3}n_1 ,\end{align*}

and so $c_5(T_Z) - c_5(T^{\prime}_z) = \binom{n_1}{3}n_2$ . It is clear that the right-hand side is always nonnegative. Note that if $n \geq 7$ , then either $n_0 \geq 1$ or $n_1 \geq 3$ . In both cases this guarantees that $c_5(T) - c_5(T') > 0$ .

Combining these two lemmas, we get the following theorem that will immediately allow us to show that the projection of the most balanced tree in $\textrm{RB}_\textrm{U}(n)$ will always be a vertex in $\mathcal{E}_5^n$ .

Theorem 5.1. For $n \geq 7$ , the maximally balanced tree is the unique minimizer of $c_5(T)$ among all trees $T \in \textrm{RB}_\textrm{U}(n)$ .

Proof. This proof also follows the strategy of [Reference Coronado, Mir, Rosselló and Valiente4]. We assume that $c_5$ obtains it minimum value in $\textrm{RB}_\textrm{U}(n)$ at T, but that T is not maximally balanced. We will try to find a contradiction. We let z be a nonbalanced internal node with balanced children a and b. We let $n_a$ and $n_b$ be the number of leaves of the trees rooted at a and b respectively. Then since z is not balanced we have, without loss of generality, that $n_a \geq n_b +2$ . If b is a leaf then, by Lemma 5.3, we immediately have that $c_5(T)$ is not a minimum since $n \geq 7$ . So we have that $n_b \geq 2$ , and thus both a and b are balanced and must be internal nodes.

We now let $v_1,v_2$ be the children of a and $v_3,v_4$ be the children of b, and take $n_i = \#L(T_{v_i})$ for $i=1,2,3,4$ ; once again, without loss of generality we assume that $n_1 \geq n_2$ and $n_3 \geq n_4$ . Since both a and b are balanced, it must be that $n_1=n_2$ or $n_1 = n_2 +1$ and $n_3 = n_4$ or $n_3 = n_4 +1$ . Then the assumption that $n_a \geq n_b + 2$ immediately gives $n_1 + n_2 = n_a \geq n_b + 2 = n_3 + n_4 +2$ . Then, by the previous assumptions we get that $n_1 > n_3$ . Now, since $c_5$ is a minimum at T and $n \geq 7$ , we can apply Lemma 5.2 to get that $n_4 \geq n_2$ . Stringing together these inequalities we get $n_1 > n_3 \geq n_4 \geq n_2$ . But since $n_1 = n_2$ or $n_1 = n_2 +1$ , the only possibility we have is that $n_1 -1 = n_2 = n_3 = n_4$ . But then we get that $n_1 + n_2 = 2n_1 -1$ and $n_3 + n_4 = 2n_1 -2$ , which contradicts the inequality $n_1 + n_2 \geq n_3 + n_4 +2$ . This tells us that any tree with at least seven leaves must be maximally balanced around every internal node if it obtains the minimum value of $c_5$ on $\textrm{RB}_\textrm{U}(n)$ . Since there is only one tree that is maximally balanced at every internal node, there is a unique minimizer of $c_5(T)$ in $\textrm{RB}_\textrm{U}(n)$ for $n \geq 7$ , which is the maximally balanced tree.

Corollary 5.1. Let $T_n$ be the maximally balanced tree in $\textrm{RB}_\textrm{U}(n)$ . Then $\pi_5(p_{T_n})$ is a vertex of $\mathcal{E}_5^n$ .

Proof. The corollary can be verified computationally for $n = 6$ . For $n \geq 7$ , Theorem 5.1 shows that $T_n$ is the unique tree that attains the minimum value of $c_5$ among all trees in $\textrm{RB}_\textrm{U}(n)$ . So, $\{(c_5(T_n)/\binom{n}{5}, p_2,p_3) \in \mathcal{E}_5 \} \cap \mathcal{E}_5^n = \{\pi_5(p_{T_n})\}$ , and thus $\pi_5(p_{T_n})$ is a vertex of $\mathcal{E}_5^n$ .

We have another corollary that relates the exchangeable and sampling-consistent distributions to the $\beta$ -splitting model.

Corollary 5.2. The projection of the most balanced tree in $\mathcal{E}_5^n$ approaches the $\beta=\infty$ point on the beta-splitting model as $n \to \infty$ .

Proof. It is enough to show that the complete symmetric tree $T_{2^n} \in \textrm{RB}_\textrm{U}(2^n)$ satisfies this property. We can just count the number of times that $\textrm{Gir}_5$ and $\textrm{Bal}_5$ occur as restriction trees when we restrict to a five-subset of the leaves. We use the structure of $T_{2^n}$ to write down a simple recurrence for $g_5(T_{2^n})$ and $b_5(T_{2^n})$ , and then solve the recurrence. Since we can choose our subset to be on the right side of the root of $T_{2^n}$ , the left side of the root of $T_{2^n}$ , or to have three leaves from one side and two leaves from the other, we have

\begin{equation*}b_5(T_{2^n}) = 2b_5(T_{2^{n-1}}) + 2\binom{2^{n-1}}{3}\binom{2^{n-1}}{2}.\end{equation*}

As for $g_5$ , we can once again choose our subset to be on either the right or left side of the root of $T_{2^n}$ , or we can choose to have one leaf on one side of the tree and a four-leaf symmetric tree on the other. This can be done in just $2^{n-1}m_{2,2^{n-1}}$ ways, which implies $g_5(T_{2^n}) = 2g_5(T_{2^{n-1}})+ 2(2^{n-1}m_{2,2^{n-1}}) = 2g_5(T_{2^{n-1}}) + 2^{n}m_{2,2^{n-1}}$ . Both of these recurrences can be solved explicitly using a computer algebra system. We get

\begin{align*} b_5(T_{2^n}) & = \frac{1}{315}2^{n-2}(2^n-4)(2^n-2)(2^n-1)(7 \cdot 2^n-11) , \\[3pt] g_5(T_{2^n}) & = \frac{1}{105}2^{n-3}(2^n-4)(2^n-3)(2^n-2)(2^n-1) .\end{align*}

We can then find the probabilities $p_2$ and $p_3$ of $\textrm{Gir}_5$ and $\textrm{Bal}_5$ by simply dividing out by $\binom{2^n}{5}$ . This yields

\begin{equation*}p_3 = \frac{b_5(T_{2^n})}{\binom{2^n}{5}} = \frac{2}{3} + \frac{20}{21(2^n-3)} , \qquad p_2 = \frac{g_5(T_{2^n})}{\binom{2^n}{5}} = \frac{1}{7} . \end{equation*}

Clearly, as $n \rightarrow \infty$ we have $p_3 \rightarrow \frac{2}{3}$ and $p_2 \rightarrow \frac{1}{7}$ .

On the other hand, we recall that the probability of obtaining a tree under the beta-splitting model is just a rational function in $\beta$ that can be explicitly calculated. We can then find the limit of these rational functions to get that the beta-splitting curve approaches the point $(p_1,p_2,p_3) = \big(\frac{4}{21},\frac{1}{7},\frac{2}{3}\big)$ as $\beta \to \infty$ as well, and so the projection of $T_{2^n}$ in $\mathcal{E}_5^{2^n}$ is approaching the $\beta=\infty$ point on the curve.

These are all of the tree structures in $\textrm{RB}_\textrm{U}(n)$ we have been able to find that always appear as vertices in $\mathcal{E}_5^n$ . We end this section with Figure 8, which pictures all of the families of exchangeable and sampling-consistent distributions that we have discussed, and the vertices of $\mathcal{E}_n^m$ for some small values of m.

Figure 8. The multinomial model on the two-leaf tree is in gray and the $\beta$ -splitting model is the thick black curve. The thinner black lines are the boundary of $\mathcal{E}_5^n$ for $n=5,6,9,12$ .

6. Distributions on $\mathcal{E}_n^\infty$

While we are not able to get a description of the vertices of $\mathcal{E}_n^m$ for general m and n, it is possible to to describe $\mathcal{E}_n^\infty$ using the multinomial model introduced in Section 3.2. In particular, this shows that multinomial models converge as an inner limit to $\mathcal{E}_n^\infty$ .

Theorem 6.1. Let $\{T_m\}_{m=n}^\infty$ be a sequence of tree shapes and $p^{(m)} = \pi_n(T_m)$ be the corresponding sequence of distributions. If $p^{(m)}$ converges to some $p \in \mathcal{E}_n^\infty$ as m goes to infinity, then there exists a sequence of multinomial distributions $\{d^{(m)}\}_{m=n}^\infty$ that also converges to p as m goes to infinity.

Proof. Define $d^{(m)}$ to be the multinomial distribution on the tree $T_m$ with the edge parameter vector $(t_e \mid e \in E(T_m))$ such that $t_e = \frac{1}{m}$ if one of the vertices in e is one of the original m leaves of $T_m$ , and $t_e = 0$ otherwise. Note that these nonzero edge parameters are bijectively associated to the leaves of $T_m$ and we may call the set of nonzero edge parameters $L(T_m)$ , meaning the leaf set of $T_m$ . To show that $d^{(m)}$ also converges to p, it is enough to show that for every tree $T \in \textrm{RB}_\textrm{U}(n)$ , $\lim_{m \to \infty}d^{(m)}(T) = \lim_{m \to \infty}p^{(m)}(T)$ .

Fix a labelling of $T_m$ and let $c_{T_m}(T)$ be the number of sets $S \subseteq [m]$ such that $\textrm{shape}(T_m|_S) = T$ . By Corollary 2.1, $p^{(m)}(T)$ is the induced subtree density of T in $T_m$ , so $p^{(m)}(T) = \frac{c_{T_m}(T)}{\binom{m}{n}}$ . Hence,

\begin{equation*}\lim_{m \to \infty}p^{(m)}(T) = \lim_{m \to \infty} \frac{c_{T_m}(T)}{\binom{m}{n}} = \lim_{m \to \infty} \frac{n!}{m^n}c_{T_m}(T) .\end{equation*}

On the other hand, let $M^{(m)} = \{A \in M_n^{T_m} \mid ({T_m})_A = T, p_A \neq 0\}$ . Then $d^{(m)}(T) = \sum_{A \in M^{(m)}} p_A$ by definition, and we note that by requiring that multisets $A \in M^{(m)}$ have $p_A \neq 0$ , $M^{(m)}$ only includes multisets whose support is contained in $L(T_m)$ . Also note that either $p_A = 0$ or

\begin{equation*}p_A = \binom{n}{m_A(t_{e_1}),m_A(t_{e_2}), \ldots , m_A(t_{e_{2m-1}})}\frac{1}{m^n}\end{equation*}

since all the edge parameters are 0 or $\frac{1}{m}$ . So, to understand the quantity $d^{(m)}(T)$ it is enough to know the coefficient of $\frac{1}{m^n}$ . Note that any multiset A has a naturally associated integer partition of n, formed by taking the multiplicities of each unique element that appears in it. Call this integer partition the weight of A, denoted $\textrm{wt}(A)$ , and let $M_\lambda^{(m)}$ be the set of multisets in $M^{(m)}$ with weight $\lambda$ . Now observe that for $A,B \in M_\lambda^{(m)}$ , $p_A = p_B$ since the value of the multinomial coefficient is totally determined by the weight, and the product of the edge parameters is always $\frac{1}{m^n}$ . If we let $\binom{n}{\lambda}$ be the value of the multinomial coefficient, then the formula for $d^{(m)}(T)$ can be rewritten as $d^{(m)}(T) = \frac{1}{m^n}\sum_{\lambda \vdash n} \binom{n}{\lambda}|M_\lambda^{(m)}|$ , but we can bound the quantity $|M_\lambda^{(m)}|$ , which is at most $l(\lambda)!\binom{m}{l(\lambda)}$ , where $l(\lambda)$ is the length of the partition $\lambda$ . This is because there are $\binom{m}{l(\lambda)}$ choices for which elements to use in the multiset, and at most $l(\lambda)!$ unique multisets for each choice of elements. Recall that there are only $\binom{m}{l(\lambda)}$ choices to use in a multiset since any $A \in M_\lambda^{(m)}$ must have $p_A \neq 0$ , which means A must be a multiset on the leaves of $T_m$ . Since $l(\lambda)!\binom{m}{l(\lambda)}$ is a polynomial in m of degree $l(\lambda)$ , though, we have

\begin{equation*}\lim_{m \to \infty} \frac{1}{m^n}\sum_{\lambda \vdash n} \binom{n}{\lambda}|M_\lambda^{(m)}| =\lim_{m \to \infty} \frac{n!}{m^n}|M_{(1,1,\ldots, 1)}^{(m)}| ,\end{equation*}

since the partition $\lambda = (1,1,\ldots , 1)$ is the only partition where $|M_{(1,1,\ldots, 1)}^{(m)}|$ is of order $m^n$ , and so is the only term that contributes to the limit. Now we note that the multisets $A \in M_{(1,1,\ldots, 1)}^{(m)}$ correspond exactly to choosing subsets of the leaves of $T_m$ that yield T upon restriction, since the only edges that can be in A are those corresponding to leaves, every leaf can be chosen at most once, and $\textrm{shape}((T_m)_A) = T$ . So $|M_{(1,1,\ldots, 1)}^{(m)}| = c_{T_m}(T)$ , and so $\lim_{m \to \infty}d^{(m)} = \lim_{m \to \infty} \frac{n!}{m^n}c_{T_m}(T) = \lim_{m \to \infty}p^{(m)}$ ; since $p^{(m)}$ converges to p, it must be that $d^{(m)}$ also does.

Theorem 6.2. For all $n \geq 1$ , there exists a constant $C > 0$ such that, for all $m > n$ and $p \in \mathcal{E}_n^m$ , there exists $d \in \mathcal{E}_n^\infty$ such that $\max_{S \in \textrm{RB}_\textrm{U}(n)}|p(S) - d(S)| \leq \frac{C}{m}$ .

Proof. Note that if $p \in \mathcal{E}_n^m$ , then we have, for every $S \in \textrm{RB}_\textrm{U}(n)$ ,

\begin{equation*}p(S) = \sum_{T \in \textrm{RB}_\textrm{U}(m)} \lambda_T \pi_n(p_T)(S),\end{equation*}

where this combination is convex by Lemma 2.3. Then let $d^T$ be defined as the multinomial distribution $d^T$ on T just as $d^{(m)}$ is defined for $T_m$ in the previous theorem. Then, recall from the proof of the previous theorem that $d^T(S) = \frac{1}{m^n}\sum_{\lambda \vdash n} \binom{n}{\lambda}|M_\lambda^T|$ , where $M_\lambda^T = \{A \in M_n^T \mid T_A = S, p_A \neq 0, \textrm{wt}(A) = \lambda \}$ . Also recall from the proof of the previous theorem that $|M_{(1,1,\ldots,1)}^T| = c_T(S)$ . Combining these facts with the definition of $\pi_n(p_T)$ and the triangle inequality gives

(6.1)

\begin{equation} |\pi_n(p_T)(S)- d^T(S)| \leq \bigg|\frac{c_T(S)}{\binom{m}{n}} - \frac{n!c_T(S)}{m^n}\bigg| + \Bigg|\frac{1}{m^n} \sum_{\substack{\lambda \vdash n \\ \lambda \neq (1,1,\ldots,1)}} \binom{n}{\lambda}|M_\lambda^T| \Bigg| ;\end{equation}

we now bound each term on the right-hand side of this inequality.

To bound the first term in (6.1), note that $c_T(S)$ is a nonnegative quantity and is bounded above by $\binom{m}{n}$ . This gives the inequality

\begin{equation*} \bigg|\frac{c_T(S)}{\binom{m}{n}} - \frac{n!c_T(S)}{m^n}\bigg| \leq \bigg| 1-\frac{\frac{m!}{(m-n)!}}{m^n} \bigg| \leq \bigg|\frac{m^n - (m-n)^n}{m^n} \bigg| \leq \frac{n^2}{m}.\end{equation*}

Note that this bound does not depend on the trees T and S.

To bound the second term we again recall from the proof of the previous theorem that $|M_\lambda^T| \leq l(\lambda)!\binom{m}{l(\lambda)}$ for each partition $\lambda$ of n. Then, we have

(6.2)

\begin{equation}\Bigg|\frac{1}{m^n} \sum_{\substack{\lambda \vdash n \\ \lambda \neq (1,1,\ldots,1)}} \binom{n}{\lambda}|M_\lambda^T| \Bigg| \leq\sum_{\substack{\lambda \vdash n \\ \lambda \neq (1,1,\ldots,1)}} \binom{n}{\lambda}\frac{l(\lambda)! \binom{m}{l(\lambda)}}{m^n} ,\end{equation}

but since $\lambda \neq (1,1,\ldots,1)$ , it must be that $l(\lambda) \leq n-1$ so $l(\lambda)!\binom{m}{l(\lambda)} \leq m^{n-1}$ for all the remaining partitions $\lambda$ . Applying this fact to the right-hand side of (6.2) gives the bound

\begin{equation*} \Bigg|\frac{1}{m^n} \sum_{\substack{\lambda \vdash n \\ \lambda \neq (1,1,\ldots,1)}} \binom{n}{\lambda}|M_\lambda^T| \Bigg| \leq \frac{1}{m} \sum_{\substack{\lambda \vdash n \\ \lambda \neq (1,1,\ldots,1)}} \binom{n}{\lambda} \leq \frac{\tilde{C}}{m} ,\end{equation*}

where $\tilde{C} \in \mathbb{R}$ is a constant that also does not depend on the trees T and S but only on n. Applying the bounds for each term to (6.1) and setting $C = \tilde{C} + n^2$ gives

(6.3)

\begin{equation} |\pi_n(p_T)(S)- d^T(S)| \leq \frac{C}{m} ,\end{equation}

and again we note that this bound is independent of the trees T and S. We are now ready to construct a distribution $d \in \mathcal{E}_n^\infty$ that gives the desired result. From the discussion of the multinomial model, we have that each distribution $d^T \in \mathcal{E}_n^\infty$ and so, from the convexity of $\mathcal{E}_n^\infty$ , we get $d = \sum_{T \in \textrm{RB}_\textrm{U}(m)}\lambda_T d^T \in \mathcal{E}_n^\infty$ . We can now use the expression for p we began with and the bound obtained in (6.3) to get

\begin{align*}|p(S) - d(S)| \leq\sum_{T \in \textrm{RB}_\textrm{U}(m)} \lambda_T |\pi_n(p_T)(S) - d^T(S)| \leq\frac{C}{m}.\\[-45pt]\end{align*}

Theorem 6.1 gives that the limit of any convergent sequence $(v_m)_{m \geq 1}$ where $v_m \in V(\mathcal{E}_n^m)$ can also be realized as the limit of points coming from multinomial models. Theorem 6.2 shows that if we have a distribution in $\mathcal{E}_n$ that can be extended to part of a finitely sampling-consistent family, then it can be approximated with an infinitely sampling-consistent distribution. With Theorem 6.1 and the following proposition, we will show that $\mathcal{E}_n^\infty$ is actually the convex hull of all limits of convergent sequences of vertices, and thus the convex hull of limits of distributions drawn from the multinomial model. To do this we need a basic proposition from convex analysis the proof of which is included for completeness.

Proposition 6.1. Let $(P_m)_{m \geq 1}$ be a sequence of polytopes in $\mathbb{R}^n$ such that, for all $m \geq 1$ , $P_{m+1} \subseteq P_m$ . Let

\begin{equation*}P = \overline{ {\rm conv}\big(\big\{\lim_{m \to \infty} v_{i_m}^{(m)} \mid v_{i_m}^{(m)} \in V(P_m)\mbox{ and }\big(v_{i_m}^{(m)}\big)_{m \geq 1} \mbox{ converges} \big\} \big) } ,\end{equation*}

where the bar denotes the closure in the Euclidean topology. Then $P = \cap_{m=1}^\infty P_m$ .

Proof. It is straightforward to see that $P \subseteq \cap_{m=1}^\infty P_m$ . To show that the sets are equal, suppose that there is $p \in (\cap_{m=1}^\infty P_m ) \setminus P$ . Then the basic separation theorem of convex analysis implies there must exist an affine functional $\ell$ with $\ell(p) \leq 0$ and $\ell(w) > 0$ for all $w \in P$ . We also have that since $p \in \cap_{m=1}^\infty P_m$ , for each $m \geq 1$ p can be written as $p = \sum_{j=1}^{k_m} \lambda_j v_j^{(m)}$ , where the $v_j^{(m)}$ are the vertices of $P_m$ . Then, because $\ell(p) \leq 0$ , it must be that for each m there exists at least one vertex $v_{i_m}^{(m)}$ of $P_m$ such that $\ell(v_{i_m}^{(m)}) \leq 0$ . Since all the points $v_j^{(m)}$ lie in $P_1$ , which is a compact set, there exists a convergent subsequence $(v_{i_{m_k}}^{(m_k)})_{k \geq 1}$ with limit $v \in P$ , and thus $\ell(v) > 0$ . But we also have $\ell(v) = \lim_{k \to \infty} \ell\big(v_{i_{m_k}}^{(m_k)}\big) \leq 0$ , which is a contradiction.

Corollary 6.1. Let $d_{T_m}^{(m)}$ denote the specific multinomial model construction on the tree $T_m \in \textrm{RB}_\textrm{U}(m)$ described in Theorem 6.1. Then

\begin{equation*}\mathcal{E}_n^\infty = \overline{{\rm conv}\big(\big\{\lim_{m \to \infty}d_{T_m}^{(m)} \mid \pi_n(T_m) \in V(\mathcal{E}_n^m) \mbox{ and } (d_{T_m}^{(m)})_{m \geq n} \mbox{ converges} \big\} \big) } .\end{equation*}

Proof. Recall that $\mathcal{E}_n^\infty = \cap_{m = n}^\infty \mathcal{E}_n^m$ ; thus, by Proposition 6.1,

\begin{equation*}\mathcal{E}_n^\infty = \overline{{\rm conv}\big(\big\{\lim_{m \to \infty}\pi_n(p_{T_m}) \mid T_m \in \textrm{RB}_\textrm{U}(m) \mbox{ and } (\pi_n(p_{T_m}))_{m \geq 1} \mbox{ converges} \big\}\big) }\end{equation*}

since the vertices of $\mathcal{E}_n^m$ correspond to a subset of the points $\pi_n(T_m)$ . Applying Theorem 6.1 to the sequence $(\pi_n(T_m))_{m \geq 1}$ gives the result.

Corollary 6.1 shows that every exchangeable and infinitely sampling-consistent distribution is either a convex combination of limits of multinomial distributions or a limit point of points in that set. Understanding the structure of the multinomial models may shed greater light on the structure of $\mathcal{E}_n^\infty$ as a whole. We view Theorem 6.2 and Corollary 6.1 as the rooted binary tree analogue to Theorems 3 and 4 in [Reference Diaconis6]; in essence they are finite forms of a de Finetti-type theorem for rooted binary trees. As previously mentioned, the work done in [Reference Forman, Haulk and Pitman10] and [Reference Forman9] establishes a more typical de Finetti theorem in the sense that it shows that every infinitely sampling-consistent sequence of distributions can be obtained by sampling from a limit object using techniques from probability theory.

We also note that the requirement that the induced subtree densities converge is quite similar to the idea of graph convergence that appears in [Reference Lovász13], and that many of the ideas in the theory of graph limits may also be applied to trees. The very well developed theory of graph limits contains many equivalent versions of the limiting object (see [Reference Lovász13, Theorem 11.52]). The work done in [Reference Forman, Haulk and Pitman10] and [Reference Forman9] makes the connection between the limiting object, a random real tree, and an infinitely sampling-consistent model. It is still unknown if this can be connected to ideas such as tree parameters (the induced subtree density, for instance) or to metrics on finite trees, as has been done in the theory of graph limits. It seems that many of these equivalences hold, but differences in techniques will be required.

Acknowledgement

We wish to thank Dávid Papp for a helpful conversation regarding Proposition 6.1.

Funding Information

Benjamin Hollering and Seth Sullivant were partially supported by the US National Science Foundation (DMS 1615660).

Competing Interests

There were no competing interests to declare which arose during the preparation or publication process of this article.

References

Aldous, D. (1996). Probability distributions on cladograms. In Random Discrete Structures (IMA Volumes Math. Appl. Vol. 76), eds Aldous, D. and Pemantle, R., Springer, New York, pp. 1–18.CrossRef Google Scholar

Bernstein, D. I., Ho, L. S. T., Long, C., Steel, M., St. John, K. and Sullivant, S. (2015). Bounds on the expected size of the maximum agreement subtree. SIAM J. Discrete Math. 29, 2065–2074.10.1137/140997750CrossRef Google Scholar

Bryant, D., McKenzie, A. and Steel, M. (2003). The size of a maximum agreement subtree for random binary trees. In Bioconsensus (DIMACS Series in Discrete Math. Theoret. Comput. Sci. Vol. 61), American Mathematical Society, Providence, RI, pp. 55–65.CrossRef Google Scholar

Coronado, T. M., Mir, A., Rosselló, F. and Valiente, G. (2019). A balance index for phylogenetic trees based on rooted quartets. J. Math. Biol. 79, 1105–1148.10.1007/s00285-019-01377-wCrossRef Google Scholar PubMed

de Vienne, D. M., Giraud, T. and Martin, O. (2007). A congruence index for testing topological similarity between trees. Bioinformatics 23, 3119–3124.CrossRef Google Scholar PubMed

Diaconis, P. (1977). Finite forms of de Finetti’s theorem on exchangeability. Synthese 36, 271–281.10.1007/BF00486116CrossRef Google Scholar

Diaconis, P. and Janson, S. (2008). Graph limits and exchangeable random graphs. Rend. Mat. Appl. 28, 33–61.Google Scholar

Ford, D. J. (2006). Probabilities on Cladograms: Introduction to the Alpha Model. PhD thesis, Stanford University.Google Scholar

Forman, N. (2018). Mass-structure of weighted real trees. Preprint, arXiv:1801.02700.Google Scholar

Forman, N., Haulk, C. and Pitman, J. (2018). A representation of exchangeable hierarchies by sampling from random real trees. Prob. Theory Relat. Fields 172, 1–29.CrossRef Google Scholar

Haas, B., Miermont, G., Pitman, J. and Winkel, M. (2008). Continuum tree asymptotics of discrete fragmentations and applications to phylogenetic models. Ann. Prob. 36, 1790–1837.10.1214/07-AOP377CrossRef Google Scholar

Lauritzen, S. L., Rinaldo, A. and Sadeghi, K. (2017). On exchangeability in network models. Preprint, arXiv:1709.03885.Google Scholar

Lovász, L. (2012). Large Networks and Graph Limits (Colloquium Publications Vol. 60), American Mathematical Society, Providence, RI.Google Scholar

McCullagh, P., Pitman, J. and Winkel, M. (2008). Gibbs fragmentation trees. Bernoulli 14, 988–1002.10.3150/08-BEJ134CrossRef Google Scholar

Wakeley, J. (2008). Coalescent Theory: An Introduction. W. H. Freeman, New York.Google Scholar

Figure 1. The projection of $\mathcal{E}_5^7$ onto the first two coordinates of the simplex $\mathcal{E}_5$. The gray points correspond to the points $\pi_n(p_T)$ for $T \in \textrm{RB}_\textrm{U}(7)$. The vertices of the simplex are labelled with the corresponding unrooted tree. Note that the balanced tree is at the origin since we’ve projected onto the coordinates corresponding to the other two trees.

Figure 2. (a) A five-leaf tree. (b) The two tree shapes for four-leaf trees.

Figure 3. (a) T; (b) $\tilde{T}$; (c) $\textrm{Bal}_5$.

Figure 4. This is a projection onto the first two coordinates of the simplex $\mathcal{E}_5$. The beta-splitting model on $\textrm{RB}_\textrm{U}(5)$ is pictured in black, and the multinomial model on the two-leaf tree is pictured in gray.

Figure 5. Tree shapes on five leaves.

Figure 6. The two trees from the proof of Lemma 5.2. Note that $T_0$ denotes all of the part of the tree that lies above the vertex z.

Figure 7. (a) T; (b) $\textrm{T}^{\prime}$.

Figure 8. The multinomial model on the two-leaf tree is in gray and the $\beta$-splitting model is the thick black curve. The thinner black lines are the boundary of $\mathcal{E}_5^n$ for $n=5,6,9,12$.

Article contents

Exchangeable and sampling-consistent distributions on rooted binary trees

Abstract

Keywords

MSC classification

1. Introduction

2. Exchangeability and finite sampling consistency

3. Examples of exchangeable and sampling-consistent distributions

3.1. Markov branching models

3.2. The multinomial model

4. Distributions in $\mathcal{E}_4^\infty$

5. Distributions on $\mathcal{E}_5^\infty$

6. Distributions on $\mathcal{E}_n^\infty$

Acknowledgement

Funding Information

Competing Interests

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests