Monte Carlo fusion

Hongsheng Dai; Murray Pollock; Gareth Roberts

doi:10.1017/jpr.2019.12

Monte Carlo fusion

Part of: Probabilistic methods, simulation and stochastic differential equations Decision theory

Published online by Cambridge University Press: 12 July 2019

Hongsheng Dai ,

Murray Pollock and

Gareth Roberts

Show author details

Hongsheng Dai*: Affiliation:
University of Essex
Murray Pollock*: Affiliation:
University of Warwick
Gareth Roberts*: Affiliation:
University of Warwick
*: *Postal address: Department of Mathematical Sciences, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, UK. Email address: hdaia@essex.ac.uk
**Postal address: Department of Statistics, University of Warwick, Gibbet Hill Road, Coventry, CV4 7ES, UK.
**Postal address: Department of Statistics, University of Warwick, Gibbet Hill Road, Coventry, CV4 7ES, UK.

Article contents

Abstract
Introduction
An extended fusion density
A fusion rejection sampler using Brownian bridges
A fusion rejection sampler using Ornstein–Uhlenbeck bridges
Simulation studies
Conclusion
References

Rights & Permissions

Abstract

In this paper we propose a new theory and methodology to tackle the problem of unifying Monte Carlo samples from distributed densities into a single Monte Carlo draw from the target density. This surprisingly challenging problem arises in many settings (for instance, expert elicitation, multiview learning, distributed ‘big data’ problems, etc.), but to date the framework and methodology proposed in this paper (Monte Carlo fusion) is the first general approach which avoids any form of approximation error in obtaining the unified inference. In this paper we focus on the key theoretical underpinnings of this new methodology, and simple (direct) Monte Carlo interpretations of the theory. There is considerable scope to tailor the theory introduced in this paper to particular application settings (such as the big data setting), construct efficient parallelised schemes, understand the approximation and computational efficiencies of other such unification paradigms, and explore new theoretical and methodological directions.

Keywords

Fork-and-join fusion Langevin diffusion Monte Carlo

MSC classification

Primary: 65C05: Monte Carlo methods 65C60: Computational problems in statistics

Secondary: 62C10: Bayesian problems; characterization of Bayes procedures 65C30: Stochastic differential and integral equations

Type: Research Papers
Information: Journal of Applied Probability , Volume 56 , Issue 1 , March 2019 , pp. 174 - 191

DOI: https://doi.org/10.1017/jpr.2019.12 [Opens in a new window]
Copyright: © Applied Probability Trust 2019

1. Introduction

A common problem arising in statistical inference is the need to unify distributed analyses and inferences on shared parameters from multiple sources into a single coherent inference. This unification (or what we term ‘fusion’) problem can arise either explicitly due to the nature of a particular application, or artificially as a consequence of the approach a practitioner takes to tackling an application.

Typically, there will exist no closed-form analytical approach to unifying distributed inferences, and so we focus on a Monte Carlo approach. Stated generally, in this paper we are interested in sampling (without error) the d-dimensional (fusion) target density

(1)

$$ f\left( x \right) \propto {f_1}\left( x \right) \cdot \cdot \cdot {f_C}\left( x \right),$$

where each f_c(x) (c ∈ {1, …, C}) is a density (up to a multiplicative constant) representing one of the C distributed inferences we wish to unify. Each f_c(x) (which we term a sub-posterior) may in practice itself be represented by a Monte Carlo sample $${\tilde f_c}(x)$$ , and in this paper we assume that we are able to sample (directly and exactly) from each f_c(x).

Specific examples of this problem arising naturally in an application include expert elicitation ([Reference Berger2] and [Reference Genest and Zidek13]), in which the (distributional) views of multiple experts on a topic (or set of parameters) have to be pooled into a single view before a decision-maker can make an informed decision; and, multiview learning ([Reference Zhao, Xie, Xu and Sun29] and [Reference Li, Yang and Zhang16]) and meta-analysis ([Reference Fleiss12] and [Reference Smith, Spiegelhalter and Thomas24]), in which an interpretation could be that we are synthesising multiple inferences on a particular parameter set (computed on datasets which may or may not be of the same type), but the underlying raw data is not directly available for the unified inference. Obtaining the raw data may itself be an insoluble problem due to reasons including the nature of the original publication, data confidentiality, or simply time and storage constraints.

This fusion problem also arises artificially in a number of settings, particularly within modern statistical methodologies tackling ‘big data’. The computational cost of algorithms such as the Metropolis–Hastings, which is an iterative algorithm requiring full access at every iteration to the full dataset, scale poorly with increasing volumes of data unless a modification is found.

One common modification in light of the challenge of big data is to deploy a ‘divide-and-conquer’ approach (or more accurately termed ‘fork-and-join’ approach ([Reference Stamatakis and Aberer26])). In this setting the full dataset is artificially split into a large number of smaller data sets, inference is then conducted on each smaller data set in isolation, and the resulting inferences are unified (see, for instance, [Reference Agarwal and Duchi1], [Reference Li, Srivastava and Dunson15], [Reference Minsker, Srivastava, Lin and Dunson18], [Reference Neiswanger, Wang and Xing19], [Reference Scott23], [Reference Srivastava, Cevher, Dinh and Dunson25] and [Reference Wang and Dunson28]). The rationale for such an approach is that inference on each small data set can be conducted independently, and in parallel, and so one could exploit large clusters of computing cores to greatly reduce the elapsed time to conduct the full inference. The weakness of these approaches is that the fusion of the separately conducted inferences is inexact. It should be noted that divide-and-conquer methodologies will typically have additional constraints due to hardware concerns—such as minimising or removing any communication between computing cores to reduce the effect of latency. We focus in this paper on the general fusion problem, and so do not fully address the problem in the context of big data (to which we return in subsequent work).

The framework and methodology we outline in this paper (Monte Carlo fusion) for sampling exactly from (1) can be viewed as a simple rejection sampling scheme on an extended space— we develop and sample from efficient proposal densities for (1), the samples from which we retain according to an appropriate acceptance probability. The mathematical complication in this paper is in computing the intractable acceptance probability—which requires the auxiliary simulation of collections of Brownian (or Ornstein–Uhlenbeck) bridges. Our fusion approach provides a principled way to understand the error in existing unification schemes, using a simple linear combination and correction of Monte Carlo samples (analogous to a traditional meta-analysis approach).

The presentation of this paper broadly follows this pedagogy. In Section 2 we present a density on an extended space which admits (1) as a marginal. In the remainder of the paper we then develop a rejection sampler for the extended density in Section 2. In Section 3 we develop general theory and methodology for sampling (1) based on a collection of independent Brownian bridges. In Section 4 we present a modification of the theory developed in Section 3 using Ornstein–Uhlenbeck bridges, resulting in sampling efficiencies for the particular (common) setting in which the fusion density is believed to be approximately Gaussian. In Section 5 we consider examples of our methodology applied to both light-tailed and heavy-tailed fusion target densities. Finally, in Section 6 we conclude by discussing the exciting new research directions possible using Monte Carlo fusion. Much of the technical detail in the paper is suppressed for ease of reading, but can be found in the appendices.

2. An extended fusion density

Consider f (x) and f_c(x) as described in (1), where f (x) is integrable and we can sample from the density proportional to f_c(x).

The following simple observation will form the foundation of our approach. Suppose that p_c(y | x) is the transition density (with respect to the Lebesgue measure) of a Markov chain on ℝ^d with invariant density proportional to $f_c^2$ .

Proposition 1. The (C + 1)d-dimensional (fusion) density proportional to the integrable function

(2)

$$\matrix{ {g({x^{(1)}}, \ldots ,{\rm{ }}{x^{(C)}},{\rm{ }}y)} \hfill & { = \prod {[{f^2}_c{x^{(c)}} \cdot {\rm{ }}p{c^{(c)}}{\mkern 1mu} (y|{x^{{\rm{(}}c{\rm{)}}}}) \cdot {1 \over {fc{\rm{(}}y{\rm{)}}}}]} ,} \hfill \cr } $$

admits the marginal density f for y.

The proof of Proposition 1 is elementary and is omitted. The statistical interpretation of Proposition 1 is simply if it were possible to sample from the (C + 1)d-dimensional (fusion) density g in (2) then, as a by-product, we would obtain a draw from our fusion target density f in (1). How to directly sample from (2) is not clear, even if it were possible to simulate from p_c(·| x). Our strategy instead will be to use rejection sampling. Two rejection sampling methods (which have differing efficiencies) are provided in Sections 3 and 4.

3. A fusion rejection sampler using Brownian bridges

3.1. The methodology

Consider the proposal density for the extended fusion target (2) proportional to the function

(3)

$$\matrix{ {{h^{{\rm{bm}}}}({x^{(1)}}, \ldots ,{\rm{ }}{x^{(C)}},{\rm{ }}y)} \hfill & {\prod\limits_{c = 1}^c {[fc({x^{(c)}}]\exp \{ - {{C{\rm{ }}y - \overline x {^2}} \over {2T}}\} } ,} \hfill \cr } $$

where $x = {C^{ - 1}}\sum\nolimits_{c = 1}^C {x^{(c)}}$ , and T is an arbitrary positive constant.

Simulation from the proposal h ^bm can be achieved directly. In particular, x⁽¹⁾, …, x ^(C) are first drawn independently from f ₁, …, f_C, respectively, and then y is simply a Gaussian random variable centred on $\tilde x$ .

In Proposition 1 we presented a general form of the extended fusion target, in which p_c(y | x) is the transition density (with respect to the Lebesgue measure) of a Markov chain on ℝ^d with invariant density proportional to $f_c^2$ . In this paper we set ${p_c}(y|x){\kern 1pt} : = {\kern 1pt} p_{T,c}^{{\rm{dl}}}(y|x)$ , the transition density of a double Langevin diffusion for f_c (i.e. the transition density of a Langevin diffusion for f ² c ) from x to y over a predefined (user-specified) time T > 0. To distinguish the resulting extended fusion target from the general case, we further denote the extended fusion target by g ^dl. In particular, for all 1 ≤ c ≤ C, we consider the d-dimensional (double) Langevin (DL) diffusion processes $ = \{ X_t^{c(c)},{\kern 1pt} t \in [0,T],{\kern 1pt} c = 1, \ldots ,C\} $ , given by

(4)

$$\matrix{ {{\mathop{\rm d}\nolimits} {X_t}^c} \hfill & { = \nabla {\rm{ }}A_c^{{\mathop{\rm dl}\nolimits} }(X_t^c{\rm{ }}{\mathop{\rm d}\nolimits} t + {\rm{ }}{\mathop{\rm d}\nolimits} W_t^{(c)},} \hfill \cr } $$

where $W_t^{c(c)}$ is a d-dimensional Brownian motion, ∇ is the gradient operator over x, and

$$A_c^{{\mathop{\rm dl}\nolimits} }{\rm{(}}x{\rm{)}}{\mkern 1mu} : = {\mkern 1mu} \log {f_c}(x);$$

$X_t^{c(c)}$ has invariant distribution $f_c^2(x)$ for any t ∈ [0, T] ([Reference Hansen14]). We also impose the following standard regularity property (where div denotes the divergence operator)

Condition 1. Define

$$\phi _c^{{\mathop{\rm dl}\nolimits} }{\rm{(}}x{\rm{)}}{\mkern 1mu} : = {\mkern 1mu} {\textstyle{1 \over 2}}(\nabla {\rm{ }}A_c^{{\mathop{\rm dl}\nolimits} }{^2} + {\rm{ }}{\mathop{\rm dive}\nolimits} \nabla A_c^{{\mathop{\rm dl}\nolimits} }{\rm{(}}x)).$$

There exists constant $\Phi _c^{{\rm{ou}}} > - \infty $ such that, for all x and each c ∈ {1, …, C}, $\phi _c^{{\rm{dl}}}(x) \ge \Phi _c^{{\rm{bm}}}$ .

Then we have the following proposition which gives a rejection sampling method for g ^dl(x⁽¹⁾, …, x ^(C), y).

Proposition 2. Proposition 2. Under Condition 1, we can write

(5)

$${{{g^{{\rm{dl}}}}({x^{(1)}}, \ldots ,{\rm{ }}{x^{(C)}},{\rm{ }}y)} \over {{h^{{\rm{bm}}}}({x^{(1)}}, \ldots ,{x^{(C)}},{\rm{ }}y)}} = {[{{\sqrt C } \over {\sqrt {2\pi T} {\mkern 1mu} }}]^C} \times {\rho ^{{\rm{bm}}}} \times {Q^{{\rm{bm}}}} \times \prod\limits_{c = 1}^c {{{\mathop{\rm e}\nolimits} ^{ - T\Phi _c^{{\mathop{\rm bm}\nolimits} }}}} ,$$

where

$${\rho ^{{\rm{bm}}}}{\mkern 1mu} : = {\mkern 1mu} {\rho ^{{\rm{bm}}}}({x^{(1)}}, \ldots ,{\rm{ }}{x^{(C)}}) = {\rm{ }}{{\mathop{\rm e}\nolimits} ^{ - C{\sigma ^2}/2T}},{\rm{ }}{\sigma ^2} = {C^{ - 1}}\sum\limits_{c = 1}^c {{x^{(c)}} - \overline x {^2}} ,$$

and

$${Q^{{\rm{o}}u}} = {E_{\overline {{_{}}} }}({E^{{\rm{o}}u}}),$$

with $\overline {\mathbb W} $ denoting the law of C Brownian bridges $$x_t^{(1)}, \ldots ,x_t^{(C)}$$ with $x_0^{(c)} = {x^{(c)}}$ and $x_T^{(c)} = y$ in the time interval [0, T] (noting that these Brownian bridges are independent conditional on the starting and ending points), and

(6)

$$\matrix{ {{E^{{\rm{bm}}}}} \hfill & {{\mkern 1mu} : = {\mkern 1mu} \prod\limits_{c = 1}^C [ \exp \{ - \int_0^T {({\rm{ }}\phi _c^{{\mathop{\rm dl}\nolimits} }(x_t^{(c)}) - {\rm{ }}\Phi _c^{{\mathop{\rm bm}\nolimits} })} \,{\mathop{\rm d}\nolimits} t\} ].} \hfill \cr } $$

Proof. From the Dacunha-Castelle representation ([8]) we have

$$p_{T,c}^{{\rm{dl}}}(y|{x^{(c)}}) = {{{f_c}(y)} \over {{f_c}({x^{(c)}})}}{{\sqrt C } \over {\sqrt {2\pi T} }}\exp \{ {{ - \parallel y - {x_c}{\parallel ^2}} \over {2T}}\} \exp {E_ - }(\exp \{ - \int_0^T {\phi _c^{{\rm{dl}}}} (x_t^{(c)}){\rm{d}}t\} ).$$

The result then follows from (2) and (3) by rearrangement and recalling that $\sum\nolimits_{c = 1}^C \parallel y - {x^{(c)}}{\parallel ^2} = \sum\nolimits_{c = 1}^C \parallel {x^{(c)}} - x{\parallel ^2}$ + C||y - x{||^2}$ .

Here ρ ^bm and Q ^bm are both necessarily bounded by 1, and when interpreted methodologically (see the next section) correspond to separate acceptance steps within our rejection sampling framework. An event of probability ρ ^bm can be simulated by direct computation, and an event of probability Q ^bm can be simulated using the extensive efficient methodology on Poisson samplers (using an auxiliary diffusion bridge path-space rejection sampler as developed in, e.g. [Reference Beskos and Roberts3], [Reference Beskos, Papaspiliopoulos and Roberts4], [Reference Beskos, Papaspiliopoulos and Roberts5], [Reference Beskos, Papaspiliopoulos, Roberts and Fearnhead6], [Reference Chen and Huang7], [Reference Dai9], [Reference Pollock20], [Reference Pollock, Johansen and Roberts21], and [Reference Pollock, Fearnhead, Johansen and Roberts22]). Note that there is a tradeoff involved in the (user-specified) choice of T. For small T, ρ ^bm will likely be small while Q ^bm is large, whereas, for large T, the opposite will be true. A small value of T is usually preferred since the computational cost for the diffusion bridge rejection sampling for Q ^bm is comparatively expensive.

The algorithm for simulating from f (by means of g) therefore proceeds as per Algorithm 1 (which we term Monte Carlo fusion).

Algorithm 1. (Monte Carlo fusion (Brownian bridge approach).)

1. Initialize a value T > 0;
2. For c = 1, …, C, simulate x_c from the density f_c(x) and calculate $\tilde x$
3. Simulate y from the Gaussian distribution, with density exp $( - C\parallel y - x{\parallel ^2}/2T)$ ;
4. Generate the standard uniform random variable U ₁;
5. if log U ₁ ≤ −Cσ ²/2T then
6. Generate the standard uniform variable U ₂ and the independent Brownian bridges $x_t^{(c)},{\kern 1pt} c = 1, \ldots ,C,$ , in [0, T], conditional on the starting point x_c and ending point y;
7. if
(7) $$\begin{align}\label{eq:Algorithm1U2_al} U_2 \leq E^{\rm bm} \end{align}$$
then
8. Accept and output y as a sample from f (x);
// Event (7) can be dealt with via the path-space rejection sampling methods in Beskos and Roberts (2005), Beskos et al. (2006a), Beskos et al. (2008), and Pollock et al. (2016a)
9. else
10. Go back to step 2;
11. end
12. else
13. Go back to step 2;
14. end

3.2. Practical interpretation of the algorithm

Remark 1. (Adjustment for the simple average of the sample from subdensities.) In the above algorithm, for all c, x^(c) is simulated from f_c(x) as in other Monte Carlo fusion algorithms. The proposed combined value y, however, is actually generated from a Gaussian distribution with mean $\tilde x$ and covariance matrix C ⁻¹TI_d. Therefore, y can be viewed as a simple average of the values {x^(c), c ∈ C} added to a Gaussian random error term. This algorithm indicates exactly how the simple average of these independent sub-posterior samples can be adjusted as a draw from the target distribution. This adjustment is in the form of the accept/reject step with acceptance probability ρ ^bm × Q ^bm.

Remark 2. (Implication on Bayesian group decision theory.) In step 5 of Algorithm 1, we can also write log U ₁ ≤ −Cσ ²/2T as

(8)

$$\begin{align}\label{eq:variancecondition_BM} \sigma^2 \leq - \frac{2T\log U_1}{C}. \end{align}$$

Note that σ ² is actually the sample variance of the simulated starting points, x^(c) (strictly speaking, it is the sum of variances for each component of x^(c)). Thus, in this step, the acceptance condition (8) implies that y will have a reasonable probability of being accepted as a sample from f (x) only when the variance of the x^(c) is small enough. This coincides with our intuition in group decision making. For example, the small variance of {x^(c), c ∈ C} (satisfying condition (8)) means that the decisions/results from each group are similar, so that we can combine these decisions/results in step 7, using (7), of Algorithm 1. On the other hand, if the variance of {x^(c), c ∈ C} (not satisfying condition (8)) is large, then {x^(c), c ∈ C} provide contradictory evidence and should usually be rejected. Note that, algorithmically, this initial rejection sampling step is efficient since early rejection then avoids the need to carry out the more complicated (and computationally expensive) step 7, via the condition of (7).

Remark 3. (An extreme case.) A very interesting extreme case is to choose T = 0. Then the event (7) will certainly happen, but the event U ₁ ≤ exp{−Cσ ²/2T} will occur only if x⁽¹⁾ = · · · = x^(C) = y. Therefore, in such an extreme case, Algorithm 1 (theoretically) draws a sample y from

$${h^{{\rm{bm}}}}{x^{(1)}}, \ldots ,{\rm{ }}{x^{(C)}},y \propto \prod\limits_{c = 1}^c {{f_c}(y),} $$

which is the target distribution. In practice, however, we have to choose T > 0, since the independent sub-posterior samples x^(c) have zero probability to be the same.

4. A fusion rejection sampler using Ornstein–Uhlenbeck bridges

4.1. The methodology

In the previous section, the proposal density h ^bm uses a simple average of the sub-posterior samples x^(c) as the mean of the proposal y. Another approach is to consider using a weighted average of the sub-posterior samples x^(c) as the mean of the proposal y. For example, in a typical meta-analysis, a weighted average from different research outputs is typically used as the unification mean, and an individual output with more certainty (or, smaller variance) should consequently have larger weights ([Reference Fleiss12]).

Denote by ${\hat \Lambda _c}$ and ${\hat \Lambda _c}$ the mean and the inverse of the covariance matrix estimates for distribution f_c(x), respectively. We consider the more general proposal density

$${h^{{\rm{ou}}}}({x^{(1)}}, \ldots ,{x^{(C)}},y) \propto [\prod\limits_{c = 1}^C {f_c}({x^{(c)}})]etr[ - {1 \over 2}{[y - x]^{ \otimes 2}}D],$$

where the notation etr is the exponential trace function, ‘⊗’ is the Kronecker product,

$\bf{D} = \sum\limits_{c = 1}^C {\bf{D}_c},{\bf{D}_c} = {\bf V}_c^{ - 1} - {\hat \Lambda _c},$

$${V_c} : = {V_c}(T) = {\rm var}({\int_0^T e^{{{\hat \Lambda }_c}(t - T)}}{\rm d}W_t^{(c)}) = {{\hat \Lambda_c^{ - 1}} \over 2}({I_d}{ - {\rm e}^{ - 2{{\hat \Lambda }_c}T}}),$$

and

(9)

$$x = {D^{ - 1}}\{ \sum\limits_{c = 1}^C (V_c^{ - 1}{m_c} - {\hat \Lambda _c}{\hat \mu _c})\} ,{m_c}{\kern 1pt} : = {\kern 1pt} {m_c}({x^{(c)}},T) = {\hat \mu _c}{ + ^{ - {{\hat \Lambda }_c}T}}({x^{(c)}} - {\hat \mu _c}).$$

It is also straightforward to simulate from h ^ou(·), since the x^(c) are independently drawn from f_c(x), and then y is generated from a Gaussian distribution with mean $\tilde x$ and covariance matrix D⁻¹.

Here the vector m_c and the matrix V_c are actually the mean and covariance matrices, respectively, for the following Ornstein–Uhlenbeck (OU) process at time T conditional on the starting point $x_0^{(c)} = {x^{(c)}}$ ([Reference Masuda17]):

(10)

$${\kern 1pt} {\rm{d}}X_t^{c(c)} = \nabla A_c^{{\rm{ou}}}(X_t^{c(c)}){\kern 1pt} {\rm{d}}t + {\kern 1pt} {\rm{d}}W_t^{c(c)},X_0^{c(c)} \sim f_c^2(x),$$

with

$$A_c^{{\rm{ou}}}(x) = - {{{{({{{\hat \mu}}_c} - x)}^{tr}}{{{\hat \Lambda}}_c}({{{\hat \mu}}_c} - x)} \over 2}.$$

We shall also require the following regularity property.

Condition 2. Define

(11)

$$\phi _c^{{\rm{ou}}}(x) = {\textstyle{1 \over 2}}(\parallel {\hat \Lambda _c}({\hat \mu _c} - x){\parallel ^2} - {\rm{t}}race({\hat \Lambda _c})).$$

For any c ∈ {1, …, C}, there exists $\Phi _c^{{\rm{ou}}} > - \infty $ such that, for all x,

$$\phi _c^{{\text{dl}}}(x) - \phi _c^{{\text{ou}}}(x) \ge \Phi _c^{{\text{ou}}}.$$

We now have the following result.

Proposition 3. Define function

(12)

$${\rho ^{{\rm{o}}u}}{\kern 1pt} : = {\kern 1pt} {\rho ^{{\rm{o}}u}}({x^{(1)}}, \ldots ,{x^{(C)}}) 2.9pt = etr\{ - {1 \over 2}[H{D^{ - 1}} + \sum\limits_{c = 1}^C {M_{1,c}}{({m_c} + M_{1,c}^{ - 1}{M_{2,c}}{V_c}{\hat \Lambda _c}{\hat \mu _c})^{ \otimes 2}}]\} ,$$

with

$$\begin{array}{lllll}{M_{1,c}}{ = ^{2{{\hat \Lambda }_c}T}}{\hat \Lambda _c} - V_c^{ - 1}(\sum\limits_{c = 1}^C {\hat \Lambda _c}){D^{ - 1}},\\ {M_{2,c}} = V_c^{ - 1}(\sum\limits_{c = 1}^C {\hat \Lambda _c}){D^{ - 1}} - 2{\hat \Lambda _c}^{2{{\hat \Lambda }_c}T},\\ H = (\sum\limits_{c = 1}^C {({m_c} - {V_c}{\hat \Lambda _c}{\hat \mu _c})^{ \otimes 2}}V_c^{ - 1})(\sum\limits_{c = 1}^C V_c^{ - 1}) - {\{ \sum\limits_{c = 1}^C V_c^{ - 1}({m_c} - {V_c}{\hat \Lambda _c}{\hat \mu _c})\} ^{ \otimes 2}}.\end{array}$$

Under Condition 2, we can write

(13)

$${{{g^{{\rm{d}}l}}({x^{(1)}}, \ldots ,{x^{(C)}},y)} \over {{h^{{\rm{ou}}}}({x^{(1)}}, \ldots ,{x^{(C)}},y)}} \propto {\rho ^{{\rm{o}}u}}({x^{(1)}}, \ldots ,{x^{(C)}}) \times {Q^{{\rm{o}}u}} \times \prod\limits_{c = 1}^C {e^{ - T\Phi _c^{{\rm{ou}}}}},$$

where

$$Q^{{\rm{ou}}} = {\mathbb{E}}_{{\overline{\mathbb{O}}}}(E^{\rm ou}),$$

with $\overline{\mathbb{O}}$ denoting the law of C OU bridges ${\boldsymbol x}^{(c)}_t$ , c = 1, · · ·, C, in time interval [0, T]. These $x_t^{(c)}$ are independent conditional on the starting point x ^(c) and common ending point y, and

(14)

$${E^{{\rm{o}}u}}{\kern 1pt} : = {\kern 1pt} \prod\limits_{c = 1}^C [\exp \{ - \int_0^T (\phi _c^{{\rm{dl}}}(x_t^{(c)}) - \phi _c^{{\rm{ou}}}(x_t^{(c)}) - \Phi _c^{{\rm{ou}}}){\kern 1pt} {\rm{d}}t\} ].$$

Proof. See Appendix A.

Since ρ ^ou and Q ^ou in (13) are always no more than 1, we have the following rejection sampling algorithm.

Algorithm 2. (Monte Carlo fusion (Ornstein–Uhlenbeck approach).)

1. Initialise a value T > 0 and ${\hat \mu _c},{\hat \Lambda _c}$ ;
2. For c = 1, …, C, simulate x^(c) from the density f_c(x) and calculate $\tilde x$ , D;
3. Simulate y from the Gaussian distribution, with mean $\tilde x$ and covariance matrix D⁻¹;
4. Generate the standard uniform random variable U ₁;
5. if U ₁ ≤ ρ ^ou(x⁽¹⁾, …, x^(C)) (given in (12)) then
6. Generate the standard uniform random variable U ₂ and independent OU bridges $x_t^{(c)},{\kern 1pt} c = 1, \ldots ,C$ , in [0, T], conditional on the starting point $x_{0}^{(c)}=x^{(c)}$ and ending point y;
7. if
(15) $${U_2} \le \exp \{ - \sum\limits_{c = 1}^C \int_0^T (\phi _c^{{\rm{dl}}}(x_t^{(c)}) - \phi _c^{{\rm{ou}}}(x_t^{(c)}) - \Phi _c^{{\rm{ou}}}){\kern 1pt} {\rm{d}}t\} $$
// Event (15) can be dealt with via the path-space rejection sampling methods in Beskos and Roberts (2005), Beskos et al. (2006a), Beskos et al. (2008), and Pollock et al. (2016a)
8. then
9. Accept and output y as a sample from f (x);
10. else
11. Go back to step 2;
12. end
13. else
14. Go back to step 2;
15. end

In Algorithm 2, T is a tuning parameter as well. If we choose a small value of T, the acceptance probability ρ ^ou(x⁽¹⁾, …, x^(C)) will be very low. This is noticed by the facts that trace(HD⁻¹) = ∞ and M_1,c and M_2,c are finite matrices, if T = 0. Therefore, in practice, we should choose a reasonably large value T. However, if we choose a large T, the acceptance probability (15) will be very small. In practice, it is usually preferable to choose a small T since the computational cost for the acceptance/rejection step of (15) is much higher.

4.2. Connection with consensus Monte Carlo

In practice, people may employ an approximate version of Algorithm 2, since we can ignore condition (15) in the rejection step 7, if we choose a small value T. In other words, from (13) of Proposition 3, for small T, the target g ^dl(·) is approximately equal to a function proportional to

$${\tilde h^{{\rm{o}}u}} = {h^{{\rm{o}}u}} \cdot {\rho ^{{\rm{o}}u}} = [\prod\limits_{c = 1}^C {f_c}({x^{(c)}})]etr[ - {1 \over 2}{[{\kern 1pt} y - x]^{ \otimes 2}}D] \cdot {\rho ^{{\rm{o}}u}}({x^{(1)}}, \ldots ,{x^{(C)}}).$$

Such an approximation will be very good since Q ^ou, the acceptance probability (15), will be very close to 1 with a small T. In other words, we only simulate y from ${\tilde h^{{\rm{o}}u}}( \cdot )$ and accept y as a sample approximately from the target distribution f (·).

We draw x_c, c = 1, …, C, independently from each f_c(x), respectively. Another naive approach to combine these draws is to use the following linear combination:

(16)

$$y = (\sum\limits_{c = 1}^C {\hat \Lambda _c}{)^{ - 1}}[\sum\limits_{c = 1}^C {\hat \Lambda _c}{x_c}].$$

In practice, $\hat \Lambda _c^{ - 1}$ can be obtained via preliminary analysis. Such a y can be viewed as a sample approximately from f (x) as well. This is named as consensus Monte Carlo in [Reference Scott23].

The following lemma tells us how the simulation of y from ${\tilde h^{{\rm{o}}u}}( \cdot )$ is related to the consensus Monte Carlo sample (16).

Lemma 1. If f_c(x) is a Gaussian distribution and if we choose T = ∞, then simulating y from ${\tilde h^{{\rm{o}}u}}( \cdot )$ will be the same as the consensus Monte Carlo (CMC) method. Both draw samples exactly from the target distribution.

Proof. See Appendix B.

This lemma tells us why CMC does not provide good results. It is because CMC simulates y from ${\tilde h^{{\rm{o}}u}}( \cdot )$ with T = ∞; however, T should be chosen as a small value to achieve better approximation or even use the exact Algorithm 2 with the diffusion path rejection sampler.

5. Simulation studies

5.1. Distribution with light tails

We consider the target distribution f (x) ∝ e^{−x ⁴/2}. We choose C = 4 and f_c(x) = e^{−x ⁴/2C}, c = 1, ..., C. It is easy to check that $\phi^{\rm dl}(x) = \tfrac{1}{2}({4x^6}/{C^2} - {6x^2}/{C})$ satisfies Condition 1. On the other hand, for any chosen values ${\hat \mu _c}$ and ${\hat \Lambda _{c,}}\phi _c^{{\rm{ou}}}(x)$ in (11) satisfies Condition 2 as well. Therefore, both Algorithm 1 and Algorithm 2 can be applied to simulate from f (x).

We compare the following Monte Carlo (MC) methods for the estimation of the density function of f (x).

1. Simulating MC samples directly from f (x) via a simple rejection sampling with standard Gaussian distribution as the proposal.
2. Simulating MC samples based on the exact simulation method, Algorithm 1 with T = 1.
3. Simulating MC samples based on the exact simulation method, Algorithm 2 with T = 1.
4. Simulating MC samples based on the consensus method of [Reference Scott23].

The density curve estimation results are summarised in Figure 1. Note that all results are based on 10 000 realisations. The solid curve (simulation 1, the true fitted density curve), the dotted curve (simulation 2), and the dash–dot curve (simulation 3) are all exact algorithms and they are almost identical. The consensus method (simulation 4, the dashed curve) has very large biases. Note that both the CMC algorithm and Algorithm 2 use the same values of ${\hat \mu _c}$ and ${\hat \Lambda _c}$ based on preliminary analysis.

Figure 1: Kernel density fitting with bandwidth 0.25 for the density proportional to e^{−x ⁴/2}, based on different MC methods: the standard exact MC (simulation 1, solid curve); Algorithm 1 (simulation 2, dash–dot curve); Algorithm 2 (simulation 3, dotted curve); the CMC algorithm (simulation 4, dashed curve).

The running time of the algorithms are presented in Table 1. It seems that Algorithm 2 uses the most system running time. This is because Algorithm 2 has a smaller acceptance probability (about 0.011 for the path-space rejection sampling (15)) than that of Algorithm 1 (about 0.139 for the path-space rejection sampling (7)).

TABLE 1: System running times in seconds for simulating 10 000 realisations.

Note that Condition 1 is usually satisfied in most applications; however, Condition 2 only holds when the target f (x) has lighter tails than Gaussian distributions. We present an example in the following section, where only Condition 1 holds.

5.2. Beta distribution

Consider the target distribution as the beta distribution with density π(u) ∝ u ⁴(1 − u), u ∈ [0, 1], i.e. Beta(5, 2). To use the proposed algorithms, the support of the target distribution should be in the whole real axis. Therefore, we need to use the variable transformation x = log (u/(1 − u)) and consider the target distribution as

$${f_{}}(x) \propto {[{{\exp (x)} \over {1 + \exp (x)}}]^5}{[{1 \over {1 + \exp (x)}}]^2}.$$

We decompose π(x) into C = 5 components,

(17)

$$\begin{array}{lll}{f_{}}(x) \propto {f_1}(x) \cdots {f_C}(x)\\{f_c}(x) = [{{\exp (x)} \over {1 + \exp (x)}}][{1 \over {1 + \exp (x)}}{]^{0.4}}.\end{array}$$

Note that, for this simple example, Condition 1 is satisfied but Condition 2 is not satisfied. Therefore, we compare the following MC methods for the estimation of the density function of Beta(5, 2).

(a) Simulating MC samples directly from Beta(5, 2) via the simple R command, rbeta.
(b) Simulating MC samples based on the exact simulation method, Algorithm 1 with T = 3.
(c) Simulating MC samples based on the consensus method of [Reference Scott23], with variable transformation and with the decomposition in (17).

For simulation (c), ${\hat \mu _c}$ and ${\hat \Lambda _c}$ (also known as the weight of each consensus sample in the CMC algorithm) are respectively chosen as the estimated mean and inverse of the variance of f_c(·), as suggested by [Reference Scott23].

The density curve estimation results are summarized in Figure 2. Note that all results are based on 10 000 realisations. The solid curve (simulation (a)) and the dotted curve (simulation (b)) are almost identical, since both of them are based on exact simulation methods. Again, the CMC algorithm gives very biased results.

Figure 2: Kernel density fitting with bandwidth 0.04 for Beta(5, 2), based on different MC methods: the standard exact MC (simulation (a), solid curve); Algorithm 1 (simulation (b), dotted curve); the CMC algorithm (simulation (c), dashed curve).

Remark 4. (Tuning parameters and the density decomposition.) We may choose any value T in the proposed algorithms. However, as we mentioned before, value T is a tuning parameter for the efficiency of Algorithm 1, which is indeed shown by our simulation results. The CPU running time for simulating 10 000 realisations based on Algorithm 1 for the beta example is summarized in Figure 3. The optimum value is about T = 2.

Figure 3: CPU time (in seconds) for Algorithm 1, based on different T.

We here use Algorithm 1 to provide two empirical approaches to find a good value T in practice.

Approach 1. Although it is not easy to calculate an explicit value for the overall acceptance probability

(18)

$${\underbrace {{\rm e}^{ - C{\sigma ^2}/2T}}_{{\rho ^{{\rm bm}}}}}{\underbrace {{{\rm {\mathbb E}}_{\overline {\mathbb W}}} { \left ( \prod \limits_{c = 1}^C \left [\exp \{ - \int_0^T ({\phi_c^{\rm dl}}(x_t^{(c)}) - {\Phi_c^{\rm bm}}){\rm d}t \} \right ] \right )}}_{Q^{\rm bm}}},$$

where σ ² is the sample variance of the distributed MC samples, it can be estimated easily using a small number of preliminary simulation studies. Therefore, in practice, we choose a sequence of different values of T, say T ₁, T ₂, …, T_q, and, for each T_i, we carry out a small number of preliminary simulations to estimate the above probability. Then the value T_i which gives the largest acceptance probability is chosen.

Approach 2. When dealing with Q ^bm via path-space rejection sampling, we need upper bounds Ψ_c, $\phi _c^{{\rm{dl}}}({x_c}) \le {\Psi _c}$ for all x_c, and we want $$({\Psi _c} - \Phi _c^{{\rm{bm}}})T$$ to be as small as possible to give efficient Poisson thinning steps in path-space rejection sampling for diffusion bridges. Therefore, if $\Psi_c$ _c (or an approximate value) is available, it makes sense to consider choosing a value of T to give a large value for the following probability (replacing $\phi _c^{{\rm{dl}}}(x_t^{(c)})$ ) by Ψ_c in (18)):

$${e^{ - C{\sigma ^2}/2T}}\prod\limits_{c = 1}^C [\exp \{ - \int_0^T ({\Psi _c} - \Phi _c^{{\rm{bm}}}){\kern 1pt} {\rm{d}}t\} ]\\ = \exp \{ - {{\sum\limits_{c = 1}^C \parallel {x^{(c)}} - x{\parallel ^2}} \over {2T}} - \sum\limits_{c = 1}^C ({\Psi _c} - \Phi _c^{{\rm{bm}}})T\} .$$

Each preliminary simulation study simulates values x^(c) from each f_c(x) and thus gives $\bar x$ . Therefore, we can find the value T* which can maximize the above formula in each preliminary simulation study. The averaged optimal T* values corresponding to each preliminary study may also be used as the value T.

In addition, when we split the target f into f ∝ f ₁ · · · f_C, we actually chose f ₁ = · · · = f_C = f ^1/C, since such a decomposition will always give the smallest variation for x^(c), c = 1, …, C, as suggested in [Reference Dai10].

5.3. Normal distribution: comparison of the two algorithms

We have seen in previous sections that Algorithm 2 may have limited applications due to its complexity and the stronger condition requirement, Condition 2. However, Algorithm 2 can gain great advantage if the target distribution is very close to a Gaussian distribution, which is usually true for Bayesian posterior densities under the big data framework. We here use a simple Gaussian example to compare the two algorithms.

We consider the target distribution as standard normal f (x) ∝ e^{−x ²/2} and $\phi^{\rm dl}_c(x) = \tfrac{1}{2}({x^2}/{C^2} - {1}/{C})$ . We choose the OU process with ${\hat \mu _c} = 0$ and ${\hat \Lambda _c} = 1/2C$ , which gives $\phi _c^{{\rm{ou}}}(x) = {\textstyle{1 \over 2}}({x^2}/4{C^2} - 1/2C)$ . The normalising constant $\Phi _c^{{\rm{o}}u} = - 1/4C$ guarantees that $\phi _c^{{\rm{dl}}}(x) - \phi _c^{{\rm{ou}}}(x)$ is bounded below. Under these settings, the acceptance probability E ^ou in (14) becomes

$${E^{{\rm{o}}u}}{\kern 1pt} : = {\kern 1pt} \prod\limits_{c = 1}^C [\exp \{ - \int_0^T {{3x_t^{(c)}} \over {8{C^2}}}{\kern 1pt} {\rm{d}}t\} ]$$

and, furthermore, $Q^{{\rm{ou}}} = _{{\overline{\mathbb{O}}}}(E^{\rm ou})$ . On the other hand, if we use Algorithm 1, we can choose $\Phi _c^{{\rm{b}}m} = - 1/2C$ and the acceptance probability E ^bm defined in (6) becomes

$${E^{{\rm{b}}m}}{\kern 1pt} : = {\kern 1pt} \prod\limits_{c = 1}^C [\exp \{ - \int_0^T {{x_t^{(c)}} \over {2{C^2}}}{\kern 1pt} {\rm{d}}t\} ]$$

and $Q^{{\rm bm}} = {E_{\overline {\mathbb W}}({E^{\rm bm}})}$ . Note that ${\overline {\mathbb W}}$ and ${\overline {\mathbb O}}$ are the probability measures induced by Brownian bridges and OU bridges, respectively, conditional on the same starting and ending points. It is clear that, for the same $x_t^{(c)}$ , E ^ou ≥ E ^bm; thus, Q ^ou should be greater than Q ^bm, if the value T is not too large. Therefore, Algorithm 2 should have higher acceptance probabilities for the complicated path-space rejection sampling for diffusions. In addition, we also found that ρ ^ou is higher than ρ ^bm for this toy Gaussian example via 10 000 simulations.

We chose T = 3 for both algorithms, which is about the most efficient tuning parameter value for both. For Algorithm 2, the overall acceptance probability is about 0.1 with ρ ^ou ≈ 0.16 and Q ^ou ≈ 0.63. The overall acceptance probability for Algorithm 1 is about 0.07 with ρ ^bm ≈ 0.14 and Q ^bm ≈ 0.51. Indeed, Algorithm 2 gains advantages when the target is an (approximate) Gaussian distribution. In fact, when using Algorithm 2, if we choose an OU process such that $\phi^{\rm dl}_c({\boldsymbol x}) \approx \phi^{\rm ou}_c({\boldsymbol x})$ , the acceptance probability Q ^ou will be almost close to 1, which is much better than Algorithm 1.

6. Conclusion

In this paper we have introduced a novel theoretical framework, and direct methodological implementation (Monte Carlo fusion), to address the common (but challenging) ‘fusion’ problem—unifying distributed analyses and inferences on shared parameters from multiple sources into a single coherent inference—by viewing it as a simple rejection sampler on an extended space (Section 2), and developing an appropriate sampling mechanism (Section 3).

Our fusion approach in this paper is not only the first to answer in a principled manner how to combine samples from multiple sources, but (as shown in Section 4) also provides a principled approach to understand the errors that arise in other existing unification schemes (such as those in big data divide-and-conquer approaches). The errors in existing unification schemes can be considerable, even for simple one-dimensional unification targets (such as those we consider in Section 5).

Characterising the error in existing unification schemes is possible by setting the (proposal) sampling mechanisms of those schemes within our framework, and finding a representation for the remaining error (which we could remove by further acceptance or rejection). This opens interesting avenues of research in which existing unification schemes are adapted within our framework into efficient proposal mechanisms for our extended fusion target density (2).

A number of avenues to apply our work directly to interesting applications are possible. In addition to those discussed in Section 1 (namely expert elicitation, multiview learning, and meta-analysis), other application areas include Bayesian group decision theory (see Remark 2) and Bayesian sensitivity analysis. In the case of Bayesian sensitivity analysis we need to assess a large number of prior distributions, but existing methods address this by using approximations ([Reference Tan, Doss and Hobert27]).

A key avenue for (on-going) future research is to fully explore how to use the Monte Carlo fusion framework we introduce to modify, and remove approximation, from existing Monte Carlo methods (particularly within the big data setting) that use the divide-and-conquer (fork-and-join) strategies described in Section 1. In the particular setting of big data the unification of distributed inferences is only part of the problem—additional constraints are imposed due to practical computational and hardware concerns (such as avoiding as far as possible communication between computing cores). As such, key future research will focus on how to implement Monte Carlo fusion with this particular set of constraints (as direct implementation is not possible).

Another interesting big data direction would be to blend the multi-core approach of divide-and-conquer strategies with state-of-the-art single-core approaches for big data (see, for instance, [Reference Pollock, Fearnhead, Johansen and Roberts22]).

Appendix A Proof of Proposition 3

To prove the proposition, we first introduce a lemma about a density proportional to the following function

(19)

$$\eqalign{ & {{\tilde g}^{{\rm{o}}u}}({x^{(1)}}, \ldots ,{x^{(C)}},y) \cr & = \prod\limits_{c = 1}^C [{f_c}({x^{(c)}})p_{T,c}^{{\rm{ou}}}(y|{x^{(c)}})\exp (A_c^{{\rm{ou}}}({x^{(c)}}) - A_c^{{\rm{ou}}}(y))], \cr} $$

where $p_{T,c}^{{\rm ou}}(y|{x^{(c)}})$ is the transition density from x^(c) at time 0 to y at time T for the OU process given by (10).

Lemma 2. (The expression of ${\tilde g^{{\rm{o}}u}}$ .) The formula in (19) can be rewritten as

$$\eqalign{ & {{\tilde g}^{{\rm{o}}u}}({x^{(1)}}, \ldots ,{x^{(C)}},y) \cr & \propto [\prod\limits_{c = 1}^C {f_c}({x^{(c)}})]etr[ - {1 \over 2}{[{y_T} - x]^{ \otimes 2}}D]{\rho ^{{\rm{o}}u}}({x^{(1)}}, \ldots ,{x^{(C)}}), \cr} $$

where $\tilde x$ and ρ ^ou are given by (9) and (12), respectively.

Proof. See the supplementary file ([Reference Dai, Pollock and Roberts11]).

Proof of Proposition 3. If we denote $\overline {\mathbb{DL}} $ as the law of C d-dimensional Langevin diffusion bridges given in (4), with starting points x^(c) and common ending point y, the result of Proposition 2 can be written as

$${{{g^{{\rm{dl}}}}({x^{(1)}}, \ldots ,{x^{(C)}},y)} \over {{h^{{\rm{bm}}}}({x^{(1)}}, \ldots ,{x^{(C)}},y)}} \times {\rm{d}}\overline {{\mathbb{DL}}} (\overrightarrow x ) \propto {\rho ^{{\rm{bm}}}} \times {E^{{\rm{bm}}}} \times \prod\limits_{c = 1}^C {{{\rm{e}}^{ - T{\Phi _c}}}} \times {\rm{d}}\overline {\mathbb{W}} {\rm{(}}\overrightarrow x {\rm{)}},$$

where $x = \{ x_t^{(c)},{\kern 1pt} c = 1, \cdots ,C,{\kern 1pt} t \in [0,T]\} $ are typical diffusion bridge paths, with starting points x^(c) and common ending point y.

If we consider a very special case where each f_c(x) is a Gaussian density exp $(A_c^{{\rm{ou}}}(x))$ , we immediately find that the target g ^dl becomes

$${g^{{\rm{o}}u}} = \prod\limits_{c = 1}^C {[^{2A_c^{{\rm{ou}}}({x^{(c)}})}}p_{T,c}^{{\rm{ou}}}(y|{x^{(c)}}){1 \over {^{A_c^{{\rm{ou}}}(y)}}}].$$

In addition, the proposal density h ^bm in Proposition 2 becomes

$${\hbar ^{{\rm{b}}m}}({x^{(1)}}, \ldots ,{x^{(C)}},y) = \prod\limits_{c = 1}^C {{[^{A_c^{{\rm{ou}}}({x^{(c)}})}}]^{ - \parallel y - x{\parallel ^2}/2T}},$$

and, furthermore, the rejection sampling ratio in Proposition 2 becomes

\begin{array}{*{20}{c}} {\frac{{{g^{{\rm{ou}}}}({x^{(1)}}, \ldots ,{x^{(C)}},y)}}{{{\hbar ^{{\rm{bm}}}}({x^{(1)}}, \ldots ,{x^{(C)}},y)}} \times {\mkern 1mu} {\rm{d}}\overline {\mathbb{O}} (\overrightarrow x ) \propto {\rho ^{{\rm{bm}}}} \times E_*^{{\rm{ou}}} \times {\mkern 1mu} {\rm{d}}\overline {\mathbb{W}}(\overrightarrow x ),}\\ {E_*^{{\rm{ou}}}{\mkern 1mu} : = {\mkern 1mu} \prod\limits_{c = 1}^C [ \exp \{ - \int_0^T {\phi _c^{{\rm{ou}}}} (x_t^{(c)})\;{\rm{d}}t\} ]\;.} \end{array}

Now we have

$$\eqalign{{{{g^{\rm dl}} \over {h^{\rm ou}}}\;{{{\rm d}{\overline {\mathbb D}L}} \over {{\rm d}{\overline {\mathbb O}}}}} & = {{{g^{\rm dl}} \over {h^{\rm bm}}} {{{\rm d}{\rm {\overline {\mathbb DL}}}} \over {{\rm d}{\rm {\overline {\mathbb W}}}}}\;{{\hbar ^{\rm bm}} \over {{g^{\rm ou}}}}{{{\rm d}{\rm {\overline {\mathbb W}}}} \over {{\rm d}{\rm {\overline {\mathbb O}}}}}{{g^{\rm ou}} \over {{h^{\rm ou}}}}\;{{{h^{\rm bm}}} \over {{\hbar ^{\rm bm}}}}} \cr & {\propto {{E^{\rm bm}} \over {{E_*^{\rm ou}}}} {{\prod \nolimits_{c = 1}^C {[{e^{2{\rm A}_c^{{\rm ou}}{x^{(c)}}}}{\rm p}_{T,c}^{{\rm ou}}(y|x)/{e^{A_c^{{\rm ou}(y)}}}]}} \over {\prod \nolimits_{c = 1}^C {[{f_c}({x^{(c)}}){\rm etr}[ - {{[y - {\tilde x}]}^{ \otimes 2}}D/2]}}} {{\prod \nolimits_{c = 1}^C {{f_c}({x^{(c)}})}} \over {\left[ {\prod\nolimits_{c = 1}^C {{e^{A_c^{{\rm{ou}}(y)}({x^{(c)}})}}}} \right]}}} \cr & = {{E^{\rm bm}} \over {E_*^{\rm ou}}} {{{{\tilde g}^{\rm ou}}({x^{(1)}}, \ldots ,{x^{(C)}},y)} \over {[\prod \nolimits_{c = 1}^C {f_c^{({x^{(c)}})}} ]{\rm etr} [ - {{[y - {\tilde x}]}^{ \otimes 2}}D/2]}}.}$$

Lemma 2 and the expressions for E ^bm and $E_*^{\rm ou}$ immediately give

$$\begin{align*} {g^{\rm dl} \over h^{\rm ou}} = \frac{E^{\rm bm}}{E^{\rm ou}_*} \cdot \rho^{\rm ou} \propto E^{\rm ou} \cdot \rho^{\rm ou}, \end{align*}$$

where E ^ou is given in (14).

Appendix B Proof of Lemma 1

We consider the formula for ρ ^ou in (12). If T = ∞, we have

$${m_c} = {\hat \mu _c},\quad \quad {V_c} = {{\hat \Lambda _c^{ - 1}} \over 2},$$

and

$$\begin{array}{lll}D = \sum\limits_{c = 1}^C {\hat \Lambda _c},{M_{1,c}}{ = ^{2{{\hat \Lambda }_c}T}}{\hat \Lambda _c} - 2{\hat \Lambda _c},\\{M_{2,c}} = 4{\hat \Lambda _c} - {2^{2{{\hat \Lambda }_c}T}}{\hat \Lambda _c} = - 2{M_{1,c}}.\end{array}$$

Therefore,

$$\mathop {\lim }\limits_{T \to \infty } {M_{1,c}}{({m_c} + M_{1,c}^{ - 1}{M_{2,c}}{V_c}{\hat \Lambda _c}{\hat \mu _c})^{ \otimes 2}} = \mathop {\lim }\limits_{T \to \infty } {(M_{1,c}^{1/2}{\hat \mu _c} + M_{1,c}^{ - 1/2}{M_{2,c}}{{{{\hat \mu }_c}} \over 2})^{ \otimes 2}}$$

and, furthermore, ρ ^ou(x⁽¹⁾, …, x ^(C)) becomes a value not depending on $X_0^{1:C(1:C)}$ at all as T → ∞. Therefore, with T = ∞, the density function ${\tilde h^{{\rm{o}}u}}( \cdot )$ becomes

$${\tilde h^{{\rm{o}}u}} \cdot ({x^{(1)}}, \ldots ,{x^{(C)}},y) \propto [\prod\limits_{c = 1}^C {f_c}({x^{(c)}})]etr[ - {1 \over 2}{[{\kern 1pt} y - x]^{ \otimes 2}}(\sum\limits_{c = 1}^C {\hat \Lambda _c})],$$

with

$$ \tilde x = (\sum\limits_{c = 1}^C {\hat \Lambda _c}{)^{ - 1}}\{ \sum\limits_{c = 1}^C {\hat \Lambda _c}{\hat \mu _c}\} .$$

Then we can generate y as, with some standard Gaussian random errors ε,

$$ \begin{array}{lll}y = x + {(\sum\limits_{c = 1}^C {\hat \Lambda _c})^{ - 1/2}} \cdot {e}\\ = (\sum\limits_{c = 1}^C {\hat \Lambda _c}{)^{ - 1}}\{ \sum\limits_{c = 1}^C {\hat \Lambda _c}{\hat \mu _c} + {(\sum\limits_{c = 1}^C {\hat \Lambda _c})^{1/2}}{e}\} \\{(\sum\limits_{c = 1}^C {\hat \Lambda _c})^{ - 1}}\{ \sum\limits_{c = 1}^C {\hat \Lambda _c}{\hat \mu _c} + \sum\limits_{c = 1}^C \hat \Lambda _c^{1/2}{{e}_c}\} \\{(\sum\limits_{c = 1}^C {\hat \Lambda _c})^{ - 1}}\{ \sum\limits_{c = 1}^C {\hat \Lambda _c}{\hat \mu _c} + \sum\limits_{c = 1}^C \hat \Lambda _c^{1/2}{{e}_c}\}\end{array} $$

where ‘$\mathop = \limits^D $ ’ denotes equality in distribution, and ε_c, c = 1, …, C, means C independent standard normal vectors.

By noting that ${\hat \mu _c} + \hat \Lambda _c^{ - 1/2}{{e}_c}$ has the same distribution as x^(c) if f_c(x) is a Gaussian distribution, the lemma is proved.

Acknowledgements

The authors would like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the programme ‘Scalable inference; statistical, algorithmic, computational aspects (SIN)’ when work on this paper was undertaken. MP and GOR were supported by the EPSRC under grant EP/K014463/1. GOR was additionally supported by the EPSRC under grants EP/K034154/1 and EP/D002060/1.

References

Agarwal, A. and Duchi, J. C. (2012). Distributed delayed stochastic optimization. In 51st IEEE Conference on Decision and Control, pp. 5451–5452.Google Scholar

Berger, O. J. (1980). Statistical Decision Theory and Bayesian Analysis. Springer, New York.10.1007/978-1-4757-1727-3CrossRef Google Scholar

Beskos, A. and Roberts, G. O. (2005). Exact simulation of diffusions. Ann. Appl. Prob. 15, 2422–2444.10.1214/105051605000000485CrossRef Google Scholar

Beskos, A., Papaspiliopoulos, O. and Roberts, G.O. (2006b), Retrospective exact simulation of diffusion sample paths with applications, Bernoulli 12, 1077–1098.10.3150/bj/1165269151CrossRef Google Scholar

Beskos, A., Papaspiliopoulos, O. and Roberts, G. O. (2008). A factorisation of diffusion measure and finite sample path constructions. Methodology Comput. Appl. Prob. 10, 85–104.10.1007/s11009-007-9060-4CrossRef Google Scholar

Beskos, A., Papaspiliopoulos, O., Roberts, G. O. and Fearnhead, P. (2006a). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion). J. R. Statist. Soc. B 68, 333–382.10.1111/j.1467-9868.2006.00552.xCrossRef Google Scholar

Chen, N. and Huang, Z. (2013). Localisation and exact simulation of Brownian motion-driven stochastic differential equations. Math. Operat. Res. 38, 591–616.10.1287/moor.2013.0585CrossRef Google Scholar

Dacunha-Castelle, D. and Florens-Zmirou, D. (1986). Estimation of the coefficients of a diffusion from discrete observations. Stochastics 19, 263–284.10.1080/17442508608833428CrossRef Google Scholar

Dai, H. (2014). Exact simulation for diffusion bridges: an adaptive approach. J. Appl. Prob. 51, 346–358.10.1239/jap/1402578629CrossRef Google Scholar

Dai, H. (2017). A new rejection sampling method without using hat function. Bernoulli 23, 2434–2465.10.3150/16-BEJ814CrossRef Google Scholar

Dai, H., Pollock, M. and Roberts, G. (2019). Monte Carlo fusion. Supplementary material. Available at http://doi.org/10.1017/jpr.2019.12 http://doi.org/10.1017/jpr.2019.12.CrossRef Google Scholar

Fleiss, J. L. (1993). Statistical basis of meta-analysis. Statist. Methods Med. Res. 2, 121–145.10.1177/096228029300200202CrossRef Google Scholar PubMed

Genest, C. and Zidek, J. V. (1986). Combining probability distributions: a critique and an annotated bibliography. Statist. Sci. 1, 114–148.10.1214/ss/1177013825CrossRef Google Scholar

Hansen, N. R. (2003). Geometric ergodicity of discrete-time approximations to multivariate diffusions. Bernoulli 9, 725–743.10.3150/bj/1066223276CrossRef Google Scholar

Li, C., Srivastava, S. and Dunson, D. B. (2017). Simple, scalable and accurate posterior interval estimation. Biometrika 104, 665–680.CrossRef Google Scholar

Li, Y., Yang, M. and Zhang, Z. (2015). Multi-view representation learning: A survey from shallow methods to deep methods. J. Latex Class Files 14, 20pp.Google Scholar

Masuda, H. (2004). On multidimensional Ornstein-Uhlenbeck processes driven by a general Lévy process. Bernoulli 10, 97–120.10.3150/bj/1077544605CrossRef Google Scholar

Minsker, S., Srivastava, S., Lin, L. and Dunson, D. B. (2014). Scalable and robust Bayesian inference via the median posterior. In Proc. 31st Internat. Conference on Machine Learning, pp. 1656–1664.Google Scholar

Neiswanger, W., Wang, C. and Xing, E. P. (2014). Asymptotically exact, embarrassingly parallel MCMC. In Proc. 13th Conference on Uncertainty In Artificial Intelligence, pp. 623–632.Google Scholar

Pollock, M. (2013). Some Monte Carlo methods for jump diffusions. Doctoral Thesis, University of Warwick.Google Scholar

Pollock, M., Johansen, A. M. and Roberts, G. O. (2016b). On the exact and ε-strong simulation of (jump) diffusions. Bernoulli 22, 794–856.10.3150/14-BEJ676CrossRef Google Scholar

Pollock, M, Fearnhead, P., Johansen, A. M. and Roberts, G. O. (2016a). The scalable Langevin exact algorithm: bayesian inference for big data. Submitted to J. R. Statist. Soc. B.Google Scholar

Scott, S. L. et al. (2016). Bayes and big data: the consensus Monte Carlo algorithm. Internat. J. Manag. Sci. Eng. Manag. 11, 78–88.Google Scholar

Smith, T. C., Spiegelhalter, D. J. and Thomas, A. (1995). Bayesian approaches to random-effects meta-analysis: a comparative study Statist. Med. 14, 2685–2699.10.1002/sim.4780142408CrossRef Google Scholar PubMed

Srivastava, S., Cevher, V., Dinh, Q. and Dunson, D. (2016). WASP: scalable Bayes via barycenters of subset posteriors. In Proc. 18th Internat. Conference on Artificial Intelligence and Statistics, pp. 912–920.Google Scholar

Stamatakis, A. and Aberer, A. J. (2013). Novel parallelization schemes for large-scale likelihood-based phylogenetic inference. In Proc. 2013 IEEE 27th Internat. Symposium on Parallel and Distributed Processing, pp. 1195-1204.10.1109/IPDPS.2013.70CrossRef Google Scholar

Tan, A., Doss, H. and Hobert, J. P. (2015). Honest importance sampling with multiple Markov chains. J. Comput. Graph. Statist. 24, 792–826.10.1080/10618600.2014.929523CrossRef Google Scholar PubMed

Wang, X. and Dunson, D. B. (2014). Parallelizing MCMC via Weierstrass sampler. Preprint. Available at https://arxiv.org/abs/1312.4605v2 https://arxiv.org/abs/1312.4605v2.Google Scholar

Zhao, J., Xie, X., Xu, X. and Sun, S. (2017). Multi-view learning overview: recent progress and new challenges. Inf. Fusion 38, 43–54.10.1016/j.inffus.2017.02.007CrossRef Google Scholar

Figure 1: Kernel density fitting with bandwidth 0.25 for the density proportional to e−x4/2, based on different MC methods: the standard exact MC (simulation 1, solid curve); Algorithm 1 (simulation 2, dash–dot curve); Algorithm 2 (simulation 3, dotted curve); the CMC algorithm (simulation 4, dashed curve).

TABLE 1: System running times in seconds for simulating 10 000 realisations.

Figure 2: Kernel density fitting with bandwidth 0.04 for Beta(5, 2), based on different MC methods: the standard exact MC (simulation (a), solid curve); Algorithm 1 (simulation (b), dotted curve); the CMC algorithm (simulation (c), dashed curve).

Figure 3: CPU time (in seconds) for Algorithm 1, based on different T.

Dai supplementary material

Supplementary material

PDF 238.6 KB

Article contents

Monte Carlo fusion

Abstract

Keywords

MSC classification

1. Introduction

2. An extended fusion density

3. A fusion rejection sampler using Brownian bridges

3.1. The methodology

3.2. Practical interpretation of the algorithm

4. A fusion rejection sampler using Ornstein–Uhlenbeck bridges

4.1. The methodology

4.2. Connection with consensus Monte Carlo

5. Simulation studies

5.1. Distribution with light tails

5.2. Beta distribution

5.3. Normal distribution: comparison of the two algorithms

6. Conclusion

Appendix A Proof of Proposition 3

Appendix B Proof of Lemma 1

Acknowledgements

References

Dai supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests