1. Introduction
Consider the following compound result about asymptotic statistical inference. A community of Bayesian investigators who begin an investigation with conflicting opinions about a common family of statistical hypotheses use shared evidence to achieve a consensus about which hypothesis is the true one. Specifically, suppose the investigators agree on a partition of statistical hypotheses and share observations of an increasing sequence of random samples with respect to whichever is the true statistical hypothesis from this partition.Footnote 1 Then, under various combinations of formal conditions that we review in this essay, ex ante (i.e., before accepting the new evidence) it is practically certain that each of the investigators’ conditional probabilities approach 1 for the one true hypothesis in the partition.
The result is compound: individual investigators achieve asymptotic certainty about the unknown, true statistical hypothesis. Second, the shared evidence leads to a consensus among the different investigator’s individual degrees of belief. The initial disagreements, the disparate initial credences about different hypotheses, are resolved with increasing shared evidence. Stated in more familiar Bayesian terms, it is practically certain that the likelihood function based on the shared statistical evidence swamps differences in initial prior credences to produce a consensus among posterior credences.
The strategy to use asymptotics of Bayesian inference to defend against charges of excessive subjectivity is highlighted in the seminal work of Savage (Reference Savage1954, secs. 3.6 and 4.6) and Edwards, Lindman, and Savage (Reference Edwards, Lindman and Savage1963). Savage’s (Reference Savage1954) results apply to a finite set of investigators who hold nonextreme views over a common finite partition of statistical hypotheses.Footnote 2 He establishes that—using a (finitely additive) weak law of large numbers—given increasing statistical evidence from a sequence of random samples, with probability approaching 1, different nonextreme personalists’ conditional probabilities become ever more concentrated on the same one true statistical hypothesis from among a finite partition of rival statistical hypotheses.Footnote 3 To repeat, the result is compound. It addresses both issues of certainty and consensus among finitely many investigators over a finite partition of statistical hypotheses, assuming they share an increasing sequence of observations from random sampling.Footnote 4
Savage offers these findings as a partial defense against the accusation, voiced by frequentist statisticians of the time, that the theory of (Bayesian) personalist statistics is fraught with subjectivism and cannot serve the methodological needs of the scientific community, where objectivity is required. The central theme in Savage’s response is to understand ‘objectivity’ in terms of shared agreements about the truth, particularly when the shared agreements arise from shared statistical evidence. In summary, Savage provides sufficient conditions for when Bayesian methodology makes it ex ante almost certain that shared evidence secures this kind of objectivity for a well-defined community of investigators.
Savage (Reference Savage1954, 50) notes that his result about asymptotic certainty can be extended in several ways, by adapting the central limit theorem, the strong law of large numbers, and the law of the iterated logarithm to sequences of conditional probabilities generated by an increasing sequence of random samples. The last two of these laws require stronger assumptions than are needed for the finitely additive weak-law convergence result that Savage presents. Specifically, these stronger results require the assumption that (conditional) probabilities are countably additive.
Savage’s twin results have been strengthened also to include shared evidence from nonrandom samples. Consider an uncountably infinite probability space generated by increasing finite sequences of observable random variables, not necessarily forming a random sample with respect to a statistical hypothesis of interest. Rather than requiring that different agents hold nonextreme views about all possible events in the space of observables, which is mathematically impossible with real-valued probabilities once the space is uncountable, instead require that they agree with each other about which events in this uncountably infinite space of observables have probability 0. They share in a family of mutually absolutely continuous probability distributions. If the agents’ personal probabilities over these infinite spaces also are countably additive, then strong-law convergence theorems yield strengthened results about asymptotic consensus (see, e.g., Blackwell and Dubins Reference Blackwell and Dubins1962) and also about asymptotic certainty for events defined in the space of sequences of increasing shared evidence. We discuss several of these results in section 4. There we use considerations both of certainty and consensus to explicate epistemic modesty within a Bayesian framework that contrasts with a critical assessment of Bayesian theory offered by Belot (Reference Belot2013), whose work we next consider.
2. Orgulity as Identified by Comparing Meager Sets versus Null Sets
In a 2013 paper in this Journal, critical about the methodological significance of some of the strengthened versions of Savage’s convergence result for asymptotic certainty, Belot arrives at a harsh conclusion: “The truth concerning Bayesian convergence-to-the-truth results is significantly worse than has been generally allowed—they constitute a real liability for Bayesianism by forbidding a reasonable epistemological modesty” (Reference Belot2013, 502). Below, we argue that this verdict is misguided. The criteria for reasonable epistemic modesty that we understand underpin Belot’s analysis are self-defeating; hence, his argument is not compelling. When the criteria that we attribute to Belot are satisfied, they induce unreasonable epistemic apriorism regarding, for example, how sequences of observed relative frequencies behave.
What makes a (coherent) Bayesian credal state overconfident and lacking in epistemological modesty? Does the Bayesian position generally forbid “a reasonable epistemological modesty,” as Belot intimates? These questions are both interesting and imprecise. There is no doubting that the standard of mere Bayesian coherence for a credal state, as formalized in de Finetti’s (Reference de Finetti, Kyburg, Smokler and Kyburg1937/1964) theory, falls short of characterizing the set of reasonable credal states. To use an old and tired example, a person who thinks each morning that it is highly probable that the world ends later that afternoon does not thereby violate the technical norms of coherence.
In order to identify a brand of unreasonableness captured in overconfident, epistemologically immodest credal states, Belot supplements Bayesian coherence with a topological standard for respecting what he calls a typical event: He defines a typical event as a topologically large event. When a coherent agent assigns probability 0 to a topologically large set, specifically when a probability null set is comeager, Belot thinks that is a warning sign of epistemological immodesty.Footnote 5 Such a Bayesian agent is practically certain that the topologically typical event does not occur. And then Bayesian conditioning (almost surely) preserves that certainty in the face of new evidence. So, the Bayesian agent is not open-minded because, in dismissing as probabilistically negligible a topologically typical event E, (almost surely) she is aware ex ante that Bayesian conditioning precludes learning that the typical event E occurs.
We understand Belot’s criticism (Reference Belot2013, sec. 4) to be that Bayesian convergence-to-the-truth results about hypotheses that are formulated in terms of sets of observable sequences fail this concern about typical events. The strengthened convergence results allow the Bayesian agent to dismiss (ex ante) a probabilistically negligible set of sequences of observations where the convergence to the truth fails. This set has “prior” probability 0. Except, Belot complains, that failure set may be comeager in the usual topology for the sequences of observables. Hence, the failure set may be a typical event in the space of observables, about which a modest investigator should keep an open mind. But, instead, Bayes updating (almost surely) ignores these typical events by continuing to assign to them probability 0, even as the evidence grows. Thus, the strengthened asymptotic certainty results that Belot criticizes do not conform to the topological standards of epistemic modesty in the sense of modesty that we understand he advocates.
Although he does not explicitly formulate criteria for immodesty, on the basis of the examples and analysis he offers, we understand Belot’s primary requirements to be these two:Footnote 6
Topological Condition 1: Do not assign probability 1 to a meager set of observables.
Also, we find that Belot argues for a more demanding standard,Footnote 7
Topological Condition 2: Assign probability 0 to each hypothesis that is a meager set in the space of sequences of observables.
Ordinary statistical models violate topological condition 1 by their unconditional probabilities, independent of whether learning is by Bayesian updating. Already, condition 1 is inconsistent with the strong laws of large numbers, including the ergodic theorem, which are asymptotic results for unconditional probabilities (see Oxtoby Reference Oxtoby1980, 85).
Here we show that topological condition 2 entails a radical probabilistic apriorism toward observed relative frequencies that has little to do with questions about Bayesian overconfidence. In particular, this topological standard requires that with probability 1, relative frequencies for an arbitrary sequence of (logically independent) events oscillate maximally. From a Bayesian point of view, almost surely new evidence leaves this extreme epistemic attitude wholly unmodified. A Bayesian agent whose credal state conforms to condition 2 knows ex ante that she is practically certain never to change her mind that the relative frequencies for a sequence of events oscillate maximally. In this sense, we find that conditions 1 and 2 are self-defeating through a lack of humility. They promote excessive apriorism with respect to ordinary properties of limiting frequencies.
The Bayesian convergence-to-the-truth results that are the subject of Belot’s complaints are formulated as probability strong laws that hold almost surely or almost everywhere. In order to make clear why we think Belot’s verdict is mistaken thinking these results about convergence to the truth are a liability for Bayesian theory, revisit the familiar instance of the strong law of large numbers, as reported in footnote 4.
Let 〈Ω, ℬ, P〉 be the countably additive measure space generated by all finite sequences of repeated, probabilistically independent (iid) flips of a “fair” coin. Let 1 denote a “heads” outcome and 0 a “tails” outcome for each flip. Then a point x of Ω is a denumerable sequence of zeroes and ones, , with each
for
, 2, … . Let
designate the random variable corresponding to the outcome of the nth flip of the fair coin. The Borel σ-algebra ℬ is generated by rectangular events, those determined by specifying values for finitely many coordinates in Ω. The countably additive iid product fair-coin probability P is determined by
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df1.png?pub-status=live)
and where each finite sequence of length n is equally probable,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df2.png?pub-status=live)
Let L 1/2 be the set of infinite sequences of 0s and 1s with limiting frequency 1/2 for each of the two digits: a set belonging to ℬ. Specifically, let . Then
. The strong law of large numbers asserts that
. What is excused with the strong law, what is assigned probability 0, is the null set
consisting of the complement to L 1/2 among all denumerable sequences of 0s and 1s.
The null set N is large, both in cardinality and in category under the product topology for 2ω. It is a set with cardinality equal to the cardinality of its complement, the continuum.Footnote 8 When 2ω is equipped with the infinite product of the discrete topology on {0, 1},Footnote 9 then the null set N is topologically large. Set N is comeager (Oxtoby Reference Oxtoby1980, 85).Footnote 10 That is, the set L 1/2 is meager and so is judged topologically “small,” or atypical. By condition 1, a Bayesian who adopts the fair-coin model for her credences is epistemologically immodest with respect to denumerable sequences of possible coin flips: the space of sequences of observations that drive the asymptotic certainty result.
This strong-law counterexample to condition 1 should come as no surprise in the light of the following result:
Oxtoby (1980, theorem 1.6): Each nonempty interval on the real line may be partitioned into two sets, {N, M}, where N is a Lebesgue measure null set and its complement is a meager set.
Oxtoby generalizes this result with his theorem 16.5.Footnote 11 In his illustration of theorem 16.5 using the strong law of large numbers, the binary partition {N, L 1/2} displays the direct conflict between the measure theoretic and topological senses of small. Under the fair-coin model, N has probability 0, and L 1/2 is a meager set in the product topology of the discrete topology on {0,1}. The tension between the two senses of small is not over some esoteric binary partition of the space of binary sequences but applies to the event that the sequence of observed outcomes has a limiting frequency 1/2.
We exemplify the general conflict encapsulated in Oxtoby’s theorem 16.5 with the following claim, which we use to criticize condition 2. Consider the space 2ω, with points of denumerable sequences of zeroes and ones, equipped with the infinite product of the discrete topology on {0,1}. Define the set of sequences L 〈0,1〉 consisting of those points x whose relative frequency does not oscillate maximally, that is, where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df3.png?pub-status=live)
The complement to L 〈0,1〉, , is the set of binary sequences whose observed relative frequencies oscillate maximally.
Proposition 1. L 〈0,1〉 is a meager set; that is, OM is a comeager set.Footnote 12
Theorem A1 of the appendix establishes that sequences of logically independent random variables that oscillate maximally are comeager with respect to infinite product topologies on the sequence of random variables. Proposition 1 is a corollary to theorem A1 applied to binary sequences, that is, where there are only two categories for observables.
What proposition 1 establishes is that only extreme probability models of relative frequencies satisfy topological condition 2. That is, consider a measure space 〈2ω, ℬ, P〉, where ℬ includes the Borel sets from 2ω and where 2ω is equipped with the infinite product of the discrete topology as above. Each probability with produces a nonnull set that is meager.
Unless a probability model P for a sequence of relative frequencies assigns probability 1 to the set of sequences of observed frequencies that oscillate maximally, then P assigns positive probability to a meager set of sequences, in violation of condition 2. Evidently, the standard for epistemological modesty formalized in topological condition 2, which requires meager sets of relevant events to be assigned probability 0, itself leads to probabilistic orgulity because it requires an unreasonable a priori opinion about how observed relative frequencies behave. Let P satisfy condition 2. Given evidence of a P-non-null observation o of observed relative frequencies, the resulting conditional probability leaves this extreme a priori opinion unchanged: .
Familiar Bayesian models also violate the weaker topological condition 1. Consider an exchangeable probability model over 2ω. Then, by de Finetti’s (Reference de Finetti, Kyburg, Smokler and Kyburg1937/1964) theorem, each exchangeable probability assigns probability 1 to the set L of sequences with well-defined limiting frequencies for 0s and 1s. That is, then . But L is a subset of L 〈0,1〉; hence, L is a meager set.
In summary, our understanding is that Belot applies topological conditions 1 and 2 in order to identify an epistemically immodest coherent credal state. We find that each of these two conditions is excessively restrictive and is self-defeating as a criterion for epistemic immodesty. The credences that satisfy these conditions with respect to the sets of sequences of observables that ground the almost-sure Bayesian convergence results embed extreme a priorism about, for example, the limiting behavior of observed relative frequencies.
In section 4, we argue that a better account of Bayesian epistemological modesty/immodesty uses interpersonal standards for asymptotic consensus within a community of investigators about the set of certainties that arise from an idealized sequence of observations. Belot’s approach for identifying epistemic immodesty applies topological conditions of adequacy to a standalone credence function and avoids issues of consensus. In contrast, we supplement coherence with criteria involving asymptotic consensus among a community of investigators about which certainties they might acquire based on a sequence of shared evidence.
3. But What If Probability Is Merely Finitely Additive and Not Countably Additive?
Elga (Reference Elga2016) responds to Belot’s criticism by focusing on the premise of countable additivity for probability, which is needed for the strong-law versions of Savage’s convergence result. The subjective theory of probability, especially as promoted by Savage (Reference Savage1954), Dubins and Savage (Reference Dubins and Savage1965), and de Finetti (Reference Smokler and Kyburg1974), does not mandate countable additivity for credences. This added generality is of importance for contemporary Bayesian practice, as argued in Kadane, Schervish, and Seidenfeld (Reference Kadane, Schervish, Seidenfeld, Goel and Zellner1986).
As we understand Elga’s response to Belot’s criticism, it is based on an example. The example purportedly shows how, using a finitely but not countably additive probability P, Belot’s standard for being an open-minded Bayesian credal state may be satisfied without also being burdened with the immodesty of treating a comeager failure set as a P-null set, as follows when probability is countably additive. Elga argues that, in his example, the associated set of data sequences where the convergence to the truth result fails with the credal state P has positive P-probability, contrary to what happens in the countably additive case. Elga asserts that in his example, the agent’s finitely additive conditional probabilities do not (almost surely) converge to the true statistical hypothesis about limiting relative frequencies; hence, such a Bayesian agent escapes Belot’s criticism as this agent is epistemologically humble about becoming certain of the true limiting relative frequency in the observed sequence.
First and foremost, we dispute Elga’s analysis of the specific example he offers. We argue that, contrary to Elga’s assessment, his merely finitely additive probability model P satisfies a finitely additive convergence-to-the-truth theorem that is needed to defend Bayesian learning. The Bayesian agent of Elga’s example is not humble about whether (with increasing probability) she will achieve asymptotic certainty for the limiting frequency hypothesis in question: at each stage of her investigation, looking forward, she remains practically certain that her posterior probability will converge to the true limiting frequency hypothesis.
Second, the credal state P in Elga’s example fails what we call Belot’s condition 1. State P assigns probability 1 to a meager set of sequences of observations. Hence, although Elga argues that P is modest with respect to one limiting frequency hypothesis, according to condition 1 P is immodest for a different but related hypothesis about the existence of well-defined limiting frequencies.
Nonetheless, we agree with Elga (and with others who have argued the same point previously) that finitely but not countably additive probability models allow failures of the strengthened convergence results. We illustrate this point using a finitely additive probability, P′, that is a simple variant of Elga’s model P. But in our judgment, this phenomenon—where a finitely additive model P′ fails the strengthened convergence-to-the-truth result—does not provide a satisfactory rebuttal to Belot’s criticism. Belot’s criticism, which is directed at countably additive credences, is that they display Bayesian orgulity. To argue that, on the contrary, the finitely additive probability P′ assigns positive probability to a set of sequences where convergence to the truth fails does not show that such a merely finitely additive probability is reasonable.
According to the rival standards for epistemic modesty that we offer in section 4, such a finitely additive probability P′ is unreasonable on two counts simultaneously: the Bayesian agent with credence P′ knows in advance that each data sequence that might be observed will fail to induce certainty, both in the short term and in the limit. Also, P′ fails the test for reasonableness based on consensus. That is, the agent with credences fixed by P′ does not reach consensus with other members of a community of investigators who use countably additive credences and agree with P′ about which (finite) sequences of observables are probability-0 events. But the others reach consensus among themselves.
For a detailed discussion of Elga’s example, begin with a review of some relevant mathematical considerations. When probability P is defined for a measurable space, the principle of countable additivity has an equivalent form as a principle of Continuity. Let be a monotone sequence of (measurable) events, where
is also a (measurable) event.
Continuity .
When probabilities satisfy Continuity, the probabilities for a class of events that form a field also determine uniquely the probabilities for the smallest σ-field generated by
(see Halmos Reference Halmos1950, theorem 13A). And if an event H belongs to that σ-field, then H can be approximated in probability by events from the field
. Specifically, for each
there exists a
such that
(see Halmos Reference Halmos1950, theorem 13D). This result has important consequences when H is a tail-field event in 2ω.Footnote 13
Consider the countably additive probability P for iid flips of a fair coin and, for example, the tail-field event L 1/2 in 2ω. Then, L 1/2 can be approximated ever more precisely in probability by a sequence of finite-dimensional events {}, each of which is determined by a finite number of coordinates from the set of denumerable binary sequences, 2ω. Choose a sequence, {
}, with
. That is, for each
, …,
, and each E n depends on only finitely many coordinates from 2ω. With P the product measure for iid fair-coin flips and L 1/2 the tail-field event that is to be approximated, then the finite dimensional events E n may be chosen as the set of sequences with relative frequency of ones sufficiently close to 1/2 through the first n trials. However, when Continuity fails, and P is merely finitely additive but not countably additive, then the probabilities over
may fail to define the probabilities over the smallest σ-field generated by
.
For example, pick two values . A coherent, merely finitely additive probability Pp,q on 2ω may assign values to each finite-dimensional event according to iid trials with constant Bernoulli probability p but assign probabilities to the tail-field events according to iid trials with constant Bernoulli probability q. Then, the strong law of large numbers does not entail the weak law of large numbers with the same values. While finite sequences of zeroes and ones follow an iid Bernoulli-p product law, with Pp,q probability 1, the tail event of the limiting relative frequency for ones is q. This phenomenon is at the heart of Elga’s example.
Let P be a merely finitely additive probability on the Borel σ-algebra of 2ω where . Elga considers the case with
and
. This finitely additive probability assigns probability 1/2 to the tail-field event L 1/10 (the set of sequences with limiting frequency 1/10) and probability 1/2 to the tail-field event L 9/10 (the set of sequences with limiting frequency 9/10). For
, let I L1/10(x) be the indicator function for the event L 1/10 and I L9/10(x) the indicator function for the event L 9/10. So,
. Thus, we see from proposition 1, Elga’s example stands in violation of topological condition 1, since with P-probability 1 the sequence of coin flips has a convergent limiting relative frequency. This forms a meager set among the set of all binary sequences.
Elga asserts that the conditional probabilities associated with the (merely) finitely additive P-distribution fail the almost-sure strong-law convergence result. Here is the argument he offers for that conclusion. Let x be an element of the set L 1/10, a sequence with limiting relative frequency 1/10, which is practically certain to occur according to the P-distribution on sequences if and only if the P9/10, 1/10 coin is flipped. (Otherwise, with P-probability 1, a sequence x almost surely has a limiting relative frequency 9/10 since it is then following a P1/10, 9/10 law.) Then, for each there exists integer nε, such that for each
, the observed sequence {X 1, …, Xn} has a relative frequency of ones close enough to 1/10 so that the posterior probability satisfies
.
This conditional probability assigns high probability to the event L 9/10 that the limiting frequency of the sequence is 9/10 even though the sequence that generates the observations in fact has limiting frequency 1/10. In this sense, the sequence of conditional probabilities generated by x (an element of the set L 1/10) converge to the wrong tail-field event, L 9/10, even though the sequence that generates the observations has limiting frequency 1/10. Likewise, the convergence is to the wrong tail-field event, L 1/10, when the sequence is generated by an element of the set L 9/10. Elga concludes that conditional probabilities from this merely finitely additive P-model do not satisfy the (almost-sure) strong-law convergence-to-the-truth results. Then, regarding either tail-field event L 1/10 or L 9/10, the agent with conditional credences fixed by probability P is both open-minded and modest.Footnote 14 But this analysis is misleading regarding convergence to the truth because it conditions on P-null events, as we now explain.
Define the denumerable set of countably additive probabilities {Pn} on 2ω so that Pn is the iid product of a Bernoulli-p probability for the first n coordinates and is the iid product of a Bernoulli-q probability for all coordinates beginning with the position. Each Pn is a countably additive probability on the measurable space 〈2ω, ℬ〉. Distribution Pn has a change point after the nth trial. Let the change point,
, be chosen according to a purely finitely additive probability, with
,
, 2, … . Finally, let P be the induced (marginal) unconditional probability on the Borel σ-algebra of sequences of coin flips, 〈2ω, ℬ〉.
As required for Elga’s construction, this finitely additive probability P behaves as Pp,q. Its distribution is the iid product of a Bernoulli-p distribution on finite dimensional sets and is the iid product of a Bernoulli-q distribution on the tail-field events.Footnote 15 Probability P satisfies the weak law of large numbers over finite sequences with Bernoulli parameter p and satisfies the strong law of large numbers on the tail field with Bernoulli parameter q. Hence, the strong law does not entail the weak law with the same parameter value.
Given an observed history, , the Bayesian agent in Elga’s example assigns a purely finitely additive conditional probability to the distribution of the change point (N) so that, with conditional probability 1, the change point is arbitrarily far off in the future. For each finite history hj and for each
, 2, …,
. An agent who uses Elga’s finitely additive P-model precludes learning about the change point variable, N. That agent is closed-minded in the relevant sense that, no matter what she observes, she is certain that the change point lies in the yet-to-be-observed future.
So, whenever the agent observes a finite history of coin flips with observed relative frequency of heads near to 9/10, she has high posterior probability for the tail-field event L 1/10. Likewise, whenever the agent sees a finite history of coin flips with observed relative frequency of heads near to 1/10, she has high posterior probability for the tail-field event L 9/10. And since this agent is always sure, given each finite history hj, that the change point (N) is in the distant future of the sequence of coin flips, she always assigns arbitrarily high posterior probability to correctly identifying the tail-field event between L 1/10 and L 9/10.
For example, this agent assigns probability near 1 to observing indefinitely long finite histories that have observed relative frequencies that linger near 9/10 exactly when the sequence x has a limiting relative frequency of 1/10. This finitely additive credal state satisfies the conclusion of the finitely additive almost-sure convergence-to-the-truth result: almost surely, given the observed histories from a sequence x, the conditional probabilities converge to the correct indicator for the tail behavior of the relative frequencies in x.
Elga’s analysis to the contrary is based on having the agent consider conditional probabilities, at histories hn that run beyond the change point. But with Elga’s finitely additive probability P-model, the agent’s credence is 0 of ever witnessing such a history. That is, Elga’s argument, whose conclusion is that the agent’s conditional probabilities converge to the wrong indicator function, requires the agent to condition on an event of P-probability 0 (i.e., that she has made finitely many observations that go past the change point in the sequence). But, at each finite stage in the history of observations, this event is part of a P-null event where a failure of the (finitely additive) almost sure convergence to the truth is excused. Where this case differs from the countably additive one is that with the merely finitely additive probability P, the countable union of all these infinitely many P-null events (namely, that that change point has been reached by the kth observation,
, 2, …), is a certain event—since the change point is certain to arrive eventually.
Apart from this peculiar merely finitely additive credal attitude that precludes learning about the change point N, there is something else unsettling about this Bayesian agent’s finitely additive model for coin flips. Perhaps what follows makes clearer what that problem is. Modify Elga’s model to the finitely additive probability P′ so that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df4.png?pub-status=live)
with the change point N chosen, just as before, by a purely finitely additive probability, for
, 2, … . Then the strong-law result applies to tail-field events, and, P′-almost surely, the limiting frequency for heads is either 1/10 or 9/10 also, just as in Elga’s P-model. However, the two finitely additive coins, P5/10, 1/10 and P5/10, 9/10, assign the same probability to each finite history of coin flips. Letting hn denote a specific history of length n,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df5.png?pub-status=live)
But then
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df55.png?pub-status=live)
for each possible history. That is, contrary to the strengthened convergence-to-the-truth result, in this modified P′-model, the agent is completely certain that her posterior probability for either of the two tail-field hypotheses, L 1/10 or L 9/10, is stationary at the prior value 1/2. Under the growing finite histories from each infinite sequence of coin flips, the posterior probability moves neither toward 0 nor toward 1. Within the P′-model, surely there is no convergence to the truth about these two tail-field events given increasing evidence from coin flipping.Footnote 16
Evidently, one aspect of what is unsettling about these finitely additive coin models is that the observed sequence of flips is entirely uninformative about the change point variable, N. No matter what the observed sequence, the agent’s posterior distribution for N is her prior distribution for N, which is a purely finitely additive distribution assigning 0 probability to each possible integer value for N. It is not merely that this Bayesian agent cannot learn about the value of N from finite histories. Also, two such agents who have finitely additive coin models that disagree only on the tail-field parameter cannot use the shared evidence of the finite histories to induce a consensus about the tail-field events since they are both certain that their shared evidence has yet to cross the change point. In the next section, we use these themes about certainty and consensus based on shared evidence to provide a different answer to Belot’s question about what distinguishes modest from immodest credal states.
4. On Standards for Epistemic Modesty Using Asymptotic Merging and Consensus
Peirce (Reference Peirce1877) argues that sound methodology needs to defend a proposal for how to resolve interpersonal differences of scientific opinion. He asserts that the scientific method for resolving such disputes wins over other rivals (e.g., apriorism or the method of tenacity) by having the Truth (i.e., observable Reality) win out—by settling debates through an increasing sequence of observations from well-designed experiments. With due irony, much of Peirce’s proposal for letting Reality settle the intellectual dispute is embodied within personalist Bayesian methodology.Footnote 17 Here, we review some of those Bayesian resources regarding three aspects of immodesty.
One kind of epistemic immodesty is captured in a dogmatic credal state that is immune to revision from the pressures of new observations. Such a credal state is closed-minded. And a closely related second kind of immodesty is that two rival dogmatic positions cannot find a resolution to their epistemic conflicts through shared observations. They are persistent in their closed-mindedness. These two suggest that a credal state can be assessed for epistemic immodesty according to three considerations:
i) how large is the set of conjectures,
ii) how large is the community of rival opinions, and
iii) for which sets of sequences of shared observations
does Bayesian conditionalization offer resolution to interpersonal credal conflicts by bringing the different opinions into a consensus regarding the truth. In other words, qualitative degrees of epistemic immodesty are revealed with these three considerations, which synthesize criteria of asymptotic consensus and certainty. We discuss this sense of “immodesty” in the remainder of this section.
We use as our starting point an important result due to Blackwell and Dubins (Reference Blackwell and Dubins1962) about countable additive probabilities. Let 〈X, ℬ〉 be a measurable Borel product space with the following structure. Consider a denumerable sequence of sets each with an associated σ-field ℬi. Form the infinite Cartesian product
of denumerable sequences
, where
. That is, each xi is an atom of its algebra ℬi. In the usual fashion, let the measurable sets in ℬ be the σ-field generated by the measurable rectangles.
Definition: A measurable rectangle is one where
and
for all but finitely many i.
Blackwell and Dubins (Reference Blackwell and Dubins1962) consider the idealized setting where two Bayesian agents have this same measurable space of possibilities, each with her own countably additive personal probability, creating the two measure spaces 〈X, ℬ, P1〉 and 〈X, ℬ, P2〉. Suppose that P1 and P2 agree on which measurable events have probability 0, and admit (countably additive) predictive distributions, , for each finite history of possible observations.Footnote 18 In order to index how much these two are in probabilistic disagreement, Blackwell and Dubins adopt the total-variation distance. Define
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df6.png?pub-status=live)
The index ρ is one way to quantify the degree of consensus between the two agents who share the same history of observations, (x1, …, xn). This index focuses on the greatest differences between the two agents’ conditional probabilities.
Here is the related strong-law result about asymptotic consensus (Blackwell and Dubins Reference Blackwell and Dubins1962, theorem 2):
Blackwell-Dubins Result. For , 2, Pi-almost surely,
.
In words, the two agents are practically certain that with increasing shared evidence their conditional probabilities will merge, in the very strong sense that the greatest differences in their conditional opinions over all measurable events in ℬ will diminish to 0. And they remain practically certain of this future development for each (nonnull) observed history. Thus, this result supports a conclusion about idealized asymptotic consensus from idealized application of the scientific method that Peirce asserted he could not prove but only defend as having no equal.Footnote 19
Since, for each event in the space ℬ, the familiar strong-law convergence-to-the-truth result applies, separately, to each investigator’s opinion, the added feature of merging allows a defense against the charge of individual “immodesty” by showing that two rival opinions come into agreement about the truth, almost surely, in the strong sense provided by the ρ-index. In the setting of Blackwell and Dubins’s (Reference Blackwell and Dubins1962) result, almost surely two such investigators agree that they can resolve all conflicts in their credal states over all elements of ℬ, and have their posterior probabilities almost surely concentrate on true hypotheses, by sharing increasing finite histories of observations from a sequence x. Thus, in fine Peircean style, they are not open-minded about the efficacy of the scientific method for creating consensus and certainty.
Schervish and Seidenfeld (Reference Schervish and Seidenfeld1990, sec. 3) explore several variations on this theme by enlarging the set of rival credal states in order to consider larger communities than two investigators and by relaxing the sense of merging (or consensus) that is induced by shared evidence from a common measurable space 〈X, ℬ〉. They show that, depending on how large a set of different mutually absolutely continuous probabilities is considered, the character of the asymptotic merging varies. This is where topology plays a useful role in formalizing “immodesty.”
Here, we summarize three of those results. Let ℛ be the set of rival credences that conform, pairwise, to the Blackwell-Dubins conditions above. Consider three increasing classes of such communities.
1. If ℛ is a subset of a convex set of rival credences whose extreme points are compact in the discrete topology, then all of ℛ uniformly satisfies the Blackwell-Dubins merging result. That is, then merging in the sense of ρ occurs simultaneously over all of ℛ.
2. If ℛ is a subset of a convex set of rival credences whose extreme points are compact in the topology induced by ρ, then all that is assured is a weak-law merging. That is, if {Pn, Qn} is an arbitrary sequence of pairs from ℛ, and
is an arbitrary credence from the set of rivals, then
3. And if ℛ is a subset of a convex set of rival credences whose extreme points are compact in the weak-star topology induced by ρ, then not even a weak-law merging of the kind reported in class 2 is assured.
It is not surprising then, as the community ℛ increases its membership, the kind of consensus that is assured—the version of community-wide probabilistic merging that results from shared evidence—becomes weaker. So, one way to assess the epistemological “immodesty” of a credal state formulated with respect to a measurable space 〈X, ℬ〉 is to identify the breadth of the community ℛ of rival credal states that admits merging through increasing shared evidence from ℬ. For example, the agent who thinks each morning that it is highly probable that the world ends later that afternoon has an immodest attitude because there is only the isolated community of like-minded pessimists who can reconcile their views with commonplace evidence that is shared with the rest of us.
When the different opinions do not satisfy the requirement of mutual absolute continuity, the previous results do not apply directly. Instead, we modify an idea from Levi (Reference Levi1980, sec. 13.5) so that different members of a community of investigators modify their individual credences (using convex combinations of rival credal states) in order to give other views a hearing and, in Peircean fashion, in order to allow increasing shared evidence to resolve those differences.
Let serve as a finite or countably infinite index set, and let
represent a community of investigators, each with her own countably additive credence function Pi on a common measurable space 〈X, ℬ〉. It may be that, pairwise, the elements of ℛ are not even mutually absolutely continuous. In order to allow new evidence to resolve differences among the investigators’ credences for elements of ℬ (rather than trying, e.g., to preserve common judgments of conditional credal independence between pairs of elements of ℬ), each member of ℛ shifts to a credal state by taking a mixture of each of the investigators’ credal states: a “linear pooling” of those states. Specifically, for each
, let
serve as a set of weights that investigatori uses to create the credal state
to replace Pi. It might be that for each
, each Qi is self-centered in the following sense. Let
. The Qi might be self-centered in that
. Then, pairwise, the Qi satisfy the assumptions for the Blackwell-Dubins result despite being self-centered. Depending upon the size of the community ℛ, using the replacement credal states {Qi} results 1, 2, and 3 obtain.
We conclude this discussion of probabilistic merging with a reminder that merely finitely additive probability models open the door to reasoning to a foregone conclusion (Kadane, Schervish, and Seidenfeld Reference Zellner1996), in a different sharp contrast with the P′ model above to the almost sure asymptotic merging and convergence-to-the-truth results associated with countably additive probability models. Key to these asymptotic results for countably additive probabilities is the Law of Iterated Expectations.
Let X and Y be (bounded) random variables measurable with respect to a countably additive measure space 〈Ω, ℬ, P〉. With ℰ[X] and denoting, respectively, the expectation of X and the conditional expectation of X, given the event
, then
Law of Iterated Expectations .
As Schervish, Seidenfeld, and Kadane (Reference Schervish, Seidenfeld and Kadane1984) established, each merely finitely (and not countably) additive probability defined on a σ-field of sets fails this law even when the variable X is an indicator variable. That is, each merely finitely additive probability fails to be conglomerable in some denumerable partition, here associated with the random quantity Y. Specifically, with a merely finitely additive probability P, there exists a measurable hypothesis H and denumerable partition of measurable events , where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df10.png?pub-status=live)
Then, contrary to the Law of Iterated Expectations, with expectations ℰ over all ,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df11.png?pub-status=live)
Let the random variable Y have a denumerable sample space, . Associate the event Ei with the outcome
. Then if P is nonconglomerable for H in the partition generated by Y, in the partition πY, P fails the Law of Iterated Expectations in πY.
A set of such merely finitely additive probabilities, each of which is nonconglomerable in the same partition of the shared evidence, can display reasoning to contrary foregone conclusions both in the short run and asymptotically with increasing shared evidence. Because the investigators’ conditional probabilities for a pair of contrary hypotheses {H 1, H 2} are nonconglomerable in the partitions of their increasing shared evidence, each investigator may become increasingly certain of a different hypothesis as a function solely of the sample size of their shared evidence, regardless of what those samples reveal. Moreover, this assured increasing divergence in their updated opinions is a fact they are aware of ex ante.
The lesson we draw is this: Bayesian agents who use merely finitely additive probabilities face a trade-off between the added flexibility in modeling that comes with relaxing the constraint of countable additivity and the added restrictions on the kinds of shared evidence necessary to achieve the desirable methodological laws about asymptotic consensus and certainty illustrated in the countably additive strong laws.
5. Summary
Savage (Reference Savage1954) and Blackwell and Dubins (Reference Blackwell and Dubins1962) offer important results showing that Bayesian methodology uses increasing shared evidence in order to temper and to resolve interpersonal disagreements about personal probabilities. We contrast interpersonal standards of asymptotic consensus about certainties that arise from a sequence of shared evidence with Belot’s (Reference Belot2013) proposal to use a topological standard of “meagerness” in order to determine when a credal state is immodest, based on a standalone assessment of that credal state.
We understand Belot to endorse topological condition 1, which requires that comeager sets are assigned positive probability. Where a probability model treats a comeager set as null, that shows the model is immodest because it dismisses a topological large set as probabilistically negligible. But, in the light of the fact that the set of sequences whose frequencies oscillate maximally is comeager, we see that all the familiar probability models violate condition 1. We believe that, also, Belot endorses condition 2, which requires that a typical set of sequences should receive a typical probability; that is, a meager set should be assigned probability 0. This topological standard entails extreme a priori credences about the behavior of observed relative frequencies. Condition 2 mandates that, with probability 1, observed frequencies oscillate maximally in order to avoid being contained in a meager set. This creates its own kind of dogmatism since (almost surely) the conditional probability from this model persists in assigning conditional probability 1 to the hypothesis that observed frequencies oscillate maximally.
In contrast with Belot’s approach, in section 4 we outline a different strategy for assessing epistemic modesty/immodesty, based on considerations of both asymptotic certainty and consensus among investigators who share evidence. Belot’s strategy is to impose additional requirements that, in the spirit of coherence, apply to a standalone credence function. We follow, for example, Peirce in requiring that sound scientific methodology provides investigators with the resources to resolve interpersonal disagreements through shared evidence. This consideration allows for results about conditions for asymptotic consensus among a set of investigators to serve also as a standard for their epistemic modesty regarding interpersonal disagreements.
As a separate issue, in section 3 we discuss Elga’s (Reference Elga2016) reply to Belot’s analysis. Elga focuses on the assumption of countable additivity in the strengthened convergence results. His rebuttal to Belot’s analysis uses a merely finitely additive P probability to illustrate that merely finitely additive conditional probabilities need not satisfy the countably additive asymptotic (strong law) convergence results. These are the results that Belot argues reveal an immodesty in the countably additive Bayesian methodology.
We agree with Elga (as has been argued before) that the asymptotics of merely finitely additive conditional probabilities are different in kind from those of countably additive conditional probabilities. But we do not agree with Elga about which are the relevant asymptotic results in his P-model for assessing Bayesian learning of limiting frequencies. In addition, the P-model fails condition 1, which we understand is one of Belot’s standards for modesty.
As we illustrate in section 3, the conditional probabilities arising from a different (but related) merely finitely additive probability P′ fail the asymptotic certainty and consensus results that follow when either Savage’s or Blackwell and Dubins’s analysis applies. We argue that the added generality afforded by merely finitely additive probabilities over countably additive probabilities carries an extra price if merely finitely additive probabilities are to be used reasonably. They require more restrictive conditions than do countably additive probabilities, if the sequence of conditional probabilities that arise from an increasing sequence of shared evidence is to resolve interpersonal credal disagreements.
Appendix
In his classic discussion of measure and category, Oxtoby (Reference Oxtoby1980, theorems 1.6 [p. 4] and 16.5 [p. 64]) establishes that, quite generally, a topological space that also carries a Borel measure can be partitioned into two sets: one is a measure 0 set, and the other, which is its complement, is a meager set. Here we show (theorem A1) that this tension between probabilistic and topological senses of being a “small” set generalizes to sequences of random variables relative to a large class of infinite product topologies. We follow that result with a corollary, namely, proposition 1 in the main text is an instance of theorem A1 for binary sequences.
Let χ be a set with topology ℑ and Borel σ-field, ℬ, that is, the σ-field generated by the open sets in ℑ. Let χ∞ be the countable product set with the product topology ℑ∞ and product σ-field, ℬ∞, which is also the Borel σ-field for the product topology (because it is a countable product). Let 〈Ω, , P〉 be a probability space, and let
be a sequence of random quantities such that, for each n,
is
and ℬ measurable. Define
by
. Let
be the image of X, that is, the set of sample paths of X. We denote elements of
as
. The set
is a subset of χ∞. Therefore, we endow
with the subspace topology. In the remainder of this appendix, we identify certain subsets of
as being either meager or comeager. These results depend solely on the topology for the measurable space 〈Ω,
〉, and not on the probability P. However, the probability P is needed in order to display the tension between the two rival senses of being a “small” set.
In what follows we require a degree of “logical independence” between the Xn’s. In particular, we need the sequence to be capable of moving to various places in χ∞ regardless of where it has been so far.
Condition A: Specifically, for each j, let be a set such that Bj has nonempty interior
. Assume that for each n, for each
, and for each j, there exists a positive integer c(n, j, x) such that
.
Condition A asserts that, no matter where the sequence of random variables has been up to time n, there is a finite time, c(n, j, x), after which it is possible that the sequence reaches the set . For example, suppose that each Xn is the average of the first n in a sequence of Bernoulli random variables and that
is a sequence of positive real numbers whose limit is 0. If
for even j and
for odd j, then, independent of the particular sequence x, the longest we would have to wait to reach Bj is
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df12.png?pub-status=live)
in order to be sure that there is a sample path that takes us from an arbitrary initial sample path of length n to Bj by time . Thus, cn,j is a worst case bound for waiting. For some
, the minimum c(n, j, x) might be much smaller than this cn,j. For instance, with jointly continuous random variables with strictly positive joint density in which
for all n, then
for all n, j, and x.
For each , define
, and for
, define
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df13.png?pub-status=live)
Let , and let
.
Note that A is the set of sample paths each of which fails to visit at least one of the Bj sets in the order specified. Because we do not require that the sets Bj are nested, it is possible that the sequence reaches Bk for all without ever reaching Bj. Or the sequence could reach Bj before reaching
but not after.
Theorem A1. A is a meager set.
Proof. Write , where
. Then A is meager if and only if Cj is meager for every j. We prove that Cj is meager for every j by induction.
Start with . We have
if and only if
. To see that C 1 is meager, notice that
, where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df14.png?pub-status=live)
and for n > 1
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df141.png?pub-status=live)
Each Dn contains a nonempty sub-basic open set On obtained by replacing B 1 in the definition of each Dn by its interior . So
contains the nonempty open set
.
Next, we show that O is dense; hence, C 1 is meager as it is nowhere dense. We verify that for every nonempty basic open set E. If E is a nonempty basic open set, then there exists an integer k and there exist nonempty open subsets E 1, …, Ek of χ such that
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df15.png?pub-status=live)
Let , and let xk be the first k coordinates of y. Then there exist points in
whose first k coordinates are xk and whose
coordinate lies in
. Hence,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df16.png?pub-status=live)
Next, for , assume that Cr is meager for all
. To complete the induction, we show that Cj is meager, which follows the same reasoning as in the base case. Write
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df17.png?pub-status=live)
where . It suffices to show that each Fr is meager.
Notice that Fr is a subset of
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df18.png?pub-status=live)
It suffices to show that Gr is meager.
As in the case with , write
, where
. Each Dn contains a nonempty sub-basic open set On obtained by replacing each Bj in the definition of each Dn by its interior
. So
contains a nonempty open set
.
Finally, we establish that O is dense; hence, Gr is meager. Reasoning as in the base case with , we verify that
for every nonempty basic open set E. If E is a nonempty basic open set, then there exists an integer k and there exist nonempty open subsets E 1, …, Ek of χ such that
. Let
, and let xk be the first k coordinates of y. Then there exist points in
whose first k coordinates are xk and whose
coordinate lies in
. Hence,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df19.png?pub-status=live)
which completes the induction. QED
Next, return to consider the sequence of random variables described earlier. Suppose that each Xn is the sample average of some other sequence of random variables. That is,
, where each Yk is finite. Assume that condition A obtains. Namely, assume that the dependence between the Yk is small enough so that
, for all n, j, xk. For example, assume that there exist
with c either finite or
, and either with d finite or
, such that for each
and each
,
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df20.png?pub-status=live)
where
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df21.png?pub-status=live)
Then condition A obtains for a sequence of iid random variables. It also obtains for a sequence of random variables such that{Y 1, …, Yn} has strictly positive joint density over (c, d)n for all n. In such a case, we could let
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df22.png?pub-status=live)
where . Then A contains all sample paths for which
along with some sample paths for which
. If we repeat the construction of A for a countable collection of an with
, then the union of all of the A sets is meager. Then, the set of sample paths for which the
is meager. A similar construction shows that the set of sample paths for which
is meager. Hence the union of these last two sets is meager, and the sequence of sample paths along which Xn oscillates maximally is a comeager set.
Theorem A1 applies directly to the sequence . It shows that certain sets of sample paths of this sequence are meager or comeager. If, as in the case of sample averages, each Xn is a function of{Y 1, …, Yn}, we can evaluate the category of a set of sample paths of the
sequence. If 〈X 1, X 2, …〉 is a bicontinuous function of 〈Y 1, Y 2, …〉, then the two sets of sample paths are homeomorphic. In particular, this implies that the category of a set of sample paths of one sequence will be the same as the category of the corresponding set of sample paths of the other sequence: the one is meager if and only if the other is.
In the case of sample averages, we can exhibit the bicontinuous function explicitly. To be specific, let , and for each n, define
. Let
as above, with
. Let
and
be the sets of sample paths of X and Y, respectively. That is,
and
. For each
, define
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df23.png?pub-status=live)
For each , define
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20220325095322662-0650:S0031824800010515:S0031824800010515_df24.png?pub-status=live)
Then, by construction, and
. That is,
is the inverse of
. In order to have the category of the two sample paths the same, it is sufficient that both ɸ and φ are continuous. If they are continuous as functions both from and to ℜ∞, then they will be continuous in their subspace topologies. It suffices to show that ɸ−1(B) and φ−1(B) are open for each sub-basic open set B. Every sub-basic open set is of the form
, where each
except for one value of
for which
is open as a subset of ℜ. Then each of ɸ−1(B) and φ−1(B) has the form
, where C is an n 0-dimensional open subset of
; hence, both sets are open, and we have that
is homeomorphic to
. Proposition 1 in the main text is an instance of theorem A1 for binary sequences.