Standards for Modest Bayesian Credences

Jessi Cisewski; Joseph B. Kadane; Mark J. Schervish; Teddy Seidenfeld; Rafael Stern

doi:10.1086/694836

Standards for Modest Bayesian Credences

Published online by Cambridge University Press: 01 January 2022

Teddy Seidenfeld and

Abstract
Introduction
Orgulity as Identified by Comparing Meager Sets versus Null Sets
But What If Probability Is Merely Finitely Additive and Not Countably Additive?
On Standards for Epistemic Modesty Using Asymptotic Merging and Consensus
Summary
Footnotes
References

Rights & Permissions

Abstract

Gordon Belot argues that Bayesian theory is epistemologically immodest. In response, we show that the topological conditions that underpin his criticisms of asymptotic Bayesian conditioning are self-defeating. They require extreme a priori credences regarding, for example, the limiting behavior of observed relative frequencies. We offer a different explication of Bayesian modesty using a goal of consensus: rival scientific opinions should be responsive to new facts as a way to resolve their disputes. Also we address Adam Elga’s rebuttal to Belot’s analysis, which focuses attention on the role that the assumption of countable additivity plays in Belot’s criticisms.

Type: Research Article
Information: Philosophy of Science , Volume 85 , Issue 1 , January 2018 , pp. 53 - 78

DOI: https://doi.org/10.1086/694836 [Opens in a new window]
Copyright: Copyright © The Philosophy of Science Association

1. Introduction

Consider the following compound result about asymptotic statistical inference. A community of Bayesian investigators who begin an investigation with conflicting opinions about a common family of statistical hypotheses use shared evidence to achieve a consensus about which hypothesis is the true one. Specifically, suppose the investigators agree on a partition of statistical hypotheses and share observations of an increasing sequence of random samples with respect to whichever is the true statistical hypothesis from this partition.Footnote ¹ Then, under various combinations of formal conditions that we review in this essay, ex ante (i.e., before accepting the new evidence) it is practically certain that each of the investigators’ conditional probabilities approach 1 for the one true hypothesis in the partition.

The result is compound: individual investigators achieve asymptotic certainty about the unknown, true statistical hypothesis. Second, the shared evidence leads to a consensus among the different investigator’s individual degrees of belief. The initial disagreements, the disparate initial credences about different hypotheses, are resolved with increasing shared evidence. Stated in more familiar Bayesian terms, it is practically certain that the likelihood function based on the shared statistical evidence swamps differences in initial prior credences to produce a consensus among posterior credences.

The strategy to use asymptotics of Bayesian inference to defend against charges of excessive subjectivity is highlighted in the seminal work of Savage (Reference Savage1954, secs. 3.6 and 4.6) and Edwards, Lindman, and Savage (Reference Edwards, Lindman and Savage1963). Savage’s (Reference Savage1954) results apply to a finite set of investigators who hold nonextreme views over a common finite partition of statistical hypotheses.Footnote ² He establishes that—using a (finitely additive) weak law of large numbers—given increasing statistical evidence from a sequence of random samples, with probability approaching 1, different nonextreme personalists’ conditional probabilities become ever more concentrated on the same one true statistical hypothesis from among a finite partition of rival statistical hypotheses.^{Footnote 3} To repeat, the result is compound. It addresses both issues of certainty and consensus among finitely many investigators over a finite partition of statistical hypotheses, assuming they share an increasing sequence of observations from random sampling.^{Footnote 4}

Savage offers these findings as a partial defense against the accusation, voiced by frequentist statisticians of the time, that the theory of (Bayesian) personalist statistics is fraught with subjectivism and cannot serve the methodological needs of the scientific community, where objectivity is required. The central theme in Savage’s response is to understand ‘objectivity’ in terms of shared agreements about the truth, particularly when the shared agreements arise from shared statistical evidence. In summary, Savage provides sufficient conditions for when Bayesian methodology makes it ex ante almost certain that shared evidence secures this kind of objectivity for a well-defined community of investigators.

Savage (Reference Savage1954, 50) notes that his result about asymptotic certainty can be extended in several ways, by adapting the central limit theorem, the strong law of large numbers, and the law of the iterated logarithm to sequences of conditional probabilities generated by an increasing sequence of random samples. The last two of these laws require stronger assumptions than are needed for the finitely additive weak-law convergence result that Savage presents. Specifically, these stronger results require the assumption that (conditional) probabilities are countably additive.

Savage’s twin results have been strengthened also to include shared evidence from nonrandom samples. Consider an uncountably infinite probability space generated by increasing finite sequences of observable random variables, not necessarily forming a random sample with respect to a statistical hypothesis of interest. Rather than requiring that different agents hold nonextreme views about all possible events in the space of observables, which is mathematically impossible with real-valued probabilities once the space is uncountable, instead require that they agree with each other about which events in this uncountably infinite space of observables have probability 0. They share in a family of mutually absolutely continuous probability distributions. If the agents’ personal probabilities over these infinite spaces also are countably additive, then strong-law convergence theorems yield strengthened results about asymptotic consensus (see, e.g., Blackwell and Dubins Reference Blackwell and Dubins1962) and also about asymptotic certainty for events defined in the space of sequences of increasing shared evidence. We discuss several of these results in section 4. There we use considerations both of certainty and consensus to explicate epistemic modesty within a Bayesian framework that contrasts with a critical assessment of Bayesian theory offered by Belot (Reference Belot2013), whose work we next consider.

2. Orgulity as Identified by Comparing Meager Sets versus Null Sets

In a 2013 paper in this Journal, critical about the methodological significance of some of the strengthened versions of Savage’s convergence result for asymptotic certainty, Belot arrives at a harsh conclusion: “The truth concerning Bayesian convergence-to-the-truth results is significantly worse than has been generally allowed—they constitute a real liability for Bayesianism by forbidding a reasonable epistemological modesty” (Reference Belot2013, 502). Below, we argue that this verdict is misguided. The criteria for reasonable epistemic modesty that we understand underpin Belot’s analysis are self-defeating; hence, his argument is not compelling. When the criteria that we attribute to Belot are satisfied, they induce unreasonable epistemic apriorism regarding, for example, how sequences of observed relative frequencies behave.

What makes a (coherent) Bayesian credal state overconfident and lacking in epistemological modesty? Does the Bayesian position generally forbid “a reasonable epistemological modesty,” as Belot intimates? These questions are both interesting and imprecise. There is no doubting that the standard of mere Bayesian coherence for a credal state, as formalized in de Finetti’s (Reference de Finetti, Kyburg, Smokler and Kyburg1937/1964) theory, falls short of characterizing the set of reasonable credal states. To use an old and tired example, a person who thinks each morning that it is highly probable that the world ends later that afternoon does not thereby violate the technical norms of coherence.

In order to identify a brand of unreasonableness captured in overconfident, epistemologically immodest credal states, Belot supplements Bayesian coherence with a topological standard for respecting what he calls a typical event: He defines a typical event as a topologically large event. When a coherent agent assigns probability 0 to a topologically large set, specifically when a probability null set is comeager, Belot thinks that is a warning sign of epistemological immodesty.Footnote ⁵ Such a Bayesian agent is practically certain that the topologically typical event does not occur. And then Bayesian conditioning (almost surely) preserves that certainty in the face of new evidence. So, the Bayesian agent is not open-minded because, in dismissing as probabilistically negligible a topologically typical event E, (almost surely) she is aware ex ante that Bayesian conditioning precludes learning that the typical event E occurs.

We understand Belot’s criticism (Reference Belot2013, sec. 4) to be that Bayesian convergence-to-the-truth results about hypotheses that are formulated in terms of sets of observable sequences fail this concern about typical events. The strengthened convergence results allow the Bayesian agent to dismiss (ex ante) a probabilistically negligible set of sequences of observations where the convergence to the truth fails. This set has “prior” probability 0. Except, Belot complains, that failure set may be comeager in the usual topology for the sequences of observables. Hence, the failure set may be a typical event in the space of observables, about which a modest investigator should keep an open mind. But, instead, Bayes updating (almost surely) ignores these typical events by continuing to assign to them probability 0, even as the evidence grows. Thus, the strengthened asymptotic certainty results that Belot criticizes do not conform to the topological standards of epistemic modesty in the sense of modesty that we understand he advocates.

Although he does not explicitly formulate criteria for immodesty, on the basis of the examples and analysis he offers, we understand Belot’s primary requirements to be these two:Footnote ⁶

Topological Condition 1: Do not assign probability 1 to a meager set of observables.

Also, we find that Belot argues for a more demanding standard,Footnote ⁷

Topological Condition 2: Assign probability 0 to each hypothesis that is a meager set in the space of sequences of observables.

Ordinary statistical models violate topological condition 1 by their unconditional probabilities, independent of whether learning is by Bayesian updating. Already, condition 1 is inconsistent with the strong laws of large numbers, including the ergodic theorem, which are asymptotic results for unconditional probabilities (see Oxtoby Reference Oxtoby1980, 85).

Here we show that topological condition 2 entails a radical probabilistic apriorism toward observed relative frequencies that has little to do with questions about Bayesian overconfidence. In particular, this topological standard requires that with probability 1, relative frequencies for an arbitrary sequence of (logically independent) events oscillate maximally. From a Bayesian point of view, almost surely new evidence leaves this extreme epistemic attitude wholly unmodified. A Bayesian agent whose credal state conforms to condition 2 knows ex ante that she is practically certain never to change her mind that the relative frequencies for a sequence of events oscillate maximally. In this sense, we find that conditions 1 and 2 are self-defeating through a lack of humility. They promote excessive apriorism with respect to ordinary properties of limiting frequencies.

The Bayesian convergence-to-the-truth results that are the subject of Belot’s complaints are formulated as probability strong laws that hold almost surely or almost everywhere. In order to make clear why we think Belot’s verdict is mistaken thinking these results about convergence to the truth are a liability for Bayesian theory, revisit the familiar instance of the strong law of large numbers, as reported in footnote 4.

Let 〈Ω, ℬ, P〉 be the countably additive measure space generated by all finite sequences of repeated, probabilistically independent (iid) flips of a “fair” coin. Let 1 denote a “heads” outcome and 0 a “tails” outcome for each flip. Then a point x of Ω is a denumerable sequence of zeroes and ones, $x = 〈 x_{1}, x_{2}, … 〉$ , with each $x_{n} \in {0, 1}$ for $n = 1$ , 2, … . Let $X_{n} (x) = x_{n}$ designate the random variable corresponding to the outcome of the nth flip of the fair coin. The Borel σ-algebra ℬ is generated by rectangular events, those determined by specifying values for finitely many coordinates in Ω. The countably additive iid product fair-coin probability P is determined by

P (X_{n} = 1) = 1 / 2 (n = 1, 2, …),

and where each finite sequence of length n is equally probable,

P (X_{1} = x_{1}, …, X_{n} = x_{n}) = 2^{- n} .

Let L ^1/2 be the set of infinite sequences of 0s and 1s with limiting frequency 1/2 for each of the two digits: a set belonging to ℬ. Specifically, let $S_{n} = \sum_{i = 1}^{n} X_{n}$ . Then $L^{1/2} = {x : \lim_{n \to \infty} S_{n} / n = 1 / 2}$ . The strong law of large numbers asserts that $P (L^{1/2}) = 1$ . What is excused with the strong law, what is assigned probability 0, is the null set $(N = {[L^{1/2}]}^{c})$ consisting of the complement to L ^1/2 among all denumerable sequences of 0s and 1s.

The null set N is large, both in cardinality and in category under the product topology for 2^ω. It is a set with cardinality equal to the cardinality of its complement, the continuum.Footnote ⁸ When 2^ω is equipped with the infinite product of the discrete topology on {0, 1},Footnote ⁹ then the null set N is topologically large. Set N is comeager (Oxtoby Reference Oxtoby1980, 85).Footnote ¹⁰ That is, the set L ^1/2 is meager and so is judged topologically “small,” or atypical. By condition 1, a Bayesian who adopts the fair-coin model for her credences is epistemologically immodest with respect to denumerable sequences of possible coin flips: the space of sequences of observations that drive the asymptotic certainty result.

This strong-law counterexample to condition 1 should come as no surprise in the light of the following result:

Oxtoby (1980, theorem 1.6): Each nonempty interval on the real line may be partitioned into two sets, {N, M}, where N is a Lebesgue measure null set and its complement $M = N^{c}$ is a meager set.

Oxtoby generalizes this result with his theorem 16.5.Footnote ¹¹ In his illustration of theorem 16.5 using the strong law of large numbers, the binary partition {N, L ^1/2} displays the direct conflict between the measure theoretic and topological senses of small. Under the fair-coin model, N has probability 0, and L ^1/2 is a meager set in the product topology of the discrete topology on {0,1}. The tension between the two senses of small is not over some esoteric binary partition of the space of binary sequences but applies to the event that the sequence of observed outcomes has a limiting frequency 1/2.

We exemplify the general conflict encapsulated in Oxtoby’s theorem 16.5 with the following claim, which we use to criticize condition 2. Consider the space 2^ω, with points $x = 〈 x_{1}, x_{2}, \dots 〉$ of denumerable sequences of zeroes and ones, equipped with the infinite product of the discrete topology on {0,1}. Define the set of sequences L ^〈0,1〉 consisting of those points x whose relative frequency does not oscillate maximally, that is, where

lim inf \sum_{j = 1}^{n} \frac{x_{j}}{n} > 0 or lim sup \sum_{j = 1}^{n} \frac{x_{j}}{n} < 1 .

The complement to L ^〈0,1〉, $OM = {[L^{〈 0, 1 〉}]}^{c}$ , is the set of binary sequences whose observed relative frequencies oscillate maximally.

Proposition 1. L ^〈0,1〉 is a meager set; that is, OM is a comeager set.Footnote ¹²

Theorem A1 of the appendix establishes that sequences of logically independent random variables that oscillate maximally are comeager with respect to infinite product topologies on the sequence of random variables. Proposition 1 is a corollary to theorem A1 applied to binary sequences, that is, where there are only two categories for observables.

What proposition 1 establishes is that only extreme probability models of relative frequencies satisfy topological condition 2. That is, consider a measure space 〈2^ω, ℬ, P〉, where ℬ includes the Borel sets from 2^ω and where 2^ω is equipped with the infinite product of the discrete topology as above. Each probability with $P (L^{〈 0, 1 〉}) > 0$ produces a nonnull set that is meager.

Unless a probability model P for a sequence of relative frequencies assigns probability 1 to the set of sequences of observed frequencies that oscillate maximally, then P assigns positive probability to a meager set of sequences, in violation of condition 2. Evidently, the standard for epistemological modesty formalized in topological condition 2, which requires meager sets of relevant events to be assigned probability 0, itself leads to probabilistic orgulity because it requires an unreasonable a priori opinion about how observed relative frequencies behave. Let P satisfy condition 2. Given evidence of a P-non-null observation o of observed relative frequencies, the resulting conditional probability leaves this extreme a priori opinion unchanged: $P (OM | observation o) = 1$ .

Familiar Bayesian models also violate the weaker topological condition 1. Consider an exchangeable probability model over 2^ω. Then, by de Finetti’s (Reference de Finetti, Kyburg, Smokler and Kyburg1937/1964) theorem, each exchangeable probability assigns probability 1 to the set L of sequences with well-defined limiting frequencies for 0s and 1s. That is, then $P {x : lim inf \sum_{j = 1}^{n} x_{j} / n = lim sup \sum_{j = 1}^{n} x_{j} / n} = 1$ . But L is a subset of L ^〈0,1〉; hence, L is a meager set.

In summary, our understanding is that Belot applies topological conditions 1 and 2 in order to identify an epistemically immodest coherent credal state. We find that each of these two conditions is excessively restrictive and is self-defeating as a criterion for epistemic immodesty. The credences that satisfy these conditions with respect to the sets of sequences of observables that ground the almost-sure Bayesian convergence results embed extreme a priorism about, for example, the limiting behavior of observed relative frequencies.

In section 4, we argue that a better account of Bayesian epistemological modesty/immodesty uses interpersonal standards for asymptotic consensus within a community of investigators about the set of certainties that arise from an idealized sequence of observations. Belot’s approach for identifying epistemic immodesty applies topological conditions of adequacy to a standalone credence function and avoids issues of consensus. In contrast, we supplement coherence with criteria involving asymptotic consensus among a community of investigators about which certainties they might acquire based on a sequence of shared evidence.

3. But What If Probability Is Merely Finitely Additive and Not Countably Additive?

Elga (Reference Elga2016) responds to Belot’s criticism by focusing on the premise of countable additivity for probability, which is needed for the strong-law versions of Savage’s convergence result. The subjective theory of probability, especially as promoted by Savage (Reference Savage1954), Dubins and Savage (Reference Dubins and Savage1965), and de Finetti (Reference Smokler and Kyburg1974), does not mandate countable additivity for credences. This added generality is of importance for contemporary Bayesian practice, as argued in Kadane, Schervish, and Seidenfeld (Reference Kadane, Schervish, Seidenfeld, Goel and Zellner1986).

As we understand Elga’s response to Belot’s criticism, it is based on an example. The example purportedly shows how, using a finitely but not countably additive probability P, Belot’s standard for being an open-minded Bayesian credal state may be satisfied without also being burdened with the immodesty of treating a comeager failure set as a P-null set, as follows when probability is countably additive. Elga argues that, in his example, the associated set of data sequences where the convergence to the truth result fails with the credal state P has positive P-probability, contrary to what happens in the countably additive case. Elga asserts that in his example, the agent’s finitely additive conditional probabilities do not (almost surely) converge to the true statistical hypothesis about limiting relative frequencies; hence, such a Bayesian agent escapes Belot’s criticism as this agent is epistemologically humble about becoming certain of the true limiting relative frequency in the observed sequence.

First and foremost, we dispute Elga’s analysis of the specific example he offers. We argue that, contrary to Elga’s assessment, his merely finitely additive probability model P satisfies a finitely additive convergence-to-the-truth theorem that is needed to defend Bayesian learning. The Bayesian agent of Elga’s example is not humble about whether (with increasing probability) she will achieve asymptotic certainty for the limiting frequency hypothesis in question: at each stage of her investigation, looking forward, she remains practically certain that her posterior probability will converge to the true limiting frequency hypothesis.

Second, the credal state P in Elga’s example fails what we call Belot’s condition 1. State P assigns probability 1 to a meager set of sequences of observations. Hence, although Elga argues that P is modest with respect to one limiting frequency hypothesis, according to condition 1 P is immodest for a different but related hypothesis about the existence of well-defined limiting frequencies.

Nonetheless, we agree with Elga (and with others who have argued the same point previously) that finitely but not countably additive probability models allow failures of the strengthened convergence results. We illustrate this point using a finitely additive probability, P′, that is a simple variant of Elga’s model P. But in our judgment, this phenomenon—where a finitely additive model P′ fails the strengthened convergence-to-the-truth result—does not provide a satisfactory rebuttal to Belot’s criticism. Belot’s criticism, which is directed at countably additive credences, is that they display Bayesian orgulity. To argue that, on the contrary, the finitely additive probability P′ assigns positive probability to a set of sequences where convergence to the truth fails does not show that such a merely finitely additive probability is reasonable.

According to the rival standards for epistemic modesty that we offer in section 4, such a finitely additive probability P′ is unreasonable on two counts simultaneously: the Bayesian agent with credence P′ knows in advance that each data sequence that might be observed will fail to induce certainty, both in the short term and in the limit. Also, P′ fails the test for reasonableness based on consensus. That is, the agent with credences fixed by P′ does not reach consensus with other members of a community of investigators who use countably additive credences and agree with P′ about which (finite) sequences of observables are probability-0 events. But the others reach consensus among themselves.

For a detailed discussion of Elga’s example, begin with a review of some relevant mathematical considerations. When probability P is defined for a measurable space, the principle of countable additivity has an equivalent form as a principle of Continuity. Let $A_{i} (i = 1, \dots)$ be a monotone sequence of (measurable) events, where $\lim_{i} A_{i} = A$ is also a (measurable) event.

Continuity $P (A) = \lim_{i} P (A_{i})$ .

When probabilities satisfy Continuity, the probabilities for a class $C$ of events that form a field also determine uniquely the probabilities for the smallest σ-field generated by $C$ (see Halmos Reference Halmos1950, theorem 13A). And if an event H belongs to that σ-field, then H can be approximated in probability by events from the field $C$ . Specifically, for each $ε > 0$ there exists a $C_{e} \in C$ such that $P ([H - C_{ε}] \cup [C_{ε} - H]) < ε$ (see Halmos Reference Halmos1950, theorem 13D). This result has important consequences when H is a tail-field event in 2^ω.Footnote ¹³

Consider the countably additive probability P for iid flips of a fair coin and, for example, the tail-field event L ^1/2 in 2^ω. Then, L ^1/2 can be approximated ever more precisely in probability by a sequence of finite-dimensional events { $E_{n} : n = 1, \dots$ }, each of which is determined by a finite number of coordinates from the set of denumerable binary sequences, 2^ω. Choose a sequence, { $ε_{n} > 0 : n = 1, \dots$ }, with $\lim_{n} ε_{n} = 0$ . That is, for each $n = 1$ , …, $P ([L^{1/2} - E_{n}] \cup [E_{n} - L^{1/2}]) < ε_{n}$ , and each E _n depends on only finitely many coordinates from 2^ω. With P the product measure for iid fair-coin flips and L ^1/2 the tail-field event that is to be approximated, then the finite dimensional events E _n may be chosen as the set of sequences with relative frequency of ones sufficiently close to 1/2 through the first n trials. However, when Continuity fails, and P is merely finitely additive but not countably additive, then the probabilities over $C$ may fail to define the probabilities over the smallest σ-field generated by $C$ .

For example, pick two values $0 \leq p \neq q \leq 1$ . A coherent, merely finitely additive probability P^p,q on 2^ω may assign values to each finite-dimensional event according to iid trials with constant Bernoulli probability p but assign probabilities to the tail-field events according to iid trials with constant Bernoulli probability q. Then, the strong law of large numbers does not entail the weak law of large numbers with the same values. While finite sequences of zeroes and ones follow an iid Bernoulli-p product law, with P^p,q probability 1, the tail event of the limiting relative frequency for ones is q. This phenomenon is at the heart of Elga’s example.

Let P be a merely finitely additive probability on the Borel σ-algebra of 2^ω where $P (\cdot) = [P^{p, q} (\cdot) + P^{q, p} (\cdot)] / 2$ . Elga considers the case with $p = 1 / 10$ and $q = 9 / 10$ . This finitely additive probability assigns probability 1/2 to the tail-field event L ^1/10 (the set of sequences with limiting frequency 1/10) and probability 1/2 to the tail-field event L ^9/10 (the set of sequences with limiting frequency 9/10). For $x \in 2^{ω}$ , let I _L^1/10(x) be the indicator function for the event L ^1/10 and I _L^9/10(x) the indicator function for the event L ^9/10. So, $P {x : {I_{L}}^{1 / 10} (x) + {I_{L}}^{9 / 10} (x) = 1} = 1$ . Thus, we see from proposition 1, Elga’s example stands in violation of topological condition 1, since with P-probability 1 the sequence of coin flips has a convergent limiting relative frequency. This forms a meager set among the set of all binary sequences.

Elga asserts that the conditional probabilities associated with the (merely) finitely additive P-distribution fail the almost-sure strong-law convergence result. Here is the argument he offers for that conclusion. Let x be an element of the set L ^1/10, a sequence with limiting relative frequency 1/10, which is practically certain to occur according to the P-distribution on sequences if and only if the P^{9/10, 1/10} coin is flipped. (Otherwise, with P-probability 1, a sequence x almost surely has a limiting relative frequency 9/10 since it is then following a P^{1/10, 9/10} law.) Then, for each $ε > 0$ there exists integer n_ε, such that for each $n > n_{ε}$ , the observed sequence {X ₁, …, X_n} has a relative frequency of ones close enough to 1/10 so that the posterior probability satisfies $P (L^{9 / 10} | X_{1}, \dots, X_{n}) > 1 - ε$ .

This conditional probability assigns high probability to the event L ^9/10 that the limiting frequency of the sequence is 9/10 even though the sequence that generates the observations in fact has limiting frequency 1/10. In this sense, the sequence of conditional probabilities generated by x (an element of the set L ^1/10) converge to the wrong tail-field event, L ^9/10, even though the sequence that generates the observations has limiting frequency 1/10. Likewise, the convergence is to the wrong tail-field event, L ^1/10, when the sequence is generated by an element of the set L ^9/10. Elga concludes that conditional probabilities from this merely finitely additive P-model do not satisfy the (almost-sure) strong-law convergence-to-the-truth results. Then, regarding either tail-field event L ^1/10 or L ^9/10, the agent with conditional credences fixed by probability P is both open-minded and modest.Footnote ¹⁴ But this analysis is misleading regarding convergence to the truth because it conditions on P-null events, as we now explain.

Define the denumerable set of countably additive probabilities {P_n} on 2^ω so that P_n is the iid product of a Bernoulli-p probability for the first n coordinates and is the iid product of a Bernoulli-q probability for all coordinates beginning with the $n + 1$ position. Each P_n is a countably additive probability on the measurable space 〈2^ω, ℬ〉. Distribution P_n has a change point after the nth trial. Let the change point, $N = n$ , be chosen according to a purely finitely additive probability, with $P (N = n) = 0$ , $n = 1$ , 2, … . Finally, let P be the induced (marginal) unconditional probability on the Borel σ-algebra of sequences of coin flips, 〈2^ω, ℬ〉.

As required for Elga’s construction, this finitely additive probability P behaves as P^p,q. Its distribution is the iid product of a Bernoulli-p distribution on finite dimensional sets and is the iid product of a Bernoulli-q distribution on the tail-field events.Footnote ¹⁵ Probability P satisfies the weak law of large numbers over finite sequences with Bernoulli parameter p and satisfies the strong law of large numbers on the tail field with Bernoulli parameter q. Hence, the strong law does not entail the weak law with the same parameter value.

Given an observed history, $h_{j} = {X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{j} = x_{j}}$ , the Bayesian agent in Elga’s example assigns a purely finitely additive conditional probability to the distribution of the change point (N) so that, with conditional probability 1, the change point is arbitrarily far off in the future. For each finite history h_j and for each $k = 1$ , 2, …, $P (N > k | h_{j}) = 1$ . An agent who uses Elga’s finitely additive P-model precludes learning about the change point variable, N. That agent is closed-minded in the relevant sense that, no matter what she observes, she is certain that the change point lies in the yet-to-be-observed future.

So, whenever the agent observes a finite history of coin flips with observed relative frequency of heads near to 9/10, she has high posterior probability for the tail-field event L ^1/10. Likewise, whenever the agent sees a finite history of coin flips with observed relative frequency of heads near to 1/10, she has high posterior probability for the tail-field event L ^9/10. And since this agent is always sure, given each finite history h_j, that the change point (N) is in the distant future of the sequence of coin flips, she always assigns arbitrarily high posterior probability to correctly identifying the tail-field event between L ^1/10 and L ^9/10.

For example, this agent assigns probability near 1 to observing indefinitely long finite histories that have observed relative frequencies that linger near 9/10 exactly when the sequence x has a limiting relative frequency of 1/10. This finitely additive credal state satisfies the conclusion of the finitely additive almost-sure convergence-to-the-truth result: almost surely, given the observed histories from a sequence x, the conditional probabilities converge to the correct indicator for the tail behavior of the relative frequencies in x.

Elga’s analysis to the contrary is based on having the agent consider conditional probabilities, $P (L^{1 / 10} | h_{n})$ at histories h_n that run beyond the change point. But with Elga’s finitely additive probability P-model, the agent’s credence is 0 of ever witnessing such a history. That is, Elga’s argument, whose conclusion is that the agent’s conditional probabilities converge to the wrong indicator function, requires the agent to condition on an event of P-probability 0 (i.e., that she has made finitely many observations that go past the change point in the sequence). But, at each finite stage in the history of observations, this event is part of a P-null event where a failure of the (finitely additive) almost sure convergence to the truth is excused. Where this case differs from the countably additive one is that with the merely finitely additive probability P, the countable union of all these infinitely many P-null events (namely, that that change point has been reached by the kth observation, $k = 1$ , 2, …), is a certain event—since the change point is certain to arrive eventually.

Apart from this peculiar merely finitely additive credal attitude that precludes learning about the change point N, there is something else unsettling about this Bayesian agent’s finitely additive model for coin flips. Perhaps what follows makes clearer what that problem is. Modify Elga’s model to the finitely additive probability P′ so that

P^{'} (\cdot) = [P^{5 / 10, 1 / 10} (\cdot) + P^{5 / 10, 9 / 10} (\cdot)] / 2,

with the change point N chosen, just as before, by a purely finitely additive probability, $P (N = n) = 0$ for $n = 1$ , 2, … . Then the strong-law result applies to tail-field events, and, P′-almost surely, the limiting frequency for heads is either 1/10 or 9/10 also, just as in Elga’s P-model. However, the two finitely additive coins, P^{5/10, 1/10} and P^{5/10, 9/10}, assign the same probability to each finite history of coin flips. Letting h_n denote a specific history of length n,

\begin{matrix} P^{5 / 10, 1 / 10} (h_{n}) = P^{5 / 10, 9 / 10} (h_{n}) = 2^{- n} . \end{matrix}

But then

P^{'} (L^{1 / 10} | h_{n}) = P^{'} (L^{9 / 10} | h_{n}) = 1/2 = P^{'} (L^{1 / 10}) = P^{'} (L^{9 / 10}),

for each possible history. That is, contrary to the strengthened convergence-to-the-truth result, in this modified P′-model, the agent is completely certain that her posterior probability for either of the two tail-field hypotheses, L ^1/10 or L ^9/10, is stationary at the prior value 1/2. Under the growing finite histories from each infinite sequence of coin flips, the posterior probability moves neither toward 0 nor toward 1. Within the P′-model, surely there is no convergence to the truth about these two tail-field events given increasing evidence from coin flipping.Footnote ¹⁶

Evidently, one aspect of what is unsettling about these finitely additive coin models is that the observed sequence of flips is entirely uninformative about the change point variable, N. No matter what the observed sequence, the agent’s posterior distribution for N is her prior distribution for N, which is a purely finitely additive distribution assigning 0 probability to each possible integer value for N. It is not merely that this Bayesian agent cannot learn about the value of N from finite histories. Also, two such agents who have finitely additive coin models that disagree only on the tail-field parameter cannot use the shared evidence of the finite histories to induce a consensus about the tail-field events since they are both certain that their shared evidence has yet to cross the change point. In the next section, we use these themes about certainty and consensus based on shared evidence to provide a different answer to Belot’s question about what distinguishes modest from immodest credal states.

4. On Standards for Epistemic Modesty Using Asymptotic Merging and Consensus

Peirce (Reference Peirce1877) argues that sound methodology needs to defend a proposal for how to resolve interpersonal differences of scientific opinion. He asserts that the scientific method for resolving such disputes wins over other rivals (e.g., apriorism or the method of tenacity) by having the Truth (i.e., observable Reality) win out—by settling debates through an increasing sequence of observations from well-designed experiments. With due irony, much of Peirce’s proposal for letting Reality settle the intellectual dispute is embodied within personalist Bayesian methodology.Footnote ¹⁷ Here, we review some of those Bayesian resources regarding three aspects of immodesty.

One kind of epistemic immodesty is captured in a dogmatic credal state that is immune to revision from the pressures of new observations. Such a credal state is closed-minded. And a closely related second kind of immodesty is that two rival dogmatic positions cannot find a resolution to their epistemic conflicts through shared observations. They are persistent in their closed-mindedness. These two suggest that a credal state can be assessed for epistemic immodesty according to three considerations:

i) how large is the set of conjectures,
ii) how large is the community of rival opinions, and
iii) for which sets of sequences of shared observations

does Bayesian conditionalization offer resolution to interpersonal credal conflicts by bringing the different opinions into a consensus regarding the truth. In other words, qualitative degrees of epistemic immodesty are revealed with these three considerations, which synthesize criteria of asymptotic consensus and certainty. We discuss this sense of “immodesty” in the remainder of this section.

We use as our starting point an important result due to Blackwell and Dubins (Reference Blackwell and Dubins1962) about countable additive probabilities. Let 〈X, ℬ〉 be a measurable Borel product space with the following structure. Consider a denumerable sequence of sets $X_{i} (i = 1, …)$ each with an associated σ-field ℬ_i. Form the infinite Cartesian product $X = X_{1} \times …$ of denumerable sequences $(x_{1}, \dots) = x \in X$ , where $x_{i} \in X_{i}$ . That is, each x_i is an atom of its algebra ℬ_i. In the usual fashion, let the measurable sets in ℬ be the σ-field generated by the measurable rectangles.

Definition: A measurable rectangle $(A_{1} \times \dots) = A \in ℬ$ is one where $A_{i} \in ℬ_{i}$ and $A_{i} = X_{i}$ for all but finitely many i.

Blackwell and Dubins (Reference Blackwell and Dubins1962) consider the idealized setting where two Bayesian agents have this same measurable space of possibilities, each with her own countably additive personal probability, creating the two measure spaces 〈X, ℬ, P₁〉 and 〈X, ℬ, P₂〉. Suppose that P₁ and P₂ agree on which measurable events have probability 0, and admit (countably additive) predictive distributions, $P_{i} (\cdot | X_{1}, \dots, X_{n}) (i = 1, 2)$ , for each finite history of possible observations.Footnote ¹⁸ In order to index how much these two are in probabilistic disagreement, Blackwell and Dubins adopt the total-variation distance. Define

\begin{matrix} ρ (P_{1} (\cdot | X_{1} = x_{1}, \dots, X_{n} = x_{n}), P_{2} (\cdot | X_{1} = x_{1}, \dots, X_{n} = x_{n})) = \sup_{E \in_{ℬ}} | P_{1} (E | X_{1} = x_{1}, \dots, X_{n} = x_{n}) - P_{2} (E | X_{1} = x_{1}, \dots, X_{n} = x_{n}) | . \end{matrix}

The index ρ is one way to quantify the degree of consensus between the two agents who share the same history of observations, (x₁, …, x_n). This index focuses on the greatest differences between the two agents’ conditional probabilities.

Here is the related strong-law result about asymptotic consensus (Blackwell and Dubins Reference Blackwell and Dubins1962, theorem 2):

Blackwell-Dubins Result. For $i = 1$ , 2, P_i-almost surely, $\lim_{n \to \infty} ρ (P_{1} (\cdot | X_{1} = x_{1}, \dots, X_{n} = x_{n}), P_{2} (\cdot | X_{1} = x_{1}, \dots, X_{n} = x_{n})) = 0$ .

In words, the two agents are practically certain that with increasing shared evidence their conditional probabilities will merge, in the very strong sense that the greatest differences in their conditional opinions over all measurable events in ℬ will diminish to 0. And they remain practically certain of this future development for each (nonnull) observed history. Thus, this result supports a conclusion about idealized asymptotic consensus from idealized application of the scientific method that Peirce asserted he could not prove but only defend as having no equal.Footnote ¹⁹

Since, for each event in the space ℬ, the familiar strong-law convergence-to-the-truth result applies, separately, to each investigator’s opinion, the added feature of merging allows a defense against the charge of individual “immodesty” by showing that two rival opinions come into agreement about the truth, almost surely, in the strong sense provided by the ρ-index. In the setting of Blackwell and Dubins’s (Reference Blackwell and Dubins1962) result, almost surely two such investigators agree that they can resolve all conflicts in their credal states over all elements of ℬ, and have their posterior probabilities almost surely concentrate on true hypotheses, by sharing increasing finite histories of observations from a sequence x. Thus, in fine Peircean style, they are not open-minded about the efficacy of the scientific method for creating consensus and certainty.

Schervish and Seidenfeld (Reference Schervish and Seidenfeld1990, sec. 3) explore several variations on this theme by enlarging the set of rival credal states in order to consider larger communities than two investigators and by relaxing the sense of merging (or consensus) that is induced by shared evidence from a common measurable space 〈X, ℬ〉. They show that, depending on how large a set of different mutually absolutely continuous probabilities is considered, the character of the asymptotic merging varies. This is where topology plays a useful role in formalizing “immodesty.”

Here, we summarize three of those results. Let ℛ be the set of rival credences that conform, pairwise, to the Blackwell-Dubins conditions above. Consider three increasing classes of such communities.

1. If ℛ is a subset of a convex set of rival credences whose extreme points are compact in the discrete topology, then all of ℛ uniformly satisfies the Blackwell-Dubins merging result. That is, then merging in the sense of ρ occurs simultaneously over all of ℛ.
2. If ℛ is a subset of a convex set of rival credences whose extreme points are compact in the topology induced by ρ, then all that is assured is a weak-law merging. That is, if {P_n, Q_n} is an arbitrary sequence of pairs from ℛ, and $R \in ℛ$ is an arbitrary credence from the set of rivals, then
$ρ (P_{n} (\cdot | X_{1}, \dots, X_{n}), Q_{n} (\cdot | X_{1}, \dots, X_{n})) \overset{R}{\to} 0 .$
3. And if ℛ is a subset of a convex set of rival credences whose extreme points are compact in the weak-star topology induced by ρ, then not even a weak-law merging of the kind reported in class 2 is assured.

It is not surprising then, as the community ℛ increases its membership, the kind of consensus that is assured—the version of community-wide probabilistic merging that results from shared evidence—becomes weaker. So, one way to assess the epistemological “immodesty” of a credal state formulated with respect to a measurable space 〈X, ℬ〉 is to identify the breadth of the community ℛ of rival credal states that admits merging through increasing shared evidence from ℬ. For example, the agent who thinks each morning that it is highly probable that the world ends later that afternoon has an immodest attitude because there is only the isolated community of like-minded pessimists who can reconcile their views with commonplace evidence that is shared with the rest of us.

When the different opinions do not satisfy the requirement of mutual absolute continuity, the previous results do not apply directly. Instead, we modify an idea from Levi (Reference Levi1980, sec. 13.5) so that different members of a community of investigators modify their individual credences (using convex combinations of rival credal states) in order to give other views a hearing and, in Peircean fashion, in order to allow increasing shared evidence to resolve those differences.

Let $I = {i_{1}, \dots}$ serve as a finite or countably infinite index set, and let $ℛ = {P_{i} : i \in I}$ represent a community of investigators, each with her own countably additive credence function P_i on a common measurable space 〈X, ℬ〉. It may be that, pairwise, the elements of ℛ are not even mutually absolutely continuous. In order to allow new evidence to resolve differences among the investigators’ credences for elements of ℬ (rather than trying, e.g., to preserve common judgments of conditional credal independence between pairs of elements of ℬ), each member of ℛ shifts to a credal state by taking a mixture of each of the investigators’ credal states: a “linear pooling” of those states. Specifically, for each $i \in I$ , let $\tilde{α_{i}} = {α_{ij} : α_{ij} > 0, \sum_{j = 1}^{\infty} α_{ij} = 1}$ serve as a set of weights that investigator_i uses to create the credal state $Q_{i} = \sum_{j = 1}^{\infty} α_{ij} P_{i}$ to replace P_i. It might be that for each $i \in I$ , each Q_i is self-centered in the following sense. Let $ε > 0$ . The Q_i might be self-centered in that $α_{ii} \geq 1 - ε$ . Then, pairwise, the Q_i satisfy the assumptions for the Blackwell-Dubins result despite being self-centered. Depending upon the size of the community ℛ, using the replacement credal states {Q_i} results 1, 2, and 3 obtain.

We conclude this discussion of probabilistic merging with a reminder that merely finitely additive probability models open the door to reasoning to a foregone conclusion (Kadane, Schervish, and Seidenfeld Reference Zellner1996), in a different sharp contrast with the P′ model above to the almost sure asymptotic merging and convergence-to-the-truth results associated with countably additive probability models. Key to these asymptotic results for countably additive probabilities is the Law of Iterated Expectations.

Let X and Y be (bounded) random variables measurable with respect to a countably additive measure space 〈Ω, ℬ, P〉. With ℰ[X] and $ℰ [X | Y = y]$ denoting, respectively, the expectation of X and the conditional expectation of X, given the event $Y = y$ , then

Law of Iterated Expectations $ℰ [X] = ℰ [ℰ [X | Y]]$ .

As Schervish, Seidenfeld, and Kadane (Reference Schervish, Seidenfeld and Kadane1984) established, each merely finitely (and not countably) additive probability defined on a σ-field of sets fails this law even when the variable X is an indicator variable. That is, each merely finitely additive probability fails to be conglomerable in some denumerable partition, here associated with the random quantity Y. Specifically, with a merely finitely additive probability P, there exists a measurable hypothesis H and denumerable partition of measurable events $π = {E_{i} : i = 1, …}$ , where

P (H) < \inf_{E_{i} \in π} P (H | E_{i}) .

Then, contrary to the Law of Iterated Expectations, with expectations ℰ over all $E_{i} \in π$ ,

P (H) < ℰ [P [H | E_{i} \in π]] .

Let the random variable Y have a denumerable sample space, $Y = {y_{1}, y_{2}, \dots}$ . Associate the event E_i with the outcome $Y = y_{i}$ . Then if P is nonconglomerable for H in the partition generated by Y, in the partition π_Y, P fails the Law of Iterated Expectations in π_Y.

A set of such merely finitely additive probabilities, each of which is nonconglomerable in the same partition of the shared evidence, can display reasoning to contrary foregone conclusions both in the short run and asymptotically with increasing shared evidence. Because the investigators’ conditional probabilities for a pair of contrary hypotheses {H ₁, H ₂} are nonconglomerable in the partitions of their increasing shared evidence, each investigator may become increasingly certain of a different hypothesis as a function solely of the sample size of their shared evidence, regardless of what those samples reveal. Moreover, this assured increasing divergence in their updated opinions is a fact they are aware of ex ante.

The lesson we draw is this: Bayesian agents who use merely finitely additive probabilities face a trade-off between the added flexibility in modeling that comes with relaxing the constraint of countable additivity and the added restrictions on the kinds of shared evidence necessary to achieve the desirable methodological laws about asymptotic consensus and certainty illustrated in the countably additive strong laws.

5. Summary

Savage (Reference Savage1954) and Blackwell and Dubins (Reference Blackwell and Dubins1962) offer important results showing that Bayesian methodology uses increasing shared evidence in order to temper and to resolve interpersonal disagreements about personal probabilities. We contrast interpersonal standards of asymptotic consensus about certainties that arise from a sequence of shared evidence with Belot’s (Reference Belot2013) proposal to use a topological standard of “meagerness” in order to determine when a credal state is immodest, based on a standalone assessment of that credal state.

We understand Belot to endorse topological condition 1, which requires that comeager sets are assigned positive probability. Where a probability model treats a comeager set as null, that shows the model is immodest because it dismisses a topological large set as probabilistically negligible. But, in the light of the fact that the set of sequences whose frequencies oscillate maximally is comeager, we see that all the familiar probability models violate condition 1. We believe that, also, Belot endorses condition 2, which requires that a typical set of sequences should receive a typical probability; that is, a meager set should be assigned probability 0. This topological standard entails extreme a priori credences about the behavior of observed relative frequencies. Condition 2 mandates that, with probability 1, observed frequencies oscillate maximally in order to avoid being contained in a meager set. This creates its own kind of dogmatism since (almost surely) the conditional probability from this model persists in assigning conditional probability 1 to the hypothesis that observed frequencies oscillate maximally.

In contrast with Belot’s approach, in section 4 we outline a different strategy for assessing epistemic modesty/immodesty, based on considerations of both asymptotic certainty and consensus among investigators who share evidence. Belot’s strategy is to impose additional requirements that, in the spirit of coherence, apply to a standalone credence function. We follow, for example, Peirce in requiring that sound scientific methodology provides investigators with the resources to resolve interpersonal disagreements through shared evidence. This consideration allows for results about conditions for asymptotic consensus among a set of investigators to serve also as a standard for their epistemic modesty regarding interpersonal disagreements.

As a separate issue, in section 3 we discuss Elga’s (Reference Elga2016) reply to Belot’s analysis. Elga focuses on the assumption of countable additivity in the strengthened convergence results. His rebuttal to Belot’s analysis uses a merely finitely additive P probability to illustrate that merely finitely additive conditional probabilities need not satisfy the countably additive asymptotic (strong law) convergence results. These are the results that Belot argues reveal an immodesty in the countably additive Bayesian methodology.

We agree with Elga (as has been argued before) that the asymptotics of merely finitely additive conditional probabilities are different in kind from those of countably additive conditional probabilities. But we do not agree with Elga about which are the relevant asymptotic results in his P-model for assessing Bayesian learning of limiting frequencies. In addition, the P-model fails condition 1, which we understand is one of Belot’s standards for modesty.

As we illustrate in section 3, the conditional probabilities arising from a different (but related) merely finitely additive probability P′ fail the asymptotic certainty and consensus results that follow when either Savage’s or Blackwell and Dubins’s analysis applies. We argue that the added generality afforded by merely finitely additive probabilities over countably additive probabilities carries an extra price if merely finitely additive probabilities are to be used reasonably. They require more restrictive conditions than do countably additive probabilities, if the sequence of conditional probabilities that arise from an increasing sequence of shared evidence is to resolve interpersonal credal disagreements.

Appendix

In his classic discussion of measure and category, Oxtoby (Reference Oxtoby1980, theorems 1.6 [p. 4] and 16.5 [p. 64]) establishes that, quite generally, a topological space that also carries a Borel measure can be partitioned into two sets: one is a measure 0 set, and the other, which is its complement, is a meager set. Here we show (theorem A1) that this tension between probabilistic and topological senses of being a “small” set generalizes to sequences of random variables relative to a large class of infinite product topologies. We follow that result with a corollary, namely, proposition 1 in the main text is an instance of theorem A1 for binary sequences.

Let χ be a set with topology ℑ and Borel σ-field, ℬ, that is, the σ-field generated by the open sets in ℑ. Let χ^∞ be the countable product set with the product topology ℑ^∞ and product σ-field, ℬ^∞, which is also the Borel σ-field for the product topology (because it is a countable product). Let 〈Ω, $A$ , P〉 be a probability space, and let ${X_{n}}_{n = 1}^{\infty}$ be a sequence of random quantities such that, for each n, $X_{n} : Ω \to χ$ is $A$ and ℬ measurable. Define $X : Ω \to χ^{\infty}$ by $X (ω) = 〈 X_{1} (ω), X_{2} (ω), \dots 〉$ . Let $S_{X} = X (Ω)$ be the image of X, that is, the set of sample paths of X. We denote elements of $S_{X}$ as $y = 〈 y_{1}, y_{2}, \dots 〉$ . The set $S_{X}$ is a subset of χ^∞. Therefore, we endow $S_{X}$ with the subspace topology. In the remainder of this appendix, we identify certain subsets of $S_{X}$ as being either meager or comeager. These results depend solely on the topology for the measurable space 〈Ω, $A$ 〉, and not on the probability P. However, the probability P is needed in order to display the tension between the two rival senses of being a “small” set.

In what follows we require a degree of “logical independence” between the X_n’s. In particular, we need the sequence ${X_{n}}_{n = 1}^{\infty}$ to be capable of moving to various places in χ^∞ regardless of where it has been so far.

Condition A: Specifically, for each j, let $B_{j} \in ℬ$ be a set such that B_j has nonempty interior $B_{j}^{o}$ . Assume that for each n, for each $x = 〈 x_{1}, \dots, x_{n} 〉 \in 〈 X_{1}, \dots, X_{n} 〉 (Ω)$ , and for each j, there exists a positive integer c(n, j, x) such that ${〈 X_{1}, \dots, X_{n}, X_{n + c (n, j, x)} 〉}^{−1} ({x} \times B_{j}^{o}) \neq \emptyset$ .

Condition A asserts that, no matter where the sequence of random variables has been up to time n, there is a finite time, c(n, j, x), after which it is possible that the sequence reaches the set $B_{j}^{o}$ . For example, suppose that each X_n is the average of the first n in a sequence of Bernoulli random variables and that ${ε_{j}}_{j = 1}^{\infty}$ is a sequence of positive real numbers whose limit is 0. If $B_{j} = (0, ε_{j})$ for even j and $B_{j} = (1 - ε_{j}, 1)$ for odd j, then, independent of the particular sequence x, the longest we would have to wait to reach B_j is

c_{n, j} = \frac{n (1 - ε_{j})}{ε_{j}} + 1,

in order to be sure that there is a sample path that takes us from an arbitrary initial sample path of length n to B_j by time $n + c_{n, j}$ . Thus, c_n,j is a worst case bound for waiting. For some $x = 〈 x_{1}, \dots, x_{n} 〉$ , the minimum c(n, j, x) might be much smaller than this c_n,j. For instance, with jointly continuous random variables with strictly positive joint density in which $〈 X_{1}, \dots, X_{n} 〉 (Ω) = χ^{n}$ for all n, then $c (n, j, x) = 1$ for all n, j, and x.

For each $y \in S_{X}$ , define $τ_{0} (y) = 0$ , and for $j > 0$ , define

τ_{j} (y) = {\begin{matrix} \min {n > τ_{j - 1} (y) : y_{n} \in B_{j}} & if the minimum is finite, \\ \infty & if not . \end{matrix}

Let $B = {y \in S_{X} : τ_{j} (y) < \infty for all j}$ , and let $A = S_{X} \ B = B^{c} \cap S_{X}$ .

Note that A is the set of sample paths each of which fails to visit at least one of the B_j sets in the order specified. Because we do not require that the sets B_j are nested, it is possible that the sequence reaches B_k for all $k > j$ without ever reaching B_j. Or the sequence could reach B_j before reaching $B_{j - 1}$ but not after.

Theorem A1. A is a meager set.

Proof. Write $A = \cup_{j} C_{j}$ , where $C_{j} = {y \in S_{X} : τ_{j} (y) = \infty}$ . Then A is meager if and only if C_j is meager for every j. We prove that C_j is meager for every j by induction.

Start with $j = 1$ . We have $τ_{j} (y) = \infty$ if and only if $y \in C_{1} = \cap_{n = 1}^{\infty} {y \in S_{X} : y_{n} \in B_{1}^{c}}$ . To see that C ₁ is meager, notice that $C_{1}^{c} = \cup_{n = 1}^{\infty} D_{n}$ , where

\begin{matrix} D_{1} = S_{X} \cap (B_{1} \times χ^{\infty}), \end{matrix}

and for n > 1

\begin{matrix} D_{n} = S_{X} \cap (χ^{n - 1} \times B_{1} \times χ^{\infty}) . \end{matrix}

Each D_n contains a nonempty sub-basic open set O_n obtained by replacing B ₁ in the definition of each D_n by its interior $B_{1}^{o}$ . So $C_{1}^{c}$ contains the nonempty open set $O = \cup_{n = 1}^{\infty} O_{n}$ .

Next, we show that O is dense; hence, C ₁ is meager as it is nowhere dense. We verify that $O \cap E \neq \emptyset$ for every nonempty basic open set E. If E is a nonempty basic open set, then there exists an integer k and there exist nonempty open subsets E ₁, …, E_k of χ such that

E = S_{X} \cap (E_{1} \times \dots \times E_{k} \times χ^{\infty}) .

Let $y \in E$ , and let x_k be the first k coordinates of y. Then there exist points in $S_{X}$ whose first k coordinates are x_k and whose $k + c (k, 1, x_{k})$ coordinate lies in $B_{1}^{o}$ . Hence,

O \cap E \supseteq S_{X} \cap (E_{1} \times \dots \times E_{k} \times χ^{c (k, 1, x_{k}) −1} \times B_{1}^{o} \times χ^{\infty}) \neq \emptyset .

Next, for $j > 1$ , assume that C_r is meager for all $r < j$ . To complete the induction, we show that C_j is meager, which follows the same reasoning as in the base case. Write

C_{j} = C_{j - 1} \cup_{r = j - 1}^{\infty} F_{r},

where $F_{r} = {y \in S_{X} : τ_{j - 1} (y) = r and y_{n} \in B_{j}^{c} forall n > r}$ . It suffices to show that each F_r is meager.

Notice that F_r is a subset of

G_{r} = {y \in S_{X} : y_{r} \in B_{j}} \cap {y : forall n > r, y_{n} \in B_{j}^{c}} .

It suffices to show that G_r is meager.

As in the case with $j = 1$ , write $G_{r}^{c} = {y \in S_{X} : y_{r} \in B_{r}^{c}} \cup \cup_{n = r + 1}^{\infty} D_{n}$ , where $D_{n} = S_{X} \cap (χ^{n - 1} \times B_{j} \times χ^{\infty})$ . Each D_n contains a nonempty sub-basic open set O_n obtained by replacing each B_j in the definition of each D_n by its interior $B_{j}^{o}$ . So $G_{r}^{c}$ contains a nonempty open set $O = \cup_{n = r + 1}^{\infty} O_{n}$ .

Finally, we establish that O is dense; hence, G_r is meager. Reasoning as in the base case with $j = 1$ , we verify that $O \cap E \neq \emptyset$ for every nonempty basic open set E. If E is a nonempty basic open set, then there exists an integer k and there exist nonempty open subsets E ₁, …, E_k of χ such that $E = S_{X} \cap (E_{1} \times \dots \times E_{k} \times χ^{\infty})$ . Let $y \in E$ , and let x_k be the first k coordinates of y. Then there exist points in $S_{X}$ whose first k coordinates are x_k and whose $k + c (k, j, x_{k})$ coordinate lies in $B_{j}^{o}$ . Hence,

O \cap E \supseteq S_{X} \cap (E_{1} \times \dots \times E_{k} \times χ^{c (k, j, x_{k}) −1} \times B_{j}^{o} \times χ^{\infty}) \neq \emptyset,

which completes the induction. QED

Next, return to consider the sequence ${X_{n}}_{n = 1}^{\infty}$ of random variables described earlier. Suppose that each X_n is the sample average of some other sequence of random variables. That is, $X_{n} = (1 / n) \sum_{k = 1}^{n} Y_{k}$ , where each Y_k is finite. Assume that condition A obtains. Namely, assume that the dependence between the Y_k is small enough so that $c (n, j, x_{k}) < \infty$ , for all n, j, x_k. For example, assume that there exist $c < d$ with c either finite or $c = - \infty$ , and either with d finite or $d = \infty$ , such that for each $j > 1$ and each $y \in 〈 Y_{1}, \dots, Y_{j - 1} 〉 (Ω)$ ,

\sup_{ω \in A_{y}} Y_{j} (ω) = d and \inf_{ω \in A_{y}} Y_{j} (ω) = c,

where

A_{y} = {ω : < Y_{1} (ω), \dots, Y_{j - 1} (ω) > = y} .

Then condition A obtains for a sequence of iid random variables. It also obtains for a sequence of random variables such that{Y ₁, …, Y_n} has strictly positive joint density over (c, d)ⁿ for all n. In such a case, we could let

B_{j} = {\begin{matrix} [c, a] & if c is finite, \\ [- \infty, a] & if c = - \infty \end{matrix}

where $c < a < d$ . Then A contains all sample paths for which ${liminf}_{n} X_{n} > a$ along with some sample paths for which ${liminf}_{n} X_{n} = a$ . If we repeat the construction of A for a countable collection of a_n with $a_{n} ↓ c$ , then the union of all of the A sets is meager. Then, the set of sample paths for which the ${liminf}_{n} X_{n} > c$ is meager. A similar construction shows that the set of sample paths for which ${limsup}_{n} X_{n} < d$ is meager. Hence the union of these last two sets is meager, and the sequence of sample paths along which X_n oscillates maximally is a comeager set.

Theorem A1 applies directly to the sequence ${X_{n}}_{n = 1}^{\infty}$ . It shows that certain sets of sample paths of this sequence are meager or comeager. If, as in the case of sample averages, each X_n is a function of{Y ₁, …, Y_n}, we can evaluate the category of a set of sample paths of the ${Y_{n}}_{n = 1}^{\infty}$ sequence. If 〈X ₁, X ₂, …〉 is a bicontinuous function of 〈Y ₁, Y ₂, …〉, then the two sets of sample paths are homeomorphic. In particular, this implies that the category of a set of sample paths of one sequence will be the same as the category of the corresponding set of sample paths of the other sequence: the one is meager if and only if the other is.

In the case of sample averages, we can exhibit the bicontinuous function explicitly. To be specific, let $χ = ℜ$ , and for each n, define $X_{n} = {(1 / n) \sum}_{k = 1}^{n} Y_{k}$ . Let $X = 〈 X_{1}, X_{2}, \dots 〉$ as above, with $Y = 〈 Y_{1}, Y_{2}, \dots 〉$ . Let $S_{X}$ and $S_{Y}$ be the sets of sample paths of X and Y, respectively. That is, $S_{X} = X (Ω)$ and $S_{Y} = Y (Ω)$ . For each $y \in S_{Y}$ , define

ɸ (y) = (\dots, \frac{1}{n} \sum_{k = 1}^{n} y_{k}, \dots) .

For each $x \in S_{X}$ , define

φ (x) = (x_{1}, 2 x_{2} - x_{1}, \dots, {nx}_{n} - (n - 1) x_{n - 1}, \dots) .

Then, by construction, $ɸ (Y) = X$ and $φ (X) = Y$ . That is, $φ : S_{X} \to S_{Y}$ is the inverse of $ɸ : S_{Y} \to S_{X}$ . In order to have the category of the two sample paths the same, it is sufficient that both ɸ and φ are continuous. If they are continuous as functions both from and to ℜ^∞, then they will be continuous in their subspace topologies. It suffices to show that ɸ⁻¹(B) and φ⁻¹(B) are open for each sub-basic open set B. Every sub-basic open set is of the form $B = \prod_{n = 1}^{\infty} B_{n}$ , where each $B_{n} = ℜ$ except for one value of $n = n_{0}$ for which $B_{n_{0}}$ is open as a subset of ℜ. Then each of ɸ⁻¹(B) and φ⁻¹(B) has the form $C \times ℜ^{\infty}$ , where C is an n ₀-dimensional open subset of $ℜ^{n_{0}}$ ; hence, both sets are open, and we have that $S_{X}$ is homeomorphic to $S_{Y}$ . Proposition 1 in the main text is an instance of theorem A1 for binary sequences.

Footnotes

1. Let H be a (simple) statistical hypothesis for the random variable X, i.e., where the conditional probability distribution P(X|H) is well defined. Given H, a random sample of size n from this distribution, {X ₁, …, X_n} has a joint distribution that is identically, independently distributed (iid) with respect to P(X|H):P(X1,…,Xn|H)=∏i=1nP(Xi|H).

2. Say that an investigator with degree of belief represented by a probability P(·) holds a nonextreme opinion about an event E if 0<P(E)<1.

3. Savage’s (Reference Savage1954) axiomatic theory of preference, based on postulates P1–P7, is about an idealized Bayesian agent’s static preference relation over pairs of acts—preferences at one time in the idealized agent’s life. The theory of personal probability and conditional probability that follows from P1–P7 is about an idealized agent’s epistemic state at that one time: her degrees of belief and conditional degrees of belief at that one time. More familiar in the Bayesian literature is a dynamic Bayesian rule where conditional probability models the idealized agent’s changing beliefs over time, when new evidence is accepted. For details on differences between the static and dynamic use of conditional probability, see Levi (1980, sec. 4.3).

4. We illustrate the weak and strong laws of large numbers for iid Bernoulli trials. Let X be a Bernoulli variable with possible values {0, 1}, where P(X=1)=p, for some 0≤p≤1. Let Xi (i = 1, 2, …) be a denumerable sequence of Bernoulli variables, with a common parameter P(Xi=1)=p and where trials are independent. Independence is expressed as follows. For each n=1, 2, …, let Sn=∑i=1nXn. Then P(X1=x1,…,Xn=xn)=pSn×(1−p)(n−Sn). The weak law of large numbers for iid Bernoulli trials asserts that for each ε>0, limn→∞P(|Sn/n−p|>ε)=0. The strong law of large numbers asserts that P(limn→∞Sn/n=p)=1. If P is countably additive, the strong-law version entails the weak-law version.

5. A topologically meager set is one that is a denumerable union of nowhere dense sets. A comeager set, or a residual set, is the complement of a meager set.

6. A helpful referee suggests that Belot might instead restrict topological condition 1 to the space of chance hypotheses rather than extending it to the space of observables as we do. However, Belot’s (Reference Belot2013, sec. 4) criticism of Bayesian methodology depends on applying topological standards for a “typical event” to probabilities over hypotheses defined in terms of sequences of observables. There are no chance hypotheses in those cases.

7. Belot (Reference Belot2013, 488), remark 2, notes that when R is a meager element of a measurable space 〈Ω, ℬ〉 then the set of probabilities that assign R probability 0 is comeager in the space of probability distributions on the same measurable space 〈Ω, ℬ〉. This points to what we label as topological condition 2. It is a higher-order application of Belot’s idea that a topologically typical set (i.e., a comeager set) should be reflected with probability 1. For condition 2, that reasoning is applied to typical sets of probabilities.

8. For each 0≤y≤1, with y≠1/2, N contains at least one sequence with limiting relative frequency y, and these are pairwise different sequences for different values of y.

9. This product topology is homeomorphic to the Cantor Space.

10. Oxtoby (Reference Oxtoby1980, 99) sketches the proof of this claim. The claim follows from an elegant application of the Banach-Mazur Game. Belot’s (Reference Belot2013, 498) remark 5, n. 41, adapts Oxtoby’s argument, which we paraphrase as follows: Consider a point x in Cantor Space. A prior P is ‘open minded’ with respect to the hypothesis x provided that, given any finite initial segment of x, (x ₁, …, x_m), there is a finite continuation (x ₁, …, x_m, xm+1,…,xn) where P({x)|(x1,…,xn))>.50, and there exists some other finite continuation of (x ₁, …, x_m), (xm+1,…,xn′) where P({x)|(x1,…,xn′))<.50. Say that hypothesis x flummoxes prior P provided that, for infinitely many values of n, P({x}|(x1,…,xn))>.50 and for infinitely many values of n, P({x}|(x1,…,xn))<.50. Then, the set of sequences in Cantor Space (i.e., the set of hypotheses) that flummoxes an open minded P is comeager in the infinite product topology on Cantor Space. What Belot observes is a special case of proposition 1, which we introduce and discuss next.

11. Oxtoby’s (Reference Oxtoby1980, 64) theorem 16.5 establishes that if the measure space 〈X, ℬ, P〉 satisfies that P is nonatomic, X has a metrizable topology MT with a base whose cardinality is less than the first weakly inaccessible, and the σ-field ℬ includes the Borel sets of MT, then X can be partitioned into a set of P-measure 0 and a meager set.

12. This proposition is established by Calude and Zamfirescu (Reference Calude and Zamfirescu1999) using an application of Oxtoby’s (Reference Oxtoby1957) theorem for the Banach-Mazur Game. In the appendix, we establish the more general theorem A1 with a direct argument, which extends beyond Oxtoby’s (Reference Oxtoby1957) theorem for the Banach-Mazur Game and which has proposition 1 as a corollary.

13. An event T belongs to the tail field of 2^ω provided that, when a point x belongs to T, so too does each point x′ that differs in only finitely many coordinates from x. It is straightforward to verify that the set of tail-field events of 2^ω form a field.

14. Regarding each of the two tail-field hypotheses, L ^1/10 and L ^9/10, the P-credences are open-minded since P(L1/10|X1,…,Xn) may cross the 0.5 threshold, in either direction, as a function of finitely many future observations, {Xn+1,…,Xn+k}. The P-credences are modest, since ex ante, given that the infinite sequence x belongs to L ^1/10 (or to L ^9/10) the agent assigns a probability greater than 0 to the respective failure set of sequences.

15. Elga follows Rao and Rao (Reference Rao and Rao1983, 39–40) using the technique of Banach limits to establish the existence of a finitely additive probability corresponding to the P^p,q distribution on repeated flips of the coin, based on the set of countably additive probabilities {P_n}. The method we use here, where the change point N is incorporated explicitly as a random variable in the finitely additive joint probability model, generates all the P^p,q distributions over sequences of repeated flips of a coin as may be obtained with Banach limits. However, in addition it provides the added machinery needed to assess the agent’s conditional credences P(N|X1,…,Xn), which reflects also the agent’s opinion about whether the sequence of coin flips passed the change point. Elga’s reasoning ignores the fact that, for each n=1, …, P(N>n|X1,…,Xn)=1, which is what the alternative analysis makes salient.

16. The P′-model does not contradict Savage’s (Reference Savage1954) finitely additive weak-law result, which we reported in sec. 1. That is, the P′-model does not satisfy Savage’s requirement that the rival statistical hypotheses, P^{5/10, 1/10}(·) and P^{5/10, 9/10}(·), have different likelihood functions given some P′-non-null data.

17. The irony is, of course, that Peirce objected to conceptualism (i.e., personalist probabilities) because he thought that it inappropriately combined subjective and objective senses of “probability.” See Peirce (Reference Peirce1878).

18. Blackwell and Dubins use the concept of predictive distributions to mean those that admit regular conditional distributions with respect to the subalgebra of rectangular events. (See Breiman [Reference Breiman1968, 77] for a definition of a regular conditional distribution.) For discussion of countably additive probabilities that do not admit regular conditional distributions, see Seidenfeld, Schervish, and Kadane (Reference Seidenfeld, Schervish and Kadane2001, 1614, corollary 1).

19. Peirce (Reference Peirce1877, sec. 5) writes, “To satisfy our doubts, therefore, it is necessary that a method should be found by which our beliefs may be determined by nothing human, but by some external permanency—by something upon which our thinking has no effect. Some mystics imagine that they have such a method in a private inspiration from on high. But that is only a form of the method of tenacity, in which the conception of truth as something public is not yet developed. Our external permanency would not be external, in our sense, if it was restricted in its influence to one individual. It must be something which affects, or might affect, every man. And, though these affections are necessarily as various as are individual conditions, yet the method must be such that the ultimate conclusion of every man shall be the same. Such is the method of science. Its fundamental hypothesis, restated in more familiar language, is this: There are Real things, whose characters are entirely independent of our opinions about them; those Reals affect our senses according to regular laws, and, though our sensations are as different as are our relations to the objects, yet, by taking advantage of the laws of perception, we can ascertain by reasoning how things really and truly are; and any man, if he have sufficient experience and he reason enough about it, will be led to the one True conclusion. The new conception here involved is that of Reality. It may be asked how I know that there are any Reals. If this hypothesis is the sole support of my method of inquiry, my method of inquiry must not be used to support my hypothesis. The reply is this: 1. If investigation cannot be regarded as proving that there are Real things, it at least does not lead to a contrary conclusion; but the method and the conception on which it is based remain ever in harmony. No doubts of the method, therefore, necessarily arise from its practice, as is the case with all the others. 2. The feeling which gives rise to any method of fixing belief is a dissatisfaction at two repugnant propositions. But here already is a vague concession that there is some one thing which a proposition should represent. Nobody, therefore, can really doubt that there are Reals, for, if he did, doubt would not be a source of dissatisfaction. The hypothesis, therefore, is one which every mind admits. So that the social impulse does not cause men to doubt it. 3. Everybody uses the scientific method about a great many things, and only ceases to use it when he does not know how to apply it. 4. Experience of the method has not led us to doubt it, but, on the contrary, scientific investigation has had the most wonderful triumphs in the way of settling opinion. These afford the explanation of my not doubting the method or the hypothesis which it supposes; and not having any doubt, nor believing that anybody else whom I could influence has, it would be the merest babble for me to say more about it. If there be anybody with a living doubt upon the subject, let him consider it.”

References

Belot, G. 2013. “Bayesian Orgulity.” Philosophy of Science 80:483–503.CrossRef Google Scholar

Blackwell, D., and Dubins, L.. 1962. “Merging of Opinions with Increasing Information.” Annals of Mathematical Statistics 33:882–87.CrossRef Google Scholar

Breiman, L. 1968. Probability. Reading, MA: Addison-Wesley.Google Scholar

Calude, C., and Zamfirescu, T.. 1999. “Most Numbers Obey No Probability Laws.” Publicationes Mathematicae Debrecen 54:619–23.Google Scholar

de Finetti, B. 1937/1937. “Foresight: Its Logical Laws, Its Subjective Sources.” In Studies in Subjective Probability, ed. Kyburg, H. E. Jr. and Smokler, H., trans. Kyburg, H. E. Jr., 93–158. New York: Wiley.Google Scholar

Smokler, H., trans. Kyburg, H. E. Jr. 1974. Theory of Probability. Vol 1. New York: Wiley.Google Scholar

Dubins, L. E., and Savage, L. J.. 1965. How to Gamble If You Must: Inequalities for Stochastic Processes. New York: McGraw-Hill.Google Scholar

Edwards, W., Lindman, H., and Savage, L. J.. 1963. “Bayesian Statistical Inference for Psychological Research.” Psychological Review 70:193–242.CrossRef Google Scholar

Elga, A. 2016. “Bayesian Humility.” Philosophy of Science 83:305–23.CrossRef Google Scholar

Halmos, P. 1950. Measure Theory. New York: Van Nostrand.CrossRef Google Scholar

Kadane, J. B., Schervish, M. J., and Seidenfeld, T.. 1986. “Statistical Implications of Finitely Additive Probability.” In Bayesian Inference and Decision Techniques, ed. Goel, P. K. and Zellner, A., 59–76. Amsterdam: Elsevier.Google Scholar

Zellner, A. 1996. “Reasoning to a Foregone Conclusion.” Journal of the American Statistical Association 91:1228–35.Google Scholar

Levi, I. 1980. The Enterprise of Knowledge. Cambridge, MA: MIT Press.Google Scholar

Oxtoby, J. C. 1957. “The Banach-Mazur Game and Banach Category Theorem.” In Contributions to the Theory of Games, Vol. 3, ed. M. Dresher, A. W. Tucker, and P. Wolfe, 159–63. Princeton, NJ: Princeton University Press.Google Scholar

Oxtoby, J. C. 1980. Measure and Category. 2nd ed. Dordrecht: Springer.CrossRef Google Scholar

Peirce, C. S. 1877. “The Fixation of Belief.” Popular Science Monthly.Google Scholar

Peirce, C. S. 1878. “The Probability of Induction.” Popular Science Monthly.Google Scholar

Rao, K. P., and Rao, M. B.. 1983. Theory of Charges: A Study of Finitely Additive Measures. London: Academic Press.Google Scholar

Savage, L. J. 1954. The Foundations of Statistics. New York: Wiley.Google Scholar

Schervish, M. J., and Seidenfeld, T.. 1990. “An Approach to Certainty and Consensus with Increasing Evidence.” Journal of Statistical Planning and Inference 25:401–14.CrossRef Google Scholar

Schervish, M. J., Seidenfeld, T., and Kadane, J. B.. 1984. “The Extent of Non-conglomerability of Finitely Additive Probabilities.” Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 66:205–26.CrossRef Google Scholar

Seidenfeld, T., Schervish, M. J., and Kadane, J. B.. 2001. “Improper Regular Conditional Distributions.” Annals of Probability 29:1612–24.Google Scholar

Article contents

Standards for Modest Bayesian Credences

Abstract

1. Introduction

2. Orgulity as Identified by Comparing Meager Sets versus Null Sets

3. But What If Probability Is Merely Finitely Additive and Not Countably Additive?

4. On Standards for Epistemic Modesty Using Asymptotic Merging and Consensus

5. Summary

Appendix

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests