1. Introduction
An account of unification of phenomena by theory should make it clear what reasons, if any, we might have for preferring a theory that provides a unified account of what otherwise might seem to be independent phenomena. Is the ability of a theory to unify phenomena an epistemic virtue, relevant to the degree of confidence we can justifiably place in the theory, or is it merely a pragmatic one? If epistemic, is this based on a priori knowledge that Nature is simple, or does the ability of the theory to provide a unified account of disparate phenomena contribute to the evidential support these phenomena lend to the theory?
There is a surprisingly persistent tradition in the philosophy of science that locates the whole of the empirical support of a theory in its ability to save the phenomena. On such an account, if two theories both account for all available empirical evidence, then the choice between them, if a choice is to be made at all, must rest on extra-empirical considerations. Simplicity and unification are popular candidates for such extra-empirical considerations; these are often held to be pragmatic, or perhaps even aesthetic, virtues, rather than epistemic virtues. We should ask whether simplicity and unification can be defended from the charge of being extra-empirical virtues and regarded instead as contributory to the empirical support of the theory. It will be argued below that these qualities can play a role in the degree to which empirical evidence supports a theory.
It is the purpose of this paper to give a Bayesian account of unification that captures one interesting sense in which a theory can unify phenomena. On this account, the ability of a theory to unify phenomena consists in its ability to render what, on prior grounds, appear to be independent phenomena informationally relevant to each other. It will be shown that it is a consequence of Bayes’ theorem that this ability does, indeed, contribute to the degree of support provided to the theory by the phenomena unified by the theory; one need not base a preference for theories that unify on a prior belief in the simplicity of Nature. No claim is being made that every case that one can reasonably regard exhibiting unification will be captured by this account. Nor is it claimed that the contribution made by unification to the degree of support that the theory enjoys exhausts what is valuable in unification; as William Harper (Reference Harper, Brown and Mittelstrass1989, Reference Harper and Malament2002b) has argued, unification can also result in greater resiliency in the face of apparently disconfirming evidence.
A few words are in order about what is meant by a Bayesian account, in the context of this paper. Bayesianism, as it is construed here, is an approach to the construction of canons of scientific inference that takes as its central notion degree of justified belief, represented by a real-valued function Pr(· | ·) that takes propositions as its arguments and is assumed to satisfy the axioms of the probability calculus. It will not be assumed that there is a unique correct probability function; this feature of Bayesianism is sometimes misleadingly called personalism.
The canons of inference that result are meant to have a status similar to that of the canons of deductive inference. Neither are descriptive of the way in which people actually think. No human being has a deductively closed set of beliefs, and it is likely that no human being has a consistent set of beliefs. These facts are, however, to be regarded as departures from ideal rationality; if one comes to realize that one believes a set {p 1, … , p n} of propositions but not some logical consequence q of this set, then one ought not to rest content with this state of affairs but ought to consider whether to accept q or to reject one or more of {p 1, … , p n}. Similarly, it is doubtful that anyone has numerical degrees of belief satisfying the axioms of probability. If, however, one becomes aware that one's rankings of propositions (perhaps only qualitative) as highly probable, somewhat probable, highly improbable, etc., are inconsistent with the existence of such numerical degrees of belief, then such rankings should be regarded as failing to meet the standards of rationality. Moreover, it will not be assumed that conformity with the axioms of probability will be all that matters; some assignments of probability, on the basis of a given body of evidence, are more reasonable than others, and we will be appealing to the reader's judgments about such reasonableness. The Bayesianism of this paper is, to borrow a term from Shimony (Reference Shimony and Colodny1970), of the tempered sort.
We will want to consider probabilities of the form Pr(h | e & b), where h is a hypothesis, e some body of evidence being explicitly considered, and b an appropriate body of background information. The background b need not be the sum total of facts known to an agent at some time, and, in particular, should not include the evidence e being considered, as we will want to judge the relevance of e to h and hence will want to take b such that Pr(h | e & b) is not equal to Pr(h | b). It will be assumed that we have available to us judgments (which may be somewhat vague) about the reasonableness of assigning certain probabilities to hypotheses on the basis of bodies of knowledge that differ from the sum total of our own knowledge.
2. Examples
Two examples will be used to illustrate the notion of unification to be discussed in this paper, and to guide our Bayesian account of the epistemic value of such unification. The examples we will use are the choice between geocentric and heliocentric world systems, and Newton's inference to the inverse square law of gravitation. It will become clear that examples of this sort could be multiplied, and that the features exhibited by these examples that are relevant to our account are quite common in science.
2.1. Copernicus v. Ptolemy
As is well-known, Earth-based observations concerning the positions (if not the phases) of planets can be predicted, with good accuracy, both by systems such as Ptolemy's, on which the planets orbit the Earth, and systems such as the Copernican system or the Tychonic system, on which the planets orbit the Sun.Footnote 1 This has led to the suggestion that a body of data consisting only of geocentric positions of the planets leaves the decision between the systems a decision to be made on non-empirical grounds; simplicity is frequently invoked as a non-empirical advantage possessed by the Copernican system.
The phenomena to be saved are the apparent motions of the planets with respect to the fixed stars, as seen from Earth. Each planet traverses its path through the zodiac, with its own period. This motion is not uniform, however; the planets proceed at varying rates, at times stopping and reversing direction. In the case of the “superior” planets, Mars, Jupiter, and Saturn, these retrogressions occur only when the planet is near opposition to the Sun, with the greatest retrograde angular velocity occurring at opposition. The period between successive oppositions, and hence between successive maximally retrograde motions, is called the synodic period.
A geocentric system can recover these qualitative features of the motion of the planets, and achieve a fairly good quantitative fit, by means of a system consisting of a deferent and one epicycle for each planet.Footnote 2 The deferent is a circle centered near the Earth. The epicycle is a circle centered on a point on the deferent which, in Ptolemy's system, moves along the deferent at an angular rate that appears constant when viewed, not from the center of the circle, but from an equant point placed on the diameter joining the Earth with the center of the deferent, at the same distance from the center as the Earth but on the opposite side of the center. The planet moves on the epicycle at a constant angular rate. Retrograde motion occurs when the swifter motion of the planet on its epicycle is directed contrary to the motion of the center of the epicycle along the deferent.
The main features of these motions can be recovered by a single circle, centered near the Sun, for each planet. The chief deviations of the apparent motion of the planets from a uniform traversal of the ecliptic, which, on the Ptolemaic system, are accounted for by the epicyclic motion, are, on the Copernican system, due to the motion of the Earth about the Sun, and, on the Tychonic system, to the motion of the Sun, carrying the planets’ orbits with it, about the Earth. On the Copernican system, apparent retrograde motion occurs, in the case of the superior planets, when the Earth catches up with and passes the planet in its orbit and, in the case of the inferior planets, when the planet passes the Earth.
A heliocentric system is, in a certain sense, simpler than the Ptolemaic, in that “[t]he motion of the Earth alone … suffices to explain so many apparent inequalities in the heavens” (Copernicus, Commentariolus, in Rosen Reference Rosen1959, 59). Rheticus, in his Narratio Prima, invokes the frugality of Nature in support of the Copernican hypothesis:
Mathematicians as well as physicians must agree with the statements emphasized by Galen here and there: “Nature does nothing without purpose” and “So wise is our Maker that each of his works has not one use, but two or three or often more.” Since we see that this one motion of the Earth satisfies an almost infinite number of appearances, should we not attribute to God, the creator of nature, that skill which we observe in the common makers of clocks? For they carefully avoid inserting in the mechanism any superfluous wheel or any whose function could be served better by another with a slight change of position. (Rosen Reference Rosen1959, 137–38)
Galileo has Salviati make an analogous claim, concerning the related point that the apparent diurnal motions of all the heavenly bodies are, on the Copernican system, attributed not to these bodies severally, but to the Earth alone:
who is going to believe that nature (which by general agreement does not act by means of many things when it can do so by means by few) has chosen to make an immense number of extremely large bodies move with inconceivable velocities, to achieve what could have been done by a moderate movement of one single body around its own center? (Galileo Reference Galilei and Drake[1632] 1953, 135)
Both Rheticus and Galileo's Salviati are making strong claims about the nature of the world. Neither gives any argument beyond authority, Rheticus citing Galen, Salviati the authority of “general agreement” and, a few pages later, in a shrewd rhetorical move, of Aristotle.Footnote 3 Nor can it be claimed that such a principle is justified on empirical grounds; the aspect Nature presents to our observation contains a bewildering mixture of parsimony and profligacy.
Simplicity is sometimes held to be a pragmatic virtue, rather than an epistemic one. Even if the simplicity of a theory is not grounds for believing in its approximate truth, it may be a pragmatic reason for preferring one theory over the other, because a simpler theory is easier to work with. Both the Copernican and Ptolemaic systems are capable of recovering the phenomena; to a first approximation, at least, the Copernican system does so with fewer circles, and for this reason has been regarded as simpler. That the Copernican system surpasses the Ptolemaic in this sort of simplicity, when all complications of the two systems are taken into account, has been disputed (see, e.g. Neugebauer Reference Neugebauer1957, 204; Dijksterhuis Reference Dijksterhuis1961, 294; Hoskin and Gingerich Reference Hoskin and Gingerich1999, 88); to achieve the same precision as Ptolemy without the use of Ptolemy's equants, Copernicus had to introduce complications into the simple system outlined above. Kuhn declared that, since the Copernican system fails to surpass the Ptolemaic on grounds of either simplicity (measured by circle-count) or accuracy of predictions, there were, at the time of Copernicus, not only no epistemic grounds for preferring the Copernican system over the Ptolemaic, but also no pragmatic grounds:
But this apparent economy of the Copernican system, though it is a propaganda victory that the proponents of the new astronomy rarely failed to emphasize, is largely an illusion. … Copernicus, too, was forced to use minor epicycles and eccentrics. His full system was little if any less cumbersome than Ptolemy's had been. Both employed over thirty circles; there was little to choose between them in economy. Nor could the two systems be distinguished in accuracy. When Copernicus had finished adding circles, his cumbersome sun-centered system gave results as accurate as Ptolemy's but it did not give more accurate results. Copernicus did not solve the problem of the planets. (Kuhn Reference Kuhn1959, 169)
The only superiority of the Copernican system over the Ptolemaic system, according to Kuhn, lay in its greater harmony and beauty:
as Copernicus himself recognized, the real appeal of sun-centered astronomy was aesthetic rather than pragmatic. To astronomers the initial choice between Copernicus’ system and Ptolemy's could only be a matter of taste, and matters of taste are the most difficult of all to define or debate. Yet, as the Copernican Revolution itself indicates, matters of taste are not negligible. (Kuhn Reference Kuhn1959, 172)
This conclusion is, I believe, premature. Let us consider a bit more carefully the manner in which each of the two systems saves the phenomena.
As mentioned, the Ptolemaic system recovers the bulk of the variation of the angular speed of each planet via an epicycle for each planet; the Copernican system accounts for this variation, for all of the planets, by the orbital motion of the Earth. The separate epicycles of the planets are, on the Ptolemaic system, kinematically independent; there is no kinematically necessary relation between the motion of any two epicycles, or between the motion of one planet's epicycle and the motion of any other planet. By adjusting the speed of the planet on its epicycle, one could produce any number of episodes of retrograde motion per synodic period, in place of the single episode that is observed. It so happens, however, that the epicyclic motion required on the Ptolemaic system to recover the actual observed behaviour of the planets displays a curious correlation with the motion of the Sun. The period of a planet's epicyclic motion is its synodic period, if calculated, as Ptolemy does, with respect to the line joining the Earth with the center of the epicycle; this means that, with respect to a fixed reference direction, the epicyclic period of a superior planet is the same as that of the Sun, and, moreover, the radius drawn from the center of the epicycle to the planet remains parallel, at all times, to the radius drawn from the Earth to the mean sun. For the inferior planets, the period of the center of the epicycle in its journey around the deferent is equal to the Sun's period, and the radius between the Earth and center of the epicycle passes through the mean sun at all times. This correlation between the motions of the planets and that of the Sun was, of course, recognized by Ptolemy, who, in setting out his preliminary hypotheses for the planets, says that there are “two apparent anomalies for each planet: that anomaly which varies according to its position on the ecliptic, and that which varies according to its position relative to the Sun” (Ptolemy Reference Toomer1984, 442). Michael Hoskin, in Reference HoskinThe Cambridge Concise History of Astronomy, remarks that “[t]his unexplained involvement of the Sun in the geometry of the other planets was to puzzle later astronomers” (Hoskin Reference Hoskin1999, 47); Georg Peurbach (1423–61), for one, was no heliocentrist but nevertheless observed, “It is clear that each of the six planets in its motion shares something with the Sun, and the Sun is, so to speak, the common mirror and measure for their motions” (quoted in Hoskin and Gingerich Reference Hoskin and Gingerich1999, 88–89).
On the Ptolemaic system, then, the longitudinal motion of each planet (other than Mercury, for which Ptolemy had to introduce a more complicated hypothesis) is composed of two circular motions: one whose period is peculiar to the planet, and one whose period is equal to that of the Sun about the Earth. On the Tychonic system, the orbits of the planets are centered on the Sun, and so this second motion simply is the Sun's motion. On the Copernican system, the apparent motion of each planet is again composed of two circular motions: the planet's own motion, and the motion of the observer's vantage point on the Earth. The component of motion recognized by Ptolemy to be related to the Sun, is, on either the Tychonic or the Copernican system, the relative motion of the Sun and Earth.
As Kepler clearly perceived, the Copernican system explains what on the Ptolemaic system is a mysterious connection between the Sun and the motion of the other planets:
For in the first place one might ask of Ptolemy how it comes about that the three eccentrics of the Sun, Venus, and Mercury have equal times of revolution? … Why do the five planets make retrogressions, whereas the luminous stars do not? … Similarly the ancients rightly wondered why the three superior planets are always in opposition to the Sun when they are at the bottom of their epicycles, but in conjunction when they are at the top. (Kepler Reference Kepler and Duncan[1596] 1981, 81)
On the Ptolemaic system, the periods of the planets and the intervals between episodes of retrograde motion are independent parameters. It so happens that the maximal retrogressions of the superior planets occur at intervals equal to their synodic periods (which are calculable from the periods of the planet and that of the Sun). On the heliocentric explanation of retrograde motion, however, fixing the periods of the planets (including the Earth) fixes also the intervals between episodes of retrogression. A system, such as Tycho's or Copernicus’, which centers the orbits of the planets on the Sun, makes one set of phenomena—the mean apparent motions of the planets, and the Sun, along the ecliptic—carry information about what, on the Ptolemaic system, are independent phenomena, the deviations of the planets from their mean apparent motions.
2.2. Newton and the Inverse Square Law of Gravitation
Proposition 2 of Book 3 of Newton's Principia states:
The forces by which the primary planets are continually drawn away from rectilinear motions and are maintained in their respective orbits are directed to the sun and are inversely as the squares of their distances from its center. (Newton Reference Newton, Cohen and Whitman[1726] 1999, 802)
For the first part of this proposition, the sun-directedness of the force, Newton cites Phenomenon 5 of Book 3, namely, Kepler's Area Law. For the second part, the inverse square dependence of the force on distance from the Sun, he cites two phenomena. One is Phenomenon 4, Kepler's Harmonic Law: the periods of the planets are proportional to the 3/2 power of their mean distances from the Sun. He then adds, “But this second part of the proposition is proved with the greatest exactness from the fact that the aphelia are at rest. For the slightest departure from the ratio of the square would (by Bk. 1, Prop. 45, Corol. 1) necessarily result in a noticeable motion of the apsides in a single revolution and an immense such motion in many revolutions.”
The relevance of these two phenomena to the force law are provided, respectively, by Proposition 4 of Book 1, and Proposition 45 of the same book. The argument from the Harmonic Law to the inverse square law is the familiar one found in many physics textbooks. Proposition 4 concerns the centripetal acceleration of uniform circular motion. If the planets move in orbits that approximate uniform circular motion about the Sun, and their periods are some function T(r) of their mean distances from the Sun, their accelerations will satisfy

This is Newton's Corollary 2 to Proposition 4. Putting T(r) ∝ r 3/2 yields Corollary 6,

That is, the accelerations of the planets towards the Sun vary inversely as the square of their distances from the Sun.
The argument from the quiescence of the apsides to the inverse square dependence of the force law is less familiar (see Harper Reference Harper, Cohen and Smith2002a for a clear exposition). Newton was able to show (Bk. 1, Props. 43, 44) that, if a body moves in an orbit under the action of a centripetal force f(r), adding an inverse cube term to the force produces a second orbit, whose distance from the center of force at any time is the same as that of the first orbit (r 2(t) = r 1(t)), and whose angular displacement is a constant multiple of that of the first (θ 2(t) = α θ 1(t)) (this is posed as an exercise problem by Goldstein Reference Goldstein1980, 123; see also Whittaker Reference Whittaker1944, 83; Chandrasekhar Reference Chandrasekhar1995, 184). Since Newton had shown that an inverse square force law produces a quiescent elliptical orbit, he is able to conclude that a force law that takes the form of an inverse square term plus an inverse cube term produces a precessing ellipse. Newton then proceeds (Prop. 45) to approximate, over a small range of r, an arbitrary force f(r) by a sum of an inverse square term and an inverse cube term by taking (what we now call) a Taylor series expansion, to the first power in r, of g(r) = r 3 f(r) around the aphelion distance a of the orbit. This approximation yields the result that for each revolution there will be an advance of the aphelion by an amount (measured in degrees) equal to

Therefore, for orbits that are approximately circular, so that this Taylor series approximation is valid, a measurement of the aphelion advance of a planet yields information about the distance dependence of the force on the planet, in the small range of distances explored by the planet.Footnote 4
Suppose we make the hypothesis h PL that, to a high degree of approximation, the acceleration of all of the planets is due to a single field of force centered on the Sun, obeying some power-law:

Application of Newton's Proposition 45 to such a law gives, for an approximately circular orbit,

or,

A quiescent orbit, for which p = 0, gives us –2 as an estimate of the parameter λ, and this in turn entails Kepler's Harmonic Law, that the periods of the planets vary as the three-halves power of their distances from the sun. On the hypothesis h PL of a single power-law force, a measurement of the perihelion precession (or lack thereof) of each planet yields information about the exponent of the power law and thereby yields information about the dependence of their relative periods on their distances, and vice versa. What a priori are independent phenomena, namely, the approximate quiescence of the aphelia of the planets, and Kepler's Harmonic Law, are no longer independent on the supposition h PL of a single heliocentric power-law force; if such a hypothesis is true or approximately true, each of the two phenomena strongly constrains the other.
The hypothesized form of the force law can be replaced by more general families of functions. In order for the hypothesis to make our two a priori independent phenomena yield information about each other, all that is required of the hypothesized form of the force law is that the behaviour of the force law over the range of distances explored by one planet yield information about its behaviour at other planetary distances. If, on the other hand, one merely hypothesized that the force law is some function, with no preference given to any function over another—that is, if one had a prior probability measure that was sufficiently uniform over the space of possible force laws—then such a hypothesis would leave the a priori independent phenomena independent.
These two examples suggest an interesting way in which a hypothesis can unify disparate phenomena: the hypothesis can make two phenomena that in the absence of the hypothesis seem to be independent phenomena yield information about each other. In the next section we will tender an explication, in Bayesian terms, of information yielded by one proposition about another, and of unification of a body of evidence by a hypothesis. Such an account seems to capture at least part of the intuition invoked by Michael Friedman:
this is the essence of scientific explanation—science increases our understanding of the world by reducing the total number of independent phenomena that we have to accept as ultimate or given. A world with fewer independent phenomena is, other things equal, more comprehensible than one with more. (Friedman Reference Friedman1974, 15)
The Bayesian account of such unification will differ greatly in its details from the account given by Friedman; in particular, Friedman's notion of independent acceptability will be replaced by the notion of probabilistic independence; a unifying hypothesis will reduce, not the number of independently acceptable phenomena, but the degree of informational independence of a body of phenomena.
3. Informational Relevance and Evidential Support
We wish to give a Bayesian account of this notion of unification of disparate phenomena, which consists of the ability of the theory to make the phenomena yield information about each other. For this, we will need a Bayesian notion of the information one proposition p yields about another proposition q. We will assume a body of background knowledge b, and that we have available a probability function Pr( · |b). Learning that a proposition p is true can be either positively or negatively relevant to another proposition q; if it is positively relevant to q, then learning p furnishes a certain amount of information about whether or not q is true. Moreover, the relevance of p to q—that is, how much we learn about whether or not q is true when we learn that p is true—is the sort of thing that admits degrees. Accordingly, we will want to define a measure of the informational relevance of p to q, on background b. We will call the degree of informational relevance of p to q, on background b, I(q, p | b). I(q, p | b) will be assumed to be definable in terms of the probability function Pr( · | b) (and, in fact, continuously definable, so that small changes in probabilities yield small changes in information). By convention, we will take this quantity to be positive when p is positively relevant to q, negative when p is negatively relevant to q, and zero when p is irrelevant to q. Furthermore, we will define our measure of informational relevance so that independent evidence is additive—when p 1 and p 1 are independent items of evidence, the information yielded about q by the conjunction of p 1 and p 1 will be simply the sum of the information yielded by p 1 and the information yielded by p 2 (see the Appendix for a precise statement of this condition). Finally, we will assume a normalization convention that agrees with the convention adopted in information theory for the special case when our information about q amounts to certainty that q obtains. If q is one of 2N equiprobable, mutually exclusive, and jointly exhaustive alternatives, then the information that q obtains amounts to N bits of information, and, in general, information that q obtains will count as −Log2(Pr(q | b)) bits of information.
It is shown in the Appendix that these conditions uniquely determine a measure of degree of informational relevance:Footnote 5

It is worth noting that the informational relevance function I(q , p | b) so obtained is symmetric in its two arguments: I(q , p | b) = I(p, q | b).
Suppose that p 1, p 2 are independent phenomena; that is, I(p 1, p 2 | b) = 0. A hypothesis h makes p 2 yield information about p 1 if I(p 1, p 2 | h & b) > 0, and, in such a case, we can take I(p 1, p 2 | h & b) as a measure of the extent to which h makes p 2 yield information about p 1. If I(p 1, p 2 | b) ≠ 0, the excess of I(p 1, p 2 | h & b) over I(p 1, p 2 | b) measures the extent to which h makes p 2 yield information about p 1 . Let us call this quantity U, as it is a measure of the extent to which h unifies the set of phenomena {p 1, p 2}.Footnote 6

We will also want to consider bodies of phenomena {p 1, p 2, … , p n} consisting of more than two elements. We define the n-place function I (n), which is a measure of the degree of mutual dependence to be found in the set {p 1, p 2, … , p n}, as the information about p 2 yielded by p 1, plus the information about p 3 yielded by the conjunction of p 1 and p 2, and so on, up to the information about p n yielded by the conjunction of all the other members of the set. This gives

It can easily be seen that this quantity is also independent of the order in which the elements are taken. We now define the quantity U (n), which is the degree to which h unifies the set {p 1, p 2, …, p n}:

The question to be asked now is whether the ability of a hypothesis to unify a set of phenomena contributes to the degree to which the phenomena lend evidential support to the hypothesis. To ask this we will need a measure of the degree to which evidence e supports hypothesis h on background b. The quantity Pr(h | e & b)/Pr(h | b), or its logarithm, is a popular choice for such a measure, and is often referred to as the “degree of confirmation” of h by e on background b, confirmation being taken in the relative, or incremental sense, rather than the absolute—that is, to say that e confirms h, in this sense, is to say that e is positively relevant to h, and is not to say that e provides sufficient grounds for accepting h (for discussions of this distinction, see Carnap Reference Carnap1962, xv–xix; Salmon Reference Salmon, Maxwell and Anderson1975, 3–36). The logarithm of Pr(h | e & b)/Pr(h | b) is the informational relevance I(h, e | b). This quantity, is, therefore, a candidate for a measure of the degree to which a piece of evidence supports a hypothesis. Another candidate is what I.J. Good (Reference Good1950; Reference Good1983) has called the “weight of evidence,”

It will not be necessary to adjudicate between these two candidates, as there is an interesting relation between the power of a hypothesis to unify a body of evidence and its degree of confirmation on either choice of explicatum for the latter. On either choice of measure of degree of evidential support, it can be shown that the ability of a hypothesis to unify a body of evidence contributes in a direct way to the support provided to h by the body of evidence. It follows from Bayes’ theorem thatFootnote 7

That is: the degree of support provided to h by e 1 and e 2 taken together is the sum of three terms: the degree of support of h by e 1 alone, the degree of support of h by e 2 alone, and an additional term which is simply the degree of unification of the set {e 1, e 1} by h. An analogous result holds for larger bodies of evidence; the degree of support provided to h by the set {e 1, e 2, … , e n} taken together is simply the sum of the degrees of support provided to h by each of the evidence items taken individually, plus an additional term which is the degree to which h unifies the body of evidence.
To understand what equation (12) means, suppose that two phenomena, e 1 and e 2, which on prior grounds are regarded as having little to do with each other, both occur. If h 1 makes e 1 informationally relevant to e 2—that is, if the truth of h 1 would render it more probable that e 2 occurs if e 1 does—then the joint occurrence of e 1 and e 2 is more probable on the supposition of h 1 than it would be if h 1 didn't unify e 1 and e 2 in this sense. Even if the likelihoodsFootnote 8 of two hypotheses h 1 and h 2 are equal to each other on evidence e 1 taken alone, and on e 2 taken alone, if h 1 unifies the pair {e 1, e 2} by making them informationally relevant to each other, and h 2 doesn't, then the likelihood of h 1 on the evidence e 1&e 2 is higher than that of h 2, and consequently h 1 is better supported by e 1&e 2 than h 2 is. Furthermore, if h 1 unifies the pair {e 1, e 2} more than h 2 does by making them more relevant to each other, then h 1 is again better supported by e 1&e 2 than h 2 is.
For Good's weight of evidence, we have an analogous result,

Here what counts is the excess of the degree of unification of the evidence by h over its degree of unification by ∼h. In this case also the corresponding result holds for larger bodies of evidence.
On this Bayesian account, therefore, the power of a theory to unify a body of evidence is not an extra-empirical virtue but contributes directly to the degree to which the evidence supports the theory. Note that these results do not depend on a special form of the prior probabilities and will hold for any probability assignment; in particular, it is not necessary to build a preference for unification or simplicity into the assignment of prior probabilities. Nor do we have to invoke a preference for unification as an extra, supplementary rule, beyond the usual Bayesian updating rules, or add an extra boost of confirmation to the unifying theory beyond the degree of confirmation it receives from simple Bayesian conditionalization.Footnote 9
4. Application to Our Examples
Let b be a body of background knowledge including qualitative information about the apparent course of the planets (including, perhaps, the fact that they undergo retrograde motion), but not including precise values of the periods of the planets, or detailed information about their retrogressions. This body of knowledge is, of course, not the body of knowledge possessed by any of the readers of this paper; the motivation for considering such a body of knowledge will be made clear shortly. Let h C be the hypothesis that some system following the outlines of the Copernican system, as sketched in Section 2.1 above, is true. Note that h C by itself makes no specific predictions as to the observed location of any planet at any time, as it contains a number of parameters—the size of the planetary orbits, their periods, and the location of the planets at some initial epoch—that must be filled in from empirical data. Similarly, let h P be the hypothesis that some Ptolemaic system is correct, again, with parameters concerning the periods of the planets, and the frequency, time, and length of their retrogressions, left unspecified.
We suppose that we have some probability distributions, conditional on background b, over the periods of the planets and of the times of their retrogressions. Let p m be the statement that Mars traverses the sphere of fixed stars with a period that, within a small observational error, is equal to1.88 years, and let r m be the statement that it retrogresses when it is near opposition to the Sun, within a period equal to 2.14 years (plus or minus a small observational error). Now, the Copernican hypothesis h C entails that p m holds if and only r m does; hence, these two propositions, conditional on h C, have a maximal degree of informational relevance to each other:

Since, by hypothesis, the background knowledge b contains only qualitative information about the motions of the planets, it is not to be expected that one could anticipate, in advance, the precise values of these parameters; hence, the quantity Pr(r m | h C & b) ought to be quite small, and I(r m, p m | h C & b) will be quite large. If, conditional on all serious rivals to h C, p m and r m are independent, then I(r m, p m | b) will be fairly small, and so U(r m, p m; h C | b) will be positive—h C unifies {p m, r m}, relative to background b.
On the other hand, if only the bare-bones Ptolemaic hypothesis h P is assumed, then p m affords little or no information about whether or not r m is true, and so a reasonable probability assignment will have them informationally independent, or nearly so, conditional on h P:


Let us assume that the evidential support lent to h C by p m on background b is approximately the same as the evidential support lent to h P by p m on the same background (and, indeed, since p m merely tells us what the period of Mars is, it would seem that the support lent to either hypothesis alone by p m is nil), and similarly, that the degree of support lent to h C by r m on background b is approximately the same as the evidential support lent to h P by r m. Then the degree of support lent to h C by the conjunction of p m and r m will be considerably greater than the degree of support lent to h P by this conjunction—since h C unifies p m and r m in the sense of making them informationally relevant to each other, they work together to support h C.
Now, of course, it doesn't follow from this that every probability assignment will have Pr(h C | p m & r m & b) greater than Pr(h P | p m & r m & b). What does follow is that any probability assignment satisfying the conditions outlined above will have Pr(h C | p m & r m & b) greater than Pr(h P | p m & r m & b) unless it also has the prior probability of h C considerably less than that of h P. An agent who attaches low prior probability to the Copernican hypothesis (perhaps on the basis of considerations of terrestrial dynamics;Footnote 10 see Ptolemy Reference Toomer1984, 44–45) should nevertheless acknowledge that the ability of the Copernican hypothesis to explain the otherwise puzzling correlation between the motions of the planets and that of the Sun counts in favor of the hypothesis. Note that the procedure here is to assess the reasonableness of probabilities conditional on p m & r m & b by considering what sorts of probability assignments are reasonable on b alone—which was the motivation for introducing b. Such considerations should be available even to agents who already know p m and r m.
Since a parallel discussion can be applied as well to the other planets, the degree of support lent to the Copernican hypothesis by planetary phenomena becomes considerably stronger when the other planets are taken into account.
So far we have contrasted a bare-bones Ptolemaic hypothesis with a bare-bones Copernican hypothesis; in particular, no relation between the motion of the planet on its epicycle and the Sun was built into the Ptolemaic hypothesis h P. The system presented by Ptolemy in the Almagest, however, does contain restrictions of just this sort: for the superior planets Mars, Jupiter, and Saturn, the line drawn between the planet and the center of its epicycle remains at all times parallel to the earth-sun radius, whereas for the inferior planets Mercury and Venus, the line from the Earth to the center of the planet's epicycle passes through the Sun. Call this set of restrictions the condition of sun-planet parallelism. Although it is plausible that the bare-bones Ptolemaic hypothesis was considered by some astronomer at some point, it must be admitted that we have no historical record of such an episode. It seems only fair, therefore, that we consider what happens when the Copernican hypothesis h C faces as a rival, not the minimal Ptolemaic hypothesis h P, but a strengthened Ptolemaic hypothesis that includes the sun-planet parallelism condition.Footnote 11
Call the strengthened Ptolemaic hypothesis h SP. This hypothesis shares with the Copernican hypothesis the feature of entailing that r m obtains if and only p m does, and hence enjoys the same evidential boost from the conjunction of r m and p m enjoyed by h C; the ratio of posterior to prior should be equal, or at least approximately equal, for the two hypotheses.

Suppose, now, that we have

where h P is, as before, the minimal Ptolemaic hypothesis. Suppose, further, that we also have

This seems eminently reasonable; since h SP is a strengthening of h P, we must have Pr(h SP | b) ≤ Pr(h P | b), and to have Pr(h SP | b) equal to Pr(h P | b) is to attach zero probability, on background b, to the possibility that some Ptolemaic hypothesis not satisfying the sun-planet parallelism condition is true. Moreover, it seems unreasonable to assume that the background knowledge b suffices to permit one to expect the sun-planet parallelism condition to be true; if this is right, we should take Pr(h SP | b) to be substantially less than Pr(h P | b).
From (17), (18), and (19) it follows that

It seems, therefore, that the only way that a reasonable agent can avoid having Pr(h C | p m & r m & b) be much larger than Pr(h SP | p m & r m & b) is to have the prior probability of the Copernican hypothesis h C be much smaller than the prior probability of the bare-bones Ptolemaic hypothesis h P.
Now, what Ptolemy presented in the Almagest, and Reference Copernicus and DuncanCopernicus in De Revolutionibus, were fully specified models of the heavens, with all parameters filled in. From such models precise predictions can be deduced. If we let H P and H C be these fully specified Ptolemaic and Copernican hypotheses, respectively, then these hypotheses actually entail the observed phenomena, and the likelihoods that appear in Bayes’ theorem will be equal to unity. But any two items of evidence e 1 and e 2, if entailed by a hypothesis H, are probabilistically independent conditional on H, and hence not informationally relevant to each other, conditional on H.
Considerations similar to those discussed above will apply. Since the unsaturated Copernican hypothesis h C has a greater ability to unify the celestial phenomena than does the unsaturated Ptolemaic hypothesis h P, then, if e is a body of evidence containing facts about the mean motions of planets and their retrogressions, we will have Pr(H C |e & b) ≫ Pr(H P | e & b) unless Pr(h C | b) ≪ Pr(h P | b).
The analysis of our second example is similar. It is not as clear in this case what rival hypotheses are to be considered, but let us contrast, by way of example, a theory (call it h 1) that posits a single power-law force affecting all the planets, with a theory (call it h 2) that posits independent power-law forces acting on each of the planets. Let b be a body of knowledge containing qualitative facts about the motion of the planets but not containing numerical information sufficient to fix the values of the parameters appearing in h 1 and h 2 corresponding to the exponents of the power laws and the strengths of the force fields. According to h 1, measurement of the rate of precession of any planet furnishes information about the precession rates of the other planets and also about the relation between the periods and distances of the planets. On the assumption of a single power-law force, once we have obtained the exponent of the power-law from a measurement of precession rates, measurement of the period and orbital radius of one planet permits the orbital radii of the others to be predicted from their periods (or vice versa).
On h 2, these quantities could be regarded as informationally independent, but they need not be—it is, after all, compatible with h 2 that the exponents of the separate power laws for each of the planets be all equal, or approximately so, and one might even attach high prior probability to this being the case. However, one who takes h 2 as a serious rival to h 1 ought to attach non-negligible probability to the possibility that the several distinct forces do not perfectly mimic a single acceleration field—if one regards it as inevitable that these distinct forces perfectly mimic the action of a single force, it is hard to understand what is meant by calling them “distinct.” Suppose, then, that one does assign a non-negligible probability to the several force laws having different values of their parameters, either in the exponent of the power law or in the strength of the force. On such a probability assignment, h 1 will do a better job of unifying the phenomena than h 2, and will have a correspondingly higher degree of support.
The inference that results from an application of Newton's Rule 1, that “No more causes of natural things should be admitted than are both true and sufficient to explain their phenomena,” is justified in this case without any commitment to a principle that “Nature does nothing in vain” (Newton Reference Newton, Cohen and Whitman[1726] 1999, 784). What counts, instead, is the fact that the supposition that these forces are the same makes inevitable a relation that would be a puzzling coincidence on the supposition of independent forces.
5. Relation to Some Other Views
5.1. Whewell on Consilience
The sorts of cases considered in this paper as examples of unification, are, at the very least, reminiscent of what Whewell called consilience of inductions:
the evidence in favour of our induction is of a much higher and forcible character when it enables us to explain and determine cases of a kind different from those which were contemplated in the formation of our hypothesis. The instances in which this has occurred, indeed, impress us with a conviction that the truth of our hypothesis is certain. No accident could give rise to such an extraordinary coincidence. No false supposition could, after being adjusted to one class of phenomena, exactly represent a different class, when the agreement was unforeseen and uncontemplated. That rules springing from remote and unconnected quarters should thus leap to the same point, can only arise from that being the point where truth resides.
Accordingly the cases in which inductions from classes of facts altogether different have thus jumped together, belong only to the best established theories which the history of science contains. And as I shall have occasion to refer to this particular feature in their evidence, I will take the liberty of describing it by a particular phrase; and will term it the Consilience of Inductions. (Whewell Reference Whewell1847, 65; Reference ButtsButts (ed.), 153)
On a common reading of Whewell's account of consilience, the unifying hypothesis is a fully specified hypothesis that entails evidence from disparate domains. If this were the whole story according to Whewell, there would little room for a close parallel between our account and Whewell's, as two evidence statements that are both entailed by a hypothesis h are informationally irrelevant to each other, conditional upon h. Footnote 12 We should bear in mind, however, that for Whewell induction is a multi-stage process. The three steps of induction, according to Whewell, are the selection of the idea, the construction of the conception, and the determination of the magnitudes (Whewell Reference Whewell1847, 380; Reference ButtsButts (ed.), 211). For scientific theories formulated mathematically, these steps become the selection of the independent variable, the construction of the formula, and the determination of the coefficients (382; 213). Thus, according to Whewell, the induction that leads to a law such as Newton's law of gravitation includes a stage in which one is considering a formula containing parameters, called by Whewell “coefficients,” to be filled in empirically. In such a case, a consilience of inductions would occur when the values of certain parameters can be determined from two different sorts of phenomena, and the values determined from one class of phenomena agree with those determined from another. Note that Whewell speaks of a hypothesis being “adjusted to one class of phenomena” and then found to represent another class; he is clearly contemplating situations in which initially unsaturated hypotheses are considered, and the values of their parameters filled in empirically. Malcolm Forster's interpretation of Whewell's views on consilience seems to be correct, at least where mathematically formulated theories are concerned: “the essential part of the consilience of inductions is the demonstration of a law-like connection between magnitudes determined by different colligations of facts—the ‘over-determination’ of the coefficients” (Forster Reference Forster1988, 76). A hypothesis that entails such a law-like connection will render the facts in the disparate domains connected by the law informationally relevant to each other. The account of unification offered in this paper, therefore, seems to mesh nicely with Whewell's conception of consilience.
5.2. Forster and Sober
Much of what has been said in this paper is also reminiscent of Forster and Sober's (Reference Forster and Sober1994) account of the epistemic virtue of simpler, more unified theories. Here, however, the resemblance goes less deep. According to Forster and Sober, the virtue of simplicity and unification is to be found in the resistance of such theories to the phenomenon of overfitting, which is the tendency for the best-fit curve from an overly capacious family of functions to track observational errors instead of the systematic dependencies one is trying to capture. Provided that the statistical distribution of the parameters estimated from the data around their optimal values is at least approximately normal, the tendency towards overfitting will be roughly linear in the number of degrees of freedom of the family of functions considered, and this is why a best-fit curve from a family of functions with fewer degrees of freedom may yield a better fit to future data than the best-fit curve from a family with more. For a fixed family of functions, overfitting is diminished if observational errors are diminished, or if the body of data is enlarged. Therefore, the weight to be attached to simplicity and unification, according to Forster and Sober, should decrease as observations are made more precise or more data points are added.
Let us consider two hypotheses h 1 and h 2, which assert that the force law for the influence of the Sun on the planets is to be found in families of functions F 1 and F 2, respectively. It may happen that, if F 1 is a family of fewer degrees of freedom, then h 1 makes the quiescence of the apsides yield more information about the harmonic law than does h 2. This may seem to provide an important point of connection between the account of the virtue of unification offered in this paper and that given by Forster and Sober. There is an important difference, however; on the account given here, what counts is not the number of degrees of freedom of a family of functions, but the extent to which a hypothesis makes one set of phenomena constrain another, which in our example consists of the extent to which the rate of change of the force over the distances explored by a single planet constrains the relative strength of the force at other planetary distances. On Forster and Sober's account, the virtue of “simpler, more unified” theories lies solely in the ability of such theories to resist overfitting the data, and hence is diminished when the data are made more accurate or when the number of data points is increased. On the account given here, the ability of a theory to unify a body of phenomena lends support to the theory that goes well beyond such resistance to overfitting and persists when random errors in the data are diminished. This is not to say that the statistical considerations invoked by Forster and Sober will play no role—they will, in estimation of parameters from the data, and in error analysis. But the chief importance of theoretical unification is not to be found in such considerations.
Appendix
The conditions that our measure of information relevance, I(q, p | b), will be assumed to satisfy are the following:
i) Continuous definability in terms of probability. I(q, p | b) is a continuous real-valued function of the values that Pr( · | b) takes on Boolean combinations of p and q.
ii) Zero point. If q is probabilistically independent of p (that is, if Pr(q | p & b) = Pr(q | b)), then I(q, p | b) = 0.
iii) Additivity of independent information. If p 1 and p 2 are probabilistically independent of each other, and remain so under conditionalization on q (that is, if Pr(p 1 & p 2 | b) = Pr(p 1 | b) Pr(p 2 | b) and Pr(p 1 & p 2 | q & b) = Pr(p 1 | q & b) Pr(p 2 | q & b)), then I(q, p 1 & p 2 | b) = I(q, p 1 | b) + I(q, p 2 | b).
iv) Normalization. If p & b entails q, then I(q, p | b) = –Log2(Pr(q | b)).
Theorem. The conditions (i)–(iv) entail that I(q, p | b) = Log2(Pr(q | p & b)/Pr(q | b)).
Proof. I(q, p | b) is to be determined by the values Pr( · | b) takes on Boolean combinations of p and q, and these values will, in turn, be determined by Pr(p | b), Pr(q | b), and Pr(q | p & b). There will, therefore, be a continuous function F, such that
ii′) If z = y then F(x, y, z) = 0.
iii′) If x 3 = x 1 x 2 and z 3 = z 1 z 2 / y, then F(x 3, y, z 3) = F(x 1, y, z 1) + F(x 2, y, z 2).