1. INTRODUCTION
Groups frequently make judgements that are based on aggregating the opinions of its individual members. A panel of market analysts at Apple or Samsung may estimate the expected number of sales of a newly developed cell phone. A group of conservation biologists may assess the population size of a particular species in a specific habitat. A research group at the European Central Bank may evaluate the merits of a particular monetary policy. Generally, such problems occur in any context where groups have to combine various opinions into a single group judgement (for a review paper, see Clemen Reference Clemen1989).
Even in cases of fully shared information, the assessment of the evidence will generally vary among the agents and depend on factors such as professional training, familiarity with similar situations in the past, and personal attitude toward the results. Thus, it will not come as a surprise that the individual judgements may differ. But how shall they be aggregated?
Often, some group members are more competent than others. Recognizing these experts may then become a crucial issue for improving group performance. Research in social psychology and management science has investigated the ability of humans to properly assess the expertise of other group members in such contexts (Clemen Reference Clemen1989; Bonner et al. Reference Bonner, Baumann and Dalal2002; Larrick et al. Reference Larrick, Burson and Soll2007). Most of this research stresses that recognizing experts is no easy task: perceived and actual expertise need not agree, data are noisy, questions may be too hard, and expertise differences may be too small to be relevant (e.g. Littlepage et al. Reference Littlepage, Schmidt, Whisler and Frost1995). This motivates a comparison of two strategies for group judgements: (i) deferring to the agent who is perceived as most competent, and (ii) taking the straight average of the estimates (Henry Reference Henry1995; Soll and Larrick Reference Soll and Larrick2009). The overall outcomes suggest that the straight average is often surprisingly reliable, apparently being one of those ‘fast and frugal heuristics’ (Gigerenzer and Goldstein Reference Gigerenzer and Goldstein1996) that help boundedly rational agents to make cost-effective decisions.
On the other hand, even if not explicitly recognized as such, experts tend to exert greater influence on group judgements than non-experts (Bonner et al. Reference Bonner, Baumann and Dalal2002). This motivates a principled epistemic analysis of the potential benefits of expertise-informed group judgements. We characterize conditions under which differentially weighted averages, fed by incomplete and perhaps distorted information on individual expertise, ameliorate group performance, compared with a straight average of the individual judgements. Our paper approaches this question from an analytical perspective, that is, with the help of a statistical model. We follow the social permutation approach (e.g. Bonner Reference Bonner2000) and model the agents as unique entities with different abilities. This differs notably from more traditional social combination research where individual agents are modelled as interchangeable (e.g. Davis Reference Davis1973). Our main result – that individual expertise makes a robust contribution to group performance – is not without surprise, given the generality of our conditions that also allow for perturbations such as individual bias or correlations among the group members. Therefore, our analytical results provide theoretical support to research on the recognition of experts in groups (e.g. Baumann and Bonner Reference Baumann and Bonner2004), and they directly relate to empirical comparisons of differentially weighted group judgements to ‘composite judgements’, such as the group mean or median (Einhorn et al. Reference Einhorn, Hogarth and Klempner1977; Hill Reference Hill1982; Libby et al. Reference Libby, Trotman and Zimmer1987; Bonner Reference Bonner2004).
Our work is also related to two other research streams. First, there is a thriving epistemological literature on peer disagreement and rational consensus, where consensus is mostly reached by deference to (perceived) experts. However, this debate either focuses on social power and mutual respect relations (e.g. Lehrer and Wagner Reference Lehrer and Wagner1981), or on principled philosophical questions about resolving disagreement (e.g. Elga Reference Elga2007). By means of a performance-focused mathematical model, we hope to bring this literature close to its primary target: the truth-tracking abilities of various epistemic strategies. There is also a vast literature on group decisions preference and judgement aggregation (e.g. List Reference List2012), but two crucial features of our inquiry – the aggregation of numerical values and the particular role of experts – do not play a major role in there.
Second, there is a fast increasing body of literature on expert judgement and forecasting, which has emerged from applied mathematics and statistics and became a flourishing interdisciplinary field. This strand of research deals with the theoretical modelling of expert judgement, most notably the (Bayesian) reconciliation of probability distributions (Lindley Reference Lindley1983), but it also includes more practical questions such as comparison of calibration methods, choice of seed variables, analyses of the use of expert judgement in the past (Cooke Reference Cooke1991), and the study of general forecasting principles, such as the benefits of opinion diversity (Armstrong Reference Armstrong and Armstrong2001; Page Reference Page2007). We differ from that approach in pooling individual (frequentist) estimators instead of subjective probability distributions, but we study similar phenomena, such as the impact of in-group correlations.
Admittedly, our baseline model is very simple, but due to this simplicity, we are able to prove a number of results regarding the behaviour of differentially weighted estimates under correlation, bias and benchmark uncertainty. Here, our paper builds on analytical work in the forecasting and social psychology literature (Bates and Granger Reference Bates and Granger1969; Hogarth Reference Hogarth1978), following the approach of Einhorn et al. (Reference Einhorn, Hogarth and Klempner1977).
The rest of the paper is structured as follows: we begin with explaining the model and stating conditions where differentially weighted estimates outperform the straight average (Section 2). In the sequel, we show that this relation is often preserved even if bias or mutual correlations are introduced (Sections 3 and 4). Subsequently, we assess the impacts of over- and underconfidence (Section 5). Finally, we discuss our findings and wrap up our conclusions (Section 6).
2. THE MODEL AND BASELINE RESULTS
Our problem is to find a good estimate of an unknown quantity μ. For reasons of convenience, we assume without loss of generality that μ = 0.Footnote 1
We model the group members’ individual estimates Xi, i ⩽ n, as independent random variables that scatter around the true value μ = 0 with variance σ2i. The Xi are unbiased estimators of μ, that is, they have the property $\mathbb {E}[X_i]=\mu$. This baseline model is inspired by the idea that the agents try to approach the true value with a higher or lower degree of precision, but have no systematic bias in either direction. The competence of an agent is explicated as the degree of precision in estimating the true value. No further assumptions on the distributions of the Xi are made – only the first and second moments are fixed.
In this model, the question of whether the recognition of individual expertise is epistemically advantageous translates into the question of which convex combination of the Xi, $\hat{\mu } :=\sum _{i=1}^n c_i X_i$, outperforms the straight average $\overline{\mu } := \frac{1}{n} \sum _{i=1}^n X_i$. Standardly, the quality of an estimate is assessed by its mean square error (MSE) which can be calculated as
which is minimized by the following assignment of the ci (cf. Lehrer and Wagner Reference Lehrer and Wagner1981: 139):
Thus, naming the c*i as the ‘optimal weights’ is motivated by two independent theoretical reasons:
1. As argued above, for independent and unbiased estimates Xi with variance σ2i, mean square error of the overall estimate is minimized by the convex combination X = ∑ic*iXi. Thus, for a standard loss function, the c*i are indeed the optimal weights.
2. Even when the square loss function is replaced by a more realistic alternative (Hartmann and Sprenger Reference Hartmann and Sprenger2010), the c*i can still define the optimal convex combination of individual estimates. In that case, we require stronger distributional assumptions.Footnote 2
The problem with these optimal weights is that each agent’s individual expertise would have to be known in order to calculate them. Given all the biases that actual deliberation is loaded with, e.g. ascription of expertise due to professional reputation, age or gender, or bandwagon effects, it is unlikely that the agents succeed at unravelling the expertise of all other group members (cf. Nadeau et al. Reference Nadeau, Cloutier and Guay1993; Armstrong Reference Armstrong and Armstrong2001).
Therefore, we widen the scope of our inquiry:
Question: Under which conditions will differentially weighted group judgements outperform the straight average?
A first answer is given by the following result where the differential weights preserve the expertise ranking:
Theorem 1 (First Baseline Result).Let c 1, . . ., cn > 0 be the weights of the individual group members, that is, ∑ni = 1ci = 1. Without loss of generality, let c 1 ⩽ . . . ⩽ cn. Further assume that for all i > j:
Then the differentially weighted estimator $\hat{\mu } := \sum _{i=1}^n c_i X_i$outperforms the straight average. That is, $\text{MSE}(\hat{\mu }) \le \text{MSE}(\overline{\mu })$, with equality if and only if ci = 1/n for all 1 ⩽ i ⩽ n.
This result demonstrates that relative accuracy, as measured by pairwise expertise ratios, is a good guiding principle for group judgements as long as the relative weights are not too extreme.
The following result extends this finding to a case where the benefits of differential weighting are harder to anticipate: we allow the ci to lie in the entire [1/n, c*i] (or [c*i, 1/n]) interval, allowing for cases where the ranking of the group members is not represented correctly. One might conjecture that this phenomenon adversely affects performance, but this is not the case:
Theorem 2 (Second Baseline Result).Let c 1. . .cn ∈ [0, 1] such that ∑ni = 1ci = 1. In addition, let $c_i\in [\frac{1}{n};c_i^*]$respectively $c_i \in [c_i^*;\frac{1}{n}]$hold for all 1 ⩽ i ⩽ n. Then the differentially weighted estimator $\hat{\mu } := \sum _{i=1}^n c_i X_i$outperforms the straight average. That is, $\text{MSE}(\hat{\mu }) \le \text{MSE}(\overline{\mu })$, with equality if and only if ci = 1/n for all 1 ⩽ i ⩽ n.
Note that none of the baseline results implies the other one. The conditions of the second result can be satisfied even when the ranking of the group members differs from their actual expertise, and a violation of the second condition (e.g. c*i = 1/n and ci = 1/n + ε) is compatible with satisfaction of the first condition. So the two results are really complementary.
We have thus shown that differential weighting outperforms straight averaging under quite general constraints on the individual weights, motivating the efforts to recognize experts in practice. The next sections extend these results to the presence of correlation and bias, thereby transferring them to more realistic circumstances.
3. BIASED AGENTS
The first extension of our model concerns biased estimates Xi, that is, estimates that do not centre around the true value μ = 0, but around Bi ≠ 0. We still assume that agents are honestly interested in getting close to the truth, but that training, experience, risk attitude or personality structure bias their estimates into a certain direction. For example, in assessing the impact of industrial development on a natural habitat, an environmentalist will usually come up with an estimate that significantly differs from the estimate submitted by an employee of an involved corporation – even if both are intellectually honest and share the same information.
For a biased agent i, the competence/precision parameter σ2i has to be re-interpreted: it should be understood as the coherence (or non-randomness) of the agent’s estimates instead of the accuracy. This value is indicative of accuracy only if the bias Bi is relatively small.
Under these circumstances, we can identify an intuitive sufficient condition for differential weighting to outperform straight averaging.
Theorem 3.Let X 1, . . ., Xn be random variables with bias B 1, . . ., Bn.
(a) Suppose that the ci in the estimator $\hat{\mu }= \sum _{i=1}^n c_i X_i$satisfy one of the conditions of the baseline results (i.e. either 1 ⩽ ci/cj ⩽ c*i/c*jor ci ∈ [1/n, c*i] respectively ci ∈ [c*i, 1/n]). In addition, let the following inequality hold:
Then differential weighting outperforms straight averaging, that is, $\text{MSE}(\hat{\mu }) < \text{MSE}(\overline{\mu })$.
(b) Suppose the following inequality holds:
Then differential weighting does worse than straight averaging if condition (b) holds, that is, $\text{MSE}(\hat{\mu }) > \text{MSE}(\overline{\mu })$.
Intuitively, condition (4) states that the differentially weighted bias is smaller or equal than the average bias. As one would expect, this property favourably affects the performance of the differentially weighted estimator. Condition (5) states, on the other hand, that if the difference between the mean square biases of the weighted and the straight average exceeds the mean variance of the agents, then straight averaging performs better than weighted averaging.
When the group size grows to a very large number, both parts of Theorem 3 collapse into a single condition, as long as the biases and variances are both bounded. This is quite obvious since the last term of (5) is of the order $\mathcal {O}(1/n)$. Theorem 3 applies in particular in the case where agents are biased into the same direction and less biased agents make more coherent estimates (that is, with smaller variance):
Corollary 1.Let X 1, . . ., Xn, be random variables with bias B 1, . . ., Bn ⩾ 0 such that ci ⩾ cj implies Bi ⩾ Bj (or vice versa for B 1, . . ., Bn ⩽ 0). Then, with the same definitions as above:
• $\text{MSE}(\overline{\mu }) \ge \text{MSE}(\hat{\mu })$.
• If there is a uniform group bias, that is, B ≔ B 1 = . . . = Bn, then $\text{MSE}(\overline{\mu })-\text{MSE}(\hat{\mu })$is independent of B.
So even if all agents have followed the same training, or have been raised in the same ideological framework, expertise recognition does not multiply that bias, but helps to increase the accuracy of the group’s judgement. In particular, if there is a uniform bias in the group, the relative advantage of differential weighting is independent of the size of the bias. All in all, these results demonstrate the importance of expertise recognition even in groups where the members share a joint bias – a finding that is especially relevant for practice.
4. INDEPENDENCE VIOLATIONS
We turn to violations of independence between the group members. Consider first the following fact that compares two groups with different degrees of correlation:
Fact 1.If $0\le \mathbb {E}\left[ X_i X_j \right]\le \mathbb {E}\left[ Y_i Y_j \right] \; \forall i\ne j \le n$and $\mathbb {E}[X_i^2]=\mathbb {E}[X_j^2]$, then both straight averaging and weighted averaging on Xi yield a lower mean square error than the same procedures applied to Yi.
Fact 1 shows that less correlated groups perform better, ceteris paribus. For practical purposes, this suggests that heterogeneity of a group is an epistemic virtue since strong correlations between the agents are less likely to occur, making the overall result more accurate (cf. Page Reference Page2007).
Regarding the comparison of straight and weighted averaging, we can show the following result:
Theorem 4.Let X 1, . . ., Xn be unbiased estimators, that is, $\mathbb {E}[X_i]=\mu =0$, and let the ci satisfy the conditions of one of the baseline results, with $\hat{\mu }$defined as before. Let I⊆{1, . . ., n} be a subset of the group members with the property
(i) Correlation vs. Expertise. If I = {1, . . ., n}, then weighted averaging outperforms straight averaging, that is, $\text{MSE}(\hat{\mu }) \le \text{MSE}(\overline{\mu })$.
(ii) Correlated Subgroup. Assume that $\mathbb {E}\left[ X_i X_j \right] = 0$if i∉I or j∉I, and that
Then weighted averaging still outperforms straight averaging, that is, $\text{MSE}(\hat{\mu }) \le \text{MSE}(\overline{\mu })$.
To fully understand this theorem, we have to clarify the meaning of condition (6). Basically, it says that in group I, experts are less correlated with other (sub)group members than non-experts.Footnote 3
Once we have understood this condition, the rest is straightforward. Part (i) states that if I equals the entire group, then differential weighting has an edge over averaging. That is, the benefits of expertise recognition are not offset by the perturbations that mutual dependencies may introduce. Arguably, the generality of the result is surprising since condition (6) is quite weak. Part (ii) states that differential weighting is also superior whenever there is no correlation with the rest of the group, and as long as the average competence in the subgroup is lower than the overall average competence (see equation (7)).
It is a popular opinion (e.g. Surowiecki Reference Surowiecki2004) that correlation of individual judgements is one of the greatest dangers for relying on experts in a group. To some extent, this opinion is vindicated by Fact 1 in our model. However, expertise-informed group judgements may still be superior to composite judgements, as demonstrated by Theorem 4. The interplay of correlation and expertise is subtle and not amenable to broad-brush generalizations.
5. OVER- AND UNDERCONFIDENCE
We now consider a specific family of ci’ in order to study how group members’ self-assessment in terms of quality affects group performance as a whole, modelled again as unbiased estimates Xi with variance σ2i.
Suppose that the group members have some idea of their own competence. That is, they are able to position themselves in relation to a commonly known benchmark: they are able to assess how much better or worse they expect themselves to perform compared with a default agent, modelled as an unbiased random variable with variance s 2. Such a scenario may be plausible when agents have a track record of their performance, or obtain performance feedback. The agents then express how much weight they should ideally get in a group of n − 1 default agents:
Assume further that every agent uses the same benchmark, that these weights also determine to what extent a group member compromises his or her own position, and that decision-making takes place on the basis of the normalized cis. It can then be shown (proof omitted) that the differentially weighted estimator $\hat{\mu }$ defined by equation (8) outperforms straight averaging – in fact, this is entailed by the Second Baseline Result (Theorem 2).
Here, we want to study how over- and underestimating the competence of a ‘default agent’ will affect group performance. Is it always epistemically detrimental when the agents misguess the group competence?
The answer is, perhaps surprisingly, no. To explain this result, we first observe that the less confidence we have in the group (=s 2 is large), the more does the weighted average resemble the straight average. Recalling equation (8), we note that all ci will be very close to 1. This implies that the expertise-informed average will roughly behave like the straight average.
Conversely, if the group is perceived as competent (=small value of s), then the ci will typically not be close to 1 such that differential weights will diverge significantly from the straight average. This intuitive insight leads to the following theorem:
Theorem 5.Let $\hat{\mu }_{s^2}$and $\hat{\mu }_{{\tilde{s}}^2}$be two weighted, expertise-informed estimates of μ, defined according to equation (8) with benchmarks s 2and ${\tilde{s}}^2$, respectively. Then $\text{MSE}(\hat{\mu }_{s^2}) \le \text{MSE}(\hat{\mu }_{{\tilde{s}}^2})$if and only if $s^2 \le {\tilde{s}}^2$.
It can also be shown (proof omitted) that this procedure approximates the optimal weights c*i if the perceived group competence approaches perfection, that is, s → 0. In other words, as long as the group members judge themselves accurately, optimism with regard to the abilities of the other group members is epistemically favourable. On the other hand, overconfidence in one’s own abilities relative to the group typically deteriorates performance.
6. DISCUSSION
We have set up an estimation model of group decision-making in order to study the effects of individual expertise on the quality of a group judgement. We have shown that, in general, taking into account relative accuracy positively affects the epistemic performance of groups. Translated into our statistical model, this means that differential weighting outperforms straight averaging, even if the ranking of the experts is not represented accurately.
The result remains stable over several representative extensions of the model, such as various forms of bias, violations of independence, and over- and underconfident agents (Theorems 3–5). In particular, we demonstrated that differential weighting is superior (i) if experts are, on average, less biased; (ii) for a group of uniformly biased agents; (iii) if experts are less correlated with the rest of the group than other members. We also showed that uniform overconfidence in one’s own abilities is detrimental for group performance whereas (over)confidence in the group may be beneficial. These properties may be surprising and demonstrate the stability and robustness of expertise-informed judgements, implying that the benefits of recognizing experts may offset the practical problems linked with that process.
Our model can in principle also be used for describing how groups actually form judgements. In that case, the involved tasks should neither be too intellective (that is, there is a demonstrable solution) or too judgemental (Laughlin and Ellis Reference Laughlin and Ellis1986): in highly intellective tasks, group will typically not perform better than the best individual (=the one who has solved the task correctly). This differs from our model where any agent has only partial knowledge of the truth. On the other hand, if the task is too judgemental, any epistemic component will be removed and the individual weights may actually be based on the centrality of a judgement, such as in Hinsz’s (Reference Hinsz1999) SDS-Q scheme.
Finally, we name some distinctive traits of our model. First, unlike other models of group judgements that are detached from the group members’ individual abilities (Davis Reference Davis1973; DeGroot Reference DeGroot1974; Lehrer and Wagner Reference Lehrer and Wagner1981; Hinsz Reference Hinsz1999), it is a genuinely epistemic model, evaluating the performance of different ways of making a group judgement.Footnote 4 Thus, our model can be used normatively, for supporting the use of differential weights in group decisions, but also descriptively, for fitting the results of group decision processes.
Second, we did not make any specific distributional assumptions on how the agents estimate the target value. Our assumptions merely concern the first and second moment (bias and variance). We consider this parsimony a prudent choice because those distributions will greatly vary in practice, and we do not have epistemic access to them. Classical work in the social combination literature makes much more specific distributional assumptions (e.g. the multinomial distributions in Thomas and Fink (Reference Thomas and Fink1961) and Davis (Reference Davis1973)), restricting the scope of that analysis.
Third, we are not aware of other analytical models that take into account important confounders such as correlation, bias and over-/underconfident agents. Thus, we conclude that our model makes a substantial contribution to understanding the epistemic benefits of expertise in group judgements.
ACKNOWLEDGEMENTS
Dominik Klein thanks the Netherlands Organisation for Scientific Research (NWO) for supporting his research through Vidi grant #no. 016.094.345, held by Eric Pacuit. Jan Sprenger thanks the Netherlands Organisation for Scientific Research (NWO) for supporting his research through Veni grant #no. 016.104.079 and Vidi grant #no. 016.144.342.
APPENDIX. PROOF OF THE THEOREMS
We will need the following inequalities repeatedly in the subsequent proofs. Let c 1, . . ., cn > 0. Then
with equality if and only if c 1 = . . . = cn. Moreover
again with equality if and only if c 1 = . . . = cn. Both inequalities are special cases of the Power Mean Theorem (cf. Wilf Reference Wilf1985: 258).
For the First Baseline Result, we need the following
Lemma 1.Let k < n and let (c 1, . . ., cn) be a sequence such that
(1) ∑ni = 1ci = s for some s > 0 and all ci are positive;
(2) c 1 = . . . = ck and c k + 1 = . . . = cn;
(3) ck ⩽ c k + 1and $1\le \frac{c_{k+1}}{c_k}\le \frac{c_{k+1}^*}{c_k^*}$.
Further assume that σ1 ⩾ . . . ⩾ σn. Then
Furthermore, we show that under the above conditions (i.e. ∑ni = 1ci = s), the value of the sum ∑ni = 1ci 2σidecreases as the quotient $\frac{c_{k+1}}{c_k}$increases.
Proof of Lemma 1. Fix r such that
• $c_i=\frac{s}{n}-\frac{r}{k}$ for i ⩽ k
• $c_i=\frac{s}{n}+\frac{r}{n-k}$ for i > k
Then we have to show that:
The above equation reduces to:
Now the left hand side of the above equation is a quadratic function in r with zeros at 0 and
Since the σi are ordered decreasingly we get
Now this is a function of the form $\frac{kx-a}{x+b}$ with a, b > 0. Since these functions are increasing for x > −b, the inequality above can be strengthened to
Recall that $\frac{c_{k+1}}{c_k}\le \frac{c_{k+1}^*}{c_k^*}=\frac{\sigma _k}{\sigma _{k+1}}=:\sigma$. Inserting this transforms the above equation into:
Our assumptions about the ci translate into
This transforms to
In particular r < r 0, finishing the proof of (11). For the last statement of Lemma 1, observe that the left hand side of (11) is a quadratic function with minimum $\frac{1}{2}r_0$, and that $r\le \frac{1}{2}r_0$.$\Box$
Proof of Theorem 1. By assumption the ci are ordered increasingly, thus the σi are ordered decreasingly. For a vector of weights $\mathbf {w}\in \mathbb {R}^n$ (i.e. all wi positive and ∑iwi = 1), we denote the mean square error of the estimator ∑wiXi by Ψ(w): That is:
Thus for c = (c 1. . .cn) as in the theorem we have to show Ψ(c) ⩽ Ψ(e), where e is the equal weight vector $(\frac{1}{n},\ldots ,\frac{1}{n})$. To this end we will construct a sequence of weight vectors e = d0, . . ., dn − 1 = c such that:
(i) each di satisfies the assumptions of Theorem 1;
(ii) for di = (d 1. . .dn), there is some $k \in \mathbb {N}$ such that
d 1 = . . . = dk and d 1 > c 1; . . .; dk > ck;
dj = cj for $k<j\le k+i\hspace{28.45274pt}$(where i is the index of di);
d k + i + 1 = . . . = dn and d k + i + 1 ⩽ c k + i + 1; . . .; dn ⩽ cn;
(iii) Ψ(d i − 1) ⩾ Ψ(di).
Thus di − 1 = c and Ψ(c) ⩽ Ψ(e) as desired. The di are constructed inductively as follows: Assume di − 1 = (d′1. . .d′n) has already been constructed. If i = 1 let k be the unique index such that $c_k<\frac{1}{n}$ and $c_{k+1}\ge \frac{1}{n}$. If i > 1 let k be as in the above conditions for di − 1. First note that if k = 0, then d′j ⩽ cj for all j and thus di − 1 = c since both are weight vectors and we are done. Thus assume k ⩾ 1 for the rest of the proof. With a similar argument, we can show that k + i + 1 ⩽ n. Now choose the maximal $r \in \mathbb {R}$ that satisfies
By the above conditions, r ⩾ 0. Then define di = (d 1, . . ., dn) by:
• $d_j = d^{\prime }_j-\frac{r}{k}$ for j ⩽ k;
• dj = cj for k < j ⩽ k + i;
• $d_j = d^{\prime }_j+\frac{r}{n-k-i-1}$ for j ⩾ k + i + 1.
To see that di satisfies conditions (i)-(iii), first note that since r was chosen to be maximal, one of the two inequalities in (13) has to be an equality. Thus we either have dk = ck or d k + i + 1 = c k + i + 1 and condition (ii) is satisfied. Further note that
Using that the ci are ordered increasingly, it is easy to see that di satisfies the assumptions of Theorem 1. Furthermore, applying the monotonicity part of Lemma 1 to the set of indices I ≔ {1, . . ., k}∪{i + k + 1, . . ., n}, we get ∑Idiσ2i ⩽ ∑Id′iσ2i. Thus Ψ(di) ⩽ Ψ(di − 1) since di − 1 and di coincide outside I. This finishes the proof. $\Box$
Proof of Theorem 2. We would like to show that the mean square error of the straight average $\overline{\mu } := (1/n) \sum _{i=1}^n X_i$ exceeds the mean square error of the weighted estimate $\hat{\mu }$. The MSE difference can be calculated as
where we have made use of $\mathbb {E}[X_i \, X_j] = 0, \; \forall i \ne j$, and of $c_i^* = ( \sum _{j=1}^n \frac{\sigma _i^2 }{\sigma _j^2} )^{-1}$ (cf. equation (2)). Thus, instead of considering Δ, it suffices to show that
To this end, let Ii ≔ [1/n; c*i] (respectively [c*i; 1/n]) and let $\mathcal {Q}:= I_1\times \ldots \times I_n$. Then,
defines the ‘domain’ of our theorem, and it is a polygon. Moreover, since $\sum _i \frac{n^2}{c_i^*}c_i^2$ is a positive determinate quadratic form in the ci, we get that Δ′− 1([0; ∞)) is convex. Thus, it suffices to show that Δ′ is positive on the vertices of $\mathcal {D}$. Note that since {x|∑xi = 1} is of dimension n − 1, the vertices of $\mathcal {D}$ are of the form v = (c*1, . . ., c k − 1*, ck, 1/n, . . ., 1/n) – the ordering is assumed for convenience, and ck is defined such that ||v||1 = 1. Thus we have to show that Δ′(c*1, . . ., c k − 1*, ck, 1/n, . . ., 1/n) ⩾ 0.
In the case k = 1, the desired inequality holds trivially since ck = 1 − (n − 1) · (1/n) = 1/n. Thus we assume k > 1 for the remainder of this proof. Let l denote the real number satisfying
Observe that for $c_i=\frac{1}{n}$ the corresponding summands in Δ′ vanish. Thus we have to show that
Using the definition of l from above and inequality (9) gives $\sum _{i=1}^{k-1}\frac{1}{c_i^*} \ge (k-1)^2/(\sum _{i=1}^{k-1} c_i) \ge \frac{n(k-1)}{l}$. Thus, it suffices to show
Since the ci add up to one, we can express the dependency between l and ck by
Inserting this into (14) gives
Since the first factor is always positive, it suffices to show that the factor in the square brackets, denoted by P(l), is positive for every l that can occur in our setting. We do this by a case distinction on the value of c*k
Case 1.c*k ⩽ 1/n. Noting $c_k \in [c_k^*,\frac{1}{n}]$ and the dependency (15) between l and ck, we have to show that P(l) ⩾ 0 for all $l \in [1;\frac{k-nc_k^*}{k-1}]$. We observe that P is a polynomial of third order with zero points of P given by P(1) = 0 and
with r + denoting the larger of these two numbers. With some algebra it also follows that P′(1) ⩾ 0 if and only if c*k ⩽ 1/n. From the functional form of P(l) – a polynomial of the third degree with negative leading coefficient – we can then infer that l = 1 must be the middle zero point of P. To prove that P(l) ⩾ 0 in the critical interval, it remains to show that for the rightmost zero point, we have $r_+ \ge \frac{k-nc_k^*}{k-1}$:
completing the proof for the case ck* ⩽ 1/n.
Case 2.ck* ⩾ 1/n. In this case we are dealing with the interval $l \in [\frac{k-nc_k^*}{k-1};1]$. The same calculations as above yield
in particular r + < 1. Thus l always lies between the middle and the rightmost zero point of P(l), and in particular, P(l) ⩾ 0 for all $l \in [\frac{k-nc_k^*}{k-1};1]$.
Proof of Theorem 3. Let the Xi centre around Bi > 0. Then $\mathbb {E}[X_i - B_i] =0$, and we observe
Analogously, we obtain
Like in Theorem 2, we define $\Delta (c_1, \ldots , c_n) := \text{MSE}(\overline{\mu }) - \text{MSE}(\hat{\mu })$ as the difference in mean square error between both estimates and show that Δ(c 1, . . ., cn) ⩾ 0 if equation (4) is satisfied.
By Theorem 1 and/or Theorem 2, the first line is greater or equal to zero, and by equation (4), the second line is also non-negative. Thus Δ(c 1, . . ., cn) ⩾ 0, showing the superiority of differential weighting.
For the second part of the theorem, we just observe that
$\Box$
Proof of Corollary 1. It is easy to see that the conditions of the corollary satisfy the requirements of part (a) of Theorem 3. This yields the desired result for the first part of the theorem. For the second, part, let the Xi all centre around B ≠ 0. Then Xi − B is unbiased, and we observe
Therefore, under the conditions of the theorem,
showing that Δ only depends on the centred estimates.$\Box$
Proof of Fact 1. First we deal with straight averaging:
The proof exploits that Xi and Yi have the same variance, thus $\mathbb {E}\left[ X_i^2\right]=\mathbb {E}\left[ Y_i^2 \right]$. The proof for differential weights is similar, making use of the fact that the ci are the same for Xi and Yi because they only depend on the variance of the random variable.$\Box$
Proof of Theorem 4, part (i). First, assume without loss of generality that ci ⩾ c i + 1 for all i < n. Thus, our assumption on the $\mathbb {E}[X_i X_j]$ reduces to $\mathbb {E}[X_iX_k]\le \mathbb {E}[X_jX_k]$ for i ⩾ j ≠ k. First, we show the theorem under the assumption that all $\mathbb {E}[X_i X_j]$ with i ≠ j are equal, say $\mathbb {E}[X_i X_j]=\gamma$. By Theorem 1 and/or 2, it suffices to show that
Inserting $\mathbb {E}[X_i X_j]=\gamma$ this reduces to
The point (1/n, . . ., 1/n) is a global minimum of the function f(x) = ∑ix 2i under the constraints x 1, . . ., xn ⩾ 0 and ∑ixi = 1. Thus we have
Observing ∑ni = 1∑j = 1nci cj = (∑ni = 1ci)2 = 1 and combining this equality with (17) and (18), we obtain
thus proving the statement in the case that all $\mathbb {E}[X_i X_j]$ are the same.
For the general case let us assume that not all ci are the same (otherwise the theorem is trivially true). Thus we either have c 1 > c n − 1 or c 2 > cn since the ci are ordered decreasingly. In the following, we assume c 2 > cn, the other case works with a similar argument. First observe that
Thus, we can concentrate on $\lbrace \mathbb {E}[X_i X_j]|i>j\rbrace$. We fix a natural number c and let Sc be the set of all vectors $(\mathbb {E}[X_i X_j])_{(i>j)}$ fulfilling the conditions of our theorem and $\sum _{i> j}\mathbb {E}[X_iX_j]=c$ We then consider the functional
on Sc. Observe that every Sc contains exactly one point eeq where all $\mathbb {E}[X_i X_j]$ are equal. By the first part of this proof, $\tilde{\varphi }(e_{eq})$ is non-negative. Thus, it suffices to show that eeq is an absolute minimum of $\tilde{\varphi }$ on Sc. First, observe that the value of $\frac{1}{n^2} \sum _{i=1}^n \sum _{j<i}\mathbb {E}\left[ X_i,X_j \right]$ is constantly $\frac{c}{n^2}$ on Sc, thus it suffices to show that
attains its maximum on Sc in eeq.
To do so, we show the following: For every e ∈ Sc with e ≠ eeq there is some e′ ∈ Sc with φ(e′) > φ(e). In particular, φ does not take its maximum on Sc in e. Thus assume that $e=(\mathbb {E}[X_i X_j])_{(i> j)}\in S_c$ is given. Since e ≠ eeq there are some indices s > t and k > l such that $\mathbb {E}[X_s X_t]\ne \mathbb {E}[X_k X_l]$. Furthermore, we can assume that t ⩾ l. Without loss of generality (by potentially replacing one of the two entries with $\mathbb {E}[X_sX_l]$) we can assume that either s = k or t = l. In the following we assume s = k, the other case works similar. The idea of the following construction is: We show that moving towards a more equal distribution of the entries $\mathbb {E}[X_iX_j]$ increases φ(e). In particular, we construct $e^{\prime }=(\mathbb {E}^{\prime }[X_i X_j])_{(i> j)}\in S_c$ as follows: In every row $r_i:=\left\langle \mathbb {E}[X_i X_1]\ldots \mathbb {E}[X_i X_{i-1}]\right\rangle$ of e we replace all the entries of this row by their arithmetic mean. Formally, that is for all i and j (independent of j):
Trivially this operation satisfies for all i:
and thus also for the double sum:
In particular e′ is in Sc. Furthermore, we have assumed that the ci are ordered decreasingly. Recall that ck > cj implies $\mathbb {E}[X_i X_k]\le \mathbb {E}[X_i X_j]$ by assumption, therefore the rows ri were ordered increasingly, and thus the rows of e′ − e:
are ordered decreasingly (since the rows of e′ are constant). In particular, we have for any i:
where the ⩽ comes from the fact that both cj and $\mathbb {E}^{\prime }[X_iX_j]-\mathbb {E}[X_iX_j]$ are decreasing in j. Summing that up over all i we get that
Thus we have φ(e′) ⩾ φ(e) as desired. Now observe that (21) for i = s is the following:
with both,
and
By construction we have $\mathbb {E}[X_s X_t]\ne \mathbb {E}[X_s X_l]$, thus we would have a strict inequality in the last summand (and thus in the entire sum) if we knew that ct ≠ cl. Unfortunately, this is not always the case. However, we have put ourselves in a situation where applying the same construction again with $\mathbb {E}^{\prime }[X_2X_1]$ and $\mathbb {E}^{\prime }[X_nX_1]$ replacing $\mathbb {E}[X_sX_t]$ and $\mathbb {E}[X_sX_l]$ yields the desired (since we have assumed that c 2 > cn). To see this, observe that
• $\mathbb {E}[X_2X_1]=\mathbb {E}^{\prime }[X_2X_1]$ by construction
• $\mathbb {E}^{\prime }[X_sX_1]>\mathbb {E}[X_sX_1]$ since $\mathbb {E}[X_sX_t]\ne \mathbb {E}[X_s,X_l]$ and $\mathbb {E}[X_sX_1]$ is the minimal element in the row rs
• $\mathbb {E}[X_2X_1]\le \mathbb {E}[X_sX_1]$ by assumption
Thus we have
By assumption we have c 2 > cn and repeating the construction from above with coloumns replacing rows and $\mathbb {E}^{\prime }[X_2,X_1],\mathbb {E}^{\prime }[X_n,X_1]$ as the two reference points yields the desired.
Proof of Theorem 4, part (ii). We have to show that the statement holds if all $\mathbb {E}[X_i X_j]$ with i ≠ j ∈ I are the same. The step from this case to the general statement works as in the proof above. As in the proof of (i), it suffices to show that
Let $\overline{c}=\frac{1}{|I|}\sum _{i\in I}c_i$. By equation (10) we have
thus
with the last inequality coming from our assumption that $\overline{c}<1$.
Proof of Theorem 5. Let the benchmark agent have standard deviation s > 0, that is, variance s 2. We will show that Δ(s, σ1, . . ., σn) – the MSE difference between the differentially weighted and the straight average – is strictly monotonically decreasing in the first argument. To this effect, we calculate
Now we show that $\frac{\partial }{\partial s}\Delta (s,\sigma _1,\ldots ,\sigma _n)\le 0$, where c′i denotes (∂/∂)sci:
Since we are only interested in the sign of the first derivative and $-\frac{2}{(\sum _kc_k)^3}<0$, it suffices to show that:
We show that the terms in both brackets have the same sign.
For the first bracket we have:
which is larger than or equal to 0 if and only if σ2i > σj2. Similarly, we observe for the second bracket that
which allows us to conclude
Thus, both factors in (22) have the same sign, implying $\frac{\partial }{\partial s}\Delta (s,\sigma _1,\ldots ,\sigma _n)\le 0$ which is want we wanted to prove.$\Box$