Introduction
Political scientists rely on expert surveys to measure a wide array of variables—the positions of parties on policy dimensions (e.g., Benoit and Laver Reference Benoit and Laver2006; Bakker et al. Reference Bakker, Vries, Edwards, Hoogh, Jolly, Marks, Polk, Rovny, Steenbergen and Vachudova2015), the importance of portfolios (Druckman and Warwick Reference Druckman and Warwick2005), the effectiveness of regional trade agreements (Gray and Slapin Reference Gray and Slapin2012), the preferences of bureaucracies (Clinton and Lewis Reference Clinton and Lewis2008) and the quality of elections (Norris et al. Reference Norris, Frank and Martinez i Coma2013). Yet, experts’ ratings are rarely in perfect agreement. While scholars have explored the variation in expert placements (Hooghe et al. Reference Hooghe, Bakker, Brigevich, Vries, Edwards, Marks, Rovny, Steenbergen and Vachudova2010; Martínez i Coma and Ham Reference Martinez i Coma and Ham2015) and have proposed solutions to anchor experts on the scales (Bakker et al. Reference Bakker, Edwards, Jolly, Polk, Rovny and Steenbergen2014), we argue that these approaches are insufficient for understanding the nature of the measurement problem in data derived from expert assessments.
If a single expert was perfectly knowledgeable, the opinion of that expert may be sufficient and preferable to multiple opinions of lesser informed ones. But researchers do not know how knowledgeable experts are. The goal of an expert survey is thus to aggregate the responses from many experts, typically by taking the mean. We challenge this widely accepted form of response aggregation and demonstrate that mean expert responses may produce biased estimates of the latent concept researchers wish to measure. Confusion arises in part, we argue, because political scientists have not adequately distinguished between the tasks of inference and aggregation in expert surveys, leading to insufficient attention paid to problems surrounding aggregation. Taking the mean leads to bias due to scale truncation and central tendency biases among respondents in low information environments. We demonstrate the problem using Monte Carlo simulations, data from a prominent expert survey, and by conducting a replication of a prominent study. We provide an easy-to-implement solution to this aggregation problem—using the median or modal expert response, rather than the mean.
Example: expert surveys on party positions
Studies using expert surveys largely rely on mean expert placement, using standard deviations or standard errors to assess uncertainty. Yet, the shape of expert placement distributions can vary drastically across the items in a survey. Political parties, for example, can have similar estimated mean party positions based on very different distributions of expert placements.Footnote 1Figure 1 shows distributions of expert responses for selected parties on an EU Integration dimension and a Left–Right Economic Policy dimension, two commonly used scales from the Chapel Hill Expert Survey (CHES) (Bakker et al. Reference Bakker, Vries, Edwards, Hoogh, Jolly, Marks, Polk, Rovny, Steenbergen and Vachudova2015), a widely used dataset. On the EU Integration dimension, the mean expert placement suggests that the Dutch VVD, an economically liberal party, and the Finnish SDP, a center-left party, are moderately Euroskeptic.Footnote 2 However, whereas experts in the Netherlands strongly disagree about the position of the VVD, experts in Finland strongly agree on the position of the SDP.
The Left–Right Economic Policy dimension, which typically shows smaller variation in expert placements, uncovers similar problems. Figure 1 shows the expert placements for Left–Right Economic Policy for the French National Front and the Polish Civic Platform. The Front National has a bimodal distribution; on average, experts judge it to be a center-right party. But the contrast to the Polish Civic Platform—a party with a similar average position—is striking. Experts use almost the entire scale to place the French party, while experts agree that the Polish party is center-right. These illustrations underscore that distributions of expert placements can vary drastically despite having similar means. Moreover, in the example of the National Front, the mean provides an answer that is likely wrong. Assuming all experts are equally well informed, our best guess about the party position ought to be somewhere near 2–3 or 8–10, the regions where most expert assessments lie; it should not be a value in the middle where the fewest experts locate the party. We explore this problem more systematically by using a Monte Carlo simulation. We demonstrate that when experts do not assess positions perfectly, the mean leads to biased estimates of true positions. Researchers are better off using the median or modal response.
Expert surveys and statistical inference
The statistical inference problem in expert surveys differs substantially from the frequentist notions of inference researchers typically apply when analyzing public opinion surveys (see Benoit and Laver Reference Benoit and Laver2006, ch. 4). In public opinion surveys, researchers wish to measure a population parameter by randomly sampling observations from that population. In contrast, the primary objective of expert surveys is not to learn about the experts, who are not chosen at random from a population. Rather, researchers wish to glean information from experts on a topic on which they have expertise. Because researchers do not necessarily know how knowledgeable their experts are, they ask many experts and aggregate their responses, hoping the aggregate response is closer to the truth. In asking for and aggregating multiple experts’ responses, two problems arise: the first results from experts’ differing perceptions and the second from the nature of scale truncation.
Formally, assume that an object to be rated has a true, latent position γ on some continuous scale which researchers ask n experts to assess.Footnote 3 Typically, researchers ask these experts to make their assessment on a (Likert) scale with a limited number of response options. Each expert assessment x i, where i=1,…,n, forms part of a vector of expert responses X=(x 1,x 2,…,x n)′. Let there be an expert-specific function, $g_{i} ( \cdot )$, where i=1,…,n, which maps the true party position γ to the expert assessments, X. We wish to infer γ from X, and our ability to do so rests on the nature of $g_{i} ( \cdot )$. If expert assessments were continuous, we might assume $g_{i} ( \cdot )$ to be a linear function such that:
where α is a shift parameter, β is a stretch parameter, and ε is noise. If all experts are perfectly informed and have no biases (α i=0, β i=1, ε i=0 $\forall i$), then each x i=γ. Having one expert is as good as having hundreds. However, if the experts are not all equally informed, uniformly poorly informed, or have different biases in their perceptions of γ, they will not respond in the same manner.
Existing rescaling techniques account for differences arising from individual respondent biases and perceptions, referred to as differential-item functioning (DIF) (Aldrich and McKelvey Reference Aldrich and McKelvey1977; Hare et al. Reference Hare, Armstrong, Bakker, Carroll and Poole2015). These models estimate the expert-specific α and β parameters in Equation 1, but they do not account for another type of bias that arises as the result of rating items. When assessing objects that lie at the extremes of the scale, the truncated nature of the scale means that experts can only make mistakes in one direction, namely towards the middle. Experts with a tendency towards making centrist placements when uninformed (central tendency bias) are doubly susceptible to this bias.Footnote 4 Even when all experts perceive the scale identically, if any noise exists, truncation bias must exist as well. Mean expert ratings will estimate objects located near the extremes as increasingly centrist as random noise increases. And among summary statistics, the mean will be most affected by random centrist placements resulting from noise. Better and worse measured objects could even change rank positions when summary statistics more robust to centrist outlying placements are used for aggregation.
We could reduce truncation bias by using only the responses of better-informed experts, except that we have no good way of identifying them. The best we can do is to observe the distribution of all expert responses to assess whether poorly informed experts may exist. Increasing the number of expert responses does not provide us with greater certainty about γ, but it does allow us to get a better sense of the shape of the distribution of expert responses. We can then determine the consequences of aggregation using different summary statistics in the presence of expert disagreement.
Existing robustness and validity checks applied to expert survey responses insufficiently assess the consequences of aggregation. Most analyses focus on the mean expert placement, and may, at best, examine disagreement using the standard deviation of placements (Hooghe et al. Reference Hooghe, Bakker, Brigevich, Vries, Edwards, Marks, Rovny, Steenbergen and Vachudova2010). Although recent analyses take concerns about differing expert scale perceptions into account (Clinton and Lewis Reference Clinton and Lewis2008; Bakker et al. Reference Bakker, Edwards, Jolly, Polk, Rovny and Steenbergen2014), they do not consider truncation bias. In the next section, we explore these consequences using Monte Carlo simulations.
Monte Carlo simulation
We simulate a data generating process underlying expert assessments of parties and determine when aggregate measures best capture true positions. Our latent dimension is continuous on a given interval. We generate 100 true positions by taking draws from a uniform distribution ranging from 0 to 10. We refer to this vector of true positions as γ. In the real world, researchers design the expert survey and ask experts to make an assessment of the truth on a discrete scale, often ranging from 0 to 10: y∈[0,1,…,10].
The simulation, which we run 1000 times, begins with “experts” reporting their perceptions of the positions on the discretized scale y. In our simulation, we draw expert j's assessment of party i as follows:
with each Ratingij rounded to the nearest integer and truncated to lie between 0 and 10. The quantity α is an expert-specific shift parameter and β is an expert-specific stretch parameter. Following the DIF setup, these parameters allow each expert to perceive the space differently. The error term means that experts assess some parties better than others. We run the simulation four times using 5, 10, 15, and 20 experts to examine the effect of consulting more experts. More information on the simulation parameter values is located in the Supplementary Appendix.
Having drawn expert assessments, we calculate the mean, median, and mode response for each party.Footnote 5 Because we set γ, the true party positions, we can assess how well the mean, mode and median of the expert assessments recover the truth. We regress the truth, γ, on each of the summary statistics of the expert responses—the mean, median, or mode—for each of the 1000 simulated expert datasets. We expect an estimated regression slope of 1, representing a perfect relationship with the truth. We assess the performance of the summary statistics using OLS to mirror how expert data are typically used—as an independent variable in a regression model. Although we examine the relationship between our aggregate measure and the truth, any bias we find would also be present in the relationship between the aggregate measure and a dependent variable causally related to the truth.
After setting the true positions, the simulation steps are as follows:
1. Draw n expert assessments for each of the 100 parties using Equation 2, where n is 5, 10, 15, or 20.
2. Round the expert assessments to the nearest integer and truncate them, so that all assessments lie between 0 and 10.
3. Aggregate the expert assessments using the mean, median, and mode for each of the 100 parties.
4. Estimate three bivariate regressions of true positions on each of the aggregate measures and save the slope coefficients.
5. Repeat steps 1–4 1000 times and generate a boxplot of the 1000 slope coefficients.
We run a second set of simulations to capture the possibility that some experts perceive a party in a systematically different manner than other experts. This second simulation captures one way in which a bimodal pattern such as that seen for the Front National in Figure 1 may arise. Experts are selected at random (with probability 0.35) to view a subset of extreme parties (35 percent of parties with a position greater than 7.5 or less than 2.5) in mirror image. For example, while most experts would observe a party with a true position of 8, the randomly selected subset of experts would view a party with a true position of 2. This simulation is equivalent to a case in which experts are asked to rate a populist far right party on a traditional left–right economic dimension. The majority of experts view the populist economic policies of a far right party as right-wing policies, but a subset of experts recognize these policies (likely to include state intervention and subsidies) as matching notions of left-wing economic interventions. While neither is wrong, we take as the truth the scale mapping that the majority of experts see.
The results of both sets of simulations are presented in Figure 2. The top row of Figure 2 presents the results when all experts perceive the scales identically. The mean is biased and consistently underestimates the truth. Adding more experts reduces noise in the estimates but it does not reduce the bias. The median recovers the true relation nearly perfectly regardless of the number of experts asked, and increasing the number of experts reduces noise. Finally, the mode recovers the truth most accurately when we use 15 experts. Increasing the number of experts further reduces noise, but the mode starts to overestimate the truth. In the simulation where some experts view a mirror image of the truth, presented in the bottom row, the mean greatly underestimates the truth. The median and mode both perform much better, although they, too, underestimate the truth. The small subset of experts who view the truth differently has a greater impact on the mean than the median or mode.
These simulation results have important implications for the design and interpretation of expert surveys. If there is a reason to think that experts may not be fully knowledgeable, possess varied biases, or perceive scales differently, then researchers should recognize that different summary statistics can provide different answers about the nature and impact of party positions. These simulations suggest that the median and mode recover the truth better than the mean. At a minimum, the results suggest that when faced with discrepancies in expert placements, researchers should check the robustness of their results to different means of aggregation.
The CHES party expert survey further demonstrates that researchers must carefully consider their approach to aggregating expert placements. Figure 3 presents the percentage of countries in each survey wave that experiences at least one-rank order shift among the parties as a result of using the mean versus the mode and the mean versus the median. Depending on survey and dimension, between 20 percent and 50 percent of countries have at least one-rank order change in their party system as a result of using a different aggregation method.
Application—understanding party position shifts
Response aggregation can affect the results of studies relying on expert placements. We replicate a study by Adams et al. (Reference Adams, Ezrow and Somer-Topcu2014), which examines how citizens update their views on parties’ policy position shifts. In doing so, we also show how a simple bootstrap can help gauge the effects of uncertainty in the expert data on our inferences. Adams et al. (Reference Adams, Ezrow and Somer-Topcu2014) argue that citizens, rather than relying on party manifestos to update their information on party policy position shifts—a view that has had a long tradition in the extant literature—draw on a variety of information sources when updating their beliefs. Adams, Ezrow, and Somer-Topcu use expert opinions from the CHES surveys as a proxy for broad information gathering. Focusing on European integration, their empirical analysis confirms their hypothesis. The finding is an important contribution to the ongoing debate about political sophistication of citizens.
However, using the mean to aggregate divergent expert opinions when calculating party policy shifts may impact these results. The study uses one of the better-measured items in the CHES data—party position with respect to European integration—and focuses on parties in Western Europe, where experts tend to display higher levels of agreement. Thus, if we find that expert disagreement creates problems in this case, it is likely to create issues in a large number of other studies, too.
First, we examine how robust the results are to different aggregation approaches. We also account for uncertainty in the aggregated expert responses resulting from disagreement among experts. If we were to simply rely on the median, modal or mean response, we would be assuming that our point estimate has no associated noise. To address this issue, we conduct a non-parametric bootstrap of the expert data. We create 100 bootstrapped expert data sets by sampling with replacement from the set of expert responses for each party on the European integration dimension. From the sampled expert responses, we calculate the modal response to construct the relevant variable and estimate the Adams et al. (Reference Adams, Ezrow and Somer-Topcu2014) model 100 times—once in each sample. Finally, we summarize the results across the 100 samples using model averaging just as one would when imputing missing data (Blackwell et al. Reference Blackwell, Honaker and King2015).
Table 1 presents the results. The first model replicates the Multivariate Model (3) in Adams, Ezrow, and Somer-Topcu without clustered standard errors. The second model uses clustered standard errors and therefore is an exact replication of Multivariate Model (3). Clustering has very little effect on the standard errors. We therefore proceed to estimate the other models without clustering. Using the bootstrapped median and mode, the results of Adams, Ezrow, and Somer-Topcu become much weaker (columns 3 and 4). The coefficient on the variable of interest is only 57 percent of its former size when using the bootstrapped median and only 41 percent of its reported size when using the bootstrapped mode. Neither the median nor mode variable attains statistical significance. In the final model, we use the mean responses but estimate the model using only well-measured parties.Footnote 6 The coefficient estimate using only better-measured parties is still smaller than the original estimate using the mean and all parties, further indicating that poorly measured parties with high levels of expert disagreement are contributing to the authors’ findings.
Our analysis suggests that the substantive effects presented by Adams et al. (Reference Adams, Ezrow and Somer-Topcu2014) may not be as strong as they suggest. Their results are at least partly driven by disagreement in expert placements of parties and the choice to aggregate these responses using the mean. Nevertheless, we would not go so far as to say that the Adams et al. (Reference Adams, Ezrow and Somer-Topcu2014) results are incorrect. We are accounting for measurement error in only one of the two variables. There is almost certainly measurement error in the variable based on Euromanifestos, as well, and accounting for that measurement error could impact the coefficient on the expert survey variable. Our point is simply that disagreement among experts in rating parties can lead to incorrect inferences about the impact of party positioning when using party position as an independent variable in regression analyses.
Discussion and conclusion
Political scientists make frequent use of expert surveys, but they have not properly examined the consequences of lack of expert agreement on aggregation of responses. Our findings have implications for those who wish to collect new expert survey data and those using existing data. Those running new surveys must consider the degree to which experts can assess individual items. Within party position surveys in political science, the locations of some parties and on some dimensions are easier to assess than others. In the Supplementary Appendix, we apply common measures of agreement to the CHES data to underscore the problems of lack of agreement.
It may also be useful to design items in expert surveys aimed at gauging expert knowledge. Researchers could give more weight to knowledgeable experts and determine whether disagreement results from heterogeneous expert ability or a fundamental lack of agreement on where targets lie on the scale. Lastly, it might be useful to think about other types of survey designs, beyond Likert scales, that may lessen the cognitive burden placed on experts, resulting in higher levels of agreement (e.g., pairwise comparisons).
For those using existing data, we suggest that researchers examine expert agreement and reliability within the items they wish to use by drawing on well-known techniques (Finn Reference Finn1970; James et al. Reference James, Demaree and Wolf1984; van der Eijk Reference van der Eijk2001). If items are poorly measured, it may not be wise to use them in secondary analyses. And when disagreement among experts exists, researchers should check the robustness of their results to aggregation using the median and modal expert responses.
Supplementary Material
To view supplementary material for this article, please visit http://dx.doi.org/10.1017/psrm.2018.52
Acknowledgments
We would like to thank James Lo, Moritz Marbach, several anonymous reviewers and the editorial team at PSRM for their helpful comments.