The conventional theory of economic voting is that voters reward or punish the incumbent government based on how the domestic economy is performing. Recently, scholars have challenged that view, arguing that voters use relative assessments to gauge government performance. From this perspective, what matters is not how well the national economy is doing per se, but rather how it performs relative to an international or historical reference point.
This article revisits prominent published works in that emerging tradition, and finds that the available evidence does not support the benchmarking hypothesis. The authors come to this conclusion after taking a close look at the regression models that are typically used to test benchmarking. It shows algebraically that the way in which those models are specified invites a fundamental misreading of the evidence. Finally, the study proposes an alternative regression equation that can be used to test benchmarking, avoids common misinterpretations and facilitates the assessment of complex, conditional theories of relative evaluation.
Background
Economic voting is one of the most important accountability mechanisms at work in electoral democracies. The fact that voters reward or punish the incumbent government based on how the domestic economy is performingFootnote 1 is traditionally viewed as normatively desirable, for it reflects popular control of representatives.Footnote 2
Recently, a number of scholars have challenged this optimistic view, by pointing out that domestic economic growth is often a weak proxy for government performance. When the local economy moves in sync with secular trends or global shocks, governments may be rewarded or punished for events beyond their control.Footnote 3 This is especially true in integrated economies, where domestic fortunes are tightly linked to events abroad, and responsibility is blurred.Footnote 4 Democratic accountability thus requires more from voters than a simple response to local economic conditions.Footnote 5
A new and influential strand of research argues that voters do, in fact, make rational judgements about government performance, because their evaluations are relative.Footnote 6 What matters to rational voters may not be how well the national economy is doing per se, but rather how it performs compared to the economies of other countries, or relative to some historical benchmark.
Benchmarking is a powerful idea, which can be traced back to the work of Powell and Whitten.Footnote 7 As these authors point out, voters are likely to ‘evaluate government relative to some expectations about how the economy should have performed’.Footnote 8 But since expectations are difficult to measure, ‘it seems reasonable to use the international average levels of growth, inflation, and unemployment to estimate a baseline against which each country’s citizens could judge the performance of their own economy’.Footnote 9 This approach is intuitive, since ‘abundant research in other domains of social science supports the proposition that individuals are sensitive to comparative assessments’.Footnote 10
Yet there are also good reasons to doubt that voters benchmark economic performance. First, the benchmarking hypothesis is at odds with a dominant view on the cognitive limitations of ordinary voters. Indeed, a long tradition of research in political science has depicted the citizenry as poorly informedFootnote 11 and biased.Footnote 12 It is difficult to imagine how such an unsophisticated electorate could systematically and accurately compare how well the national economy is performing relative to other countries or a historical benchmark. Secondly, even if some authors posit that the media could facilitate benchmarking by making implicit comparisons in their news coverage, evidence of the underlying mechanism is rather weak. For instance, Kayser and Peress (henceforth, KP) report that high-information voters – those most exposed to the media – do not engage in more benchmarking than low-information voters.Footnote 13 Finally, some empirical studies claim that voters act based on relative economic conditions, but others find that when it comes to evaluating government performance, ‘the effect of luck is larger than the effect of competence’.Footnote 14 In short, the theoretical case for benchmarking is implausible, and the empirical record is mixed.
In this article, we show that the empirical evidence of benchmarking is extremely weak. We argue that the way in which regression models are typically specified to test benchmarking is needlessly complicated, that it invites a fundamental misreading of the evidence and that it may lead researchers astray. We propose a simpler model specification that can be used to test benchmarking, which avoids common misconceptions and carries powerful intuitions about the theory. We also show how this simple model can be enhanced to test more complicated theories, such as when voters benchmark against multiple reference points, or when the strength of benchmarking depends on the context. We revisit a prominent empirical study of Benchmarking Across Borders,Footnote 15 and conduct a faithful replication of the models reported in that article. When correctly interpreted, the results do not support the contention that voters make rational comparative evaluations.
Our findings have important implications for the field of economic voting, and for our understanding of the mechanisms that underpin democratic accountability. More generally, our article makes useful contributions to political science methodology by highlighting the shortcomings of a widely used empirical strategy, and by proposing a better way to test theories of relative evaluation.
Benchmarking vs. Conventional Economic Voting
The core intuition of benchmarking is illustrated in Figure 1, which shows how domestic growth and a reference point can affect support for the incumbent party. The solid line represents the domestic growth rate during the election year (G y ), and the dashed line represents the growth rate that voters use as a benchmark to evaluate the incumbent government’s performance. Depending on the analyst’s theory, the reference point could be the international growth rate (G i ) or the historical level of growth in the country under study (G h ).
The conventional view of economic voting is that votes for the incumbent (V) are tied to domestic growth. As we move from left to right, G y increases in Figure 1a but stays constant in 1b. Thus the conventional prediction is that votes for the incumbent will increase in 1a but stay constant in 1b.Footnote 16
In contrast, proponents of benchmarking argue that what matters to voters is not domestic growth per se, but rather the difference between domestic growth and the benchmark (G y − G i ).Footnote 17 When the solid line is above the dashed line, domestic growth outperforms the benchmark, and voters should reward the incumbent. When the solid line is below the dashed line, domestic growth underperforms relative to the benchmark, and voters should punish the incumbent. As we move from left to right in Figure 1, the ‘performance gap’ or ‘competence signal’ increases in Figure 1a and decreases in 1b. Thus, benchmarking predicts that votes for the incumbent will increase in 1a and decrease in 1b.
These expectations can be restated using the language of multiple regression. When domestic growth increases and the reference point is held constant (Figure 1a), both theories predict that the incumbent’s vote share will increase. In other words, both theories predict that the marginal effect of domestic growth will be positive: ∂V/∂G y >0. When the reference point increases and domestic growth is held constant (Figure 1b), benchmarking predicts that votes for the incumbent will decrease. In other words, benchmarking predicts that the marginal effect of the reference point will be negative: ∂V/∂G i <0.
If we hope to discriminate between benchmarking and conventional economic voting, the main quantity of interest is the marginal effect of the reference point, since this is where benchmarking theory makes a distinctive prediction.
How do Scholars Test Benchmarking?
Conventional theories of economic voting are typically tested using models of this form:
where V is the incumbent’s vote share, G y is the domestic economic growth rate during the election year, Ω is a vector of control variables, and ν is a disturbance term. Clearly, Model 1 cannot be used to test benchmarking, since it ignores relative evaluations altogether.
In their seminal article, Powell and WhittenFootnote 18 estimate a regression equation of this form:
with G i equal to a ‘reference point’, the international economic growth rate. Model 2 takes us very close to the benchmark story: When the gap between G y and G i is positive, the domestic economy outperforms the global economy, and voters should reward the incumbent government for its competence.
As KP note, however, Model 2 cannot be used to distinguish between benchmarking and conventional economic voting, because it suffers from omitted variable bias.Footnote 19 Indeed, the composite variable (G y − G i ) is highly correlated with the level of domestic economic growth (G y ).Footnote 20 As a result, we cannot parse out the effect of benchmarking from conventional economic voting, and λ y−i captures both phenomena. Model 2 is thus useful if we want to estimate something akin to the ‘total effect’ of domestic growth and benchmarking on voting behavior, but not if we wish to compare and contrast the two theories.
To solve this problem, KP introduce an additional control for the reference point:
In this model, G y − G i represents a ‘decomposed’ or ‘local’ component of growth, whereas G i aims to control for changes in the reference point. Model 3 has had tremendous influence in the field. At the time of writing, KP’s article has been cited over 160 times, and several other researchers have adopted and adapted their empirical strategy.
A Widespread Misconception
An intuitive – but ultimately incorrect – way to interpret Model 3 would be to focus on the gap between G y and G i , and to treat the θ y−i coefficient as the effect of relative economic performance on votes for the incumbent.
For example, Aytaç argues that a positive estimate of θ y−i provides ‘evidence for the hypothesis that voters reward (punish) incumbents on whose watch the economic performs relatively better (worse) in domestic and international comparisons’.Footnote 21 KP define ‘local growth’ as the gap between G y and G i , and interpret θ y−i as measuring the association between ‘an increase in local growth’ and an ‘increase in the leader party’s vote share’.Footnote 22 Goplerud and Schleiter follow in those footsteps, and discuss θ y−i as the effect of some ‘benchmarked’ or ‘local’ component of growth on voting behavior.Footnote 23 Using data on the American states, Ebeib and Rodden also interpret θ y−i as the effect of ‘relative state conditions’ on votes.Footnote 24
If the domestic economy outperforms a reference point, it may be reasonable for voters to infer that the government is doing good work. In that spirit, Leigh treats θ y−i as the effect of ‘government competence’ on votes for the incumbent.Footnote 25 In the American context, Wolfers considers the gap between state- and national-level economic growth, and calls θ y−i the ‘effect of competence’.Footnote 26
Interpreting θ y−i as the effect of relative economic performance on votes for the incumbent appeals to common sense, but it is a mistake. The root of the problem lies in the fact that G i appears twice on the right-hand side of Equation 3. This redundancy changes the substantive meaning of our regression coefficients.
To see how, take the partial derivative of Equation 3 with respect to G y , and find the marginal effect of domestic growth:
This simple exercise demonstrates that the coefficient associated with G y − G i is exactly equivalent to the marginal effect of G y . Against intuitive common sense, θ y−i does not measure the effect of relative economic performance on votes for the incumbent. Since θ y−i is the marginal effect of domestic growth, finding a positive coefficient for ‘benchmarked’ or ‘local’ growth is actually supportive of conventional economic voting.
Tests of benchmarking based on Equation 3 have been repeatedly misinterpreted in prestigious scientific journals, by leading scholars of economic voting. The inclusion of duplicate regressors on the right-hand side of Equation 3 has been a source of widespread confusion in the economic voting literature.Footnote 27 To put this confusion to rest, we need a simpler, more direct test of benchmarking.
A Simpler Test of Benchmarking
From Figure 1, we learned that both benchmarking and conventional economic voting predict that the marginal effect of domestic growth should be positive. In contrast, only benchmarking predicts that the marginal effect of international growth should be negative. The simplest and most direct way to test those predictions is to estimate a model of this form:
Since Model 3 includes redundant regressors, it carries no more information than the simpler Model 5. In fact, Models 3 and 5 are perfectly equivalent from a logical standpoint, and they produce identical numerical results: the marginal effect of domestic growth, the marginal effect of international growth,Footnote 28 the intercept, the control variables’ coefficients, the residuals and all fit statistics are always the same in both models. In the online appendix, we present side-by-side estimates using Models 3 and 5 to illustrate this point numerically.
Yet even if the two models are formally equivalent, the simpler specification has major advantages in terms of transparency, presentation and interpretation.
First, the correct interpretation of KP’s Model 3 is highly counterintuitive: the coefficient that they call ‘Local Component of Growth’ in their regression tables (θ y−i ) does not measure the effect of the local economy’s relative performance on votes for the incumbent. As we showed above, this has been a major source of confusion in the field, and benchmarking results have been repeatedly misinterpreted in print. Our simpler specification avoids this problem.Footnote 29
Secondly, Equation 5 directly translates the theoretical intuitions conveyed by Figure 1, and it immediately reveals the relevant test statistics. Recall that the discriminating test of benchmarking is that international growth should have a negative marginal effect on votes for the incumbent. In Model 5, the marginal effect of international growth is the δ i coefficient, and we can simply look at its p-value. In Model 3, the marginal effect of international growth is a linear combination of coefficients (∂V/∂G i =θ i − θ y−i ), and we must conduct an extra Wald test to know if that combination is negative and statistically significant.
Finally, as we show below, our simpler specification offers a solid foundation on which we can build empirical tests for theories of benchmarking where voters compare multiple reference points, or where the strength of benchmarking is context dependent.
Replication: Benchmarking Across Borders
As we explained above, the key quantity of interest for tests of benchmarking is the marginal effect of international growth (holding domestic growth constant). Unfortunately, KP do not consistently report the statistics that are needed to test if that quantity is distinguishable from zero.Footnote 30 As a result, readers cannot assess the strength of the evidence simply based on the findings printed in Benchmarking Across Borders.
To see if KP’s data support their theory, we re-estimated all of their models using the authors’ replication files, and we computed all the quantities of interest.Footnote 31 Table 1 shows the results for four models,Footnote 32 estimated using KP’s preferred measure of international growth (an index constructed via principal component analysis).
Note: robust standard errors in parentheses. *p<0.1, **p<0.05, ***p<0.01
Baseline Specification
In Column 1 of Table 1, we see that the G y coefficient is positive. This is consistent with both benchmarking and conventional economic voting. The G i coefficient is negative and statistically significant. This is consistent with benchmarking.Footnote 33 However, those results are not credible, because the model in Column 1 is fatally underspecified.
Controls, Lags and Fixed Effects
Ensuring that results are robust to the inclusion of controls and a lagged dependent variable is a minimum standard for most modern research on economic voting. In Column 2 of Table 1, we follow KP and add the same control variables as in their article; Column 3 includes the incumbent’s vote share in the previous election, and Column 4 includes both a lagged dependent variable and country fixed effects.
The three new models are consistent with conventional economic voting: the marginal effects of domestic growth in Columns 2 to 4 are all positive and statistically significant. However, none of the three models supports benchmarking: the marginal effects of international growth in Columns 2 to 4 are all indistinguishable from zero. As soon as we introduce control variables, a lagged dependent variable or country fixed effects – widely recognized best practices in the field – the evidence of benchmarking evaporates.
Alternative Measures of International Growth
The models in Table 1 were all estimated using an index of international growth constructed by principal component analysis. This is KP’s preferred measure of G i , but the authors also consider two alternatives: a trade-weighted average of growth rates around the world, and the international median.
In Benchmarking Across Borders, the choice between those three measures is rather inconsequential, because KP conclude that the evidence supports benchmarking, regardless of how they measure G i . Substantively, the authors take this to mean that ‘voters respond to their country’s deviation from various measures of average international performance’.Footnote 34 Moreover, KP do not offer a real theoretical defense of their preferred measure, and fit statistics do not give us strong reasons to favor one measure of G i over another.Footnote 35
Nevertheless, access to these two alternative measures of international growth is useful, because it allows us to probe the sensitivity of benchmarking tests to how we measure the reference point. In the online appendix, we replicate the eight regression models that KP estimated using aggregate-level data and their two alternative measures of international growth. None of those eight models shows evidence of benchmarking: the marginal effect of international growth is never distinguishable from zero.
Individual-Level Survey Data
Moving beyond aggregate-level data, KP also study benchmarking using individual-level surveys. Once again, their empirical specification resembles Model 3, and the quantity of interest is the marginal effect of international growth. In the online appendix, we replicate all twelve of KP’s individual-level models. None of those models allows us to reject the null of ‘no benchmarking’.
Three More Empirical Claims
In the online appendix, we consider three more empirical claims from the original article: (1) a statistically insignificant estimate of θ i constitutes evidence of ‘full benchmarking’, (2) the substantive effect of decomposed growth is more important than the substantive effect of domestic growth and (3) at several points in time, the magnitude of the benchmarked economic vote is greater than the magnitude of the non-benchmarked economic vote. Our assessment is that these claims do not add credence to the theory.
Do Voters Benchmark Economic Performance?
In their article, KP ‘argue that previous research has fundamentally misunderstood and hence incorrectly estimated how economic assessments are made’.Footnote 36 They contend that ‘voters respond more to national deviations from an international average rate of growth than to the growth rate itself’.Footnote 37 They claim that their empirical analysis reveals ‘strong evidence of cross-national benchmarking on economic growth both at the aggregate and at the individual level, across time periods, and across subsamples’.Footnote 38 Finally, after conducting extensive robustness checks, they conclude that their ‘main results are not altered’,Footnote 39 and that the evidence is ‘clearly inconsistent with no benchmarking’.Footnote 40
We re-evaluated benchmarking on KP’s own terms, using their original data, logically equivalent statistical models, the same null hypothesis testing framework and an evaluation criterion that they explicitly endorsed.Footnote 41 Yet our substantive conclusions are strikingly different.
When models include control variables, a lagged dependent variable or country fixed effects, we cannot reject the null of ‘no benchmarking’. When we use alternative measures of international growth, we cannot reject the null of ‘no benchmarking’. When we test the theory using individual-level survey data, we cannot reject the null of ‘no benchmarking’. In fact, out of the twenty-four regression models that we replicated, only one model – without controls or lagged dependent variable – supports the theory. In the other twenty-three tests, the critical quantity of interest does not cross (or even approach) conventional thresholds of statistical significance.Footnote 42 Put simply, the evidence in Benchmarking Across Borders amounts to little more than a null result.
How to Test Benchmarking With Multiple Reference Points
The models considered above show little evidence of benchmarking. This surprising result could be an artifact of several factors. For instance, our models may be too simple to capture the complex processes at work, or KP’s dataset may be too small to conduct well-powered tests. In this section, we consider how to adapt our barebones empirical framework to the more complex case in which voters compare domestic economic performance to multiple reference points. Then, we illustrate by studying a larger dataset drawn from a more recent study of benchmarking.
AytaçFootnote 43 develops a reference point theory that is highly reminiscent of KP’s, but which makes two important substantive changes. First, the author argues that voters use two reference points to assess their government’s performance: the level of international growth (G i ) and their own country’s historical level of growth (G h ).
Secondly, Aytaç points out that these reference points could be compared to two alternative measures of the incumbent’s performance: domestic growth during the election year (G y ) or domestic growth during the incumbent’s full term in office (G t ). The term-based measure is preferable if we adopt a rational voter model, since such voters can extract more information about the quality of government by observing performance over a longer period. The election year measure is preferable if we take the view – dominant in political psychology – that voters are cognitively limited, myopic and that they use end heuristics when engaging in retrospective evaluations.Footnote 44 Here, we remain agnostic and estimate models using both measures.
In our framework, testing theories of benchmarking with multiple reference points is straightforward: we simply introduce the new reference point variable additively in Model 5. Again, there is evidence of benchmarking if the marginal effect of domestic growth is positive, and if the marginal effects of the benchmarks are negative.
In Table 2, we illustrate this by estimating six models using Aytaç’s replication data.Footnote 45 In a first set of three models, we compare international and historical growth to domestic growth in the election year. In a second set of three models, we compare international and historical growth to the average domestic growth rate during the incumbent’s full term in office. We include the same control variables as Aytaç.
Note: OLS regressions with country-clustered standard errors. Robust standard errors in parentheses. *p<0.1, **p<0.05, ***p<0.01
All six of the models in Table 2 show evidence of conventional economic voting: the coefficients for domestic economic growth (G y or G t ) are all positive and statistically significant. In contrast, none of the models allows us to reject the null hypothesis of ‘no international benchmarking’: the G i coefficient is never statistically significant at the α=0.1 level.
The two right-most models in Table 2 show evidence of historical benchmarking: the G h is negative and statistically significant. However, it is important to point out that those models rely on a highly unconventional assumption. Indeed, they assume that voters have long enough memories to accurately compare the average level of growth during the incumbent’s full term in office to the average level of growth during the previous government’s term. This assumption clashes with common wisdom in the field of economic voting, in which ‘virtually all macro-studies assume a short lag, generally of one year’.Footnote 46 Most studies of benchmarking also use short-term measures of domestic growth.Footnote 47
In sum, the results in Table 2 offer strong support for the conventional theory of economic voting, but evidence of benchmarking is mixed. None of the models allows us to confidently reject the absence of international benchmarking, and the only models that support historical benchmarking require us to jettison the widespread assumption that voters are myopic.
How to Test Conditional Theories of Benchmarking
The regression models that we have studied so far were relatively underspecified. Indeed, one of the major contributions of Powell and WhittenFootnote 48 was to point out that the level of economic voting depends on the institutional context (for example, clarity of responsibility). Similarly, there are good reasons to think that benchmarking will vary across populations: some voters – such as those with high information – might engage in more relative economic evaluations than others.
If benchmarking is truly conditional, then the ‘pooled’ models that we estimated above would be inappropriate, and our null results would not be surprising. For this reason, it is extremely important to develop regression models capable of testing conditional benchmarking hypotheses. Again, this is very easy to do in our simple empirical framework.
We use the same starting point as before (Figure 1). Benchmarking predicts that the marginal effect of domestic growth should be positive, and that the marginal effect of the reference point should be negative. If a moderating variable M increases (decreases) the salience of the reference point, then the marginal effect of domestic growth should be more (less) positive, and the marginal effect of the reference point should be more (less) negative where M is high.
This idea can be captured by a simple extension of Model 5:
where M stands for a variable that moderates comparative economic assessments.
As usual, a positive marginal effect of domestic growth (δ y +δ ym M>0) would be consistent with both conventional economic voting and benchmarking. A negative marginal effect of international growth (δ i +δ im M<0) would be consistent with benchmarking. The slopes of those marginal effects (δ ym and δ im ) measure the extent to which M moderates relative economic assessments.
To illustrate how one can apply Model 6, we revisit a secondary set of tests from Aytaç,Footnote 49 where the author studies if benchmarking is more prevalent in countries with high trade intensity, GDP per capita or average level of schooling.Footnote 50 We assess the moderating effect of all three variables,Footnote 51 include the same control variables as in Table 2, and use Aytaç’s two alternative measures of domestic growth (G y and G t ). The full regression results are reported in the online appendix.
Figure 2 shows the estimated marginal effect of international growth in six models. None of the marginal effects is clearly negative, and most lines are nearly flat. These results, estimated using a dataset that is over twice the size of KP’s, offer no evidence of international benchmarking, and no evidence that trade, income or education increase the salience of comparative economic assessments.
Conclusion
In this article, we reinterpreted the theory of benchmarking and explained that, all else equal, it predicts that votes for the incumbent should be positively related to domestic growth, but negatively related to reference points. By recasting the theory’s predictions in terms of the marginal effects of domestic growth and the reference points, we showed that benchmarking could be tested using a simpler linear model that excludes duplicate regressors, immediately produces the relevant discriminating statistics and greatly facilitates interpretation.
We reanalyzed data from prominent studies that have claimed to present evidence clearly supportive of benchmark. Across a range of models, we found robust evidence that domestic growth affects voting behavior, but very little sign of benchmarking. We therefore conclude that benchmarking is an interesting hypothesis, but that it is not supported by the available evidence.
These results should not be interpreted as a wholesale rejection of the benchmarking hypothesis. Indeed, it seems reasonable to expect that some populations may be more responsive to international or historical comparisons than others. For example, voters may be particularly attuned to the economic performance of neighboring countries or rivals.Footnote 52 At the individual level, some types of voters (for example, politically sophisticated ones) may also be more prone to compare domestic with global economic performance or present economic conditions with previous ones. The idea that some voters evaluate economic performance in relative terms has some intuitive appeal, but the idea that most voters systematically and accurately compare with an ‘objective’ benchmark seems rather implausible given citizens’ cognitive limitations.
Perhaps most importantly, we have shown that there are great risks in using composite measures to test theories of relative evaluation. We have demonstrated that there is a straightforward way to test the benchmarking hypothesis, which is to avoid composite variables, and to simply enter each term additively in the regression equation. Using such an approach, we can formulate clear tests of the benchmarking hypotheses: all else equal, the benchmark should have a negative marginal effect on support for the incumbent party (or other relevant dependent variables). This simple empirical framework can also be extended in straightforward fashion to test theories of benchmarking with multiple reference points or context conditionality. We hope to have provided clear guidelines for further research on this complex and important question.
Supplementary Material
The data, replication instructions, and the data’s codebook can be found in Harvard Dataverse at: http://dx.doi.org/10.7910/DVN/OCNIWD and online appendices at: https://doi.org/10.1017/S0007123418000236
Acknowledgments
We thank Erdem Aytaç, Timothy Hellwig, Michael Lewis-Beck, seminar participants at Texas A&M University, and our regular lunch companions Chez Valère.
Financial Support
This work was supported by the Social Sciences and Humanities Research Council of Canada, and the Fonds de Recherche Société et Culture du Québec.
Related Article
A comment by Mark A Kayser and Michael Peress, “Benchmarking Across Borders: An Update and Response” is published in the British Journal of Political Science and can be found here: https://doi.org/10.1017/S0007123418000625”.