1. Introduction
List experiments (also known as item count technique (ICT)) are a widely-used survey technique designed to elicit true preferences on sensitive topics that are vulnerable to social desirability bias (Rosenfeld et al. Reference Rosenfeld, Imai and Shapiro2016). They work as follows: respondents are divided into control and treatment groups. The control group is shown J non-sensitive statements and asked to indicate how many are true. The treatment group is shown J + 1 statements, where the J statements are the same as the control group, but the + 1 is a sensitive item that may elicit socially desirable responses if asked directly. The difference in the mean number of true statements between the control and treatment groups, referred to as the difference-in-means estimator, is interpreted as the percentage of the population for whom the sensitive statement is true. This technique has been used to estimate the prevalence of a wide range of socially sensitive attitudes and behaviors from ethnic prejudice to sexual practices and voting behavior.Footnote 1
List experiments are subject to both strategic and non-strategic respondent error (Ahlquist Reference Ahlquist2018). Strategic errors arise when respondents lie to conceal their position on the sensitive issue, which is revealed when all or none of the statements are indicated as true. To prevent these ceiling and floor effects, best practice calls for one relatively rare and one relatively common item (Blair and Imai Reference Blair and Imai2012; Glynn Reference Glynn2013). Non-strategic error includes such things as coding errors and poor quality responses that arise when respondents do not understand or rush through the list experiment. As noted by Ahlquist (Reference Ahlquist2018), previous work on ICT has generally disregarded the implications of non-strategic errors.
We raise attention to a potential non-strategic error that emerges from the differential list lengths in typical ICT designs: the higher number of statements in the J + 1 treatment group relative to the J control group may produce an artificial inflation of “true” statements in the treatment group if respondents resort to satisficing, for example by selecting the perceived middle point (Krosnick Reference Krosnick1999). Despite the potential for this error, only a few studies (Holbrook and Krosnick Reference Holbrook and Krosnick2010; Ahlquist et al. Reference Ahlquist, Mayer and Jackman2014; Kiewiet de Jong and Nickerson Reference Kiewiet de Jong and Nickerson2014) have directly examined the effect of ICT design on responses. They take the following approach: A placebo statement that is exceedingly rare or impossible is added to an alternative control group. Since it should be false for all respondents, the mean of the J + 1 alternative control group with the placebo statement should be the same as the J control group. Any significant difference in means is the result of bias from the standard ICT design.
Individually, the studies are inconclusive. Holbrook and Krosnick (Reference Holbrook and Krosnick2010) find a difference in means that suggests inflation; the effect, however, does not reach significant levels using a two-tailed t-test. Ahlquist et al. (Reference Ahlquist, Mayer and Jackman2014) likewise find evidence of inflation, this time at statistically significant levels.Footnote 2 Kiewiet de Jong and Nickerson (Reference Kiewiet de Jong and Nickerson2014) explicitly look for inflation or deflation; they find “little evidence of an upward bias in estimates” (p. 662). None of the studies note strong evidence of heterogeneous effects.
This paper brings a representative sample that provides substantial statistical power to bear on the question of non-strategic bias in ICT design.Footnote 3 As with the previous studies, we use a placebo statement to identify the potential effects of differential list lengths. We find strong evidence for mechanical inflation, though only among the subgroup with relatively low levels of educational attainment. This finding is consistent with previous research that shows response quality to vary depending on cognitive ability and education levels (see Krosnick Reference Krosnick1991). As list experiments require greater attention to detail and concentration than conventional questions, this subgroup may have an increased propensity to resort to satisficing (Kramon and Weghorst Reference Kramon and Weghorst2012), which in turn can drive mechanical inflation.
We also conduct a meta-analysis of previous work, finding inflation to be more likely than not, and roughly the size of many reported treatment effects at around 7–8 percent. Details are in Section 4 of the supplementary materials. Moreover, we reanalyze data from Ahlquist et al. (Reference Ahlquist, Mayer and Jackman2014) for heterogeneous effects and find, consistent with our study, evidence of inflation among the subgroup with relatively low levels of educational attainment.Footnote 4
Our findings have important implications for list experiment best practices. They suggest that the conventional J/J + 1 design is vulnerable to bias toward positive findings, at least in contexts where some respondents have low levels of formal education or are especially prone to satisficing. To protect against this bias, we recommend inclusion of a placebo statement in the control group that equalizes the list lengths at J + 1/J + 1. The placebo statement should be false for all or nearly all respondents, and should not be so disruptive that it triggers a low-quality response to the remaining list items.Footnote 5 Ultimately, the inclusion of a placebo statement is a costless preventative measure that does not increase cognitive demands or alter interpretation of survey experiments, but does protect against the observed mechanical bias among vulnerable subgroups.
2. Data and survey design
The data described below come from a list experiment embedded in a survey on social attitudes and community relations in Singapore, conducted from 2 March to 18 April 2019. The survey was administered in person by a multi-ethnic team of enumerators comprised of local university students, either on weekdays (6–8pm) or weekends (10am–6pm). Most respondents required between 5 and 10 minutes to complete the questionnaire, which was comprised of closed questions. Buildings were randomly selected to approximate a representative sample of the resident population. The response rate was 38.6 percent, marginally above the typical rate of surveys carried out by official institutions in Singapore.Footnote 6 In total, the dataset contains 1,278 observations. Full details of the survey methodology are included in the supplementary materials.Footnote 7
The list experiment was designed to estimate mechanical inflation. The four-item control group received four neutral statements, while the five-item placebo group received the same four neutral statements, plus a (necessarily false) placebo statement.
All groups received the same instructions: “Look at the following statements below. Can you tell us how many statements are true for you? Please don't tick individual statements, just tell us the total number” [Emphasis in the original]. The four neutral statements were chosen using the generally accepted criteria for list experiments: natural fit into the context of the survey, uncorrelated (both with one another and with other broader socio-economic characteristics), and resistant to ceiling and floor effects.
The placebo statement was designed to be plausible but false for all respondents: “I have been invited to have dinner with PM Lee at Sri Temasek next week.”Footnote 8 This is the equivalent of being asked to have dinner with the President of the United States in the White House or some other equally improbable event. Hence, we assume that it is false for all respondents and easily recognized as such.
3. Results
Table 1 provides a summary of the overall findings. For the whole sample, the mean number of reported true statements is higher (1.89) for the placebo group than for the standard control group (1.77). The magnitude is substantial: this suggests that the inclusion of the + 1 placebo statement induces roughly 12 percent of respondents in the five-item placebo group to increase their reported number of true statements by 1 above their counterparts in the four-item control group. Figure 1 in the supplementary materials provides the frequency distribution for both the four-item and the five-item groups. Few respondents in either group indicate 0 or all statements to be true, which suggests that the presence of a clearly false placebo statement does not induce respondents to indicate extreme counts.Footnote 9
Reported p-values are from a one-sided difference in means t-test between the four-item control and five-item placebo groups. Political knowledge: “1” if respondents know the electoral district in which they reside, “0” otherwise.
Table 1 also reports mean number of true statements by subgroups on the dimensions of political knowledge, educational attainment, household income, and age.Footnote 10 We opt for simple categories to facilitate comparisons: respondents are coded as having high political knowledge when they are able to correctly name their electoral district; household income is above and below 3,500 Singapore dollars per month (which represents roughly the bottom third); while age is above or below 60 years.
The findings suggest that the treatment effect of the placebo statement is highly heterogeneous: for the politically knowledgeable, relatively educated, and middle and upper income, the difference in means between the four-item control and five-item placebo groups is insignificant, meaning that the inclusion of the placebo statement does not inflate the reported number of true statements. By contrast, the difference in means is statistically significant and substantively meaningful among the counterpart subgroups. This provides a strong initial indication of which respondent types are most vulnerable to mechanically inflating their true statement count in conventional list experiments.
In order to check the robustness of these findings to a different context, we examine data from Ahlquist et al. (Reference Ahlquist, Mayer and Jackman2014), which are available online at Harvard Dataverse.Footnote 11 The study likewise uses a standard four-item control group and a five-item placebo group, in which the extra placebo statement is necessarily false for all respondents. The 3,000 responses were collected via online survey in the United States. The results of the replication study are broadly in line with our general conclusions. The mean item count in the five-item placebo group is 0.07 points higher than the four-item control group; the difference reaches conventional levels of statistical significance. Furthermore, the elderly and those with lower levels of formal education are more likely to mechanically increase their reported number of true statements in response to the placebo statement, supporting our finding of heterogeneous treatment effects. The effect of income, however, is inconclusive. Details of the replication study and further discussion can be found in Section 3.3 in the supplementary materials.
We return to our dataset to examine the heterogeneous treatment effects more precisely. Since formal education, age, and income may themselves be correlated, we estimate an OLS regression model using the following specification originally from Holbrook and Krosnick (Reference Holbrook and Krosnick2010), then adopted by Imai (Reference Imai2011) and Blair and Imai (Reference Blair and Imai2012):
where LISTi is the number reported in the list experiment, X i are sociodemographic variables, and PLACEBO i is a dummy that takes value “1” if the respondent was part of the five-item placebo group, “0” if part of the four-item control group. γ is the vector of our coefficients of interest: we expect it to be significant for the variables specified in Table 1.
Table 2 reports the results. Panel A captures the interaction between individual characteristics and the placebo statement, which can be read as the propensity to inflate the number of “true” statements in the five-item placebo list. Panel B captures the baseline relationship, i.e., the correlation between individual characteristics and number of “true” statements in the four-item control list. For example, column 1 in Panel B indicates that an elderly respondent from the four-item group reports on average 0.071 items less than a younger counterpart from the four-item group, though the difference does not reach conventional levels of statistical significance. Column 1 in Panel A indicates that an elderly respondent from the five-item group reports on average 0.181 more items than a younger counterpart from the five-item group. Note that the baseline for Panel A (five-item placebo group) is 1.89 items, i.e., 0.12 higher than the Panel B (four-item control group) baseline of 1.77.
*** Significant at 1 percent level; **at 5 percent level; *at 10 percent level.
Standard errors (in parenthesis) clustered at the electoral district level (18 districts). Dependent variable: number of “true” items in the list experiment. District fixed effects included in all specifications. Education: years of schooling. 61 + years old: dummy for being 61 years of age or older. Hhd. income: monthly household income, in thousands of Singapore dollars. Political knowledge: “1” when respondent correctly names the electoral district in which they reside; otherwise “0”. Controls: gender, ethnicity, apartment size. PLACEBO: Dummy for being in the 5-item placebo group. All coefficients reported in panel A are the interaction of PLACEBO × “variable”.
Specifications (1)–(4) confirm the unconditional results of Table 1 using fixed effects and clustered standard errors: age, education levels, political sophistication, and income are associated with mechanical inflation, although only education reaches conventional levels of statistical significance. Results also suggest that, on average, respondents with only a primary school education report 0.32 more true items when presented with the placebo statement, whereas the predicted difference between the response to the four-item vis-a-vis the five-item list is of a negligible 0.02 points for those with a college degree.Footnote 12
Specifications (5)–(9) further add sociodemographic controls (gender, ethnicity, and apartment size): earlier findings are robust to their inclusion. Specification (9), which includes all controls and variables of interest, reveals that educational attainment is the strongest predictor of inflating the number of true statements due to the inclusion of the placebo statement.
Other variables (especially income and political knowledge) likely lose their significance due to power and multicollinearity issues. Finally, note that the R 2s are generally quite low: this is evidence that, as intended by design, agreement to the statements in our list experiment is randomly distributed across the population and hard to correlate with observables.Footnote 13
To illustrate the effect of education and income on propensity to inflate item counts, we predict the number of “true” statements using specification (9) from Table 2 and present this through a smooth polynomial fit. Figure 1 shows the results, with the left panels comprising the responses from the five-item placebo group and the right panel the responses from the four-item control group.
We see that with the full set of controls, a respondent with primary school education or below—i.e., six years of schooling—is likely to report around 0.3 more true items on average when presented with the five-item list than when presented with the four-item list, a difference that disappears for those with bachelor degree or higher—i.e., 18 + years of schooling—(panels a and b). A similar effect can be seen for respondents in the lowest income groups, who report on average .2 more true items than when presented with the five-item list, an effect that vanishes for higher income groups (panels c and d). This suggests mechanical inflation in the lower socioeconomic strata.
4. Conclusion
This paper uses original data to provide evidence for mechanical inflation in conventional list experiments. It finds evidence of heterogeneous effects, with inflation most pronounced among low educational attainment respondents who may be most inclined toward satisficing. We find additional evidence for this conclusion in a replication exercise using data from Ahlquist et al. (Reference Ahlquist, Mayer and Jackman2014). Moreover, we conduct a meta-analysis using results from Ahlquist et al. (Reference Ahlquist, Mayer and Jackman2014), Holbrook and Krosnick (Reference Holbrook and Krosnick2010), and Kiewiet de Jong and Nickerson (Reference Kiewiet de Jong and Nickerson2014). This shows inflation to be more likely than not and roughly the size of many reported treatment effects; that is, around 0.074 points when pooling all studies together and weighting by the number of observations.Footnote 14
The findings have clear implications. Studies that rely on list experiments in contexts where low educational attainment is widespread may have artificially inflated treatments that lead to invalid conclusions. By contrast, studies using convenience sampling that over-represents young and educated respondents are comparatively less vulnerable, though they may likewise be problematic if respondents resort to satisficing, for example, when incentives for providing accurate responses are inadequate or when the questionnaire is particularly long or cognitively demanding.
We suggest a simple preventative solution. Inclusion of a placebo statement in the control group equalizes the control and treatment list lengths, thereby preventing artificial inflation of the treatment group when respondents resort to satisficing. The placebo statement should: (i) be false for all or nearly all respondents and be easily recognized as such; (ii) be orthogonal to other items in the list to avoid interactions that may themselves introduce bias; and (iii) not be so outlandish or disruptive that respondent seriousness declines, which may increase the risk of extreme responses like “all” or “none”. When samples are sufficiently large, requirement (ii) can be confirmed by randomly alternating between different placebo statements and ensuring there is no difference in means.
Placebo statements are essentially costless, as they do not alter the mechanics, cognitive demands, or interpretation of list experiments. Given their potential benefits, we see no reason to exclude their usage in any setting, but they are especially valuable in contexts where educational attainment is low, or in instruments that are unusually vulnerable to satisficing.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/psrm.2020.18.