Lee and Schwarz (L&S) provided a theoretical account of grounded procedures, based on purportedly robust cleansing effects. Although acknowledging numerous failed replications of cleansing effects, L&S argued that several successful replications make it difficult to dismiss cleansing effects offhand. Here, we investigate whether the results of successful replications of the cleansing effects may in fact be consistent with the failed replications. We conclude that – based on the evidence they present – there is no support for the replicability of cleansing effects in the first place and thus no need to develop a theoretical account of grounded procedures.
Throughout the target article, L&S presented a selection of 23 effects in total, 14 non-significant, and 8 statistically significant (the results of one of the studies were not available). To critically appraise the evidence presented by L&S, we identified and coded the exact p-values reported for all the presented focal effects from the replication studies (data and R code are available at http://osf.io/c7ehk/). If the replication studies presented by L&S tapped into a genuine effect, the distribution of significant p-values would be expected to be right-skewed (i.e., indicative of evidential value). Under a true effect, low p-values (e.g., 0.01) are more likely than high, “just-significant” p-values (e.g., 0.04). That holds regardless of the level of statistical power. Using p-curve analysis, the degree of right skew can be used to test whether selective reporting can be ruled out as the sole explanation of the observed findings (Simonsohn, Nelson, & Simmons, Reference Simonsohn, Nelson and Simmons2014b). As shown in Figure 1, the distribution of significant p-values has a strong left skew. Such a distributional shape is expected only under widespread selective reporting in primary studies or strong publication bias. The p-curve analysis indicated that the set of significant replication effects lacks evidential value, z = 2.79, p = 0.997. The direct replications of those seven successful replications are thus not expected to find an effect.
Figure 1. Distribution of p-values from the successful replications of cleansing effects.
We also assessed the chance of conducting 22 independent replication studies and finding seven significant effects yielding the observed or more deviant pattern of p-values (median p-value closer to 0.05 or greater left skew). To do so, we carried out a Monte Carlo simulation, systematically varying the effect size from d = 0.1 to 0.6 in steps of 0.1, fully crossed with the set of sample sizes employed in the given replication designs (from 28 to 727). We simulated 10,000 sets of 22 replication studies for each combination of effect size and N. Then, we calculated the cumulative probability of observing seven or more significant effects for which the median p-value was the same or higher than the median of the observed p-value distribution (Mdnp = 0.04). In the simulation, the probability of observing such a pattern of high, significant p-values was only 0.00015. Based on 107 simulations, this pattern was unlikely even under the null hypothesis, with a probability of 0.0000017 (about 2 in a million). The probabilities of observing a set of significant p-values with the same or higher degree of left skew were even an order of magnitude smaller (see our OSF page). There were also other issues in four out of seven of the successfully replicated effects, like the undisclosed use of a one-tailed test and multiple testing without proper control of the error rate, rendering the chance that cleansing effects are replicable as even less likely.
Are cleansing effects real? We don't know. L&S tried to unravel the purportedly contradicting results of replication studies using a meta-analysis, which did include a majority of successful replications (9 out of 17). They described finding an overall effect more generally and an effect for successful replications in particular, even after accounting for publication bias. Their analytic approach is, however, expected to yield an underlying cleansing effect even if none exists. Both of their bias-tackling workhorses, fail-safe N and trim-and-fill are known to rest on untenable assumptions and are long considered outdated (see Becker, Reference Becker2005b; Ferguson & Heene, Reference Ferguson and Heene2012; Stanley & Doucouliagos, Reference Stanley and Doucouliagos2014). Their third method, the examination of the normal-quantile plot, is neither a formal bias detection nor bias correction technique. Simulations show that under publication bias, the false-positive rate of the methods used by L&S approaches 100% with the increasing number of included effects (Carter, Schönbrodt, Gervais, & Hilgard, Reference Carter, Schönbrodt, Gervais and Hilgard2019). Selective reporting for which we have found indications then further amplifies the effect of publication bias (Friese & Frankenbach, Reference Friese and Frankenbach2019). The analytic workflow employed by L&S thus makes the cleansing effects hardly falsifiable. To examine one of the possible causes for the lack of evidence, we gathered information concerning the validity of measurement (i.e., whether previous validation was obtained or not, whether factor structure was examined either in the study itself or in an independent validation study, and whether any evidence of construct validity existed) for the 23 effects included by L&S. For the focal variables, we were not able to find any evidence of validity, with only a single article reporting Cronbach's alphas.
To justify a need for an explanation, the literature on cleansing effects needs to be subjected to a more severe test first. A quantitative synthesis should examine patterns consistent with selective reporting and the integrity of the statistics reported in primary studies by looking for inconsistencies. Publication bias tests should not be relied upon – they address a hypothesis that is known to be false (Morey, Reference Morey2013). State-of-the-art correction methods such as the regression-based (Stanley & Doucouliagos, Reference Stanley and Doucouliagos2014) and especially the multiple-parameter selection models (McShane, Böckenholt, & Hansen, Reference McShane, Böckenholt and Hansen2016) should be employed by default. The specific implementation of bias-correction depends on the analytical context, but for an example of such a workflow, see IJzerman et al. (Reference IJzerman, Hadi, Coles, Sarda, Klein and Ropovik2020) and Sparacio et al. (Reference Sparacio, Ropovik, Jiga-Boy, Forscher, Paris and IJzerman2021).
Short of solid evidence, we recommend that the research program on cleansing effects proceeds by establishing explananda prior to explanations. The first stage in establishing explananda, we feel, is developing reliable tools to measure and manipulate.
Lee and Schwarz (L&S) provided a theoretical account of grounded procedures, based on purportedly robust cleansing effects. Although acknowledging numerous failed replications of cleansing effects, L&S argued that several successful replications make it difficult to dismiss cleansing effects offhand. Here, we investigate whether the results of successful replications of the cleansing effects may in fact be consistent with the failed replications. We conclude that – based on the evidence they present – there is no support for the replicability of cleansing effects in the first place and thus no need to develop a theoretical account of grounded procedures.
Throughout the target article, L&S presented a selection of 23 effects in total, 14 non-significant, and 8 statistically significant (the results of one of the studies were not available). To critically appraise the evidence presented by L&S, we identified and coded the exact p-values reported for all the presented focal effects from the replication studies (data and R code are available at http://osf.io/c7ehk/). If the replication studies presented by L&S tapped into a genuine effect, the distribution of significant p-values would be expected to be right-skewed (i.e., indicative of evidential value). Under a true effect, low p-values (e.g., 0.01) are more likely than high, “just-significant” p-values (e.g., 0.04). That holds regardless of the level of statistical power. Using p-curve analysis, the degree of right skew can be used to test whether selective reporting can be ruled out as the sole explanation of the observed findings (Simonsohn, Nelson, & Simmons, Reference Simonsohn, Nelson and Simmons2014b). As shown in Figure 1, the distribution of significant p-values has a strong left skew. Such a distributional shape is expected only under widespread selective reporting in primary studies or strong publication bias. The p-curve analysis indicated that the set of significant replication effects lacks evidential value, z = 2.79, p = 0.997. The direct replications of those seven successful replications are thus not expected to find an effect.
Figure 1. Distribution of p-values from the successful replications of cleansing effects.
We also assessed the chance of conducting 22 independent replication studies and finding seven significant effects yielding the observed or more deviant pattern of p-values (median p-value closer to 0.05 or greater left skew). To do so, we carried out a Monte Carlo simulation, systematically varying the effect size from d = 0.1 to 0.6 in steps of 0.1, fully crossed with the set of sample sizes employed in the given replication designs (from 28 to 727). We simulated 10,000 sets of 22 replication studies for each combination of effect size and N. Then, we calculated the cumulative probability of observing seven or more significant effects for which the median p-value was the same or higher than the median of the observed p-value distribution (Mdnp = 0.04). In the simulation, the probability of observing such a pattern of high, significant p-values was only 0.00015. Based on 107 simulations, this pattern was unlikely even under the null hypothesis, with a probability of 0.0000017 (about 2 in a million). The probabilities of observing a set of significant p-values with the same or higher degree of left skew were even an order of magnitude smaller (see our OSF page). There were also other issues in four out of seven of the successfully replicated effects, like the undisclosed use of a one-tailed test and multiple testing without proper control of the error rate, rendering the chance that cleansing effects are replicable as even less likely.
Are cleansing effects real? We don't know. L&S tried to unravel the purportedly contradicting results of replication studies using a meta-analysis, which did include a majority of successful replications (9 out of 17). They described finding an overall effect more generally and an effect for successful replications in particular, even after accounting for publication bias. Their analytic approach is, however, expected to yield an underlying cleansing effect even if none exists. Both of their bias-tackling workhorses, fail-safe N and trim-and-fill are known to rest on untenable assumptions and are long considered outdated (see Becker, Reference Becker2005b; Ferguson & Heene, Reference Ferguson and Heene2012; Stanley & Doucouliagos, Reference Stanley and Doucouliagos2014). Their third method, the examination of the normal-quantile plot, is neither a formal bias detection nor bias correction technique. Simulations show that under publication bias, the false-positive rate of the methods used by L&S approaches 100% with the increasing number of included effects (Carter, Schönbrodt, Gervais, & Hilgard, Reference Carter, Schönbrodt, Gervais and Hilgard2019). Selective reporting for which we have found indications then further amplifies the effect of publication bias (Friese & Frankenbach, Reference Friese and Frankenbach2019). The analytic workflow employed by L&S thus makes the cleansing effects hardly falsifiable. To examine one of the possible causes for the lack of evidence, we gathered information concerning the validity of measurement (i.e., whether previous validation was obtained or not, whether factor structure was examined either in the study itself or in an independent validation study, and whether any evidence of construct validity existed) for the 23 effects included by L&S. For the focal variables, we were not able to find any evidence of validity, with only a single article reporting Cronbach's alphas.
To justify a need for an explanation, the literature on cleansing effects needs to be subjected to a more severe test first. A quantitative synthesis should examine patterns consistent with selective reporting and the integrity of the statistics reported in primary studies by looking for inconsistencies. Publication bias tests should not be relied upon – they address a hypothesis that is known to be false (Morey, Reference Morey2013). State-of-the-art correction methods such as the regression-based (Stanley & Doucouliagos, Reference Stanley and Doucouliagos2014) and especially the multiple-parameter selection models (McShane, Böckenholt, & Hansen, Reference McShane, Böckenholt and Hansen2016) should be employed by default. The specific implementation of bias-correction depends on the analytical context, but for an example of such a workflow, see IJzerman et al. (Reference IJzerman, Hadi, Coles, Sarda, Klein and Ropovik2020) and Sparacio et al. (Reference Sparacio, Ropovik, Jiga-Boy, Forscher, Paris and IJzerman2021).
Short of solid evidence, we recommend that the research program on cleansing effects proceeds by establishing explananda prior to explanations. The first stage in establishing explananda, we feel, is developing reliable tools to measure and manipulate.
Financial support
The preparation of this work was partly funded by the Slovak Research and Development Agency under grants APVV-17-0418, APVV-18-0140, and by PRIMUS/20/HUM/009 awarded to Ivan Ropovik and a French National Research Agency “Investissements d'avenir” program grant (ANR-15-IDEX-02) awarded to Hans IJzerman.
Conflict of interest
The third author, IJzerman, is friends with Lee.