The article by Henrich et al. is a valuable contribution that goes beyond prior critiques of the deplorable lack of representativeness of a large proportion of participant samples that have been used in the behavioral sciences. The cogency of argumentation, and both the breadth and the detail of the empirical documentation that is provided, are impressive. Therefore, my commentary will not challenge the main thesis proposed by Henrich et al. Instead, its purpose is to supplement and increase the scope of their article's argument.
An important, although perhaps self-evident, observation is that the authors' thesis concerning WEIRD samples would be even more useful (perhaps considerably more so) had they at least mentioned and briefly outlined some other factors – often closely, and sometimes unavoidably, associated with the research designs using WEIRD samples – which may even more detrimentally affect the generalizability (external validity) of the results than does the lack of WEIRD samples' representativeness.
An abbreviated list of such factors will have to suffice here: unrepresentative sets of independent variables; artificiality of research settings; a limited number of tasks (often a single task) through which the independent variables are presented; and relying on a single data-collection method (such as questionnaires, surveys, or rating scales) – and therefore obtaining a single dependent measure (or an uninformatively correlated set of measures) that is often qualitatively different from the one to which generalization is sought in the “real world.” The mentioned factors are highly relevant for a more complete understanding of the issues in some of the areas discussed in the target article, especially fairness and cooperation, punishment of “excessive” cooperators, personal choice, “fundamental attribution error,” and moral reasoning.
Moreover, one must worry about the (statistical) interaction of the effect of WEIRD samples' uniqueness (extremity, non-modal character) with the effects of these additional factors (e.g., the frequently highly artificial tasks), such that the overall result (especially when interactions are of a multiplicative form) would be even more misleading with regard to some real-world criterion and domain of desired application than is the case on the basis of WEIRD samples' “differentness” alone. On the other hand, if, for example, a greater variety of tasks were used, the presently observed differences between WEIRD and various non-WEIRD samples might in some cases disappear. One simply cannot predict what would happen without doing the research.
The above family of methodological observations has its root in the pioneering work of Campbell and colleagues (e.g., Campbell & Stanley Reference Campbell and Stanley1963; Webb et al. Reference Webb, Campbell, Schwartz and Sechrest1966). Among the subsequent empirical demonstrations of some of the underlying principles were the studies by Ebbesen and Konečni: for example, of decisions under risk (in automobile driving; e.g., Ebbesen et al. Reference Ebbesen, Parker and Konečni1977; Konečni et al. Reference Konečni, Ebbesen and Konečni1976) and of key decisions by judges, prosecutors, and other participants in the criminal justice system (Konečni & Ebbesen Reference Konečni and Ebbesen1982b). An important aspect of this work has been the mustering of the theoretical and empirical support for the idea of validated simulations in behavioral science (Konečni & Ebbesen Reference Konečni, Ebbesen, Lösel, Bender and Bliesener1992).
Among the judicial decisions studied in this research program were those of the setting of bail and, especially, the sentencing of felons (e.g., Ebbesen & Konečni Reference Ebbesen and Konečni1975; Konečni & Ebbesen Reference Konečni, Ebbesen, Konečni and Ebbesen1982a). This work utilized both WEIRD and non-WEIRD samples (as in the fourth “telescoping contrast” in Henrich et al.; see sect. 6) and supports the target article's skepticism. Moreover, a more general, but logical, extension is to question the applicability of WEIRD-based findings regarding aggressiveness, retribution, fairness and equity, and moral reasoning in general (cf. sect. 4.4.) to international law. Here the most troubling possibility is the deliberate or unconscious incorporation of WEIRD-based findings into the normative expectations held by international bodies in “cognitively distant” war-torn areas – such as in Rwanda by the United Nations Assistance Mission for Rwanda and the International Criminal Tribunal for Rwanda. What must be very carefully taken into account are not only the enormous complexities of ancient tribal relations, but also those stemming from massive religious conversions by some of the warring parties under an external oppressor (as in Bosnia and Herzegovina, another internationally adjudicated conflict).
In sum, there is far more to external validity than the unrepresentativeness of samples. The only truly solid reason to trust an experimental simulation (especially one that potentially involves enormous human costs) is to have had it validated by means of careful successive approximations to the real world, each step moving closer to the actual real-world phenomenon – not just with different participant samples, but also guided by a multi-method X multi-dependent-measure matrix (Konečni & Ebbesen Reference Konečni, Ebbesen, Lösel, Bender and Bliesener1992).
Some additional observations are in order. Just as Nature Genetics requires all empirical papers to include data from two independent samples (target article, sect. 6.2, para. 3), the Journal of Personality and Social Psychology, for example, might begin to require not just the use of at least two different methods in the laboratory, but also both laboratory and field research – before researchers move away from psychology freshmen. If this were required, it seems likely that some “cute,” supposedly counterintuitive, task-specific effects (including in the area of heuristics and biases) would not be replicated even with different WEIRD samples. I am not as favorably disposed as Henrich et al. apparently are to Mook's (Reference Mook1983) idea that the use of WEIRD samples is justified “when seeking existential proofs” (sect. 7.1.6, para. 1); nor to the authors' admittedly clever idea of setting up research facilities in bus terminals and airports to capture non-university participants (sect. 7.3, para. 6) – if the same old suspect methods, such as “reactive” questionnaires and games with trivial pay-offs, would continue to be used.
Henrich et al. believe that behavioral scientists' tendency to claim “universality” for data obtained with WEIRD participants may in part be due to so many researchers themselves being WEIRD (sect. 7.1.1, para. 8). This fact may also be partly responsible for researchers' relative reluctance to worry adequately about external validity and about the effects of complex higher-order interactions among type of participants, methods, and settings. A sustained interest in such interactions may require a contextual (“field-dependent”) worldview and a holistic reasoning style that is (according to Henrich et al.) less utilized by WEIRD people, who favor analytical reasoning.
The article by Henrich et al. is a valuable contribution that goes beyond prior critiques of the deplorable lack of representativeness of a large proportion of participant samples that have been used in the behavioral sciences. The cogency of argumentation, and both the breadth and the detail of the empirical documentation that is provided, are impressive. Therefore, my commentary will not challenge the main thesis proposed by Henrich et al. Instead, its purpose is to supplement and increase the scope of their article's argument.
An important, although perhaps self-evident, observation is that the authors' thesis concerning WEIRD samples would be even more useful (perhaps considerably more so) had they at least mentioned and briefly outlined some other factors – often closely, and sometimes unavoidably, associated with the research designs using WEIRD samples – which may even more detrimentally affect the generalizability (external validity) of the results than does the lack of WEIRD samples' representativeness.
An abbreviated list of such factors will have to suffice here: unrepresentative sets of independent variables; artificiality of research settings; a limited number of tasks (often a single task) through which the independent variables are presented; and relying on a single data-collection method (such as questionnaires, surveys, or rating scales) – and therefore obtaining a single dependent measure (or an uninformatively correlated set of measures) that is often qualitatively different from the one to which generalization is sought in the “real world.” The mentioned factors are highly relevant for a more complete understanding of the issues in some of the areas discussed in the target article, especially fairness and cooperation, punishment of “excessive” cooperators, personal choice, “fundamental attribution error,” and moral reasoning.
Moreover, one must worry about the (statistical) interaction of the effect of WEIRD samples' uniqueness (extremity, non-modal character) with the effects of these additional factors (e.g., the frequently highly artificial tasks), such that the overall result (especially when interactions are of a multiplicative form) would be even more misleading with regard to some real-world criterion and domain of desired application than is the case on the basis of WEIRD samples' “differentness” alone. On the other hand, if, for example, a greater variety of tasks were used, the presently observed differences between WEIRD and various non-WEIRD samples might in some cases disappear. One simply cannot predict what would happen without doing the research.
The above family of methodological observations has its root in the pioneering work of Campbell and colleagues (e.g., Campbell & Stanley Reference Campbell and Stanley1963; Webb et al. Reference Webb, Campbell, Schwartz and Sechrest1966). Among the subsequent empirical demonstrations of some of the underlying principles were the studies by Ebbesen and Konečni: for example, of decisions under risk (in automobile driving; e.g., Ebbesen et al. Reference Ebbesen, Parker and Konečni1977; Konečni et al. Reference Konečni, Ebbesen and Konečni1976) and of key decisions by judges, prosecutors, and other participants in the criminal justice system (Konečni & Ebbesen Reference Konečni and Ebbesen1982b). An important aspect of this work has been the mustering of the theoretical and empirical support for the idea of validated simulations in behavioral science (Konečni & Ebbesen Reference Konečni, Ebbesen, Lösel, Bender and Bliesener1992).
Among the judicial decisions studied in this research program were those of the setting of bail and, especially, the sentencing of felons (e.g., Ebbesen & Konečni Reference Ebbesen and Konečni1975; Konečni & Ebbesen Reference Konečni, Ebbesen, Konečni and Ebbesen1982a). This work utilized both WEIRD and non-WEIRD samples (as in the fourth “telescoping contrast” in Henrich et al.; see sect. 6) and supports the target article's skepticism. Moreover, a more general, but logical, extension is to question the applicability of WEIRD-based findings regarding aggressiveness, retribution, fairness and equity, and moral reasoning in general (cf. sect. 4.4.) to international law. Here the most troubling possibility is the deliberate or unconscious incorporation of WEIRD-based findings into the normative expectations held by international bodies in “cognitively distant” war-torn areas – such as in Rwanda by the United Nations Assistance Mission for Rwanda and the International Criminal Tribunal for Rwanda. What must be very carefully taken into account are not only the enormous complexities of ancient tribal relations, but also those stemming from massive religious conversions by some of the warring parties under an external oppressor (as in Bosnia and Herzegovina, another internationally adjudicated conflict).
In sum, there is far more to external validity than the unrepresentativeness of samples. The only truly solid reason to trust an experimental simulation (especially one that potentially involves enormous human costs) is to have had it validated by means of careful successive approximations to the real world, each step moving closer to the actual real-world phenomenon – not just with different participant samples, but also guided by a multi-method X multi-dependent-measure matrix (Konečni & Ebbesen Reference Konečni, Ebbesen, Lösel, Bender and Bliesener1992).
Some additional observations are in order. Just as Nature Genetics requires all empirical papers to include data from two independent samples (target article, sect. 6.2, para. 3), the Journal of Personality and Social Psychology, for example, might begin to require not just the use of at least two different methods in the laboratory, but also both laboratory and field research – before researchers move away from psychology freshmen. If this were required, it seems likely that some “cute,” supposedly counterintuitive, task-specific effects (including in the area of heuristics and biases) would not be replicated even with different WEIRD samples. I am not as favorably disposed as Henrich et al. apparently are to Mook's (Reference Mook1983) idea that the use of WEIRD samples is justified “when seeking existential proofs” (sect. 7.1.6, para. 1); nor to the authors' admittedly clever idea of setting up research facilities in bus terminals and airports to capture non-university participants (sect. 7.3, para. 6) – if the same old suspect methods, such as “reactive” questionnaires and games with trivial pay-offs, would continue to be used.
Henrich et al. believe that behavioral scientists' tendency to claim “universality” for data obtained with WEIRD participants may in part be due to so many researchers themselves being WEIRD (sect. 7.1.1, para. 8). This fact may also be partly responsible for researchers' relative reluctance to worry adequately about external validity and about the effects of complex higher-order interactions among type of participants, methods, and settings. A sustained interest in such interactions may require a contextual (“field-dependent”) worldview and a holistic reasoning style that is (according to Henrich et al.) less utilized by WEIRD people, who favor analytical reasoning.