1. Introduction
The incentives, review processes, and norms of peer review provide the foundation for creating and sustaining the piecemeal contribution and critique of knowledge in scientific communities (Ziman Reference Ziman1969; Merton and Zuckerman Reference Merton, Zuckerman and Storer1971; Lee Reference Lee and Schunn2012). In light of its central role in determining locally the content of science (Hull Reference Hull1988, xii), scientists and social scientists have taken it on themselves to undertake “hypothesis-testing research on peer review and editing practices” (Fletcher and Fletcher Reference Fletcher and Fletcher1997, 38). There is a growing industry of social scientific research on editorial peer review, about a third of which comes from psychology, and another third, from medicine (Weller Reference Weller2001, 10).
Of the empirical research available on peer review, one of the “most basic, broadly supported, and damning” aspects is the failure for independent expert reviewers to achieve acceptable levels of agreement in reviews for journals and grant proposals across the physical, social, and human sciences as well as the humanities (Marsh, Jayasinghe, and Bond Reference Marsh, Jayasinghe and Bond2008, 161). I will review this literature and discuss the reflexive felicity of psychometric researchers imposing on themselves as peer reviewers the same standards that they would impose on the content of their research: in particular, high interrater reliabilities.
I will then argue that equating low interrater reliabilities with the invalidity of peer review as a test overlooks the ways in which low interrater reliabilities might reflect reasonable forms of disagreement among reviewers. Although this research focuses on the acceptance of papers and grant proposals as opposed to the acceptance of theories, I will argue that Kuhnian observations about how the definition of epistemic values underdetermine their interpretation and application suggests new empirical hypotheses and philosophical questions about the kinds of reviewer disagreement we would expect to find. It remains an open empirical and philosophical question the extent to which these might account for and rationalize low interrater reliability rates. Still, low interrater reliability rates remain problematic insofar as they cause individual peer review outcomes to result from the “luck of the reviewer draw” (Cole, Cole, and Simon Reference Cole, Cole and Simon1981, 885). To close, I will discuss some of the discipline-wide communication structures that can help accommodate low interrater reliability rates. This discussion makes light of less obvious ways in which peer review constitutes a social epistemic feature of the production and communication of knowledge.
2. Interrater Reliability of Expert Reviewers
High correlations between mean reviewer recommendations and final decisions by editors and grant panels suggest that reviewer recommendations are taken very seriously (Cole et al. Reference Cole, Cole and Simon1981; Bakanic, McPhail, and Simon Reference Bakanic, McPhail and Simon1987; Marsh and Ball Reference Marsh and Ball1989; Hargens and Herting Reference Hargens and Herting2006). Measures of the intraclass correlation between ratings for two reviewers on a single submission, or the single-rater reliability of reviewers, have been found to be quite low. Table 1 presents results from studies on single-rater reliability rates for grant review across disciplines.
Table 1. Single-Rater Reliability for Grant Review

Single-Rater Reliability | |
---|---|
National Science Foundation: | |
Chemical dynamics | .25 |
Solid-state physics | .32 |
Economics | .37 |
Australian Research Council: | |
Social science and humanities | .21 |
Physical sciences | .19 |
Sources.— Cicchetti (Reference Cicchetti1991) and Jayasinghe, Marsh, and Bond (Reference Jayasinghe, Marsh and Bond2003).
The finding that reliability measures for reviews in the physical sciences were not better than those in the social sciences and humanities is quite surprising since one might expect less consensus in disciplines with less developed research paradigms (Beyer Reference Beyer1978; Lindsey Reference Lindsey1978). Single-rater reliabilities are also comparably low for reviews of manuscripts submitted to top journals such as American Sociological Review, Physiological Zoology, Law and Society Review, and Personality and Social Psychology Bulletin (Hargens and Herting Reference Hargens and Herting1990b). Even more interesting is research demonstrating low interrater reliabilities for specific evaluative criteria, as presented in table 2.
Table 2. Single-Rater Reliability for Specific Evaluative Criteria

Single-Rater Reliability | |
Australian Research Council: | |
Originality | .17 |
Methodology | .15 |
Scientific/theoretical merit | .16 |
Educational Psychology: | |
Significance | .12 |
Research design | .23 |
Clarity of problem, hypothesis, assumptions | .22 |
Journal of Personality and Social Psychology: | |
Importance of contribution | .28 |
Design and analysis | .19 |
Sources.— Scott (Reference Scott1974), Marsh and Ball (Reference Marsh and Ball1989), and Jayasinghe (Reference Jayasinghe2003).
3. Psychometric Assumptions
Psychometric approaches to studying peer review construe the low agreement between expert reviewers to be deeply problematic.Footnote 1 This contention rests on a few key psychometric assumptions. The most fundamental assumption is that submissions have a latent overall quality value along a single dimension of evaluation. Indeed, this assumption is built into the measurement of interrater reliability that measures disagreement along a single, ordinal point scale (used in the evaluation of grant proposals) or by using coefficients requiring the assignment of arbitrary scores to recommendation categories (e.g., “accept,” “revise and resubmit,” “reject”) or to distances between categories (used in the evaluation of journal submissions; Hargens and Herting Reference Hargens and Herting1990b, 1). The assumption that there is such a dimension of value is commonplace within psychometrics, which relies heavily on measuring hypothetical entities or constructs, such as intelligence or creativity, along a single dimension (Rust and Golombok Reference Rust and Golombok2009, 31).
This turns the role of expert reviewers into identifying the latent quality value of a submission along this single dimension of evaluation with a high degree of reliability and, thereby, interrater reliability (Hargens and Herting Reference Hargens and Herting1990a, 92). How high is high? Within the field of psychometrics, different types of tests have different levels of interrater reliability levels (Rust and Golombok Reference Rust and Golombok2009, 75–76), as shown in table 3. Some psychometrically oriented researchers suggest that levels of interrater reliability for expert reviewers should be at about 0.8 (or even 0.9; Marsh et al. Reference Marsh, Jayasinghe and Bond2008, 162). Unfortunately, interrater reliability for expert reviewers is perilously close to rates found for Rorschach inkblot tests.
Table 3. Single-Rater Reliability for Psychometric Tests

Single-Rater Reliability | |
Intelligence tests | >.9 |
Personality tests | >.7 |
Essay marking | ∼.6 |
Rorschach inkblot test | ∼.2 |
Source.— Rust and Golombok (Reference Rust and Golombok2009).
From a psychometric perspective, if we assume that the raters are not in need of retraining, a test with too low a level of interrater reliability is considered invalid—that is, it cannot be said to measure what it purports to measure (Rust and Golombok Reference Rust and Golombok2009, 72). According to psychology's own disciplinary standards for valid testing, peer review is a “poor” evaluation tool (Bornstein Reference Bornstein1991, 444–45; Suls and Martin Reference Suls and Martin2009, 44). This makes the “critical direction for future research” to be that of improving “the reliability of peer reviews” (Jayasinghe, Marsh, and Bond Reference Jayasinghe, Marsh and Bond2003, 299). Along these lines, Jayasinghe et al. found that some of the variance in ratings resulted from biases related to characteristics of the reviewer: North American reviewers tended to give higher review ratings than Australian ones, reviewers nominated by the researcher gave higher ratings than those nominated by the grant panel, and scientists who reviewed fewer proposals gave higher ratings than those who reviewed more proposals (Jayasinghe et al. Reference Jayasinghe, Marsh and Bond2003). However, these statistically significant biases do not account for very much of the variance in reviewer ratings. Increasing the number of reviewers per proposal (4.1 for social sciences and humanities and 4.2 for science) increased single-rater reliability measures to ∼0.47 (Jayasinghe et al. Reference Jayasinghe, Marsh and Bond2003). However, this measure is still low, falling between rates found for essay graders and Rorschach inkblot tests. The obvious empirical conundrum is to figure out what can account for the rest of the variance in reviewer ratings.
4. Interrater Reliability and Normatively Appropriate Disagreement
There is a reflexive felicity in psychometric researchers imposing on themselves as peer reviewers the same methodological standards (i.e., high interrater reliabilities) that they impose on the content of their research.Footnote 2 However, equating low interrater reliabilities with the invalidity of peer review as a test overlooks the ways in which low interrater reliabilities might reflect reasonable forms of disagreement among reviewers. When we shift focus from the numerical representation of a reviewer's assessment to the content on which such assessments are grounded, we can identify cases in which interrater disagreement reflects normatively appropriate differences in subspecialization, as well as normatively appropriate differences in the interpretation and application of evaluative criteria.
Differences in subspecialization and expertise can lead to low interrater reliabilities. Editors might choose reviewers to evaluate different aspects of a submission according to their subspecialization or expertise. For example, some reviewers might be sought for their theoretical expertise, while other reviewers might be sought for their technical expertise in, for example, statistics, modeling, or a special sampling technique (Hargens and Herting Reference Hargens and Herting1990a, 94; Bailar Reference Bailar1991). Additional reviewers might be sought to review the domain-specific application of those techniques. In cases in which quality along these different aspects diverges, we would not expect high interreviewer reliability scores (Hargens and Herting Reference Hargens and Herting1990a, 94). It is normatively appropriate for editors and grant panels to rely on differences in reviewer expertise in the evaluation of a submission. Note that, in these cases, the discrepancy between reviewer ratings does not reflect disagreements about the same content since reviewers are evaluating different aspects of the research.
There are other cases that can involve more direct disagreement between reviewers. Reviewers can disagree about the proper interpretation and application of evaluative criteria. This possibility may have been overlooked because of long-standing work suggesting expert agreement about evaluative criteria within and across disciplines. Studies on editors of journals in physics, chemistry, sociology, political science, and psychology discovered strong agreement within disciplines about the relative importance of different criteria in the evaluation of manuscripts (Beyer Reference Beyer1978; Gottfredson Reference Gottfredson1978). And surveys of editors for the top physical, human, and social sciences journals () indicate agreement most especially about the importance of the significance, soundness, and novelty of submitted manuscripts (Frank Reference Frank1996). Editor opinions about the relevant criteria of evaluation are important since, in 92.5% of the cases, reviewers receive forms with instructions about evaluating manuscripts along these dimensions.
However, there are reasons to think that interdisciplinary and disciplinary agreement about evaluative criteria lies only on the surface. Lamont's interviews of interdisciplinary grant panelists show that disciplines attach different meanings to evaluative criteria such as originality and significance (Lamont Reference Lamont2009). Quantitative sociological research on discipline-specific publication biases corroborates her insights about how differently these criteria are interpreted and applied across disciplines. Consider, for example, the ubiquitous quest for novelty. In medicine, the interest in novelty is expressed as the preference for results of randomized, controlled trials that favor a new therapy as opposed to the standard therapy (Hopewell et al. Reference Hopewell, Loudon, Clarke, Oxman and Dickersin2009). In contrast, for social and behavioral scientists, the emphasis on novelty gets expressed as a preference for new effects over replications or failures to replicate an existing effect (Neuliep and Crandall Reference Neuliep and Crandall1990), regardless of whether these effects constitute an “improvement” in normative outcome.Footnote 3
We have Kuhnian reasons to think experts within disciplines might disagree about how best to interpret and apply evaluative criteria. Recall Kuhn's observation that how different scientific values are applied in the evaluation of competing theories is underdetermined by their definitions and characterizations (Reference Kuhn and Kuhn1977, 322). Likewise, evaluative criteria in peer review are not sufficiently characterized to determine how they are interpreted and applied in the evaluation of papers and projects. Just as two scientists agreed about the importance of accuracy can disagree about which theory is more accurate, two expert reviewers agreed on the importance of novelty can disagree about whether a peer's paper or project is novel. This is because scientists and expert reviewers can come to different antecedent judgments about the significant phenomena or respects in which a theory or submission is thought to be accurate or novel. These Kuhnian considerations challenge the ideal that peer review is impartial in the sense that reviewers see the relationship of evaluative criteria to submission content in identical ways (Lee and Schunn Reference Lee, Sugimoto, Zhang and Cronin2011; Lee et al., Reference Leeforthcoming). This is a basic theoretical problem about value-based evaluations that applies, not just in the interdisciplinary contexts Lamont studies, but in disciplinary contexts as well.
5. Empirical and Normative Questions: Kuhnian Considerations
An empirical hypothesis we might propose in light of the Kuhnian considerations just raised is that experts can have diverging evaluations about how significant, sound, or novel a submitted paper or project is because they make different antecedent judgments about the relevant respects in which a submission must fulfill these criteria.Footnote 4 So far, current empirical research corroborates this kind of empirical hypothesis. Quantitatively, if the hypothesis were true, we would expect low interrater reliabilities along evaluative dimensions, as researchers have discovered (Scott Reference Scott1974; Marsh and Ball Reference Marsh and Ball1989; Jayasinghe Reference Jayasinghe2003). Qualitatively, if this hypothesis were true, we would expert reviewers to focus on different aspects in the content of reviews: their focus on different features of the work, by the Gricean maxim of relevance (Grice Reference Grice1989), would suggest that they take different aspects of the work to be most relevant in evaluations of quality. Qualitative research corroborates the suggestion that reviewers focus on different aspects of research. An analysis of reviewer comments from more than 400 reviews of 153 manuscripts submitted to American Psychological Association journals across a range of subdisciplines found that narrative comments offered by pairs of reviewers rarely had critical points in common, either in agreement or in disagreement. Instead, critiques focused on different facets of the paper (Fiske and Fogg Reference Fiske and Fogg1990). A different study found that comments from reviewers who recommended rejecting papers that went on to become Citation Classics or win Nobel Prizes claimed that manuscripts failed to be novel or significant in what reviewers took to be relevant ways (Campanario Reference Campanario1995).
This last example forcefully raises the normative question of whether disagreements about the interpretation or application of evaluative terms should always be counted as appropriate. In times of revolutionary science, these forms of disagreement may be normatively appropriate in the sense that they are reasonable in light of available evidence and methods. As Kuhn observed, the available evidence for competing theories during scientific revolutions is mixed, where each theory has its own successes and failures (Reference Kuhn and Kuhn1977, 323). Members of opposing camps prefer one theory or approach because they identify as most significant the specific advantages of their theory and the specific problems undermining the competing one, although there are no evidential or methodological means at the time to establish which aspect is most relevant or crucial.
In light of Kuhn's observations, we would expect reviewers in different camps—with different beliefs about what constitute the most significant advantages or disadvantages of competing theories—to have diverging opinions about how significant, sound, or novel a submitted paper or project is. As a result, we would expect reviewers in different camps to arrive at reasonable disagreements about the quality of a particular submission.Footnote 5 If editors were to adopt the strategy of choosing expert reviewers from competing camps and mixed these evaluations with those by neutral referees, we would expect low correlations between reviewer ratings (Hargens and Herting Reference Hargens and Herting1990a, 94).
However, in periods of normal science, it is unclear whether disagreements about what features of a submission should be counted as most relevant are reasonable in these ways. Philosophical analysis of peer reviews should be undertaken to evaluate this question. By making this suggestion, I am not defending the claim that peer review, as it is currently practiced, functions as it should. Nor am I denying that normatively less compelling factors might contribute to low interrater reliability measures.Footnote 6 I am simply suggesting new lines of empirical and philosophical inquiry motivated by Kuhnian considerations.
Further empirical and philosophical analysis should be undertaken to measure the extent to which the variance in reviewer ratings can be accounted for by reasonable and unreasonable disagreements of various kinds. Until these empirical-cum-philosophical analyses are done, it will remain unclear the extent to which low interrater reliability measures represent reasonable disagreement rather than arbitrary differences between reviewers.
Psychometrically oriented researchers might suggest an alternative research program that would accommodate reasonable disagreement among reviewers while preserving the idea that low interrater reliability measures (of some kind) render peer review a poor/invalid test for assessing the quality of submissions. Under this refined research program, the task would be to evaluate peer review's well functioning by measuring interreviewer reliability among editors rather than reviewers. After all, it would be reasonable for editors to improve the quality of pooled reviews by choosing reviewers with diverging expertise and antecedent judgments about the significant respects in which a submission should be understood as novel, sound, or significant. This shifts the locus of relevant expert agreement to the editorial rather than the reviewer level.
Unfortunately, there is little to no research on intereditor reliability rates. Note, however, that Kuhnian concerns recur at the editorial level: editors could disagree with each other about the relevant respects in which a submission should be considered novel, sound, or significant. Along these lines, sociologists Daryl Chubin and Edward Hackett suggest that the editor's task is its own kind of “Rohrschach [sic] test,” where “both the article and the referee's interpretation are for the [editors] to weigh or discard as they see fit” (Chubin and Hackett Reference Chubin and Hackett1990, 112).
6. Social Solutions to the Luck of the Reviewer Draw
Regardless of whether we discover reasonable forms of disagreement among reviewers, decisions to accept or reject submissions must be made. Even if the considerations raised by disagreeing reviewers are not arbitrary, low interrater reliabilities can make peer review outcomes an arbitrary result of which reviewer perspectives are brought to bear. Jayasinghe et al. found that the decision to fund a grant submission “was based substantially on chance” since few grant proposals were far from the cutoff value for funding when 95% confidence intervals were constructed for each proposal (Marsh et al. Reference Marsh, Jayasinghe and Bond2008, 162). Cole et al. found that the mean ratings of their newly formed panel of expert reviewers differed enough from the mean ratings of the actual National Science Foundation reviewers that, for about a quarter of the proposals, the funding decisions would have been reversed. They concluded that actual outcomes depend to a large extent on the “luck of the reviewer draw” (Reference Cole, Cole and Simon1981, 885).
These observations raise important questions about how discipline-wide publication venues should be structured to accommodate these kinds of problems. Hargens and Herting argue that the number of prestigious outlets for publication within a discipline, as well as the thresholds set for these outlets, play an overlooked but crucial role in addressing low interrater reliability rates (Reference Hargens and Herting1990a, 102–3). Disciplines with very few “core” journals serving as high-prestige outlets (e.g., Astrophysical Journal in astronomy and astrophysics and Physical Review in physics) are more vulnerable to the possibility of relegating important work to less prestigious and less visible journals as a result of the luck of the reviewer draw. However, these disciplinary journals “minimize this threat” by accepting the great majority of submissions (75%–90%). In contrast, disciplines like psychology and philosophy with many core journals allow for more chances for important work to find a prestigious venue through an iterative process of submission and review. These journals can afford to have substantially higher rejection rates.
The obvious empirical question is whether these considerations can rationalize the large and stable differences observed in acceptance rates in the social sciences and humanities (about 10%–30%) versus the physical sciences (about 60%–80%; Zuckerman and Merton Reference Zuckerman and Merton1971; Hargens Reference Hargens1988, Reference Hargens1990). Structural accommodations of this kind might be constrained by discipline-specific goals, norms about whether to risk accepting bad research versus rejecting good research (Cole Reference Cole1992), and norms about whether time and future research (as opposed to peer review) should serve as the central filter for assessing quality (Ziman Reference Ziman1969).
Peer review clearly constitutes a social epistemic feature of the production and dissemination of scientific knowledge. It relies on members of knowledge communities to serve as gatekeepers in the funding and propagation of research. It calls on shared norms cultivated by the community. And it relies on institutions such as journal editorial boards, conference organizers, and grant agencies to articulate and enforce such norms. However, in light of research on low interrater reliabilities and the role that discipline-wide communication structures can serve to address the luck of the reviewer draw, it is clear that we should also analyze and evaluate peer reviewer's social epistemic function within larger communication structures to identify how these structures can accommodate reviewer disagreement.
7. Conclusion
Reflecting on the various ways in which epistemic values can be interpreted and applied, Kuhn suggested that “essential aspects of the process generally known as verification will be understood only by recourse to the features with respect to which” researchers “may differ while still remaining scientists” (Reference Kuhn and Kuhn1977, 334). It remains an open empirical and philosophical question whether the same can be said of peer review, namely, that the essential aspects of the process known as expert peer review should be understood by recourse to the features to which reviewers may differ while still remaining experts in their field. Further inquiry into this philosophical and empirical question should be undertaken, with a sensitivity to how reasonable and unreasonable disagreement can be accommodated in discipline-wide communication structures.