I. Introduction
Recent research on wine judging raises questions about both the reliability and consensus of wine quality judgments made by experienced judges in blind tastings (e.g., Gawel and Godden, Reference Gawel and Godden2008; Hodgson, Reference Hodgson2008, Reference Hodgson2009a, Reference Hodgson2009b). Reliability, an intraindividual notion, concerns the similarity of repeat judgments of the same wine by an individual judge, while consensus, an interindividual notion, concerns the similarity of the judgments of a particular wine between/among two or more independent judges.Footnote 1 Both reliability and consensus are necessary requirements for expertise in wine judging. Stated simply, the basic issues are the extent to which individual wine judges repeat their own judgments (which I label “expertise within”) and the extent to which different wine judges agree in their judgments (which I label “expertise between”).
Although the body of research on the reliability and consensus of experienced wine judges is small (and mostly recent), both the “within” and “between” aspects of judgment variability have been the subject of extensive research across many professional fields over many decades. In this paper, I review the few wine studies that exist and compare their results with those of a much larger sample of carefully controlled experimental studies that examine reliability and consensus in the fields of medicine, clinical psychology, business, auditing, personnel management, and meteorology. All the studies that I review quantify reliability and consensus using correlational measures. Reliability for each individual judge is measured as the correlation between repeat judgments of identical stimuli on two different occasions. Consensus is measured as the correlation between the judgments of identical stimuli by each pair of judges. Correlational measures are by far the most common way of quantifying reliability and consensus, and they offer the advantage of greater comparability of the levels of reliability and consensus across individuals, judgment tasks, and, ultimately, professional fields. I exploit this advantage by contrasting the level of reliability and consensus found in wine judging with that found in other fields.
Of course, professional judges—in any field—cannot be expected to achieve perfect reliability or perfect consensus, especially the latter, for many reasons. These include varying levels of attention to the task and motivation to perform well, differential ability and experience, focusing on different aspects of the phenomenon of interest, the absence of objectively “correct” or “best” answers in many settings, and the evolving and dynamic nature of the phenomenon being judged (Shanteau, Reference Shanteau, Salas and Klein2001). As a result, it is difficult to make statements about the level of reliability and consensus that one should expect to find in a particular field. In controlled experimental settings, however, many factors that might naturally degrade reliability and consensus will likely operate to a lesser extent, so we might be seeing the various types of judges “at their best” in the research results. In any event, we can get a clear sense of the relative extent of reliability and consensus across fields, and that sense can help to inform our understanding of these critical aspects of wine judging.
Section II of the paper considers the roles that reliability and consensus play in the evaluation of professional judgment. It also describes the positive relationship that exists between reliability and accuracy and between consensus and accuracy, in settings where “correct answers” are available and thus accuracy can be measured. Section III consists of two parts. The first part summarizes the results of wine studies that examine intrajudge reliability and compares the results to those of 41 studies conducted in these other six fields. The second part summarizes the results of wine studies that examine interjudge consensus and compares the results to those of 46 studies conducted in these same fields. In addition, results from several studies that examine both reliability and consensus in the same study and with the same judges are briefly presented. Section IV presents a discussion and conclusion.
II. Accuracy, Reliability, and Consensus
The type of setting addressed here is that in which one or more individuals in a professional field, who might be “experts” in varying degrees, make professional judgments concerning specialized aspects of their field, and then communicate recommendations based on those judgments to people who use them as critical inputs in their decision making. Examples abound in such fields as medicine, business, and consumer decision making. Judgments and recommendations made in these settings can be highly consequential to both those who provide them (because of their effect on reputation-building) and those who receive them (because of their effect on decisions).
Those on the receiving end seek confidence in the recommendations they receive and, therefore, are interested in the quality of the professional judgments on which those recommendations are based. Ideally, the quality of professional judgments would be revealed by their accuracy, that is, their correspondence with an objectively measured external criterion that is independent of the professional and the judgments he or she makes. In many settings, however, an independent external criterion does not exist (or will not be known for a long time), and therefore judgment accuracy cannot be evaluated. In those settings, attention naturally turns to surrogate evaluation criteria such as intrajudge reliability and interjudge consensus, criteria that are necessary but not sufficient for establishing expertise (or at least for establishing that such judgments are “good enough” for practical purposes). Instead of being three separate features of professional judgment, however, accuracy, reliability, and consensus are closely related both theoretically and empirically, as explained below.
Researchers across many fields consider reliability a more fundamental requirement for expertise than consensus. Cicchetti (Reference Cicchetti2004b) and Hodgson (Reference Hodgson2008, Reference Hodgson2009b), for example, adopt this view in the field of wine judging: “What do we expect from expert wine judges? Above all, we expect [reliability], for if a judge cannot closely replicate a decision for an identical wine served under identical circumstances, of what value is his/her recommendation?” (Hodgson, Reference Hodgson2009b, 241). A similar view prevails in medicine. In a setting involving judgments of disease severity, Einhorn (Reference Einhorn1974, 563) states, “With regard to intrajudge reliability, it should be obvious that unless the expert can reproduce his [judgments], there is little more that can be said in defense of his expertise.” Similarly, in a setting involving the evaluation of coronary angiograms, Detre, Wright, Murphy and Takaro (Reference Detre, Wright, Murphy and Takaro1975, 985) state, “Although high intra- and interobserver agreement does not assure that the observer is right in his judgment, it is certain that he could hardly be right if he disagrees often with himself.” Thus, intrajudge reliability is typically regarded as the most important requirement for expertise when the absence of objectively correct answers prevents a definitive determination of judgment accuracy.
It must be recognized, however, that reliability remains an important requirement for expertise even when objectively correct answers are available, and therefore judgment accuracy can be assessed, because of the positive relationship between reliability and accuracy. Theoretical work establishes that intrajudge reliability places an upper limit on the level of accuracy that can be achieved (e.g., Ghiselli, Reference Ghiselli1964; Lord and Novick, Reference Lord and Novick1968). This fact is captured by Goldberg's (Reference Goldberg1970, 423) description of intrajudge reliability issues in the field of clinical psychology: “He ‘has his days’: Boredom, fatigue, illness, situational and interpersonal distractions all plague him, with the result that his repeated judgments of the exact same stimulus configuration are not identical. He is subject to all those human frailties which lower the reliability of his judgments below unity. And, if the judge's reliability is less than unity, there must be error in his judgments—error which can serve no other purpose than to attenuate his accuracy.” Thus, intrajudge reliability is a necessary requirement for expertise both when the accuracy of professional judgment cannot be assessed and when it can.
It is worth noting that test-retest reliability is not the only type of intrajudge reliability that has been studied by judgment researchers. The other principal type, often called “linear consistency,” concerns the extent to which a linear regression model estimated from the relationship between an individual's judgments and a set of underlying information items can reproduce the individual's judgments. This type of intrajudge reliability is one determinant of the ability of a linear regression model of the individual to produce accurate predictions of an external criterion. Linear-consistency and test-retest reliability are related (see Cooksey, Reference Cooksey1996, 205–208) in that linear-consistency reliability is a function of test-retest reliability and the extent to which the individual's linear regression model captures the underlying judgment process, that is, the extent to which the individual's judgment process reflects the linearity and additivity assumptions that underlie regression (Lee and Yates, Reference Lee and Yates1992). Because linear-consistency reliability confounds the effects of test-retest reliability with the effects of systematic departures from linearity and additivity, test-retest reliability is the more fundamental of the two.Footnote 2
Although reliability is widely considered a more fundamental requirement of professional judgment than is consensus, as observed earlier, consensus is nevertheless extremely important. This is especially true in settings where correct answers do not exist (or will not be known within a reasonable period). Decisions must be made and actions must be taken even though the “correctness” of those decisions might never be known. Because agreement among the independent judgments of competent professionals is often an indispensible input to decisions and actions, consensus has emerged as an important criterion for evaluating judgment. As Hodgson (Reference Hodgson2008, 106) puts it in the wine context, “good judges agree with each other.”
To the extent that interjudge agreement is considered a desirable feature of wine judging, it follows that ways of increasing such agreement are likely to be of interest. Indeed, Cicchetti (Reference Cicchetti2004b, 221), in his discussion of research designs and data-analytic strategies for improving blind wine tastings, says “the goal is to reduce, as much as is possible, the extent of inter-judge variability in the evaluation of any given wine.” Cicchetti goes further, however, making a bold suggestion that reducing interjudge variability should “[increase] the validity or accuracy of blind wine tasting” (221).
The idea that reducing interjudge variability (i.e., increasing consensus) will result in increased accuracy has been tested empirically by Ashton (Reference Ashton1985) in two important business settings where correct answers exist. One setting involves sales predictions (a continuous judgment variable), in which Time, Inc., executives make quarterly predictions, over fourteen years, of the annual number of advertising pages that will be sold by Time magazine. The second setting involves predictions by independent auditors (CPAs) of whether a sample of business firms will or will not continue as “going concerns” (a dichotomous judgment variable) for the coming year. In both settings, a strong positive relationship is found between consensus and accuracy.Footnote 3 Ashton (Reference Ashton1985, 185) concludes: “If an individual's predictions agree strongly with those of others in a group, then that individual will tend to be among the most accurate in the group. This conclusion also holds for pairs of individuals; that is, pairs who agree better also tend to be more accurate than other pairs. Similarly, individuals and pairs that exhibit low consensus tend to be less accurate than those exhibiting high consensus.”
Ashton's (Reference Ashton1985) finding of a strong positive relation between consensus and accuracy is bolstered by the results of Detre et al. (Reference Detre, Wright, Murphy and Takaro1975), who find a strong positive relationship between consensus and reliability. In a medical setting involving evaluations of coronary angiograms, these researchers document considerable variability in both reliability and consensus and, more important for present purposes, a clear relationship between the reliability of individual judges and how often they agree with other judges.
Despite results such as those of Ashton (Reference Ashton1985) and Detre et al. (Reference Detre, Wright, Murphy and Takaro1975), consensus is sometimes viewed as a problematic criterion for evaluating judgment. Although it is difficult to dispute the notion that a professional judge should not “disagree with himself,” it is often pointed out that even complete agreement among judges does not guarantee accuracy and that the lone dissenter among many judges could, in fact, be correct. As Einhorn (Reference Einhorn1974, 570) states, “the history of science is replete with oddballs who did not agree with anyone, yet, were proved to be correct by subsequent events.” Einhorn also observes, however, that the later discovery that the oddball was correct requires that a criterion other than consensus eventually become available, which will not be the case in many important judgment settings.
Perhaps a more troublesome aspect of consensus as a criterion for evaluating judgment is its potential dampening effect on learning: “Disagreements are often the route by which experts increase understanding of their field. By seeking out areas of disagreement between one another, experts explore the limits of their own knowledge and stretch their range of competency” (Weiss and Shanteau, Reference Weiss, Shanteau, Smith, Shanteau and Johnson2004, 231). Thus, to the extent that agreement becomes the standard, the benefits of disagreement, alternative viewpoints, devils’ advocates, and so on may be lost and learning may suffer. These potential drawbacks notwithstanding, the practical necessity for timely decisions and actions, as well as the positive relationship among consensus, reliability, and accuracy revealed by research, firmly establish consensus as an important criterion for evaluating judgment.
III. Results
A. Judgment Reliability: Expertise Within?
Correlational studies of the intrajudge reliability of experienced wine judges have been reported by Brien, May, and Mayo (1987), Gawel and Godden (Reference Gawel and Godden2008), Gawel, Royal, and Leske (Reference Gawel, Royal and Leske2002) and Lawless, Liu and Goldwyn (Reference Lawless, Liu and Goldwyn1997). Each study involves several judges who, in blind tastings, independently rate a number of wines and later re-rate those same wines. The researchers determine, separately for each judge, the correlation between the judge's first and second ratings. The results are summarized in Table 1, Panel A.
Table 1 Summary of Studies Investigating Judgment Reliability
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20160626052802-88307-mediumThumb-S1931436112000065_tab1a.jpg?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151128095848410-0478:S1931436112000065_tab1b.gif?pub-status=live)
Brien et al. (Reference Brien, May and Mayo1987) describe the results of four studies in which either 24 or 48 different wines were tasted—and re-tasted the same day or the following day—by either six or eight experienced judges. Intrajudge reliability varies greatly, ranging from .16 to 1.00. On average, reliability is fairly high, with mean reliability across the four studies ranging from .45 to .74. Note that the mean reliabilities in Studies 2 and 5, in which the repeat tastings occurred the same day (.73 and .74), are considerably greater than those in Studies 3 and 4, in which the repeat tastings occurred one day later (.45 and .54).
Lawless et al. (Reference Lawless, Liu and Goldwyn1997) report a study in which four panels of judges tasted—and re-tasted less than an hour later—14 different wines. Three of the panels were experienced wine tasters (Panels CB, G, and PB), while the fourth panel was described as wine consumers (Panel C). The range of intrajudge reliability across the individual judges is −.03 to .85, while the range of mean reliability across the individuals in the four panels is .31 to .61.Footnote 4 The consumer panel produces lower reliabilities than the three experienced panels (a mean of .31 for the former and means of .53, .61, and .42 for the latter).
A particularly interesting aspect of the Lawless et al. (Reference Lawless, Liu and Goldwyn1997) results is that the reliability of the mean ratings of the individuals in each panel is much higher than the mean reliability of the panel's individual members. To illustrate, consider Panel CB, which has six members. The mean reliability reported for Panel CB in Table 1 (.53) is the mean of the six judges’ individual reliability values, consistent with the notion that reliability is an intraindividual phenomenon. In addition to quantifying these six individual reliability values, Lawless et al. also calculate the mean of the six judges’ ratings of each wine, on both the initial and repeat tastings, and then determine the correlation between these mean ratings. The resulting correlation (.90) is much higher than the mean of the six judges’ individual reliability values (.53).Footnote 5 The superiority of mean, or composite, judgments vis-à-vis those of the average individual in the composite has been demonstrated in many settings, including wine judging (Ashton, Reference Ashton2011).
Gawel et al. (Reference Gawel, Royal and Leske2002) report a study in which 42 experienced judges tasted a wine that had been aged in four different types of oak. Instead of rating overall quality, however, the judges rated the intensity of eight different characteristics of the wine (e.g., spice, butter, and texture). Only average reliabilities across the eight characteristics are reported. Again, there is tremendous variability across judges, with a mean reliability of .46. Gawel et al. (Reference Gawel, Royal and Leske2002) also refer to unpublished data from 225 experienced tasters that reveal a mean intrajudge reliability of .40, although they provide no further information.
Gawel and Godden (Reference Gawel and Godden2008) report results from tastings involving 571 experienced judges who tasted an average of 23 reds and 23 whites, with duplicates tasted two or three days later. Again, great variability across judges is evident, with mean intrajudge reliability of .45 for the reds and .35 for the whites. When the reliability of three-judge panels was evaluated, it was found to be substantially greater than the mean reliability of the individual judges—consistent with the earlier results of Lawless et al. (Reference Lawless, Liu and Goldwyn1997).
Finally, Hodgson (Reference Hodgson2008) reports some fascinating results from the California State Fair Wine Competitions of 2005 to 2008. Hodgson's results concern four triplicate samples that were judged by 16 panels of four judges each. Both of the repeat samples were tasted in the same tasting flight and were poured from the same bottle as the original sample. As Hodgson (Reference Hodgson2008, 106) explains: “The overriding principle was to design the experiment to maximize the probability in favor of the judges’ ability to replicate their scores.” Unlike in earlier studies, a correlational measure was not used in this study to quantify judge reliability; instead, the judges awarded medals to each wine (Gold, Silver, Bronze, or No Award), and the reported results concern the judges’ ability to replicate their own awards. The key finding is that the judges awarded the same medal only about 18 percent of the time—and this usually occurred for wines that received No Award. Moreover, in many instances a judge awarded Gold to one of the triplicates and Bronze (or No Award) to another.
Mean reliability across all the wine studies in Table 1 is .50. How does this compare to judgment reliability in other fields? In an earlier paper, I analyzed published research on the reliability of professional judgment in the fields of meteorology, medicine, clinical psychology, personnel management, business, and auditing (Ashton, Reference Ashton2000). Fifty studies across these six fields were identified, 41 of which measured reliability as the correlation between repeat judgments of identical stimuli by each judge. All 41 correlational studies focus on professional judges who make a series of judgments in the domain of their everyday experience (as opposed to, say, college students responding to abstract and unfamiliar tasks to fulfill a course requirement).
The meteorological studies concerned forecasts of atmospheric events such as microbursts and hail. The medical studies involved professionals such as pathologists and radiologists evaluating the severity of conditions such as gastric ulcers and Hodgkin's disease. The clinical psychology studies concerned the evaluation of traits such as intelligence and sociability. The personnel management studies concerned the evaluation of various dimensions of work-related behaviors, typically for selection or promotion purposes. The business studies concerned financial analysis and taxation. Several studies involved the professional field of auditing. Because the nature of professional judgment in auditing may be unfamiliar to readers of this journal, the Appendix provides a brief explanation of the critical importance of judgment in auditing.
Judgment reliability varied substantially across individual judges in these studies. The mean reliability that emerged in each of the six fields is reported in Table 1, Panel B. Mean reliability ranges from .91 in meteorology to .70 in clinical psychology—vis-à-vis a mean of .50 for the wine studies. (I defer until Section IV a consideration of why reliability in wine judging might reasonably be expected to be lower than in other fields.)
My earlier analysis (Ashton, Reference Ashton2000) identified three features of the overall body of results that may provide useful perspective in the wine context. First, reliability decreased with greater time between the original judgment and the repeat judgment, which is also seen in the Brien et al. (Reference Brien, May and Mayo1987) study of wine judging. Second, group discussion among two or more individual judges had the effect of increasing reliability; a similar effect is seen in the superior reliability of the judge panels in Gawel and Godden (Reference Gawel and Godden2008) and Lawless et al. (Reference Lawless, Liu and Goldwyn1997). Finally, reliability was inversely related to the difficulty of the judgment task; this, too, has its counterpart in studies of wine judging—for example, the clear tendency for reliability to be greater for wines at each end of the quality scale than for those in the middle (e.g., Hodgson, Reference Hodgson2008).
B. Judgment Consensus: Expertise Between?
Correlational studies of the interjudge consensus of experienced wine judges have been reported by Ashton (Reference Ashton2011), Baker and Amerine (Reference Baker and Amerine1953), Brien et al. (Reference Brien, May and Mayo1987), Cicchetti (Reference Cicchetti2006a, Reference Cicchetti2006b), and Hodgson (Reference Hodgson2009a). Each study involves several judges who, in blind tastings, independently rate a number of wines. The researchers determine, for each pair of judges, the correlation between their ratings. The results are summarized in Table 2, Panel A.
Table 2 Summary of Studies Investigating Judgment Consensus
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151128095848410-0478:S1931436112000065_tab2a.gif?pub-status=live)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151128095848410-0478:S1931436112000065_tab2b.gif?pub-status=live)
Baker and Amerine (Reference Baker and Amerine1953; cited in Brien et al., Reference Brien, May and Mayo1987) report results from five experienced judges who evaluated 13 reds and 17 whites over multiple sessions, with four or five wines per session. The results reveal greater mean consensus for the whites (.58) than for the reds (.39). Substantial variability in consensus exists across pairs of judges, ranging from .44 to .75 for the whites and from .07 to .90 for the reds.
One of Brien et al.'s (1987) four reliability studies (described above) also examined consensus. Interjudge correlations are reported for both the first occasion on which the wines were tasted and the second occasion (later the same day). Mean consensus is lower in the repeat tasting than in the first (.37 vs. 45), and the range of consensus across judges is wider (−.40 to .84 vs. −.09 to .79).
Other evidence on the consensus of wine judgments comes from two analyses of the famous 1976 Paris tasting of California and French wines that revolutionized the wine world.Footnote 6 Eleven experts (nine of them French) tasted ten reds (six California and four French) and ten whites (again, six California and four French). Although much has been written about who “won” the tasting (e.g., Ashenfelter and Quandt, Reference Ashenfelter and Quandt1999; Cicchetti, Reference Cicchetti2004a; Hulkower, Reference Hulkower2009; Lindley, Reference Lindley2006; Quandt, Reference Quandt2006, Reference Quandt2007), my concern here is the extent to which the 11 judges agreed in their judgments of the wines. Cicchetti (Reference Cicchetti2004a, Reference Cicchetti2006b), using the intraclass correlation coefficient as the measure of judge consensus, finds an overall consensus level of .22 for the reds and .36 for the whites. Ashton (Reference Ashton2011), using the Pearson correlation as the measure of consensus, reports similar results: mean consensus of .16 for the reds and .44 for the whites. Both analyses report substantial variability in consensus across pairs of judges.
Hodgson (Reference Hodgson2009a) analyzed 4,167 wines that were entered in 13 major U.S. wine competitions in 2003. Several of his results speak to the degree of consensus in wine quality judgments across the competitions. First, 106 of the 375 wines that were entered in five competitions received Gold medals in one competition, but only 20 of these 106 received a second Gold medal and only six of these 20 received a third. None of the 375 received Gold medals in more than three competitions. Second, only 132 of the 3,347 wines that were entered in two or more competitions received the same medal in all competitions entered (and this almost always occurred in just two competitions). Finally, of the 2,440 wines that were entered in more than three competitions, 1,142 received at least one Gold; however, 957 of these 1,142 failed to receive any medal in at least one competition.
Hodgson (Reference Hodgson2009a) developed a correlational measure of consensus by first assigning numerical scores to the various medals and then computing correlations between the scores received by wines in each pair of competitions. With 13 competitions, there are 78 such pairwise measures. The mean correlation is .11, with a range of −.02 to .33. This clearly reflects poor consensus across the competitions, and most of the consensus that existed concerned wines awarded Bronze medals or No Awards. Hodgson (Reference Hodgson2009a, 5) concluded that “wine judges concur in what they do not like but are uncertain about what they do,” consistent with his earlier findings (Hodgson, Reference Hodgson2008) concerning intrajudge reliability.Footnote 7
Lawless et al. (Reference Lawless, Liu and Goldwyn1997), whose intrajudge reliability results are reported in Table 1, also examined consensus. They did so, however, by focusing on the mean judgments of each of the four panels, not on the judgments of the individual members. As noted earlier, mean judgments result in inflated reliability values—and the same is true for consensus values. Lawless et al. found that the three experienced panels agreed much more with one another (correlations of .66, .75, and .77) than with the consumer panel (correlations of .33, .44, and .46).
Finally, Quandt (Reference Quandt2006) summarizes some consensus results from 92 tastings conducted by the eight members of the Liquid Assets Wine Group. Instead of pairwise correlations among tasters, however, Quandt reports Kendall's coefficient of concordance (W), a measure of the overall concordance among the judges’ ratings. Kendall's W is statistically significant at the .05 (.10) level for 49 percent (57 percent) of the tastings, indicating that “substantial agreement existed among judges more than half the time” (Quandt, Reference Quandt2006, 16).
Mean consensus across all the wine studies in Table 2 is .34, substantially below mean reliability across wine studies of .50. As is the case with reliability, it is of interest to compare the level of consensus found in wine judging to that found in other fields. To my knowledge, there is no comprehensive review of consensus studies comparable to Ashton's (Reference Ashton2000) review of reliability studies. However, my recent search of the literature identified 46 studies across the same six professional fields included in Ashton (Reference Ashton2000) that report consensus results using a correlational measure.Footnote 8 The types of judgments examined in each field are the same as those described above for the reliability studies, with the exception of studies in business; the eight consensus studies in business settings examine a wider range of issues than do the three reliability studies (including sales predictions, actuarial judgments, and predictions of stock prices). As in the reliability studies in Ashton (Reference Ashton2000), all the consensus studies focus on professional judges who make a series of judgments in the domain of their everyday experience.
Table 2, Panel B, reports the mean consensus that emerged in each of the six fields. Mean consensus ranges from .75 in meteorology to .37 in clinical psychology—vis-à-vis a mean of .34 for the wine studies. (I defer until Section IV a consideration of why consensus in wine judging might reasonably be expected to be lower than in other fields.) Comparing mean reliability (Table 1) and mean consensus (Table 2) within fields reveals that consensus is lower than reliability in all fields, often substantially so, indicating that judges in all fields agree more with themselves than with others.
It should be recognized, however, that the mean within-field reliability and consensus results reported in Tables 1 and 2 are not completely comparable because the set of reliability studies in Table 1 differs somewhat from the set of consensus studies in Table 2 (i.e., some studies examine only reliability, some examine only consensus, and some examine both). Fortunately, many of these studies evaluate both reliability and consensus in the same study and with the same judges, allowing a direct comparison of reliability and consensus. The results, shown in Table 3, confirm that mean reliability is substantially greater than mean consensus in all fields. The difference between mean reliability and mean consensus ranges from .12 (.89 vs. .77) in meteorology to .36 (.73 vs. .37) in clinical psychology. (This compares to a difference of .32 (.73 vs. .41) in the single wine study that examines both reliability and consensus.) Finally, examination of the mean within-study levels of reliability and consensus in the 25 non-wine studies reveals no case in which mean consensus exceeds mean reliability.
Table 3 Summary of Studies Investigating Both Reliability and Consensus in the Same Study
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20151128095848410-0478:S1931436112000065_tab3.gif?pub-status=live)
IV. Discussion and Conclusion
Both intrajudge reliability and interjudge consensus of experienced wine judges are found to be substantially below reliability and consensus in other fields. Quantified in correlational terms, mean reliability across published wine studies is .50 while mean consensus is .34. Moreover, reliability and consensus vary widely across studies (and across individual judges in a single study), with some judges performing well and others performing poorly. Two questions immediately arise: (1) Why are the mean levels of reliability and consensus so much lower among experienced wine judges than among judges in other fields? (2) What accounts for the great variability in reliability and consensus across wine judges?
On the first question, it is easy to imagine valid reasons that reliability and consensus in wine judging would be lower than in many other fields. At the risk of stating the obvious, foremost among them is that wine judging is inherently more subjective. Whereas professional judgment in meteorology, medicine, or business, for example, is based largely on relatively objective inputs (such as barometric pressure, x-ray results, and economic data), wine judging involves the senses of sight, smell, and taste. Thus, wine judging is not simply a matter of passively receiving some objective facts about bouquet, clarity, finish, and so forth and then weighting and combining these facts into an overall judgment of quality (which itself possesses a sizable subjective component).
The second question—concerning variability in reliability and consensus across wine judges—is more difficult. In general, however, a useful way of understanding the sources of differential performance across judges—in any field—is to focus on features of the judge, features of the judgment task, and the “interaction” between features of the judge and features of the task (Fischhoff, Reference Fischhoff, Kahneman, Slovic and Tversky1982). By interaction, I mean the extent to which there is a “match” (or a “mismatch”) between judge and task.
Considering features of the judge (and assuming that the judge is motivated to perform well), voluminous research establishes that differences in ability, experience, and knowledge result in differential judgment performance (e.g., Ashton, Reference Ashton1999; Einhorn and Hogarth, Reference Einhorn and Hogarth1981; Schmidt and Hunter, Reference Schmidt and Hunter1992). In the wine context, differences in preferences (which may result in part from differences in experience and knowledge and in part from past emotional associations involving particular wines) must also be considered, as must biological characteristics, such as differential sensitivity to smells and tastes (e.g., Bartoshuk, Reference Bartoshuk1993; Goode, Reference Goode and Allhoff2008). Considering features of the judgment task, a multitude of factors are involved in the task of blind wine tasting that might influence the overall results, including those with respect to intrajudge reliability and interjudge consensus (e.g., Amerine and Roessler, Reference Amerine and Roessler1983; Goldwyn and Lawless, Reference Goldwyn and Lawless1991). Examples include the types of wines tasted (and the range of types, if more than one), the number of wines of each type, the order in which they are tasted, the number of tasting flights, and the time between flights.
Such features of the judge and the judgment task surely account for much of the variability in reliability and consensus found across experienced wine judges. However, acquiring a deep understanding of differential performance across wine judges is likely to be more complex than identifying isolated features of judges and judgment tasks that are relevant. The extent to which relevant features of the judge are consonant with relevant features of the task (i.e., the extent to which there is a “match” between judge and task) is likely to be important as well.
To illustrate, imagine a blind tasting of red Bordeaux and red Burgundy involving four judges. Judges 1 and 2 have an affinity for Bordeaux, but not for Burgundy. Such affinity could be the result of greater experience or knowledge, stronger emotional associations, or heightened sensitivity and discriminability with respect to the smells and tastes of Bordeaux. In contrast, Judges 3 and 4 have the opposite affinity—for Burgundy, but not for Bordeaux. I conjecture that Judges 1 and 2 will exhibit greater intrajudge reliability when they taste Bordeaux (a match between judge and task) than when they taste Burgundy (a mismatch between judge and task) and that Judges 3 and 4 will exhibit the opposite pattern of results. Similarly, I conjecture that the Judge 1/Judge 2 pair and the Judge 3/Judge 4 pair will exhibit greater interjudge consensus than will the remaining four pairwise combinations of judges. They key point in this stylized example is that differential levels of reliability and consensus will not be determined solely by features of either the judge or the task in isolation but also by the extent to which those features match one another.
The empirical validity (and practical usefulness) of the various judge and task features mentioned above, as well as the notion that the “match” between judge and task is important in understanding the performance of experienced wine judges, can only be settled by research. Existing studies on the reliability and consensus of experienced wine judges were not designed or conducted in a way that allows the sources of differential performance to be understood. I hope the results reviewed in this paper will provide a benchmark for future studies that take a systematic approach to understanding why reliability and consensus in wine judging are lower than in other fields and the sources of differential performance across experienced judges.
Appendix
Several of the reliability and consensus studies in Tables 1–3 concern judgments made in the field of auditing, a field that might be unfamiliar to readers of this journal. This Appendix provides some perspective on the critical role that professional judgment plays in auditing.
Briefly stated, auditing provides independent assurance concerning important disclosures provided by business organizations whose ownership shares are publicly held. Such organizations are required to disclose to current and potential investors and creditors substantial information about their past financial performance and current financial condition. Because this information is generated and disclosed by managers of the organization itself, who have strong incentives to portray the results favorably, and because external parties have limited access to such information via other channels, regulatory bodies in both the public and private sectors require the information to be examined by a firm of auditors, or certified public accountants (CPAs), who are independent of the reporting organization.
Auditors examine the reporting organization's financial disclosures and underlying systems and records to judge whether the disclosures are fairly presented in accordance with measurement and disclosure standards adopted by government agencies (e.g., the U.S. Securities and Exchange Commission) and the financial community more generally. Auditors collect and evaluate information that bears on this issue, and they use it as input to several component judgments that, when aggregated, suggest whether the organization's claim of fair presentation is likely to be tenable.
Audit judgments fall into two broad categories—investigation and reporting. Investigation judgments concern (1) the likelihood that errors or irregularities have occurred in the organization's processing of financial information and that the organization's own controls would have prevented or detected them, (2) the extent to which errors or irregularities that may have occurred and not been detected are important enough to require close scrutiny by the auditor, and (3) the extent to which evidence collection should be expanded in response to ongoing findings from the audit. Reporting judgments concern how best to fulfill the auditors’ obligation to report (to the public) the results of their investigation. Auditors’ most important reporting options are the standard and modified reports. A standard report provides assurance that the organization's financial disclosures are indeed fairly presented in accordance with accepted measurement and disclosure standards. A modified report, in contrast, signals that the organization's claim of fair presentation is unlikely to be tenable, and it provides an explanation of the circumstances or events that call fair presentation into question.
Auditors’ investigation and reporting judgments are made in a setting that imposes significant costs on the various parties from legal, economic, and regulatory sources. Investors, creditors, suppliers, employees, and others can be harmed if auditors fail to detect errors or irregularities or fail to provide adequate disclosure of an organization's financial problems. The organization itself can be harmed if auditors mistakenly believe that they have found errors or irregularities or report that the organization has not provided adequate disclosure when, in fact, it has. Thus, the field of auditing is ripe for the study of professional expertise.Footnote 9