I. Introduction
Judges confer medals, ribbons, scores, ranks, and other awards on wines entered in dozens of wine competitions each year. Diverse literature implies that those awards are the observable results of an unseen or latent mixture of judges’ consensus, idiosyncratic, and random decisions about quality or preference. Further, experiments with blind replicates in wine competitions show that the random component of judges’ decisions is material, variable, and nuanced.
The awards noted above are usually conferred using the sums of scores or the sums of ranks assigned by a small number of judges. Those methods are easy to use and communicate. Some competitions and researchers use or are examining Borda counts, Shapely values, and preference models (see also Cao and Stokes, Reference Cao and Stokes2017; Ginsburgh and Zhang, Reference Ginsburgh and Zang2012). Regardless of which of those five methods is employed, it yields an aggregation that is based on judges’ observed ratings but ignores the latent randomness that is a foundation of those ratings. Considering that foundation would often lead to different awards. Disentangling the latent consensus, idiosyncratic, and random components of judges’ ratings can yield awards that are closer to a mode, mean, or maximum likelihood of consensus. Disentangling the latent components of judges’ ratings can also yield useful information about the judges and the wines.
Section II begins with analysis of the distribution of the ratings assigned to blind replicates. The results are then employed in Section III to show that aggregations of ratings are conditional and unlikely to yield a mode, mean, or maximum likelihood of consensus. Then, building on the literature and work of others, a model is proposed and tested in Section IV that uncovers the latent consensus, idiosyncratic, and random components of judges’ ratings. Using the Stellenbosch data published in Cicchetti (Reference Cicchetti2014) as an example, the exact p-value for the null hypothesis that the model obtained a random result is <0.001. Conclusions and discussion follow in Section V.
Before moving forward, the author's anecdotal experience is that many Master Sommeliers, Masters of Wine, Wine and Spirit Education Trust (WSET) certificate holders, and other wine professionals express disdain for scoring wines and quantitative analyses of those scores.Footnote 1 They assert that wines and tasters are too complex and too idiosyncratic for scores to convey much useful information. Nevertheless, every year, and most often judged by wine professionals, dozens of state fair, county fair, magazine, newspaper, and other wine competitions and reviews confer ribbons, medals, awards, ranks, and scores. All of those designations can be expressed as ranks or scores.Footnote 2 This article is an effort to analyze the designations awarded to wines while keeping complexity and idiosyncrasy in sight.
II. The Probability Distribution of an Observed Rating
Hodgson (Reference Hodgson2008), Ashton (Reference Ashton2012), Hodgson and Cao (Reference Hodgson and Cao2014), and Cicchetti (Reference Cicchetti2014) show that a wine judge with near-perfect consistency, one who assigns the same rating to the same wine every time, is rare. Bodington (Reference Bodington2017b) finds that wine judges tend to assign closer ratings to replicates than is likely due to chance alone. He also concludes that the distribution of ratings assigned to blind replicates is determined by judges’ capabilities, the mechanics of the tasting protocol, and the difference between the replicate and the other wines in the flight.
The findings summarized above are expressed by the probability mass function (PMF) in Equation (1). The probability of an observed score (f) for a particular replicate wine (i, and a total of W wines) for a particular judge (j, and a total of J judges) is an exponential function of the observed score (s j,i ), a modal score parameter ( ${\hat s}_{j,im} $ ), the standard deviation of the judge's scores on all the wines in a tasting (σ j ), and a dispersion parameter ( $0 \le {\hat \theta} _j \le 1$ ). The PMF in Equation (1) expresses a discrete, unimodal, and bounded distribution. For ${\hat \theta} = 0$ and $s_{j,i} = {\hat s}_{j,im}, $ the probability of s j,i is unity. That is perfect consistency, meaning that a judge assigns the same score to the same wine every time. For ${\hat \theta} = 1,$ the PMF describes a distribution in which the probability of every s j,i is the same. In that case, there is no consistency, and a judge assigns scores as if they were drawn from a uniform random distribution:
The distance (d) defined in Equation (1B) is the square of the standardized difference between the observed and modal scores. First, considering the distance ( $s_{j,i} - \; {\hat s}_{j,im} $ ) alone can lead to a mirage of consistency. Some judges spread their scores more broadly over the allowed range than others. For example, at Stellenbosch, σ 7 = 1.6 and σ 5 = 11.4. Thus, a judge with a narrow spread on replicates can appear to be highly consistent even if all of his or her scores are also assigned randomly within a narrow range. Dividing distance by σ j standardizes the difference and then favors judges who assign scores to replicates within a narrower range than the scores that each judge assigns to all the wines. An additional benefit of standardizing is that distance becomes unit-less, so ${\hat \theta} _j $ can be compared across tastings that have difference score ranges. Finally, the constant (C j ) in Equation (1C) normalizes the results of the exponential function in Equation (1A) so that the sum of probabilities equals unity.
A test and example of Equation (1) appears in Figure (1). Stellenbosch Judge #7 assigns scores of (83, 84, 85) to replicates of Sauvignon Blanc and ${\hat \theta} _7 = 0.13$ , Judge #3 assigns (78, 82, 85) and ${\hat \theta} _3 = 0.14$ , and Judge #4 assigns (65, 72, 86) and ${\hat \theta} _4 = 0.74$ . The maximum likelihood estimates (MLEs) of the parameters in Equation (1) for all 15 judges appear in Table 1.
III. Start Over
The results in Figure 1 show that the scores that judges assign to a wine are not identically distributed. The results in Figure 1 for Judge #4 show that one draw from the distribution of scores for that judge is unlikely to be even close to the expected value of scores. More important, the result of a function with a stochastic input is also stochastic. Sums of scores, sums of ranks, Shapely values, Borda counts, and preference-model results are therefore conditional, and they depend on one instance of stochastic ratings. Especially for the small sample sizes that are typical of wine competitions, the relationship between results for one instance of ratings and the mode, mean, or maximum likelihood of results for the potential range of instances is thus unknown. Without further investigation, little can be said about the interpretation and reliability of conditional results that are aggregated while ignoring the implications of Figure 1. The appearance of a reliable consensus within observed ratings, whether based on sums of scores, Borda, or any other metric, may be an illusion.
As a test and example, based on simple sums of scores, the aggregate order of the quality ratings assigned by 15 judges to the Stellenbosch Sauvignon Blanc is (8 T, 5 T, 6, 1, 7, 3, 4, 2 T). Wine #1 is ranked fourth, #2 T is ranked last, and wine #8 T is ranked highest. “T” indicates that the wine is a member of the blind triplicate; thus, the sums of scores imply that the same wine from the same bottle ranks as highest and lowest quality. Using the PMF in Equation (1), the expected value of a judge's score on a wine appears in Equation (2). According to the expected values of the sums of scores, the order of quality rating for the flight of Sauvignon Blanc is (6, 2 T, 5 T, 8 T, 1, 7, 3, 4). Note that the triplicate wines now correctly group together. MATLAB code written by the author for those results is available on request. In concept, the reason for the change in order is that low-randomness judges (such as Judge #7) have more influence on differences between expected values than high-randomness judges (such as Judge #4). That effect also applies to the orders implied by sums of ranks, Shapley values, Borda counts, and preference-model results:
Although the example above shows that the order implied by the expected values of ratings is not the same as the order implied by the observed instance of ratings, it also shows the difficulty of analyzing wines without replicates. The order of the wines without replicates (6, 1, 7, 3, 4) does not change, because, with only one rating by each judge, there are not enough data to support PMFs for those wines. Even with blind triplicates, 3 points are meager support for estimates of ${\hat s}_{j,im} $ and ${\hat \theta} _j $ . Bodington (Reference Bodington2015a, Reference Bodington2015b) addresses that difficulty by parsing observed ratings in a mixture model with nonrandom and random components, and the PMF for the random component is parameterized a priori as a uniform random distribution. Cicchetti (Reference Cicchetti2017) tests the hypothesis that judges who assign consistent scores to replicates also assign scores to nonreplicates that are consistent with the consensus of scores assigned to those wines by other judges. This article aims in the next section to build on that work and to recognize that although all judges’ scores are stochastic to some extent, few judges assign scores as if drawn from a uniform random distribution, most do better, and some are as accurate as Judge #7.
IV. Disentangling Consensus, Idiosyncratic, and Random Ratings
This section moves forward in three steps. It begins with a review of consumer-choice literature and efforts to disentangle consumers’ consensus and idiosyncratic preferences. That review provides useful background and definitions of consensus and idiosyncrasy, but it also shows that often-employed utility models have limited application to wine-tasting results. Second, this section presents a review of preference models, showing that preference models have been employed to evaluate the results of taste tests since the 1970s and that such models have easy application to wine-tasting results. Building on that work and Sections II and III, the third step proposes and tests a model of judges’ aggregate consensus, idiosyncratic, and random wine ratings.
A. Consumer Choice, Consensus, and Idiosyncrasy
The literature on consumer choice and heterogeneity is wide and deep. Greene and Hensher (Reference Greene and Hensher2010), Keane and Wasi (Reference Keane and Wasi2013), Train (Reference Train2002), and Yue et al. (Reference Yue, Zhao and Kuzma2015) provide recent reviews of the methods and literature.
Many published evaluations of consumer choice employ utility theory, and many of those express utility in the general form U j,i = c i + β i A j,i + ε j,i , where the utility of product i to consumer j (U j,i ) is a product-specific intrinsic utility (c i ), plus the utility due to a vector of unit values (β i ) multiplied by a vector of observable product and consumer attributes (A j,i ), plus a product- and consumer-specific idiosyncratic utility (ε j,i ). In application to wine-tasting results, research to date indicates that no covariates A j,i are observable and reliable predictors of judges’ scores or rank assignments. In particular, Frost and Nobel (Reference Frost and Nobel2002) review the literature, quantify the wine knowledge and sensory expertise of 57 tasters, and then obtain those tasters’ hedonic ratings on 14 sensory properties and their preferences for 12 red wines. They conclude that, with the possible exception of preferences for vanilla/oak and against leather/sour flavors, expressions of preference “could not be modeled well from the sensory properties” (283). Rather than employing hedonic assessments of sensory properties, Cortez, Cerdeira, Almeida, Matos, and Reis (Reference Cortez, Cerdeira, Almeida, Matos and Reis2009) and Nachev and Hogan (Reference Nachev and Hogan2013) employ 11 laboratory-determined physiochemical properties of wine and machine-learning methods to predict the mean scores assigned by experienced wine judges. For the one tasting that they both evaluate, they obtain accuracies up to approximately ±20%. Further research appears necessary to prove the broad application and to improve the accuracy of such analysis, and extensive physiochemical data are rarely available. Frost and Nobel also find that sensory expertise and wine knowledge are independently distributed and that no significant differences in preference “[are] found across groups based on performance in the wine knowledge test or overall expertise” (2002, 284). Mantonakis, Rodero, Lesschaeve, and Hastie (Reference Mantonakis, Rodero, Lesschaeve and Hastie2009, 1311) find that “high knowledge” wine tasters are more prone than “low knowledge” tasters to primacy and recency biases. Ashton (Reference Ashton2014) compares the scores assigned by novices and wine professionals to wines from California and New Jersey. He finds that the results “do not support the idea that professionals and novices differ in their appreciation for New Jersey vs. California wines” (310). Bodington (Reference Bodington2017a) shows that female and male tasters assign about the same scores and ranks to the same wines. In sum, that literature implies that no observable attributes of either wines or judges are good predictors of the rating that a judge assigns to a wine. Consequently, in application to wine-tasting results and until future research uncovers useful covariates and methods, U j,i = c i + β i A j,i + ε j,i reduces to U j,i = c i + ε j,i .
The intercept c i is the intrinsic utility of a product and, all other things equal, reflects consumers’ consensus about preference for or the quality of the product under consideration. For c i >c k , product i is preferred to or is higher quality than product k. The literature on idiosyncratic utility ε t,i offers many examples. Bayer, Ferreira, and McMillan (Reference Bayer, Ferreira and McMillan2003) examine a real-estate market, including the idiosyncratic utility of home i to buyer k. Mc Breen, Goffette-Nagot, and Jensen (Reference Mc Breen, Goffette-Nagot and Jensen2009) evaluate a market for rental housing and find that “idiosyncratic tastes give some monopoly power” to landlords (p. 5). Hastings, Kane, and Staiger (Reference Hastings, Kane and Staiger2006) analyze school choice, including the “idiosyncratic preference of a student” for school i (p. 11). Rhee, de Palma, and Thisse (Reference Rhee, de Palma and Thisse1998) evaluate the so-called first-mover advantage, consumers’ “unobservable … idiosyncratic preferences,” and find that being a first mover is a disadvantage when consumers have sufficiently large idiosyncratic preferences (p. 15). Rajan and Sinha (Reference Rajan and Sinha2008) evaluate hypothetical product pricing in a duopoly and find an inverse relationship between price competition and the strength of “idiosyncratic” reactions to the good (p. 3). All of those authors define and model idiosyncratic preference and quality ratings as having a distribution around c i . Mc Breen et al. (Reference Mc Breen, Goffette-Nagot and Jensen2009) assume that ε j,i has a normal distribution. Bayer et al. (Reference Bayer, Ferreira and McMillan2003) and Hastings et al. (Reference Hastings, Kane and Staiger2006) assume that ε j,i has an extreme value distribution. Rajan and Sinha (Reference Rajan and Sinha2008) assume that ε j,i has a double exponential distribution, and Rhee et al. (Reference Rhee, de Palma and Thisse1998) treat the difference between idiosyncratic preferences for two products as a random disturbance with a logistic distribution.
Although the notions of latent consensus c i and idiosyncratic preferences ε j,i cited above do apply to wine-tasting results, the methodologies do not. All of those methods rely on latent utility and functions that are continuous and unbounded. In a wine tasting, judges’ scores and ranks are explicit and observable measures of utility. Judges assign scores from a bounded line, and they assign ranks from a discrete, ordered, and bounded set. When ties are not allowed, sampling is without replacement. There is no support for assuming that idiosyncratic assignments have a normal, extreme-value, double-exponential, or logistic distribution. Although examining consensus and idiosyncratic ratings is common, a different methodology is necessary for examining the results of wine tastings.
B. Rank-Preference Model Applications to Taste Tests
Wine tastings, and many other applications, involve a set of objects that can be expressed as an object vector o = (o A , o B , o C , …). Judges consider the objects and then assign to each a rating that is an assessment of absolute quality or relative preference. Those ratings can be expressed, for each judge, as a score s j = (s A , s B , s C , …) vector and/or a rank r j = (r A , r B , r C , …) vector. So-called rank-preference models are employed to examine the relationships, such as the consensus about order of quality or preference, between the vectors of judges’ ratings.Footnote 3 In contrast to the linear-utility models summarized in Section IV.A, rank-preference models can be tailored to discrete, ordered, and bounded ratings that are assigned with or without replacement. The works of Marden (Reference Marden1995) and Alvo and Yu (Reference Alvo and Yu2014) are widely cited texts concerning these models.
Rank-preference models have been applied to taste tests of breakfast foods (Green and Rao, Reference Green and Rao1972), snap beans (Plackett, Reference Plackett1975), crackers (Critchlow, Reference Critchlow1980), soft drinks (Bockenholt, Reference Bockenholt1992), animal feed (Marden, Reference Marden1995), cheese snacks (Vigneau, Courcoux, and Semenou, Reference Vigneau, Courcoux and Semenou1999), salad dressings (Vargo, Reference Vargo, Fligner and Verducci1989; Theusen, Reference Theusen2007), an unidentified food (Cleaver and Wedel, Reference Cleaver and Wedel2001), sushi (Chen, Reference Chen2014), and, recently, wine (Bodington, Reference Bodington2015a, Reference Bodington2015b, Reference Bodington2017a). A generalized Mallows (Reference Mallows1957) preference model is proposed in Equation (3), because it employs scores and is a simple variation of the exponential PMF already explained in Equation (1). In Equation (3), the probability of one judge's score vector (f′) is the product of the probabilities that the judge assigns each score to each wine in that vector ( $f^{^{\prime}i} $ ). With two important exceptions, $f^{^{\prime}i} $ on the right-hand side is defined the same as it is in Equations (1A) through (1C). The exceptions are that the parameter ${\hat s}_{ic} $ is the judges’ consensus score for the subject wine, and the parameter ${\hat \theta} _i $ expresses dispersion about that consensus due to idiosyncratic assignments of scores,
This article has now defined two PMFs. Equation (1) is a PMF for the probability that a judge assigns a particular score to a particular wine. Assuming a vector of such scores for every judge, Equation (3) is then a PMF for the probability of one judge's score vector within the distribution of all the judges’ score vectors. Those PMFs are combined into a conditional-probability model of consensus, idiosyncratic, and random assignments below.
C. Consensus, Idiosyncrasy, and Randomness
A likelihood function that expresses the aggregate of judges’ latent consensus, idiosyncratic, and random assignments of scores appears in Equation (4). The log likelihood ( ${\cal L}$ ) of the observed scores is the log sum of the probability of each judge's score on each wine $f^{^{\prime}i} \left( {s_{j,i} \vert{\hat \theta} _i, \; {\hat s}_{ic}} \right)$ multiplied by the probability of observing that score $\; f\left( {s_{j,i} \vert {\hat \theta} _j, {\hat s}_{j,im}} \right)$ . Equations (1) and (3) are combined in Equation (4) to express a conditional probability. MLEs of ${\hat s}_{ic} $ yield the judges’ consensus scores, MLEs of ${\hat \theta} _{i\;} $ yield the dispersion in judges’ scores due to idiosyncratic differences between judges, and MLEs of ${\hat \theta} _j $ yield the dispersion in each judge's scores due to individual underlying randomness:
Section III concludes by noting that, with only one rating from each judge, there are not enough data to support estimates of the parameters in $f\left( {s_{j,i} \vert {\hat \theta} _j, {\hat s}_{j,im}} \right)$ for wines without replicates. Even when there are blind triplicates, again, three observations provide meager support for estimates of two parameters. The solution proposed below employs all of a judge's scores to estimate ${\hat \theta} _j $ .
Cicchetti (Reference Cicchetti2017) hypothesizes that “those wine tasters who agreed reliably with their own previous evaluations of the same wine would also agree reliably with other tasters. Conversely, those tasters who disagreed with their previous evaluations of the same wines would also disagree substantially with the evaluations of other tasters.” Here, that idea is restated as the hypothesis that $\; {\hat \theta} _j $ , the underlying randomness in a judge's ratings, is positively correlated with the difference between a judge's observed scores and the all-judge-aggregate consensus scores $\left( {{\rm s}_{j,{\rm i}} - \; {\hat s}_{ic}} \right)$ on all wines. The PMF $f\left( {s_{j,i} \vert {\hat \theta} _j, {\hat s}_{j,im}} \right)$ in Equation (4) is thus restated here as $f\left( {s_{j,i} \vert {\hat \theta} _j, {\hat s}_{ic}} \right)$ . This approach uses all the data, preserves degrees of freedom, and yields an estimate of ${\hat \theta} _j $ for each judge.
Results for the Sauvignon Blanc data appear in Table 2. Focusing on Judge #7, who has the narrowest distribution of scores in Figure 1, Table 2A shows that and ${\hat \theta} _7 = 0.31$ . That finding is consistent with the stand-alone analysis of triplicates in Section II, Judge #7 is among the most accurate judges. Similar findings apply to Judge #4. Judge #4 has the broadest distribution of scores in Figure 1, ${\hat \theta} _4 = 0.77$ in Table 2A, and Judge #4 is among the less-accurate judges. The order of quality implied by the consensus scores ${\hat s}_{ic} $ in Table 2B is (7, 8 T, 5 T, 6, 2 T, 3, 4, 1). Note that the triplicates nearly group together even though no information in Equation (4) identifies them as the same wine.
If judges assign ratings as if they are drawn from a uniform random distribution, the asymptotic log likelihood according to Equation (4) is ${\cal L} = J \cdot W \cdot ln\left( {1/S \cdot 1/S} \right)$ and $15 \cdot 8 \cdot ln\left( {1/51 \cdot 1/51} \right) = \; - 943.6$ . MLEs of parameters shown in Table 2 for the Sauvignon Blanc data yield ${\cal L} = - 803.5$ . A chi-square test of the likelihood-ratio test statistic for the null hypothesis that those two likelihoods are the same has a p-value < 0.001. However, Pearson (Reference Pearson1900, 166) recommends using what is now called an exact distribution if the chi-square distribution “is a bad fit” to the exact distribution. That is a risk with the small sample sizes that are typical of wine tastings. An exact random distribution of ${\cal L}$ is calculated using 1,000 sets of scores drawn from a uniform random distribution. Using that distribution, the exact p-value for a test of the null hypothesis that the findings above are a random result is also < 0.001.
Finally, are wine judges consistent in their inconsistency? Cicchetti (Reference Cicchetti2017) concludes that replicates are “moderately confirmative” predictors of consistency in nonreplicate scores for one flight but “minimally confirmative” for another. That question is answered here by comparing estimates of dispersion in replicates alone using Equation (1) and ${\hat \theta} _j $ in Table 1 to estimates of aggregate dispersion due to randomness using Equation (4) and ${\hat \theta} _j $ in Table 2B. A scatterplot of the result appears in Figure 2. Although the slope of a least-squares line through the scatter is 0.34, the R2 is 0.29. The correlation coefficient is 0.54. At minimum, in agreement with Cicchetti, the results show that dispersion in a judge's scores on replicates may not have robust implications about the consistency of scores on other wines.
V. Conclusion and Discussion
Judges confer medals, ribbons, scores, and other awards on wines entered in dozens of wine competitions each year. Section II shows that those ratings are usually more accurate than entirely random, yet still stochastic. Section III shows that sums of scores, sums of ranks, Borda count, Shapley value, and preference-model results are conditional results. Using the notion of a conditional probability, a model is proposed and tested in Section IV that yields information about judges’ latent consensus, idiosyncratic, and random expressions of quality or preference. Using data for a tasting of eight Sauvignon Blanc wines that contain a blind triplicate, the conditional-probability model detects the similarity between the triplicates, and the model results also show that the scores that a judge assigns to replicates may not be a robust guide to the accuracy of the scores that the judge assigns to other wines.
These findings are based on one model and one set of data. Tests of other models and tests using other data appear worthwhile. Other models could have different PMFs. In particular, methods of estimating PMFs that express the stochastic nature of the scores that judges assign need to be improved. The model proposed above applies to scores assigned with replacement, but another model could be developed for application to ranks assigned without replacement. Tests using other data would illuminate the general applicability and usefulness of the proposed and other models. The results may lead to more robust methods of assigning awards to entries in wine competitions and to better methods of assessing the capabilities of wine judges.