Testing a Mixture of Rank Preference Models on Judges' Scores in Paris and Princeton*

Jeffrey C. Bodington

doi:10.1017/jwe.2015.18

Testing a Mixture of Rank Preference Models on Judges' Scores in Paris and Princeton*

Published online by Cambridge University Press: 20 August 2015

Jeffrey C. Bodington

Show author details

Jeffrey C. Bodington*: Affiliation:
Bodington & Company, 50 California Street #630, San Francisco, CA 94111; e-mail: jcb@bodingtonandcompany.com.

Article contents

Abstract
Introduction
Mixture Model for Ranked Wines
Transforming Scores into Ranks and Transitivity
Tied Scores
Estimates, Tests, and Type I Error
Application to Paris and Princeton
Conclusion
Footnotes
References

Rights & Permissions

Abstract

Rank preference and mixture models have been employed to evaluate the ranks assigned by consumers in taste tests of beans, cheese, crackers, salad dressings, soft drinks, sushi, animal feed, and wine. In many wine tastings, including the famous 1976 Judgment of Paris and the 2012 Judgment of Princeton, judges assign scores rather than ranks, and those scores often include ties. This article advances the application of ranking and mixture models to wine-tasting results by modifying the established use of a Plackett-Luce rank preference model to accommodate scores and ties. The modified model is tested and then employed to evaluate the Paris and Princeton wine-tasting results. Test results show that the mixture model is an accurate predictor of observed rank densities. Results for Paris and Princeton show that the group preference orders implied by the mixture model are highly correlated with the orders implied by widely employed rank-sum methods. However, the mixture model satisfies choice axioms that rank-sum methods do not, it yields an estimate of the proportion of scores that appear to be assigned randomly, and it also yields a preference order based on nonrandom preferences that tasters appear to hold in common. (JEL Classifications: A10, C00, C10, C12, D12)

Keywords

Mixture model preference rank statistics wine tasting

Type: Articles
Information: Journal of Wine Economics , Volume 10 , Issue 2 , November 2015 , pp. 173 - 189

DOI: https://doi.org/10.1017/jwe.2015.18 [Opens in a new window]
Copyright: Copyright © American Association of Wine Economists 2015

I. Introduction

Rank preference and mixture models have been applied to taste tests of snap beans (Plackett, Reference Plackett1975), cheese snacks (Vigneau et al., Reference Vigneau, Courcoux and Semenou1999), crackers (Critchlow, Reference Critchlow1980), salad dressings (Theusen, Reference Theusen2007), soft drinks (Bockenholt, Reference Bockenholt1992), sushi (Chen, Reference Chen2014), animal feed (Marden, Reference Marden1995), an unidentified food (Cleaver and Wedel, Reference Cleaver and Wedel2001), and, now, recently, wine. Regarding wine, Bodington (Reference Bodington2012) posited that observed wine-tasting results may have a mixture distribution with random, common preference and idiosyncratic preference mixture components. Cao (Reference Cao2014) applied a mixture model with random-ranking and consensus-ranking components to the results of the 2009 California State Fair Commercial Wine Competition. Bodington (Reference Bodington2015) applied a mixture of rank-preference models to the ranks assigned by experienced tasters during a blind tasting of Pinot Gris, and the results implied that common-preference agreement among tasters exceeded the random expectation of illusory agreement.

This article seeks to test and broaden the application of a rank-preference and mixture models to wine-tasting results expressed as numerical scores rather than ranks and to results that include ties between scores. The 1976 Judgment of Paris (Paris) and the 2012 Judgment of Princeton (Princeton, and together the Judgments) are well known and analyzed wine tastings that provide data and context for a replicable test of a mixture of rank-preference models that is modified to handle scores, with ties, that are converted to ranks. The judges' scores in Paris and Princeton had many ties.

The mixture of Plackett-Luce models applied to ranked data for a tasting of Pinot Gris in Bodington (Reference Bodington2015) is summarized in Section II. Transforming numerical scores into ranks and choice axioms are discussed in Section III, and an addition to the mixture model to handle ties between numerical scores appears in Section IV. The resulting model satisfies the Luce and independence from irrelevant alternatives (IIA) choice axioms, it considers ties between scores, and it can be applied to the small sample sizes associated with most tastings. Next, in Section V, the model is tested on hypothetical data, on the Pinot Gris tasting results and with a Monte Carlo simulation. In Section VI, the mixture model is then applied to judges' scores for Paris and Princeton. The mixture model results yield estimates of potential Type I error, the proportion of tasters' scores that appear to be assigned randomly, and a preference order based on nonrandom preferences that tasters hold in common. That preference order is highly but not perfectly correlated with the implications of rank-sum methods applied to the Judgments, and, unlike rank-sum methods, it also complies with the Luce choice axiom and IIA. Conclusions follow in Section VII.

II. Mixture Model for Ranked Wines

A mixture distribution is the result of combining the distributions of two or more random variables. The distribution of the mixture is observable and the underlying component distributions may be unobservable or latent. A mixture model is a mathematical expression of the latent distributions and their observable combination. See McLachlan and Peel (Reference McLachlan and Peel2000), Mengersen et al. (Reference Mengersen, Robert and Titterington2011) and References. As a starting point, notation and a mixture model for ranked wine-tasting results are summarized below.

The names of wines (each name w with a total of W wines) assessed by a taster (each taster t with a total of T tasters) are listed in an object vector ${\bf o} = \left( {o_1, \; o_2, \; o_3, \; \ldots, \; o_W} \right)$ , and the respective scores assigned to the wines by each taster are listed in a score vector x _t = (x _t,1, x _t,2, x _t,3, …, x _t,W). When those scores are assigned a relative rank, or when a taster assigns ranks rather than scores, the result is a rank vector r _t = (r _t,1, r _t,2, r _t,3, …, r _t,W). Arranging the objects from most-preferred to least-preferred yields an order vector y _t = (y _t,1, y _t,2, y _t,3, …, y _t,W). Following Marden (Reference Marden1995) and others, ties between scores are indicated by their average rank and a superscript line over tied objects. Adapting notation from Kidwell et al. (Reference Kidwell, Lebanon and Cleveland2008), ${\squf} $ symbolizes a one-unit interval between observed scores in a complete order vector ${\bi y}_t^c $ . For example, assuming an object vector with four wines ${\bf o} = \left( {A,B,C,D} \right)$ and that a taster assigns scores x _t = (10, 15, 6, 10), the rank vector is r _t = (2.5, 1, 4, 2.5), the order vector is ${\bi y}_t = \left( {B\overline {AD} C} \right)$ , and the complete order vector is ${\bi y}_t^c = \left( {B{\squf} \; {\squf} \; {\squf} \; {\squf} \; \overline {AD} {\squf} \; {\squf} \; {\squf} \; C} \right)$ .

Cao (Reference Cao2014) employed a mixture model with two latent classes of taster, those who appear to assign ranks randomly and those who appear to have consensus. Bodington (Reference Bodington2015) analyzed two similar classes, employed a Plackett-Luce probability mass function (PMF,f _t( y _t| ρ )) for the latent common-preference class of tasters, employed the mixture model in Equations (1) and (2), and tested that model on a blind tasting of Pinot Gris. The Plackett-Luce PMF appears in Equation (1) and ρ _i is the probability that wine i is selected as most preferred. See also Luce (Reference Luce1977) and Marden (Reference Marden1995). Next, a mixture model with two classes of taster expressing the probability of a taster's observed order vector ( $f_t^{\prime} \left( {{\bi y}_t {\vert}{\hat{\bi \rho}}, {\hat{\bi \pi}}} \right)$ ) appears in Equation (2). In Equation (2), the probability of a taster's order vector y _t equals the probability that a taster assigns ranks randomly (π _r) times the random-ranking PMF, plus, the probability that a taster assigns ranks in accordance with common preferences (π _p) times the Plackett-Luce PMF. The π are known as mixture component weights or mixing proportions and a hat (^) indicates that the value of a parameter in the mixture model must be estimated.

(1A)

$$f_t \left( {{\bi y}_t {\rm \vert} {\bi \rho}} \right) = \mathop \prod \nolimits_{i = 1}^W \left( {\displaystyle{{\rho _i} \over {\mathop \sum \nolimits_{\,j\; = i}^W \; \left( {\rho _j} \right)}}{\rm \vert} {\bi y}_t, {\bi \rho}} \right)$$

(1B)

$$0 \le \; \rho _i \; \le 1.0\,{\rm and}\,1.0 = \mathop \sum \nolimits_{\,i\; = 1}^W \rho _i $$

(2A)

$$f_t^{\prime} \left( {{\bi y}_t {\rm \vert} {\hat {\bi \rho}}, {\hat {\bi \pi}}} \right) = \widehat{{\pi _r}} \cdot \left( {\displaystyle{1 \over {W!}}} \right)\; + \; \widehat{{\pi _p}} \; \cdot \; f_t \left( {{\bi y}_t {\rm \vert} {\hat {\bi \rho}}} \right)$$

(2B)

$$0 \le \; \widehat{{\pi _r}} \; {\rm and\;} \widehat{{\pi _p}} \; \le 1.0\; \,{\rm and}\,\; 1.0 = \widehat{{\pi _r}} + \widehat{{\pi _p}} $$

Several simple examples of mixture model results are worked by hand in Appendix A. As a check, those results also match the results of Equations (1) and (2). The examples in Appendix A involve two wines and two to four tasters. Examples with more wines and tasters become intractable to solve by hand. For three wines and three tasters, the number of order vector combinations is (3!)³ = 216. For ten wines and nine tasters, as there were in Paris and Princeton, the number of combinations is (10!)⁹ = 1.09 × 10⁵⁹.

Appendix A also provides examples of an important qualification regarding Bodington (Reference Bodington2015) and this article. What are labeled here for convenience as assignments that appear to be random may actually themselves be a mixture of random assignments and idiosyncratic assignments that fit together to yield the characteristic flat shape of a uniform random distribution. As noted in the Conclusion, separating and identifying random and idiosyncratic assignments is work for the future. Neither of those are the common-preference assignments that are the focus of wine makers or the determinants of a tasting group's aggregate preference order.

Note again that the mixture model above applies to a tasting in which the protocol did not allow ties and tasters assigned ranks r _t rather than scores x _t. Section III addresses transforming scores into ranks, and an addition in Section IV enables the model to handle ties between scores.

III. Transforming Scores into Ranks and Transitivity

Transforming numerical scores into ranks by sequentially ordering the scores is a customary practice. In application to the Judgments, Ashenfelter and Quandt (Reference Ashenfelter and Quandt1999), Ginsburgh and Zang (Reference Ginsburgh and Zang2012), Quandt (Reference Quandt2006, Reference Quandt2012), and Ward (Reference Ward2012) employed that transformation. Two aspects of that transformation and transitivity are discussed below.

First, although the transformation above does preserve transitivity, it may lose information. Continuing the example x _t = (10, 15, 6, 10) from Section II, the ranking transformation yields ${\bi y}_t = \left( {B\overline {AD} C} \right)$ but loses the extra information in $\; {\bi y}_t^c = \left( {B{\squf} \; {\squf} \; {\squf} \; {\squf} \; \overline {AD} {\squf} \; {\squf} \; {\squf} \; C} \right)$ . Not one judge in Paris or Princeton ranked her or his wines with ten consecutive scores. Intervals between scores of two and three were common, and most scored at least one wine next to an interval of four points. For example, in Paris, Pierre Brejoux's complete order vector was ${\bi y}_{PB}^c = \left( {CB{\squf} F{\squf} IAD{\squf} G{\squf} HE{\squf} {\squf} {\squf} {\squf} J} \right).$ Are the intervals between those scores random? Are they artifacts of flawed experimental design? Or, are the intervals information about the strengths of relative preferences?

Turning to the question of randomness, the skewness (γ ₁) of tasters' scores in the Judgments shows that the intervals between scores do not appear to be random. Only two judges had symmetric distributions of scores with γ ₁ = 0. Thirteen of the judges skewed right with γ ₁ > 0, and the other 21 skewed left with γ ₁ < 0. If that asymmetry in scores is caused by random interval ${\squf} $ then the expectation of skewness is zero; E(γ ₁) = 0. The mean skewness for the 9 * 4 = 36 observations in the Judgments was −0.17, and the variance in that skewness was 0.27. The Students t-statistic for the hypothesis that E(γ ₁) = 0 is $\left( {0 - \left( { - 0.17} \right)} \right)/\left( {\sqrt {0.27} /\; \sqrt {36}} \right) = 1.96$ . That t-statistic has a one-tailed p-value of approximately 0.03. On that basis, the null hypothesis that the intervals ${\squf} $ are random would be rejected for a large sample size but is marginal for a sample with 36 observations.

Next, flawed design can induce bias in any experiment. See Ashton (Reference Ashton2014), Filipello (Reference Filipello1955, Reference Filipello1956, Reference Filipello1957), Filipello and Berg (Reference Filipello and Berg1958), Mantonakis et al. (Reference Mantonakis, Rodero, Lesschaeve and Hastie2009), and Masson and Aurier (Reference Masson and Aurier2015) for bias induced in wine-tasting results by the number of wines to be judged, the tasting protocol and tasters' expectations. See Cicchetti (Reference Cicchetti2014) for a comparison of ranking and scoring the same wine. Another aspect of experimental design is the general guidance sometimes given to tasters about which characteristics in wine warrant certain score totals. For example, University of California at Davis and Jancis Robinson employ 20-point scales that assign levels of quality to point ranges. When judges score using a numerical point scale, do they assign scores independently according to each wine's quality? Or, is the scale merely a bound on judges' assessments of relative preference? Tasting experience implies that some judges assign a score to one wine according to a general zone of quality and then score the remaining wines “around” that anchor. Is the left skew found above an artifact of judges' expectations that the anchor score is a 15 to 17 out of 20 and then having more room for lower scores than for higher scores? This author is not aware of research that tests the reliability of score scales, how they may actually be used, and whether they induce any bias. See some of those issues discussed in Quandt (Reference Quandt2012, p. 153) and Ward (Reference Ward2012, p. 159). On that basis, the notion that the experimental design of a tasting may induce ${\squf} $ and bias in scores cannot be dismissed.

The possibility remains that differential intervals between numerical scores are information about the strengths of relative preferences. The probabilities in Plackett-Luce have been modified (from ρ _i to ρ′_t,i) by others to be functions of, or to make inferences about, additional information. In an application to betting on horse races, Benter (Reference Benter, Ziemba, Lo and Haush1994) added an exponent to capture information about the observed chance that a long-shot horse could win. In an application to election results, Gormley and Murphy (Reference Gormley, Murphy, Airoldi, Blei, Fienberg, Goldenberg, Xing and Zheng2007) made ρ′_t,i an exponential function of each voter's notional distance from a candidate on various issues. In another analysis of election results, Gormley and Murphy (Reference Gormley and Murphy2008) made ρ′_t,i a logistic function of the observed characteristics of voters. Applying those ideas to wine-tasting results seems perilous. Wine-tasting sample sizes are small, and there are few objectively measureable covariates. Moreover, disentangling strengths in relative preference from bias induced by experimental design is likely to be difficult with statistical significance, and it risks confusing the search for preferences that wine judges have in common.

In sum, transforming each judge's scores into ranks by merely ordering the scores does preserve the transitivity of preference for each judge. While that transformation may also lead to a loss of information, the value of that lost information appears to be both speculative and intractable. Further analysis of that hypothesis may be more work for the future.

The second aspect of transitivity to be addressed here concerns comparing the ranks assigned by different judges to measure their aggregate, group, or social preference. Arrow's (Reference Arrow1963) impossibility theorem concerning social choice and Luce's (Reference Luce1977) choice axiom both have implications based on transitivity about aggregating wine tasters' scores or ranks into measures of which wine or country “won,” a preference order, and which wine or country “lost.”

Arrow's theorem is considered here first. All of Ashenfelter and Quandt (Reference Ashenfelter and Quandt1999), Ginsburgh and Zang (Reference Ginsburgh and Zang2012), Hulkower (Reference Hulkower2009), Quandt (Reference Quandt2006,Reference Quandt2012), and Ward (Reference Ward2012) compared the wines in the Judgments using various sums of judges' ranks (hereafter referred to as rank-sums methods). Quandt (Reference Quandt2012, p. 153) explains that rank-sums methods violate a rule of choice logic sometimes known as independence from irrelevant alternatives (again, IIA). Quandt provides an example showing that excluding wine G from the white wines tasted in Princeton yields a different aggregate preference order for those that remain; ${\bf y} = \left( {ADGBEIHFJC} \right)$ changes to ${\bf y} = \left( {\overline {AD} BEIHFJC} \right)$ . A simple example involves just two tasters and three wines. For o = (A, B), r ₁ = (1, 2) and r ₂ = (2, 1) the rank sums aggregate preference order is the tie ${\bf y} = \left( {\overline {AB}} \right)$ . Adding a wine C, for r ₁ = (1, 2, 3) and r ₂ = (3, 1, 2), the rank sums aggregate preference order is ${\bf y} = \left( {CAB} \right)$ . Adding wine C thus changed the aggregate preference for A and B from indifference to preference. That example fails transitivity and demonstrates Arrow's general possibility theorem; there is no method of combining ranked individual expressions of preference into an aggregate that does not have logical inconsistencies. Restated without the double negative, every method of aggregating expressions of individual ranked preference into a measure of social preference has logical flaws (see also Marden [1995, p. 134]). IIA is one of the four criteria for Arrow's theorem, and it requires that aggregate social preference for an option A over option B should be independent of and should not be changed by individuals' preferences for A and B compared to option C.

Luce examined choice and IIA from a probabilistic perspective. See Luce (Reference Luce1977) for a formal statement and discussion of the Luce choice axiom (LCA). Applying the LCA to the simple example above, consider an urn containing equal numbers of balls marked A and B. Relative preference is the relative likelihood of drawing either A or B. With equal numbers of A and B balls, ρ _A = ρ _B and the preference order is thus the same tie as above, $\; {\bf y} = \left( {\overline {AB}} \right)$ . Now, add just one ball marked C to the urn. In that case, ρ _A = ρ _B > ρ _C thus $\; {\bf y} = \left( {\overline {AB} C} \right)$ . Next, add many more balls marked C such that ρ _C > ρ _A = ρ _B and now ${\bf y} = \left( {C\overline {AB}} \right)$ . In both cases, transitivity is preserved, and IIA is obtained. In the formal statement of LCA, the aggregate relative preference for two objects depends on the ratio of their probabilities alone. In Luce (Reference Luce1977, p. 216), this is known as the constant ratio rule. The Plackett-Luce PMF employed here in Equation (1) is consistent with the LCA (see Marden [1995, p. 134]; Plackett [1975]).

IV. Tied Scores

Ties are common in wine tastings. Every participant in Paris and Princeton assigned the same numerical score to at least two wines. Some of the judges assigned the same score to three or four wines. A taster may assign the same numerical score to two or more wines when he or she cannot distinguish between wines or decides that, all things considered, two or more wines deserve the same score. Quoting Quandt (Reference Quandt2006, p. 9), “the option of using tied ranks enables tasters to avoid hard choices.” An interpretation of ties from Kidwell et al. (Reference Kidwell, Lebanon and Cleveland2008) is that a taster needs more time or information than is available to actually differentiate between the tied wines.

Ties are often evaluated as the mean or expectation of the like objects. Spearman's rank correlation coefficient is calculated using the mean rank of tied objects. Kendall's rank correlation coefficient, the Mann-Whitney U test, and the Wilcoxon rank-sum test also employ the mean rank of tied objects. In a rank-preference model, it is convenient to treat the probability of an order vector containing ties as the expectation of the probabilities of the order vectors that are the permutations of the tie (each permutation m is y _t,m with a total of tp tie permutations). For example, the probability of ${\bi y}_t = \left( {B\overline {AD} C} \right)$ is the mean of the probabilities of y _t,1 = (BADC) and y _t,2 = (BDAC). See Critchlow (Reference Critchlow1980, p. 73), Kidwell et al. (Reference Kidwell, Lebanon and Cleveland2008, p. 1356), and Marden (Reference Marden1995, pp. 261, 269). The general form of that expectation appears in Equation (3).

(3)

$$E\left( {\,f_t \left( {{\bi y}_t {\rm \vert} {\hat {\bi \rho}}} \right)} \right) = \displaystyle{1 \over {tp}}\mathop \sum \nolimits_{m = 1}^{tp} \left( {\,f_t \left( {{\bi y}_{t,m} {\rm \vert} {\hat {\bi \rho}}} \right)} \right)$$

Although Equation (3) appears straightforward, it may not be easy in practice. In addition to the four wines that Odette Kahn scored in Paris as 12, she also assigned a score of 2 to two other wines. The number of her tie permutations tp is thus 4! × 2! = 48. In Princeton, Daniele Muelders assigned a score of 12 to four wines and a score of 15 to another four wines. The total of Muelders's tie permutations tp is 4! × 4! = 576.

Specific tests of the mixture model in Section V, including ties under Equation (3), show that the model is an accurate predictor of both observed rank densities and the mixture weights for random-behavior and common-preference score assignments.

V. Estimates, Tests, and Type I Error

The expectation maximization (EM) algorithm is a widely employed method of estimating the unknown parameters in mixture models. See Dempster et al. (Reference Dempster, Laird and Rubin1977), McLachlan (Reference McLachlan and Peel2000), Mengersen et al. (Reference Mengersen, Robert and Titterington2011), and References. In sum, EM iterates to climb a likelihood function. MATLAB code written by the author for the EM algorithm employed here, including an integrated maximum likelihood estimator (MLE), is available on request. Several tests of the mixture model in Equations (1) through (3), solved using the EM algorithm, are summarized below.

Test 1: A hypothetical 18 tasters rank six wines; six of those combine to have a random expectation, and 12 assign the same ranks to the same wines. The mixture weight for the random class should be 6/18 = 0.33. The EM solution does yield $\widehat{{\pi _r}} = 0.33$ , and the estimates of $\widehat{{\rho _i}} $ imply the correct preference order.

Test 2: Again, 18 tasters rank six wines. Twelve of those combine to have a random expectation, and six assign the same ranks to the same wines. The mixture weight for the random class $\widehat{{\pi _r}} $ should be 12/18 = 0.67. The EM solution does yield $\widehat{{\pi _r}} = 0.67,$ and the estimates of $\widehat{{\rho _i}} $ again imply the correct preference order. As an additional test, the model does replicate the ranked Pinot Gris results referenced above.

Test 3: The same data in Test 1 are evaluated now as scores rather than ranks. While the random class weight and density of the most preferred wine should be the same as in Test 1, the order should reverse because unity is the most-preferred rank, but it is the least-preferred score. The EM solution does yield those results. Further, the EM solution for changing three tasters' scores to prefer wine E the most and another three to prefer wine F the most does yield approximately $\widehat{{\rho _i}} \; = 0.5$ for the two wines that are tied for most-preferred.

Test 4: The data from Test 3 are evaluated again here except that, for the eighteenth taster alone, the scores on the most- and second-most-favored wines are tied. The total number of order vectors should be 17 + 2! = 19. Compared to the results for Test 3, the probability $\widehat{{\rho _i}} $ for the first-place wine should go down because that wine is now tied for second place for one taster. This is a test of Equation (3) for ties, and the EM solution does yield those results.

Test 5: In each Judgment, nine judges assigned scores between zero and 20 to each of ten wines. Accordingly, in this fifth test, each of nine tasters randomly assigns a score between zero and 20 to each of ten wines in a Monte Carlo simulation with 1,000 iterations. These draws include random ties and random intervals ${\squf} $ between scores. For ten wines, the expected value of the probability that each wine is most-preferred $E\left( {\widehat{{\rho _i}}} \right)$ should be 1/10. That and other EM solution results are summarized in Table 1. Standard deviations (SD) are also reported. The expected results are obtained; none of the $E\left( {\widehat{{\rho _i}}} \right)$ are significantly different from 0.10, and their sum is close to unity. In addition, the expectation of the mixture weight for the class of tasters that appear to assign scores randomly $E\left( {\widehat{{\pi _r}}} \right)$ is 0.882. This implies that the probability of Type I error, false-positive illusory agreement on common preference, is approximately 1.000 − 0.882 = 0.118. See further discussion of Type I error and the tractable examples worked in Appendix A.

Table 1 Monte Carlo Simulation Results, 1,000 Iterations

VI. Application to Paris and Princeton

The title of this article portends a test of a mixture of rank-preference models using the scores that judges assigned in Paris and Princeton. Using the model presented and tested above, this section now presents that test.

A. Paris 1976

Eleven wine judges met at the Intercontinental Hotel in Paris, France, on May 24, 1976. Ten white wines and ten red wines were decanted into identical plain bottles. For each group of white and red wines, the tasting order was determined by drawing numbers from a hat. Each judge had two wine glasses, one for wine and the other for water. The protocol was step-by-step sequential; each judge tasted and scored a wine before the next wine was poured. Ten white wines were poured, then, following a break, ten reds. Each judged scored each wine on a scale of zero to 20, and the “official” ranking was based on the sums of the nine French judges' scores. See Taber (Reference Taber2005) and the author thanks both Mr. Taber and Mr. Spurrier for independently confirming this description of the protocol.

Among other sources, the judges' scores are available in Hulkower (Reference Hulkower2009, tables 3 and 8) and Lindley (Reference Lindley2006). Spurrier and many others calculated overall preference using the total of scores for each wine. Ashenfelter and Quandt (Reference Ashenfelter and Quandt1999) compared the red wines using a rank-sums method. Quandt (Reference Quandt2006) then compared the white wines using rank sums. Cicchetti (Reference Cicchetti2006) evaluated both red and white wines using intraclass correlation coefficients to identify two classes of judges: those who ranked consistently with each other and those whose ranks appeared to be inconsistent. Hulkower (Reference Hulkower2009) reviewed the literature to date and presented another comparison using rank sums. Ginsburgh and Zang (Reference Ginsburgh and Zang2012) compared the red wines using rank sums and a game-theory–based measure of relative influence known as a Shapley value. See Olkin et al. (Reference Olkin, Lou, Stokes and Cao2015) for a recent survey of rank sums and other methods of analyzing wine-tasting data.

Mixture model results for Paris appear in Tables 2A and 2B. In addition to a preference order based on $\widehat{{\rho _i}}, $ each table shows a preference order according to rank sums; most of the evaluations referenced above calculate rank sums and consider it superior to a sum of scores. For the white wines in Table 2A, the preference order implied by the mixture model is very close to that implied by rank sums. The correlation coefficient between the two is 0.94. The $\widehat{{\rho _i}} $ imply very strong agreement among judges on the first- and last-place wines and then much similarity in between. The estimate $\widehat{{\pi _r}} = 0.335$ implies that three of the nine judges appear to have assigned scores randomly. The likelihood ratio statistic (LRS), compared to the null hypothesis random-ranking Monte Carlo log likelihood result in Table 1, is −2 * ((−125.19) – (−102.85) = 44.7 and that LRS has a chi-square p- value < 0.001.

Table 2A White Wines, 1976 Judgment of Paris

Judges' scores and rank-sums preference order from Borda results in Hulkower (Reference Hulkower2009).

Table 2B Red Wines, 1976 Judgment of Paris

Judges' scores and rank-sums preference order from Borda results in Hulkower (Reference Hulkower2009).

Results for the red wines in Paris, in Table 2B, are similar to those for the white wines, however, the proportion of apparently random scoring as measured by $\widehat{{\pi _r}} $ increased. The preference orders implied by the mixture model and rank sums are again similar, and their correlation coefficient is 0.86. There is much general concordance, but specific agreement on only the second-place wine. The estimate of $\widehat{{\pi _r}} $ implies that four of the judges, one more than for the white wines, appear to have assigned scores randomly. That increase in the number of random-scoring judges could be due to palate fatigue, the characteristics of the red wines themselves, that announcement of results for the first flight and discussion among judges altered expectations, and other factors. The p-value of the LRS remains <0.001.

Finally, concerning Paris, the analysis above is intended only as a test of the mixture model on a famously available and analyzed set of data. The author does not intend to put forth any opinion about which wine or which country actually won or lost in Paris. Filipello (Reference Filipello1955, Reference Filipello1956, Reference Filipello1957) and Filipello and Berg (Reference Filipello and Berg1958) conducted various wine taste tests using sequential protocols and found evidence of primacy bias. Mantonakis et al. (Reference Mantonakis, Rodero, Lesschaeve and Hastie2009) found evidence of both primacy and recency bias with an end-of-sequence protocol even among “high-knowledge” wine tasters. De Bruin (Reference De Bruin2005) examined singing and figure-skating competition results and found position bias in both step-by-step and end-of-sequence sequential judging protocols. A step-by-step sequential protocol was employed in Paris, and the sequence of the wines has never been disclosed.Footnote ¹

B. Princeton 2012

Nine judges met on June 8, 2012, at Prospect House on the campus of Princeton University. Ten white wines and ten red wines were bagged and out of tasters' sight in another room. For each group of white and red wines, the tasting order was determined by drawing letters from a hat. In sharp contrast to the sequential protocol in Paris, in Princeton, there were ten glasses in front of each judge and each judge could taste and re-taste the wines in any order. Ten white wines were tasted and scored in a first flight, then ten red wines in a second. Water was available. Each judge scored each wine on a scale of zero to 20. See a description of the protocol and judges' scores in Ashenfelter and Storchmann (Reference Ashenfelter and Storchmann2012) and Taber (Reference Taber2012). Bodington (Reference Bodington2012) showed that the “open” protocol employed in Princeton, in contrast to the sequential protocol employed in Paris, is not prone to sequential or flight position bias.

Mixture model results for Princeton and again the preference orders implied by rank sums appear in Tables 3A and 3B. Quandt (Reference Quandt2012) compared the wines using rank sums. Ward (Reference Ward2012) calculated and offered caution about comparing the raw scores, ranks, centered, standardized, heterogeneous, and Friedman's test statistics. Ginsburgh and Zang (Reference Ginsburgh and Zang2012), as they did for Paris, compared the red wines using rank sums and Shapley values.

Table 3A White Wines, 2012 Judgment of Princeton

Rank-sums preference order from Quandt (Reference Quandt2012).

Table 3B Red Wines, 2012 Judgment of Princeton

Rank-sums preference order from Quandt (Reference Quandt2012).

For the white wines in Princeton, results in Table 3A show that the preference orders implied by the mixture model and ranks sums are similar. The correlation coefficient between the two orders is 0.86. The estimates of $\widehat{{\rho _i}} $ for the common-preference class of taster imply their strong preference for first-place Clos des Mouches and strong preference against the last-place Ventimiglia. Those findings are consistent with those in Quandt (Reference Quandt2012) and Ward (Reference Ward2012). In addition to preference- order implications that are similar to those of others, the mixture model implies substantial randomness in tasters' scores. The random mixture weight is $\widehat{{\pi _r}} = 0.670$ , and the LRS is 0.2.

Results for the red wines in Princeton, shown in Table 3B, are similar to those for the white wines. There is general concordance between the mixture model and rank-sums results, and the correlation coefficient between the two orders is 0.78. The estimates of $\widehat{{\rho _i}} $ for the common-preference class of taster imply their strong preference for Château Mouton Rothschild and strong preference against Four JG's. Again, that finding is consistent with those in Quandt (Reference Quandt2012) and Ward (Reference Ward2012). As for the white wines, in addition to preference-order implications that are similar to those of others, the mixture model results also imply substantial randomness in tasters' scores. $\widehat{{\pi _r}} = 0.778$ , and the LRS is 4.4. Again, as it did in Paris, $\widehat{{\pi _r}} $ increased for the second wine flight due to palate fatigue or other factors.

Standing back from the specific results and comparisons above, the test of the mixture model in Equations (1) through (3) using Paris and Princeton data has several implications. First, although the mixture model satisfies the Luce choice axiom that rank-sums methods do not, the mixture model's findings about aggregate preference order are similar but not identical to those of rank-sums methods. Second, in $\widehat{{\pi _r}} $ , the mixture model implies and quantifies an increase in the appearance of random score assignments between the first and second wine flights that may be due to palate fatigue or other factors.

VII. Conclusion

A mixture of rank-preference models, including a Plackett-Luce PMF, was tested on the scores assigned by judges in the 1976 Judgment of Paris and the 2012 Judgment of Princeton. The aggregate- preference order implied by that mixture model complies with the Luce choice and IIA axioms, and it is also generally consistent with other published results. The mixture model has the added benefit of an estimate of Type I error, an estimate of the proportion of judge's scores that appear to be based on random-scoring behavior, and an estimate of judges' nonrandom common-preference order.

Although rank-preference and mixture models have been applied to taste tests of beans, cheese, crackers, salad dressings, soft drinks, sushi, and animal feed, application of the models to the unique challenges of wine-tasting results remains at an early stage. An analysis of alternatives to Plackett-Luce seems worthwhile. Restating the mixture to allow different proportions of common-preference agreement on different wines, changing π _p to π _p,i, seems realistic. While the mixture model presented above does parse observed tasting results into random and common-preference components, tasters' nonrandom but idiosyncratic preferences may account for much of the variance in observed scores. Some like more fruit than others, some like more acid, there are many examples. Future work is needed to identify and quantify those idiosyncratic preferences. Finally, the fundamental question of whether a zero to 20 point-score protocol induces ${\squf} $ and bias in scores may be worth testing.

Appendix A: Examples of Mixture Model Results and Type I Error

Several simple examples of mixture model results and Type I error appear below. Examples with two wines and two to four tasters are tractable to work out by hand as shown. As a check, the EM solutions to Equations (1) and (2) do match the results below.

Footnotes

The author thanks an anonymous reviewer for his or her helpful comments. All remaining errors and omissions are the responsibility of the author alone.

¹ Patricia Gallagher had a list with the tasting order and gave it to Taber (Taber, Reference Taber2005, p. 200). In addition to a literature search, the author contacted George Taber, Steven Spurrier, an associate of Gallagher, and Jancis Robinson and posted an inquiry in the Purple Pages in an attempt to find the 1976 sequence. Other than that Puligny-Montrachet was the first white poured, both Spurrier (personal communication via e-mail, March 28, 2014) and Taber (personal communication via e-mail, April 22, 2014) conveyed to the author that the order is lost to history.

Note that if the observed distribution has the exact “flat” form of a random distribution, the estimates of $\widehat{{\pi _r}} $ and $\widehat{{\rho _i}} $ are indeterminate because the observed distribution of ranks in that case is homogeneous and thus does not appear to be a mixture. The author has never observed that pattern in actual tasting results. See also discussion of Type I error in Bodington (Reference Bodington2015).

References

Arrow, K.J. (1963). Social Choice and Individual Values. 2nd ed. New York: John Wiley & Sons.Google Scholar

Ashenfelter, O., and Quandt, R.E (1999). Analyzing a wine tasting statistically. Chance, 12, 16–20.Google Scholar

Ashenfelter, O., and Storchmann, K. (2012). Editorial: The Judgment of Princeton and other articles. Journal of Wine Economics, 7(2), 139–142.Google Scholar

Ashton, R.H. (2014). Nothing good ever came from New Jersey: Expectations and the sensory perception of wine. Journal of Wine Economics, 9(3), 304–319.Google Scholar

Benter, W. (1994). Computer-based horse race handicapping and wagering systems: A report. In Ziemba, W.T., Lo, V.S., and Haush, D.B. (eds.), Efficiency of Racetrack Betting Markets. San Diego: Academic Press, 183–198.Google Scholar

Bockenholt, U. (1992). Thurstonian representation for partial ranking data. British Journal of Mathematical and Statistical Psychology, 45, 31–49.Google Scholar

Bodington, J. (2012). 804 Tastes: Evidence on randomness, preferences and value from blind tastings. Journal of Wine Economics, 7(2), 181–191.Google Scholar

Bodington, J. (2015). Evaluating wine-tasting results and randomness with a mixture of rank preference models. Journal of Wine Economics, 10(1), 31–46.Google Scholar

Cao, J. (2014). Quantifying randomness versus consensus in wine quality ratings. Journal of Wine Economics, 9(2), 202–213.Google Scholar

Chen, W. (2014). How to Order Sushi. PhD dissertation, Harvard University.Google Scholar

Cicchetti, D.V. (2006). The Paris 1976 wine tasting revisited once more: Comparing ratings of consistent and inconsistent tasters. Journal of Wine Economics, 1(2), 125–140.Google Scholar

Cicchetti, D.V. (2014). Blind tasting of South African wines: A tale of two methodologies. American Association of Wine Economists. AAWE Working Paper No. 164.Google Scholar

Cleaver, G., and Wedel, M. (2001). Identifying random-scoring respondent in sensory research using finite mixture regression results. Food Quality and Preference, 12, 373–384.CrossRef Google Scholar

Critchlow, D.E. (1980). Metric Methods for Analyzing Partially Ranked Data. New York: Springer.Google Scholar

De Bruin, W. (2005). Save the last dance for me: Unwanted serial position effects in jury evaluations. Acta Psychologica, 118(3), 245–260.Google Scholar

Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1), 1–38.Google Scholar

Filipello, F. (1955). Small panel taste testing of wine. American Journal of Enology, 6(4), 26–32.Google Scholar

Filipello, F. (1956). Factors in the analysis of mass panel wine-preference data. Food Technology, 10, 321–326.Google Scholar

Filipello, F. (1957). Organoleptic wine-quality evaluation II: Performance of judges. Food Technology, 11, 51–53.Google Scholar

Filipello, F., and Berg, H.W. (1958). The present status of consumer tests on wine. Paper presented at the Ninth Annual Meeting of the American Society of Enologists, Asilomar, Pacific Grove, California, June 27–28.Google Scholar

Ginsburgh, V., and Zang, I. (2012). Shapley ranking of wines. Journal of Wine Economics, 7(2), 169–180.CrossRef Google Scholar

Gormley, I.C., and Murphy, T.B. (2007). A latent space model for rank data. In Airoldi, E., Blei, D.M., Fienberg, S.E., Goldenberg, A., Xing, E.P., and Zheng, A.X. (eds.), Statistical Network Analysis: Models, Issues and New Directions. Berlin: Springer, 90–102.Google Scholar

Gormley, I.C., and Murphy, T.B. (2008). A mixture of experts model for rank data with applications in election studies. Annals of Applied Statistics, 2(4), 1452–1477.Google Scholar

Hulkower, N.D. (2009). The Judgment of Paris according to Borda. Journal of Wine Research, 20(3), 171–182.Google Scholar

Kidwell, P., Lebanon, G., and Cleveland, W.S. (2008). Visualizing incomplete and partially ranked data. IEEE Transactions on Visualization and Computer Graphics, 14(6) 1356–1364.Google Scholar

Lindley, D.V. (2006). Analysis of a wine tasting. Journal of Wine Economics, 1(1), 33–41.Google Scholar

Luce, R. D. (1977). The choice axiom after twenty years. Journal of Mathematical Psychology, 15 (3), 215–233.Google Scholar

Mantonakis, A., Rodero, P., Lesschaeve, I., and Hastie, R. (2009). Order in choice: Effects of serial position on preferences. Psychological Science, 20(11), 1309–1312.Google Scholar

Marden, J.I. (1995). Analyzing and Modeling Rank Data. London: Chapman & Hall.Google Scholar

Masson, J., and Aurier, P. (2015). Should it be told or tasted? Impact of sensory versus nonsensory cues on the categorization of low-alcohol wines. Journal of Wine Economics, 10(1), 62–74.CrossRef Google Scholar

McLachlan, G., and Peel, D. (2000). Finite Mixture Models. New York: John Wiley & Sons.Google Scholar

Mengersen, K.L., Robert, C.P., and Titterington, D.M. (2011). Mixtures: Estimation and Application. New York: John Wiley & Sons.Google Scholar

Olkin, I., Lou, Y., Stokes, L. and Cao, J. (2015). Analyses of wine-tasting data: A tutorial. Journal of Wine Economics, 10(1), 4–30.Google Scholar

Plackett, R.L. (1975). The analysis of permutations. Applied Statistics, 24(2), 193–202.CrossRef Google Scholar

Quandt, R.E. (2006). Measurement and inference in wine tasting. Journal of Wine Economics, 1(1), 7–30.Google Scholar

Quandt, R.E. (2012). Comments on the Judgment of Princeton. Journal of Wine Economics, 2(7), 152–154.Google Scholar

Taber, G.M. (2005). Judgment of Paris: California vs. France and the Historic 1976 Paris Tasting That Revolutionized Wine. New York: Scribner.Google Scholar

Taber, G.M. (2012). The Judgment of Princeton. Journal of Wine Economics, 2(7), 143–151.Google Scholar

Theusen, K.F. (2007). Analysis of ranked preference data. Master's thesis, Technical University of Denmark, Kongens Lyngby, Denmark.Google Scholar

Vigneau, E., Courcoux, P., and Semenou, M. (1999). Analysis of ranked preference data using latent class models. Food Quality and Preference, 10(3), 201–207.Google Scholar

Ward, D.L. (2012). A graphical and statistical analysis of the Judgment of Princeton wine tasting. Journal of Wine Economics, 7(2), 155–168.CrossRef Google Scholar

Table 1 Monte Carlo Simulation Results, 1,000 Iterations

Table 2A White Wines, 1976 Judgment of Paris

Table 2B Red Wines, 1976 Judgment of Paris

Table 3A White Wines, 2012 Judgment of Princeton

Table 3B Red Wines, 2012 Judgment of Princeton

Article contents

Testing a Mixture of Rank Preference Models on Judges' Scores in Paris and Princeton*

Abstract

Keywords

I. Introduction

II. Mixture Model for Ranked Wines

III. Transforming Scores into Ranks and Transitivity

IV. Tied Scores

V. Estimates, Tests, and Type I Error

VI. Application to Paris and Princeton

A. Paris 1976

B. Princeton 2012

VII. Conclusion

Appendix A: Examples of Mixture Model Results and Type I Error

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests