Hostname: page-component-745bb68f8f-mzp66 Total loading time: 0 Render date: 2025-02-11T18:07:20.305Z Has data issue: false hasContentIssue false

Mathematical techniques and the number of groups

Published online by Cambridge University Press:  08 April 2008

Michael Lavine
Affiliation:
Department of Statistical Science, Duke University, Durham, NC 27708. michael@stat.duke.eduhttp://fds.duke.edu/db/aas/stat/faculty/michael
Rights & Permissions [Opens in a new window]

Abstract

Cluster analysis, factor analysis, and multidimensional scaling are not good guides to the number of groups in a data set. In fact, the number of groups may not be a well-defined concept.

Type
Open Peer Commentary
Copyright
Copyright © Cambridge University Press 2008

As a statistician I will comment only on the statistical aspects of Erickson's article, notably section 4.2.2, “Mathematical techniques.” I agree with Erickson's general sentiment that cluster analysis, factor analysis, and multidimensional scaling are unreliable indicators of the number of groups of neurons or of any other objects to which we apply the techniques. It is well known, for example, that the apparent number of groups found by cluster analysis can depend strongly on details of the clustering algorithm that are unrelated to the science in question. For example, one must make decisions about whether to analyze the original data, standardized data (rescaled so the mean is 0 and the variance is 1), or correlations. One must also decide how to measure the distance between clusters. Should it be Euclidean distance, or something else? Should it use average linkage, complete linkage, or single linkage? These choices all give different views of the data. The choice matters, but none is guaranteed to be the best or to be the only reasonable way of looking at the data. To illustrate, I refer to two figures from The Elements of Statistical Learning (2001) by Hastie, Tibshirani, and Friedman. Figure 1 (Figure 14.5 of Hastie et al. Reference Hastie, Tibshirani and Friedman2001) shows the results of clustering either with or without standardizing first. The results are quite different, and in this case the unstandardized result is correct. Figure 2 (Figure 14.13 of Hastie et al. Reference Hastie, Tibshirani and Friedman2001) shows three cluster analyses of one data set. The analyses differ in whether they use average, complete, or single linkage to measure distance between clusters. Again, the results are quite different. The point is not to say which is best; it is to say that results depend on seemingly innocuous choices and that none gives a complete picture of the data.

Figure 1. Simulated data: On the left, K-means clustering (with K=2) has been applied to the raw data. The two [shades] indicate the cluster memberships. On the right, the features were first standardized before clustering. This is equivalent to using feature weights 1/[2 . var (Xj)]. The standardization has obscured the two well-separated groups. Note that each plot uses the same units in the horizontal and vertical axes. (From Hastie et al. Reference Hastie, Tibshirani and Friedman2001. With kind permission of Springer Science and Business Media.)

Figure 2 (Lavine). Dendrograms from agglomerative hierarchical clustering of human tumor microarray data. (From Hastie et al. Reference Hastie, Tibshirani and Friedman2001. With kind permission of Springer Science and Business Media.)

Factor analysis and multidimensional scaling are subject to some of the same vagaries. For example, answers depend on whether we analyze raw data, rescaled data, or correlations, whether we assume normality in factor analysis and what loss function we use in multidimensional scaling. Each gives a different view of the data; none is necessarily the best. One analysis may appear to show four well-distinguished groups but another may appear to show eight, three, or none at all. Faith in the results of any particular analysis may often prove unfounded. The number of groups in multidimensional data can depend on who is looking, with what techniques, and for what purpose.

References

Hastie, T., Tibshirani, R. & Friedman, J. (2001) The elements of statistical learning, data mining, inference, and prediction. Springer.CrossRefGoogle Scholar
Figure 0

Figure 1. Simulated data: On the left, K-means clustering (with K=2) has been applied to the raw data. The two [shades] indicate the cluster memberships. On the right, the features were first standardized before clustering. This is equivalent to using feature weights 1/[2 . var (Xj)]. The standardization has obscured the two well-separated groups. Note that each plot uses the same units in the horizontal and vertical axes. (From Hastie et al. 2001. With kind permission of Springer Science and Business Media.)

Figure 1

Figure 2 (Lavine). Dendrograms from agglomerative hierarchical clustering of human tumor microarray data. (From Hastie et al. 2001. With kind permission of Springer Science and Business Media.)