Introduction
A vast collection of crop-related global germplasm includes traditional landraces, modern cultivars and wild cultivars. However, only a fraction of these germplasm collections could be protected and maintained in gene banks. Frankel and Brown (Reference Frankel, Brown, Holden and Williams1984) introduced the concept of a core collection as a subset of a larger germplasm collection that represents genetic and phenotypic diversity. Existing methodologies for the development of a core set are either based on qualitative or quantitative data. Occasionally, transformations are applied on quantitative data to make them qualitative or vice versa to avoid the difficulty of handling mixture data (Kim et al., Reference Kim, Chung, Cho, Ma, Chandrabalan, Gwag, Kim, Cho and Park2007).
Various clustering methodologies are applied to obtain homogeneous strata either based on qualitative or quantitative data. However, the results have been shown to be highly dependent on clustering methodologies, and mostly heuristic methods (Kim et al., Reference Kim, Chung, Cho, Ma, Chandrabalan, Gwag, Kim, Cho and Park2007) have been followed to determine homogeneous strata in a germplasm collection. Various factors that need to be addressed for the development of a core set include the size of the core set, the formation of homogeneous strata and the sampling strategy (van Hintum and Th, Reference van Hintum and Th1999).
In the past, many studies have described the development of core sets from a large collection of germplasm, namely the development of a core set from the United States Department of Agriculture (USDA) rice germplasm (Yan et al., Reference Yan, Rutger, Bryant, Bockelman, Fjellstrom, Thomas, Tai and McClung2007) and a rice mini-core from the USDA core collection (Agrama et al., Reference Agrama, Yan, Lee, Fjellstrom, Chen, Jia and McClung2009) using PowerCore software (RDA-Genebank Information Center; http://www.genebank.go.kr/eng/PowerCore/PowerCore_Software.zip) (Kim et al., Reference Kim, Chung, Cho, Ma, Chandrabalan, Gwag, Kim, Cho and Park2007). Gangopadhyay et al. (Reference Gangopadhyay, Mahajan, Kumar, Yadav, Meena, Pandey, Bisht, Mishra, Sivaraj, Gambhir, Sharma and Dhillon2010) used the principal component score strategy to develop a core set of brinjal germplasm. Sharma et al. (Reference Sharma, Rao, Upadhyaya, Reddy and Thakur2010) evaluated a sorghum mini-core from a core collection of landrace accessions to identify the sources of grain mold and downy mildew resistance. Yu et al. (Reference Yu, Kohel, Fang, Cho, Van Deynze, Ulloa, Hoffman, Pepper, Stelly, Jenkins, Saha, Kumpatla, Shah, Hugie and Percy2012) developed a core set of cotton germplasm with a genome-wide coverage of marker data. Wen et al. (Reference Wen, Franco, Chavez-Tovar, Yan and Taba2012) investigated how the tropical maize race Tuxpeno could be exploited in future maize improvement using genome-wide single nucleotide polymorphisms (SNPs). Gibert and Cortes (Reference Gibert and Cortes1997) presented the properties and details of distance matrices obtained by weighting qualitative and quantitative variables for cluster analysis. However, weighting of distance matrices from different sources is problematic because the objective choices of weighting parameters are often difficult. Crossa and Franco (Reference Crossa and Franco2004) reviewed genomic classification techniques as well as statistical models based on mixed distribution models. Doring et al. (Reference Doring, Borgelt and Kruse2004) proposed a fuzzy clustering procedure for mixture data. Sarkar et al. (Reference Sarkar, Rao, Wahi and Bhat2011) compared the performance of different clustering procedures based on mixture data. However, the identification of the optimum number of clusters with full utilization of mixture data for the development of a core set remains a challenge.
To our knowledge, most of the existing methodologies for the development of a core set are either based on qualitative or quantitative traits. Moreover, identification of a suitable distance measure, clustering methodology, number of clusters, allocation strategy and evaluation criteria for the development of a core set based on the mixture data of germplasm is yet to be fully explored. Therefore, in the present study, a systematic approach was proposed for the development of a core set of germplasm using mixture data. The approach thus developed is illustrated on rice germplasm having both quantitative and qualitative SNP genotyping data.
Materials and methods
The identification of a core collection is a two-step procedure in which the accessions are initially classified into homogeneous strata and then a fraction of accessions from each stratum are selected for core collection by using an appropriate sampling or allocation strategy. To enable clustering techniques to handle a mixture of qualitative and quantitative data, first, distances are calculated separately for qualitative and quantitative data using relevant measures. Then, these distance matrices are directly combined and used as inputs for cluster analysis.
In this study, a dataset comprising 219 salt-tolerant rice germplasm accessions having 14 agronomic/phenotypic characteristics and 2915 genome-wide SNPs (coded as 0 and 2 for dominant and recessive homozygotes, respectively, and 1 for a heterozygote for each individual) was considered.
Distance measures
The following three distance measures were considered to determine the distances between the accessions based on quantitative data:
-
(1) Distance based on the average of the range-standardized absolute difference:
$$\begin{eqnarray} A _{1} = \frac {1}{ p }{ \sum _{ k = 1}^{ p } }\,\frac {\vert x _{ ik } - x _{ jk }\vert }{ r _{ k }}, \end{eqnarray}$$where x ik and x jk are the ith and jth accessions of the kth quantitative variable; r k is the range of the kth variable; and p is the total number of quantitative variables (Gower, Reference Gower1971). -
(2) Distance based on Pearson's correlation:
$$\begin{eqnarray} A _{2} = (1 - r _{ ij }^{2}), \end{eqnarray}$$where r ij is the product moment correlation (similarity) between ith and jth accessions, thus dissimilarity = 1 − similarity. -
(3) Rescaled distance based on the standardized score:
$$\begin{eqnarray} A _{3} = { \sum _{ k = 1}^{ p } }\,\frac {\left [\frac { x _{ ik } - x _{ jk }}{ \sigma _{ k }}\right ]^{2}}{\,max\,( d _{ ij }^{\ast })}, \end{eqnarray}$$where $$\sigma _{ k } $$ is the standard deviation of the kth variable and max $$( d _{ ij }^{^{\ast }}) $$ is the maximum of the distances between two accessions in the entire dataset.
The following two different distance measures were considered to determine the distance between the accessions based on qualitative data:
-
(1) Distance based on the average mismatch:
$$\begin{eqnarray} B _{1} = \frac {1}{ m }{ \sum _{ k = 1}^{ m } }\, d _{ k }, \end{eqnarray}$$where d k = 0, if y ik = y jk , else d k = 1 (Gower, Reference Gower1971). -
(2) Rescaled distance based on the average absolute difference:
$$\begin{eqnarray} B _{2} = \frac {3}{2}\times \frac {\frac {1}{ m }{ \sum _{ k = 1}^{ m } }\,\vert y _{ ik } - y _{ jk }\vert }{1 + \frac {1}{ m }{ \sum _{ k = 1}^{ m } }\,\vert y _{ ik } - y _{ jk }\vert }, \end{eqnarray}$$where y ik and y jk are the ith and jth accessions of the kth qualitative variable and m is the total number of qualitative variables. The distance B 2 is a modified measure of Munneke et al. (Reference Munneke, Schlauch, Simonsen, Beavis and Doerge2005). The modification is done so that the value of B 2 lies in the range [0, 1].
The range of elements in the three quantitative distance matrices (A 1–A 3) and two qualitative distance matrices (B 1 and B 2) lies between 0 and 1. Thus, the various combined distance matrices for mixture data are computed by summing up the distance matrices corresponding to the qualitative and quantitative data, which are defined as follow:
where (a 1ij ), (a 2ij ), (a 3ij ), (b 1ij ), (b 2ij ) and (b 3ij ) represents the ijth elements of matrices A 1, A 2, A 3, B 1, B 2 and B 3, respectively. These qualitative, quantitative and combined distance matrices are used as inputs for clustering analysis. In this study, seven (five hierarchical and two partitioned) different clustering procedures, namely single linkage, complete linkage, unweighted pair-group method with arithmetic mean, weighted average, Ward's method, k-means and partitioning around the medoids, were considered to find the optimum number of homogeneous clusters. Here, the approach of Monti et al. (Reference Monti, Tamayo, Mesirov and Golub2003) was followed for assessing the stability of clusters by bootstrapping, and 1000 bootstrap samples were drawn from the distance matrices for each set of the cluster number $$k = \{2,\ldots ,10\} $$ . A consensus clustering result was obtained by taking the ratio of the number of times any two observations are found together in the same cluster to the total number of times that are selected together in the bootstrap samples. As each clustering procedure exhibits different cluster memberships of individuals, the consensus clustering results obtained from different clustering procedures are merged together to obtain a merged-consensus clustering result (Simpson et al., Reference Simpson, Armstrong and Jarman2010). The merged-consensus clustering result is obtained by taking the average of the consensus clustering results for a particular cluster number k. Due to the absence of any a priori information on the clustering pattern, equal weights are given to each consensus clustering result, i.e. equal importance is given to each of the clustering procedure.
Optimum number of clusters
Mostly, a priori information is used for the determination of the number of clusters to classify the accessions, but in the absence of such information, it is beneficial to identify the optimum number of clusters. Moreover, identifying the optimal number of clusters is one of the most challenging issues and essential for effective and efficient clustering (Everitt, Reference Everitt1979). The optimal number of clusters (k) is estimated as the value of k at which the change in the area under cumulative density function (CDF) (ΔK) calculated across a range of possible values of k is largest. Let us suppose that M indicates a merged-consensus clustering result of order N× N. Then, an empirical CDF, defined over the range [0, 1], is given by:
where $$1\{\ldots \} $$ denotes an indicator function, M(i, j), with (i, j) being the entry of the merged-consensus matrix M. The area under the CDF corresponding to M is computed using the formula:
where $$\left \lcub x _{1}, x _{2},\ldots , x _{ m }\right \rcub $$ is the ordered set of entries of the merged-consensus matrix M, with $$m = N ( N - 1)/2 $$ (Monti et al., Reference Monti, Tamayo, Mesirov and Golub2003).
Cluster robustness
After determining the optimal number of clusters, the best-fitted clustering pattern of germplasm is determined based on cluster robustness. The robustness of clusters under any clustering procedure is calculated by taking the average of the merged-consensus result of those individuals falling in the same group using the formula (Simpson et al., Reference Simpson, Armstrong and Jarman2010):
The average cluster robustness value is calculated across the k clusters using the clustering algorithm to choose the one that is best fitted to the data.
Allocation methods
The second and final stage for the development of a core set is to select the accessions from homogeneous groups based on a suitable sampling or allocation strategy. van Hintum et al. (Reference van Hintum, Brown, Spillane and Hodgkin2000) and Hu et al. (Reference Hu, Zhu and Xu2000) used different sampling strategies, namely proportional allocation (P strategy), log frequency allocation (L strategy), constant allocation (C strategy) and simple random sampling (R strategy) for the identification of a core set. During this stage, the accessions from the identified robust clusters are sampled by using the following three different allocation methods:
-
(1) Proportional allocation
$$\begin{eqnarray} n _{ i } = \left [ n \times \frac { N _{ i }}{{ \sum _{ i = 1}^{ g } }\, N _{ i }}\right ] \end{eqnarray}$$ -
(2) Log-proportional allocation
$$\begin{eqnarray} n _{ i } = \left [ n \times \frac {log( N _{ i })}{{ \sum _{ i = 1}^{ g } }\,log( N _{ i })}\right ], \end{eqnarray}$$where n i is the number of accessions selected for the core set from the ith cluster; N i is the number of accessions in the ith cluster; n is the size of the core set; g is the total number of clusters and the parentheses ‘[ ]’ represent the nearest integer function. -
(3) Random allocation of single entry (RASE). Here, no optimal number of clusters is determined and the accessions are grouped into the number of clusters equals to the size of the core set. A single entry is then selected from each of the cluster to construct the core set.
Evaluation of a core set
For quantitative data, the efficiency of methodologies for the identification of a core set is evaluated by using different indices, namely mean difference (MD), variance difference (VD), variable rate (VR) and coincidence rate (CR) (Hu et al., Reference Hu, Zhu and Xu2000). For qualitative data, the aforementioned methodologies are evaluated using the index average polymorphic information content difference (APICD), which is given by:
where $$\overline{ P _{c}} $$ and $$\overline{ P _{e}} $$ are the average polymorphic information content of the core set and the entire set, respectively.
A combined evaluation index (CEI) was proposed by combining the above-mentioned five indices to evaluate the diversity of the core set based on mixture data. The CEI is given by:
where $$M _{1} = [(100 - MD) + (100 - VD) + (100 - VR_{t}) + $$ $$CR]/4 $$ , with $$VR_{t} = \vert 100 - VR\vert $$ , $$M _{2} = (100 - APICD) $$ ; and $$w _{1} = ( N _{quant}/ N _{T}) $$ and $$w _{2} = ( N _{qual}/ N _{T}) $$ Nquant, Nqual and NT are the number of quantitative, qualitative and total number of variables, respectively, with $$w _{1} + w _{2} = 1 $$ and N T= N quant+N qual. The CEI represents the percentage of resemblance between the core set and the entire set. The value of the CEI ranges between 0 and 100. Moreover, the value of 100 corresponds to the best representativeness of the entire population. The difference between the CEI under the proportional allocation, log-proportional allocation and RASE methods for all the distance measures is tested by using a large sample z-test.
All the required coding is done in R software. For the consensus and merged-consensus clustering results, the ‘clusterCons’ package was used (Simpson, Reference Simpson2010), and to sample the accessions for the core set, the ‘ccChooser’ package (Studnicki and Debski, Reference Studnicki and Debski2012) in R software was used.
Results
The consensus clustering methodology was applied on the qualitative and quantitative data separately using the three distance measures for quantitative data (A 1–A 3) and two distance measures for qualitative data (B 1 and B 2). In Fig. 1, the values of ΔK were plotted against the cluster number (k). As shown in Fig. 1, the data consisted of three and four groups based on the quantitative and qualitative distance measures, respectively. Thus, the use of qualitative or quantitative data alone may result in widely different core sets. Moreover, dropping or transforming (qualitative to quantitative and vice versa) any kind of the variables from the analysis may result in a potential loss of information.
The plot of ΔK values against the number of clusters formed by the combined qualitative and quantitative data is shown in Fig. 2. It can be observed that for all the combined distance measures, the peak value of ΔK for the number of clusters was found to be equal to 3. So, the problem of choosing the number of clusters while using qualitative or quantitative data alone, in the case of disagreement, can be resolved by considering the combined distance measures based on mixture data.
The average cluster robustness values for the different clustering methodologies and combined distance measures are given in Table 1. For a given combined distance measure, the clustering methodology with the highest average cluster robustness value was then chosen for adopting the sampling strategy for the development of a core set. It was found that the k-means clustering algorithm was suitable for grouping germplasm based on the combined distance measures A 1 B 1, A 3 B 1 and A 1 B 2, whereas the complete linkage clustering algorithm was suitable for grouping germplasm based on the combined distance measures A 2 B 1, A 2 B 2 and A 3 B 3.
PAM, partitioning around the medoids.
To sample the accessions, three different allocation methods, namely proportional allocation, log-proportional allocation and RASE, were adopted. For the first two allocation methods, accessions were selected from the three clusters identified under each combined distance measure to develop a core set with 20% of germplasm from the entire collection. In contrast, the random sampling of a single entry from each of the 44 clusters (approximately 20% of the total number of germplasm) was done to develop a core set by ignoring the optimal number of clusters using the RASE method. To evaluate the efficiency of the procedures to identify the core set, 500 independent core collections were simulated under each sampling strategy.
The mean values of the CEI, over 500 independent simulation runs, under the proportional, log-proportional allocation and RASE methods are presented in Table 2. The absolute differences in CEI values between the proportional, log-proportional allocation and RASE methods were statistically tested and are given in Table 2. From Table 2, it is evident that the differences in CEI values between the proportional and log-proportional methods and between the proportional and RASE methods were significantly higher and hence the proportional allocation method was best among the three allocation methods for the identification of a diverse core set irrespective of the distance measures used. In addition, the differences in CEI values between the combined distance measures and the qualitative/quantitative distance measures under the proportional, log-proportional and RASE methods are given in Table 3. For the proportional allocation method, the value of the CEI was highest for A 1 B 2 among the combined distance measures. However, the CEI values of the distance measure A 1 B 2 were significantly different from those of A 1 to A 3, and at the same time they were not significantly different from the CEI values of B 1 and B 2 (Table 3). In contrast, for the log-proportional allocation method, the CEI value of A 1 B 2 was highest among the CEI values of all the distance measures (Table 2) and significantly different from the rest (Table 3). Moreover, a core set was constructed through heuristic methods using PowerCore (RDA-Genebank Information Center; http://www.genebank.go.kr/eng/PowerCore/PowerCore_Software.zip). The CEI value of the core set constructed by PowerCore was found to be 89.19, which was the lowest among all the combined distance measures under the proportional and log-proportional allocation methods.
* P< 0.05.
* P< 0.05.
Discussion
The use of qualitative and quantitative data separately to classify germplasm collections may result in different numbers of groups and, hence, different grouping patterns under each clustering methodology. Therefore, it is difficult to generalize the clustering patterns obtained from the analysis of qualitative and quantitative data separately. Moreover, dropping or transforming of any type of data that is generated by spending time and money may lead to loss of information. So, combining qualitative and quantitative information by distance indices is a beneficial way to handle such data. In the present study, six different combined distance measures were proposed and evaluated. While developing combined measures, care has been taken to combine the distance matrices corresponding to both qualitative and quantitative data. Prior to combining the qualitative and quantitative distance matrices, the elements of each of these matrices are set in a uniform scale, i.e. ranging between 0 and 1. Occasionally, the core set is identified based on the degree of correspondence between the clustering patterns obtained from the analysis of qualitative and quantitative data separately. In addition, the classification depends on the clustering methodology used in grouping the data. Therefore, it is also important to combine qualitative and quantitative data in the early stage of the analysis to draw valid inferences. Moreover, many clustering algorithms are used over time to generate a core set by applying a suitable allocation strategy. Odong et al. (Reference Odong, van Heerwaarden, Jansen, van Hintum and van Eeuwijk2011) advocated the use of traditional clustering approaches over model-based clustering approaches to develop a core set, particularly for simple sequence repeat marker data. However, developing a core set based on phenotypic and SNP genotyping data, together by adopting a suitable procedure involving an appropriate combined distance measure, clustering methodologies, number of clusters, allocation method and evaluation strategy, is rarely known. Hence, the present study was undertaken to find an end-to-end solution for the identification of a core set.
In the present study, the consensus and merged-consensus clustering results were used to identify the optimum number of groups. Although there was a disagreement between the optimum number of clusters based on the quantitative and qualitative data (i.e. three and four), the number of clusters for mixture data was found to be 3. With regard to the selection of the best-fitted clustering algorithm, the k-means clustering algorithm gave the highest average cluster robustness values for the combined distance measures A 1 B 1, A 3 B 1 and A 1 B 2. In contrast, the complete linkage clustering algorithm gave the highest average cluster robustness value for the combined distance measures A 2 B 1, A 2 B 2 and A 3 B 2 (Table 1).
Odong et al. (Reference Odong, Jansen, van Eeuwijk and van Hintum2013) reviewed different criteria, under different circumstances, for the evaluation of a core set. Frequently, it may be difficult to comment on the diversity of a core collection based on the individual index. To avoid such confusion, a combined measure may help in drawing valid conclusions. Thus, a CEI involving MD, VD, VR, CR and APICD is used to evaluate the diversity in a core set. A comparison among the CEI values under all the combined distance measures indicates the superiority of the proportional allocation method over the log-proportional and RASE methods. Classifying germplasm to the lowest level, i.e. by taking the number of clusters equal to the size of the core set, and followed by the application of the RASE method for selecting accessions does not provide any gain over the proportional allocation and log-proportional allocation methods for all the combined distance measures, barring few exceptions (Table 2). Moreover, clustering germplasm with the number of clusters equal to the size of the core set leads to a violation of natural grouping in the clustering methodology. In addition, this will lead to bias in the selection of germplasm, as the probability of a germplasm being chosen from a smaller cluster is higher than that from a larger cluster. Hence, the identification of the optimum number of clusters based on sound statistical techniques followed by the selection of accessions based on the proportional allocation method is advisable for developing a diverse core set. Furthermore, Table 3 reveals that even though, in a majority of the cases, the combined distance measures performed better over the individual measures, under the proportional allocation method, there were few cases such as A 1 B 2 versus B 2, A 2 B 2 versus A 2 and A 3 B 2 versus B 2 where the combined distance measures did not perform over the individual distance measures. Similarly, under the log-proportional allocation method, the combined distance measures A 2 B 1 and A 2 B 2 did not outperform the individual measure A 2. Hence, it cannot be concluded that combined distance measures will always perform better than individual distance measures. However, in the present study, the combined distance measure A 1 B 2 performed best among the rest of the combined measures. Furthermore, the efficiency of the approach is established by a comparison with PowerCore (RDA-Genebank Information Center; http://www.genebank.go.kr/eng/PowerCore/PowerCore_Software.zip). This indicates the advantage of using the proposed approach for mixed data. Hence, the combined measure A 1 B 2 using the k-means clustering algorithm along with the proportional allocation method to sample accessions can be preferred for the identification of a core set from a collection of rice germplasm.
Acknowledgements
The authors wish to express their gratitude to referees and editor for important comments and suggestions, which improved the paper substantially. Mr Sarkar acknowledge the receipt of fellowship from PG School, IARI, New Delhi during his Ph.D. study. Also, the authors wish to acknowledge World Bank Funded – National Agricultural Innovation Project (NAIP), ICAR Grants NAIP/Comp-4/C4/C-30033/2008-09.