Introduction
Core collections are sampled and utilized for a variety of applications in crop improvement programmes. Diversity and size of the core collection plays a crucial role in its effective utilization. A good core collection should represent maximum diversity without having similar accessions in a minimum number of entries (Krishnan et al., Reference Krishnan, Sumathy, Ramesh, Bindroo and Naik2014). An array of sampling methodologies are in practice including stratum-based methods, genetic distance sampling (Jansen and van Hintum, Reference Jansen and van Hintum2007), maximization method (Schoen and Brown, Reference Schoen and Brown1993), Core Hunter (De Beukelaer et al., Reference De Beukelaer, Smýkal, Davenport and Fack2012), genetic distance optimization (Odong et al., Reference Odong, van Heerwaarden, Jansen, van Hintum and van Eeuwijk2011), Groupwise sampling (Guruprasad et al., Reference Guruprasad, Krishnan, Dandin and Naik2014) and the Similarity Elimination (SimEli) method (Krishnan et al., Reference Krishnan, Sumathy, Ramesh, Bindroo and Naik2014). All of these methodologies sample either a diverse or representative core collection (Odong et al., Reference Odong, Jansen, van Eeuwijk and van Hintum2013). A diverse core collection retains maximum diversity in a minimum number of entries, whereas a representative core collection preserves the genetic structure of the whole collection.
Since the advent of the core collection, the size of the core subset has been determined based on the size of the whole collection. It is assumed that the size of the core collection is proportional to the diversity of the whole collection (Bhattacharjee et al., Reference Bhattacharjee, Khairwal, Bramel and Reddy2006). The majority of studies have sampled 5–20% of entries irrespective of the diversity of the whole collection (Reddy et al., Reference Reddy, Upadhyaya, Gowda and Singh2005). In some cases, core collections with different sizes (e.g. 10, 20 and 30%) were sampled and among them best-performing core collection was selected (Wang et al., Reference Wang, Hu, Xu and Zhang2007). In this study, we precisely estimate the size of the core collection based on the diversity of the whole collection using the SimEli method. This approach was developed based on our previously reported SimEli methodology (Krishnan et al., Reference Krishnan, Sumathy, Ramesh, Bindroo and Naik2014). Therefore, we refer the reader to SimEli article as a prerequisite for the proper understanding of this work.
Experimental
All computations were carried out using the R Development Core Team (2013) by using either appropriate packages or our custom scripts. A genotypic dataset of 1014 coconut accessions profiled using 30 simple sequence repeat (SSR) markers was utilized in this study (Odong et al., Reference Odong, van Heerwaarden, Jansen, van Hintum and van Eeuwijk2011; Krishnan et al., Reference Krishnan, Sumathy, Ramesh, Bindroo and Naik2014). SSR marker allele data were converted into allele frequency and used in the calculation of modified Rogers' genetic distance. The SimEli method accepts any pairwise genetic distance of accessions in the whole collection and involves two steps: (1) selection criterion – a pair of accession having the least distance is identified and (2) elimination criterion – one accession among the pair is eliminated based on different elimination criteria. In this study, the ‘accession to rest of the accession’ distance was used as the elimination criteria. Pairwise and mean genetic distances, and allele retention were measured for each elimination cycle in the SimEli method (Fig. 1). We repeated this elimination cycle until all the accessions in the whole collection were eliminated. Our aim was to monitor the rate of change in diversity measures during the elimination process, and to determine the precise size of core collection, in which the diversity of the whole collection is retained in minimum number of entries.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921031636382-0692:S1479262114000902:S1479262114000902_fig1g.gif?pub-status=live)
Fig. 1 Rate of change in allele retention, and pairwise and mean genetic distances during the elimination cycle. Green line represents pairwise genetic distance, black line represents mean genetic distance and blue line represents the number of alleles retained during each elimination cycle.
Results and discussion
A total of 173 alleles were recorded in the whole collection, and all the alleles were retained until the size of the core collection reached 532 accessions. One allele was lost when the 532nd accession was removed, and the rest of the alleles were retained until the size of the core collection reduced to 266 accessions (26.23% of the whole collection). The eliminated allele is a very rare allele, which was recorded only in one of the 1014 coconut accessions. Probably, the eliminated allele might have been recorded due to the scoring error or non-specific amplification. Beyond the 266 accessions, allele retention progressively decreased in each subsequent elimination cycle (Fig. 1). Alleles with very less frequency were eliminated initially, followed by rare alleles and, finally, common alleles (Fig. 2). Our results support the hypothesis that the marker with rare and very rare alleles contributes very less to the genetic distance when compared with that having more common alleles. Moreover, the utility of these very rare alleles in crop improvement programmes is debatable (Odong et al., Reference Odong, van Heerwaarden, Jansen, van Hintum and van Eeuwijk2011; Zhang et al., Reference Zhang, Zhang, Wang, Sun, Qi, Li, Wei, Han, Qiu, Tang and Li2011).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241024014551-98748-mediumThumb-gif-S1479262114000902_fig2g.jpg?pub-status=live)
Fig. 2 Rate of change in allele retention. A total of 173 alleles in the whole collection were grouped into ten quantiles with an equal number of alleles. Retention of alleles in the ten quantiles was measured during the elimination cycle.
During the initial elimination cycle, the mean genetic distance of the core collection showed a steady decrease due to the elimination of similar diverse accessions. These similar diverse accessions are genetically distant from the rest of the collection and likely to be genetically close to each other. After the elimination of these similar diverse entries, the mean genetic distance showed a steady increase until the size of the core collection reached 19 accessions. The rate of change in pairwise genetic distance was maximum between the initial and final elimination cycles, indicating that the majority of the accessions were 0.4–0.6 genetic distance apart.
On the basis of these observations, core collection size can be precisely estimated by monitoring the change in allele retention, and mean and pairwise genetic distances. To achieve maximum allelic richness, the size of the coconut core collection can be set to 266 (26.23%), where all the SSR alleles in the whole collection were retained in the core collection. As has been discussed previously (Krishnan et al., Reference Krishnan, Sumathy, Ramesh, Bindroo and Naik2014), allelic richness of the core collection can be increased by using expected heterozygosity (H e) as the elimination criteria. For sampling distant entries, pairwise and mean genetic distances can be used instead of allelic richness to determine the size of the core collection. Therefore, the selection of this criterion should be based on the objective of the core collection such as whether to sample a collection with high allelic richness or high-pairwise genetic distance among the entries. The presize approach can be efficiently utilized to study the contribution of trait variations or alleles to the diversity of the whole collection and to precisely estimate the size of the core collection.
Acknowledgements
The authors thank the reviewers for their constructive and insightful comments. They thank the Generation Challenge Program for providing the coconut dataset in the public domain (http://gcpcr.grinfo.net). The authors also thank all the researchers involved in the generation of the dataset. They acknowledge the use of adegenet, cluster, ggplot2, hmisc and reshape2 R packages.