Introduction
Seed banks are aimed to maintain genetic diversity and to have ways for monitoring and evaluating populations (Govindaraj et al., Reference Govindaraj, Vetriventhan and Srinivasan2015). As seed banks grow in the quantities of accessions they handle, a necessity to select manageable subsets emerges. That motivated to define a core collection as a subset of accessions that represents, with minimum redundancy, the genetic diversity of the original collection (Frankel and Brown, Reference Frankel, Brown, Holden and Williams1984).
Several analytical strategies have been proposed for core subset selection, which use phenotypic, geographic and genotypic datasets. Those methods can be classified into two types. The stratified sampling strategy, which involves the conformation of groups by cluster analysis, and then sampling those clusters (Franco et al., Reference Franco, Crossa, Villaseñor, Taba and Eberhart1998), and direct optimization of some objective function applied to the whole collection, an approach that has demonstrated to be effective (Schoen and Brown, Reference Schoen and Brown1993; Thachuk et al., Reference Thachuk, Crossa, Franco, Dreisigacker, Warburton and Davenport2009).
In the realm of direct optimization, Core Hunter (Thachuk et al., Reference Thachuk, Crossa, Franco, Dreisigacker, Warburton and Davenport2009) was presented as a tool with an advanced stochastic local search (SLS) algorithm for selecting core subsets based on molecular markers. It aims to maximize an objective function that may contain several criteria. It uses an advanced SLS algorithm: replica exchange Montecarlo (REMC) (Geyer, Reference Geyer and Keramidas1991; Kimura and Taki, Reference Kimura, Taki, Vichneetsky and Miller1991; Iba, Reference Iba2001). It performs replica subsetting, introduces perturbation and does replica exchanges with selection by the Metropolis criterion. This approach was tested with two molecular datasets of maize, comprising a maximum of 521 markers and 209 alleles, and it demonstrated to outperform MSTRAT (Gouesnard et al., Reference Gouesnard, Bataillon, Decoux, Rozale, Schoen and David2001), the D-method (Franco et al., Reference Franco, Crossa, Taba and Shands2005) and Power Core (Kim et al., Reference Kim, Chung, Cho, Ma, Chandrabalan, Gwag, Kim, Cho and Pak2007).
An innovation of the Core Hunter approach was presented with the Mixed Replica Algorithm (MixRep) (De Beukelaer et al., Reference De Beukelaer, Smýkal, Davenport and Fack2012), which runs heterogeneous replicas. It was tested with plant datasets containing molecular marker data, with a maximum number of alleles of 282 and a maximum number of samples of 4429. It was shown that the MixRep gives cores with equal or higher diversity scores than REMC, and often outperforming it in computing speed. However, REMC and MixRep have not yet been tested with a large dataset of accessions and genetic markers, with a design suitable for hypothesis testing.
A large set of accessions characterized by many genetic markers is the collection of landraces introduced into Mexico from Europe. From them, 8616 have been characterized by Dart-seq technology, with availability of 20,526 quality single-nucleotide polymorphism (SNPs), as a part of the CIMMYT Seeds of Discovery initiative (Vikram et al., Reference Vikram, Franco, Burgueño-Ferreira, Li, Sehgal, Saint Pierre, Ortiz, Sneller, Tattaris, Guzman, Sansaloni, Ellis, Fuentes-Davila, Reynolds, Sonder, Singh, Payne, Wenzl, Sharma, Bains, Singh, Crossa and Singh2016). Data are publicly available in the CIMMYT web page (http://www.cimmyt.org/) (Singh et al., Reference Singh, Sansaloni, Petroli, Ellis and Kilian2014). With this collection, core reference subsets have been assembled from the 7986 hexaploid accessions, by using a combination of SNP alleles, categorical and continuous phenotypic variables (Vikram et al., Reference Vikram, Franco, Burgueño-Ferreira, Li, Sehgal, Saint Pierre, Ortiz, Sneller, Tattaris, Guzman, Sansaloni, Ellis, Fuentes-Davila, Reynolds, Sonder, Singh, Payne, Wenzl, Sharma, Bains, Singh, Crossa and Singh2016).
The objective of this work is to compare among eight simple heuristic optimization methods, plus REMC, MixRep and random sampling, for core subset selection from the large collection of Mexican wheat landraces characterized by SNP markers.
Materials and methods
Data from the collection of Mexican wheat landraces were downloaded from the web page http://www.cimmyt.org/ (Singh et al., Reference Singh, Sansaloni, Petroli, Ellis and Kilian2014). The subset of 7986 hexaploid accessions was selected, which correspond to the ones used by Vikram et al. (Reference Vikram, Franco, Burgueño-Ferreira, Li, Sehgal, Saint Pierre, Ortiz, Sneller, Tattaris, Guzman, Sansaloni, Ellis, Fuentes-Davila, Reynolds, Sonder, Singh, Payne, Wenzl, Sharma, Bains, Singh, Crossa and Singh2016) in a study of genetic diversity of Creole wheats. Data are binary, with 0 denoting absence and 1 presence of a given allele. For homozygous loci, the binary notation was kept, whereas for heterozygous loci 0.5 was assigned to each allele, to be consistent with an allele frequency scale. For each direct optimization method tested and four objective functions (which we name ‘conditions’), three replications were performed, each one based on a random sample of 1500 loci (3000 SNP alleles) extracted from the total of SNP 20,526 loci available in the dataset. Data filtering and random loci selection were performed with the aid of the language and environment for statistical computing R (R Core Team, 2016).
To perform optimizations for core subset selection, the software Core Hunter 2.0 (De Beukelaer et al., Reference De Beukelaer, Smýkal, Davenport and Fack2012) was used through its R implementation, which has 13 search methods available. From them, two methods were excluded due to their high demand of time and system resources, which make them unsuitable for large datasets: exhaustive search and sequential backward selection. The following search strategies were tested: standard local search (Local), deterministic LR greedy search (LR), LR search with random first pair (LRSemi), parallel mixed replica search (MixRep), heuristic steepest descent (MSTRAT), random core set (Rand), Replica Exchange Monte Carlo (REMC), sequential forward selection (SFS), SFS with random first pair (SFSSemi), steepest descent search (Steepest) and tabu search (Tabu). For each assay, a core subset of 798 accessions was obtained, which represents nearly 10% of the collection.
For each tested method, four sets of weighting coefficients were used to include the following two criteria: modified Rogers Distance (MR) and Shannon diversity index (SH). The four objective functions were defined with 70% MR and 30% SH (default values in the software), 0% MR and 100% SH, 100% MR and 0% SH, and 50% MR and 50% SH. Since each of the four conditions was tested with three random marker sets, a total of 12 core subset selections were performed for each method. Four criteria were used to evaluate each core subset and the computer process: mean modified Rogers Distance between all pairs of accessions (MR), Shannon diversity index (SH), allele richness (AR) and computing time. Boxplots, multidimensional scaling graphics, hierarchical clustering, correlograms as well as Tukey's HSD (honest significant difference) tests were used to compare among the different methods.
The modified Roger's distance (Goodman and Stuber, Reference Goodman and Stuber1983) is essentially Euclidean, and it is defined as follows:
where p ij and q ij are allele frequencies of the jth allele at the ith locus in two accessions under consideration, k i is the number of alleles at the ith locus and m refers to the number of loci.
The so-called Shannon diversity index is originally the Shannon entropy (Shannon, Reference Shannon1948; Reyes-Valdés, Reference Reyes-Valdes and Kantartzi2013), and has the following expression for the ith locus:
For optimization and evaluation, SH was calculated as the sum of SH i across all loci for each sample of SNP markers. AR was calculated as the relative number of alleles in a core subset, compared with the number of alleles in the whole collection for the reference set of SNP markers. The values of AR are represented as percentages.
The optimization methods for core subset selection were run in a dual 2.5 GHz Intel Core i5 MacBook Pro processor, with 4 GB of RAM.
Results
Comparative scores
Results of the evaluations of the different search strategies through three random samples of 1500 SNP loci are summarized in Tables 1 and 2. When the objective function contained only SH (Table 1), the average value of this criterion was greatest for the LR, LRSemi and SFS methods, all of them being consistently best across the three replications (see online Supplementary material, Table S1). For AR, which was not included in the objective function, LR, LRSemi, SFS, SFSSemi and MixRep were the best performing, with each of them being consistently best in two of the three replications. For MR, which was not included in the objective function, the LR, LRSemi and SFS methods showed the highest average, and performed best in two of the three replications, with the Local method giving the best score for Replication 2. When the objective function contained only MR (Table 1), the average value of this criterion was the highest for the LR, LRSemi and SFS methods, while they were matched by SFSSemi in Replication 1. For AR, the LR, LRSemi, SFS and SFSSemi methods showed the highest averages. For SH, which was not included in the objective function, LR, LRSemi and SFS showed the best averages, with the three of them being consistent across the three replications.
The best score values are marked in bold.
MR, mean Modified Rogers distance; SH, Shannon diversity index; AR, Allele richness; ObFun, Objective function; ObFunSt, Standardized objective function.
The best score values are marked in bold.
MR, mean Modified Rogers distance; SH, Shannon diversity index; AR, allele richness; ObFun, objective function; ObFunSt, standardized objective function.
In the software default option, which assigns weights of 70 and 30% to MR and SH, respectively (Table 2), the LR, LRSemi and SFS methods showed the best average scores for the value of the objective function. For Replication 3, those methods were matched by SFSSemi. For AR, the LR, LRSemi and SFSSemi methods outperformed all other strategies, with the three of them being consistent across the three replications. When MR and SH were equally weighted (Table 2), the LR and LRSemi methods showed the best averages for the objective function, being matched by SFS and SFSSemi in two replications. For AR, the LR, LRSemi, SFS and SFSSemi methods had the highest average scores, although MixRep performed best for AR in the third replication.
The LR and LRSemi methods had the best scores for the objective function in the 12 assays, while they were matched by SFS in 11 of 12 assays. Although AR was not part of the objective functions, LR and LRSemi performed best for this criterion in 10 of 12 assays. In general, the best performing methods for the criteria contained in the objective functions were LR, LRSemi and SFS, with the fastest one of them being LRSemi. However, SFSSemi, which showed high scores too, is approximately twice as fast as LRSemi.
The MR and SH values of the whole collection, reported in Tables 1 and 2, cannot be considered as the theoretical maxima for core subsets, because they do not necessarily grow with the number of accessions. In fact, they are surpassed by most of the core subsets. The only parameter that reach its maximum in the whole set of accessions is AR, with 100%.
To investigate the origin of collateral effects of optimization, where criteria appeared relatively maximized even when they were not part of the objective function, the 132 core subsets generated in this research were used to estimate correlations among the three criteria employed here. The three correlations were statistically significant (P < 2.2 × 10−16): MR with SH, r = 0.96; MR with AR, r = 0.82; and SH with AR, r = 0.87. These results explain the collateral effects of the different objective functions defined for the set of comparisons.
Boxplots for the standardized value of the objective function reached by each method within each condition, are depicted in Fig. 1. A similar pattern can be observed for the conditions 1 MR + 0 SH, 0.7 MR + 0.3 SH, and 0.5 MR + 0.5 SH, with the LR, LRSemi, SFS and SFSSemi methods standing on top of the plot, MixRep appearing close below them, the Local method showing up at the middle, and the remaining methods appearing as a group at the bottom of the plot. However, this pattern is different for the condition 0 MR + 1 SH, i.e. when only the Shannon diversity was used as a maximization criterion. In this situation, Local, LR, LRSemi, MixRep, SFS and SFSSemi appear at the top, REMC appears at the middle and the remaining methods at the bottom.
The analysis of variance for a model containing the terms method, condition and method × condition gave highly significant results for the three sources of variation, always with P < 2.2 × 10−16. For this analysis, the response variable, i.e. the objective function, was transformed to its square root, to approach a normal distribution for the ANOVA residuals. In Table 3, the average values of the objective function and they square roots, are shown for each method, along with the grouping for a Tukey's HSD test performed with an alpha value of 0.05. The group composed by the LR, LRSemi, SFS and SFSSemi methods showed the maximum efficiency, with LR and LRSemi having identical results for the average objective function. In second place appears the MixRep method, showing a behavior above the average of all strategies. In third place appears the Local method, with a behavior slightly less than average. Finally, in fourth place appears the group composed by REMC, MSTRAT, Tabu, Steepest and Rand. As expected, the Rand method had the lowest observed average score. Among the members of the first statistical group, the fastest method was SFSSemi. However, between the two methods that always performed best, the fastest was LRSemi. The MixRep method, which conforms the second statistical group, was faster than all members of the first group; in fact, it was 13.5 times faster than LRSemi. For the remaining two groups, the Local strategy was the fastest, after discarding the Rand method, which is not an optimization per se.
ObFun, objective function; SqrObFun, square root of objective function.
Attributes and coincidences of the core subsets
Figure 2 represents a core subset generated by the LRSemi strategy, which had the best scores along with LR in all assays, but with a higher speed. For this representation, the core subset was generated with an objective function that assigns the same weights to MR and SH. The multidimensional scale plot generated with 1500 SNP loci in Fig. 2(a), depicts all the accessions in the collection, with the members of the core subset in red asterisks. Although this is a bi-dimensional representation of a highly dimensional space, it shows that the members of the core subset represent well the space of genetic variation. A different approach for graphic representation is depicted in Fig. 2(b), through an UPGMA hierarchical cluster plot. Although a large amount of wheat lines is represented in this plot, it can be noted that all groups tend to be represented in the core subset.
To evaluate the coincidence between the core subsets generated by the various strategies evaluated through this research, the number of common accessions was quantified for the subsets selected by the different methods in Replication 1. For instance, when the objective function contained both criteria, MR and SH, equally weighted (see online Supplementary material, Table S2), the LR and LRSemi methods had a coincidence of 100% between them, and composed a group with SFS and SFSSemi with coincidences of 99.62% and above. The MixRep strategy generated a core set with coincidences above 55% with the best four methods. The Local strategy had coincidences above 16% with the members of the group composed by LR, LRSemi, SFS, SFSSemi and MixRep. The group that showed the lowest scores, composed by REMC, Steepest, MSTRAT, Tabu and Rand, had a maximum coincidence of 11.4% with all methods. Correlograms that represent the coincidences among all methods for Replication 1, under all conditions, are depicted in Fig. 3. A consistent pattern emerged for all conditions, although MixRep was remarkably coincident (>80%) with LR, LRSemi, SFS and SFSSemi, when SH was the only component of the objective function (Fig. 3(a)), an attribute consistent with the boxplot in the left upper corner of Fig. 1. The general grouping that emerges from core subset coincidences resembles the structure given by the Tukey's HSD test.
A core set with 1133 lines was generated by LRSemi with the first sample of 1500 SNP loci, assigning equal weights to MR and SH, to compare it with the one reported by Vikram et al. (Reference Vikram, Franco, Burgueño-Ferreira, Li, Sehgal, Saint Pierre, Ortiz, Sneller, Tattaris, Guzman, Sansaloni, Ellis, Fuentes-Davila, Reynolds, Sonder, Singh, Payne, Wenzl, Sharma, Bains, Singh, Crossa and Singh2016). A total of 171 out of 1133 accessions were common between both cores, i.e. a coincidence of 15%. Although this value is low, we must consider that in the core subset generated in the cited work, they used a combination of SNP markers and categorical and continuous phenotypic traits. Furthermore, in that paper the SNP information was not used directly, but reduced to 2000 principal components, and then six principal axes of a hierarchical multiple-factor analysis (HMFA) were selected to represent genotypic and phenotypic variance in proportions of 75 and 25%, respectively. They followed a stratified sampling strategy with the number of accessions per group estimated by the D-method. The scores of the published core for the reference set of 1500 SNP loci were MR = 0.322, SH = 7.595 and AR = 97.59%. For the core generated by LRSemi, the scores were MR = 0.353, SH = 7.627 and AR = 99.09%. Although the core subset selected by LRSemi showed better MR, SH and AR values, we must take into account that both cores were selected under different criteria.
Discussion
The LR and LRSemi methods performed consistently well in all assays, and they worked well even for the criteria not included in the objective function, due to correlations among the respective parameters. They performed identically, but with LRSemi being considerably faster. Close to them, forming a single statistical group, appear SFS and SFSSemi as efficient algorithms to maximize the objective function, with SFSSemi being the faster. The MixRep method, although in the second statistical group, could be considered as a good choice only when computer time is an important limitation, also it is among the best strategies when the objective function is formed only by SH. When time is not extremely critical, LRSemi can be considered as a standard for the criteria evaluated in this work, because it consistently showed a good performance and moderate speed. The Local strategy, although a very fast method, does not appear to be a good maximization choice, because it had a less than average efficiency scores. Finally, at least under the characteristics of the data herein analysed, the REMC, MSTRAT, Tabu and Steepest strategies, do not seem to be a good choice, because they form a single statistical group with the use of a random extraction. It is worth remarking that the best performing methods collaterally aided to maximize criteria not included on the objective function.
The LR search, which was always consistently at the top of efficiency, is a greedy deterministic algorithm that does not take any random decision. It starts with the empty solution and iterates according to pre-determined parameters l and r, until the desired core size has been reached. In the implementation for Core Hunter 2.0, the conditions of the search are set in l = 2 and r = 1. In the LRSemi strategy, which performed as good as LR, the first pair of accessions is chosen randomly, thus being a semi-deterministic algorithm. As we have seen in the Results section, these two methods performed always on top of efficiency for the objective function. In spite of the high dimensionality of the genetic marker data, the graphic analysis offers an indication of a good coverage of the core set selected by LRSemi in the space of all accessions. Even when LRSemi is not totally deterministic, it showed all times a 100% coincidence with LR. On the other hand, the structure of the coincidence matrices for all methods reinforces the statistical grouping found by the Tukey's HSD test, being a good indication of the consistency of our results. Then, we must conclude that the group formed by the strategies LR, LRSemi, SFS and SFSSemi is a set of effective and consistent methods for selecting core subsets, at least from large collections of highly homozygous lines, genotyped with a high coverage by biallelic markers, as is the case of SNPs.
Small variations in MR, SH and AR are typical when comparing core selection methods (see for example De Beukelaer et al., Reference Franco, Crossa, Villaseñor, Taba and Eberhart2012); however, our design allowed discrimination among several strategies, even when the size of the core was large. Also, one must consider that small variations, e.g. in AR, can have a considerable impact on the richness of the subset. For example, a difference in 3% for AR between two methods can involve a difference up to 90 alleles between the two respective cores for the number of SNPs considered in this work.
To the best of our knowledge, this comparison of optimization approaches for core subset selection with genetic marker data is so far the one performed with the largest dataset, both in number of accessions and marker alleles. In fact, each of the three assays used herein, were performed on data sets with 23,958,000 points, while the largest reported in De Beukelaer et al. (Reference De Beukelaer, Smýkal, Davenport and Fack2012) contained 200,364, from the flax data set comprising 708 samples and 282 alleles. In that work, it was observed that REMC was outperformed by simpler methods in several experiments. In fact, REMC never outperformed LR, which is consistent with our results. For the larger pea dataset REMC and MSTRAT resulted in worse scores than Local search. They also found the Local search to be faster than the REMC method. When Local, MSTRAT, LR and REMC were compared for minimum distance, LR showed leading results too. Although those results are coincident with our findings, the comparisons between Local, MSTRAT, LR and REMC were not as contrasting in De Beukelaer et al. (Reference De Beukelaer, Smýkal, Davenport and Fack2012) as in this work, probably because of the much larger data set we used. In addition, in this work MixRep was for the first time systematically compared with the simple heuristics. Furthermore, our replication schema allowed us to perform statistical comparisons between methods.
The core subset of 1133 accessions that we selected by the LRSemi strategy, to make a comparison with the one selected by Vikram et al. (Reference Vikram, Franco, Burgueño-Ferreira, Li, Sehgal, Saint Pierre, Ortiz, Sneller, Tattaris, Guzman, Sansaloni, Ellis, Fuentes-Davila, Reynolds, Sonder, Singh, Payne, Wenzl, Sharma, Bains, Singh, Crossa and Singh2016), showed the efficiency of LRSemi, albeit the selection criteria were different.
One of the justifications for the development of the MixRep method, pointed out by in De Beukelaer et al. (Reference De Beukelaer, Smýkal, Davenport and Fack2012), is that LR becomes slow when run for relatively large datasets. However, as shown by our results, LRSemi performed as good as LR in large datasets, and took an average of 139 min to select core subsets with a selection pressure of 10% from tables of 7986 accessions with 1500 SNP loci, being a reasonable amount of computing time. As is the case of MixRep, the LR and LRSemi methods can take objective functions with other criteria, like minimum distance.
The increasing computing power is making feasible to efficiently apply direct heuristic algorithms for core subset selection, which can be efficient even for large data sets, and that can outperform complex strategies aimed to expedite the computing process. The LRSemi and SFSSemi methods are examples of simple heuristic algorithms that can be effectively used to develop core subsets from large collections, characterized by many marker loci.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S1479262117000247.
Acknowledgements
This research was funded by Universidad Autónoma Agraria Antonio Narro. We thank Consejo Nacional de Ciencia y Teconología of Mexico for granting a graduate scholarship to C.L. Acuña-Matamoros, and the MasAgro and CIMMYT Seeds of Discovery initiative for allowing access to the wheat SNP data.