Cassava is the sixth most important staple food crop after wheat, rice, maize, potato and barley, feeding more than 800 million people in the poorest tropical countries worldwide (Srinivas and Anantharuman, Reference Srinivas and Anantharuman2000). Among the starchy staples, it has 40% more carbohydrate than rice and 25% more than maize, making it the cheapest source of calories for human nutrition and animal feeding. The part most commonly used for food is the starchy root, but the leaves and shoots are high in protein and are also consumed by humans and used as animal feed. Although a native of the Neotropics (Olsen and Schaal, Reference Olsen and Schaal2001), cassava is cultivated in the tropical and subtropical regions of Africa, Asia and Latin America. Africa alone produces over 50% of the world's cassava (FAO, 2009), and is used by industries for starch-based products, such as alcohol (Sriroth et al., Reference Sriroth, Chollakup, Chotineeranat, Piyachomkwan and Ostes2000; Tonukari, Reference Tonukari2004).
The Genetic Resources Center of the International Institute of Tropical Agriculture (IITA) maintains in trust 2544 accessions from 28 countries (Table 1) as a field collection, consisting of landraces and breeding lines. There are obvious gaps in the collection lacking representation from many cassava-growing countries, and efforts are underway to collect or assemble new germplasm from several countries. However, the maintenance of a vegetatively propagated crop is laborious and expensive, posing a severe limitation on the size of the collection. Furthermore, the available diversity has not been adequately used in improvement programmes as evident in many other crops, such as maize (Dowswell et al., Reference Dowswell, Paliwal and Cantrell1996), groundnut (Jiang and Duan, Reference Jiang and Duan1998), chickpea (Upadhyaya et al., Reference Upadhyaya, Furman, Dwivedi, Udupa, Gowda, Baum, Crouch, Buhariwalla and Singh2006) and pearl millet (Bhattacharjee et al., Reference Bhattacharjee, Khairwal, Bramel and Reddy2007). One of the strategies suggested for increased utilization and better management of plant germplasm is to establish smaller subsets derived from original collections, called active working collections (Harlan, Reference Harlan, Rao and House1972) or core collections (Frankel, Reference Frankel, Arber, Llimensee, Peacock and Starlinger1984).
The establishment of a core collection depends on: size of original collection, quality of characterization data, stratification of the original collection and sampling strategy to select samples from each group (Cochran, Reference Cochran1977). Ideally, a core collection should represent 10% size and 70% genetic diversity of the original collection (Brown, Reference Brown1989a, Reference Brown, Brown, Frankel, Marshall and Williamsb). This is also true in the case of clonal crops; however, the representation of the core collection may not be fixed at 10% and a higher or lower percentage may be considered (Brown, Reference Brown, Hodgkin, Brown, van Hintum and Morales1995). A good core collection therefore should represent the maximum diversity with a minimum of redundancies, and be smaller in size for easy management (Brown, Reference Brown1989a).
A cassava core collection has already been established by Hershey et al. (Reference Hershey, Iglesias, Iwanaga and Tohme1994) consisting of 630 accessions at the Centro Internacional de Agricultura Tropical (CIAT). Another core collection was established by Cordeiro et al. (Reference Cordeiro, Morales, Ferreira, Rocha, Costa, Valois, Silva, Hodgkin, Brown, van Hintum and Morales1995) with the collection maintained at Empresa Brasileira de Pesquisa Agropecuária (EMBRAPA). However, both represented germplasm accessions mainly from Latin America, the primary origin of this crop. Considering the importance of cassava in Africa and the collection maintained at IITA, this study was carried out to establish a core collection for the increased use of germplasm accessions in the improvement programmes.
Materials and methods
Plant material
IITA presently maintains 2544 cassava accessions in its field bank, collected/assembled from 28 countries. The collection mainly consists of landraces and breeding lines. These accessions were characterized at two locations in Nigeria, IITA–Ibadan (latitude 7.493793, longitude 3.901932) and IITA–Ubiaja (latitude 6.655986, longitude 6.383279). The locations are characteristic of cassava-growing areas and are also notable for the expression of certain agro-morphological traits, such as flowering. A total of 1766 accessions from Ibadan and 1890 accessions from Ubiaja were considered for statistical analysis after removing all those accessions that had more than 25% missing information across different traits under study. There were 1503 accessions that were common in both locations; Ibadan had 263 and Ubiaja had 387 additional accessions that were characterized in those individual locations. This difference in sample size in the two locations was due to poor germination and unavailability of enough stem cuttings for those accessions at the time of planting. The accessions at both locations were planted using a ridge-and-furrow system. Each ridge represented an accession, with five plants per ridge and no replication. The distances maintained were l m between two ridges and 0.5 m between two plants within a ridge. At Ibadan, three checks 4(2)1425, 30 572 and TMe-419 were randomly planted after every 29 accessions. At Ubiaja, only one check (TMe-1) was used after every 35 accessions. The experiment was planned based on an augmented design wherein each accession was planted only once and the check was repeated after a constant number of entries within a block. In both locations, no fertilizer was applied. Data were recorded in Ibadan and Ubiaja (numbers in parentheses) on 40 (36) agro-morphological traits consisting of eight continuous, 32 (28) discrete, 14 (12) ordinal and 18 (16) nominal variables (Supplementary material, available online only at http://journals.cambridge.org) (Fukuda et al., Reference Fukuda, Guevara, Kawuki and Ferguson2010). The Ubiaja location was mainly included to characterize the flowering components of the accessions under study. Ubiaja was identified as a unique location where cassava genotypes, including landraces, flower freely (Dixon et al., Reference Dixon, Whyte, Mahungu, Ng, Taylor, Ogbe and Fauquet2001).
Establishment of a core collection
The statistical analysis for establishing the core collection was carried out in sequential stages using five different methods, considering the nature of the available characterization data and the difference in the number of accessions assessed at the two locations. The objective was to select a subset of accessions as diverse as possible, using the relevant information from the available number of variables (continuous, nominal and ordinal), and include the genotype by environment (G × E) interaction in the data analysis. The statistical analysis involved five different approaches: (1) the hierarchical multiple factor analysis (HMFA) (Le Dien and Pagès, Reference Le Dien and Pagès2003) that included different kinds of variables (categorical and continuous) to generate new coordinates per accession using principal axes and allowed a balanced effect of different variables on the established groups or clusters; (2) the three-way analysis (Basford and McLachlan, Reference Basford and McLachlan1985; Franco et al., Reference Franco, Crossa, Taba and Shands2003) that considered the effect of G × E interaction in the clustering process; (3) the mixture of normal distributions method (Basford and McLachlan, Reference Basford and McLachlan1985) applied on the coordinates obtained from HMFA and three-way analysis for clustering; (4) the linear discriminant function (Mardia et al., Reference Mardia, Kent and Bibby1979) that allowed to measure the distance between an accession and the basic group generated by using those accessions that were common in both locations. This measure was carried out for all those additional accessions that were evaluated in one location and not in the other, so that these could be assigned to the basic group, thus allowing the inclusion of all the accessions in establishing the core collection using all available information; (5) the D allocation method (Franco et al., Reference Franco, Crossa, Taba and Shands2005, Reference Franco, Crossa, Warburton and Taba2006) to select accessions from each cluster/group proportional to within-cluster diversity.
The first stage of the analysis consisted of steps to select a ‘basic core collection’. This involved a classification of 1503 accessions, that were common in both locations, into groups/clusters using a numerical classification methodology that included the three-way HMFA analysis, followed by the mixture of normal distributions (Basford and McLachlan, Reference Basford and McLachlan1985) clustering method. The three-way method (Basford and McLachlan, Reference Basford and McLachlan1985; Franco et al., Reference Franco, Crossa, Taba and Shands2003) allowed the inclusion of the differential effect of locations/environments in the classification by adding information from each location to the data matrix, thus including the effect of G × E interaction in the formation of groups/clusters (Franco et al., Reference Franco, Crossa, Villasenor, Castillo, Taba and Eberhart1999, Reference Franco, Crossa, Taba and Shands2003). The HMFA method (Le Dien and Pagès, Reference Le Dien and Pagès2003) allowed the mixture of different types of variables and the 50:50% equilibrium between the effects of continuous and discrete variables on the clustering process (Franco et al., Reference Franco, Crossa and Dehpsande2010). All the variables, a mixture of continuous and discrete variables, were transformed into principal axes explaining more than 90% of the variability (inertia) present in the original data set. The mixture of normal distributions clustering method was then applied on the principal axes scores to obtain the grouping of 1503 accessions, and entries were selected from each group/cluster proportional to within-cluster diversity, following Gower's distance measure (Gower, Reference Gower1971). A re-sampling (bootstrapping) process was then carried out, wherein 1000 candidate basic cores (each one consisting of 301 accessions) were randomly and independently selected using the stratified random sampling process. Finally, the candidate basic core collection (out of 1000 re-sampling results), with maximum genetic diversity, was selected.
The second stage involved the classification of accessions ‘only from Ibadan (263)’ and ‘only from Ubiaja (387)’ into the groups/clusters created in the first stage. This was obtained by calculating the linear discriminant distances between each new accession and each previously created group/cluster. Each new accession (from both Ibadan and Ubiaja) was then assigned to these groups/clusters based on the minimum distance between the accession and the group. The third stage consisted of augmenting the basic core collection (301 entries) with accessions selected from both locations (51 from Ibadan and 76 from Ubiaja) in such a way that the final core collection represented maximum diversity present in the original collection. Further, a stratified random re-sampling was carried out to obtain 1000 independent candidates consisting of the basic core augmented with all the new accessions. The most diverse candidate core was then selected to establish the final core collection.
Validating the core subset
The representativeness of the established core collection was determined following three criteria: (1) the effect of variables of different types in the clustering process and the selection of entries, (2) control of the G × E interaction, and (3) comparison of the diversity between the final core and the entire collection. The effects of each variable on classification were studied by using independent F tests for continuous and log-transformed ordinal variables, and independent χ2 tests of independence for nominal variables, based on the null hypothesis that the groups/clusters are independent of the variables under study. The effect of the G × E interaction was evaluated by calculating the means of continuous variables per group across different locations/environments; the Kendall coefficient of concordance (Conover, Reference Conover1971) to test the concordance of groups across locations for all continuous and ordinal variables considering the null hypotheses of ‘no concordance’ (that is, presence of G × E interaction) versus ‘concordance’ (that is, absence of G × E interaction). The variance components attributable to the effect of differences ‘between groups’, ‘between locations’, and the ‘G × E interaction’ using the continuous and log-transformed ordinal variables were calculated and compared. The representativeness of the core to the entire collection was validated by comparing the means and variances, using 99% confidence intervals for continuous variables and the percentage of the range recovery for ordinal and nominal variables. The phenotypic diversity estimated using Gower's distance for the entire collection and the core collection was also compared.
Results
The core collection
The approach followed in the present study to establish the cassava core collection resulted in the selection of 428 accessions from the international collection maintained at IITA. The core reflected the predominance of accessions from Nigeria (45.5%), Ghana (11.2%), Bénin (9.5%), Togo (6.0%) and Guinea and Cameroon (5.8%). This is in agreement with the international cassava collection maintained at IITA that represents mainly germplasm collected from West Africa (Table 1). The core subset also represented accessions from other African regions, such as DR Congo, Kenya and Tanzania, as well as a few from Brazil (Table 1).
Selection of accessions for the core collection
Table 2 presents the number of accessions selected from each cluster to constitute the core collection. The methodology used in the selection of accessions was proportional to the average of the Gower distances between accessions within a cluster. It was interesting to observe that larger clusters, such as Group 1, consisting of a higher number of accessions for each location, showed lower distances between accessions (N = 808, Gower's distances (G-dist) = 0.27, n-core = 55); while smaller clusters, such as Group 4, with fewer accessions showed relatively high phenotypic distances (N = 86, G-dist = 0.31, n-core = 59). This indicates how the number of accessions selected from different clusters was proportional to the phenotypic diversity within each of these clusters. The proportionally smaller representation of the larger clusters in the core (Table 2) is the result of the lower values of diversity within those groups, since larger groups may have more redundant (similar) accessions, and show lower diversity values.
G-dist, Gower's distances; N, number of accessions in the collection; Ni, number of accessions only in Ibadan location; Nu, number of accessions only in Ubiaja location; n, number of accessions selected from each group generated through clustering of combined data; ni, number of accessions selected from each group generated through clustering of accessions characterized only in Ibadan location; nu, number of accessions selected from each group generated through clustering of accessions characterized only in Ubiaja location; n-core, number of accessions in the core subset.
a Number of accessions into the augmented data set containing the basic core plus the ‘only in’ accessions.
Effect of morphological variables on the classification and selection of the final core collection
The 40 agro-morphological variables (continuous, ordinal and nominal) were transformed into principal axes coordinates explaining the total variation (Fig. 1). The comparison of continuous and ordinal versus nominal trait effects on the first 50 principal axes, explaining about 90% of the total variation, showed a 50:50% contribution on the axes for these variables (Fig. 1). Further analysis to determine the effect of the variables on the classification indicated that all (16 out of 16) of the continuous variables, 25 out of 26 of the ordinal variables and 33 out of 34 of the nominal variables for Ibadan location were significant; while 15 out of 16 of the continuous variables, 17 out of 25 of the ordinal variables and 28 out of 34 of the nominal variables were significant (P < 0.05) for Ubiaja location, in the clustering of accessions into different groups (data not shown).
Effect of G×E interaction on the classification method
The characterization of the entire cassava collection was carried out in two locations using un-replicated layouts, and data analysis was carried out by combining information from both locations following a sequential strategy based on five major concepts. Since most of the variables used in the present study (particularly the continuous and the ordinal variables, as they represent quantitative genetic characteristics) may be under the influence of the G × E interaction, the three-way analysis was used. The Kendall's coefficient of concordance calculated for all continuous and ordinal variables across locations showed that 18 out of 20 Kendall coefficients were greater than 0.5 and 11 out of 20 were greater than 0.75 (high concordance), for Ibadan location. Similarly, 11 (P < 0.10) and nine (P < 0.05), out of 20 Kendall coefficients were significant for Ubiaja location (data partially shown in Table 3). Table 3 also shows the results obtained on 13 (out of 20) variables with significant F-test values on both locations wherein only two out of 13 variables were non-significant (P>0.10) for the Kendall coefficients. Further, the estimation of variance components using a mixed model for groups, location and G × E interaction showed that the percentage of contribution by the G × E interaction was low (minimum 0.001%, maximum 2.88%) (Table 3). These results indicate that the strategy used in the present study to reduce the effect of G × E interaction in the classification of accessions was effective for most of the continuous and ordinal variables.
a T2_02, fruit set level; T3_02, petiole length; T3_04, distribution of anthocyanin pigmentation; T4_01, storage root peduncle; T4_07, storage root length; T4_10, fresh weight roots; T4_11, fresh weight shoots; T4_18, harvest index; T4_19, plant weight; T4_20, root number per plant; T4_21, root weight per plant; T4_22, root weight per plant; T4_23, average plant weight
Representativeness of the core to the entire collection
A core collection is defined as a subset of accessions that maintains maximum diversity present in the entire collection with a minimum number of redundant accessions (Frankel and Brown, Reference Frankel, Brown, Chopra, Joshi, Sharma and Banasai1984). To evaluate the representativeness of the established core collection, the means and variances for the continuous and log-transformed ordinal variables were compared between the entire and the core collections. In addition, for the ordinal and nominal variables, the ranges were also compared. The results showed that there were no significant differences between the entire collection and the core for the means, variances and range of the variables under study. The core subset recovered 81% of the means and 63% of the variances for the continuous variables. Similarly, the range recovery was 94% for the nominal variables and 85% for the ordinal variables (data not shown). The core also retained the overall phenotypic diversity present in the entire collection (Table 4) based on Gower's distance, a measure that summarizes the morphological diversity. In fact, the core collection showed a gain (increase) of 15% in the average of distances between accessions.
Maximum, minimum and mean values for Gower distances; percentage of increase in average diversity obtained in the core subset (gain %); lower (Lb) and upper (Ub) bounds of the 99% confidence interval for the core mean value.
The list of cassava accessions included in the core collection with details on passport and characterization data is available at www.iita.org/genetic-resources-center.
Discussion
Efficiency of the strategy
Utilizing a mixture of different types of measured variables to reach an equilibrated effect on the grouping of accessions, the un-replicated experimental layouts in different environments implying the presence of G × E interaction, the absence of the same control checks in both locations, and the classification of accessions into groups of different sizes representing variable within-group diversity, posed a great challenge in establishing the core collection representing the maximum diversity present in the entire cassava collection maintained at IITA. This was addressed by following a strategy of combining five different statistical methods to approach the problem. The results showed that the strategy was highly successful in selecting a representative subset from the entire collection: the HMFA allowed the mixture of variables and their effective equilibrium on the classification; the three-way approach combined with the mixture of normal distributions clustering method was useful in controlling the G × E interaction and generating clusters across locations; the D method for the selection of samples from each group/cluster proportional to the within-cluster diversity; and the 1000 iterations to select candidate core collection through independent stratified random sampling processes; thus allowing the constitution of a final core collection that not only represented the mean and variances but also the range of the variables of the entire collection.
Characteristics of the core collection
Some of the countries are under-represented in the collection held at IITA genebank and consequently their representation in the core collection was also low. In general, the entire cassava germplasm collection at IITA shows a wide gap in the collection from different geographical regions. The most obvious reason for this limited representation of cassava germplasm is due to IITA's focus on Africa. IITA primarily maintains the cassava collection of African origin, while the CIAT, another Consultative Group on International Agricultural Research (CGIAR) centre based in Colombia, maintains a large diversity of cassava from Latin American countries. The unequal representation of landraces is also linked with the opportunities for collection in various countries during the past decades. The international collection maintained at IITA, however, represents the West African collection to a greater extent, with a good number of accessions from Nigeria, which is considered to be the secondary centre of diversity for this crop (Lebot, Reference Lebot2009), and is the largest producer of cassava in the world (FAO, 2009). This is also reflected in the established core collection that includes about 46% of accessions from Nigeria (Table 1).
The established cassava core collection can therefore be efficiently used as a reference for further improvement programmes looking at sources of desirable traits for resistance to various biotic and abiotic constraints, drought tolerance, quality traits or photoperiod sensitivity in a cost-effective manner. Several core subsets of smaller size, focusing on different traits of interest important for cassava breeding programmes, could also be established as new challenges arise. If necessary, additional sources of desirable traits can be obtained from the reserve collection in a much timelier manner by referring to the clusters. The established core collection will also provide a guideline for the gene bank manager while acquiring new accessions in the collection. The core collection, being dynamic in nature, will need to be revised periodically when additional accessions and related information become available.
Acknowledgements
We thank the Global Crop Diversity Trust for funding the research carried out and also the field technical staff who contributed in data recording and data entry.