Introduction
Seeds of Theobroma cacao L. (cacao) provide cocoa mass and cocoa butterfat, the raw materials of the multibillion-dollar confectionery industry. Cacao germplasm is conserved live in situ or ex situ since this outcrossing tropical tree crop has recalcitrant seeds (Toxopeus, Reference Toxopeus1985). Of the 50 ex situ field genebanks (Motilal and Butler, Reference Motilal and Butler2003), only two are universal collections: Centro Agronomico Tropical de Universal Investigacion y Enseñanza, Turrialba (CATIE) in Costa Rica and the International Cocoa Genebank, Trinidad (ICG,T) in Trinidad. The latter, managed by the Cocoa Research Unit (CRU) of the University of the West Indies, is the largest and most diverse public domain collection. The ICG,T contains germplasm from multiple expeditions, beginning in 1930, from Amazonian South America, Central America and the West Indies (Kennedy and Mooleedhar, Reference Kennedy and Mooleedhar1993). Details of the ICG,T have been documented in Kennedy and Mooleedhar (Reference Kennedy and Mooleedhar1993), Bekele and Bekele (Reference Bekele and Bekele1996), Iwaro et al. (Reference Iwaro, Bekele and Butler2003), Motilal and Butler (Reference Motilal and Butler2003), Sounigo et al. (Reference Sounigo, Umaharan, Christopher, Sankar and Ramdahin2005), Motilal et al. (Reference Motilal, Zhang, Umaharan, Mischke, Pinney and Meinhardt2011) and at http://sta.uwi.edu/cru. The ICG,T contains grafted and rooted cuttings representing approximately 0.18% Criollo, 30.2% Forastero, 37.0% Refractario, 15.8% Trinitario and 16.8% unknown accessions.
Mislabelling errors within field germplasm collections have been recognized (http://cropgenebank.sgrp.cgiar.org/index.php?option = com_content&view = article&id = 549&Itemid = 744). Mislabelling is a major hindrance in the conservation, dissemination and efficient use of crop germplasm (Hurka et al., Reference Hurka, Neuffer and Friesen2004) including grape (Leão et al., Reference Leão, Riaz, Graziani, Dangl, Motoike and Walker2009), lettuce (van Treuren et al., Reference van Treuren, de Groot, Boukema, van de Wiel and van Hintum2010) and cacao (Motilal and Butler, Reference Motilal and Butler2003; Irish et al., Reference Irish, Goenaga, Zhang, Schnell, Brown and Motamayor2010; Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Pinney and Meinhardt2011). Mislabelled plants that are phenotypically similar but genetically dissimilar will inflate genetic variance instead of phenotypic variance. The ICG,T has a high safety duplication level, with 16 clones (maximum) of an accession in each plot. Mislabelling can primarily exist as (a) an admixture of trees from various accessions, which may or may not include the expected accession and (b) a uniform plot but all trees are of another accession. Mislabelling occurs as a result of (a) inadvertent budwood collection, (b) clerical errors in the transcription of plant tags and map records, (c) incorrect replacement of labels on field trees and (d) inadvertent planting. Overtopping of scion by rootstock material will lead to mislabelling in grafted plants, if the scion insert is weak, broken off or dies back.
Plot admixture in the ICG,T has previously been addressed (Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Pinney and Meinhardt2011) but the assignation of a tree to a given accession nomenclature is pending. Cacao accessions can be identified from phenotypic examination (Engels et al., Reference Engels, Bartley and Enriquez1980; Bekele and Bekele, Reference Bekele and Bekele1996; Bekele and Butler, Reference Bekele and Butler2000; Bekele et al., Reference Bekele, Bekele, Butler and Bidaisee2006). Johnson et al. (Reference Johnson, Mora and Schnell2007) advocated the use of field guides in identifying cacao accessions. DNA fingerprinting techniques (e.g. microsatellite or single nucleotide polymorphism markers), however, are efficient, accurate and unambiguous means of plant identification. This study therefore employed microsatellite markers to (a) determine the percentage of incorrectly named trees (homonymous error), (b) determine the level of duplication within the ICG,T (synonymous error), (c) determine the correct population clustering and hence (d) improve the management strategy for the collection based on a subset of the collection.
Materials and methods
Plant material
Healthy leaves (flush–mature) from 484 samples of 387 cacao accessions (17% of ICG,T accessions) were opportunistically harvested to facilitate the identification of at least one tree of an accession. The samples were constituted as (a) a single tree from 301 accessions, (b) three accessions with two trunks sampled from the same tree number, (c) 76 accessions with two sampled trees and (d) nine accessions with three sampled trees. Sets (b) and (c) contained a common accession. In addition, leaves from a reference set of 26 accessions were retained to act as a pool of distinctive alleles. These reference accessions were composed of two Upper Amazon Forastero accessions from Peru; 15 Criollo accessions from Belize (10), CATIE (2) and Honduras (3); six Lower Amazon Forastero accessions from Brazil (5) and the USDA Tropical Agriculture Research Station cacao germplasm collection in Puerto Rico (1); two Trinitario clones (ICS 97 and MXC 67) from Trinidad and a reference IMC 67 tree from La Reunion Estate of the Ministry of Food Production, Land and Marine Affairs of Trinidad and Tobago. The complete list of accessions can be found in Supplementary Table S1 (available online only at http://journals.cambridge.org).
DNA extraction and quantification
Total leaf genomic DNA was extracted similarly to that described in Motilal et al. (Reference Motilal, Zhang, Umaharan, Mischke, Mooleedhar and Meinhardt2010). Maceration was performed with a 120 V FastPrep instrument (Qbiogene, Inc., Carlsbad, CA, USA) using lysing matrix A. DNA was maintained in sterile deionized water or Tris–EDTA buffer and stored at − 20°C. Stock DNA solutions were assayed with either (a) PicoGreen® (Molecular Probes, Eugene, OR, USA) in a Fluroskan Ascent system (Labsystems, Helsinki, Finland), (b) Hoechst dye in a TKO fluorometer or (c) a NanoDrop 8000 spectrophotometer, according to the manufacturer's recommendations. Working solutions were prepared at ~0.1 ng/μl of total DNA.
PCR amplification
Twenty-six microsatellite primer pairs (Supplementary Table S2, available online only at http://journals.cambridge.org) were used to generate independent DNA polymorphisms. Characteristics of these primers can be found online at www.ebi.ac.uk and in Lanaud et al. (Reference Lanaud, Risterucci, Pieretti, Falque, Bouet and Lagoda1999), Pugh et al. (Reference Pugh, Fouet, Risterucci, Brottier, Abouladze, Deletrez, Courtois, Clement, Larmande, N'Goran and Lanaud2004) and Saunders et al. (Reference Saunders, Mischke, Leamy and Hemeida2004). Microsatellite amplification was as described in Motilal et al. (Reference Motilal, Zhang, Umaharan, Mischke, Boccara and Pinney2009). The Taq polymerase employed was Eppendorf HotMasterMix (Brinkmann Instruments Inc., Westbury, NY, USA) or AmpliTaq Gold DNA polymerase (Applied Biosystems, Foster City, CA, USA).
Electrophoresis
Fragment lengths of amplified loci were sized on an 8000 or 8800 capillary electrophoresis system (Beckman Coulter, Inc., Brea, CA, USA) using an internal 400 bp DNA Size Standard Kit as a reference, according to the manufacturer's instructions (Beckman Coulter, Inc.). Binning was performed as described earlier (Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Boccara and Pinney2009).
Multilocus matching
The allelic dataset (4% missing data; dataset I) was checked for binning errors with the Excel Microsatellite Toolkit v.3.1.1 add-in (Park, Reference Park2001). The multilocus microsatellite profiles were subjected to all possible pairwise matching, and a mismatch flexibility of three loci with a minimum of 20 matching loci in CERVUS v.3.0.3 (Kalinowski et al., Reference Kalinowski, Taper and Marshall2007) was implemented. Trees with the same accession name but different multilocus profiles were deemed homonyms. Trees with different accession names but equivalent multilocus profiles were deemed synonyms. Synonymous accessions were replaced with their appropriate single consensus entry. Homonymous accessions were recoded and kept as separate entries. Multilocus profiles in the new dataset (dataset II; 415 individuals, 26 loci, 2.6% missing data; three samples duplicated as internal checks) were then matched manually against the reference tree microsatellite profiles that had been compiled in the CRU/USDA fingerprinting project. An adjusted dataset to align allele bins in the aforementioned project and the present study was created and matching accessions were determined for a mismatch flexibility of two loci with a minimum of 13 matching loci in CERVUS v.3.0.3 (Kalinowski et al., Reference Kalinowski, Taper and Marshall2007).
Microsatellite loci and dataset II
Probabilities of identity (Waits et al., Reference Waits, Luikart and Taberlet2001) of the 26 loci were calculated using the software GIMLET (Valière, Reference Valière2002). Descriptive statistics for these loci were determined on dataset II (415 samples) with GenAlEx v.6.1 (Peakall and Smouse, Reference Peakall and Smouse2006). Pairwise genetic distances among all individuals were calculated and the standardized distance (Nei, Reference Nei1972, Reference Nei1978) was then used in a principal coordinate analysis with this software. The eigenvectors were graphed with SigmaPlot 2002 v.8.0 (SPSS, Inc., 1986–2001).
Population assignment
Population assignment analysis was conducted on dataset II (containing three known duplicated samples) with STRUCTURE v.2.3 (Pritchard et al., Reference Pritchard, Stephens and Donnelly2000). A burn-in period of 200,000 runs followed by 500,000 Markov Chain Monte Carlo (MCMC) runs was employed under an admixture model with independent allele frequencies. Alpha was inferred in the model. Population groups from K = 2 to K = 12 were assessed with 50 independent replicates each. The results of the STRUCTURE output were taken into Structure Harvester v.0.6.8 (Earl and von Holdt, Reference Earl and von Holdt2011) to obtain (a) the minimum number of populations as determined from the method of Evanno et al. (Reference Evanno, Regnaut and Goudet2005) and (b) formatted files for alignment in CLUMPP (Jakobsson and Rosenberg, Reference Jakobsson and Rosenberg2007). Alignment of the Q-matrices was matched by permutation with the Large K Greedy algorithm under a random matrix option in CLUMPP (Jakobsson and Rosenberg, Reference Jakobsson and Rosenberg2007).
The lnPr results from the original STRUCTURE runs were tabulated and sorted, and a trimmed mean calculated after removing the highest and lowest values. The best-fit number of populations was assessed using the turning points from plots of change in lnPr versus change in K. The lowest K value that best fitted the data was chosen as the number of effective populations. At this K value, the least negative lnPr was chosen to represent membership plots and group contributions (Q values).
For each K, the 50 independent runs were examined for individuals with at least 5% Criollo ancestry. An ANOVA was carried out with the Group Differences Program v.3.0 (Chang, Reference Chang2001). Duncan's multiple range test as implemented in DSAASTAT v.1.1 (Onofri, Reference Onofri2007) was used to distinguish the K groups from each other.
Mislabelling from population assignment
A threshold value of Q ≥ 0.85 was employed as the group membership inclusion criterion. Individuals with Q < 0.85 were considered as ambiguous individuals and treated as mislabelled samples. Mislabelled accessions were also identified by running STRUCTURE v.2.3 (Pritchard et al., Reference Pritchard, Stephens and Donnelly2000) on independent datasets of each accession group. Individuals were also partitioned into their appropriate populations based on the results of the 50 replicate analyses. Substructures within these groups without admixed individuals were run as required. Model parameters were similar to those used before except that K was set from 1 to n, where n = 5 or higher as the dataset required (maximum = 12), and 30 iterations were made for each K value. A correlation model (Falush et al., Reference Falush, Stephens and Pritchard2003) under these parameters was further employed for datasets of Amelonado and Refractario individuals. The best-fit K value was chosen as before. Comparisons of membership assignment from these three approaches were then reviewed, and a reduced dataset was obtained with each subpopulation containing only true members as identified by the inclusion criteria.
The population data of Motamayor et al. (Reference Motamayor, Lachneaud, da Silva e Mota, Loor, Kuhn, Brown and Schnell2008) were reduced to a dataset with individuals with high coefficients of membership for the pure Amelonado, IMC, PA and NA populations involved in the present study. Seventeen loci were common to the present study, and these loci were retained for the Motamayor et al. (Reference Motamayor, Lachneaud, da Silva e Mota, Loor, Kuhn, Brown and Schnell2008) reference dataset. Individuals with more than ten missing data points were removed. Allele sizes were aligned to those of the present study. One locus was removed due to difficulty in alignment. Mislabelled accessions in the present study, which fell as pure samples into the aforementioned population groups, were assessed for match declaration with CERVUS v.3.0.3 (Kalinowski et al., Reference Kalinowski, Taper and Marshall2007). Match declarations were guided at a minimum of 13 matching loci and a mismatch of two loci.
Results
Homonymous error and synonymous redundancy
Tree mislabelling as homonyms was present in 17 of the 88 accessions with replicate samples (Supplementary Table S3, available online only at http://journals.cambridge.org). Synonymous cases were present for 29 distinct pairs of matched accessions from 388 accessions managed by the CRU (Supplementary Table S3, available online only at http://journals.cambridge.org). This represented error rates of 19.3% homonymy and 7.5% synonymy at the accession level. Of the 208 accessions with accepted true-type reference trees, 82.7% were matched in the current dataset, yielding a 17.3% mislabelling error.
Microsatellite loci and dataset II
From the 415 samples of dataset II, the microsatellite loci detected 5–15 alleles, with a range of 0.193–0.462 for the fixation index and a range of 0.3643–0.6954 for the PIDsib (Supplementary Table S2, available online only at http://journals.cambridge.org). The combined probability from all 26 loci was 4.097 × 10− 10 and the probability ranged between 4.73 × 10− 7 and 5.31 × 10− 14 for matching an individual. The four most informative loci (unordered) were Y16996, Y16988, AJ271942 and AJ566565 from the PIDsib and Shannon's information index, respectively. The fifth most informative locus was Y16995 or AJ271944 for these two respective measures.
Multidimensional scaling revealed a clustering of individuals, with a clear separation of the Criollo and Amelonado samples from all other accessions (Fig. 1). The three axes explained 83.22% of the total variation, with the first two axes explaining 59.35% of the variation. The NA, PA and IMC accessions tended to cluster together. The reference accession U 1 was in close proximity to the SCA accessions (Fig. 1).
Population structure
The 415 individual representative samples (inclusive of the reference accessions) could be fitted into three groups (Criollo–Amelonado–Trinitario, Forastero and Refractario) by the method of Evanno et al. (Reference Evanno, Regnaut and Goudet2005). With the alternative graphing method described here, the dataset could be assigned to four populations, and with subclustering into eight or ten populations (Supplementary Fig. S1, available online only at http://journals.cambridge.org). As the dataset was partitioned, several events were noticed. First, the duplicated internal checks were consistently assigned across K assignations. Second, the Forastero group began to be partitioned at K = 4 (individual plots) or 5 (CLUMPP alignment), as the French Guiana and PA accessions were separated. The French Guiana group (ELP and GU accessions) and the PA clustering were separated from each other at K = 10 when representative individual plots were examined, but remained clustered according to the CLUMPP alignment. Third, at K = 4, the reference Amelonado accessions, together with the Trinitario accessions, separated from the Criollo group. The Amelonado and Trinitario accessions remained clustered together at all K groups assessed. Fourth, the SCA group separated out at K = 7 from the other Forastero accessions. Fifth, as K = 8 moved to K = 10, three samples (NA 471 Field 6A B86 T9 = Field 4A D412 T1; EET 400 Field 6B F455 T6 and CRUZ 7/8 Field 6B B83 T1 = T9) were further subdivided. Lastly, the number of accessions with Criollo ancestry became progressively less and was significantly different (P < 0.05) up to K = 5 but was relatively the same thereafter (Fig. 2). Criollo individuals appeared admixed at K = 2, 4 and 5 according to the CLUMPP permuted matrix.
Generally, individuals were either admixed (96 samples, 23.1%) or they fell into one of eight main groups: Amelonado (75 samples), Criollo (16 samples), French Guiana (four samples), IMC (14 samples), NA (24 samples), PA (18 samples), Refractario (165 samples) and SCA (three samples). The last was composed of the two SCA samples (SCA 3 and SCA 6) and the U 1 reference accession.
The material with Amelonado ancestry could be partitioned into two or three main clusters under the independent or the correlated allele model, respectively. However, the increased partitioning under the correlated model did not coincide with any biological clustering and resulted in several admixed individuals. The material with Amelonado ancestry was therefore separated into two subclusters, consisting of the reference Amelonado accessions in one group and all other accessions with Amelonado ancestry in the other group.
The Refractario accessions were clustered into two main groups (B and O) from the dataset of 415 individuals. The exclusion of non-Refractario accessions revealed that each Refractario cluster was composed of two subpopulations under both the independent and correlated allele models (Fig. 3). Cluster B was composed of OB1 (B and SJ accessions) and OB3 (JA, LV, LX, LZ, SLA, SLC and SJ accessions). Cluster O was composed of OB2 (AM, CL, CLM and LP accessions) and OB4 (MOQ accessions) (Supplementary Table S1, available online only at http://journals.cambridge.org). STRUCTURE analysis of a dataset of only SLA and SLC accessions revealed that these two accessions stayed as one cluster. In contrast, a dataset of CLM and CLEM accessions was clearly separated into these two accessions.
Typing trees
The percentage of true-type trees in the accession groups ranged from 32% (AM) to 100% (CRU) in the 16 groups that were assessed (Fig. 4). The distribution of true-type trees by accession group was non-significant (χ2 = 12.77; df = 15; P = 0.62). Of the 401 samples from the ICG,T (Fields 4A, 5A, 5B, 6A and 6B), 158 samples were misidentified given an estimated 39.4% error rate. Approximately 34% of the Refractario accessions in these fields were misidentified.
Several mislabelled or non-reference trees were matched to their appropriate nomenclature or ancestry (Table 1). Amelonado ancestry was evident in many mislabelled accessions, particularly AM (16), CL (11) and MOQ (11) as shown in Supplementary Table S1 (available online only at http://journals.cambridge.org). Accessions with primarily Amelonado–Criollo ancestry included MXC 67 UWI Field 12 x3y6, PENTAGONA 1 Field 6B F491 T5, PENTAGONA 2 Field 6B F492 T8, RIM 113 Field 4A T2, RIM 117 Field 4A T1 and TRD 66 Field 4A A50 T1. The SPEC accessions (SPEC 138/11 Field 6B C141 T1, SPEC 184/2 Field 6B D194 T1 and SPEC 194/44 Field 6B D195 T2) were of IMC–SCA ancestry, except for the mislabelled SPEC 194/48 Field 6B D219 T9, which grouped with Amelonado accessions. The accession CLEM /S-62-1 Field 5B I745 T2 had contributions from the SCA, Refractario Cluster B and NA accession groups. Mixed ancestry was also present in FSC 13 Field 4A C321 T1 (IMC–Amelonado), H 1 (IMC–NA), ICS 39 Field 4A C305 T4 (IMC–Amelonado–Criollo), LCT EEN 162 /S-1010 Field 4A A60 T1 (NA–IMC–PA), MATINA 1/7 Field 6B D236 T12 (IMC–Criollo–Amelonado) and MATINA 1/7 Field 6B D236 T15 (French Guiana–NA). Further details on accession composition can be found in Supplementary Table S1 (available online only at http://journals.cambridge.org).
a F4A, F5B, F6A, F6B = Field 4A, 5B, 6A, 6B, respectively.
b Accession group, AML = AMELONADO, putative accession match is given in parentheses.
Discussion
The population structure of a subset of the ICG,T was documented in a previous study (Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Pinney and Meinhardt2011) which estimated that the collection contained on average 25% mixed plots. From the present study, a 39.4% misidentification rate was estimated. The estimate is in agreement with previous studies on this genebank which employed dominant markers (Christopher et al., Reference Christopher, Mooleedhar, Bekele and Hosein1999; Sounigo et al., Reference Sounigo, Christopher, Bekele, Mooleedhar and Hosein2001), or the same marker system but on only Upper Amazon Forastero accessions (Zhang et al., Reference Zhang, Boccara, Motilal, Mischke, Johnson, Butler, Bailey and Meinhardt2009a). Aikpokpodion et al. (Reference Aikpokpodion, Kolesnikova-Allen, Adetmirin, Guiltinan, Eskes, Motamayor and Schnell2010) determined a 46.4% error rate in a Nigerian field genebank. A prior conservative mislabelling estimate of 24.7% across international cacao genebanks (Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Pinney and Meinhardt2011) can be revised upwards to 29.8% mislabelling. The synonymous error rate was estimated here at 7.5% from 388 accessions, which was within the modelled synonymy estimate of 14.4% of 2000 accessions (Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Pinney and Meinhardt2011). Both true-type and off-type trees should be documented in the field with appropriate labels and CRU should add this information to its database. Off-type trees should be renamed and retained until all the trees in the genebank are fingerprinted. A decision to remove off-type trees can then be considered. Homonymous cases should be retained provided that they remain unique cases. Accessions arising out of homonymous identification and with a safety duplication of less than four trees should be clonally propagated and maintained in the field genebank. Synonymy will inflate the safety duplication level of some accessions while concomitantly decreasing the safety duplication level in other accessions. Removal of extraneous trees should only be undertaken if there is an excess of duplicated accessions. New unique accessions can then be introduced so that a greater number of accessions can be maintained on the same area of land.
The Refractario accessions were grouped into OB1 (B and SJ), OB2 (AM, CL, CLM and LP), OB3 (JA, LV, LX, LZ, SLA, SLC and SJ) and OB4 (MOQ) subclusters. Subclusters OB1 and OB3 formed a larger cluster as did OB2 with the OB4 cluster. The results obtained are in agreement with the Refractarios being derived from multiple closely related parents (Zhang et al., Reference Zhang, Boccara, Motilal, Butler, Umaharan, Mischke and Meinhardt2008). Moreover, the grouping presented here suggested that the Refractarios had a narrower origin than was traditionally expected (Pound, Reference Pound1938, Reference Pound1943; Toxopeus, Reference Toxopeus1985; Bartley, Reference Bartley2005). Cacao breeders seeking to exploit the variability within Refractario are advised to select parents from different subclusters. The SLA and SLC accession groups were not separated from each other. These accessions were collected from trees A and C from the farm Santa Lucia (Bartley, Reference Bartley2000), which would be consistent with the SLA and SLC nomenclature present in the genebank. Full phenotypic evaluation of these two groups is recommended and if similar, they should be lumped into an SL accession group.
The approach to population clustering indicated that the method of inferring K, described in this paper, can adequately detect the true population structure when compared with that of Evanno et al. (Reference Evanno, Regnaut and Goudet2005). A low number (10) of iterations are usually employed (Kaeuffer et al., Reference Kaeuffer, Réale, Coltman and Pontier2007; Efombagn et al., Reference Efombagn, Motamayor, Sounigo, Eskes, Nyassé, Cilas, Schnell, Manzanares-Dauleux and Kolesnikova-Allen2008; Motamayor et al., Reference Motamayor, Lachneaud, da Silva e Mota, Loor, Kuhn, Brown and Schnell2008; Schmidt et al., Reference Schmidt, Hundertmark, Bowyer and McCraken2009; Zhang et al., Reference Zhang, Boccara, Motilal, Mischke, Johnson, Butler, Bailey and Meinhardt2009a; Aradhya et al., Reference Aradhya, Stover, Velasco and Koehmstedt2010). A larger number of iterations were employed in Aikpokpodion et al. (Reference Aikpokpodion, Kolesnikova-Allen, Adetmirin, Guiltinan, Eskes, Motamayor and Schnell2010), Motilal et al. (Reference Motilal, Zhang, Umaharan, Mischke, Mooleedhar and Meinhardt2010) and the present study. This may be a better approach to obtaining a normal sample size of iterations but is hindered by the length of time required by the software, especially on larger datasets. Further, submitting all the runs to CLUMPP may result in biologically invalid results as evidenced by the hybrid Criollo nature at K = 2, 4 or 5 under the Large K Greedy algorithm. The use of selected consistent representative runs per required K is therefore supported (Zhang et al., Reference Zhang, Boccara, Motilal, Mischke, Johnson, Butler, Bailey and Meinhardt2009a; Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Mooleedhar and Meinhardt2010; Aikpokpodion et al., Reference Aikpokpodion, Kolesnikova-Allen, Adetmirin, Guiltinan, Eskes, Motamayor and Schnell2010). Using separate runs that employed putative clusters to decide on subclustering (Pritchard et al., Reference Pritchard, Stephens and Donnelly2000; Dawson and Belkhir, Reference Dawson and Belkhir2009) was a valuable corroborating tool. A methodological tool employed in the present study was the inclusion of known samples. Here, a known homozygous population (Criollo) was used to track the population structure. In addition, duplicated samples were used as independent unknowns and acted as spiked samples. These two inclusions advocated for consistency and biological interpretation of the population subdivision. The SCA and U accessions partitioned away from other accessions into the same subcluster. This agreed with the Contamana group of Motamayor et al. (Reference Motamayor, Lachneaud, da Silva e Mota, Loor, Kuhn, Brown and Schnell2008) and their collection history (Bartley, Reference Bartley2005). The inferred population structure in the present study is therefore reliable. The close grouping of PA and French Guiana accessions is consistent with earlier researchers (Sounigo et al., Reference Sounigo, Umaharan, Christopher, Sankar and Ramdahin2005) with the PA and French Guiana groups suggested to be derived from the human selection of Lower Amazon Forastero material (Bartley, Reference Bartley2005). The absence of French Guiana accessions in Zhang et al. (Reference Zhang, Boccara, Motilal, Mischke, Johnson, Butler, Bailey and Meinhardt2009a) precluded a similar assessment but did indicate a Lower Amazon Forastero profile for the PA group. The results supported the proposition that attention should be paid to sample composition effects when inferring structure relationships (Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Mooleedhar and Meinhardt2010).
The choice of K will influence the interpretation of the results. Criollo ancestry was highly influenced by K (Fig. 2). At a choice of K = 3 (method of Evanno et al. (Reference Evanno, Regnaut and Goudet2005)) or K = 4, the number of individuals with Criollo ancestry was probably overestimated. The fit of the genetic data to the finalized population structure should therefore be accepted only after probing for substructure in the entire dataset and in putative homogeneous clusters. This study has demonstrated that three accession groups (MXC, PENTAGONA and STAHEL) traditionally assigned to the Criollo group in the ICG,T must be reassigned to the Trinitario (MXC and PENTAGONA) and Forastero (STAHEL) groups. A similar result was found by Motilal et al. (Reference Motilal, Zhang, Umaharan, Mischke, Mooleedhar and Meinhardt2010).
In cacao field genebanks, an accession is a clone arising from budwood or seed that may then be vegetatively propagated to exist as a single tree or more than one tree. Users of a collection often assume that the multiple trees of an accession are indeed of the same genetic identity. However, this has been proven otherwise (Zhang et al., Reference Zhang, Mischke, Johnson, Phillips-Mora and Meinhardt2009b; Irish et al., Reference Irish, Goenaga, Zhang, Schnell, Brown and Motamayor2010; Motilal et al., Reference Motilal, Zhang, Umaharan, Mischke, Pinney and Meinhardt2011; and references therein). Determination of homonymies and synonymies is therefore useful in determining the proper accession nomenclature or accession group. We recommend that duplicated samples, appropriate reference samples and proper compilation of the STRUCTURE runs be used when elucidating population structure. Only then will the elucidation of identities become reliable to enable the adoption of correct management strategies in field genebanks.
Acknowledgements
Thanks to Ms Alisha Omar-Ali for assisting with DNA extractions and to Mr Stephen Pinney for his assistance with the electrophoresis work. Ms Zainab Ali and Mr Kasey Gordon are thanked for their assistance with data entry. Two anonymous reviewers are thanked for critiquing the manuscript. This research was made possible in part by a grant from the Government of Trinidad and Tobago Research Development Fund.