Introduction
Thirty years ago Botstein et al. (Reference Botstein, White, Skolnick and Davis1980) introduced the method of constructing genetic maps with DNA markers, known as restriction fragment length polymorphisms (RFLP). This development revolutionized genetic mapping and the analysis of diversity. Subsequent methodological advances, such as development of simple sequence repeat markers (SSRs), random amplification of polymorphic DNA (RAPD) (Williams et al., Reference Williams, Kubelik, Livak, Rafalski and Tingey1990) and amplified fragment length polymorphisms (AFLP) (Zabeau and Voss, Reference Zabeau and Voss1993), were enabled by the development of polymerase chain reaction. Development of single nucleotide polymorphism (SNP)-based markers brought a new level of resolution to the analysis of genetic diversity and for most applications superseded other genetic marker categories. More recently, DNA sequencing of partial or complete genomes from multiple individuals has expanded our understanding of the range of intraspecific genetic variation encountered in higher plants (Fu and Dooner, Reference Fu and Dooner2002; Yang and Bennetzen, Reference Yang and Bennetzen2009). With the rapid decline in the cost of DNA sequencing and new technological developments, it is certain that genome sequencing of germplasm collection will become accessible, eliminating biases present in existing genotyping methodologies, although it will also impose a significant data analysis overhead, necessitating increased investment in bioinformatics. The proposed 1001 Arabidopsis genomes project (http://1001genomes.org/about.html) is a sign of things to come. Beyond DNA sequence, there is a renewed interest in the epigenetic marks, such as cytosine methylation, decorating DNA and chromatin, and potentially influencing the phenotype. We have discussed the impact of these developments on the analysis of genetic diversity.
Intraspecific diversity and the phenotype
Genomic sequencing of diverse genotypes in several plant species demonstrated that in addition to SNPs and SSR polymorphisms, extensive intraspecific differences include large insertions/deletions frequently composed of highly repetitive sequences such as retrotransposons and DNA transposons (Wang and Dooner, Reference Wang and Dooner2006), and in some cases also genes (Beló et al., Reference Beló, Beatty, Hondred, Fengler, Li and Rafalski2009; Springer et al., Reference Springer, Ying, Fu, Ji, Yeh, Jia, Wu, Richmond, Kitzman, Rosenbaum, Iniguez, Barbazuk, Jeddeloh, Nettleton and Schnable2009). For example, the complement of disease resistance genes may differ between accessions (Chin et al., Reference Chin, Arroyo-Garcia, Ochoa, Kesseli, Lavelle and Michelmore2001; Yahiaoui et al., Reference Yahiaoui, Kaur and Keller2009). Sequences that do not code for proteins may nevertheless affect the phenotype, by supplying enhancers or promoters to nearby genes, or code for small RNAs, which affect expression of other genes by a variety of mechanisms (Chen, Reference Chen2009). Pseudogenes, which in maize are frequently generated by Helitron transposons, are sometimes transcribed in sense or antisense direction, also affecting gene expression phenotype (Yang and Bennetzen, Reference Yang and Bennetzen2009).
If these types of polymorphisms are in linkage disequilibrium with genetic markers used for germplasm characterization (predominantly SNPs and SSRs), then no additional information other than marker genotype is needed to reflect correctly the underlying genetic relationships of accessions. However, if linkage disequilibrium (LD) between markers for germplasm fingerprinting and genic or non-genic large indel polymorphisms breaks down rapidly, direct genotyping of these differences may be necessary by DNA sequencing or other methods such as array comparative genomic hybridization (Beló et al., Reference Beló, Beatty, Hondred, Fengler, Li and Rafalski2009; Springer et al., Reference Springer, Ying, Fu, Ji, Yeh, Jia, Wu, Richmond, Kitzman, Rosenbaum, Iniguez, Barbazuk, Jeddeloh, Nettleton and Schnable2009). This is likely to occur in the case of variants, which occurred recently on the background of pre-existing haplotype pattern.
An important issue not always appreciated in the germplasm analysis context is the prevalence of ascertainment bias, which occurs when polymorphic loci are identified (ascertained) in one collection of germplasm, but used to evaluate diversity in another set (Clark et al., Reference Clark, Hubisz, Bustamante, Williamson and Nielsen2005). For example, a collection of SNP loci identified in a set of cultivated lines will not correctly represent polymorphic loci present in unadapted accessions, leading to incorrect estimates of genetic distances in the latter set of germplasm. Many polymorphic loci in the non-adapted accessions will not be represented in the SNP collection developed from adapted germplasm, and, in turn, some alleles common in adapted material may be rare in non-elite accessions. As a result, genetic distances determined in the ascertainment population may be lengthened in comparison with those in the non-ascertained population (Fig. 1). It is difficult to identify a priori an appropriate collection of germplasm for ascertainment (marker discovery), given unbalanced representation of different types of germplasm in many collections. Perhaps, the most appropriate unbiased methodology for germplasm fingerprinting is genotyping by genomic sequencing.
The sequencing technology is rapidly approaching the stage where it will become a cost-effective tool for genotyping (Edwards and Batley, Reference Edwards and Batley2009; Varshney et al., Reference Varshney, Nayak, May and Jackson2009). A number of accessions will be simultaneously sequenced in each lane of the instrument, after appropriate encoding. Depending on the size of the genome, some form of reduced representation analysis (Yuan et al., Reference Yuan, SanMiguel and Bennetzen2003) will probably be necessary to focus the effort on non-repetitive fraction of the genome.
Perspective on epigenotyping of germplasm
It is well established that epigenetic variation encoded by DNA base modifications such as 5-methylcytidine affects phenotype in animals and plants (Peaston and Whitelaw, Reference Peaston and Whitelaw2006; Henderson and Jacobsen, Reference Henderson and Jacobsen2007; Chandler and Alleman, Reference Chandler and Alleman2008). Some of the epialleles in plants are remarkably stable and affect important plant characteristics (Cubas et al., Reference Cubas, Vincent and Coen1999). It is therefore reasonable to propose that a complete characterization of a germplasm accession or a breeding stock should involve not only the description of the genotype but also of the epigenotype. It has recently been demonstrated that recursive selection for a yield component in canola results in plants that are genetically identical but can be distinguished by DNA methylation differences and exhibit significant differences in yield (Hauben et al., Reference Hauben, Haesendonckx, Standaert, Van Der Kelen, Azmi, Akpo, Van Breusegem, Guisez, Bots, Lambert, Laga and De Block2009). The tools for comprehensive epigenotyping are available and involve chemical deamination of m5C to U followed by DNA sequencing, enabling single base resolution across the whole genome, albeit at considerable expense (Lister and Ecker, Reference Lister and Ecker2009; Lister et al., Reference Lister, Pelizzola, Dowen, Hawkins, Hon, Tonti-Filippini, Nery, Lee, Ye, Ngo, Edsall, Antosiewicz-Bourget, Stewart, Ruotti, Millar, Thomson, Ren and Ecker2009; Wang et al., Reference Wang, Elling, Li, Li, Peng, He, Sun, Qi, Liu and Deng2009). The high throughput sequencing technology, especially rapidly developing single molecule sequencing (Edwards and Batley, Reference Edwards and Batley2009), promises to enable comprehensive epigenotyping of germplasm collections in the coming years. Currently, several options exist for epigenotyping of a subset of the genome, for example by excluding repetitive fraction of the genome (Peterson et al., Reference Peterson, Wessler and Paterson2002) or capturing specific sequences of interest (Hodges et al., Reference Hodges, Smith, Kendall, Xuan, Ravi, Rooks, Zhang, Ye, Bhattacharjee, Brizuela, McCombie, Wigler, Hannon and Hicks2009).
Conclusions
Rapid technological developments are changing our understanding of genetic diversity, by allowing increasingly dense genotyping and identification of types of genetic polymorphisms that were previously not easily accessible to molecular analysis. In the next few years, another step change will occur with the availability of inexpensive genomic sequencing and development of tools for direct probing of epigenetic layer of information (Flusberg et al., Reference Flusberg, Webster, Lee, Travers, Olivares, Clark, Korlach and Turner2010). These developments will further enable the understanding of relationship between haplotype defined at the sequence level and phenotypic expression, through the use of association mapping and genome prediction techniques. To fully exploit these developments, we need to better understand the extent of linkage disequilibrium in the germplasm of interest.
Acknowledgements
I appreciate many discussions with Scott Tingey and with all of my professional colleagues at DuPont/Pioneer Hi-Bred Int.