Introduction
Taxonomy, the science of classification, plays an essential role in genebank documentation. The Linnaean taxonomy of plants provides the first key to users seeking material of a particular species or group of species for use in their scientific or breeding programme, but it also determines the protocols to be applied in the maintenance of the material by the genebank curators. A proper classification and naming of the genebank material is thus a prerequisite for its utilization and handling.
In any attempt to bring together information about germplasm holdings from numerous sources, the problem of harmonising or standardising scientific nomenclature between the data sources arises. This is true for both taxon-level data sources (for example, when compiling a continental flora from numerous country floras that use different taxonomy and nomenclature) and accession-level data sources (e.g., in the development of crop-specific databases from accession data provided by individual genebanks that use different classification systems) and is due to the use of diverse nomenclatures in different collections (Knüpffer, Reference Knüpffer, Feuille and Muehlbauer2009).
The main problem regarding the use of taxonomic classification in genebanks is the fact that there is disagreement among scientists in selecting the taxonomic system, and the nomenclature, to be used. New scientific insights urge taxonomists to create new and better classifications, but not all colleagues will agree, and not all users might adopt the new and possibly better systems. A recent example is the renaming of tomato from Lycopersicon esculentum to Solanum lycopersicum (Spooner et al., Reference Spooner, Anderson and Jansen1993), resulting in one-fourth of the tomato accessions in European genebanks being called Solanum, and the rest Lycopersicon. The resulting problems for genebank documentation were also noted by van Veller et al. (Reference van Veller, Hoekstra and van Dooijeweert2008). Other well-known examples of complexity due to synonymy involve the genus Aegilops that, according to some taxonomists, had to be included in Triticum (Bowden, Reference Bowden1959), which a few others did not agree (Gupta and Baum, Reference Gupta and Baum1986). The continuing regrouping of wild potato species in the genus Solanum (e.g., Spooner and van den Berg, Reference Spooner and van den Berg2004) is another example. The result is ‘confusion’ amongst users and curators. This confusion is further increased by the frequent occurrence of spelling mistakes and plain errors in the use of the nomenclature, such as the use of a family name instead of the genus name.
These problems with taxonomy in the context of plant genetic resources (PGR) became very visible in 2003 when the passport data of the germplasm collections in Europe were combined into a single system: EURISCO. EURISCO is a web-based catalogue that provides information about ex situ plant collections maintained in Europe (EURISCO, 2010). As of 1 March 2010, a total of 1,119,348 accessions from 39 European National Inventories covering 304 individual genebank collections were documented in this system. EURISCO includes five fields for taxonomic information: a field for the genus, one for the species and one for the infraspecific name, called here sub-taxon, plus two fields for the author citations for the latter two. The genus name was described on the list of the Food and Agricultural Organization (FAO)/International Plant Genetic Resources Institute (IPGRI) multi-crop passport descriptors on which EURISCO was based as ‘Genus name for taxon, in Latin. Initial uppercase letter required’, the species as ‘Specific epithet portion of the scientific name, in Latin, in lowercase letters. Following abbreviation is allowed: ‘sp.’’ and, finally the sub-taxon as ‘Subtaxa can be used to store any additional taxonomic identifier, in Latin. Following abbreviations are allowed: ‘subsp.’ (for subspecies); ‘convar.’ (for convariety); ‘var.’ (for variety); ‘f.’ (for form)’ (FAO/IPGRI, 2001). The genus name is one of four mandatory descriptors in EURISCO; if this field is empty, the record is rejected; if it is not empty, the record can be accepted. The content of this field, as that of the other four taxonomic fields, is not checked against controlled vocabularies that would provide correct spelling and grouping of synonyms under a preferred scientific name.
The present study is part of an effort to improve the searchability in EURISCO by standardising the scientific names through enhancement of their quality, which will eventually lead to the development of a tool for EURISCO to map most occurring scientific names to preferred ones. In addition, it aims at providing a consistent classification of EURISCO accessions into ‘crops’ or ‘crop groups’ as a prerequisite for proper handling of characterization and evaluation data, and for compatibility with GENESYS (a worldwide PGR information system under development) in this respect. Finally, it also aims at classifying the material documented in EURISCO into Annex-1 crops and non-Annex-1 crops. This classification is based on the International Treaty on Plant Genetic Resources for Food and Agriculture (ITPGRFA) (FAO, 2002), a legally binding instrument aiming at the conservation and sustainable use of Plant Genetic Resources for Food and Agriculture and the fair and equitable sharing of benefits derived from their use. The crops covered by this Treaty are listed in its Annex 1.
EURISCO is far from being complete. The draft Report on the State of the World's PGRFA (FAO, 2010) estimates the number of PGR accessions in Europe at 1,735,407. A list of over one million European genebank accessions in EURISCO can, however, give a good overview of the situation regarding the taxonomic classification in European genebanks: the lingua franca or Babylonian confusion?
Material and methods
Data
The complete content of EURISCO was made available to the authors on request on 27 January 2010 by Milko Skofic of Bioversity International (Rome, Italy) as a zipped comma separated file. It included 1,049,460 accessions, from 36 national inventories. These data were loaded in Excel; all manipulations and calculations were done in Excel 2007, when necessary using Visual Basic for Applications (VBA).
As an external reference for the taxonomical nomenclature, the taxonomy of the Genetic Resources Network (GRIN) of the United States Department of Agriculture National Plant Germplasm System was used (GRIN-Tax, 2010). This well-curated system of taxa and synonyms is the most authoritative and most complete system for cultivated and other economically important plant taxa available and used as taxonomic reference in GRIN. There are a number of other taxonomic databases that could have been used as checklist, e.g. ‘Mansfeld's World Database of Agricultural and Horticultural Crops’ (IPK, 2010) based on the book edition (Hanelt and IPK, Reference Hanelt2001); however, GRIN-Tax was considered very appropriate for the purpose. Mansfeld's database deals only with cultivated species (except for ornamental and forestry plants), but contains some species that are not documented in GRIN-Tax, and also includes some synonyms not found in GRIN-Tax. Other extensive online lists of scientific plant names, such as the International Plant Names Index (IPNI, 2010) or the Integrated Taxonomic Information System (ITIS, 2010) do not have a particular focus on economic plants and have therefore been consulted in singular cases only to check the spelling of scientific names of wild plants.
The GRIN-Tax genus and taxon tables were downloaded on 5 March 2010, and loaded in Excel. The genus table concerned 27,855 records with 25,934 distinct genera, some occurring more than once with different author citations and some with multiple entries to accommodate associated infrageneric names. The taxon table contained 95,089 taxa, each with its name, author citation and a link to the ‘preferred taxon name’, the GRIN-accepted name for that taxon. Some genera in the taxon list were listed without species but with the addition ‘sp.’, and this designation sometimes existed together with other named species; however, not all genera were listed this way.
Given the importance of the ITPGRFA, the content of EURISCO was matched with the crops listed in its Annex 1. Since Annex 1 is not a clean list as it does not include the scientific names of all genera and species, and since it uses terms such as ‘Artocarpus, Breadfruit only’ and ‘genus Solanum Section melongena included’, a list of taxa related to each of the Annex 1 Crops was created. This list includes those taxa of genera completely covered by Annex 1, and an additional list with genus–species combinations in the cases where particular species of certain genera were either included or excluded.
Data processing
From the 1,049,460 accessions in EURISCO, all 44,584 distinct genus–species–subtaxon combinations were extracted with their frequencies using custom-made VBA procedures.
These genus–species–subtaxon combinations were ‘cleaned’ in a four-step procedure:
The format was corrected in terms of case; all fields were transformed to lower case, except for the genus with an initial capital. For example, the genus name ‘AEGILOPS’ was replaced with ‘Aegilops’.
The structure of the genus and species fields was corrected by deleting everything after a space, unless this field contained an ‘x’ (or ‘ × ’) preceding a hybrid genus or species name. This usually implied the removal of author citations or other undesirable additions. For example, ‘Vicia L.’ became ‘Vicia’.
In cases where the species field was empty, ‘sp.’ was added.
The genera and most frequent genus–species combinations were checked against names occurring in GRIN Taxonomy with the Taxonomic Nomenclature Checker (Bioversity, 2010a), and the most obvious mistakes were corrected. For example, ‘Phaselous’ was replaced with ‘Phaseolus’.
The genus–species–subtaxa combinations on the resulting list were matched with the GRIN-Tax data. In the cases without match, a manual inspection followed. The most frequent non-matching cases were corrected if the deviation from GRIN-Tax was obvious. Especially, the hybrid genera and species required much attention in this process. For example, the hybrid genus ‘X Triticosecale’ occurred in the list, after the first three steps of cleaning as ‘X Triticosecale’, ‘Triticosecale’, ‘Triticale’, ‘Triticocecale’, ‘Xtriticosecale’ and ‘Xtriticale’. In the cases of non-matching hybrid genera, it was checked whether the genus name without the preceding ‘X’ could be found in GRIN-Tax. If this was the case the genus name was replaced accordingly, e.g. ‘x Sorghum’ was replaced with ‘Sorghum’. Finally, it appeared that not all generic names in the GRIN-Tax genus name list appeared in the taxon list; therefore, a match of the latter list with genus names was also made.
The match with the taxa of the Annex 1 of the ITPGRFA could be easily made based on the list that was compiled with genera and species included in this Annex 1. However, the creation of this list was not obvious. For example, the genus Aegilops is not explicitly mentioned, whereas ‘Wheat – Triticum et al. including Agropyron, Elymus and Secale’ is. In this case, Aegilops was considered part of ‘Triticum et al.’
Results
The downloaded EURISCO dataset contained 1,049,460 accessions, with 5,385 distinct genus names, 34,668 distinct genus–species combinations and 44,584 genus–species–subtaxon combinations. After cleaning, as described above, these numbers had decreased to 5,264 genus names, 33,463 genus–species combinations and 42,661 genus–species–subtaxon combinations. A match with the taxa in GRIN-Tax, where the genus, species and subtaxon names were simply concatenated with a space in between, showed that 37% of the uncleaned taxa and 41% of the cleaned taxa matched, respectively corresponding to 57 and 76% of the accessions.
The cleaning was an exercise that could be performed automatically (with a VBA script), whereas the correcting of the spelling errors requires time and some knowledge of taxonomy. Frequently occurring spelling errors were based on:
doubling of consonants (e.g. Ocimmum should be Ocimum)
the use of ‘i’ instead of ‘y’ (e.g. Pinus silvestris should be P. sylvestris, Poligonum should be Polygonum) and other single letter alterations, apparently most frequently occurring due to misinterpretation of Latin letters in genebanks with a working language using the Cyrillic alphabet
the use of the ending ‘i’ instead of ‘ii’ or vice versa (e.g. Helianthus maximiliani should be H. maximilianii, Aegilops vavilovi should be Ae. vavilovii, but Abutilon theophrastii should become A. theophrasti)
the use of the ending ‘ae’ instead of ‘aea’ (e.g. Althae should be Althaea)
the use of the wrong gender in species epithets, e.g. ‘um’ or ‘a’ instead of ‘us’ (e.g. Cucumis sativum should be C. sativus)
the use of the Latin ending ‘um’ instead of the Greek ‘on’ or vice versa (e.g. Agropyrum should be Agropyron).
Sometimes, more structural changes were required such as the correction of the probably erroneous genus name for cotton Gossypeae, that was either the misspelled tribe name Gossypieae that should not have been used here, or simply the misspelled proper genus name Gossypium. It might also be based on a locally used taxonomic system, but in any case Gossypea was not used apart from one genebank where the 6,181 accessions with that name are maintained.
At the end of the cleaning and correcting process, the genus name of 3.7%, the species name of 15.8% and the subtaxon of 8.9% of the accessions had changed. In total, the names of 24.8% of the accessions were corrected.
The distribution of accessions by genus was highly uneven, as could be expected. Fifty percent of the EURISCO accessions belong to only ten genera (Triticum 16%, Hordeum 9%, Zea 4%, Phaseolus 4%, Avena 3%, Pisum 3%, Solanum 3%, Vicia 2%, Vitis 2% and Malus 2%); the 60 largest genera with respect to number of accessions in EURISCO are listed in Table 1. With only 191 genera, 95% of the EURISCO accessions can be covered; all these genera are accepted in GRIN-tax except for the genus ‘Melo’ (usually included in Cucumis) with 316 accessions (from three east European countries) on position 139. This implies that the remaining 5,073 genera, or 96% of all genera in EURISCO, cover only 5% of the accessions; 1,655 of these only with one accession.
a Since the total number of accessions is 1,049,460, these 60 genera cover 89.4% of the accessions in EURISCO, with a range of 16.0% belonging to the genus Triticum to 0.1% of the genus Juglans.
Obviously, all accessions had a genus name, since it is a mandatory field (three accessions were of the genus ‘Mixture’). The number of accessions without a species name was 86,989, or 8.3% of the EURISCO accessions.
The frequency distribution of species names was uneven, similar to that of genera. The top ten genus–species combinations comprised 40% of the accessions, with Triticum aestivum in the lead with 12% of the accessions followed by Hordeum vulgare (8%) and Zea mays (4%). To cover 95% of the accessions, 2,412 genus–species combinations were required. Of these, 258 are not known in GRIN-Tax, corresponding to only 9,696 accessions. The most frequent unknown combinations are ‘Sorghum hirse’ (1,742 accessions from one country), Fragaria ananassa (832 accessions from two countries referring to Fragaria × ananassa) and Brassica capitata (668 accessions from a number of national inventories belonging to Brassica oleracea).
Matching the taxa to GRIN-Tax showed that out of the 5,264 genus names, 186 are not found in GRIN-Tax, representing only 352 accessions (an average of 1.9 accessions/genus). At the species level, out of the 33,463 distinct genus–species combinations, less than half, 16,457 combinations, were in GRIN-Tax (of which 6% consisted of a genus name only). However, these 49% of the names represented 96.8% of the accessions. Similar to the genus names, the 14,867 species names not known in GRIN-Tax represented on average only 2.1 accessions per species. The genera with the largest number of species that could not be matched with GRIN-Tax are: Eucalyptus (368 accessions in 344 species), Silene (303/156), Acacia (270/232), Carex (211/150) and Senecio (201/159). The majority of these species names (972 represented by 1,258 accessions) is found in a single genebank from the UK, the Millennium Seed Bank at Kew, focussing on wild species from various regions of the world, aiming at covering half of the known plant species worldwide (Ian Thomas, pers. commun.). At the subtaxon level, the number of names that could be matched was low; of the 33.3% of accessions that had a sub-specific epithet only 12.5% could be matched.
To determine the applicability of the ‘preferred taxon name’ concept for the improvement of the access to EURISCO, the largest part of the taxon name that could be matched to GRIN-Tax was determined for each accession, and the corresponding ‘preferred taxon name’ was determined. In 41% of the names, corresponding to 76% of the accessions, the complete taxon name, i.e. the combination of genus, species and subtaxon as far as available, was found in GRIN-Tax. When only the largest part that could be matched was considered, 99% of the accessions could be matched with at least the genus name. This concerned 17,821 distinct taxa, some consisting of only a genus name (6%), most of a genus–species combination (84%), or of a complete triplet including a sub-specific epithet (9% of the names). Since for each of the taxa appearing in GRIN-Tax also a ‘preferred taxon name’ was listed in GRIN-Tax, it was possible to replace the taxon name with the preferred taxon name; this decreased the total number of distinct taxa in EURISCO by only 8% to 16,380. However, it could be observed that for some of the larger agronomically important taxa, the ‘preferred taxon name’ brought together some important synonyms. This is illustrated in Table 2 for the taxon ‘T. aestivum subsp. aestivum’.
a Author citations according to GRIN-Tax.
If the cleaned EURISCO taxon list was matched with the species of Annex 1 of the ITPGRFA, it could be shown that 66.7% of the accessions in EURISCO belong to species occurring on Annex 1. The list of the 25 largest crops of Annex 1 with respect to number of accessions in EURISCO is provided in Table 3.
a The crop names are the names used in the International Treaty on Plant Genetic Resources for Food and Agriculture. The material on this list covers 66.7% of the accessions in EURISCO.
Discussion
Taxonomy in genebanks is considered a problematic area by many genebank staff. For example, in the 1998 publication about the creation of the European Brassica Central Crop Database (Boukema et al., Reference Boukema, van Hintum and Astley1998), a database that aimed at combining passport data of all accessions in European Brassica collections, the authors list the names under which they received the broccoli accessions: B. oleracea botrytis italica, B. botrytis italica, B. oleracea botrytis cymosa, B. oleracea convar. botrytis var. italica, B. oleracea italica and B. oleracea var. italica. When the EURISCO database was created in 2003, this became very visible; a search of EURISCO was and still is quite difficult since the desired accession might appear under a number of different names, a highly undesirable situation that needs to be resolved.
In the analysis described in this paper, it appears, however, that the problem is not as big as it might seem. The distribution of accessions over taxa is highly uneven, which implies that in order to improve the situation for most of the accessions, attention needs to be given only to a limited number of taxa. Furthermore, GRIN-Tax provides a freely accessible system of synonymy pointing to ‘preferred taxa’, which is also implemented in the Taxonomic Nomenclature Checker (Bioversity, 2010a) where large lists of names can be checked for synonyms. This system could be used as a reference for searches in EURISCO that would allow ‘translating’ misspelled names and synonyms into a preferred name, thus avoiding the problems caused by the use of different classification systems and the occurrence of spelling errors. Names not found in GRIN-Tax can also be checked against the Mansfeld Database (IPK, 2010) using the Taxonomic Nomenclature Checker (Bioversity, 2010b). Thus, 352 genus–species combinations (corresponding to 1,683 accessions in EURISCO) that could not be found in GRIN-Tax could be matched directly, or after correcting obvious spelling errors, with names occurring in the Mansfeld Database. Other available online nomenclature checkers, such as that of TAXAMATCH (2010) that allows fuzzy matching (both phonetic and non-phonetic) of lists of scientific names with various lists of organism names (Rees, Reference Rees, Worcester, Bajona and Branton2008) were not used so far.
An important observation in the study was the low quality of a considerable part of the taxonomic data; errors of all imaginable types could be observed. EURISCO might play a role in the reduction of such errors; identifying the errors and giving feedback to the data donors is expected to act as an incentive to correct mistakes and to, perhaps, adopt standard nomenclature. Clear recommendations regarding the formatting of the names of problematic taxonomic groups, such as hybrid taxa (with hybrid names such as xTriticosecale or Allium x proliferum or hybrid formulas like Citrus aurantium x Fortunella japonica) could also help in improving the standardization and quality of taxonomic names in genebanks.
The newest ‘Report on the State of the World's PGRFA’ estimates the number of PGR accessions in Europe at 1,735,407 (FAO, 2010). The true number is likely to be lower, since this report is largely based on the FAO database (WIEWS, 2010) that, due to inherent curation problems, includes material that is either not publicly available or does not exist anymore. Over 60% of these listed accessions are included in EURISCO. An important known omission in EURISCO is France, which only included 3,589 of its 249,389 accessions (estimation of the previous Report on the State of the World's PGR (FAO, 1996)). Harmonization of taxon names allowed the creation of a ‘cleaned’ overview of the content of EURISCO (Table 1), and thus an overview on Europe's PGR. This overview shows a remarkable distribution over crop groups. The small grains with 31% of the accessions show a large domination; however, this domination is not as large as could be expected, given the ease of conservation and their importance in scientific research. The position of Zea in third place is quite noteworthy, being a cross-pollinated large-seeded genus, and thus difficult to maintain in PGR collections. Also, the dominant position of the legumes is remarkable, with Phaseolus at the fifth place, Pisum at the seventh place and Vicia at the ninth, together covering 9.7% of the European accessions. Also notable is the fact that the fruit tree genera Prunus and Malus have many accessions (52,179).
Acknowledgements
The authors acknowledge Milko Skofic and Sonia Dias, both working for EURISCO at Bioversity International, Rome, for their support for and comments to the paper, Dag Terje Filip Endresen (Nordic Genetic Resources Centre, Alnarp, Sweden) and Renato di Giovanni (CRIA – Centro de Referência em Informação Ambiental, Campinas, Brasil) for providing useful information, and John Wiersema (USDA) and an anonymous reviewer for their helpful comments on an earlier version of this manuscript.