Introduction
Crop plants are a major source of human and animal nutrition (Grusak and DellaPenna, Reference Grusak and DellaPenna1999), playing a key role as renewable resources and providing basic ingredients for the chemical and pharmaceutical industries (Metzger and Bornscheuer, Reference Metzger and Bornscheuer2006; Tilman et al., Reference Tilman, Hill and Lehman2006). The collection, maintenance and characterization of crop plants and crop wild relatives represent an important contribution to the preservation of biological diversity. Genebanks in particular play an important role for the long-term conservation of plant genetic resources for food and agriculture (PGRFA) (Hoisington et al., Reference Hoisington, Khairallah, Reeves, Ribaut, Skovmand, Taba and Warburton1999). There exist about 1800 collections conserving PGRFA around the world, of which 625 collections comprising more than two million accessions are maintained in Europe (Engels and Maggioni, Reference Engels, Maggioni, Maxted, Dulloo, Ford-Lloyd, Frese, Iriondo and Pinheiro de Carvalho2012).
In the frame of the European Cooperative Programme for Plant Genetic Resources (ECPGR), there is an ongoing initiative aimed at providing information about European plant collections via a central entry point – the European Search Catalogue for Plant Genetic Resources (EURISCO) (Weise et al., Reference Weise, Oppermann, Maggioni, van Hintum and Knüpffer2017). EURISCO provides information on more than two million accessions, maintained in almost 400 germplasm collections in 43 countries.
The germplasm accessions documented in EURISCOFootnote 1 presently comprise 6393 genera and 43,160 species, including synonyms, name variants and misspellings. The dataset is regularly updated; however, the correctness of scientific plant names poses one of the most significant challenges to ensuring efficient search functionality (van Hintum and Knüpffer, Reference van Hintum and Knüpffer2010).
This issue can be illustrated using tomato as an example: Both Lycopersicon esculentum Mill.Footnote 2 and Solanum lycopersicum L.Footnote 3 are accepted scientific names following different taxonomic conventions. Under these two names, EURISCO lists 9653 and 8158 accessions, respectively. In addition, various synonyms exist in both cases. A user, however, is interested in receiving all these hits together. To jointly search over various accepted names and synonyms, or to detect errors, such as typos, the taxonomic information in EURISCO must be checked against controlled vocabularies.
In this paper, recent developments of the taxonomic search functionality of EURISCO will be presented, aiming to significantly increase the amount of relevant hits.
Implementation
In order to significantly improve the search functionality of EURISCO with regard to the taxonomic names of the documented germplasm accessions, two aims needed to be reached: (i) The heterogeneous taxonomic terms provided to EURISCO needed to be mapped internally onto controlled vocabularies, and (ii) the search interface needed to be reworked, allowing users to obtain complete results by entering accepted taxonomic names, synonym names or even misspelled names.
Database structures were developed to support the mapping of the taxonomic terms provided to EURISCO onto taxonomic repositories. Two repositories, which reflect different taxonomic opinions, served as a starting point – the GRIN Taxonomy (Wiersema, Reference Wiersema1995) as well as the Mansfeld's World Database of Agricultural and Horticultural Crops, which is based on the Mansfeld's Encyclopedia of Agricultural and Horticultural Crops (Hanelt, Reference Hanelt2001). Procedures were developed to perform the mapping for every new dataset received from EURISCO's data providers automatically. For performance reasons, pre-calculated mapping information is stored in the EURISCO database besides the provided data. The latter is not undergoing modifications at this time.
Based on the enriched database, it was possible to rework the search interface of EURISCO. It now enables users to search material on taxa via synonyms. While filling the particular fields of the search interface, string comparisons are performed with the taxonomic information available in EURISCO and suggestions are presented to the user (Fig. 1(a)). After filling the mandatory fields, the user is able to explicitly choose whether to search over all available synonyms or to continue with the exact search string only (Fig. 1(b)). The results of a taxonomy search are then displayed on a report page. Besides the exact search string, all available synonyms are shown and can be selected/deselected based on the users' preferences (Fig. 1(c)). After selecting a National Inventory, all accessions matching the search criteria are listed, and it is visually indicated whether each match is exact or non-exact.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20191224102606243-0216:S1479262119000339:S1479262119000339_fig1.png?pub-status=live)
Fig. 1. The taxonomic search interface of EURISCO. (a) Suggestions are presented while typing. (b) The user indicates whether to search over synonyms or to use the exact search string only. (c) The search results are reported to the user. Instead of only 9653 accessions, which would be exact matches for the search string Lycopersicon esculentum Mill., 20,305 accessions are found in total, comprising both various known synonyms and scientific names with typos based on a similarity estimation.
While implementing the taxonomy extension of the EURISCO search interface, much emphasis was placed on performance tuning, because EURISCO documents more than two million accessions, which must be searchable within seconds through a web interface.
In addition, text indices were used to enable efficient fuzzy searches. This allows users to find accessions whose scientific names contain errors. For this purpose, the similarity of the database entries with the search term is automatically estimated in the background. In the case of tomato for instance, this means that searching for Lycopersicon esculentum Mill. also returns hits with incorrect species names, such as esculantum, esculenthum or escylentum.
Besides the improvement of the search interface, the new developments also support the EURISCO data providers in checking new or updated information. The mappings provide a tool for the data providers to identify problem cases, such as accepted/non-accepted taxonomic names, typos, etc., and will successively improve the quality of taxonomic information in EURISCO.
Conclusion
The search for taxonomic information in EURISCO has been improved significantly. The revised search interface now supports the search for taxonomic data via synonyms, based on two internationally accepted repositories. The developed system is open and can be extended by additional taxonomic repositories in the future.
In addition, a fuzzy search is supported, which makes it possible to include erroneous data (e.g. spelling mistakes) in the search results. It should be noted that false-positive or false-negative hits cannot naturally be completely excluded in the fuzzy search. The displayed hits can only be as good as the underlying data. However, the feedback of the users flows into the continuous improvement of the mapping/search results.
Acknowledgements
The project ‘EURISCO taxonomy’ was funded by ECPGR thanks to a contribution from the German Federal Ministry of Food and Agriculture.
We would like to thank Mark Timothy Rabanus-Wallace for language support and Annedore Söchting for intensive software testing.