Introduction
Genebanks generally store their data in straightforward custom-build databases. These databases always contain passport data, defining the identity and origin of the material. Other information stored may include characterization and evaluation (C&E) data, describing the phenotype of the accessions, logistic/management data about where the material is physically stored, germination test results, distribution data, etc. The computerized documentation usually stops there. The information should be made available through a website, enabling the users to select and request accessions.
Due to advances in molecular biology, which resulted in technology allowing high-throughput sequencing, and the large-scale application thereof, it has become apparent that genebank data and this type of genomics data will have to be connected. Genebanks are important suppliers of genetic resources to the genomics research community, and access to the resulting information will allow traditional genebank users to better select genetic material for their breeding and scientific programmes (McCouch et al., Reference McCouch, McNally, Wang and Sackville Hamilton2012; van Treuren and van Hintum, Reference van Treuren and van Hintum2014).
A thus far unresolved question relates to the interaction; how will genebanks use and provide access to genomics data and how will genomics information resources give access to genebank data and material? Herein, we discuss a possible solution that allows making curated passport and C&E data online available on the genebank side and sequences and their annotations on the genomics data provider side, including public resources such as NCBI databases (Anguita et al., Reference Anguita, García-Remesal, de la Iglesia and Maojo2013) and UniProt (Redaschi and Uniprot Consortium, Reference Redaschi and Uniprot Consortium2009). This approach will allow to interconnect these data; a genebank can provide access to the annotated sequences or search for allelic variants for specific genomic regions within its genebank material; a genomic database can give access to the details about the origin of the accessions, or the phenotypic data available at the genebank. This interconnection is based on semantic web technology, an established framework of the World Wide Web Consortium (W3C, 2013). It is fairly simple to implement in a well-organized genebank.
The technology
The principle of the semantic web technology is that information is presented in a way that allows computers to find it, interpret it and link it to other information. Or in the words of the W3C: ‘The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.’ Implementation procedures are relatively simple and involve two steps: (1) the standardized definition of terms, i.e. an ontology; (2) the use of these terms when presenting information on the Internet. The Crop Ontology (GCP, 2013; Matteis et al., Reference Matteis, Chibon, Espinosa, Skofic, Finkers, Bruskiewich, Hyman and Arnaud2013), giving access to a variety of crop-specific and generic germplasm ontologies, proved suitable for genebank purposes.
The easiest method to add semantics is via embedding the information in the webpage. Each page starts with a reference to the ontologies used in the tags (the XML namespaces, for example xmlns:pp = http://www.cropontology.org/rdf/CO_020:). Descriptors from this ontology are embedded in the webpage using the tags (Pohorec et al., Reference Pohorec, Zorman and Kokol2013; Fig. 1). These tags are no more than the codes of the descriptors in the ontology embedded in the proper formatting. For example, the tag < span property = “pp:0000024”>CGN < /span> and < span property = “pp:0000043”>CGN05176 < /span> indicates that it concerns accession CGN05176 maintained by the Centre for Genetic Resources, The Netherlands (CGN). This method is suitable for making simple data available. For more complex data types, development of an application programming interface is the preferred method (W3C, 2008).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20241024014523-26437-mediumThumb-gif-S1479262114000689_fig1g.jpg?pub-status=live)
Fig. 1 Embedding of ontologies within a genebank accession result page. Each page starts with a reference to the ontology used in the tags, and descriptors from these ontologies are embedded in the webpage using the tags. These tags are no more than the codes of the descriptors from the ontology embedded in the proper formatting. In this example, we highlight the reference to the FAO/Bioversity Multi-Crop Passport Descriptor (CO_020) and the property ‘donor accession number’ (0000024). A semantic web-aware browser can interpret these tags automatically.
The outlined approach only requires the mentioned elements and an aggregator that knows where to find the relevant information, preferably via a centralized service that allows registration of information sources. The aggregator can be developed as a stand-alone tool or integrated in a website.
Use cases
To illustrate the potential of the semantic web technology, two use cases will be presented. The first involves a proof-of-principle for combining C&E data from a breeding database and a genebank database; the second involves the concept for a genebank website to present the allelic variants of a given gene available in its collection.
Prototype: combining C&E data
Information from a breeding database (containing ~omics data for a specific crop) can be combined with information from genebank records automatically by an aggregator. A prototype aggregator (SemGem, 2013) was built to integrate data from the EU-SOL BreeDB (2013) with those of the CGN. The EU-SOL record is retrieved and embedded Crop Ontology tags (passport and Solanaceae phenotype ontology) are extracted, including the reference to the genebank record. Information of this genebank record is extracted and presented in a table. This prototype shows the feasibility of such integration, provided that both data sources use semantic annotations. A more advanced integration of information is outlined in the next example.
Concept: allelic variants in the genebank
A genebank website allows the user to enter the name of a gene, and the interface returns a list of accessions with different allelic variants for this gene, including allele name, known function and information about the phenotype (Fig. 2). Following this step, resources providing allelic variant data regarding that gene generated in one of the many genome (re-)sequencing initiatives (e.g. Lam et al., Reference Lam, Xu, Liu, Chen, Yang, Wong, Li, He, Qin, Wang, Li, Jian, Wang, Shao, Wang, Sun and Zhang2010; Finkers et al., Reference Finkers, Smit, Peters, Schijlen, van Heusden and Zhang2012; Xu et al., Reference Xu, Liu, Ge, Jensen, Hu, Li, Dong, Gutenkunst, Fang, Huang, Li, He, Zhang, Zheng, Zhang, Li, Yu, Kristiansen, Zhang, Wang, Wright, McCouch, Nielsen, Wang and Wang2012) are aggregated from all available up-to-date resources on the Internet. Based on the resulting information, haplotypes are reconstructed for the queried gene and the germplasm will be organized in the identified haplotype groups. For each haplotype group, C&E data will be retrieved from the genebank database and presented in combination with the haplotype groups. This information should assist the user in selecting a germplasm panel suitable for his/her purposes.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20160921031600540-0381:S1479262114000689:S1479262114000689_fig2g.gif?pub-status=live)
Fig. 2 Steps involved in the conceptual interface for presenting a user with a panel of accessions representing the complete allelic diversity of a given gene.
Conclusions
To exploit the information generated by the genomics community, genebanks should link their data to those in genomics databases, rather than aiming to incorporate these information resources. The enabling technology to do so is available. Ontologies, defining the common terms and standards, play a crucial role in this effort and need to be adopted and improved by the genebank community. Implementation of the ‘allelic variant’ concept will be the next step towards interoperable genebanks.