Genomic Rosetta Stone
From Genomic Standards Consortium
[edit] The vision for the Genomic Rosetta Stone (GRS) project
The GSC is creating a mapping of identifiers describing complete genomes across a wide range of relevant databases so that information about genomes and the organism from which they derive can be more easily integrated. This mapping will include as many genomic databases as possible (see below for list of contributing projects). The development of this "Genomic Rosetta Stone" (GRS) is core to the aim of auto-populating the Genome Catalogue with metadata harvested from other sources. It is also a necessary project if the GSC is to work towards a single, global list of genomes as described in the 3rd GSC workshop report (See GSC Publications). The GRS will not deal with gene identifiers only with identifiers with a one-to-one mapping with INSDC Genome Project Identifiers (GPIDs).
It is clear that in addition to mapping identifiers for genomes and metagenomes, linking to identifiers in a variety of other types of databases would be useful. Among such resources are culture collections. For this reason, the Straininfo.net portal has been included in this project from the start. Furthermore, several potential descriptors that the community might like to see in MIGS actually belong in other authoritative databases. Optimal growth temperature (OGT) is one of the most widely used ‘ecological parameters’ in comparative genomic studies and yet is not included in MIGS because it is a descriptor perhaps best curated and maintained within a specialist database (e.g. the Prokaryotic Growth Temperature Database (PGTdb) (http://pgtdb.csie.ncu.edu.tw/)24. Other key sources of data will be found in culture collections, organismal databases, and the new generation of online resources like ‘mashups’ which harvest information from a variety of other sources ‘on-the-fly’ (e.g. “ispecies”; http://darwin.zoology.gla.ac.uk/~rpage/ispecies/) and Wikis (e.g. Wikispecies; http://species.wikimedia.org/wiki/Main_Page).
Our end goal is to make this physical mapping available in multiple formats (e.g. relational schema / spreadsheet / webservices) such that it can be consumed in ways that facilitates the discovery of genomic information on the web, comparative genomic studies, and the population of databases with hyperlinks and metadata.
[Minimal standards for encoding information on biological material (came out of CABRI project)]: http://www.cabri.org/guidelines/gl-framed.html
[edit] Overview of Implementation Strategy and Component parts
This project will have several component parts that must all work together to produce the full vision. The key components, described at the highest levels, are given below, along with the specific project each one maps to in parentheses.
(1) A group of target databases to include in the mapping (See the Federation of Databases below)
(2) A way to store and maintain the mapping of IDs and in particular manage keeping it up-to-date (Currently investigating the NCBI LinkOUT)
(3) A centralized tool (website) that consumes the mapping and "makes use of it (Resolver)
(4) A way for databases to consume some or all of the mapping created in (3) for display/use locally
[edit] A Federation of Databases
The GRS is based on a federated set of database developers working together to make the mapping. The below collaborators are already involved in this project. All contributors are expected to map their own local identifiers to either the INSDC Genome Project Identifiers (INSDC GPID). The project is always looking to include new collaborators in the federation. To join this effort, please contact "curator at ceh.ac.uk".
Further information about the current status of the mapping to INSDC GPID is contained on each of the collaborator pages.
- Genome Catalogue
- Genomes Online Database GOLD
- Straininfo.net
- RDP
- Genome Reviews
- SEED
- Genome Atlas
- SILVA
- IMG
- CMR
- Gemina
[edit] The mapping and how to maintain it
(Currently exploring LinkOUT as an option - to expand this - see GRS Resolver for more detail
The following page is a record of the discussion with the LinkOut team regarding genomic identifiers
Genomic_Rosetta_Stone_and_LinkOut
[edit] A GRS Resolver
To make best use of the Genomic Rosetta Stone, we aim to engineer a web-based resolution service to produce lists of links to all databases in which an instance of a particular genome occurs. This tool will function much like a ‘currency converter’; entering an identifier will return synonymous identifiers for a particular (set of) genome(s) or metagenome(s). We will make the mapping and code freely available.
[edit] Detailed overview of GRS project
For further information, including information on LINKOUT see: GRS Resolver
[edit] A GRS Client Tool
James Cole and colleagues at Michigan State University are developing a client tool. Jim presented on it at the 5th GSC workshop. He is planning to make this make this tool distributable for use by other databases/websites (users).
[edit] Past and Current Issues
Past
- PID is not currently contained in sequence files, no way to link to sequence - this is now happening, for instance SILVA can parse INSDC GPIDs from EMBL 16S files (and does)
Ongoing
- Definition of a genome project / sequence -
- taxa (genomes) without PIDs (e.g. phage, plasmids do not have public INSDC GPIDs - what do we do?)
- Mapping of replicons to project (PID) - many plasmids are broad host range
[edit] Related ID Mapping Resources
Protein Information Resource - ID Mapping
[edit] Background Information about permanent unique identifiers
The broader need for permanent unique identifiers, and the options available (LSID, ARK, DOI), are discussed on this page: permanent unique identifiers
[edit] History and Roadmap of the GRS project
The issue of a lack of a single, integrated list of genomes and metagenomes has been flagged up since the first GSC meeting in Sept 2005.
To make this list, we need universal, permanent unique identifiers and a good definition of 'project'.
NCBI and EMBL agree to put all their genomes in "Read only" form into the Genome Catalogue for encouraging further annotation with MIGS/MIMS-compliant information
Genome Catalogue project realizes importing from these two sources and GOLD not possible due to lack of unified ID's.
Genome Catalogue project temporarily adopts GOLD identifiers (GOLD_stamps) and GOLD entries as its primary source of genomes.
Initates links to StrainInfo as the first 'non-genomic' database; already mapping to GOLD and had expertise in building a portal to many culture collections
This wiki page launched and additional key databases contacted (early 2007); most contributors keen on participation; most starting to map to INSDC GPIDs; none with bespoke web services for providing a mapping to the future resolver
Pilot Resolver version of a "Resolver" launched to give an illustration of functionality Genome Catalogue: http://gensc.org/gsc/gcat/xtr/rosetta-stone
Tanya developed a work package: Work Package - Genomic Identifier Mapping
The 4th GSC workshop in June 2007 contains a session dedicated to discussions of how to proceed with the development of the Genomic Rosetta Stone: GSC Meetings
StrainInfo maps to Silva (and GOLD) using bidirectional mappings
The 5th GSC workshop in Dec 2007 also contained a session from which the special issue of OMICS arose.
This webpage significantly updated, this history and roadmap added, and the project carried forward with a new linkage to LinkOut
ROADMAP
- make sure LINKOUT provides the infrastructure we need (looks promising)
- get databases to register (RDP, StrainInfo first two done)
- start harvesting the mapping with eutils and display in (new version) of Resolver
- develop a client tool for running on local databases (this must be tightly linked to the Resolver) (Requirement: access Resolver with web services and display all or some of the total mapping in a very user-friendly way; could have a range of added-value functionality)
- write up policy for contributors in federation
- get StrainInfo description of bidirectional mapping onto the web (Silva and GOLD used it, others could too) - this is the solution for mapping the much larger databases (Strain info = >700,000 entries, Silva >500,000 entries) where no central mapping is currently available
Conceptual schema [Image:http://gensc.org/gc_wiki/images/9/97/Figure1_knuckles_and_nodes.jpg]
Enriched information network [Image:http://gensc.org/gc_wiki/images/7/7c/Resolvers.jpg]
[edit] Technical implementation of the resolvers described in the GRS OMICS contribution
[edit] GRS resolver
A technical description can be found here.
Example XML results page
<ResultSet>
<Result>
<group>
<identifier type="GOLDSTAMP">Gc00481</identifier>
<identifier type="GOLDSTAMP_OLD">Gi00703</identifier>
<identifier type="GREENGENES_ID"/>
<identifier type="GCAT_ID">000001_GCAT</identifier>
<identifier type="TAXON_ID">393305</identifier>
<identifier type="ENTREZ_PID">190</identifier>
<identifier type="IMG_OID">640069335</identifier>
<identifier type="RDP_ID">190</identifier>
<identifier type="INSDC_Project">190</identifier>
<identifier type="STRAININFO_ID">719455</identifier>
</group>
</Result>
</ResultSet>
[edit] StrainInfo.net CIDs
For a tight coupling of biological information, each object in the information network should be uniquely defined in a global context. Where no globally unique identifiers are in place they will need to be introduced as unambiguous pointers for the objects in the network. Illustrated by the P. putida F1 strain example, where F1 is an identifier for at least 24 microbial strains, this is most notably the case for the subset of organismal objects that consists of all publicly available microbial strains. With the introduction of unique culture identifiers during the indexing process of the strain objects kept in a global network of culture collections, the StrainInfo.net bioportal might provide the answer to this requirement.
In order to fill the gap where no globally unique identifiers was in place to refer to micro-organisms in a global information network, the StrainInfo.net bioportal is assigning globally unique culture identifiers (CID) to each strain number that is discovered when indexing the online catalogues of public culture collections. An automated accumulative learning process that is described in detail by Dawyndt et al. (2005). In many respects introducing these identifiers is the easiest part of the process. The remaining challenge is to resolve all the strain numbers that occur in legacy data sets to their corresponding culture identifiers, thus establishing a tight information network around everything that is known about a micro-organism.
The StrainInfo.net bioportal aims to provide its users with as much information on a given micro-organism as possible. One way to achieve this goal is to associate the globally unique culture identifiers assigned to each microbial culture to a series of identifiers assigned by third party information sources that provide additional information on that organism. By mapping culture identifiers to external identifiers, the information content of the bioportal itself remains fairly lightweight, whereas synchronisation requirements are reduced to a strict minimum.
[edit] StrainInfo.net URL templates
The StrainInfo.net bioportal supports a series of permanent URL templates that allow third party information providers to embellish their resources by hooking up with the information on micro-organisms that is integrated within the bioportal. All URL templates return relevant information, formatted as HTML documents that are retrieved through the HTTP protocol. The URL templates that are implemented by the StrainInfo.net bioportal use the following basename: http://www.straininfo.ugent.be/. All URL templates mentioned implicitly use this basename as a prefix. Note that correct URL encoding needs to be followed when passing parameters to the templates, in which spaces are replaced by %20 or plus signs (+). Most contemporary browsers however perform this encoding automatically. What follows is a wrap-up of the different mappings to information about micro-organisms, including some comprehensive examples.
REFERENCES BASED ON AN ORGANISM IDENTIFIER
The StrainInfo.net bioportal assigns unique and persistent culture identifiers to all micro-organisms that were deposited in a culture collection. It also guarantees to provide persistent support for resolving these culture identifiers. By making use of culture identifiers, unambiguous organism references are expressed using the /culture/<culture identifier> URL template. Table 1 shows for example that the reference /culture/268202 establishes a direct mapping to the P. putida F1 strain, thereby resolving the ambiguity involved in the use of the F1 strain number. Organism references based on culture identifiers are promoted as the primary means for the realization of tight bidirectional mappings between the StrainInfo.net bioportal and any third party information provider that wants to reference information on microbial cultures. Although the use of culture identifiers for organism referencing is undoubtedly the best way forward, we still have to overcome the problem of legacy data wherein (possibly ambiguous) strain numbers are used. As a convenience we also provide a URL template to resolve strain numbers to their corresponding culture identifiers. The permanent URL template /strainnumber/<strain number> can be used as a reference to the organism identified by a given strain number. Note that the web page that is returned for this URL template is the result of a query, and may change over time when more information becomes known to the bioportal. The URL template is provided with an optional queryoption parameter that influences the semantics of the request. This parameter takes one of the following values: exact: (default value when no queryoption parameter is passed) leads to a perfect search against the strain numbers in the Integrated Strain Database. Unique strain numbers (e.g. BCRC 17059) are directly resolved and automatically redirected to the online record of the corresponding BRC catalogue, while ambiguous strain numbers (e.g. F1), result in a list of possible matches. contains: returns all strains in the Integrated Strain Database that are identified by a strain number that contains the given search term as a substring. For instance /strainnumber/F1?queryoption=contains will return a list of all strains with a strain number that contains F1 or any orthographic variant: NF1706, F1, MBF130, F12, etc. number: only the number component is extracted from the alphanumeric strain number, after which all strains with a strain number that has the same number component are matched in the Integrated Strain Database. For instance /strainnumber/BCRC+17059?queryoption=number will return a list of all strains with a strain number that has 17059 as its number part: BCRC 17059, DSM 17059, LMG 17059, etc.
REFERENCES BASED ON A TAXONOMIC NAME
Requesting information from the StrainInfo.net bioportal based on a given taxonomic name results in a list of organisms that were identified as belonging to the taxon by at least one of the BRCs that have a culture of the organism in their holdings. It should be noted here that it is well possible that different BRCs might list the same strain as a different species. This route for accessing organismal information is implemented through the /taxon/<taxonomic name> URL template. Two optional Boolean parameters are used in combination with the /taxon/ URL template and must be separated by an ampersand (&) when combined. The typestrain parameter has false as its default value. When the value true is passed, searches are narrowed by restricting the results to type strains only. As each taxon has a unique type strain, this parameter can be used to uniquely identify a micro-organism. The subtaxa parameter has true as a default value, which widens searches by including all subtaxa of the given taxon in the request. If searches need to be restricted only to the name of the given taxon, the value false should be passed for this parameter. In order to generate a list of all organisms available in public BRCs that were identified as Saccharomyces cerevisiae, the following URL may be used: /taxon/Saccharomyces+cerevisiae To get the type strain of Pseudomonas putida without making a reference based on specific strain numbers or culture identifiers, the following URL may be used: /taxon/Pseudomonas+putida?typestrain=true&subtaxa=false Finally, all type strains of the genus Bacillus can be listed using the following URL: /taxon/Bacillus?typestrain=true
REFERENCES FROM EXTERNAL IDENTIFIERS
The StrainInfo.net bioportal aims to provide its users with as much information on a given micro-organism as possible. One way to achieve this goal is to associate the globally unique culture identifiers assigned to each microbial culture to a series of identifiers assigned by third party information sources that provide additional information on that organism. By mapping culture identifiers to external identifiers, the information content of the bioportal itself remains fairly lightweight, whereas synchronisation requirements are reduced to a strict minimum. As all relationships expressed as mappings from culture identifiers to external identifiers are inherently bidirectional in nature, either direction of a relationship must be available when making implementations that consume the mapping. By making these mappings directly accessible from the StrainInfo.net bioportal, it is possible to minimize the integration effort of external information providers that otherwise need to be provided with a copy of the mapping. The StrainInfo.net bioportal provides these mappings through the /from/<namespace>/<identifier> URL template. The variable components of this URL template are both mandatory, and their values may be expanded as the set of external identifiers known by the hub grows over time. The namespace parameter indicates the information provider authorized for the assignment of a series of external identifiers that are mapped onto the culture identifiers supplied by the StrainInfo.net bioportal. The identifier parameter provides the alphanumeric identifier assigned by the information provider.
As an example, the /from/GOLD/Gc00572 URL will return information about the organism from which the genome was derived that is identified by goldstamp Gc00572. The mapping between genome sequences and the micro-organisms from which they were derived is originally expressed using goldstamps, due to the fact that the relationships are directly extracted from the GOLD database. By making use of the functionality of the GRS resolver, however, the StrainInfo.net bioportal can also offer the same relationship expressed using other genome identifiers. Information about the same organism as in the example above could thus be accessed through the GCAT identifier instead of the goldstamp, resulting in a call to the following URL: /from/GCAT/00139_GCAT. This way, information provider can refer to the organism from which a complete genome was derives without even knowing an identifier for that organism. They simply need to use whatever genome sequence identifiers they have at their disposal. D
[edit] MAPPINGS
.doc file:
Media:GRSIdentifierMappings.doc
Overview of the namespaces and their identifiers included in the information domain envisionned in the paper.