towards a richer set of information to describe our complete genome collection

MIGS/MIMS for 16S

From Genomic Standards Consortium

Main Page -> Major_Requirements_for_16S_and_Biodiversity


On this page:

[edit] Introduction & Proposal of MIENS

With the MIGS/MIMS specifications the Genomic Standards Consortium has finished the groundwork to enrich our genome and metagenome collections with contextual data see Paper in Nature Biotech. It is now time to consider whether these standards could be applied 'as is' in the short-term, and 'with modification/extension' in the longer-term to ribosomal RNA (16S/18S & 23S/28S) sequences and finally to any genetic marker used in molecular ecology.

This page is an attempt to leverage existing interest in the rRNA community (submitters and data providers) to enrich ribosomal RNA sequences (16S/18S & 23S/28S) with more contextual data. We would like to propose MIENS, the Minimum Information about an ENvironmental Sequence, as a natural extension to MIGS and MIMS. Supplementing our sequence collection with more contextual data is the key to being able to retrieve, for example, all 16S sequences related to specific habitat parameters (i.e. soil, marine, freshwater, contaminated, temperature, salinity, oxygen etc.).

The two key aspects that need to be tackled by MIENS are:

  • Which additional contextual data fields for environmental sequences are most relevant for the users?
  • How can sequence submission to the INSDC effectively be handled?

This page is an attempt to outline a focused, short-term roadmap specifically for rRNA studies that are being generated now and submitted to the INSDC (Genbank/EMBL/DDBJ). It also tries to provide additional environmental contextual data to the general description of rRNA sequences by references past and current efforts to do so.

The high-level issues associated with extending MIGS/MIMS and GCDML to cover rRNA and other gene sequences (in the wider context of 'biodiversity data') are discussed on the GCDML sub-page: Major_Requirements_for_16S_and_Biodiversity

The page is also meant to prepare a special session on MIGS/MIMS/GCDML for ribosomal RNAs on the next GSC meeting scheduled for October 2008.

[edit] Which additional contextual data fields are most relevant for the users?

A minimal amount of contextual (meta)data needs to be deposited for every culture or sample devoted for sequencing of phylogenetic marker genes. Following the MIGS/MIMS standards at least the GPS position (longitude, latitude), depth/altitude and time of sampling are mandatory.

[edit] Suggested Minimal List of Contextual Data Fields for Environmental Sequences (MIENS)

List of environmental and other parameters and their occurrence in the Feature Table of INSDC, ARB/SILVA database, and the MIGS/MIMS specification. This list only includes fields NOT specified by the INSDC, and hence are candidates for an extension to the INSDC specification.
Data ItemDescriptionINSDCARB/SILVAMIGS/MIMS
altitudeThe altitude of sampling location above sea level-++
chlorophyllChlorophyll concentration in the environment at time of sampling-++
collection_timeTime that the sample was collected in hours and minutes-++
depthDepth below surface where the sample was collected-++
dissolved_oxygenDissolved oxygen concentration in the environment at time of sampling-++
docDissolved organic carbon concentration in the environment at time of sampling-++
geodetic_datumGeodetic datum e.g. WGS 84-+-
habitatDescription of the habitat, like marine, freshwater etc..-++
lat_lon_detailsDetails of the measurement of geographic coordinates, like: Was latitude and longitude measured by GPS, derived from map, retrieved from literature?-+-
metagenomicIdentifies sequences from a culture-independent genomic analysis of an environmental sample submitted as part of a whole genome shotgun project--+
nitrateNitrate concentration in the environment at time of sampling-++
pHpH value in the environment at time of sampling-++
phosphatePhosphate concentration in the environment at time of sampling-++
pocParticular Organic Carbon concentration in the environment at time of sampling-++
project_nameName of the sequencing project-++
salinitySalinity concentration in the environment at time of sampling-++
sample_identifiera unique identifier (ID) given to the sample that allows to cross-reference samples and contextual data-+-
sample_materialDescribes the sample material that was collected, e.g. water, sediment, biofilm, vent fluid etc.-+-
sample_volumeVolume of the sample that was collected-++
silicateSilicate concentration in the environment at time of sampling-++
temperatureTemperature in the environment at time of sampling-++

The SILVA rRNA project currently conducts a community survey to get an overview which of these fields are most relevant for the users.

An extended list of environmental fields and other parameters is currently compiled for further discussion based on the MIGS/MIMS specifications.

[edit] The Habitat Field

Controlled vocabulary descriptions of habitats like marine, freshwater, contaminated etc. can now be covered through the use of the terms from the Environment Ontology (EnvO) project.

The Ribosomal Database Project is currently running a user survey on habitat terms that are most important to users based on the GSC's Habitat-Lite project (a list of <30 high-level descriptors of habitat derived from EnvO).

[edit] An 'ideal' rRNA submission template

Ideally, this community can offer a template for very rich submission of data to GenBank/EMBL/DDBJ and if we can get the community to adopt it, the major rRNA software tools (e.g. ARB) and databases (SILVA and RDP) would be able to harvest them and build options like more powerful search or sorting/grouping options.

With release 93 of the SILVA rRNA databases about 17 additional contextual fields have been introduced and can be easily filled in by the users using the ARB software package. The additional fields are documented in the environmental parameters section of the SILVA fields description.

The table above and the SILVA survey is based on these fields which have been initially selected taking into account the MIGS/MIMS specifications.

To easily merge sequence data with contextual data a suggestion for an integrated work flow can be found here: SILVA Metadata workflow

The last 3 slides in James Coles talk at the 5th GSC Workshop contain 16S entries in Genbank. The first is truly minimum and the next two have progressively more information but could still be richer: http://gensc.org/gc_wiki/index.php/Image:Cole_gsc_dec_07.ppt

[edit] How can sequence submission to the INSDC effectively be handled?

Although contextual data that go beyond the INSDC captures can already be stored in local databases using the ARB/SILVA system, no standards for the submission of these data to INSDC have been defined so far.

A clear, universal standard of the reporting requirements of contextual data beyond existing INSDC fields are urgently needed independent of the software used to process the ribosomal RNA data. This is a prerequisite to be finally able to consume such information when downloading and processing sequences from the INSDC.

This means, the easiest and quickest way to enrich our public sequence databases is to start to define the rRNA reporting requirements and ask people to make compliant submissions to the INSDC.


This is now possible - through two main mechanisms:

  • conformance to optional qualifiers in the INSDC feature tables (e.g. /lat_lon) and
  • the use of the community defined COMMENT BLOCK.

This can include structured data and db_xrefs to link out to other databases where the data might be more elaborately displayed/searched etc.


To proceed we need to take the rRNA reporting requirements and


Examples of this type of submission already exist for the HIV community (See below for examples)


[edit] Defining community-based COMMENT BLOCK in INSDC docs

Below is an explanation from Ilene about how the HIV community is using the COMMENT BLOCK option and how we would proceed if the GSC wants to do something similar:




If GSC or another community were to adopt this, we would need a community name (like HIVDatabase) and a list of potential fields. Let me know if you want me to write something up or is this enough information.


We have come up with a solution for incorporating extra information into Genbank submissions.  We are going to allow different user communities to provide additional source information which will be incorporated into a structured COMMENT in the GenBank flat file.  An example for this can be found in DQ526029 http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&val=108736317 


'''
This information was submitted as:'''



Information in this section is searchable in HIV Database at Los Alamos
(www.hiv.lanl.gov). ##HIVData-START## Funding=OTHER; Sequence
Name=B1PBL10; Patient code=B1; Subtype=B; Sample tissue=PBMC; Risk
Factor=Mother-->Baby; Note=Baby of patient M1, sequenced at birth;
Note=Env; Patient sex=Male; Patient age=0; Note=; Days from
seroconversion=0; Days post-infection=0; Patient health
status=asymptomatic; CD4 count=3305; Sample country=US; Sample city=Los
Angeles; Sample date=6-2-97; Infection Country=US; Infection city=Los
Angeles;DBID=1560819839;##HIVData-END##" ,



The HIVData-START and HIVData-END tags appear at the
beginning and end of the comment as a delimiter to allow for easy parsing. 

If this community would like to use this tabular format to show additional source information in a GenBank record, please let me know and I will be happy to work with you to get this info into GenBank.

Best regards,

Ilene Karsch Mizrachi, PhD
GenBank Coordinator


[edit] A to do list

This is largely a similar path to the development of MIGS/MIMS except that we didn't have an initial checklist to work from just ideas from the community. Now, with rRNA, a first place to start is MIGS/MIMS in the context of GCDML. In fact, the general progression of any standardization project is to 1. define the problem, 2. form the community to tackle it, 3. develop a checklist (scope), 4. progress to implementation.

Potential standardization of rRNA submissions, should go faster, in principle, than for genomes/metagenomes, because the problem is largely defined and a potential community to tackle the problem is likely available in the form of the GSC.


1. compare MIGS/MIMS to minimum reporting ideals for ribosomal RNA (literally, review the MIGS/MIMS checklist in light of rRNA sequences, adding a new 'report type' of rRNA and seeing if we can manage it - how complete is it?)

2. define minimum additional requirements (e.g. primers used, project ID, physical-chemical parameters etc.)

3. incorporate minimum reporting requirements in GCDML

4. define how we can structure such data for submission to the INSDC using a structured COMMENT BLOCK (of course can be held in GCDML, exchanged in GCDML)

5. generate a set of templates and tools for submissions that can be used NOW (we expect them to continually improve until full compliance with a future reporting requirement is achieved)

6. make some compliant submissions, advertise them as examples of improved reporting conformance

7. improve databases in order to consume/display such information (for example to read new fields about habitat, and perhaps give searching/sorting options)

8. generate user-friendly tools that aid users in achieving compliance (e.g. a richer SILVA ARB exporter to allow standardized submission through SEQUIN is being developed)


[edit] Examples

Many of us are making large submissions in the next few months and if we could adhere to the principles we set out here we would have some good examples for validating the process and 'completing the circle'. In other words, we need an examples of richly annotated rRNA sequences submitted to the INSDC that over time are harvested by the key rRNA databases and therefore become available to serve as further good templates for future submissions. If the number of richly annotated sequences becomes significant, tools and databases will start to make increasing use of the data. This is when the value of this process will become evident.

This is a project for the long-term and there will be a tremendous amount of legacy data, but it should be worth it for the future especially given the inclusion of key players from the INSDC and the rRNA community in the GSC and the huge amount of work that has gone into MIGS/MIMS.


[edit] Case study: Bergen Experiment 16S clone libraries submissions to EMBL

The Bergen Experiment 16S case study involves the submission of >3,000 16S sequences to EMBL (Webin). All of the 16S data was generated from water samples taken in Bergen that were accompanied by rich experimental and environmental data (biogeochemical measurements).


[edit] The Original Call for richer reporting of 16S sequences

There has been a long-standing general call from many in the 16S community for richer reporting of sequences submitted to the INSDC. A few years ago, the JGI (Phil Hugenholtz) initiated a survey of fields, in particular habitat, for the richer submission of 16S sequences: http://www.jgi.doe.gov/16s/

The intention of this important survey was to select additional descriptors for submission to the INSDC. Since, then the INSDC has devised a mechanism by which such rich (key=value type fields) can be placed 'legally' into a formal INSDC document. This mechanism is the comment block, which is described below.


The GSC drew on this survey in principle, along with other sources collecting metadata to describe genomes, to help define the scope and content of the MIGS/MIMS specification.


Now that the GSC has formed and MIGS/MIMS has been published it is time to re-evaluate, and hopefully push forward on richer reporting of rRNA sequences as well, especially since the INSDC comment block could be available for genomes/metagenomes and rRNA sequences. The MIGS/MIMS publication in Nat Biotech highlights how important these 'halos' of rRNA sequences are for the analysis of genomes/metagenomes. There is also vast and rapidly growing collection of rRNA sequences that could be greatly enriched in value by access to more, and better structure, metadata, in its own right.

Loading...