MIENS
From Genomic Standards Consortium
Main Page -> Major_Requirements_for_16S_and_Biodiversity
[edit] Introduction & proposal of MIENS
With the publication of the MIGS/MIMS specification the Genomic Standards Consortium GSC has finished the groundwork to enrich our genome and metagenome collections with additional contextual data (See: publication in Nature Biotech. it is now possible to extend and adapt this specification to any genetic marker sequence retrieved from the environment.
To move forward and leverage existing interest in the community aproposal for MIENS, the Minimum Information about an ENvironmental Sequence, has been accepted as a natural extension to MIGS and MIMS at the 6th GSC meeting in October 2008.
MIENS is meant to be fully compliant to the attributes already included in the MIGS/MIMS specification, but also adds additional contextual data fields that are needed to enrich our ever growing set of marker gene sequences.
It was also decided at GSC 6 that the ribosomal RNA sequence collections (16S/18S & 23S/28S) will lead as a first use case.
If we, as a community of researchers, supplement our sequence collections with more contextual data it will be able to retrieve, search and analyse these invaluable and ever-growing datasets in unpredendent detail; for example by selecting all 16S sequences related to specific environmental parameters (i.e. location, habitat, temperature, salinity, oxygen concentration etc.).
There are five key aspects that need to be tackled by the MIENS working group:
- Identify which INSDC/MIGS/MIMS contextual data attributes for environmental sequences are most relevant for the community
- Identify additional contextual data to be covered by MIENS
- -> Generate a checklist for MIENS indicating the significance of the attributes
- Formalize MIENS checklist as a community (and publish) : Definition of field names and description of fields
- Collaborate with INSDC to define the modus of sequence & contextual data submission
- Provide tools for effective sequence & contextual data submission
[edit] The MIENS working group
A working group is currently formed to move MIENS forward and prepare a set of additional fields that are subjects of decision for the upcoming GSC meetings. Currently the working group consists of the following members:
- Frank Oliver Glöckner, MPI-Bremen (chair, Silva)
- Renzo Kottmann, MPI-Bremen, GCDML
- Wolfgang Hankeln, MPI-Bremen
- Pelin Yilmaz, MPI-Bremen
- Jörg Peplies, Ribocon GmbH
- Peter Dawyndt, Ghent University, StrainInfo.net
- Linda Amaral Zettler, Woods Hole, ICoMM (VAMPS/Microbis)
- James Cole, Michigan State University, RDP
- Wolfgang Ludwig, Technical University Munich (ARB)
- Dawn Field, CEH Oxford, MIGS/MIMS and L4 Time-series
The high-level issues associated with extending MIGS/MIMS and GCDML to cover rRNA and other gene sequences (in the wider context of 'biodiversity data') are discussed on the GCDML sub-page: Major_Requirements_for_16S_and_Biodiversity
[edit] Which additional contextual data fields are most relevant for the users?
A minimal amount of contextual (meta)data needs to be deposited for every culture or sample devoted for sequencing of phylogenetic marker genes. Following the MIGS/MIMS standards at least the GPS position (longitude, latitude), depth/altitude and time of sampling are mandatory.
[edit] Table 1: Suggested minimal list of contextual data fields for environmental sequences (MIENS)
| Data Item | Description | INSDC | ARB/SILVA | MIGS/MIMS | MIENS |
|---|---|---|---|---|---|
| altitude | The altitude of sampling location above sea level | - | + | + | + |
| chlorophyll | Chlorophyll concentration in the environment at time of sampling | - | + | + | + |
| collection_time | Time that the sample was collected in hours and minutes | - | + | + | + |
| depth | Depth below surface where the sample was collected | - | + | + | + |
| dissolved_oxygen | Dissolved oxygen concentration in the environment at time of sampling | - | + | + | + |
| doc | Dissolved organic carbon concentration in the environment at time of sampling | - | + | + | + |
| geodetic_datum | Geodetic datum e.g. WGS 84 | - | + | - | + |
| habitat | Description of the habitat, like marine, freshwater etc.. | - | + | + | + |
| lat_lon_details | Details of the measurement of geographic coordinates, like: Was latitude and longitude measured by GPS, derived from map, retrieved from literature? | - | + | - | + |
| metagenomic | Identifies sequences from a culture-independent genomic analysis of an environmental sample submitted as part of a whole genome shotgun project | - | - | + | + |
| nitrate | Nitrate concentration in the environment at time of sampling | - | + | + | + |
| pH | pH value in the environment at time of sampling | - | + | + | + |
| phosphate | Phosphate concentration in the environment at time of sampling | - | + | + | + |
| poc | Particulate Organic Carbon concentration in the environment at time of sampling | - | + | + | + |
| project_name | Name of the sequencing project | - | + | + | + |
| salinity | Salinity concentration in the environment at time of sampling | - | + | + | + |
| sample_identifier | a unique identifier (ID) given to the sample that allows to cross-reference samples and contextual data | - | + | - | + |
| sample_material | Describes the sample material that was collected, e.g. water, sediment, biofilm, vent fluid etc. | - | + | - | + |
| sample_volume | Volume of the sample that was collected | - | + | + | + |
| silicate | Silicate concentration in the environment at time of sampling | - | + | + | + |
| temperature | Temperature in the environment at time of sampling | - | + | + | + |
[edit] Results from the community surveys for contextual data fields for environmental sequences
The SILVA rRNA project has conducted a community survey from May to October 2008 to get an overview of the fields that are most relevant for the users.
216 responses from 26 different countries were acquired within a six month period with 182 complete responses.
The responses are classified as "relevant", "neutral", "non-relevant" and "unknown response" for each suggested field. "Unknown response" was added as a result of the 34 unfinished responses.
In a similar survey conducted by Philip Hugenholtz between May 2005 and August 2005, comparable results were obtained.
486 responses were obtained from 78 different locations within six months with a maximum of 169 responses on one data field.
The relevance results shown in the diagram for the selected fields are based on a total number of the 169 responses. "Unknown response" was calculated by subtracting the total number of answers for a given data field from the total number of 169.
[edit] Table 2: An extended list of contextual data fields compiled based on the results of the surveys
An extended list of contextual data fields and other parameters is compiled for further discussion based on the MIGS/MIMS specifications; and results of SILVA and Hugenholtz surveys.
| Data Item | Description | ARB-SILVA Survey | Hugenholtz Survey | Respondent-Suggested |
|---|---|---|---|---|
| agricultural_use | Indicating whether the sampling location was impacted by agriculture (soil samples) | - | + | + |
| altitude | The altitude of sampling location above sea level | + | - | - |
| ammonium | Ammonium concentration in the environment at time of sampling | - | - | + |
| bacterial_abundance | Bacterial cell count at the sampling location at the time of sampling | - | - | + |
| chlorophyll | Chlorophyll concentration in the environment at time of sampling | + | - | - |
| cloning_vector | Type and name of cloning vector used for constructing clone library | - | + | + |
| collection_time | Time that the sample was collected in hours and minutes | + | - | - |
| depth | Depth below surface where the sample was collected | + | - | - |
| dissolved_oxygen | Dissolved oxygen concentration in the environment at time of sampling | + | - | - |
| extraction_method | Description of the nucleic acid extraction method used, e.g. phenol-chloroform, salting out, kit | - | + | + |
| doc | Dissolved organic carbon concentration in the environment at time of sampling | + | + | - |
| geodetic_datum | The type of datum that describes the size and shape of the earth and the origin, and orientation of the coordinate system used to map the earth | + | - | - |
| vegetation | Dominant plant species at the sample location, at the time of sampling (soil samples) | - | + | + |
| habitat | Description of the habitat, like marine, freshwater etc.. | + | + | + |
| host_anatomical_site | Anatomical site of host from which the sample is derived (host-associated samples) | - | + | + |
| host_association_type | Association type of sample with the host, e.g. parasitic, symbiotic (host-associated samples) | - | + | + |
| host_species (host-associated samples) | Scientific name of the organism which the sequenced sample was associated with (host-associated samples) | - | + | + |
| irradiation | Irradiance (W/m2/nm) at the sampling location, at the time of sampling | - | - | + |
| lat_lon_details | Details of the measurement of geographic coordinates, like: Was latitude and longitude measured by GPS, derived from map, retrieved from literature? | + | - | - |
| metagenomic | Identifies sequences from a culture-independent genomic analysis of an environmental sample submitted as part of a whole genome shotgun project | + | - | - |
| moisture | Quantity of water contained material on a volumetric or gravimetric basis (soil samples) | - | + | + |
| nitrate | Nitrate concentration in the environment at time of sampling | + | - | - |
| pcr_annealing temperature | Annealing temperature used for PCR amplification of rRNA sequence | - | + | + |
| pcr_cycle_number | Number of PCR cycles used for PCR amplification of rRNA sequence | - | + | + |
| pH | pH value in the environment at time of sampling | + | + | - |
| phosphate | Phosphate concentration in the environment at time of sampling | + | - | - |
| poc | Particulate Organic Carbon concentration in the environment at time of sampling | + | - | - |
| pollutants | If sequence originates from a contaminated environment, a list of major contaminants with their concentrations at the time of sampling | - | - | + |
| project_name | Name of the sequencing project | + | - | - |
| quality_check _clone | Method used to check for chimera presence in clone libraries | - | - | + |
| quality_check_sequence | Software or cut-off value used for assessing sequence quality | - | - | + |
| salinity | Salinity in the environment at time of sampling | + | + | - |
| sample_identifier | A unique identifier (ID) given to the sample that allows to cross-reference samples and contextual data | + | - | - |
| sample_material | Describes the sample material that was collected, e.g. water, sediment, biofilm, vent fluid | + | - | + |
| sample_quantity | Size (mass or volume) of the sample that was collected | + | + | - |
| sample_treatment_preservation | Description of methods applied to sample after it was taken from environment, e.g fixation, de-aggregation, filtration, enrichment steps | - | + | + |
| sampling_technique | Apparatus used to sample, e.g. push-core, ROV | - | - | + |
| sequencing_tech | Sequencing method used; e.g. Sanger, pyrosequencing, ABI-solid | - | - | + |
| sequencing_template | Source of sequencing template; clone library, DGGE band, metagenomic clone | - | - | + |
| silicate | Silicate concentration in the environment at time of sampling | + | - | - |
| size_filtration | If sample was filtered, the upper and lower size cut-off values (water samples) | - | - | + |
| sulfide | Reduced sulfur compounds (sulfide, polysulfide…) concentration in the environment at time of sampling | - | - | + |
| temperature | Temperature in the environment at time of sampling | + | - | - |
The slides of the survey result talk can be downloaded here: http://gensc.org/gc_wiki/images/3/3a/Survey_results.pdf
[edit] Table 3: Current implementation of contextual data fields in the ARB/SILVA workflow
| Fields/Names ver 1.0 (Note: only fill appropriate fields!) | Description of fields | Units to be used (if applicable) | INSDC status | MIGS/MIMS status |
|---|---|---|---|---|
| clone-lib | clone library (ID) from which the sequence was obtained | + | - | |
| strain | strain (ID) from which the sequence was obtained | + | - | |
| fingerprint_lib | fingerprint library (ID) from which the sequence was obtained | - | - | |
| altitude | the altitude of sampling location above sea level | m | - | + |
| chlorophyll | chlorophyll concentration in the environment at time of sampling | mg/m3 | - | + |
| collected-by | name of the person who collected the sample | + | - | |
| collection_time | time when the sample was collected in hours and minutes | HH:MM | - | + |
| collection-date | date when the sample was collected (you must use format 23-Mar-2005, Mar-2005, or 2005) | DD-MMM-YYYY | + | - |
| country | geographical origin of sample | + | - | |
| culture-collection | identifier and institution code of the microbial or viral culture or stored cell-line from which the sequence was obtained | + | + | |
| depth | depth of the water column or sediment from where the sample was collected | m | - | + |
| dissolved_oxygen | dissolved oxygen concentration in the environment at time of sampling | ml/l | - | + |
| DOC | dissolved organic carbon concentration in the environment at time of sampling | mg/l | - | + |
| environmental-sample | Definition: identifies sequences derived by direct molecular isolation from a bulk environmental DNA sample (by PCR with or without subsequent cloning of the product, DGGE, or other anonymous methods) with no reliable identification of the source organism. Environmental samples include clinical samples, gut contents, and other sequences from anonymous organisms that may be associated with a particular host. They do not include endosymbionts that can be reliably recovered from a particular host, organisms from a readily identifiable but uncultured field sample (e.g., many cyanobacteria), or phytoplasmas that can be reliably recovered from diseased plants (even though these cannot be grown in axenic culture). | TRUE/FALSE | + | + |
| fwd-pcr-primer-seq | sequence of the forward primer used in PCR reaction | + | - | |
| geodetic_datum | latitude and longitude values can be based on several different geodetic systems or datums (the most common is the WGS 84 used by all GPS equipment) | - | - | |
| habitat | description of the habitat like marine, freshwater etc.. | - | + | |
| haplotype | haplotype of the organism | + | + | |
| identified-by | name of the taxonomist who identified the specimen | + | - | |
| isolation-source | describes the physical, environmental and/or local geographical source of the biological sample from which the sequence was derived | + | - | |
| lab-host | laboratory host used to propagate the organism from which the sequence was derived | + | - | |
| lat_lon_details | details of the measurement of geographic coordinates, like: Was latitude and longitude measured by GPS, derived from map, retrieved from literature? | - | - | |
| lat-lon | latitude and longitude of location where sample was collected, mandatory format is decimal degrees N/S E/W | dd.dd N/S dd.dd W/E | + | + |
| metagenomic | identifies a sequence from a culture-independent genomic analysis of an environmental sample, submitted as part of a whole genome shotgun project | TRUE/FALSE | + | + |
| nitrate | nitrate concentration in the enviroment at sampling time | micromol/l | - | + |
| organism | name of species e.g. Escherichia coli | + | - | |
| pH | pH measurement value in the enviroment at sampling time | - | + | |
| phosphate | phospahte concentration in the enviroment at sampling time | micromol/l | - | + |
| plasmid-name | name of the plasmid which was used for cloning | + | - | |
| project_name | name of the sequencing project | - | + | |
| POC | particulate organic carbon concentration in the environment at time of sampling | mg/l | - | + |
| rev-pcr-primer-seq | sequence of the reverse primer used in PCR reaction | + | - | |
| salinity | salinity of a water sample at sampling time | PSU | - | + |
| sample_identifier | your label (ID) of the environmental sample | - | - | |
| sample_material | kind of sample material (water, sediment, biofilm, vent fluid etc.) | - | - | |
| sample_volume | exact volume of a water sample | ml | - | + |
| silicate | silicate concentration in the enviroment at sampling time | mg/l | - | + |
| specific-host | if the sequence origins from an organism that exists in a symbiotic, parasititc, or other special relationship with some second organism, use this modifier to identify the name of the host species | + | + | |
| specimen-voucher | Identifier of the physical specimen from which the sequence was obtained (mandatory format is "institution code:collection code:specimen_id") | + | + | |
| sub-species | subspecies of organism from which sequence was obtained | + | + | |
| temperature | exact temperature at sampling site (e.g. water temperaure) at sampling time | degree Celsius | - | + |
[edit] The Habitat Field
Controlled vocabulary descriptions of habitats like marine, freshwater, contaminated etc. can now be covered through the use of the terms from the Environment Ontology (EnvO) project. A high-level short list of terms (derived from EnvO) is also available in Habitat-Lite.
The Ribosomal Database Project is currently running a user survey on habitat terms that are most important to users based on the GSC's Habitat-Lite project (a list of <30 high-level descriptors of habitat derived from EnvO).
[edit] How can sequence submission to the INSDC effectively be handled?
Although contextual data that go beyond the INSDC captures can already be stored in local databases using e.g. the ARB/SILVA system, no standards for the submission of these data to INSDC have been defined so far.
A clear, universal standard of the reporting requirements of contextual data beyond existing INSDC fields are urgently needed independent of the software used to process the ribosomal RNA data. This is a prerequisite to be finally able to consume such information when downloading and processing sequences from the INSDC.
This means, the easiest and quickest way to enrich our public sequence databases is to start to define the rRNA reporting requirements and ask people to make compliant submissions to the INSDC.
This is possible - through two main mechanisms:
- submissions that are conform with optional qualifiers in the INSDC feature tables (e.g. /lat_lon)
- submission of additional MIGS/MIMS/MIENS compliant fields using a defined COMMENT BLOCK.
This can include structured data and db_xrefs to link out to other databases where the data might be more elaborately displayed/searched etc.
To proceed we need to take the MIGS/MIMS/MIENS reporting requirements and define the structure for the COMMENT BLOCK
Examples of this type of submission already exist for the HIV community (See below for examples)
[edit] Defining community-based COMMENT BLOCK in INSDC docs
Below is an explanation from Ilene about how the HIV community is using the COMMENT BLOCK option and how we would proceed if the GSC wants to do something similar:
If GSC or another community were to adopt this, we would need a community name (like HIVDatabase) and a list of potential fields. Let me know if you want me to write something up or is this enough information.
We have come up with a solution for incorporating extra information into Genbank submissions. We are going to allow different user communities to provide additional source information which will be incorporated into a structured COMMENT in the GenBank flat file. An example for this can be found in DQ526029 http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&val=108736317
This information was submitted as:
Information in this section is searchable in HIV Database at Los Alamos (www.hiv.lanl.gov).
##HIVData-START## Funding=OTHER; Sequence Name=B1PBL10; Patient code=B1; Subtype=B; Sample tissue=PBMC; Risk Factor=Mother-->Baby; Note=Baby of patient M1, sequenced at birth; Note=Env; Patient sex=Male; Patient age=0; Note=; Days from seroconversion=0; Days post-infection=0; Patient health status=asymptomatic; CD4 count=3305; Sample country=US; Sample city=Los Angeles; Sample date=6-2-97; Infection Country=US; Infection city=Los Angeles;DBID=1560819839; ##HIVData-END##" , The HIVData-START and HIVData-END tags appear at the beginning and end of the comment as a delimiter to allow for easy parsing. If this community would like to use this tabular format to show additional source information in a GenBank record, please let me know and I will be happy to work with you to get this info into GenBank. Best regards, Ilene Karsch Mizrachi, PhD GenBank Coordinator
[edit] An 'ideal' rRNA submission template
Ideally, this community can offer a template for very rich submission of data to GenBank/EMBL/DDBJ and if we can get the community to adopt it, the major rRNA software tools (e.g. ARB) and databases (SILVA and RDP) would be able to harvest them and build options like more powerful search or sorting/grouping options.
With release 93 of the SILVA rRNA databases about 17 additional contextual fields have been introduced and can be easily filled in by the users using the ARB software package. The additional fields are documented in the environmental parameters section of the SILVA fields description.
To easily merge sequence data with contextual data a suggestion for an integrated work flow can be found here: SILVA Metadata workflow
The last 3 slides in James Coles talk at the 5th GSC Workshop contain 16S entries in Genbank. The first is truly minimum and the next two have progressively more information but could still be richer: http://gensc.org/gc_wiki/index.php/Image:Cole_gsc_dec_07.ppt
[edit] A to do list
This is largely a similar path to the development of MIGS/MIMS except that we didn't have an initial checklist to work from just ideas from the community. Now, with rRNA, a first place to start is MIGS/MIMS in the context of GCDML. In fact, the general progression of any standardization project is to:
1. define the problem
2. form the community to tackle it
3. develop a checklist (scope)
4. progress to implementation
Potential standardization of rRNA submissions, should go faster, in principle, than for genomes/metagenomes, because the problem is largely defined and a potential community to tackle the problem is likely available in the form of the GSC.
1. compare MIGS/MIMS to minimum reporting ideals for ribosomal RNA (literally, review the MIGS/MIMS checklist in light of rRNA sequences, adding a new 'report type' of rRNA and seeing if we can manage it - how complete is it?)
2. working group weights existing INSDC/MIGS/MIMS fields and/or suggests new contextual data fields for environmental sequences based on community surveys
3. implementation: definition of field names and description of fields
4. incorporate minimum reporting requirements in GCDML
5. generate a checklist indicating the significance of the field and decide on additional fields by GSC community
6. define how we can structure such data for submission to the INSDC using a structured COMMENT BLOCK (of course can be held in GCDML, exchanged in GCDML)
7. generate user-friendly templates and tools that aid users in achieving compliance to allow standardized submission through INSDC submission tools
8. make some compliant submissions, advertise them as examples of improved reporting conformance
9. improve databases in order to consume/display such information (for example to read new fields about habitat, and perhaps give searching/sorting options)
[edit] Examples
Many of us are making large submissions in the next few months and if we could adhere to the principles we set out here we would have some good examples for validating the process and 'completing the circle'. In other words, we need an examples of richly annotated rRNA sequences submitted to the INSDC that over time are harvested by the key rRNA databases and therefore become available to serve as further good templates for future submissions. If the number of richly annotated sequences becomes significant, tools and databases will start to make increasing use of the data. This is when the value of this process will become evident.
This is a project for the long-term and there will be a tremendous amount of legacy data, but it should be worth it for the future especially given the inclusion of key players from the INSDC and the rRNA community in the GSC and the huge amount of work that has gone into MIGS/MIMS.
[edit] Case study: Bergen Experiment 16S clone libraries submissions to EMBL
The Bergen Experiment 16S case study involves the submission of >3,000 16S sequences to EMBL (Webin). All of the 16S data was generated from water samples taken in Bergen that were accompanied by rich experimental and environmental data (biogeochemical measurements).
[edit] The Original Call for richer reporting of 16S sequences
There has been a long-standing general call from many in the 16S community for richer reporting of sequences submitted to the INSDC. A few years ago, the JGI (Phil Hugenholtz) initiated a survey of fields, in particular habitat, for the richer submission of 16S sequences: http://www.jgi.doe.gov/16s/
The intention of this important survey was to select additional descriptors for submission to the INSDC. Since, then the INSDC has devised a mechanism by which such rich (key=value type fields) can be placed 'legally' into a formal INSDC document. This mechanism is the comment block, which is described below.
The GSC drew on this survey in principle, along with other sources collecting metadata to describe genomes, to help define the scope and content of the MIGS/MIMS specification.
Now that the GSC has formed and MIGS/MIMS has been published it is time to re-evaluate, and hopefully push forward on richer reporting of rRNA sequences as well, especially since the INSDC comment block could be available for genomes/metagenomes and rRNA sequences. The MIGS/MIMS publication in Nat Biotech highlights how important these 'halos' of rRNA sequences are for the analysis of genomes/metagenomes. There is also vast and rapidly growing collection of rRNA sequences that could be greatly enriched in value by access to more, and better structure, metadata, in its own right.