towards a richer set of information to describe our complete genome collection

Major Requirements for 16S and Biodiversity

From Genomic Standards Consortium

Main Page


[edit] Extending GCDML for Biodiversity - High-Level Requirements

5th GSC Workshop

Cambridge, England, 14 December 2007

This document is the result of a discussion among Philip Goldstein, Jim Cole and Tim Booth on 14 December 2007 at the 5th GSC Workshop. Comments from Jim on an earlier draft have been incorporated below. This draft is by Philip Goldstein.

This document originates from two interests that converge in many practices, though they can be treated as separate functions: 1) exchanging 16S data, and 2) managing information about biodiversity derived from collections of gene segments where 16S is often a major example.

Though 16S is the original example of a single gene or gene segment to be considered here, these requirements intend to include other genes and segments.

It is noted that there may be significant overlap between requirements to manage biodiversity data and requirements for genomic/metagenomic data and analysis. The initial priority here is to explore biodiversity. Comments related to genomics/metagenomics are welcome.


Note the following assumptions:

Assumption #1: The primary subject for the following notes is microbial 16S/18S, but all topics in these notes apply to any gene or portion of a gene used for biodiversity studies. Any reference to 16S below applies to other genes (unless otherwise noted ...)

Assumption #2: The data set referred to as "Biodiversity" or "a Biodiversity Study" is a set of multiple sequence records that are comparable, but distinct along a relevant dimension(s), such as location, taxonomy, environmental conditions, time, etc.

Assumption #3: A Minimum Information set is not yet defined for this information. Assume that the MIGS/MIMS checklist will be extended in parallel with this GCDML extension. Then MIGS/MIMS will validate the Minimum information for this GCDML extension. Comments on Minimum Information requested. Apologies if any of the following capabilities are already in GCDML and we have overlooked them.

Extensions required for GCDML:

Requirement #1: Identify each gene (required), segment (optional?) and alignment (debatable) and associate the sequence record with all applicable GCDML sequence annotation.

Requirement #2: Define a working-set data structure that refers to the set of multiple instances upon which biodiversity analyses will be performed. A working set is analogous to a "playlist" in that it is a user-defined list of references to entities in a database. The act of defining a working set does not alter the source database.

Requirement #3: Maintain multiple working sets.

Requirement #4: Give each working set a unique name or ID and a means of persistent storage, so processes and results associated with each working set can be referenced and repeated as desired.

Requirement #5: Maintain linkage to all data associated with each sequence in each working set throughout all processes (that is, retain links to geography, environmental data, etc).

Requirement #6: Define data structures to capture analytical methods at a chosen level of detail (relevant level of detail not yet determined). Record methods, tools, software, parameters, iterations, run-time messages, etc. This requirement is parallel to the products of other genome analysis adressed in GCDML, but this requirement is not expected to fall within the minimum information in intial implementations.

Requirement #7: Define or reuse data structures to capture results of processing and analysis such as phylogenetic trees, genetic and/or geographic distance matrices, biodiversity/richness indicators, etc - prioritse this concept in relation to what to include and at what level of detail.


[edit] Extending MIGS/MIMS and GCDML to 16S sequences - some specific considerations

In addition to the high-level, longer-term goals outlined above, there are shorter-term goals to consider which we should be able to progress quickly within the GSC as a community. The specific issue of generating richer submissions of 16S genes to Genbank/EMBL/DDBJ in the context of MIGS/MIMS and GCDML is considered on this page: MIGS/MIMS_for_16S

Loading...