GSC Case Studies
From Genomic Standards Consortium
The issues involved in defining the MIGS/MIMS specification can be best explored by working with 'real-world' examples. We are therefore working with members of past, ongoing, and future genome projects to describe particular genomes, groups of genomes, or metagenomes.
The below case studies (6 to 8) informed the development of MIGS version 1.2:
Case Study 9. The Megx/Genomes Mapserver database. Thierry Lombardot, Renzo Kottmann, Frank Oliver Glöckner
The Megx/Genomes Mapserver database (http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D390) has been developed to combine genomic sequence data with ecological and environmental data and is OGC compliant. Megx aims to implement MIGS/MIMS when this standard becomes stable (end 2007, v2.0). The information content of MIGS/MIMS will be consumed from the Genome Catalogue rather that direct data capture. Consumption would be through Web services. The purpose of consuming the information is integration with internal data stores to provide improved queries, sorting, searching, and mining through available interfaces.
Case Study 8. The Alpine Microbial Observatory Database. Rob Guralnick, Philip Goldstein, University of Colorado
The Alpine Microbial Observatory (http://amo.colorado.edu/) uses a relational, spatially-enabled database to tie sequence information to environmental data. The AMO database carries the x, y, z, and t measurements that are core to MIGS/MIMS. A mapping of MIGS/MIMS to the current schema is being undertaken by Philip and is helping to inform the future specification. Many elements in MIGS/MIMS map to existing tables in the AMO database, while some elements would require a new AMO table. The current version of MIMS could be captured adequately in AMO's 'biogeochemical' variables table through which any measured variable can be coded and georeferenced. Mapping of the schema, as well as selected enhancements, will enable AMO to import and export data compatible with the MIGS/MIMS specification. This a case of proactive implementation of MIGS/MIMS.
This work is proceeding through a series of telecons and exchange of respective schema information between AMO and other GSC participants.
Case Study 7. CAMERA. Sam Angiuoli, Nelson Axelrod, Aaron Gussman, Saul Kravitz, Leonid Kagan, Kevin Li, JCVI
What is CAMERA? (http://camera.calit2.net/about-camera/index.php) Available metagenomic data sets (http://camera.calit2.net/about-camera/full_datasets.php)
The CAMERA (http://camera.calit2.net/)] project is looking to MIMS to define a standard for the content and format of metadata for metagenomic sequencing projects that would serve as the interchange format for import and export of such metadata to/from repositories. MIMS could also set minimum expectations for metadata content. To date, CAMERA has imported metagenomic sequence from four sources, each with its own distinctive issues. In particular, each project required extensive interactions to understand the format and semantics of provided metadata.
CAMERA's perspective on MIGS/MIMS is driven by the practical aspects of metadata interchange -- can existing metadata be captured within MIMS? We will present case studies based on samples from the Global Ocean Sampling project (JCVI), the SDSU Marine Viromes project, and the MIT/C-MORE HOT/ALOHA project.
This work is proceeding through a series of GSC telecons. A Change Log of specific changes to MIGS v1.1 is being produced for the 4th GSC Workshop (See GCS Meetings) for ratification as MIGS/MIMS v1.2.
Case Study 6. Ocean acidification mesocosm study, metagenomic libaries and 454 data. Jack Gilbert, Ian Joint, Plymouth Marine Laboratory
A mesocosm experiment was performed in Bergen, Norway in May 2006. This experiment was part of a national collaboration to investigate the impact of ocean acidification on the marine microbial community. A large quantity of metadata was accumulated for each sample over the 21 day experiment. We have constructed large-insert metagenomic fosmid libraries for key dates within the experiment, as well as producing a ~35 Mbp pyrosequence metagenome each for an acidified and non-acidified sample. The data combined with a full remit of physical, chemical, biogeochemical, flow-cytometry data, 16S/18S gene libraries, SIP analysis, phytoplankton counts, etc. has produced the most extensive dataset yet on the changes to community structure due to ocean acidification. The need for a standardisation for the handling of the huge quantity of sequence data produced within the context of the available metadata is essential if the dataset is going to be interrogated with significant rigour.
The below case studies (1-5) informed the development of the original MIGS discussion document (version 0.9):
Case study 1. Thauera sp. Gareth Wilson & Andy Whiteley, CEH Oxford
This eubacterial genome has recently been selected for sequencing because it has been shown to be the dominant organism in a phenol degradation bioreactor from an industrial waste water treatment plant.
The description of this isolate is currently limited because this organism does not yet have a taxid (there are no representative sequences in Genbank), the isolation method has not been published, the strain is not deposited in a culture collection, and the sequencing project has not yet begun.
Key information to capture is that this organism is aerobic (it is the first in this group, identified on the basis of 16S sequences, not to be an anaerobe). Its environmental context is also particularly relevant because the interesting phenotype known to date is the ability to degrade phenol. It is important to note that the waste water treatment plant is industrial not municipal (for chemical composition). The information for "geographic location" is confounded by a confidentiality agreement with the industrial supplier of the bioreactor sample, and is therefore very broadly defined as "unspecified waste water treatment plant in the north of England".
Case study 2. Environmental plasmids. Adrian Tett, Andy Lilley, Sarah Turner, CEH Oxford
These plasmid genomes have been selected for sequencing because of their ecological relevance. Key phenotypes that are important to record are its ability to promote plant growth, resistance to mercury, and that fact that pQBR103 was captured in vivo using a genetically modified Pseudomonad (field trial). These are examples of genomes that are too dynamic to be represented by a 'type' strain. The 'core' genes or backbone shared between them has not yet been identified and can range in size from 150kb to over 400kb. Extensive genomic diversity in nature and yet persistently found over several years at the same geographic location.
Taxonomy is unconventional for plasmids because as a group they do not share a common ancestry, rather, the taxonomy of the host is used in the NCBI taxonomy. This means that the taxonomy reported is that of the host from which the plasmid was isolated. There could potentially by many hosts, and often host range is unknown.
Attempting to describe these plasmids also highlights the fact that several of the ecological attributes actually describe the host not the plasmid. This can make a specification difficult to apply without further refinement of attribute definitions or the ability to make attributes repeatable and qualified as belonging to 'host', 'plasmid', or 'both'. Plasmids with a broad host range should be described as this will present challenges.
Describing this plasmid highlights the importance of host as 'environmental context' and 'primary habitat'. Suggested attributes included 'HostRange'. Since plasmids can be submitted to culture collections in different potential hosts, a field "AvailableIn" along with a field "Description" was suggested for "StrainCollection" to better capture the details of the source. Also suggested the addition of an abstract repeatable field "OtherTypingMethod" to capture genetic groupings under (or instead of) the rank of species. For example, plasmids in this group are usually described according to their RFLP profile as belonging to Groups I-V.
Case study 3. Baculoviruses. Sarah Turner, CEH Oxford
Baculoviruses are ecologically important viruses that cause disease in insects. They are sometimes used as biocontrol agents. There are now over 25 complete baculovirus genomes in Genbank. These are large double-stranded DNA viruses that can vary in size from 120kb to 180kb. Often there are large numbers of genomes available for viral genomes and ecological information is of prime importance. Sequencing is simple, but comparison is difficult unless one has access to information about epidemiological data.
Key descriptors to capture include geographic origin and host range. Isolates are named after the host from which they were originally isolated. Therefore viruses that have distinct names can often be part of the same genetic complex. Understanding the host range, in addition to genetic relationships is essential. This is a frequent issue in the study of viruses. The difficulty of archiving environmental samples extends to viruses. There are paradigm viruses, but not 'type' strains. Acal is the paradigm virus for baculoviruses. The abstract repeatable attribute 'environmental condition' was useful in capturing the host environment (e.g. the stage in the insect lifecycle in which the virus was isolated).
This genome is not from a clonal population (or can't be sure) and hence these are 'artificial', or composite, genomes that reflect the isolation and sequencing process rather than being an exact copy of a natural isolate. It was useful to capture this in the abstract repeatable concept 'phenotype'.
It seems clear that the information to be collected should be found in most full length primary genome reports. As such, these genome descriptions could be published with primary genome reports as a supplementary table. Three genomes were described to compare how well the information in a published primary genome compared to information that could be provided by the generators of the genome. It was found that probably 90% of equivalent fields could be gleaned from the 3rd party paper in this example, but the reader was already an expert on this group of viruses. Some things missing from genome paper included whether or not it had been plaque purified and additional contact details.
Case study 4. Rabbit haemorrhagic disease viruses (RHDV). Naomi Forrester, CEH Oxford
This virus causes lethal outbreaks of haemorrhagic disease in wild and domestic rabbits, but domestic rabbits are far more susceptible. The virus was discovered in 1984 but antibodies to RHDV have been shown to be present in rabbits back to the 1950's. This virus is now known by a total of 25 full length genome sequences; 19 of these have been determined by Naomi Forrester in a study of population level diversity and 6 are in Genbank. This virus can not be cultured in the lab and therefore isolates are akin to other environmental samples that are unique in time and place.
Naomi is interested in phylogeographic studies of this virus and therefore needs more information on geographic origin and environmental context. In this case, environmental context is the host from which the isolate was collected. While the host is always a 'rabbit', it is essential for Naomi's work to know whether the rabbit was "wild or domestic" and "healthy or dead". Naomi is not only interested in the 'origins' of complete genome but also in the large number of partial sequences in GenBank.
The current specification seemed adequate to describe the salient features of these genomes. Also key, is the ability to capture whether an RNA virus is positive and negative stranded. This is currently captured in the NCBI taxonomy, but not in the Genbank genome annotation (which captures strand and nucleic acid type e.g. ssRNA not ssRNA-/+).
Case Study 5. Various pathogenic bacteria. Salmonella typhi: Nick Thomson, Sanger
The Sanger Centre is responsible for sequencing a large number of pathogenic bacteria. As a first example, the genome of Salmonella typhi was described. It was selected for sequencing because it is a multi-drug resistant clinical isolate. The attributes "DepthofCoverage" and "ErrorRate" were suggested to describe the quality of the sequence. "Submitter" was felt to be essential in the Contacts section because only the person who submits a genome to the database is able to change it in the future. This person is often not the corresponding author. The abstract repeatable concept "OtherTypingMethod" was useful in capturing typing methods that are commonly used to describe pathogens, for example LPS and flagellar phenotypes. While it would be no problem to give a general description of the annotation method as this is already published elsewhere, it was stressed that final annotation is based on the combined weight of many different factors by an expert annotator(s) and therefore can't exhaustively be described, nor is it exactly repeatable.
Planned Case Studies:
Thermotoga maritima, Karen Nelson, TIGR
Geobacter sulfurreducens, Barbara Methe, TIGR
Broad host range plasmids, Chris Thomas, University of Birmingham
bacteriophage of marine prokaryote, Nick Mann
various Synechococcus, Dave Scalan and Martin Ostrowski
Metagenomic data, Victor Markowitz
Phage data in EBI Genome Reviews, Peter Sterk