towards a richer set of information to describe our complete genome collection

MIGS Change Log

From Genomic Standards Consortium

Everything with a date reflects revisions in the latest version of the MIGS checklist and the schema (and hence the Genome Catalogue).

The latest views of the MIGS specification are available here: MIGS/MIMS


The list of proposed changes are found here: Proposed Changes



May 29 2007


Created 1.1.1 to incrementally help migration to 1.2 - This is a test example showing how we can feed telecon changes into new versions that are committed to the SVN.


SVN Branch: http://gensc.svn.sourceforge.net/viewvc/gensc/schema/branches/MIGS-MIMS/migs-mims-v1.1.1.xsd?view=log


structuring MIGS in a more obvious way


  • genome_catalogue -> parent to migs - this is implementation specific and not part of migs
  • gdml -> made new root to schema - 'genomic data mark up language' - according to proposal from MPI Bremen


element name changes


  • habitat_type -> habitat - just a simplication
  • ocean_water -> water body - more generic and appropriate for MIMS (CAMERA, MPI Bremen)




May 2007


To aid multiple groups making suggestions, responded to Renzo's suggestion and Tanya implemented schema SVN with branches for various groups to experiment with schema modifications. Made branches for releases, committed MIGS 1.1.


URL: http://gensc.svn.sourceforge.net/viewvc/gensc/schema/




Feb 2007


Started telecons with CAMERA, MPI Bremen and the AMO groups on implementing MIMS, also included talks with RSBI and FuGE




Jan 2007


XIDS added to test version of schema. Will allow better migration of content between version of the schema and better use / display of elements and their definitions in the Genome Catalogue.




Major changes following the 3rd GSC workshop - creation of MIGS 1.1 from 1.0:

  • major revision to all aspects of the specification including dropped fields, renamed fields, re-organization of fields within Organism, Phenotype, and Sample processing, improved definitions, changes to how the fields are applied to taxa, and more fields made repeatable
  • input fields in the XML schema further restricted -as many as possible to a CV - for the sake of future validation (marked now in MIGS specification)

All changes are detailed below

Nov 13, 2006


Made Repeatable:


  • complete genetic lineage -> for instance, to capture when a bacterial isolate is known as a serovar and a biovar; this field will need post-processing to be parsed correctly


Added enumerations:


  • complete_genetic_lineage -> serotype 0:8, biotype 1B, "not distinct below taxid level"
  • encoded traits -> virulence (for a virulence plasmid)
  • culture collection -> NTCC
  • biotic relationship -> commensal capable of causing disease
  • habitat_type -> host associated, human host associated
  • Alphabetized enumerations when appropriate


Added:


  • accession -> to link genomes to DDBJ / EMBL / GenBank accessions
  • decimal -> added as an option to both lat and long


Changed Application to taxa:

  • added '000001' to extensions for now (only metagenomes)
  • added '000000' to sources to hide it for now
  • propagation -> not to apply to bacteria and archaea




Nov 9, 2006


  • specific_host -> added group attribute to this parent element (000110)


Started to extend schema to allow import of data from other sources:

The intention is to help place MIGS in context as an extension of of information already collected in INSDC genome annotation files and the INSDC's Genome Projects database (see the GSC Roadmap).


Nov 7, 2006

Added:

  • parent element: other_sources
  • ncbi_genome_projects_prokaryotes -> complete representation of information for bacteria and archaea from NCBI genome projects database
  • gold -> complete representation of downloadable gold information (minus gold_stamp) which is in genome catalogue section



Nov 1, 2006


Added to genome catalogue section of XML schema to import from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/lproks_0.txt):


  • (project id and taxid already in this section)
  • ncbi_organism_name -> to the 'genome catalogue' section of the XML schema to allow import of data from NCBI
  • ncbi_status -> to the 'genome catalogue' section of the XML schema to allow import of data from NCBI, enumerated with : incomplete, assembly, complete


Added to genome catalogue section of XML schema to import from GOLD (http://www.genomesonline.org/):

  • gold_stamp -> the id for GOLD to allow hyperlinks to GOLD entries on line




Sept 28, 2006



Renamed:

  • direction-longitude -> renamed to direction_longitude, longitude_restriction renamed to direction_longitude_restriction, latitude_restriction renamed to direction_latitude_restriction (same with _attributes)
  • finishing strategy (the parent element) -> renamed to finishing to avoid duplication of element name with child element of same name
  • source_material_id -> renamed to source_material_identifier for completeness


Controlled vocabularies:

  • fold coverage -> added 1x-14x as a start


Dropped:

  • Optional Study and Assay - to remove any potential conflicts with MIGS 1.1

Moved:

  • Whether normally pathogenic or not -> moved from phenotype to Organism

Sept 25, 2006


Coded the MIGS table with:

  • 1 = yes, 2 = discuss further, 3 = drop, 4 = EMBL to propose as future INSDC qualifier, CV = Controlled Vocabulary


Moved:

  • ploidy level -> moved above number of replicons, apply it as eXtra to bacteria
  • presence of extrachromosomal elements -> moved from phenotype to below number of replicons
  • trophic level -> moved from phenotype to end of Organism
  • isolation and Growth conditions -> moved from Phenotype to Sample Processing
  • biotic relationship (previously relationship with host) -> moved from Phenotype to Organism
  • MIMS -> moved from Study to a new branch of the schema called extensions; added Study under MIMS


Changes in group attribute (application to different taxa):

  • estimated size -> now applied to all draft genomes ** example of a field dependent on another field - how to implement this?


Renamed:

  • number of chromosomes -> number of replicons; added to definition: always refer to haploid chromosome number, this now includes genetic elements like plasmids
  • reproductive mode -> propagation - to include things like incompatibility group for plasmids
  • references for the...-> name too long and now reference for biomaterial; updated the definition as follows: If the primary genome publication is not the first isolation of the biomaterial sequenced, please enter the published reference that describes the isolation of the biological material used to generate the genomic / metagenomic sequence. Please identify this publication by PMID or DOI. Also made this a repeatable field as requested.
  • host -> specific host; also wanted to capture taxid (or unknown, or from the non-living environment - as in a pathogen collected from a swipe of a hospital bed, or an airborne virus) and whether this is a laboratory host or natural host and so this element was split into two input fields: specific host taxid and specific host classification
  • growth conditions -> renamed to isolation and growth conditions and will now take any number of SOP's (as links, PMID, DOI's)
  • relationship to host -> renamed biotic relationship; to be a CV of terms like free-living, pathogen, commensal, symbiont etc; moved from Phenotype to Organism, add CV terms
  • presence of extrachromosomal elements -> renamed to extrachromosomal elements

Dropped:

  • Specific source of sample -> this is now captured in specific host above - it was originally for viruses in case not isolated from specific host per se (e.g. from environment)
  • isolation conditions -> this originally refered to reference for isolation or enough information to isolate the organism - now redundant with reference for biometerial and isolation and growth conditions
  • dna extraction -> nucleic acid extraction


Made Repeatable:

  • assembly -> made repeatable, but should it be tied to the sequencing method used if more than one; currently capturing this in CV with mixed sequencing types; also made sequencing repeatable
  • library construction -> many sequences are now finished using more than 1 library


Controlled vocabularies:

  • taxonomic group -> removed Organelle and replaced with mitochondrion, chloroplast, nucleomorph
  • taxonomic group -> removed prokaryote and replaced with bacteria and archaea as CV terms. Added comment to schema: "After Pace 2006 the GSC has decided not to allow the term prokaryote to mean a combination of the domains bacteria and archaea. Therefore they are both used although at this point the information collected for both is identical (i.e. not separate forms or separate group attributes in the MIGS XML schema implementation"
  • taxonomic group -> added phage as CV term - we've collapsed down phage into viruses after the workshop, but will add phage as a type of taxonomic group within the MIGS specification (e.g. identical information is currently collected about viruses and phage)
  • sequencing method -> issue raised of whether this should be a repeatable field, but most common mix is "dideoxysequencing and pyrosequencing" so these were added as CV terms: dideoxysequencing and pyrosequencing for closure, mix of dideoxysequencing and pyrosequencing
  • specific host classification -> added the terms natural host, laboratory host, natural and laboratory host, environmental source, unknown
  • encoded traits -> added the terms mercury resistance, xenobiotic degradation, antibiotic resistance,converting genes
  • resource -> added ATCC, DSMZ, CCAP as examples of culture collections
  • propagation -> added lytic, lysogenic, incompatibility group, sexual, asexual
  • nucleic acid extraction -> added Wheatcroft and Williams (e.g. 000021_GCAT)


Sept 18, 2006


Dropped from field to future controlled vocabulary of a different field:

  • converting genes -> for phage, place under encoded traits (merge phage back into viruses) - added to definition only at this point
  • incompatibility -> for plasmids, put under propogation (previously reproductive strategy) - added to definition only at this point


Added:

  • library construction -> vector - type of vector used; will be selected from a controlled vocabulary of vectors; added pUC19 as first example (used in 00001_GCat, a plasmid) but currently only applied to metagenomes


Dropped:

  • is this a model organism -> dropped because definition of a model organism too ambiguous
  • access to the isolate sequenced -> dropped because the INSDC can not record legal status of isolates (incompatibility with INSDC)
  • environment -> dropped, habitat to selected from a cv (ontology)



Sept 17, 2006

Added to XML schema:



Sept 11, 2006

Added to XML schema during 3rd GSC workshop:

  • genome report title -> to help easily identify reports, in specific response to the lack of a way to identify metagenomic data sets (e.g. lack taxid or organism name)

Sept 9, 2006


Added prior to workshop:

  • latitude/longitude -> degrees, minutes, seconds, direction (north, south, east, west)


Group Attribute changed:

  • genome project id -> applied not to viruses (although held internally at NCBI) but to metagenomics (E, B, and A)



Sept 5, 2006


Added Extension to XML schema:

  • Minimal Information about a Metagenomic Sequence (MIMS)



Notes for future updates:


Validation:

  • source_material_identifiers - currently set at maxOccur=2, actually this needs to be unbounded; in fact all fields marked M will become minOccur = at least 1. The schema must be edited heavily to allow proper validation. Right now quite generous in the types of input that can be entered.
Loading...