MIxS Compliance and Implementation

What is metadata?

Metadata is ‘data’ about data. In practical terms, metadata is the information describing a sampling event and subsequent sequencing efforts.

Why use metadata standards?

Utilizing metadata standards to annotate the data describing the sample, sampling environment and sequencing methodology will vastly improve our ability to mine and integrate our sequence data collection for knowledge and application driven research.  Collection and reporting of a common, minimal set of metadata across different projects will foster data comparisons and analysis.  Combining studies in a standard way will allow for more powerful analyses of data.

General Information
Contacts
Further Compliance and Submission Help

Databases
Tools
Others

Compliance with the core MIxS is very easy – it only consists of 11 metadata items, and can be filled in very quickly prior to sequence submission to public databases.

minimal_mixs

Below are examples of MIxS compliant metadata lists for a genome sequence, a metagenomic sample, and a marker gene survey. They have varying degree of detail, but ultimately what makes them MIxS-compliant are the common items marked in bold red font.

migs

Genome sequencing of Sediminibacterium sp. – note the use of conditional metadata items from the MIGS checklist

mims

A metagenome (WGS) sequencing sample from sea water – here the sample is extensively characterized by using parameters from environmental package “water”

mimarks

Marker gene survey on dsrA sequences and the accompanying MIMARKS-survey metadata – note the use of MIMARKS checklist conditional mandatory metadata items

Contacts

For help with any compliance or curation issues, as well as to suggest improvements to MIxS,  you can either write us a ticket at:

MIxS checklist trac

For help with compliance and implementation of MIxS standards in your own systems the Compliance and Interoperability Group can be contacted at:

gensc-developers[at]lists.gensc.org

Adopters

Despite the relative simplicity of MIxS checklists, it may still not be trivial to prepare the right data in the right format. We compiled a list of databases and tools that help support MIxS to assist submitters further.

Databases

The INSDC databases

The International Nucleotide Sequence Database Collaboration (INSDC; NCBI/GenBank, EBI-ENA, and the DNA Databank of Japan), partners have recognised the MIxS, have reserved an official keyword for compliant INSDC sequence records in the form of “GSC:MIxS;{specific_checklist_name}”.

NCBI offers customizable templates to download for all MIxS checklists and the environmental packages in BioSample submissions. The BioSample concept fits very well with MIxS metadata, as our focus is also on the sample.

ENA also offers customizable MIxS templates for downloads in their login-based submission system. “Submitting Environmental Sequences” and “MIxS” pages detail the submission process further.

The GOLD Database

The Genomes Online Database (GOLD) displays a wide range of metadata for complete and ongoing genome and metagenome projects. It now also accepts submission of new entries and metadata.

MG-RAST

MG-RAST has implemented the use of MIxS by using simple spreadsheets to capture metadata, with a minimal number of required fields (in red in the spreadsheets) and a number of optional fields. The spreadsheet is separated into multiple tabs representing the different metadata categories. A more detailed explanation can be found in the MG-RAST blog.

Tools

ISA infrastructure

The Investigation/Study/Assay (ISA) Infrastructure is a freely available software suite that:

  1. assists in the curationreporting and local management of experimental metadata (i.e. sample characteristics, technologies used, type of measurements) from studies employing one or a combination of technologies;
  2. empowers communities to uptake GSC community-defined standards: minimum information checklists (MIGS/MIMS/MIMARKS) and ontologies (e.g, EnvOOBI etc);
  3. facilitates submission to international public repositories of genomics studies (e.g. ENA and SRA databases), but also of transcriptomics and proteomics studies (ArrayExpress and Pride).

The Java-based ISA software components and a relational database are based on the ISA-Tab format and designed for local use and can work independently, or as unified system:

  • ISAcreatorConfig, for curators or power users to regulate the fields displayed in the ISAcreator; i.e., declaring certain fields mandatory or mandating the use of a specific set of ontology terms (accessed via BioPortal and OLS public portal).
    • Download ISA creator configuration files for the MIMARKS environmental packages from sourceforge. Please note these are alpha versions to be further evaluated and these should not be used in a production environment. MIGS/MIMS configuration files will follow.
  • ISAcreator, a ‘user-friendly’ editor with which experimentalists can construct reports, edit experimental metadata and ultimately validate it based on the configuration specified;
  • The BioInvestigation Index, a relational database for storing and browsing the experimental metadata;
  • ISAconverter, to transform ISA-Tab metadata into into SRA-XML (used by ENA and SRA databases), but also into MAGE-Tab (used by ArrayExpress), and Pride XML (used by Pride).
  • rISA (under development), a package for R which allows you to load in ISA-Tab files and run existing analysis functions such as Bioconductor on the data files within the ISA-Tab.

MetaBar

MetaBar (http://www.megx.net/metabar) is a spreadsheet and web-based software tool designed to assist users in the consistent acquisition, electronic storage and submission of contextual data associated to their samples. A preconfigured Microsoft Excel ® spreadsheet is used to initiate structured contextual data storage in the field or laboratory. Each sample is given a unique identifier. To enter and update the data at any stage the sheets can be uploaded to the MetaBar database server. For sample labeling the identifiers can be printed as barcodes. An intuitive web interface provides quick access to the contextual data in the MetaBar database as well as user and project management capabilities. Export functions facilitate contextual and sequence data submission to the International Nucleotide Sequence Data Collaboration databases (INSDC). MetaBar requests and stores contextual data in compliance to the MIGS/MIMS/MIMARKS specifications defined by the Genomic Standards Consortium.

EpiCollect implementation of MIMARKS

EpiCollect.net (http://www.epicollect.net/) provides a web application for the generation of forms and freely hosted project websites (using Google’s AppEngine) for many kinds of mobile data collection projects. The GSC is in the process of developing a demonstration project website for the capture of MIMARKS data.

CDinFusion

CDinFusion (Contextual Data and FASTA infusion) is a submission-preparation-tool for the integration of contextual data (CD) with sequence data. The software enriches uploaded multi fasta files with contextual data in compliance to the Genomic Standards Consortium (GSC) specifications MIGS/MIMS/MIMARKS. The generated contextual data enriched files can be used for submission to the databases of the International Nucleotide Sequence Data Consortium (INSDC). The tool aims to offer scientists in all disciplines of life sciences a software to increase the quantity and quality of contextual data in the INSDC databases. CDinFusion has been developed by the Microbial Genomics Group at the Max Planck Institute for Marine Microbiology Bremen. It can be accessed under http://www.megx.net/cdinfusion

QIITA

Qiita (canonically pronounced cheetah) is an entirely open-source microbiome storage and analysis resource that can run on everything from your laptop to a supercomputer. It is built on top of the widely used QIIME package, and enables the exploration of -omics data. The resource (http://qiita.microbio.me/) currently supports the MIMARKS specification, allowing users to generate and validate MIMARKS-compliant templates. These templates can be viewed and completed in the users’ spreadsheet editor of choice (e.g. Microsoft Excel). The Qiita web-platform also offers an ontology lookup and georeferencing tool to aid users when completing the MIMARKS templates. Additional tools for processing and analyzing MIMARKS-compliant microbial communities using this platform will be made available to the public on an ongoing basis.

RDP Googlesheets and SRA services

RDP’s Google Sheets assist researchers by providing easy online accessible data entry and storage for metadata conforming to the MIxS and the MIMARKS specifications for all 14 current environments. After you collect your metadata, you can export your MIMARKS-compliant data by selecting the menu item “MIMARKS Export” and choose your desired output: WebIN or Sequin. The RDP SRA Prepkit is no longer available. Please use the new SRA prep/submission tools hosted by ENA or NCBI to complete data submission. RDP users should contact RDP Staff if you have questions or need assistance to begin the process involved in preparing metadata documents that are required for submission to the Sequence Read Archive (SRA).

EBI Metagenomics Portal

The EBI Metagenomics service is an automated pipeline for the analysis and archiving of metagenomic data that aims to provide insights into the functional and metabolic potential of a sample. Until October 2012, the EBI Metagenomics service offered a manually-assisted submission route, with help available to ensure data and metadata formatting complied with the Sequence Read Archive (SRA) data schema and the Genomic Standards Consortium (GSC) sample metadata guidelines respectively, allowing harmonisation of analysis efforts across the wider genomics community. From October 2012, submitters of metagenomic datasets are encouraged to make use of ENA’s SRA Webin submission service, which supports all of the MIxS checklists.

Others

The SIGS Journal

The “Standards in Genomic Sciences” (SIGS) journal is the first journal to require MIGS for the publication of all genome papers. SIGS has published over 50 MIGS-compliant genome reports.