towards a richer set of information to describe our complete genome collection

Bergen Experiment 16S

From Genomic Standards Consortium

Main Page -> Towards_richer_submissions_of_16S_sequences_to_the_INSDC -> Bergen Experiment 16S



On this page:

[edit] Bergen Experiment Case Study

The Bergen Experiment 16S case study involves the submission of >3,000 16S sequences to EMBL (Webin). All of the 16S data was generated from water samples taken in Bergen that were accompanied by rich experimental and evironmental data (biogeochemical measurements).


This case study (submission) is being led by Tim Booth and Dawn Field on behalf of the data generators Anna Oliver, Lindsay Newbold and Andrew Whiteley (in collaboration with the wider Microbial Metagenomics Consortium: http://www.genomics.ceh.ac.uk/mm who collected additional data of relevance including all the biogeochemical data held in the associated barcoding database).


[edit] The process

This is the process by which this case study proceeded:


  • started with putting all 16S into a database linked to rich metadata about the experiment (the "Bergen database" linked to an instance of HandleBar database)


  • Once this was done, we knew that automatic extraction of all relevant data for submission or for future 'trackback' to a public (local) database, would be easy


  • sought guidance on 'best practice' submission and used the following


    • Initial discussion with generators of the data


[edit] Sources of best practice and guidance

    • Jim Cole's presentaion on low and high quality 16S submissions at the 5th GSC workshop and the general lack of information in submission. In particule we refered to the the 'best practice' instance (slide 10) serves the initial bench mark for our submission: http://gensc.org/gc_wiki/index.php/Image:Cole_gsc_dec_07.ppt



[edit] Options for managing the submission - should we use GCDML

Two options for submitting:


1) export the data in a MIGS/MIMS format using GCDML and then figure out how to map to INSDC file (combination mandatory features, optional qualifiers for /source and a COMMENT BLOCK)

2) export directly into 16S EMBL doc template


The first option is highly preferable because:

1. contributes to GCDML development

2. gives a stable 'container gold standard for data annotation'


Tradeoff:

1. will require waiting for GCDMl to be ready to use

2. requires a parsing to output valid EML docs

3. Only a few things are unique to a sequence submission, most are shared across sequences


Because most fields do not change (point 3), the bulk of the EMBL submission only needs to be created once, so there is no need to export this from the database in any automated way, you just type it into Webin once (see below). Therefore it makes sense to do option 2 and get the format right before considering option 1 - there is no extra coding or custom export involved.

The EMBL submission process for bulk sequences is as follows:

  • Create a single submission based on the first sequence.
  • Circulate this until we are happy
  • Send it to EMBL noting that this is the first of a bulk upload
  • Correct any issues noted by EMBL admins
  • Send bulk submission as directed by EMBL


[edit] Draft submission process through Webin to make a template

Below is an email exchange between Tim and EMBL:


I had the following exchange with an EMBL datasub:

From: datasubs@ebi.ac.uk
To: tbooth@ceh.ac.uk
Subject: Re: Submission of ~1800 sequences from a 16S study. (faruque)
(SUB#413449)
Date: Thu, 24 Jan 2008 17:23:02 GMT


Dear Colleague,

> A slightly long-winded query for you, I'm afraid, but please bear with
> me.
>
> My questions are:
>
> 1) The Webin submission guidelines link to documentation on the familiar
> EMBL flat file format. If I try to create files in this format (minus
> the internally-generated fields like accession and date) for my
> sequences, am I getting close to something I could actually submit or am
> I barking up the wrong tree?

Webin produces EMBL-format flatfiles better than other methods (eg
Sequin is a
standalone submission tool that the NCBI make, unfortunately it permits
submission even if they fail every test available to Sequin, eg2 Artemis
is a
very good annotation tool for prokaryotic sequences, but users often
create fake
qualifiers such as /colour which cannot be used by embl).
Where sequences are > 10 kbp we understand Webin would be too slow to
reenter
annotation and so it permits users to upload a preformatted file of
annotation.

> 2) If I start a bulk submission by putting in a representative sequence
> through Webin, can I make a first stab, then get that submission back in
> EMBL format, for comparison with sequences already in the database, and
> revise it before actually getting a template and submitting anything for
> real?

Yes, you enter your example sequence in Webin. On the summary page it
will show
you a flatfile view of the entry that you are creating. Once complete you
submit it to our system where a curator will review your entry. This gives
chance for them to solicit further information (eg "Your paper title
indicates
that this is a study of HIV sequences occurring in different places, do
you have
lat_lon coordinates that you can include in your entry?").
Once the example entry is perfected and a Webin bulk form has been made
you will
be sent an email with the URL and also a copy of the template for your
review.

> 3) If I have a 16S sequence that I have identified by a similarity
> search, do I give the name of the bacterium based on my identification
> or do I give it as 'uncultured marine bacterium' because this was the
> best identification I had prior to sequencing and searching in-silico?

If the source organism was not isolated it should be entered as a
/environmental_sample and be allocated to a subset of organism names (eg
'uncultured marine bacterium').
Please see
<http://www.ncbi.nlm.nih.gov/Taxonomy/protected/home/index.cgi?chapter=edspolicy>
for a deeper explanation.

If the organism was isolated in pure culture it would receive an informal
taxonomic identification that can only extend to the Genus level, eg if the
sequenced of isolate AM-2008-123 matched Escherichia coli it would be
entred as
organism "Escherichia sp. AM-2008-123"


> 4) Should I be submitting my sequences as part of a project or not? I
> see you have projects for genomes and metagenomes but being 16S only
> this is essentially a biodiversity assay.

The projects are currently genome and metagenome and do not currently
embrace
biodiversity studies.

Regarding the sampling meta data - we would be happy to receive a reasonably
preformatted CC block of tag value pairs in addition to ensuring that all
available embl qualifiers were used where available. I think Webin may
remove
preformatting when you provide a CC block using 'Feature not in list'
but that
can be fixed when you interact with the curator.


I hope I've covered everything.

The current version can be seen with Webin ID "Hx1201256064". You need a password to get in, but you can then edit the submission and see the EMBL-format file on the last page before actually submitting it.



[edit] The current draft template

This is a first pass template exported from Webin and must be heavily modified now, in collaboration with all involved to make it as rich as possible. We have two key sets of decision to make:

1) What data to include? (for example, can we use the /biomaterial qualified for an ID to the barcode data bases, or do we 'flatten out' all this information (e.g. biogeochem data) and put it directly into the file?

2) How do we structure it in the doc (various options as described above)



ID   XXX; XXX; linear; genomic DNA; XXX; XXX; 1427 BP.
XX  
ST * draft  
XX  
AC   ;
XX  
DE   Uncultured marine somethingbacterium 16_01_00A01 partial 16S rRNA gene,
DE   Espeland, Raunefjord, Norway
XX  
KW   .
XX   
OS   uncultured marine bacterium
OC   Bacteria; environmental samples.
XX   
RN   [1]
RP   1-1427 
RA   Booth T.G.;
RT   ;
RL   Submitted (22-FEB-2008) to the EMBL/GenBank/DDBJ databases.
RL   Booth T.G., CEH Oxford, Centre for Ecology and Hydrology, Mansfield Road,
RL   Oxford, OX1 3SR, UNITED KINGDOM.
XX 
RN   [2]
RA   Joint I.;
RT   "Tentative title of paper";
RL   Unpublished.
XX 
FH   Key             Location/Qualifiers
FH   
FT   source          1..1427
FT                   /organism="uncultured marine bacterium"
FT                   /db_xref="taxon:56765"
FT                   /mol_type="genomic DNA"
FT                   /isolate="Sample 01-007377, 0.22 Durapore, GFA prefilter"
FT                   /environmental_sample
FT                   /note="Note goes here"
FT                   /country="Norway:Bergen"
FT                   /isolation_source="seawater in mesocosm bag enriched with
FT                   CO2"
FT                   /lat_lon="60.27 N 5.22 E"
FT                   /collected_by="Andrew Whiteley"
FT                   /collection_date="07-May-2006"
CC                   Next line added manually
FT                   /bio-material="nebc.nox.ac.uk:mm_barcode:01-007377"
FT   rRNA            1..1427
FT                   /gene="16S rRNA"
FT                   /product="16S ribosomal RNA"
XX   
SQ   Sequence 1427 BP; 368 A; 322 C; 428 G; 309 T; 0 other 
     gagtttgctc atggctcaga acgaacgctg gcggcaggcc taacacatgc aagtcgagcg        60
     ctaccttcgg gtggagcggc ggacgggtta gtaacgcgtg ggaatatacc cagttctaag       120
     gaatagccac tggaaacggt gagtaatacc ttatacgccc ttcgggggaa agatttatcg       180
     gaattggatt agcccgcgtt agattagata gttggtgggg taatggccta ccaagtctac       240
     gatctatagc tggtttgaga ggatgatcag caacactggg actgagacac ggcccagact       300
     cctacgggag gcagcagtgg ggaatcttag acaatgggcg caagcctgat ctagccatgc       360
     cgcgtgagtg atgaaggccc tagggtcgta aagctctttc aactgtgaag ataatgacgg       420
     tagcagtaga agaaaccccg gctaactccg tgccagcagc cgcggtaata cggagggggt       480
     tagcgttgtt cggaattact gggcgtaaag cgtacgcagg cggattaata agttagaggt       540
     gaaatcccag ggctcaaccc tggaactgcc tttaaaactg ttagtcttga gatcgagaga       600
     ggtgagtgga attccaagtg tagaggtgaa attcgtagat atttggagga acaccagtgg       660
     cgaaggcggc tcactggctc gatactgacg ctgaggtacg aaagtgtggg gagcaaacag       720
     gattagatac cctggtagtc cacaccgtaa acgatgaatg ccagacgtca gcaagcatgc       780
     ttgttggtgt cacacctaac ggattaagca ttccgcctgg ggagtacggt cgcaagatta       840
     aaactcaaag gaattgacgg gggcccgcac aagcggtgga gcatgtggtt taattcgacg       900
     caacgcgcag aaccttacca acccttgaca tacttgtcgc ggattccaga gatggattcc       960
     ttcagttcgg ctggacaatg tacaggtgct gcatggctgt cgtcagctcg tgtcgtgaga      1020
     tgttcggtta agtccggcaa cgagcgcaac ccacgtcctt agttaccagc atttagttgg      1080
     gtaccctaag gagactgccg gtgataagcc ggaggaaggt gtggacgacg tcaagtcatc      1140
     atggccctta cgggttgggc tacacacgtg ctacaatggc atctacagtg agttaatctc      1200
     caaaagatgt ctcagttcgg attggggtct gcaactcgac cccatgaagt tggaatcgct      1260
     agtaatcgcg gaacagcatg ccgcggtgaa tacgttcccg ggccttgtac acaccgcccg      1320
     tcacaccatg ggaattgggt ctacccgaag gtggtgcgcc aactatttat aggggcagcc      1380
     aaccacggta ggttcagtga ctggggtgaa gtcgtaacaa ggtaacc                    1427
//


(end)

[edit] Next steps

This is an ongoing process. Out next steps are to


1. Have a local meeting to improve the 'template'

2. put up for consultation and get it ratified by larger group

3. write out all the submission files

4. Submit

5. For future submissions, consider GCDML (would we need to write a parser to create EMBL docs compliant with the rich template we create or could we put the XML in the comment field (Tim's query)

Loading...