Bergen Experiment 16S
From Genomic Standards Consortium
Main Page -> Towards_richer_submissions_of_16S_sequences_to_the_INSDC -> Bergen Experiment 16S
On this page: |
[edit] Bergen Experiment Case Study
The Bergen Experiment 16S case study involves the submission of >3,000 16S sequences to EMBL (Webin). All of the 16S data was generated from water samples taken in Bergen that were accompanied by rich experimental and evironmental data (biogeochemical measurements).
This case study (submission) is being led by Tim Booth and Dawn Field on behalf of the data generators Anna Oliver, Lindsay Newbold and Andrew Whiteley (in collaboration with the wider Microbial Metagenomics Consortium: http://www.genomics.ceh.ac.uk/mm who collected additional data of relevance including all the biogeochemical data held in the associated barcoding database).
[edit] The process
This is the process by which this case study proceeded:
- started with putting all 16S into a database linked to rich metadata about the experiment (the "Bergen database" linked to an instance of HandleBar database)
- Once this was done, we knew that automatic extraction of all relevant data for submission or for future 'trackback' to a public (local) database, would be easy
- sought guidance on 'best practice' submission and used the following
- Initial discussion with generators of the data
[edit] Sources of best practice and guidance
- Jim Cole's presentaion on low and high quality 16S submissions at the 5th GSC workshop and the general lack of information in submission. In particule we refered to the the 'best practice' instance (slide 10) serves the initial bench mark for our submission: http://gensc.org/gc_wiki/index.php/Image:Cole_gsc_dec_07.ppt
- The INSDC feature table (as HTML) to understand where to put elements in the structured final EMBL doc: http://www.insdc.org/files/documents/feature_table.html#_Toc180488147
- MIGS/MIMS checklist, specifically the online overview (http://gensc.sourceforge.net/docs/migsmims/) and the publication
- discussions with members of the GSC (See this thread which leads to this page: http://gensc.org/gc_wiki/index.php/Major_Requirements_for_16S_and_Biodiversity)
[edit] Options for managing the submission - should we use GCDML
Two options for submitting:
1) export the data in a MIGS/MIMS format using GCDML and then figure out how to map to INSDC file (combination mandatory features, optional qualifiers for /source and a COMMENT BLOCK)
2) export directly into 16S EMBL doc template
The first option is highly preferable because:
1. contributes to GCDML development
2. gives a stable 'container gold standard for data annotation'
Tradeoff:
1. will require waiting for GCDMl to be ready to use
2. requires a parsing to output valid EML docs
3. Only a few things are unique to a sequence submission, most are shared across sequences
Because most fields do not change (point 3), the bulk of the EMBL submission only needs to be created once, so there is no need to export this from the database in any automated way, you just type it into Webin once (see below). Therefore it makes sense to do option 2 and get the format right before considering option 1 - there is no extra coding or custom export involved.
The EMBL submission process for bulk sequences is as follows:
- Create a single submission based on the first sequence.
- Circulate this until we are happy
- Send it to EMBL noting that this is the first of a bulk upload
- Correct any issues noted by EMBL admins
- Send bulk submission as directed by EMBL
[edit] Draft submission process through Webin to make a template
Below is an email exchange between Tim and EMBL:
I had the following exchange with an EMBL datasub:
From: datasubs@ebi.ac.uk To: tbooth@ceh.ac.uk Subject: Re: Submission of ~1800 sequences from a 16S study. (faruque) (SUB#413449) Date: Thu, 24 Jan 2008 17:23:02 GMT Dear Colleague, > A slightly long-winded query for you, I'm afraid, but please bear with > me. > > My questions are: > > 1) The Webin submission guidelines link to documentation on the familiar > EMBL flat file format. If I try to create files in this format (minus > the internally-generated fields like accession and date) for my > sequences, am I getting close to something I could actually submit or am > I barking up the wrong tree? Webin produces EMBL-format flatfiles better than other methods (eg Sequin is a standalone submission tool that the NCBI make, unfortunately it permits submission even if they fail every test available to Sequin, eg2 Artemis is a very good annotation tool for prokaryotic sequences, but users often create fake qualifiers such as /colour which cannot be used by embl). Where sequences are > 10 kbp we understand Webin would be too slow to reenter annotation and so it permits users to upload a preformatted file of annotation. > 2) If I start a bulk submission by putting in a representative sequence > through Webin, can I make a first stab, then get that submission back in > EMBL format, for comparison with sequences already in the database, and > revise it before actually getting a template and submitting anything for > real? Yes, you enter your example sequence in Webin. On the summary page it will show you a flatfile view of the entry that you are creating. Once complete you submit it to our system where a curator will review your entry. This gives chance for them to solicit further information (eg "Your paper title indicates that this is a study of HIV sequences occurring in different places, do you have lat_lon coordinates that you can include in your entry?"). Once the example entry is perfected and a Webin bulk form has been made you will be sent an email with the URL and also a copy of the template for your review. > 3) If I have a 16S sequence that I have identified by a similarity > search, do I give the name of the bacterium based on my identification > or do I give it as 'uncultured marine bacterium' because this was the > best identification I had prior to sequencing and searching in-silico? If the source organism was not isolated it should be entered as a /environmental_sample and be allocated to a subset of organism names (eg 'uncultured marine bacterium'). Please see <http://www.ncbi.nlm.nih.gov/Taxonomy/protected/home/index.cgi?chapter=edspolicy> for a deeper explanation. If the organism was isolated in pure culture it would receive an informal taxonomic identification that can only extend to the Genus level, eg if the sequenced of isolate AM-2008-123 matched Escherichia coli it would be entred as organism "Escherichia sp. AM-2008-123" > 4) Should I be submitting my sequences as part of a project or not? I > see you have projects for genomes and metagenomes but being 16S only > this is essentially a biodiversity assay. The projects are currently genome and metagenome and do not currently embrace biodiversity studies. Regarding the sampling meta data - we would be happy to receive a reasonably preformatted CC block of tag value pairs in addition to ensuring that all available embl qualifiers were used where available. I think Webin may remove preformatting when you provide a CC block using 'Feature not in list' but that can be fixed when you interact with the curator. I hope I've covered everything.
The current version can be seen with Webin ID "Hx1201256064". You need a password to get in, but you can then edit the submission and see the EMBL-format file on the last page before actually submitting it.
[edit] The current draft template
This is a first pass template exported from Webin and must be heavily modified now, in collaboration with all involved to make it as rich as possible. We have two key sets of decision to make:
1) What data to include? (for example, can we use the /biomaterial qualified for an ID to the barcode data bases, or do we 'flatten out' all this information (e.g. biogeochem data) and put it directly into the file?
2) How do we structure it in the doc (various options as described above)
ID XXX; XXX; linear; genomic DNA; XXX; XXX; 1427 BP.
XX
ST * draft
XX
AC ;
XX
DE Uncultured marine somethingbacterium 16_01_00A01 partial 16S rRNA gene,
DE Espeland, Raunefjord, Norway
XX
KW .
XX
OS uncultured marine bacterium
OC Bacteria; environmental samples.
XX
RN [1]
RP 1-1427
RA Booth T.G.;
RT ;
RL Submitted (22-FEB-2008) to the EMBL/GenBank/DDBJ databases.
RL Booth T.G., CEH Oxford, Centre for Ecology and Hydrology, Mansfield Road,
RL Oxford, OX1 3SR, UNITED KINGDOM.
XX
RN [2]
RA Joint I.;
RT "Tentative title of paper";
RL Unpublished.
XX
FH Key Location/Qualifiers
FH
FT source 1..1427
FT /organism="uncultured marine bacterium"
FT /db_xref="taxon:56765"
FT /mol_type="genomic DNA"
FT /isolate="Sample 01-007377, 0.22 Durapore, GFA prefilter"
FT /environmental_sample
FT /note="Note goes here"
FT /country="Norway:Bergen"
FT /isolation_source="seawater in mesocosm bag enriched with
FT CO2"
FT /lat_lon="60.27 N 5.22 E"
FT /collected_by="Andrew Whiteley"
FT /collection_date="07-May-2006"
CC Next line added manually
FT /bio-material="nebc.nox.ac.uk:mm_barcode:01-007377"
FT rRNA 1..1427
FT /gene="16S rRNA"
FT /product="16S ribosomal RNA"
XX
SQ Sequence 1427 BP; 368 A; 322 C; 428 G; 309 T; 0 other
gagtttgctc atggctcaga acgaacgctg gcggcaggcc taacacatgc aagtcgagcg 60
ctaccttcgg gtggagcggc ggacgggtta gtaacgcgtg ggaatatacc cagttctaag 120
gaatagccac tggaaacggt gagtaatacc ttatacgccc ttcgggggaa agatttatcg 180
gaattggatt agcccgcgtt agattagata gttggtgggg taatggccta ccaagtctac 240
gatctatagc tggtttgaga ggatgatcag caacactggg actgagacac ggcccagact 300
cctacgggag gcagcagtgg ggaatcttag acaatgggcg caagcctgat ctagccatgc 360
cgcgtgagtg atgaaggccc tagggtcgta aagctctttc aactgtgaag ataatgacgg 420
tagcagtaga agaaaccccg gctaactccg tgccagcagc cgcggtaata cggagggggt 480
tagcgttgtt cggaattact gggcgtaaag cgtacgcagg cggattaata agttagaggt 540
gaaatcccag ggctcaaccc tggaactgcc tttaaaactg ttagtcttga gatcgagaga 600
ggtgagtgga attccaagtg tagaggtgaa attcgtagat atttggagga acaccagtgg 660
cgaaggcggc tcactggctc gatactgacg ctgaggtacg aaagtgtggg gagcaaacag 720
gattagatac cctggtagtc cacaccgtaa acgatgaatg ccagacgtca gcaagcatgc 780
ttgttggtgt cacacctaac ggattaagca ttccgcctgg ggagtacggt cgcaagatta 840
aaactcaaag gaattgacgg gggcccgcac aagcggtgga gcatgtggtt taattcgacg 900
caacgcgcag aaccttacca acccttgaca tacttgtcgc ggattccaga gatggattcc 960
ttcagttcgg ctggacaatg tacaggtgct gcatggctgt cgtcagctcg tgtcgtgaga 1020
tgttcggtta agtccggcaa cgagcgcaac ccacgtcctt agttaccagc atttagttgg 1080
gtaccctaag gagactgccg gtgataagcc ggaggaaggt gtggacgacg tcaagtcatc 1140
atggccctta cgggttgggc tacacacgtg ctacaatggc atctacagtg agttaatctc 1200
caaaagatgt ctcagttcgg attggggtct gcaactcgac cccatgaagt tggaatcgct 1260
agtaatcgcg gaacagcatg ccgcggtgaa tacgttcccg ggccttgtac acaccgcccg 1320
tcacaccatg ggaattgggt ctacccgaag gtggtgcgcc aactatttat aggggcagcc 1380
aaccacggta ggttcagtga ctggggtgaa gtcgtaacaa ggtaacc 1427
//
(end)
[edit] Next steps
This is an ongoing process. Out next steps are to
1. Have a local meeting to improve the 'template'
2. put up for consultation and get it ratified by larger group
3. write out all the submission files
4. Submit
5. For future submissions, consider GCDML (would we need to write a parser to create EMBL docs compliant with the rich template we create or could we put the XML in the comment field (Tim's query)