Upload to SRA/MicrobiomeDB


Our lab developed and maintains MicrobiomeDB, a web-based platform for sharing, integrating and carrying out sophisticated queries of 16S microbiome experiments. We can load your published data into our database. To facilitate this process, it helps our data loading team if you can prepare your raw data and mapping file. Currently, we can accept legacy data from 454 ‘pyrosequencing’, as well as newer 16S rRNA gene sequencing data from the Illumina miSeq platform. The protocol below outlines steps for preparing miSeq data.

The format required for MicrobiomeDB is the same as that required for the Sequence Read Archive (SRA). There are specific instructions for how to upload your reads to SRA after formatting at the end of this protocol.

Before starting

  • You’ll need access to your raw 16S rRNA gene sequence data, in the form of one forward.fastq.gz and one reverse.fastq.gz
  • Sign-up for a Google account (alternatively, Microsoft Excel is fine)
  • Get access to a compute cluster (or local workstation) with QIIME installed
  • Sign up for an account on microbiomeDB.org (it’s free and always will be!)

Prepare your fastq files

Activate qiime1 and unzip your fastq files

source activate qiime1
gunzip *fastq.gz

Depending on the exact format of your fastq files, you may need to remove the ‘+’ in the fastq files

perl -ane 'if (m/^\@/){s/\+//;} print;' forward.fastq > forward.fastq

Extract barcodes

extract_barcodes.py \	
-f forward.fastq \	
-r reverse.fastq \ #only used for paired-end reads	
-l 16 \ #this option specifies the barcode length. see documentation for guidance for your particular barcodes.	
-o barcodes.fastq \	
-c barcode_in_label

Demultiplex libraries

If using paired-end reads, perform the following steps for both forward and reverse reads.

Important: The output files will be named the same (sampleID.fastq) for both forward and reverse, so be sure to make a separate folder for forward and reverse files

split_libraries_fastq.py \	
-i forward.fastq \	
-o forward/split_libraries \	
-m metadata_file.txt \	
-b barcodes.fastq \	
--barcode_type 16 \	
--store_demultiplexed_fastq #Without this flag the output is only a demultiplexed fasta file

Rename your files so the runID/name and the forward or reverse info is indicated in the name (in this example, the run name was ‘miseq12’ and these were forward reads)

mv seqs.fastq miseq12_forward_seqs.fastq
mv seqs.fna miseq12_forward.fna
mv histograms.txt miseq12_forward_histograms.txt
mv split_library_log.txt miseq12_forward_split_library_log.txt
split_sequence_file_on_sample_ids.py \	
-i seqs.fastq \	
-o forward/split_libraries/forward_fastqs \	
--file_type fastq

In the case of paired-end 16S sequencing data, your forward and reverse reads will be named the same. They must have unique names in order to upload to SRA/MicrobiomeDB, so add "_F" or "_R" as a suffix to each file, as appropriate.

In your forward fastq folder do:

for file in *.fastq.gz; do mv "$file" "${file%.fastq.gz}_F.fastq.gz"; done

In your reverse fastq folder do:

for file in *.fastq.gz; do mv "$file" "${file%.fastq.gz}_R.fastq.gz"; done

Now you can put all of these files in a single folder. You are ready to upload.

Prepare metadata

MicrobiomeDB allows users to analyze and visualize microbiome data based on any user-provided metadata. This metadata needs to be provided as a google sheet. Each column of your spreadsheet should represent a metadata variable (e.g., Age, Sex, Treatment, etc), and each row should be identified with unique sample IDs that matches those used in naming the fastq files above. If you need help naming your column headers using appropriate terminology, you may find it useful to view our terminology list on microbiomeDB.

Upload to SRA

Make sure to be logged in to your NCBI account or make a new account.

Go to the Portal Wizard (yup... not a joke) https://submit.ncbi.nlm.nih.gov/subs/sra/ Click 'New Submission' Fill out the forms, including specifying sample type.

For 16S projects choose: Sample type: Genome, metagenome or marker sequences (MIxS compliant) -> Specimen Marker Sequences (MIMARKS) -> host-associated)

Fill out the required info using the templates provided. Note that the sample_names in both the Bioproject and SRA metadata must match

To Upload to Files:

  1. We recommend using a file transfer software such as Filezilla. (instructions here https://www.ncbi.nlm.nih.gov/sra/docs/submitfiles/)
  2. Connect to subftp@ftp-private.ncbi.nlm.nih.gov Address: ftp-private.ncbi.nlm.nih.gov Username: subftp Password: provided
  3. You will get "Error: Failed to retrieve directory listing" In the Remote site box, manually navigate to your account folder (provided in the instructions, unique for each user): e.g. /uploads/unique_user_folderID
  4. create a new folder with a unique name and enter it.
  5. put your fastq.gz files in here.

You will need to wait up to 10 minutes after all of your files have transferred before the submission form recognizes that you have all of the files. After 10 minutes, refresh the preload folder page and you should see you folder with the correct number of files within.

Now you can Autofinish your submission.