So you’re ready to submit a manuscript describing your hard work…congratulations! If you’re reading this page, then chances are you have RNAseq data that you want to post to an approved data repository so that you can get an accession number to include in your manuscript submission. This is the first step in ensuring that the broader research community can access your raw and processed data from any sequencing experiment(s) carried out in your publication. The Sequence Read Archive (SRA) is the defacto home for all sequencing related studies, but can be confusing to navigate. Thankfully, the Gene Expression Omnibus, will accept deposits of gene expression profiling data from RNAseq experiments and will pass this data onto to the SRA on your behalf. This is definitely the way to go, since the process of submitting data to GEO is pretty straightforward. The steps below will walk you through this process. Be aware that GEO’s own instructions are available here, but our instructions are simple.
What you’ll need
- an account on myNCBI. If you don’t already have one, signup here
- the free FTP client, FileZilla
- access to your raw .fastq files from your sequencing experiment. Keep each fastq file in the compressed (.gz) form at all times.
- access to a table of ‘processed’ gene expression data. This will need to be either raw counts, or normalized data produced by DESeq2 or EdgeR (or both files).
Complete metadata spreadsheet
This is most time consuming part of whole process, but also the most important. Download this example excel file from one of our own RNAseq studies, and replace with the appropriate information from your own experiment. This document should describe each sample, how it was handled, and how your samples relate to the raw fastq files. The more ‘metadata’ you can include here, the more useful you make this data to outside investigators, so don’t skimp on the details.
To complete this form you will need to generate a unique checksum for each file that you intend to transfer to GEO (both fastq files, as well as an process data files). Checksums confirm that the file is not corrupt. They are easy to generate, but do require that you open a command line program (e.g. Terminal on a Mac). Once you have a terminal window open, navigate to the directory on your computer where you have all the files you plan to transfer.
md5 command followed by the name of your file to generate the 32 character checksum. You’ll want to copy and paste this into your excel submission form. Here’s an example of what this looks like:
shasum GEOsubmission.xlsx 39eae0dbcc54367c588d2815876747f4 GEOsubmission.xlsx
Preparing to transfer
- Create a folder anywhere on your computer and give it the same name as your myNCBI username (for me this is ‘danielbeiting’).
- Transfer all your raw and processed files, and your metadata spreadsheet completed in step 1, to this folder.
- If the total size of this folder is >1Tb, you will need to email GEO with a list of your file names and their checksums, so that they can manage server space accordingly.
Transfer your data
- Launch FileZilla and connect to the GEO storage server using the following credentials:
- Host: ftp-private.ncbi.nlm.nih.gov
- username: geoftp
- password: (this will specific to your myNCBI account). To get your user-specific password, make sure you’re logged into your myNCBI account on the web, and then go to this webpage. Scroll down and you see your login credentials. You’ll need to enter this info into FileZilla to connect.
- Your GEO account username. If depositing to my account, use ‘danielbeiting’
- Names of the directory and files deposited
- Choose a date for public release (this can be immediate, or you can choose up to a maximum of 3yrs. You can always extend later)