Accessing sequence data


There are many times you will need to either access sequence data from public data repositories, like the Sequence Read Archive (SRA), or from your own recent sequencing runs on Illumina's Basespace Cloud service. The steps below outline how to do this from the command line in a way that directly downloads data to our linux server for convenient access to our software

What you’ll need

  • A free BaseSpace account with Illumina
    • If we set up an account for you, your username will be your Penn email and your password will be [FirstInitial][LastInitial]_Illumina123 (you are welcome to change this password after you have logged in)
  • An account on our linux server - you only need this, of course, if you intend to download data to our server 🙂
  • Grabseqs software - if you want to download fastq files from sources other than Basespace, including the Sequence Read Archive (SRA) or MG-RAST

Getting data from SRA

If you just want to get data directly from Illumina's Basespace, you can skip this step!

#navigate to the folder where you want the files to be downloaded
grabseqs sra -t 24 -m metadata.csv -r 3 PRJ#######

Accessing data on Basespace

  • Connect to our linux server using your terminal or by launching RStudio server in the browser and opening the terminal window.
  • Use the illumina basespace client tools and the bs auth function to authenticate
bs auth
# you should see a message similar to the one below.  Open a new browser tab and naviage to the url you see in your terminal.  You may be instructed to log into Basespace.  
Please go to this URL to authenticate:
# Once you've done that, you should see a message in your terminal that welcomes you.
Welcome, Daniel Beiting
  • Once authenticated, you will be allowed to download data directly from Basespace to our Linux.
  • Navigate to your sequencing project on Basespace and locate the project ID in your browser URL
  • image

Downloading fastq files from Basespace

Use the project ID number and the bs download function to get the data.

#specify where you want to data to download using the -o option
#NOTE: you need to have privileges to write to the folder you choose
bs download project -i 177233056 -o /publicData/myProjectFolder/
  • Once the download begins, you will be able to monitor progress in the terminal window (example below)
  • image

Downloading raw .bcl files from basespace

In some cases you will want unprocessed .bcl files that Illumina's basespace has not converted to fastq. This is often useful for single cell sequencing experiments. In these cases, you'll download the run, rather than the project

#specify where you want to data to download using the -o option
sudo bs download run -i 177233056 -o /publicData/myRunFolder/

Downloading large studies from SRA or ENA

An alternative to Grabseqs that works really well for downloading very large studies is this NextFlow workflow written by Rich Demko. To use this, you need to do just a little bit of configuring:

  • SRA_metadata, the SRAdb R package and a new SRAdbV2 package (under development), various tutorials, but these all require that you download all SRA metadata to query off-line. Not easy or quick!
  • First, install the NextFlow workflow management system on your computer
  • Create a simple tsv file that lists the samples you want to download from SRA/ENA (e.g. ERR######, SAMN########, etc). This file should consist of a single column of IDs with the header ‘run_accession’. See that ‘samples.tsv’ below as an example.
  • samples.tsv1.9KB
  • Then, create a nextflow.config file (you can use the one below as a start). You’ll need to open this config file with a simple text editor and replace ‘test.tsv’ with the path and name of your tsv file from above. You also need to replace ‘output’ with the path to a folder (must already exist) that you’d like your fastq files downloaded to.
  • nextflow.config0.2KB
  • Now you’re ready to run with a single command that calls the containerized workflow from github
nextflow -C nextflow.config run -r e026e4c8b83e43032c9bd250f1b3be94d0cc0f9b