Preprocessing with CellRanger

Intro

This protocol walks through all steps involved in preprocessing raw Illumina data generated from a 10X genomics experiment.

Before starting

The instructions below assume the following:

  • You have your .bcl files from Illumina's BaseSpace. If not, see our protocol for getting your data downloaded.
  • You have Cell Ranger installed on a computer with the recommended compute resources.
  • Cell Ranger is in your $PATH.
  • You have a .csv file (e.g. in Excel) that identifies samples with barcodes
  • You have access to the raw .bcls file from a single cell sequencing experiment.
  • You have some other basic linux software installed on the same computer (also in your $PATH), including wget for downloading files.
  • You have organized your working directory as /data/[userName]/[experimentName]/[runName]/

Prepare your reference genome file

mouse and human

  • If your cells came from mice or humans, then you’re in luck! You can just download the reference file from here. Alternatively, if you’re working on our linux machine, then you can find these references and more in /data/reference_db/10X

other organisms

  • If your cells are not mouse or human, then you’ll need to do a bit of work to get your reference file set-up.
  • Begin by downloading the reference genome fasta file and .gtf file for your organism of interest. You can browse available files at the Ensemble FTP site.
  • Once you’ve identified the fasta (.fa) and .gtf files you need, download them using wget
# using rat as an example
# first the genome fasta file
wget ftp://ftp.ensembl.org/pub/release-99/fasta/rattus_norvegicus/dna/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.gz
# then the .gtf file
wget ftp://ftp.ensembl.org/pub/release-99/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.0.99.chr.gtf.gz
  • Unzip both files using gunzip
gunzip Rattus_norvegicus.Rnor_6.0.99.chr.gtf.gz # gtf file from Ensembl
gunzip Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.gz # genome fasta from Ensembl
  • Use Cell Ranger’s mkgtf function to filter your original GTF file to leave only those genes that are annotated as protein coding.
cellranger mkgtf \
Rattus_norvegicus.Rnor_6.0.99.chr.gtf \ # gtf file from Ensembl
Rattus_norvegicus.Rnor_6.0.99.chr.filtered.gtf \ # output name for filtered gtf
--attribute=gene_biotype:protein_coding # filtering on protein coding genes
  • Use Cell Ranger’s mkref function to build a index for read alignment. This uses the splice-aware aligner, STAR to accomplish this.
cellranger mkref \
--genome=Rattus_norvegicus_genome_forRNA \ #output folder name
--fasta=Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa \ # reference fasta file 
--genes=Rattus_norvegicus.Rnor_6.0.99.chr.filtered.gtf \ # filtered gtf file from above
--nthreads=24 # number of threads to use (set to 24 on our linux machine)

Prepare fastq files

  • Navigate to /data/[userName]/[experimentName]
  • Use Cell Ranger’s mkfastq function to convert the bcl files from your illumina sequencing run to fastq files. mkfastq is basically a wrapper around Illumina’s bcl2fastq program, but with a few extra features, namely that it handles 10X indices for demultiplexing samples, and generates some quality control metrics that are specific to the 10X platform.
cellranger mkfastq --id=fastqGroup \ # name you're giving to the fastq group
                   --run=/data/userName/BCL \ # the folder with your BCL files
                   --csv=/data/userName/experimentName/demux.csv # csv file that contains the barcode assignments and sample names.

Generate counts

process one sample at a time

  • Use Cell Ranger’s count function to align sequencing reads in FASTQ files to your reference transcriptome and generates a .cloupe file for visualization and analysis in Loupe Browser, along with a number of other outputs compatible with other publicly-available tools for further analysis.
cellranger count --id=outputName \ # name for the output
                 --fastqs=/data/userName/experimentName/fastqGroup/outs/fastq_path/serialNumber/ \ # path to the fastq files, which should end in the serial number for the flow cell used. Be aware which folder you are in when you run this command!
                 --sample=sampleName \ # name of the sample to be processed. must match the name you gave in your csv file!
                 --transcriptome=/data/reference_db/10X/refdata-cellranger-mm10-3.0.0 # path to your transcriptome created with mkref above.  This example uses mouse

process multiple samples

  • This is useful if you have multiple samples, or if you used feature barcoding to multiplex multiple samples in a single well of a 10X chip.
📌

There are no sample or fastqs arguments when processing multiple samples, instead, there are two .csv files - one for libraries and a second for your feature reference.

cellranger count --id=Count_output_combined2 \
                 --libraries=/data/userName/libraries.csv \
                 --feature-ref=/data/userName/featurereference.csv \
                 --transcriptome=/data/reference_db/10X/refdata-cellranger-mm10-3.0.0