- Calculate summary statistics
- Check data quality with fastqc or fastp
- Summarize QC results with MultiQC
- OPTIONAL: check for contamination with fastqc_screen
- OPTIONAL: removing host reads
Calculate summary statistics
Use seqkit to quickly count reads in any fastq file or groups of fastq files
seqkit stats --threads 24 *.gz
Check data quality with fastqc or fastp
fastqc is routinely used as a command-line program for assessing the quality of reads in raw fastq files. The program can be run on your laptop, but it will be much easier/faster to run on a server with more compute power.
Begin by using fastqc to check the quality of each of your fastq files.
# navigate to the folder with your raw fastq files # run fastqc on all files, putting the outputs into a new folder called 'fastqc' # note that you must create this output folder first mkdir fastqc fastqc *.gz -t 24 -o fastqc
When you view the fastqc report, you may notice that some categories of quality metrics are flagged with a warning or a 'fail', indicating that some samples may have issues. Keep in mind that fastqc knows nothing about the kind of sequencing application that was carried out (e.g. RNA-seq, ATAC-seq, WGS), which can have a major impact on many aspects of the raw reads. For example, RNA-seq data will consist of reads that are duplicated within a single sample, in some cases thousands of times. This is totally fine! After all, RNA-seq is measuring gene expression, and we know that abundantly expressed transcripts will generate many identical reads.
Summarize QC results with MultiQC
MultiQC is a fantastic piece of software for aggregating and summarizing the outputs from many different kinds of bioinformatics programs in one convenient and interactive html file. In this case, we’ll use it to summarize the output from fastqc.
# change to the directory with your fastqc outputs # run multiqc with the -d command to tell multiqc to look in all folders (data, analysis and qa) to find log files multiqc -d .
You should now see a multiqc_report.html file in your project directory. You can copy this file from our server to your local computer using an FTP client (e.g. FileZilla), then double click and explore!
OPTIONAL: check for contamination with fastqc_screen
Often times there are questions about whether there may be reads, other than those from the intended sample source, present in a data file. If you suspect contamination of a particular kind (e.g. other host, plasmid, rRNA, or some common bacterium used in lab), you can run fastq_screen to check a subsample of reads from your raw fastq file against a set of reference genomes.
You can read more about fastq_screen in this recent paper
fastq_screen uses bowtie2 for aligning reads to the references, so we’ve provided a set of reference genomes on our server to which you can easily compare.
We’ve taken care of configuring fastq_screen so that it knows where to find bowtie2 and where to look for the reference genomes. This information is pretty clearly outlined in the fastq_screen configuration file found at
# make sure you're in the directory where your fastq files are fastq_screen --threads 24 --outdir fastq_screen *gz
DO NOT modify the fastq_screen.conf file. If you want to use fastq_screen against a bowtie2 reference genome that is not listed below, please contact us for help
The reference genomes listed below have already been added to the configuration file.
- Mouse (Mus musculus)
- Dog (Canis familiaris)
- Cow (Bos taurus)
- Horse (Equus caballus)
- Pig (Sus scrofa)
- Chicken (Gallus gallus)
- Fruitfly (Drosophila melanogaster)
- Yeast (Saccharomyces cerevisiae)
- E. coli (strain K12)
- Staph (Staphyloccous aureus strain NCTC 8325)
- Lambda phage (Enterobacteriophage lambda)
OPTIONAL: removing host reads
Depending on the results you get with
fastq_screen above, you may want to filter reads based on alignment to a particular reference genome of interest. This is particularly useful for removing host reads contaminating a metagenomic sample, for example. To do this, you can use the
--filter options for fastq_screen. See the documentation for fastq_screen to understand how to properly use
First, tag each read in each fastq with the genome to which it aligns (from the available references described above)
fastq_screen --tag sampleX.fastq.gz
Next, filter based on tags that were assigned above
fastq_screen --filter 1000 sampleX.fastq.gz
Sometimes, you won't care to screen your raw reads against multiple genomes. Instead, you'll know exactly which 'contaminating' species needs to be removed (e.g. removing human reads from shotgun metagenomic data). In these cases, an alternative to using
fastq_screen for filtering out reads is to use KneadData, which leverages bowtie2 to align reads and keep only those that do NOT map to a particular genome reference.
# here's how we remove human host reads from our fastq files kneaddata --input SRR8668774.fastq.gz \ #the data you want to 'knead' --reference-db /data/reference_db/Homo_sapiens/Ensembl/GRCh38 \ #path to bowtie2 index that will be used to remove host reads --trimmomatic /usr/local/bin/Trimmomatic-0.39 \ #kneaddata needs the path to trimmomatic -o SRR8668774_dehost \ #specify an output directory for kneaddata to write to -t 24