FastQC and MultiQC: checking quality of raw reads

Check data quality with fastqc

fastqc is routinely used as a command-line program for assessing the quality of reads in raw fastq files. The program can be run on your laptop, but it will be much easier/faster to run on a server with more compute power.

Begin by using fastqc to check the quality of each of your fastq files.

# navigate to the folder with your raw fastq files
# run fastqc on all files, putting the outputs into a new folder called 'fastqc'
fastqc *.gz -t 24 -o fastqc 
When you view the fastqc report, you may notice that some categories of quality metrics are flagged with a warning or a 'fail', indicating that some samples may have issues. Keep in mind that fastqc knows nothing about the kind of sequencing application (e.g. RNA-seq, ATAC-seq, WGS) that was carried out, and this can have a major impact on many aspects of the raw reads. For example, RNA-seq data will consist of reads that are duplicated within a single sample, in some cases thousands of times. This is totally fine! After all, RNA-seq is measuring gene expression, and we know that abundantly expressed transcripts will generate many identical reads.

Summarize QC results with MultiQC

MultiQC is a fantastic piece of software for aggregating and summarizing the outputs from many different kinds of bioinformatics programs in one convenient and interactive html file. In this case, we’ll use it to summarize the output from fastqc.

# change to the directory with your fastqc outputs
# run multiqc with the -d command to tell multiqc to look in all folders (data, analysis and qa) to find log files
multiqc -d .

You should now see a multiqc_report.html file in your project directory. You can copy this file from our server to your local computer using an FTP client (e.g. FileZilla), then double click and explore!