fastq_screen: QC analysis of sequencing data

Check for contamination with fastqc_screen

Often times there are questions about whether there may be reads, other than those from the intended sample source, present in a data file. If you suspect contamination of a particular kind (e.g. other host, plasmid, rRNA, or some common bacterium used in lab), you can run fastq_screen to check a subsample of reads from your raw fastq file against a set of reference genomes. Although this has nothing to do with Sourmash or minHask sketches per se, it can be a useful way to confirm findings from Sourmash, or to check for organisms not well represented in the SBT reference databases provided above.

You can read more about fastq_screen in this recent paper

fastq_screen uses bowtie2 for aligning reads to the references, so we’ve provided a set of reference genomes on our server to which you can easily compare.

We’ve taken care of configuring fastq_screen so that it knows where to find bowtie2 and where to look for the reference genomes. This information is pretty clearly outlined in the fastq_screen configuration file found at /usr/local/bin/fastq_screen/fastq_screen.conf

# make sure you're in the directory where your fastq files are
fastq_screen --threads 24 --outdir fastq_screen *gz  
DO NOT modify the fastq_screen.conf file. If you want to use fastq_screen against a bowtie2 reference genome that is not listed below, please contact us for help
The reference genomes listed below have already been added to the configuration file.
  • Human
  • Mouse (Mus musculus)
  • Dog (Canis familiaris)
  • Cow (Bos taurus)
  • Horse (Equus caballus)
  • Pig (Sus scrofa)
  • Chicken (Gallus gallus)
  • Fruitfly (Drosophila melanogaster)
  • Yeast (Saccharomyces cerevisiae)
  • E. coli (strain K12)
  • Staph (Staphyloccous aureus strain NCTC 8325)
  • Lambda phage (Enterobacteriophage lambda)
  • PhiX
  • Contaminants
  • plasmids/vectors

filtering reads with fastq_screen

Depending on the results you get with fastq_screen above, you may want to filter reads based on alignment to a particular reference genome of interest. This is particularly useful for removing host reads contaminating a metagenomic sample, for example. To do this, you can use the --tag and --filter options for fastq_screen. See the documentation for fastq_screen to understand how to properly use —-tag and --filter

First, tag each read in each fastq with the genome to which it aligns (from the available references described above)

fastq_screen --tag sampleX.fastq.gz

Next, filter based on tags that were assigned above

fastq_screen --filter 1000 sampleX.fastq.gz