Check for contamination with fastqc_screen
Often times there are questions about whether there may be reads, other than those from the intended sample source, present in a data file. If you suspect contamination of a particular kind (e.g. other host, plasmid, rRNA, or some common bacterium used in lab), you can run fastq_screen to check a subsample of reads from your raw fastq file against a set of reference genomes. Although this has nothing to do with Sourmash or minHask sketches per se, it can be a useful way to confirm findings from Sourmash, or to check for organisms not well represented in the SBT reference databases provided above.
You can read more about fastq_screen in this recent paper
fastq_screen uses bowtie2 for aligning reads to the references, so we’ve provided a set of reference genomes on our server to which you can easily compare.
We’ve taken care of configuring fastq_screen so that it knows where to find bowtie2 and where to look for the reference genomes. This information is pretty clearly outlined in the fastq_screen configuration file found at /usr/local/bin/fastq_screen/fastq_screen.conf
# make sure you're in the directory where your fastq files are
fastq_screen --threads 24 --outdir fastq_screen *gz
filtering reads with fastq_screen
Depending on the results you get with fastq_screen
above, you may want to filter reads based on alignment to a particular reference genome of interest. This is particularly useful for removing host reads contaminating a metagenomic sample, for example. To do this, you can use the --tag
and --filter
options for fastq_screen. See the documentation for fastq_screen to understand how to properly use —-tag
and --filter
First, tag each read in each fastq with the genome to which it aligns (from the available references described above)
fastq_screen --tag sampleX.fastq.gz
Next, filter based on tags that were assigned above
fastq_screen --filter 1000 sampleX.fastq.gz