- Before starting
- Prepare sample ‘sketches’
- Compare sketches to each other
- Visualize sample relatedness
- Search a signature against a database of signatures
- Profile a metagenome sample against a large reference database
- Interpreting results
- Collating Sourmash results into a single table
Before starting
The following protocol outlines the use of Sourmash for interpreting microbial whole genome sequence (WGS) or metagenomic data. Sourmash computes a fingerprint or ‘sketch’ from your data using minHash, and enables comparison of sketches from isolates (to understand strain relatedness, for example). Similarly, a sketch can be compared against a large databases of sketches, stored as a .zip database for rapid searching, from Genbank or the Genome Taxonomy Database (GTDB) in order to help assign identity to a sample. These databases of microbial genomes from all of Genbank are available directly on our server for convenience.
If you want to learn more about how Sourmash works, you could start with the original Sourmash paper, and a recent follow-up paper. Also check out C. Titus Brown’s blog posts on the topic here and here. Also take a look at Adam Phillippy’s original Mash paper.
Note: Sourmash is a kmer-based method, which means you have to choose a kmer size. In general smaller kmers (e.g. k=31) allow for higher sensitivity but may result in some false positives, while larger kmer sizes (e.g. k=51) will tend to give higher specificity at the expense of false negatives.
Collectively, the GenBank references on our server contain over 1.2 million microbial genomes (including viral and fungal) from NBCI’s Genbank, including:
- 1,148,011 bacterial genomes
- 47,952 viral genomes
- 8,750 archaeal genomes
- 10,286 fungal genomes
- 1193 protist genomes
The Genome Taxonomy Database (GTDB) files include data for 402,709 prokaryotic genomes organized into 85,205 species clusters. This is most comprehensive and well-curated collection of bacterial genomes, and therefore is often the database of choice when using Sourmash gather
Prepare sample ‘sketches’
Here, we’ll use sourmash’s sketch
function to prepare a sketch of each fastq file in our directory. The --scaled
option applies a 1000:1 compression ratio, which retains the ability to detect regions of similarity in the 10kb range.
# for a single sample of single end data
sourmash sketch dna \
SAMPLE.fastq.gz \
-p k=21,k=31,k=51,scaled=1000,abund \
--merge SAMPLE \
-o SAMPLE.sig.gz
# for a single sample of paired end data
sourmash sketch dna \
SAMPLE_R1.fastq.gz SAMPLE_R2.fastq.gz \
-p k=21,k=31,k=51,scaled=1000,abund \
--merge SAMPLE \
-o SAMPLE.sig.gz
# looping for single end data for multiple samples
for FASTQ in *.fastq.gz
do
SAMPLE=${FASTQ//_001.fastq.gz}
sourmash sketch dna $FASTQ -p k=21,k=31,k=51,scaled=1000,abund --merge $SAMPLE -o ${SAMPLE}.sig.gz
echo "done sketching from $SAMPLE!"
done
# looping for paired end data for multiple samples
for READ1 in *R1_001.fastq.gz
do
READ2=${READ1//1_001.fastq.gz/2_001.fastq.gz}
SAMPLE=${READ1//_R1_001.fastq.gz}
sourmash sketch dna $READ1 $READ2 -p k=21,k=31,k=51,scaled=1000,abund --merge $SAMPLE -o ${SAMPLE}.sig.gz
echo "done sketching from $SAMPLE!"
done
Compare sketches to each other
The Sourmash compare
command computes the jaccard index between two or more sketches generated above.
sourmash compare *.sig.gz -k 31 -o cmp
Visualize sample relatedness
Sourmash plot
prepares a graphic showing the distance between samples
sourmash plot --pdf --labels cmp
Search a signature against a database of signatures
Sourmash search
is useful if you want to identify an organism of unknown identity using a large database of signatures from known microorganisms. For example, perhaps you have sequence data for a cultured isolate and you’re interested in figuring out what this cool new bug might be. For this, we take advantage of the pre-built Genbank references described above.
# for one or many signatures
sourmash search \
SAMPLE.sig.gz \ #can replace with *.sig.gz to search all samples in a directory at once
/data/reference_db/sourmash_refs/genbank-2022.03-bacteria-k31.zip \
-n 20 #optional. shows top 20 hits
Profile a metagenome sample against a large reference database
Sourmash gather
profiles a metagenome signature using a database of microbial signatures. However, this function is not only useful for profiling metagenomic data, it can also be used to address questions about whether a preparation used for WGS was contaminated with another organism. Perhaps Sanger sequencing of 16S rDNA gave murky results, or restreaking an isolate showed more than one distinct colony morphology. In either case, the Sourmash gather
function allows you to ask exactly which organisms might be present in a sample based only on the sequence data. Like the search function above, gather needs a reference database to search against.
for SIG in *sig.gz
do
SAMPLE=${SIG//.sig.gz}
sourmash gather $SIG /data/reference_db/sourmash/gtdb-rs214-reps.k31.zip -o ${SAMPLE}_gather_gtdbrs214_reps.csv
echo "done running gather on the sketch from $SAMPLE!"
done
You can repeat the code above for k=51, and/or for other reference databases as well.
Now that you have profiled the sample, you can summarize the taxonomic breakdown of the sample using a simple ‘kraken-like’ report format. By default, the reference databases used in the gather step above have no information about taxonomy…they’re just genomes. So, we have to use the ‘lineages.csv’ file provided by the Sourmash developers to get this taxonomic info.
# code below is for k=31. Depending on your analysis goals, you may want to repeat with k=51 if you want high specificity (at the expense of lower sensitivity)
for OUT in *_gather_gtdbrs214_reps.csv
do
SAMPLE=${OUT//_gather_gtdbrs214_reps.csv}
sourmash tax metagenome -g $OUT -t gtdb-rs214.lineages.csv.gz -F kreport -o $SAMPLE
done
GTDB is great for bacteria, but doesn’t have any genomes from viral or eukaryotic microbes. To get information about these classes or organisms, you can use the gather
and tax
functions above, but swap out the GTDB database for genbank references for fungi, viruses, and protozoa (these are also located in /data/reference_db/sourmash
).
Interpreting results
The output file produced by sourmash tax metagenome
has columns defined as:
intersect_bp
: the approximate number of base pairs in common between the query and the matchf_orig_query
: fraction of original query; the fraction of the original query that is contained within the match. This is the number you’d get by runningsourmash search --containment <metagenome> <match>
.f_match
: fraction of match; the fraction of the match that is contained within the query. This is how much of the match is in the remaining query metagenome, after all of the previous matches have been removed.f_unique_to_query
: fraction unique to query; the fraction of the query that uniquely overlaps with the match. This is the fraction of the database match that is unique with respect to the original query. It will always decrease as you get more matches. The sum off_unique_to_query
across all rows is what is reported in by gather as the fraction of query k-mers hit by the recovered matches (unweighted) and should never be greater than 1!f_unique_weighted
: fraction unique to query weighted by abundance; fraction unique to query, weighted by abundance in the queryn_unique_weighted_found
- the summed abundances of the found hashes at each stepsum_weighted_found
- the running total ofn_unique_weighted_found
, cumulative at this steptotal_weighted_hashes
- the sum total of all hash abundances
Collating Sourmash results into a single table
The code above produces a tabular output file for each sample, which is great if you only have a single sample. However, in most cases we have multiple samples and would like to compare microbial composition across the different samples or groups of samples. To do this, we need to get all of our data into a single dataframe with taxa as rows and samples as columns, where the values are effectively kmer counts. The following code is in R, not the terminal.
library(tidyverse)
#grab all the sourmash outputs that you want to merge together
files <- list.files(pattern = "*protozoa_k31.csv")
# reading multiple files simultaneously with the readr package now collates them together...how awesome!
sourmash_taxonomy_results <- read_delim(files)
# reformat
gather_table <- sourmash_taxonomy_results %>%
select(query_name, name, n_unique_weighted_found) %>% # select only the columns that have information we need
pivot_wider(id_cols = name, names_from = query_name, values_from = n_unique_weighted_found) %>% # transform to wide format
replace(is.na(.), 0) %>% # replace any NAs with 0
column_to_rownames("name") # move the metagenome sample name to a rowname
# if any samples returned no hits with sourmash (e.g. negative controls) you can add these data in as column of zeros as follows:
gather_table <- add_column(gather_table, extract_neg_control_S8 = 0, library_neg_control_S9 = 0, filter_neg_control_S1_R1 = 0)
# write collated data out to a csv file
write.csv(gather_table, "protozoa.csv", row.names=TRUE)