Sourmash: alignment-free analysis of metagenomes

Sourmash: alignment-free analysis of metagenomes

Before starting

The following protocol outlines the use of Sourmash for interpreting microbial whole genome sequence (WGS) or metagenomic data. Sourmash computes a fingerprint or ‘sketch’ from your data using minHash, and enables comparison of sketches from isolates (to understand strain relatedness, for example). Similarly, a sketch can be compared against a large databases of sketches, stored as a .zip database for rapid searching, from Genbank or the Genome Taxonomy Database (GTDB) in order to help assign identity to a sample. These databases of microbial genomes from all of Genbank are available directly on our server for convenience.

If you want to learn more about how Sourmash works, you could start with the original Sourmash paper, and a recent follow-up paper. Also check out C. Titus Brown’s blog posts on the topic here and here. Also take a look at Adam Phillippy’s original Mash paper.

Note: Sourmash is a kmer-based method, which means you have to choose a kmer size. In general smaller kmers (e.g. k=31) allow for higher sensitivity but may result in some false positives, while larger kmer sizes (e.g. k=51) will tend to give higher specificity at the expense of false negatives.

Collectively, the GenBank references on our server contain over 1.2 million microbial genomes (including viral and fungal) from NBCI’s Genbank, including:

  • 1,148,011 bacterial genomes
  • 47,952 viral genomes
  • 8,750 archaeal genomes
  • 10,286 fungal genomes
  • 1193 protist genomes

The Genome Taxonomy Database (GTDB) files include data for 402,709 prokaryotic genomes organized into 85,205 species clusters. This is most comprehensive and well-curated collection of bacterial genomes, and therefore is often the database of choice when using Sourmash gather

Prepare sample ‘sketches’

Here, we’ll use sourmash’s sketch function to prepare a sketch of each fastq file in our directory. The --scaled option applies a 1000:1 compression ratio, which retains the ability to detect regions of similarity in the 10kb range.

# for a single sample of single end data
sourmash sketch dna \
SAMPLE.fastq.gz \
-p k=21,k=31,k=51,scaled=1000,abund \
--merge SAMPLE \
-o SAMPLE.sig.gz

# for a single sample of paired end data
sourmash sketch dna \
SAMPLE_R1.fastq.gz SAMPLE_R2.fastq.gz \
-p k=21,k=31,k=51,scaled=1000,abund \
--merge SAMPLE \
-o SAMPLE.sig.gz

# looping for single end data for multiple samples
for FASTQ in *.fastq.gz
    sourmash sketch dna $FASTQ -p k=21,k=31,k=51,scaled=1000,abund --merge $SAMPLE -o ${SAMPLE}.sig.gz
		echo "done sketching from $SAMPLE!"

# looping for paired end data for multiple samples
for READ1 in *R1_001.fastq.gz
    sourmash sketch dna $READ1 $READ2 -p k=21,k=31,k=51,scaled=1000,abund --merge $SAMPLE -o ${SAMPLE}.sig.gz
		echo "done sketching from $SAMPLE!"

Compare sketches to each other

The Sourmash compare command computes the jaccard index between two or more sketches generated above.

sourmash compare *.sig.gz -k 31 -o cmp

Visualize sample relatedness

Sourmash plot prepares a graphic showing the distance between samples

sourmash plot --pdf --labels cmp

Search a signature against a database of signatures

Sourmash search is useful if you want to identify an organism of unknown identity using a large database of signatures from known microorganisms. For example, perhaps you have sequence data for a cultured isolate and you’re interested in figuring out what this cool new bug might be. For this, we take advantage of the pre-built Genbank references described above.

# for one or many signatures

sourmash search \
SAMPLE.sig.gz \ #can replace with *.sig.gz to search all samples in a directory at once
/data/reference_db/sourmash_refs/ \
-n 20 #optional. shows top 20 hits

Profile a metagenome sample against a large reference database

Sourmash gather profiles a metagenome signature using a database of microbial signatures. However, this function is not only useful for profiling metagenomic data, it can also be used to address questions about whether a preparation used for WGS was contaminated with another organism. Perhaps Sanger sequencing of 16S rDNA gave murky results, or restreaking an isolate showed more than one distinct colony morphology. In either case, the Sourmash gather function allows you to ask exactly which organisms might be present in a sample based only on the sequence data. Like the search function above, gather needs a reference database to search against.

for SIG in *sig.gz
    sourmash gather $SIG /data/reference_db/sourmash/ -o ${SAMPLE}_gather_gtdbrs214_reps.csv
		echo "done running gather on the sketch from $SAMPLE!"

You can repeat the code above for k=51, and/or for other reference databases as well.

Now that you have profiled the sample, you can summarize the taxonomic breakdown of the sample using a simple ‘kraken-like’ report format. By default, the reference databases used in the gather step above have no information about taxonomy…they’re just genomes. So, we have to use the ‘lineages.csv’ file provided by the Sourmash developers to get this taxonomic info.

# code below is for k=31. Depending on your analysis goals, you may want to repeat with k=51 if you want high specificity (at the expense of lower sensitivity)

for OUT in *_gather_gtdbrs214_reps.csv
    sourmash tax metagenome -g $OUT -t gtdb-rs214.lineages.csv.gz -F kreport -o $SAMPLE

GTDB is great for bacteria, but doesn’t have any genomes from viral or eukaryotic microbes. To get information about these classes or organisms, you can use the gather and tax functions above, but swap out the GTDB database for genbank references for fungi, viruses, and protozoa (these are also located in /data/reference_db/sourmash).

Interpreting results

The output file produced by sourmash tax metagenome has columns defined as:

  • intersect_bp: the approximate number of base pairs in common between the query and the match
  • f_orig_query: fraction of original query; the fraction of the original query that is contained within the match. This is the number you’d get by running sourmash search --containment <metagenome> <match>.
  • f_match: fraction of match; the fraction of the match that is contained within the query. This is how much of the match is in the remaining query metagenome, after all of the previous matches have been removed.
  • f_unique_to_query: fraction unique to query; the fraction of the query that uniquely overlaps with the match. This is the fraction of the database match that is unique with respect to the original query. It will always decrease as you get more matches. The sum of f_unique_to_query across all rows is what is reported in by gather as the fraction of query k-mers hit by the recovered matches (unweighted) and should never be greater than 1!
  • f_unique_weighted: fraction unique to query weighted by abundance; fraction unique to query, weighted by abundance in the query
  • n_unique_weighted_found - the summed abundances of the found hashes at each step
  • sum_weighted_found - the running total of n_unique_weighted_found, cumulative at this step
  • total_weighted_hashes - the sum total of all hash abundances

Collating Sourmash results into a single table

The code above produces a tabular output file for each sample, which is great if you only have a single sample. However, in most cases we have multiple samples and would like to compare microbial composition across the different samples or groups of samples. To do this, we need to get all of our data into a single dataframe with taxa as rows and samples as columns, where the values are effectively kmer counts. The following code is in R, not the terminal.


#grab all the sourmash outputs that you want to merge together
files <- list.files(pattern = "*protozoa_k31.csv")
# reading multiple files simultaneously with the readr package now collates them awesome!
sourmash_taxonomy_results <- read_delim(files)

# reformat 
gather_table <- sourmash_taxonomy_results %>%   
	select(query_name, name, n_unique_weighted_found) %>% # select only the columns that have information we need  
	pivot_wider(id_cols = name, names_from = query_name, values_from = n_unique_weighted_found) %>% # transform to wide format  
	replace(, 0) %>% # replace any NAs with 0   
	column_to_rownames("name") # move the metagenome sample name to a rowname

# if any samples returned no hits with sourmash (e.g. negative controls) you can add these data in as column of zeros as follows:
gather_table <- add_column(gather_table, extract_neg_control_S8 = 0, library_neg_control_S9 = 0, filter_neg_control_S1_R1 = 0)

# write collated data out to a csv file
write.csv(gather_table, "protozoa.csv", row.names=TRUE)