Welcome to the insect R package, a tool for assigning taxon IDs to amplicon libraries using informatic sequence classification trees. The insect learning algorithm takes a set of reference sequences (obtained from GenBank and/or other sources) to build a classification tree, which is then used to assign taxonomic IDs to a set of query sequences (e.g. those generated from an NGS platform such as Illumina MiSeq). The learning and classification functions are best suited to computing environments with multiple processors and access to large amounts of memory; however, most modest-sized datasets can be processed on standard personal computers if time is available. While not a prerequisite, insect is designed to be used in conjunction with the ape package (Paradis et al., 2004; Paradis, 2012), which features memory-efficient binary formats for DNA and amino acid sequences (“DNAbin” and “AAbin” object types), and the dada2 package (Callahan et al., 2016) which contains essential functions for de-noising high-throughput sequencing data and other important pre-processing steps.
The insect package can be used to analyze environmental DNA (eDNA) meta-barcode libraries as well as single-source NGS/Sanger amplicon sequences.
The most time-consuming and memory-intensive stage of the insect work-flow generally involves training the classifier. For example, the COI classifier used in this tutorial, which was built from the MIDORI UNIQUE reference trainingset consisting of nearly a million COI barcode sequences, took around three days to run on a 24 x multithread. The insect classification trees are amplicon specific, so a unique tree is generally required for each primer set. However, trees are already available for some of the more commonly used barcoding primers here:
Marker | Target | Primers | Source | Version | Date | Download |
---|---|---|---|---|---|---|
12S | Fish | MiFishUF/MiFishUR (Miya et al 2015) | GenBank | 1 | 20181111 | RDS (9MB) |
16S | Marine crustaceans | Crust16S_F/Crust16S_R (Berry et al 2017) | GenBank | 4 | 20180626 | RDS (7.1 MB) |
16S | Marine fish | Fish16sF/16s2R (Berry et al 2017; Deagle et al 2007) | GenBank | 4 | 20180627 | RDS (6.8MB) |
18S | Marine eukaryotes | 18S_1F/18S_400R (Pochon et al 2017) | SILVA_132_LSUParc, GenBank | 5 | 20180709 | RDS (11.8 MB) |
18S | Marine eukaryotes | 18S_V4F/18S_V4R (Stat et al 2017) | GenBank | 4 | 20180525 | RDS (11.5 MB) |
23S | Algae | p23SrV_f1/p23SrV_r1 (Sherwood & Presting 2007) | SILVA_132_LSUParc | 1 | 20180715 | RDS (26.9MB) |
COI | Metazoans | mlCOIintF/jgHCO2198 (Leray et al 2013) | Midori, GenBank | 5 | 20181124 | RDS (140 MB) |
ITS2 | Cnidarians and sponges | scl58SF/scl28SR (Wilkinson et al in prep) | GenBank | 5 | 20180920 | RDS (6.6 MB) |
The insect package also includes functions for downloading, trimming and filtering reference datasets (including a “virtual PCR” tool and an annotation quality filter), building a hierarchical taxonomy database, and training the classifier; however, these methods are beyond the scope of this introductory tutorial. New classification trees and updates are frequently added to the collection, so please feel free to suggest a barcoding primer set with which to train a classifier and we will endeavor to add it to the list.
To learn a classification tree, a reference sequence dataset is first
obtained from GenBank and/or other sources from which barcode sequences
with accurate taxon IDs are available. These sequences are filtered to
remove any with obvious taxonomic labeling issues, and trimmed to retain
only the region of interest using the virtualPCR
function
(sequences that do not span the entire amplicon region are removed). the
learn
function then splits the training sequences into two
subsets, maximizing the dissimilarity between the two groups. A profile
hidden Markov model is then derived for each group (see Durbin et
al. (1998) for a detailed description of these models). The partitioning
and model training procedure then continues recursively, splitting the
reference sequences into smaller and smaller subsets while adding new
nodes and models to the classification tree.
Once the classifier has been trained, query sequences obtained from the specified primer set can be assigned taxonomic IDs along with probabilistic confidence values. The classification algorithm works as follows: starting from the root node of the classification tree, the likelihood of the query sequence (the full probability of the sequence given a particular model) is computed for each of the models at the two immediate child nodes using the forward algorithm (see Durbin et al. (1998)). The competing likelihood values are then compared by computing their Akaike weights (see Johnson and Omland, 2004). If one model is overwhelmingly more likely to have produced the sequence than the other, that child node is selected and the classification is updated to reflect the lowest common taxonomic rank of the training sequences belonging to the node.
This procedure is repeated recursively, descending down the tree
until either an inconclusive result is returned from a model comparison
test (i.e. the Akaike weight is lower than a pre-defined threshold, e.g.
0.9), or a terminal leaf node is reached, at which point a species-level
ID is generally returned. The classify
function outputs the
taxon name, rank and ID number (i.e. NCBI taxon ID, WoRMS aphia ID, or
other identifier depending on the taxonomy database used in the training
step), along with the final Akaike weight value, which can be
interpreted as a confidence score (between 0 and 1, with values close to
1 indicating high confidence). Note that the default behavior is for the
Akaike weight to ‘decay’ as it moves down the tree, by computing the
cumulative product of all preceding Akaike weight values. This is
perhaps an overly conservative approach, but it minimizes the chance of
mis-classifying or over-classifying the query sequences.
This tutorial demonstrates the insect work-flow using an example dataset of COI sequences derived from autonomous reef monitoring structures (ARMS) in Ofu, American Samoa, amplified using the metazoan COI barcoding primers mlCOIintF and jgHCO2198 (GGWACWGGWTGAACWGTWTAYCCYCC and TAIACYTCIGGRTGICCRAARAAYCA, respectively; Leray et al. (2013)).
The dataset was first de-noised and tabulated using the DADA2 pipeline, to produce a table of chimera-free amplicon sequence variants (ASVs). The most abundant 16 ASVs are included in the insect package as an example dataset (see below).
If using other tools for de-noising, trimming, merging, etc, the data
can be read in using either the readFASTA
or
readFASTQ
functions to produce a “DNAbin” object compatible
with the insect classifier.
First, make sure that the devtools, ape and seqinr packages are installed and up to date. Then install and load the latest development version of the insect package from GitHub as follows:
devtools::install_github("shaunpwilkinson/insect")
library(insect)
The COI classifier was trained on the MIDORI UNIQUE 20180221 dataset, and uses information from both the DNA read and the translated amino acid sequence (EBI5 invertebrate mitochondrial translation table) to assign taxonomy to query sequences. You can download the classifier from the link in the table above; the file is quite large (~ 140 MB), so make sure there is a good internet connection available.
Alternatively, the classifier can be downloaded to the current working directory as follows:
download.file("https://www.dropbox.com/s/dvnrhnfmo727774/classifier.rds?dl=1",
destfile = "classifier.rds", mode = "wb")
To assign taxon IDs to the table of amplicon sequence variants (ASVs)
produced by DADA2, we first extract the sequences stored as column names
in the seqtab.nochim
matrix (see the DADA2 tutorial
for instructions on how to produce this table).
Once the sequences are stored in memory as a “DNAbin” object, we can optionally nullify the column names in the table to avoid flooding the console with long sequence strings:
## read in the example seqtab.nochim ASV table
data(samoa)
## get sequences from table column names
x <- char2dna(colnames(samoa))
## name the sequences sequentially
names(x) <- paste0("ASV", seq_along(x))
## optionally remove column names that can flood the console when printed
colnames(samoa) <- NULL
The next step is to load the classifier. Note that this ‘insect’ class object is just a large dendrogram with additional attributes for classifying sequences including profile HMMs and taxonomic information:
classifier <- readRDS("classifier.rds")
classifier
names(attributes(classifier))
#> 'dendrogram' with 2 branches and 113833 members total, at height 70
#> [1] "k" "height" "midpoint" "members" "class"
#> [6] "taxonomy" "clade" "frame" "remainder" "sequences"
#> [11] "minscore" "seqlengths" "pointers" "key" "kmers"
#> [16] "minlength" "maxlength" "model" "trainingset" "alternative"
#> [21] "nunique" "ntotal" "taxID" "numcode"
The final step is to assign taxon IDs and confidence values to each
ASV. The classify
function may take a minute or so to
process these sequences, since it uses a computationally intensive
dynamic programming algorithm to find the likelihood values of each
sequence given the models at each node of the tree. The exception is
when ping
is not FALSE
and there is an exact
match or high similarity
between the query sequence and at least one of the sequences in the
training dataset (e.g. >= 99% if ping = 0.99
). In this
case the function simply returns the common ancestor of the matching
sequences without a confidence score. To stay on the safe side, we will
keep ping = 1
(i.e. only sequences with 100% identity are
considered matches).
The classify
function can also be run in parallel by
setting the cores
argument to 2 or more depending on the
number available (setting cores = "autodetect"
will
automatically run on one less than the number available). If choosing
this option for the large COI classifier, please ensure that there is at
least 2 GB of available RAM per processor. Classification times can
vary, and depend on several factors including the number of unique
sequences in the dataset, the size of the classifier, the length of the
input sequences, the processing speed, number of processors used, etc.
The average time to ID COI sequences using the classifier above is
approximately 3 - 4 seconds per unique sequence per processor. For
example, a dataset containing 1000 unique sequences would take around an
hour to process on a single processor, half an hour on two, and so
on.
In the following code we set tabulize = FALSE
(the
default setting) since the DADA2 output table already contains the
sequence counts and we are only classifying the unique sequence
variants. In the case where a list of sequences containing replicates is
to be processed, users can prefix the sequence names with their
respective sample names, delimited with an underscore
(e.g. “sample001_sequence001”) and set tabulize = TRUE
. In
this case, the classify
function will automatically count
and dereplicate the sequences, producing an output table with one column
of sequence counts for each sample.
longDF <- classify(x, classifier, threshold = 0.8)
For DADA2 users, the ASV abundance table can now be transposed and appended to the table of taxonomic information if required:
longDF <- cbind(longDF, t(samoa))
representative | taxID | taxon | rank | score | kingdom | phylum | class | order | family | genus | species | OFU04A-100_S64 | OFU04A-1_S13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ASV1 | 2806 | Florideophyceae | class | 0.9981 | Florideophyceae | 156 | 5600 | ||||||
ASV2 | 6379 | Chaetopterus | genus | 1.0000 | Metazoa | Annelida | Polychaeta | Spionida | Chaetopteridae | Chaetopterus | 4496 | 0 | |
ASV3 | 2806 | Florideophyceae | class | 0.9989 | Florideophyceae | 28 | 3267 | ||||||
ASV4 | 2172821 | Multicrustacea | superclass | 1.0000 | Metazoa | Arthropoda | 3203 | 0 | |||||
ASV5 | 131567 | cellular organisms | no rank | 0.9952 | 3024 | 0 | |||||||
ASV6 | 2806 | Florideophyceae | class | 0.9981 | Florideophyceae | 20 | 2409 | ||||||
ASV7 | 39820 | Nereididae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Nereididae | 2379 | 0 | ||
ASV8 | 116571 | Podoplea | superorder | 0.9995 | Metazoa | Arthropoda | Hexanauplia | 2156 | 104 | ||||
ASV9 | 2806 | Florideophyceae | class | 0.9482 | Florideophyceae | 0 | 2149 | ||||||
ASV10 | 1 | root | no rank | NA | 2091 | 0 | |||||||
ASV11 | 115834 | Hesionidae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Hesionidae | 1905 | 6 | ||
ASV12 | 1443949 | Corallinophycidae | subclass | 0.9910 | Florideophyceae | 87 | 1757 | ||||||
ASV13 | 33213 | Bilateria | no rank | 1.0000 | Metazoa | 27 | 1800 | ||||||
ASV14 | 131567 | cellular organisms | no rank | 0.9952 | 1729 | 9 | |||||||
ASV15 | 2806 | Florideophyceae | class | 0.9993 | Florideophyceae | 0 | 1725 | ||||||
ASV16 | 39820 | Nereididae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Nereididae | 1481 | 0 |
Any sequences that return exact hits with at least one training
sequence (or near matches if ping = 0.99
or similar) are
assigned a score of NA
. For hybrid DNA/AA classifiers such
as the COI version used above, non-translatable sequences are also
automatically assigned a score of NA
, as is the case for
ASV10 in the table above.
For a more succinct output we can aggregate the table to only include one row for each unique taxon as follows:
taxa <- aggregate(longDF[3:12], longDF["taxID"], head, 1)
counts <- aggregate(longDF[13:ncol(longDF)], longDF["taxID"], sum)
shortDF <- merge(taxa, counts, by = "taxID")
taxID | taxon | rank | score | kingdom | phylum | class | order | family | genus | species | OFU04A-100_S64 | OFU04A-1_S13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | root | no rank | NA | 2091 | 0 | |||||||
2806 | Florideophyceae | class | 0.9981 | Florideophyceae | 204 | 15150 | ||||||
6379 | Chaetopterus | genus | 1.0000 | Metazoa | Annelida | Polychaeta | Spionida | Chaetopteridae | Chaetopterus | 4496 | 0 | |
33213 | Bilateria | no rank | 1.0000 | Metazoa | 27 | 1800 | ||||||
39820 | Nereididae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Nereididae | 3860 | 0 | ||
115834 | Hesionidae | family | 1.0000 | Metazoa | Annelida | Polychaeta | Phyllodocida | Hesionidae | 1905 | 6 | ||
116571 | Podoplea | superorder | 0.9995 | Metazoa | Arthropoda | Hexanauplia | 2156 | 104 | ||||
131567 | cellular organisms | no rank | 0.9952 | 4753 | 9 | |||||||
1443949 | Corallinophycidae | subclass | 0.9910 | Florideophyceae | 87 | 1757 | ||||||
2172821 | Multicrustacea | superclass | 1.0000 | Metazoa | Arthropoda | 3203 | 0 |
As shown in the above example, many of the sequences return fairly
uninformative taxon IDs (e.g. ‘cellular organisms’). This is a fairly
typical feature of eDNA datasets that can contain a large number of
novel sequences that are dissimilar to anything recorded in the
reference database(s). Note that query sequences with high similarity to
reference sequences can also occasionally produce uninformative
classifications due to inconclusive model comparison tests at top-level
nodes. This may be circumvented by reducing the threshold
parameter or setting decay = FALSE
; however, users are
advised against the excessive relaxation of these parameters since it
may increase the chance of returning erroneous classifications (these
tend to be very rare when using the default values). Further testing and
optimization may help to address some of these best-practice
considerations, and will be a focus of future research.
This introduction to the insect package has outlined the steps involved in taxonomic identification of amplicon sequence variants (ASVs) using a pre-built classification tree. The next tutorial will deal with downloading and curating a primer-specific local sequence database and using it to build a classification tree.
Please feel free to email the author directly with any feedback or questions at shaunpwilkinson AT gmail DOT com. Bug reports can also be directed to the GitHub issues page.
This software was developed with funding from a Rutherford Foundation Postdoctoral Research Fellowship from the Royal Society of New Zealand. Unpublished COI data care of Molly Timmers (NOAA).
Callahan,B.J. et al. (2016) DADA2: High-resolution sample inference from illumina amplicon data. Nature Methods, 13, 581–583.
Durbin,R. et al. (1998) Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge.
Johnson,J.B. and Omland,K.S. (2004) Model selection in ecology and evolution. Trends in Ecology and Evolution, 19, 101–108.
Leray,M. et al. (2013) A new versatile primer set targeting a short fragment of the mitochondrial COI region for metabarcoding metazoan diversity: application for characterizing coral reef fish gut contents. Frontiers in Zoology, 10, 34.
Paradis,E. (2012) Analysis of Phylogenetics and Evolution with R. Second Edition. Springer, New York.
Paradis,E. et al. (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20, 289–290.