Title: | Informatic Sequence Classification Trees |
---|---|
Description: | Provides tools for probabilistic taxon assignment with informatic sequence classification trees. See Wilkinson et al (2018) <doi:10.7287/peerj.preprints.26812v1>. |
Authors: | Shaun Wilkinson [aut, cre] |
Maintainer: | Shaun Wilkinson <[email protected]> |
License: | GPL-3 |
Version: | 1.4.0.9000 |
Built: | 2024-11-04 03:18:52 UTC |
Source: | https://github.com/shaunpwilkinson/insect |
This function takes a reference sequence database and allocates each sequence to either a query set (a.k.a. test set) or a training set, in order to cross validate a supervised taxon classifier. The method is based on that of Edgar (2018), but uses recursive divisive clustering and retains all sequences rather than discarding those that violate the top-hit identity constraint.
allocateCVI(x, threshold = 0.9, allocate = "max", ...)
allocateCVI(x, threshold = 0.9, allocate = "max", ...)
x |
a set of reference sequences. Can be a "DNAbin" object or a named vector of upper-case DNA character strings. |
threshold |
numeric between 0 and 1 giving the identity threshold for sequence allocation. |
allocate |
character giving the method to use to allocate eligible sequences to the query set. Options are "max" (default) which chooses the largest node from each pair in order to maximize the size of the query set, or "sample", which randomly chooses one node from each eligible pair. |
... |
further arguments to pass to "kmeans" |
a logical vector the same length as the input object, indicating which sequences should be allocated to the query set
Shaun Wilkinson
Edgar RC (2018) Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences. PeerJ 6:e4652. DOI 10.7717/peerj.4652
data(whales) allocateCVI(whales)
data(whales) allocateCVI(whales)
"classify"
assigns taxon IDs to DNA sequences using an
informatic sequence classification tree.
classify( x, tree, threshold = 0.8, decay = FALSE, ping = 0.98, mincount = 5, offset = 0, ranks = c("kingdom", "phylum", "class", "order", "family", "genus", "species"), species = "ping100", tabulize = FALSE, metadata = FALSE, cores = 1 )
classify( x, tree, threshold = 0.8, decay = FALSE, ping = 0.98, mincount = 5, offset = 0, ranks = c("kingdom", "phylum", "class", "order", "family", "genus", "species"), species = "ping100", tabulize = FALSE, metadata = FALSE, cores = 1 )
x |
a sequence or set of sequences. Can be a "DNAbin" or "AAbin" object or a named vector of upper-case DNA character strings. |
tree |
an object of class |
threshold |
numeric between 0 and 1 giving the minimum Akaike weight for the recursive classification procedure to continue toward the leaves of the tree. Defaults to 0.8. |
decay |
logical indicating whether the decision to terminate the classification process should be made based on decaying Akaike weights (at each node, the Akaike weight of the selected model is multiplied by the Akaike weight of the selected model at the parent node) or whether each Akaike weight should be calculated independently of that of the parent node. Defaults to FALSE (the latter). |
ping |
logical or numeric (between 0 and 1) indicating whether
a nearest neighbor search should
be carried out, and if so,
what the minimum distance to the nearest neighbor
should be for the the recursive classification algorithm to be skipped.
If TRUE and the query sequence is identical to
at least one of the training sequences used to learn the tree,
the common ancestor of the matching training sequences is returned
with an score of NA.
If a value between 0 and 1 is provided, the common ancestor of the
training sequences with similarity greater than or equal to 'ping'
is returned, again with a score of NA.
If |
mincount |
integer, the minimum number of training sequences belonging to a selected child node for the classification to progress. Defaults to 5. |
offset |
log-odds score offset parameter governing whether the minimum score is met at each node. Defaults to 0. Values above 0 increase precision (fewer type I errors), values below 0 increase recall (fewer type II errors). |
ranks |
character vector giving the taxonomic ranks to be
included in the output table. Must be a valid rank from the
taxonomy database attributed to the classification tree
( |
species |
character string, indicating whether to include all
species-level classifications in the output (species = 'all'),
only those generated by exact matching ("ping100"; the default setting),
only those generated by exact matching or near-neighbor searching
(species = 'ping'). If |
tabulize |
logical indicating whether sequence counts should be attached to the output table. If TRUE, the output table will have one row for each unique sequence, and columns will include counts for each sample (where samples names precede sequence identifiers in the input object; see details below). |
metadata |
logical indicating whether to include additional columns containing the paths, individual node scores and reasons for termination. Defaults to FALSE. Included for advanced use and debugging. |
cores |
integer giving the number of processors for multithreading (defaults to 1).
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
for example by running |
This function requires a pre-computed classification tree
of class "insect", which is a dendrogram object with additional attributes
(see learn
for details).
Query sequences obtained from the same primer set used to construct
the tree are classified to produce taxonomic
IDs with an associated degree of confidence.
The classification algorithm works as follows:
starting from the root node of the tree,
the log-likelihood of the query sequence
(the log-probability of the sequence given a particular model)
is computed for each of the models occupying the two child nodes using the
forward algorithm (see Durbin et al. (1998)).
The competing likelihood values are then compared by computing
their Akaike weights (Johnson and Omland, 2004).
If one model is overwhelmingly more likely to have produced
the sequence than the other,
that child node is chosen and the classification is updated
to reflect the taxonomic ID stored at the node.
This classification procedure is repeated, continuing down the
tree until either an inconclusive result is returned by a
model comparison test (i.e. the Akaike weight is lower than
a pre-defined threshold, e.g. 0.9),
or a terminal leaf node is reached,
at which point a species-level classification is generally returned.
The function outputs a table with one row for each input sequence
Output table fields include "name" (the unique sequence identifier),
"taxID" (the taxonomic identification number from the taxonomy database),
"taxon" (the name of the taxon),
"rank" (the rank of the taxon, e.g. species, genus family, etc),
and "score" (the Akaike weight from the model selection procedure).
Note that the default behavior is for the Akaike weight to ‘decay’
as it moves down the tree, by computing the cumulative product of
all preceding Akaike weight values.
This minimizes the chance of type I taxon ID errors (overclassifications and misclassifications).
The output table also includes the higher taxonomic ranks specified in the
ranks
argument, and if metadata = TRUE
additional columns
are included called "path"
(the path of the sequence through the classification tree), "scores" (the
scores at each node through the tree, UTF-8-encoded),
and "reason" outlining why the recursive classification procedure was
terminated:
0 reached leaf node
1 failed to meet minimum score threshold at inner node
2 failed to meet minimum score of training sequences at inner node
3 sequence length shorter than minimum length of training sequences at inner node
4 sequence length exceeded maximum length of training sequences at inner node
5 nearest neighbor in training set does not belong to selected node (obsolete)
6 node is supported by too few sequences
7 reserved
8 sequence could not be translated (amino acids only)
9 translated sequence contains stop codon(s) (amino acids only)
Additional columns detailing the nearest neighbor search include "NNtaxID", "NNtaxon", "NNrank", and "NNdistance".
a data.frame.
Shaun Wilkinson
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Johnson JB, Omland KS (2004) Model selection in ecology and evolution. Trends in Ecology and Evolution. 19, 101-108.
data(whales) data(whale_taxonomy) ## use all sequences except first one to train the classifier set.seed(999) tree <- learn(whales[-1], db = whale_taxonomy, maxiter = 5, cores = 2) ## find predicted lineage for first sequence classify(whales[1], tree) ## compare with actual lineage taxID <- as.integer(gsub(".+\\|", "", names(whales)[1])) get_lineage(taxID, whale_taxonomy)
data(whales) data(whale_taxonomy) ## use all sequences except first one to train the classifier set.seed(999) tree <- learn(whales[-1], db = whale_taxonomy, maxiter = 5, cores = 2) ## find predicted lineage for first sequence classify(whales[1], tree) ## compare with actual lineage taxID <- as.integer(gsub(".+\\|", "", names(whales)[1])) get_lineage(taxID, whale_taxonomy)
These functions convert DNA and amino acid sequences in "DNAbin" or "AAbin" format to concatenated character strings, and vice versa.
dna2char(x) aa2char(x) char2dna(z, simplify = FALSE) char2aa(z, simplify = FALSE)
dna2char(x) aa2char(x) char2dna(z, simplify = FALSE) char2aa(z, simplify = FALSE)
x |
a "DNAbin" or "AAbin" object. |
z |
a vector of concatenated strings representing DNA or amino acid sequences in upper case. |
simplify |
logical indicating whether length-one "DNAbin" or "AAbin" objects should be simplified to vectors. Defaults to FALSE. |
These functions are used to convert concatenated character strings
(e.g. "TAACGC") to binary format and vice versa.
To convert DNAbin and AAbin objects to non-concatenated
character objects (e.g. c("T", "A", "A", "C", "G", "C")
)
refer to the ape
package functions
as.character.DNAbin
and
as.character.AAbin
.
Likewise the ape
package functions
as.DNAbin
and as.AAbin
are used to convert non-concatenated character
objects to binary format.
dna2char
and aa2char
return vectors of upper case
character strings.
char2dna
and char2aa
return "DNAbin" and "AAbin" objects,
respectively. These will be lists unless the input object
has length one and simplify = TRUE, in which case the returned object
will be a vector.
Shaun Wilkinson
Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.
Paradis E (2007) A bit-level coding scheme for nucleotides. http://ape-package.ird.fr/misc/BitLevelCodingScheme.html.
Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.
char2dna("TAACGC") char2aa("VGAHAGEY") dna2char(char2dna("TAACGC")) aa2char(char2aa("VGAHAGEY")) char2dna(list(seq1 = "TAACGC", seq2 = "ATTGCG")) char2aa(list(seq1 = "VGAHAGEY", seq2 = "VNVDEV"))
char2dna("TAACGC") char2aa("VGAHAGEY") dna2char(char2dna("TAACGC")) aa2char(char2aa("VGAHAGEY")) char2dna(list(seq1 = "TAACGC", seq2 = "ATTGCG")) char2aa(list(seq1 = "VGAHAGEY", seq2 = "VNVDEV"))
This function is used to demultiplex FASTQ files containing sequence reads with index and primer sequences still attached. The function trims the tags and primers, and exports two FASTQ files for each forward-reverse index combination.
demultiplex( R1, R2 = NULL, tags, up, down, destdir = "demux", adapter1 = NULL, adapter2 = NULL )
demultiplex( R1, R2 = NULL, tags, up, down, destdir = "demux", adapter1 = NULL, adapter2 = NULL )
R1 |
character string giving the path to the first FASTQ file |
R2 |
character string giving the path to the second FASTQ file. Set to NULL for single-end reads. |
tags |
named character vector specifying the unique forward:reverse tag combinations used in the run. Each tag combination should be entered as an upper case character string delimited by a colon (e.g. "ATCGACAC:ATGCACTG") and named according to the unique sample ID. |
up , down
|
upper case character strings giving the forward and reverse primer sequences. |
destdir |
character string giving the path to the directory where the new FASTQ files should be written. |
adapter1 |
the forward flowcell adapter sequence to check and trim (set NULL to ignore). For standard Illumina MiSeq forward adapter set to "AATGATACGGCGACCACCGAGATCTACAC" (paired end sequencing only). |
adapter2 |
the reverse flowcell adapter sequence to check and trim (set NULL to ignore). For standard Illumina MiSeq reverse adapter set to "CAAGCAGAAGACGGCATACGAGAT" (single or paired end sequencing). |
NULL (invisibly)
Shaun Wilkinson
This function is used to convert an oligonucleotide sequence into a regular expression that can be used to query a sequence dataset. This is generally used for finding and removing primer, adapter and/or index sequences.
disambiguate(z)
disambiguate(z)
z |
a concatenated string representing a DNA oligonucleotide sequence, possibly with IUPAC ambiguity codes (all in upper case). |
a regular expression.
Shaun Wilkinson
disambiguate("GGWACWGGWTGAACWGTWTAYCCYCC")
disambiguate("GGWACWGGWTGAACWGTWTAYCCYCC")
These functions are used to compress and decompress profile hidden Markov models for DNA to improve memory efficiency.
encodePHMM(x) decodePHMM(z)
encodePHMM(x) decodePHMM(z)
x |
an object of class "PHMM" |
z |
a raw vector in the encodePHMM schema. |
Profile HMMs used in tree-based classification usually include many parameters, and hence large trees with many PHMMs can occupy a lot of memory. Hence a basic encoding system was devised to store the emission and transition probabilities in raw-byte format to three (nearly four) decimal places. This does not seem to significantly affect the accuracy of likelihood scoring, and has a moderate impact on classification speed, but can reduce the memory allocation requirements for large trees by up to 95 percent.
encodePHMM returns a raw vector. decodePHMM
returns
an object of class "PHMM" (see Durbin et al (1998) and
the aphid
package for more details
on profile hidden Markov models).
Shaun Wilkinson
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
## generate a simple classification tree with two child nodes data(whales) data(whale_taxonomy) tree <- learn(whales, db = whale_taxonomy, recursive = FALSE) ## extract the omnibus profile HMM from the root node PHMM0 <- decodePHMM(attr(tree, "model")) ## extract the profile HMM from the first child node PHMM1 <- decodePHMM(attr(tree[[1]], "model"))
## generate a simple classification tree with two child nodes data(whales) data(whale_taxonomy) tree <- learn(whales, db = whale_taxonomy, recursive = FALSE) ## extract the omnibus profile HMM from the root node PHMM0 <- decodePHMM(attr(tree, "model")) ## extract the profile HMM from the first child node PHMM1 <- decodePHMM(attr(tree[[1]], "model"))
This function is used to grow an existing classification tree, typically using more relaxed parameter settings than those used when the tree was created, or if fine-scale control over the tree-learning operation is required.
expand( tree, clades = "0", refine = "Viterbi", iterations = 50, nstart = 20, minK = 2, maxK = 2, minscore = 0.9, probs = 0.5, retry = TRUE, resize = TRUE, maxsize = 1000, recursive = TRUE, cores = 1, quiet = FALSE, verbose = FALSE, ... )
expand( tree, clades = "0", refine = "Viterbi", iterations = 50, nstart = 20, minK = 2, maxK = 2, minscore = 0.9, probs = 0.5, retry = TRUE, resize = TRUE, maxsize = 1000, recursive = TRUE, cores = 1, quiet = FALSE, verbose = FALSE, ... )
tree |
an object of class |
clades |
a vector of character strings giving the binary indices matching the labels of the nodes that are to be expanded. Defaults to "0", meaning all subclades are expanded. See below for further details on clade indexing. |
refine |
character string giving the iterative model refinement
method to be used in the partitioning process. Valid options are
|
iterations |
integer giving the maximum number of training-classification
iterations to be used in the splitting process.
Note that this is not necessarily the same as the number of Viterbi training
or Baum Welch iterations to be used in model training, which can be set
using the argument |
nstart |
integer. The number of random starting sets to be chosen for initial k-means assignment of sequences to groups. Defaults to 20. |
minK |
integer. The minimum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating). |
maxK |
integer. The maximum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating). |
minscore |
numeric between 0 and 1. The minimum acceptable value
for the nth percentile of Akaike weights (where n is
the value given in |
probs |
numeric between 0 and 1. The percentile of Akaike weights
to test against the minimum score threshold given in |
retry |
logical indicating whether failure to split a node based on the criteria outlined in 'minscore' and 'probs' should prompt a second attempt with different initial groupings. These groupings are based on maximum kmer frequencies rather than k-means division, which can give suboptimal groupings when the cluster sizes are different (due to the up-weighting of larger clusters in the k-means algorithm). |
resize |
logical indicating whether the models should be free to
change size during the training process or if the number of modules
should be fixed. Defaults to TRUE. Only applicable if
|
maxsize |
integer giving the upper bound on the number of modules in the PHMMs. If NULL, no maximum size is enforced. |
recursive |
logical indicating whether the splitting process
should continue recursively until the discrimination criteria
are not met (TRUE; default), or whether a single split should
take place at each of the nodes specified in |
cores |
integer giving the number processors for multithreading. Defaults to 1.
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
e.g. by running |
quiet |
logical indicating whether feedback should be printed to the console. |
verbose |
logical indicating whether extra feedback should be printed to the console, including progress at each split. |
... |
further arguments to be passed on to |
The clade indexing system used here is based on character strings, where "0" refers to the root node, "01" is the first child node, "02" is the second child node, "011" is the first child node of the first child node, etc. Note that this means each node cannot have more than 9 child nodes.
an object of class "insect"
.
Shaun Wilkinson
data(whales) data(whale_taxonomy) ## split the first node set.seed(123) tree <- learn(whales, db = whale_taxonomy, recursive = FALSE) ## expand only the first clade tree <- expand(tree, clades = "1")
data(whales) data(whale_taxonomy) ## split the first node set.seed(123) tree <- learn(whales, db = whale_taxonomy, recursive = FALSE) ## expand only the first clade tree <- expand(tree, clades = "1")
This function derives the full lineage of a taxon ID number from a given taxonomy database.
get_lineage(taxIDs, db, simplify = TRUE, numbers = FALSE, cores = 1)
get_lineage(taxIDs, db, simplify = TRUE, numbers = FALSE, cores = 1)
taxIDs |
integer or vector of integers giving the taxonomic ID number(s). |
db |
a taxonomy database (a data.frame object).
See |
simplify |
logical indicating whether a single lineage derived from a length-one input should be simplified from a list to a named character vector. Defaults to TRUE. |
numbers |
logical indicating whether the output string(s) should be comprised of the taxonomic ID numbers rather than taxon names. Defaults to FALSE. |
cores |
integer giving the number of processors for multithreading (Defaults to 1).
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
for example by running |
the full lineage as a named character vector, or list of named character vectors if the length of the input object is > 1 or simplify = FALSE. "names" attributes are taxonomic ranks.
Shaun Wilkinson
Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Research 40, D136-D143. doi:10.1093/nar/gkr1178.
https://www.ncbi.nlm.nih.gov/taxonomy/
data(whales) data(whale_taxonomy) taxIDs <- as.integer(gsub(".+\\|", "", names(whales)[1:2])) get_lineage(taxIDs, db = whale_taxonomy)
data(whales) data(whale_taxonomy) taxIDs <- as.integer(gsub(".+\\|", "", names(whales)[1:2])) get_lineage(taxIDs, db = whale_taxonomy)
This function returns the unique ID for a specified taxon name by looking up a taxonomy database.
get_taxID(lineage, db, multimatch = NA)
get_taxID(lineage, db, multimatch = NA)
lineage |
character vector of taxon names or semicolon-delimited lineage strings. |
db |
a valid taxonomy database (as a data.frame object).
See |
multimatch |
integer giving the value to return if the query matches multiple entries. Defaults to NA_integer_. |
An integer giving the unique taxon ID, or NA if the taxon is not found in the database.
Shaun Wilkinson
Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Research 40, D136-D143. doi:10.1093/nar/gkr1178.
https://www.ncbi.nlm.nih.gov/taxonomy/
data(whale_taxonomy) get_taxID("Odontoceti", db = whale_taxonomy)
data(whale_taxonomy) get_taxID("Odontoceti", db = whale_taxonomy)
This function converts DNA or amino acid sequences to 128-bit MD5 hash values for efficient duplicate identification and dereplication.
hash(x)
hash(x)
x |
a sequence or list of sequences, either in character string, character vector, or raw byte format (eg DNAbin or AAbin objects). |
This function uses the md5
function from the openSSL library
(https://www.openssl.org/)
to digest sequences to 128-bit hashes.
These can be compared using base functions such
as duplicated
and unique
, for fast identification and
management of duplicate sequences in large datasets.
a character vector.
Shaun Wilkinson
Ooms J (2017) openssl: toolkit for encryption, signatures and certificates based on OpenSSL. R package version 0.9.7. https://CRAN.R-project.org/package=openssl
data(whales) hashes <- hash(whales) sum(duplicated(hashes))
data(whales) hashes <- hash(whales) sum(duplicated(hashes))
R package 'insect' contains functions to create and use classification trees for DNA meta-barcoding. This enables users to quickly and accurately assign the taxonomic identities of large numbers of DNA barcodes or other informative sequences generated on one of the standard high-throughput sequencing platforms.
This function joins two or more DNAbin
objects, retaining any
attributes whose lengths match those of the input objects (e.g. "species",
"lineage" and/or "taxID" attributes).
join(...)
join(...)
... |
|
an object of class DNAbin
.
Shaun Wilkinson
data(whales) whales1 <- whales[1:10] whales2 <- whales[11:19] join(whales1, whales2)
data(whales) whales1 <- whales[1:10] whales2 <- whales[11:19] join(whales1, whales2)
This function learns a classification tree from a reference sequence database using a recursive partitioning procedure.
learn( x, db = NULL, model = NULL, refine = "Viterbi", iterations = 50, nstart = 20, minK = 2, maxK = 2, minscore = 0.95, probs = 0.5, retry = FALSE, resize = TRUE, maxsize = 1000, recursive = TRUE, cores = 1, quiet = FALSE, verbose = FALSE, numcode = NULL, frame = NULL, ... )
learn( x, db = NULL, model = NULL, refine = "Viterbi", iterations = 50, nstart = 20, minK = 2, maxK = 2, minscore = 0.95, probs = 0.5, retry = FALSE, resize = TRUE, maxsize = 1000, recursive = TRUE, cores = 1, quiet = FALSE, verbose = FALSE, numcode = NULL, frame = NULL, ... )
x |
a reference database of class |
db |
a heirarchical taxonomy database in the form of a data.frame.
Cannot be NULL unless training data is in RDP format
(containing semicolon delimited lineage strings).
The object should have
four columns, labeled "taxID", "parent_taxID", "rank" and "name".
The first two should be numeric, and all ID numbers in the
"parent_taxID" column should link to those in the "taxID" column.
This excludes the first row,
which should have |
model |
an optional object of class |
refine |
character string giving the iterative model refinement
method to be used in the partitioning process. Valid options are
|
iterations |
integer giving the maximum number of training-classification
iterations to be used in the splitting process.
Note that this is not necessarily the same as the number of Viterbi training
or Baum Welch iterations to be used in model training, which can be set
using the argument |
nstart |
integer. The number of random starting sets to be chosen for initial k-means assignment of sequences to groups. Defaults to 20. |
minK |
integer. The minimum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating). |
maxK |
integer. The maximum number of furications allowed at each inner node of the tree. Defaults to 2 (all inner nodes are bifuricating). |
minscore |
numeric between 0 and 1. The minimum acceptable value
for the nth percentile of Akaike weights (where n is
the value given in |
probs |
numeric between 0 and 1. The percentile of Akaike weights
to test against the minimum score threshold given in |
retry |
logical indicating whether failure to split a node based on the criteria outlined in 'minscore' and 'probs' should prompt a second attempt with different initial groupings. These groupings are based on maximum kmer frequencies rather than k-means division, which can give suboptimal groupings when the cluster sizes are different (due to the up-weighting of larger clusters in the k-means algorithm). |
resize |
logical indicating whether the models should be free to
change size during the training process or if the number of modules
should be fixed. Defaults to TRUE. Only applicable if
|
maxsize |
integer giving the upper bound on the number of modules in the PHMMs. If NULL, no maximum size is enforced. |
recursive |
logical indicating whether the splitting process should continue recursively until the discrimination criteria are not met (TRUE; default), or whether a single split should take place at the root node. |
cores |
integer giving the number processors for multithreading. Defaults to 1.
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
e.g. by running |
quiet |
logical indicating whether feedback should be printed to the console. |
verbose |
logical indicating whether extra feedback should be printed to the console, including progress at each split. |
numcode , frame
|
passed to |
... |
further arguments to be passed on to |
The "insect" object type is a dendrogram
with several additional attributes stored at each node.
These include:
"clade" the index of the node (see further details below);
"sequences" the indices of the sequences in the reference
database used to create the object;
"taxID" the taxonomic identifier of the lowest common taxon
of the sequences belonging to the node (linking to "db"
);
"minscore" the lowest likelihood among the training sequences given
the profile HMM stored at the node;
"minlength" the minimum length of the sequences belonging to the node;
"maxlength" the maximum length of the sequences belonging to the node;
"model" the profile HMM derived from the sequence subset belonging to the node;
"nunique" the number of unique sequences belonging to the node;
"ntotal" the total number of sequences belonging to the node (including duplicates);
"key" the hash key used for exact sequence matching
(bypasses the classification procedure if an exact match is found; root node only);
"taxonomy" the taxonomy database containing the taxon ID numbers (root node only).
The clade indexing system used here is based on character strings, where "0" refers to the root node, "01" is the first child node, "02" is the second child node, "011" is the first child node of the first child node, etc. The leading zero may be omitted for brevity. Note that each inner node can not have more than 9 child nodes.
an object of class "insect"
.
Shaun Wilkinson
Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms for Molecular Biology, 5, 21.
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Gerstein M, Sonnhammer ELL, Chothia C (1994) Volume changes in protein evolution. Journal of Molecular Biology, 236, 1067-1078.
Juang B-H, Rabiner LR (1990) The segmental K-means algorithm for estimating parameters of hidden Markov models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38, 1639-1641.
data(whales) data(whale_taxonomy) ## use all sequences except first one to train the classifier set.seed(999) tree <- learn(whales[-1], db = whale_taxonomy, maxiter = 5, cores = 2) ## find predicted lineage for first sequence classify(whales[1], tree) ## compare with actual lineage taxID <- as.integer(gsub(".+\\|", "", names(whales)[1])) get_lineage(taxID, whale_taxonomy)
data(whales) data(whale_taxonomy) ## use all sequences except first one to train the classifier set.seed(999) tree <- learn(whales[-1], db = whale_taxonomy, maxiter = 5, cores = 2) ## find predicted lineage for first sequence classify(whales[1], tree) ## compare with actual lineage taxID <- as.integer(gsub(".+\\|", "", names(whales)[1])) get_lineage(taxID, whale_taxonomy)
These functions provide additional methods to manipulate objects of class
"DNAbin"
and "AAbin"
to supplement those available in the
ape
package.
## S3 method for class 'DNAbin' duplicated(x, incomparables = FALSE, pointers = TRUE, ...) ## S3 method for class 'DNAbin' unique(x, incomparables = FALSE, attrs = TRUE, drop = FALSE, ...) ## S3 method for class 'DNAbin' subset(x, subset, attrs = TRUE, drop = FALSE, ...) ## S3 method for class 'AAbin' duplicated(x, incomparables = FALSE, pointers = TRUE, ...) ## S3 method for class 'AAbin' unique(x, incomparables = FALSE, attrs = TRUE, drop = FALSE, ...) ## S3 method for class 'AAbin' subset(x, subset, attrs = TRUE, drop = FALSE, ...)
## S3 method for class 'DNAbin' duplicated(x, incomparables = FALSE, pointers = TRUE, ...) ## S3 method for class 'DNAbin' unique(x, incomparables = FALSE, attrs = TRUE, drop = FALSE, ...) ## S3 method for class 'DNAbin' subset(x, subset, attrs = TRUE, drop = FALSE, ...) ## S3 method for class 'AAbin' duplicated(x, incomparables = FALSE, pointers = TRUE, ...) ## S3 method for class 'AAbin' unique(x, incomparables = FALSE, attrs = TRUE, drop = FALSE, ...) ## S3 method for class 'AAbin' subset(x, subset, attrs = TRUE, drop = FALSE, ...)
x |
a |
incomparables |
placeholder, not currently functional. |
pointers |
logical indicating whether the re-replication index key
should be returned as a |
... |
further arguments to be passed between methods. |
attrs |
logical indicating whether the attributes of the input object whose length match the object length (or number of rows if it is a matrix) should be retained and subsetted along with the object. This is useful if the input object has species, lineage and/or taxon ID metadata that need to be retained following the duplicate analysis. |
drop |
logical; indicates whether the input matrix (assuming one is
passed) should be reduced to a vector if subset to a single sequence.
Defaults to FALSE in keeping with the style of the |
subset |
logical vector giving the elements or rows to be kept. |
unique
and subset
return a
DNAbin
or AAbin
object. duplicated
returns a logical vector.
Shaun Wilkinson
Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.
Paradis E (2007) A bit-level coding scheme for nucleotides. http://ape-package.ird.fr/misc/BitLevelCodingScheme.html.
Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.
data(whales) duplicates <- duplicated.DNAbin(whales, point = TRUE) attr(duplicates, "pointers") ## returned indices show that the last sequence is ## identical to the second one. ## subset the reference sequence database to only include unques whales <- subset.DNAbin(whales, subset = !duplicates) ## this gives the same result as unique.DNAbin(whales)
data(whales) duplicates <- duplicated.DNAbin(whales, point = TRUE) attr(duplicates, "pointers") ## returned indices show that the last sequence is ## identical to the second one. ## subset the reference sequence database to only include unques whales <- subset.DNAbin(whales, subset = !duplicates) ## this gives the same result as unique.DNAbin(whales)
This function prunes the taxon database, removing specified taxa as desired to improve speed and memory efficiency.
prune_taxonomy(db, taxIDs, keep = TRUE)
prune_taxonomy(db, taxIDs, keep = TRUE)
db |
a valid taxonomy database, e.g. obtained by running the
|
taxIDs |
the names or taxon ID numbers of the taxa to be retained
(or discarded if |
keep |
logical, indicates whether the specified taxa should be kept and the rest of the database removed or vice versa. Defaults to TRUE. |
TBA
a data.frame with the same column names as the input object ("taxID", "parent_taxID", "rank", "name").
Shaun Wilkinson
Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Research 40, D136-D143. doi:10.1093/nar/gkr1178.
https://www.ncbi.nlm.nih.gov/taxonomy/
## remove Odontoceti suborder from cetacean taxonomy database data(whale_taxonomy) prune_taxonomy(whale_taxonomy, taxIDs = 9722, keep = FALSE)
## remove Odontoceti suborder from cetacean taxonomy database data(whale_taxonomy) prune_taxonomy(whale_taxonomy, taxIDs = 9722, keep = FALSE)
This function evaluates a DNA reference database (a "DNAbin" object) and removes any sequences whose taxonomic metadata appear to be inconsistent with those of their most closely related sequences.
purge(x, db, level = "order", confidence = 0.8, cores = 1, quiet = FALSE, ...)
purge(x, db, level = "order", confidence = 0.8, cores = 1, quiet = FALSE, ...)
x |
a DNAbin list object whose names include taxonomic identification numbers
(see |
db |
a valid taxonomy database containing the taxonomic identification numbers
included in the "names" attribute of the primary input object (a data.frame object;
see |
level |
character string giving the taxonomic level at which heterogeneity within a cluster will flag a sequence as potentially erroneous. This should be a recognized rank within the taxonomy database. |
confidence |
numeric, the minimum confidence value for a sequence to be purged.
For example, if |
cores |
integer giving the number of processors for multithreading. Defaults to 1.
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
for example by running |
quiet |
logical indicating whether progress should be printed to the console. |
... |
further arguments to pass to |
This function first clusters the sequence dataset into operational
taxonomic units (OTUs) based on a given genetic similarity threshold
using the otu
function from the kmer
package.
Each cluster is then checked for taxonomic homogeneity at a given rank,
and any sequences that appear out of place are removed.
The criteria for sequence removal are that at least two other independent
studies should contradict the taxonomic metadata attributed to the sequence.
a "DNAbin" object.
Shaun Wilkinson
data(whales) data(whale_taxonomy) whales <- purge(whales, db = whale_taxonomy, level = "species", threshold = 0.97, method = "farthest")
data(whales) data(whale_taxonomy) whales <- purge(whales, db = whale_taxonomy, level = "species", threshold = 0.97, method = "farthest")
This function performs several quality checks for FASTQ input files, removing any sequences that do not conform to the specified quality standards. This includes an average quality score assessment, size selection, singleton removal (or an alternative minimum count) and ambiguous base-call filtering.
qfilter( x, minqual = 30, maxambigs = 0, mincount = 2, minlength = 50, maxlength = 500 )
qfilter( x, minqual = 30, maxambigs = 0, mincount = 2, minlength = 50, maxlength = 500 )
x |
a vector of concatenated strings representing DNA sequences
(in upper case) or a DNAbin list object with quality attributes.
This argument will usually be produced by |
minqual |
integer, the minimum average quality score for a sequence to pass the filter. Defaults to 30. |
maxambigs |
integer, the maximum number of ambiguities for a sequence to pass the filter. Defaults to 0. |
mincount |
integer, the minimum acceptable number of occurrences of a sequence for it to pass the filter. Defaults to 2 (removes singletons). |
minlength |
integer, the minimum acceptable sequence length. Defaults to 50. |
maxlength |
integer, the maximum acceptable sequence length. Defaults to 500. |
an object of the same type as the primary input argument (i.e. a "DNAbin" object if x is a "DNAbin" object, or a vector of concatenated character strings otherwise).
Shaun Wilkinson
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample2.fastq")) ## trim primers from sequences mlCOIintF <- "GGWACWGGWTGAACWGTWTAYCCYCC" jgHCO2198 <- "TAIACYTCIGGRTGICCRAARAAYCA" x <- trim(x, up = mlCOIintF, down = jgHCO2198) ## filter sequences to remove singletons, low quality & short/long reads x <- qfilter(x, minlength = 250, maxlength = 350)
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample2.fastq")) ## trim primers from sequences mlCOIintF <- "GGWACWGGWTGAACWGTWTAYCCYCC" jgHCO2198 <- "TAIACYTCIGGRTGICCRAARAAYCA" x <- trim(x, up = mlCOIintF, down = jgHCO2198) ## filter sequences to remove singletons, low quality & short/long reads x <- qfilter(x, minlength = 250, maxlength = 350)
This function reverse complements a DNA sequence or vector of DNA sequences that are stored as character strings.
rc(z)
rc(z)
z |
a vector of DNA sequences in upper case character string format. |
This function accepts only DNA sequences in concatenated character
string format, see complement
in the ape
package for "DNAbin" input objects, and comp
in the
seqinr
package for when the input object is a character
vector.
a vector of DNA sequences as upper case character strings.
Shaun Wilkinson
rc("TATTG")
rc("TATTG")
Text parsing functions for reading sequences in the FASTA or FASTQ format into R.
readFASTQ(file = file.choose(), bin = TRUE) readFASTA( file = file.choose(), bin = TRUE, residues = "DNA", alignment = FALSE )
readFASTQ(file = file.choose(), bin = TRUE) readFASTA( file = file.choose(), bin = TRUE, residues = "DNA", alignment = FALSE )
file |
the name of the FASTA or FASTQ file from which the sequences are to be read. |
bin |
logical indicating whether the returned object should be in binary/raw byte format (i.e. "DNAbin" or "AAbin" objects for nucleotide and amino acid sequences, respectively). If FALSE a vector of named character strings is returned. |
residues |
character string indicating whether the sequences to
be read are composed of nucleotides ("DNA"; default) or amino acids ("AA").
Only required for |
alignment |
logical indicating whether the sequences represent
an alignment to be parsed as a matrix.
Only applies to |
The FASTQ convention is somewhat ambiguous with several slightly different interpretations appearing in the literature. For now, this function supports the Illumina convention for FASTQ files, where each sequence and its associated metadata occupies four line of the text file as follows : (1) the run and cluster metadata preceded by an @ symbol; (2) the sequence itself in capitals without spaces; (3) a single "+" symbol; and (4) the Phred quality scores from 0 to 93 represented as ASCII symbols. For more information on this convention see the Illumina help page here .
For optimal memory efficiency and compatibility with other functions,
it is recommended to store sequences in raw byte format
as either DNAbin or AAbin objects.
For FASTQ files when bin = TRUE, a vector of quality scores
(also in raw-byte format) is attributed to each sequence.
These can be converted back to numeric quality scores with as.integer
.
For FASTQ files when bin = FALSE the function returns a vector with each
sequence as a concatenated string with a similarly concatenated quality attribute
comprised of the same ASCII metacharacters used in the FASTQ coding scheme.
This function can take a while to process larger FASTQ files, a multithreading option may be available in a future version.
Either a vector of character strings (if bin = FALSE), or a list of raw ("DNAbin" or "AAbin") vectors, with each element having a "quality" attribute.
Shaun Wilkinson
Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, Knight R, Mills DA, Caporaso JG (2013) Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods, 1, 57-59.
Illumina help page: https://help.basespace.illumina.com/articles/descriptive/fastq-files/
writeFASTQ
and writeFASTA
for writing sequences to text in the FASTA or FASTQ format.
See also read.dna
in the ape
package.
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample2.fastq"))
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample2.fastq"))
These functions are used to extract only the unique sequences from a set of DNA reads, with the ability to rebuild the original sequence set at a later time.
dereplicate(x, cores = 1) rereplicate(x)
dereplicate(x, cores = 1) rereplicate(x)
x |
a list of sequences in |
cores |
integer giving the number of processors for multithreading (defaults to 1).
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
e.g. by running |
either a DNAbin/AAbin object, or a vector of concatenated upper-case character strings, depending on the input object.
Shaun Wilkinson
data(whales) tmp <- dereplicate(whales) whales <- rereplicate(tmp)
data(whales) tmp <- dereplicate(whales) whales <- rereplicate(tmp)
This matrix contains counts of COI amplicon sequence variants (ASV) extracted from autonomous reef monitoring structures (ARMS) in Ofu, American Samoa. Unpublished data courtesy of Molly Timmers (NOAA).
samoa
samoa
a 2 x 16 integer matrix containing abundance counts of 16 ASVs from two sites. This table contains the first 16 rows of the 'seqtab.nochim' output from the DADA2 pipeline (https://benjjneb.github.io/dada2/tutorial.html). The column names contain the DNA sequences of the ASVs, and row names correspond with site codes.
searchGB
queries GenBank using the
Entrez search utilities, and downloads the matching sequences
and/or their accession numbers. A vector of
accession numbers can be passed in lieu of a query, in which case the function
downloads the matching sequences from GenBank.
Internet connectivity is required.
searchGB( query = NULL, accession = NULL, sequences = TRUE, bin = TRUE, db = "nucleotide", taxIDs = TRUE, prompt = TRUE, contact = NULL, quiet = FALSE )
searchGB( query = NULL, accession = NULL, sequences = TRUE, bin = TRUE, db = "nucleotide", taxIDs = TRUE, prompt = TRUE, contact = NULL, quiet = FALSE )
query |
an Entrez search query. For help compiling Entrez queries see https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Entrez_Searching_Options and https://www.ncbi.nlm.nih.gov/books/NBK49540/. |
accession |
an optional vector of GenBank accession numbers to be input in place of a search query. If both query and accession arguments are provided the function returns an error. Currently, a maximum of 200 accession numbers can be processed at a time. |
sequences |
logical. Should the sequences be returned or only the
GenBank accession numbers? Note that taxon IDs
are not returned if |
bin |
logical indicating whether the returned sequences should be in raw-byte format ("DNAbin" or "AAbin" object type) or as a vector of named character strings. Defaults to TRUE. |
db |
the NCBI database from which to download the sequences and/or accession names. Accepted options are "nucleotide" (default) and "protein". |
taxIDs |
logical indicating whether the NCBI taxon ID numbers should be appended to the names of the output object (delimited by a "|" character). Defaults to TRUE. |
prompt |
logical indicating whether to check with the user before downloading sequences. |
contact |
an optional character string with the users email address. This is added to the E-utilities URL and may be used by NCBI to contact the user if the application causes unintended issues. |
quiet |
logical indicating whether the progress should be printed to the console. |
This function uses the Entrez e-utilities API to search and download sequences from GenBank. Occasionally users may encounter an unknown non-reproducible error and appears to be related to database records being updated in GenBank. This can generally be remedied by re-running the function. If problems persist please feel free to raise an issue on the package bug-reports page at <http://github.com/shaunpwilkinson/insect/issues>.
a list of sequences as either a DNAbin
or AAbin
object (depending on "db"
),
or a named vector of character strings (if bin = FALSE
).
Shaun Wilkinson
NCBI Resource Coordinators (2012) Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 41 (Database issue): D8–D20.
read.GenBank
(ape)
for an alternative means of downloading DNA sequences from GenBank
using accession numbers.
## Query the GenBank database for Eukaryote mitochondrial 16S DNA sequences ## between 100 and 300 base pairs in length that were modified between ## the years 1999 and 2000. query <- "Eukaryota[ORGN]+AND+16S[TITL]+AND+100:300[SLEN]+AND+1999:2000[MDAT]" x <- searchGB(query, prompt = FALSE)
## Query the GenBank database for Eukaryote mitochondrial 16S DNA sequences ## between 100 and 300 base pairs in length that were modified between ## the years 1999 and 2000. query <- "Eukaryota[ORGN]+AND+16S[TITL]+AND+100:300[SLEN]+AND+1999:2000[MDAT]" x <- searchGB(query, prompt = FALSE)
This function uses the Viterbi algorithm to semi-globally align a motif to a DNA or AA sequence, and removes all nucleotides to the left and/or right of the motif.
shave(x, motif, direction = "both", cores = 1, ...)
shave(x, motif, direction = "both", cores = 1, ...)
x |
an object of class |
motif |
a |
direction |
character string indicating the direction of the shave. Options are "forward" (shaves everything to the right of the motif), "backward" (shaves everything to the left of the motif) or "both" (retains the motif region only). |
cores |
integer giving the number of processors for multithreading.
Defaults to 1, and reverts to 1 if |
... |
further arguments to be passed to |
This functions finds the optimal semiglobal alignment (a.k.a. "glocal"
alignment or global alignment with free end gaps) between a
sequence "x"
and a shorter sequence "motif"
, returning
the motif region of x along with the nucleotides to the left or right
if direction
is set to "reverse"
or "forward"
,
respectively.
an object of class DNAbin
or AAbin
(depending on the input object).
Shaun Wilkinson
data(whales) motif = char2dna("AAGTGTAGCATCACTTATTGATCCAAATT") shave(whales, motif = motif, direction = "both")
data(whales) motif = char2dna("AAGTGTAGCATCACTTATTGATCCAAATT") shave(whales, motif = motif, direction = "both")
This function aligns forward and reverse reads generated from a 2x amplicon sequencing platform, and produces a consensus sequence with maximum base-call quality scores attached as "quality" attributes.
stitch(x, y, up = NULL, down = NULL, mindiff = 6, minoverlap = 16)
stitch(x, y, up = NULL, down = NULL, mindiff = 6, minoverlap = 16)
x , y
|
DNAbin objects with quality attributes (see |
up , down
|
forward and reverse primer sequences (either as concatenated character strings or "DNAbin" objects). Either both or neither should be provided (not one or the other). |
mindiff |
the minimum difference in quality between two different base calls for the higher quality call to be added to the consensus alignment. In cases where the quality differences are less than this threshold, the ambiguity code "N" is added to the consensus sequence. Defaults to 6. |
minoverlap |
integer giving the minimum number of nucleotides that should be shared between the forward and reverse sequence alignments. Defaults to 16. |
a "DNAbin" object or a vector of concatenated character strings, depending on the input.
Shaun Wilkinson
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample1_read1.fastq"), bin = FALSE) y <- readFASTQ(paste0(td, "/COI_sample1_read2.fastq"), bin = FALSE) z <- stitch(x, y) z[1] attr(z, "quality")[1]
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample1_read1.fastq"), bin = FALSE) y <- readFASTQ(paste0(td, "/COI_sample1_read2.fastq"), bin = FALSE) z <- stitch(x, y) z[1] attr(z, "quality")[1]
This function downloads an up-to-date copy of the taxonomy database.
taxonomy(db = "NCBI", synonyms = FALSE)
taxonomy(db = "NCBI", synonyms = FALSE)
db |
character string specifying which taxonomy database to download. Currently only "NCBI" is supported. |
synonyms |
logical indicating whether synonyms should be included. Note that this increases the size of the returned object by around 10%. |
This function downloads the specified taxonomy database as a data.frame
object with the following columns:
"taxID", "parent_taxID", "rank", "name".
As of early 2018 the zip archive to download the NCBI taxonomy database
is approximately 40Mb in size, and the output dataframe object is around
200Mb in memory. Once downloaded, the dataframe can be pruned
for increased speed and memory efficiency using the function
prune_taxonomy
.
a dataframe with the following elements: "taxID", "parent_taxID", "rank", "name".
Shaun Wilkinson
Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Research 40, D136-D143. doi:10.1093/nar/gkr1178.
https://www.ncbi.nlm.nih.gov/taxonomy/
# db <- taxonomy()
# db <- taxonomy()
This function trims the primer and/or index sequence(s) from a set of DNA sequences.
trim(x, up, down = NULL)
trim(x, up, down = NULL)
x |
a "DNAbin" object or a vector of concatenated upper-case character strings, containing the sequences to be trimmed. |
up , down
|
"DNAbin" objects or vectors of concatenated upper-case character strings, representing the primer sequences. |
Any sequences not containing the primer(s) in either direction are discarded. Hence this function can also be used to de-multiplex sequences and remove indices, though it will generally be faster to do this on the sequencing platform prior to exporting the FASTQ files.
a "DNAbin" object or a vector of concatenated character strings, depending on the input.
Shaun Wilkinson
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample2.fastq")) ## trim primers from sequences mlCOIintF <- "GGWACWGGWTGAACWGTWTAYCCYCC" jgHCO2198 <- "TAIACYTCIGGRTGICCRAARAAYCA" x <- trim(x, up = mlCOIintF, down = jgHCO2198)
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample2.fastq")) ## trim primers from sequences mlCOIintF <- "GGWACWGGWTGAACWGTWTAYCCYCC" jgHCO2198 <- "TAIACYTCIGGRTGICCRAARAAYCA" x <- trim(x, up = mlCOIintF, down = jgHCO2198)
This function queries a list of DNA sequences with a virtual probe (either a sequence or a profile hidden Markov model) and returns only the sequences and regions that are of sufficient similarity based on log-odds alignment scoring.
virtualFISH( x, probe, minscore = 100, minamplen = 50, maxamplen = 500, up = NULL, down = NULL, rcdown = TRUE, minfsc = 60, minrsc = 60, maxNs = 0.02, cores = 1, quiet = FALSE )
virtualFISH( x, probe, minscore = 100, minamplen = 50, maxamplen = 500, up = NULL, down = NULL, rcdown = TRUE, minfsc = 60, minrsc = 60, maxNs = 0.02, cores = 1, quiet = FALSE )
x |
a list of DNA sequences in |
probe |
a DNA sequence ("DNAbin" object) or profile hidden Markov model ("PHMM" object) to be used as the virtual hybridization probe. |
minscore |
numeric; the minimum specificity (log-odds score for the optimal alignment) between the query sequence and the probe for the former to be retained in the output object. |
minamplen , maxamplen
|
integers giving the minimum and maximum acceptable amplicon lengths. |
up , down
|
optional objects of class |
rcdown |
logical indicating whether the reverse primer should be
reverse-complemented prior to aligning with the input sequences.
Should be set to TRUE if |
minfsc |
numeric, giving the minimum specificity(log-odds score for the optimal alignment) between the forward primer and a sequence for that sequence to be retained. |
minrsc |
numeric, the minimum specificity (log-odds score for the optimal alignment) between the reverse primer (if provided) and a sequence for that sequence to be retained. |
maxNs |
numeric giving the maximum acceptable proportion of the ambiguous residue "N" within the output sequences. Defaults to 0.02. |
cores |
integer giving the number of processors for multithreading.
Defaults to 1, and reverts to 1 if |
quiet |
logical indicating whether progress should be printed to the console. |
This function is generally used when filtering/trimming a
local sequence database,
to mop up any high-scoring sequences with partial/missing primer
bind sites that were not included in the output of the
virtualPCR
.
For example, this includes sequences that were generated using the same
primer set as used in the virtual PCR, and whose primer binding sites
were trimmed prior to deposition in the sequence database.
Unlike the virtualPCR function, there is
no option to retain the primer-bind sites in the returned sequences.
a list of trimmed sequences, returned as an object of class
DNAbin
.
Shaun Wilkinson
## ensure whale sequences are globally alignable data(whales) model <- aphid::derivePHMM(whales) z <- virtualFISH(whales, probe = model)
## ensure whale sequences are globally alignable data(whales) model <- aphid::derivePHMM(whales) z <- virtualFISH(whales, probe = model)
virtualPCR
queries a list of DNA sequences with virtual primers
(either sequences or profile hidden Markov models) and returns only
the sequences that contain regions of sufficient similarity based on
log-odds alignment scoring.
virtualPCR( x, up, down = NULL, rcdown = TRUE, trimprimers = FALSE, minfsc = 50, minrsc = 50, minamplen = 50, maxamplen = 2000, maxNs = 0.02, partialbind = TRUE, cores = 1, quiet = FALSE )
virtualPCR( x, up, down = NULL, rcdown = TRUE, trimprimers = FALSE, minfsc = 50, minrsc = 50, minamplen = 50, maxamplen = 2000, maxNs = 0.02, partialbind = TRUE, cores = 1, quiet = FALSE )
x |
a list of DNA sequences in |
up |
an object of class |
down |
an optional argument the same type as |
rcdown |
logical indicating whether the reverse primer should be
reverse-complemented prior to aligning with the input sequences. Set
to TRUE only if |
trimprimers |
logical indicating whether the primer-binding sites should be removed from the sequences in the returned list. |
minfsc |
numeric, giving the minimum specificity(log-odds score for the optimal alignment) between the forward primer and a sequence for that sequence to be retained. |
minrsc |
numeric, the minimum specificity (log-odds score for the optimal alignment) between the reverse primer (if provided) and a sequence for that sequence to be retained. |
minamplen , maxamplen
|
integers giving the minimum and maximum acceptable amplicon lengths. Sequences are discarded if the number of base pairs between the primer-binding sites falls outside of these limits. |
maxNs |
numeric giving the maximum acceptable proportion of the ambiguous residue "N" within the output sequences. Defaults to 0.02. |
partialbind |
logical indicating whether partial primer matching is accepted. Defaults to TRUE. |
cores |
integer giving the number of processors for multithreading.
Defaults to 1, and reverts to 1 if x is not a list.
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
for example by running |
quiet |
logical indicating whether progress should be printed to the console. |
a list of trimmed sequences, an object of class
DNAbin
.
Shaun Wilkinson
## trim whale sequences using a new set of inner primers inner_for <- "CGGTTGGGGTGACCTCGGAGTA" inner_rev <- "GCTGTTATCCCTAGGGTAA" whales_short <- virtualPCR(whales, up = inner_for, down = inner_rev, trimprimers = TRUE)
## trim whale sequences using a new set of inner primers inner_for <- "CGGTTGGGGTGACCTCGGAGTA" inner_rev <- "GCTGTTATCCCTAGGGTAA" whales_short <- virtualPCR(whales, up = inner_for, down = inner_rev, trimprimers = TRUE)
A copy of the NCBI taxonomy reference database, subsetted to include only the
cetacean taxa in the whales
dataset.
whale_taxonomy
whale_taxonomy
A data.frame object with 72 rows and four columns, labeled as follows:
the NCBI unique taxon identifier (integer).
the NCBI unique taxon identifier of the immediate parent taxon (integer).
The taxonomic rank (i.e. species, genus, etc; character).
The scientific name of the taxon (character).
The database was accessed from
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz on 17 June 2018
using the taxonomy
function, and pruned using
prune_taxonomy
with
taxIDs = as.integer(gsub(".+\|", "", names(whales)))
.
https://www.ncbi.nlm.nih.gov/taxonomy/
A dataset containing 19 mitochondrial 16S rDNA sequences from 18 cetacean species, downloaded from GenBank on 27 March 2018.
whales
whales
A "DNAbin" list object containing 19 cetacean mitochondrial sequences
in raw-byte format,
averaging 130 nucleotides in length. The sequences are named with the GenBank
accession numbers followed by a "|" symbol, followed by their NCBI taxonomy ID
numbers.
The sequences were downloaded using the
searchGB
function on 17 June 2018
(query term: "cetacea[ORGN]+AND+16S+rRNA[GENE]"), and trimmed using the
virtualPCR
function with the primers 16Smam1 and 16Smam2
(CGGTTGGGGTGACCTCGGA and GCTGTTATCCCTAGGGTAACT, respectively; Taylor 1996).
https://www.ncbi.nlm.nih.gov/genbank/
Taylor PG (1996) Reproducibility of ancient DNA sequences from extinct Pleistocene fauna. Molecular Biology and Evolution, 13, 283-285.
These functions take a list of DNA or amino acid sequences in
DNAbin
or AAbin
format
and outputs a text file to a specified directory.
writeFASTQ(x, file = "", compress = FALSE, append = FALSE) writeFASTA(x, file = "", compress = FALSE, append = FALSE, wrap = NULL)
writeFASTQ(x, file = "", compress = FALSE, append = FALSE) writeFASTA(x, file = "", compress = FALSE, append = FALSE, wrap = NULL)
x |
a list of sequences in |
file |
character string giving a valid file path to output the text to. If file = "" (default setting) the text file is written to the console. |
compress |
logical indicating whether the output file should be gzipped. |
append |
logical indicating whether the output should be appended to the file. |
wrap |
integer giving the maximum number of characters on each sequence line. Defaults to NULL (no wrapping). |
NULL (invisibly).
Shaun Wilkinson
Illumina help page: https://help.basespace.illumina.com/articles/descriptive/fastq-files/
readFASTQ
for reading FASTQ files into R,
and write.dna
in the ape package
for writing DNA to text in FASTA and other formats.
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample2.fastq")) ## trim primers from sequences mlCOIintF <- "GGWACWGGWTGAACWGTWTAYCCYCC" jgHCO2198 <- "TAIACYTCIGGRTGICCRAARAAYCA" x <- trim(x, up = mlCOIintF, down = jgHCO2198) ## quality filter with size selection and singleton removal x <- qfilter(x, minlength = 250, maxlength = 350) ## output filtered FASTQ file writeFASTQ(x, file = paste0(td, "/COI_sample2_filtered.fastq")) writeFASTA(x, file = paste0(td, "/COI_sample2_filtered.fasta"))
## download and extract example FASTQ file to temporary directory td <- tempdir() URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1" dest <- paste0(td, "/insect_tutorial1_files.zip") download.file(URL, destfile = dest, mode = "wb") unzip(dest, exdir = td) x <- readFASTQ(paste0(td, "/COI_sample2.fastq")) ## trim primers from sequences mlCOIintF <- "GGWACWGGWTGAACWGTWTAYCCYCC" jgHCO2198 <- "TAIACYTCIGGRTGICCRAARAAYCA" x <- trim(x, up = mlCOIintF, down = jgHCO2198) ## quality filter with size selection and singleton removal x <- qfilter(x, minlength = 250, maxlength = 350) ## output filtered FASTQ file writeFASTQ(x, file = paste0(td, "/COI_sample2_filtered.fastq")) writeFASTA(x, file = paste0(td, "/COI_sample2_filtered.fasta"))