match-utils {Biostrings} | R Documentation |
In this man page we define precisely and illustrate what a "match" of a pattern P in a subject S is in the context of the Biostrings package. This definition of a "match" is central to most pattern matching functions available in this package: unless specified otherwise, most of them will adhere to the definition provided here.
neditStartingAt
, neditEndingAt
, isMatchingStartingAt
and isMatchingEndingAt
are low-level functions that implement
some basic concepts. Once these concepts are understood, we can use them
to provide a simple and concise definition of a "match".
Other utility functions related to pattern matching are described here:
the mismatch
function for getting the positions of the mismatching
letters of a given pattern relatively to its matches in a given subject,
the nmatch
and nmismatch
functions for getting the number of
matching and mismatching letters produced by the mismatch
function,
and the coverage
function that can be used to get the "coverage" of
a subject by a given pattern or set of patterns.
neditStartingAt(pattern, subject, starting.at=1, with.indels=FALSE, fixed=TRUE) neditEndingAt(pattern, subject, ending.at=1, with.indels=FALSE, fixed=TRUE) neditAt(pattern, subject, at=1, with.indels=FALSE, fixed=TRUE) isMatchingStartingAt(pattern, subject, starting.at=1, max.mismatch=0, with.indels=FALSE, fixed=TRUE) isMatchingEndingAt(pattern, subject, ending.at=1, max.mismatch=0, with.indels=FALSE, fixed=TRUE) isMatchingAt(pattern, subject, at=1, max.mismatch=0, with.indels=FALSE, fixed=TRUE) mismatch(pattern, x, fixed=TRUE) nmatch(pattern, x, fixed=TRUE) nmismatch(pattern, x, fixed=TRUE) ## S4 method for signature 'MIndex': coverage(x, start=NA, end=NA) ## S4 method for signature 'XStringViews': coverage(x, start=NA, end=NA, weight=1L) ## S4 method for signature 'MaskedXString': coverage(x, start=NA, end=NA, weight=1L)
pattern |
The pattern string. |
subject |
An XString object (or character vector) containing the subject sequence. |
starting.at, ending.at, at |
An integer vector specifying the starting (for starting.at
and at ) or ending (for ending.at ) positions of the
pattern relatively to the subject.
|
max.mismatch |
See details below. |
with.indels |
See details below. |
fixed |
Only with a DNAString or RNAString subject can a fixed
value other than the default (TRUE ) be used.
With fixed=FALSE , ambiguities (i.e. letters from the IUPAC Extended
Genetic Alphabet (see IUPAC_CODE_MAP ) that are not from the
base alphabet) in the pattern _and_ in the subject are interpreted as
wildcards i.e. they match any letter that they stand for.
fixed can also be a character vector, a subset
of c("pattern", "subject") .
fixed=c("pattern", "subject") is equivalent to fixed=TRUE
(the default).
An empty vector is equivalent to fixed=FALSE .
With fixed="subject" , ambiguities in the pattern only
are interpreted as wildcards.
With fixed="pattern" , ambiguities in the subject only
are interpreted as wildcards.
|
x |
An XStringViews object for mismatch (typically, one returned
by matchPattern(pattern, subject) ).
Typically an XStringViews or MIndex object for coverage
but IRanges, MaskCollection and
MaskedXString objects are accepted too.
|
start, end |
Two single integers specifying where to start and end the extraction of the
coverage in x .
|
weight |
An integer vector specifying how much each element in x counts.
|
A "match" of pattern P in subject S is a substring S' of S that is considered similar enough to P according to some distance (or metric) specified by the user. 2 distances are supported by most pattern matching functions in the Biostrings package. The first (and simplest) one is the "number of mismatching letters". It is defined only when the 2 strings to compare have the same length, so when this distance is used, only matches that have the same number of letters as P are considered. The second one is the "edit distance" (aka Levenshtein distance): it's the minimum number of operations needed to transform P into S', where an operation is an insertion, deletion, or substitution of a single letter. When this metric is used, matches can have a different number of letters than P.
The neditStartingAt
(and neditEndingAt
) function implements
these 2 distances.
If with.indels
is FALSE
(the default), then the first distance
is used i.e. neditStartingAt
returns the "number of mismatching
letters" between the pattern P and the substring S' of S starting at the
positions specified in starting.at
(note that neditStartingAt
and neditEndingAt
are vectorized so long vectors of integers can be
passed thru the starting.at
or ending.at
arguments).
If with.indels
is TRUE
, then the "edit distance" distance is
used: for each position specified in starting.at
, P is compared to
all the substrings S' of S starting at this position and the smallest
distance is returned. Note that this distance is guaranteed to be reached
for a substrings of length < 2*length(P) so, of course, in practise,
P only needs to be compared to a small number of substrings for every
starting position.
neditStartingAt
and neditEndingAt
: an integer vector of the
same length as starting.at
(or ending.at
).
isMatchingStartingAt(...)
and isMatchingEndingAt(...)
: the
logical vector defined by neditStartingAt(...) <= max.mismatch
or neditEndingAt(...) <= max.mismatch
, respectively.
neditAt
and isMatchingAt
are conveniency wrappers for
neditStartingAt
and isMatchingStartingAt
, respectively.
mismatch
: a list of integer vectors.
nmismatch
: an integer vector containing the length of the vectors
produced by mismatch
.
coverage
: an XRleInteger object indicating the
coverage of x
in the interval specified by the start
and
end
arguments.
An integer value called the "coverage" can be associated to each position
in x
, indicating how many times this position is covered by the views
or matches stored in x
. For example, if x
is an
XStringViews object, the coverage of a given position in x
is
the number of views it belongs to.
If x
is an MIndex object, the coverage of a given position
in x
is the number of matches (or hits) it belongs to.
Note that the positions in the returned XRleInteger object are
to be interpreted as relative to the interval specified by the start
and end
arguments.
matchPattern
,
matchPDict
,
IUPAC_CODE_MAP
,
XString-class,
XStringViews-class,
MIndex-class,
coverage,
IRanges-class,
MaskCollection-class,
MaskedXString-class,
align-utils
## --------------------------------------------------------------------- ## neditAt() / isMatchingAt() ## --------------------------------------------------------------------- subject <- DNAString("GTATA") ## Pattern "AT" matches subject "GTATA" at position 3 (exact match) neditAt("AT", subject, at=3) isMatchingAt("AT", subject, at=3) ## ... but not at position 1 neditAt("AT", subject) isMatchingAt("AT", subject) ## ... unless we allow 1 mismatching letter (inexact match) isMatchingAt("AT", subject, max.mismatch=1) ## Here we look at 6 different starting positions and find 3 matches if ## we allow 1 mismatching letter isMatchingAt("AT", subject, at=0:5, max.mismatch=1) ## No match neditAt("NT", subject, at=1:4) isMatchingAt("NT", subject, at=1:4) ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE) neditAt("NT", subject, at=1:4, fixed=FALSE) isMatchingAt("NT", subject, at=1:4, fixed=FALSE) ## max.mismatch != 0 and fixed=FALSE can be used together neditAt("NCA", subject, at=0:5, fixed=FALSE) isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE) some_starts <- c(10:-10, NA, 6) subject <- DNAString("ACGTGCA") is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1) some_starts[is_matching] ## --------------------------------------------------------------------- ## mismatch() / nmismatch() ## --------------------------------------------------------------------- m <- matchPattern("NCA", subject, max.mismatch=1, fixed=FALSE) mismatch("NCA", m) nmismatch("NCA", m) ## --------------------------------------------------------------------- ## coverage() ## --------------------------------------------------------------------- coverage(m) ## See ?matchPDict for examples of using coverage() on an MIndex object...