matchPattern {Biostrings} | R Documentation |
Generic that finds all matches of a pattern in a BString.
matchPattern(pattern, subject, algorithm="auto", mismatch=0, fixed=TRUE) countPattern(pattern, subject, algorithm="auto", mismatch=0, fixed=TRUE) mismatch(pattern, x, fixed=TRUE)
pattern |
The pattern string. |
subject |
A BString (or derived) object containing the subject string, or a BStringViews object. |
algorithm |
One of the following: "auto" , "naive-exact" ,
"naive-fuzzy" , "boyer-moore" or "shift-or" .
|
mismatch |
The number of mismatches allowed. If non-zero, a fuzzy string searching algorithm is used for matching. |
fixed |
Only with a DNAString or RNAString subject can a fixed
value other than the default (TRUE ) be used.
With fixed=FALSE , ambiguities (i.e. letters from the IUPAC Extended
Genetic Alphabet (see IUPAC_CODE_MAP ) that are not from the
base alphabet) in the pattern _and_ in the subject are interpreted as
wildcards i.e. they match any letter that they stand for.
fixed can also be a character vector, a subset
of c("pattern", "subject") .
fixed=c("pattern", "subject") is equivalent to fixed=TRUE
(the default).
An empty vector is equivalent to fixed=FALSE .
With fixed="subject" , ambiguities in the pattern only
are interpreted as wildcards.
With fixed="pattern" , ambiguities in the subject only
are interpreted as wildcards.
|
x |
A BStringViews object (typically, one returned
by matchPattern(pattern, subject) ).
|
Available algorithms are: ``naive exact'', ``naive fuzzy'',
``Boyer-Moore-like'' and ``shift-or''. Not all of them can be
used in all situations: restrictions depend on the length of
the pattern, the class of the subject and the values of
mismatch
and fixed
.
When 2 different algorithms can be used for a given task,
then choosing one or the other only affects the performance,
not the result, so there is no "wrong choice" (strictly speaking).
In short, it is better to just use algorithm="auto"
(the default):
this way matchPattern
will choose the algo that is best suited
for the task.
A BStringViews object for matchPattern
.
A single integer for countPattern
.
A list of integer vectors for mismatch
.
matchLRPatterns
,
matchProbePair
,
mask
,
alphabetFrequency
,
IUPAC_CODE_MAP
,
BStringViews-class,
DNAString-class
## A simple fuzzy matching example with a short subject x <- DNAString("AAGCGCGATATG") m1 <- matchPattern("GCNNNAT", x) m1 m2 <- matchPattern("GCNNNAT", x, fixed=FALSE) m2 as.matrix(m2) ## With DNA sequence of yeast chromosome number 1 data(yeastSEQCHR1) yeast1 <- DNAString(yeastSEQCHR1) PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE) match2.PpiI <- matchPattern(PpiI, yeast1, mismatch=1, fixed=FALSE) ## With a genome containing isolated Ns library(BSgenome.Celegans.UCSC.ce2) chrII <- Celegans[["chrII"]] alphabetFrequency(chrII) matchPattern("N", chrII) matchPattern("TGGGTGTCTTT", chrII) # no match matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match ## Using wildcards ("N") in the pattern on a genome containing N-blocks library(BSgenome.Dmelanogaster.FlyBase.r51) chrX <- Dmelanogaster[["X"]] noN_chrX <- mask(chrX, "N") mask(noN_chrX) # See the N-blocks? matchPattern("TTTATGNTTGGTA", noN_chrX, fixed=FALSE) ## Can also be achieved with matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")