matchPattern {Biostrings}R Documentation

String searching functions

Description

Generic that finds all matches of a pattern in a BString.

Usage

  matchPattern(pattern, subject, algorithm="auto", mismatch=0, fixed=TRUE)
  countPattern(pattern, subject, algorithm="auto", mismatch=0, fixed=TRUE)
  mismatch(pattern, x, fixed=TRUE)

Arguments

pattern The pattern string.
subject A BString (or derived) object containing the subject string, or a BStringViews object.
algorithm One of the following: "auto", "naive-exact", "naive-fuzzy", "boyer-moore" or "shift-or".
mismatch The number of mismatches allowed. If non-zero, a fuzzy string searching algorithm is used for matching.
fixed Only with a DNAString or RNAString subject can a fixed value other than the default (TRUE) be used.
With fixed=FALSE, ambiguities (i.e. letters from the IUPAC Extended Genetic Alphabet (see IUPAC_CODE_MAP) that are not from the base alphabet) in the pattern _and_ in the subject are interpreted as wildcards i.e. they match any letter that they stand for.
fixed can also be a character vector, a subset of c("pattern", "subject"). fixed=c("pattern", "subject") is equivalent to fixed=TRUE (the default). An empty vector is equivalent to fixed=FALSE. With fixed="subject", ambiguities in the pattern only are interpreted as wildcards. With fixed="pattern", ambiguities in the subject only are interpreted as wildcards.
x A BStringViews object (typically, one returned by matchPattern(pattern, subject)).

Details

Available algorithms are: ``naive exact'', ``naive fuzzy'', ``Boyer-Moore-like'' and ``shift-or''. Not all of them can be used in all situations: restrictions depend on the length of the pattern, the class of the subject and the values of mismatch and fixed.

When 2 different algorithms can be used for a given task, then choosing one or the other only affects the performance, not the result, so there is no "wrong choice" (strictly speaking). In short, it is better to just use algorithm="auto" (the default): this way matchPattern will choose the algo that is best suited for the task.

Value

A BStringViews object for matchPattern.
A single integer for countPattern.
A list of integer vectors for mismatch.

See Also

matchLRPatterns, matchProbePair, mask, alphabetFrequency, IUPAC_CODE_MAP, BStringViews-class, DNAString-class

Examples

  ## A simple fuzzy matching example with a short subject
  x <- DNAString("AAGCGCGATATG")
  m1 <- matchPattern("GCNNNAT", x)
  m1
  m2 <- matchPattern("GCNNNAT", x, fixed=FALSE)
  m2
  as.matrix(m2)

  ## With DNA sequence of yeast chromosome number 1
  data(yeastSEQCHR1)
  yeast1 <- DNAString(yeastSEQCHR1)
  PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern
  match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE)
  match2.PpiI <- matchPattern(PpiI, yeast1, mismatch=1, fixed=FALSE)

  ## With a genome containing isolated Ns
  library(BSgenome.Celegans.UCSC.ce2)
  chrII <- Celegans[["chrII"]]
  alphabetFrequency(chrII)
  matchPattern("N", chrII)
  matchPattern("TGGGTGTCTTT", chrII) # no match
  matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match

  ## Using wildcards ("N") in the pattern on a genome containing N-blocks
  library(BSgenome.Dmelanogaster.FlyBase.r51)
  chrX <- Dmelanogaster[["X"]]
  noN_chrX <- mask(chrX, "N")
  mask(noN_chrX) # See the N-blocks?
  matchPattern("TTTATGNTTGGTA", noN_chrX, fixed=FALSE)
  ## Can also be achieved with
  matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")

[Package Biostrings version 2.6.6 Index]