match-utils {Biostrings}R Documentation

Utility functions related to pattern matching

Description

In this man page we define precisely and illustrate what a "match" of a pattern P in a subject S is in the context of the Biostrings package. This definition of a "match" is central to most pattern matching functions available in this package: unless specified otherwise, most of them will adhere to the definition provided here.

neditStartingAt, neditEndingAt, isMatchingStartingAt and isMatchingEndingAt are low-level functions that implement some basic concepts. Once these concepts are understood, we can use them to provide a simple and concise definition of a "match".

Other utility functions related to pattern matching are described here: the mismatch function for getting the positions of the mismatching letters of a given pattern relatively to its matches in a given subject, the nmatch and nmismatch functions for getting the number of matching and mismatching letters produced by the mismatch function, and the coverage function that can be used to get the "coverage" of a subject by a given pattern or set of patterns.

Usage

  neditStartingAt(pattern, subject, starting.at=1, with.indels=FALSE, fixed=TRUE)
  neditEndingAt(pattern, subject, ending.at=1, with.indels=FALSE, fixed=TRUE)
  neditAt(pattern, subject, at=1, with.indels=FALSE, fixed=TRUE)

  isMatchingStartingAt(pattern, subject, starting.at=1,
                  max.mismatch=0, with.indels=FALSE, fixed=TRUE)
  isMatchingEndingAt(pattern, subject, ending.at=1,
                  max.mismatch=0, with.indels=FALSE, fixed=TRUE)
  isMatchingAt(pattern, subject, at=1,
                  max.mismatch=0, with.indels=FALSE, fixed=TRUE)

  mismatch(pattern, x, fixed=TRUE)
  nmatch(pattern, x, fixed=TRUE)
  nmismatch(pattern, x, fixed=TRUE)
  ## S4 method for signature 'MIndex':
  coverage(x, start=NA, end=NA)
  ## S4 method for signature 'XStringViews':
  coverage(x, start=NA, end=NA, weight=1L)
  ## S4 method for signature 'MaskedXString':
  coverage(x, start=NA, end=NA, weight=1L)

Arguments

pattern The pattern string.
subject An XString object (or character vector) containing the subject sequence.
starting.at, ending.at, at An integer vector specifying the starting (for starting.at and at) or ending (for ending.at) positions of the pattern relatively to the subject.
max.mismatch See details below.
with.indels See details below.
fixed Only with a DNAString or RNAString subject can a fixed value other than the default (TRUE) be used.
With fixed=FALSE, ambiguities (i.e. letters from the IUPAC Extended Genetic Alphabet (see IUPAC_CODE_MAP) that are not from the base alphabet) in the pattern _and_ in the subject are interpreted as wildcards i.e. they match any letter that they stand for.
fixed can also be a character vector, a subset of c("pattern", "subject"). fixed=c("pattern", "subject") is equivalent to fixed=TRUE (the default). An empty vector is equivalent to fixed=FALSE. With fixed="subject", ambiguities in the pattern only are interpreted as wildcards. With fixed="pattern", ambiguities in the subject only are interpreted as wildcards.
x An XStringViews object for mismatch (typically, one returned by matchPattern(pattern, subject)).
Typically an XStringViews or MIndex object for coverage but IRanges, MaskCollection and MaskedXString objects are accepted too.
start, end Two single integers specifying where to start and end the extraction of the coverage in x.
weight An integer vector specifying how much each element in x counts.

Details

A "match" of pattern P in subject S is a substring S' of S that is considered similar enough to P according to some distance (or metric) specified by the user. 2 distances are supported by most pattern matching functions in the Biostrings package. The first (and simplest) one is the "number of mismatching letters". It is defined only when the 2 strings to compare have the same length, so when this distance is used, only matches that have the same number of letters as P are considered. The second one is the "edit distance" (aka Levenshtein distance): it's the minimum number of operations needed to transform P into S', where an operation is an insertion, deletion, or substitution of a single letter. When this metric is used, matches can have a different number of letters than P.

The neditStartingAt (and neditEndingAt) function implements these 2 distances. If with.indels is FALSE (the default), then the first distance is used i.e. neditStartingAt returns the "number of mismatching letters" between the pattern P and the substring S' of S starting at the positions specified in starting.at (note that neditStartingAt and neditEndingAt are vectorized so long vectors of integers can be passed thru the starting.at or ending.at arguments). If with.indels is TRUE, then the "edit distance" distance is used: for each position specified in starting.at, P is compared to all the substrings S' of S starting at this position and the smallest distance is returned. Note that this distance is guaranteed to be reached for a substrings of length < 2*length(P) so, of course, in practise, P only needs to be compared to a small number of substrings for every starting position.

Value

neditStartingAt and neditEndingAt: an integer vector of the same length as starting.at (or ending.at).
isMatchingStartingAt(...) and isMatchingEndingAt(...): the logical vector defined by neditStartingAt(...) <= max.mismatch or neditEndingAt(...) <= max.mismatch, respectively.
neditAt and isMatchingAt are conveniency wrappers for neditStartingAt and isMatchingStartingAt, respectively.
mismatch: a list of integer vectors.
nmismatch: an integer vector containing the length of the vectors produced by mismatch.
coverage: an XRleInteger object indicating the coverage of x in the interval specified by the start and end arguments. An integer value called the "coverage" can be associated to each position in x, indicating how many times this position is covered by the views or matches stored in x. For example, if x is an XStringViews object, the coverage of a given position in x is the number of views it belongs to. If x is an MIndex object, the coverage of a given position in x is the number of matches (or hits) it belongs to. Note that the positions in the returned XRleInteger object are to be interpreted as relative to the interval specified by the start and end arguments.

See Also

matchPattern, matchPDict, IUPAC_CODE_MAP, XString-class, XStringViews-class, MIndex-class, coverage, IRanges-class, MaskCollection-class, MaskedXString-class, align-utils

Examples

  ## ---------------------------------------------------------------------
  ## neditAt() / isMatchingAt()
  ## ---------------------------------------------------------------------
  subject <- DNAString("GTATA")

  ## Pattern "AT" matches subject "GTATA" at position 3 (exact match)
  neditAt("AT", subject, at=3)
  isMatchingAt("AT", subject, at=3)

  ## ... but not at position 1
  neditAt("AT", subject)
  isMatchingAt("AT", subject)

  ## ... unless we allow 1 mismatching letter (inexact match)
  isMatchingAt("AT", subject, max.mismatch=1)

  ## Here we look at 6 different starting positions and find 3 matches if
  ## we allow 1 mismatching letter
  isMatchingAt("AT", subject, at=0:5, max.mismatch=1)

  ## No match
  neditAt("NT", subject, at=1:4)
  isMatchingAt("NT", subject, at=1:4)

  ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)
  neditAt("NT", subject, at=1:4, fixed=FALSE)
  isMatchingAt("NT", subject, at=1:4, fixed=FALSE)

  ## max.mismatch != 0 and fixed=FALSE can be used together
  neditAt("NCA", subject, at=0:5, fixed=FALSE)
  isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE)

  some_starts <- c(10:-10, NA, 6)
  subject <- DNAString("ACGTGCA")
  is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
  some_starts[is_matching]

  ## ---------------------------------------------------------------------
  ## mismatch() / nmismatch()
  ## ---------------------------------------------------------------------
  m <- matchPattern("NCA", subject, max.mismatch=1, fixed=FALSE)
  mismatch("NCA", m)
  nmismatch("NCA", m)

  ## ---------------------------------------------------------------------
  ## coverage()
  ## ---------------------------------------------------------------------
  coverage(m)

  ## See ?matchPDict for examples of using coverage() on an MIndex object...

[Package Biostrings version 2.10.22 Index]