alphabetFrequency {Biostrings}R Documentation

Function to calculate the frequency of letters in a biological sequence and related functions

Description

Given a biological sequence, the alphabetFrequency function will calculate the frequency of each letter in the (base) alphabet, the dinucleotideFrequency function the frequency of all possible dinucleotides and the trinucleotideFrequency function the frequency of all possible trinucleotides.

More generally, the oligonucleotideFrequency function will calculate the frequency of all possible oligonucleotides of a given length (called the "width" in this particular context).

In this man page we call "DNA input" a DNAString object, or a DNAStringSet object, or an XStringViews object with a DNAString subject, or a MaskedDNAString object. Similarly we call "RNA input" an RNAString object, or an RNAStringSet object, or an XStringViews object with an RNAString subject, or a MaskedRNAString object.

Usage

  alphabetFrequency(x, baseOnly=FALSE, freq=FALSE, ...)
  hasOnlyBaseLetters(x)
  uniqueLetters(x)

  dinucleotideFrequency(x, freq=FALSE, fast.moving.side="right",
                        as.matrix=FALSE, with.labels=TRUE, ...)
  trinucleotideFrequency(x, freq=FALSE, fast.moving.side="right",
                         as.array=FALSE, with.labels=TRUE, ...)
  oligonucleotideFrequency(x, width, freq=FALSE, fast.moving.side="right",
                           as.array=FALSE, with.labels=TRUE, ...)
  oligonucleotideTransitions(x, left=1, right=1, freq=FALSE)

  ## Some related utility functions
  strrev(x)
  mkAllStrings(alphabet, width, fast.moving.side="right")

Arguments

x An XString, XStringSet, XStringViews or MaskedXString object for the *Frequency and uniqueLetters functions.
"DNA or RNA input" for hasOnlyBaseLetters.
A character vector for strrev.
baseOnly TRUE or FALSE. If TRUE, the returned vector only contains frequencies for the letters in the "base" alphabet i.e. "A", "C", "G", "T" if x is a "DNA input", and "A", "C", "G", "U" if x is "RNA input". When x is a BString object (or an XStringViews object with a BString subject, or a BStringSet object), then the baseOnly argument is ignored.
freq If TRUE then frequencies are reported, otherwise counts.
... Further arguments to be passed to or from other methods. For the XStringViews and XStringSet methods, the collapse argument is accepted.
fast.moving.side Which side of the strings should move fastest?
as.matrix If TRUE then return a numeric matrix, otherwise a numeric vector with no dim attribute.
as.array If TRUE then return a numeric array, otherwise a numeric vector with no dim attribute.
with.labels If TRUE then return a named vector (or array).
width The number of nucleotides per oligonucleotide for oligonucleotideFrequency. The number of letters per string for mkAllStrings.
left, right The number of nucleotides per oligonucleotide for the rows and columns respectively in the transition matrix created by oligonucleotideTransitions.
alphabet The alphabet to use to make the strings.

Details

alphabetFrequency and oligonucleotideFrequency are generic functions defined in the Biostrings package with methods defined for BString, DNAString, RNAString, XStringViews and XStringSet objects.

Value

All the *Frequency functions return an integer vector if freq is FALSE (default), otherwise a double vector. If as.matrix or as.array is TRUE, this vector is formatted as a matrix or an array.
For alphabetFrequency: if x is a "DNA or RNA input", then the returned vector is named with the letters in the alphabet (unless with.labels is FALSE). If the baseOnly argument is TRUE, then the returned vector has only 5 elements: 4 elements corresponding to the 4 nucleotides + the 'other' element.
dinucleotideFrequency (resp. trinucleotideFrequency and oligonucleotideFrequency) only works on "DNA or RNA input" and returns a vector named with all the possible dinucleotides (resp. trinucleotides or oligonucleotides).
If x is a multiple sequence input (i.e. an XStringViews or XStringSet object), then the returned object is a matrix (or a list) with the same number of rows (or elements) as x unless collapse=TRUE is specified. In that case the returned vector (or array) contains the frequencies cumulated across all sequences in x.
hasOnlyBaseLetters returns TRUE or FALSE indicating whether or not x contains only base letters (i.e. As, Cs, Gs and Ts for "DNA input" and As, Cs, Gs and Us for "RNA input").
uniqueLetters returns a vector of 1-letter or empty strings. The empty string is used to represent the nul character if x happens to contain any. Note that this can only happen if XString base subtype of x is BString.

Author(s)

H. Pages

See Also

countPDict, XString-class, XStringSet-class, XStringViews-class, MaskedXString-class, reverse,XString-method, rev, strsplit, GENETIC_CODE, AMINO_ACID_CODE

Examples

  data(yeastSEQCHR1)
  yeast1 <- DNAString(yeastSEQCHR1)

  alphabetFrequency(yeast1)
  alphabetFrequency(yeast1, baseOnly=TRUE)
  hasOnlyBaseLetters(yeast1)
  uniqueLetters(yeast1)

  dinucleotideFrequency(yeast1)
  trinucleotideFrequency(yeast1)
  oligonucleotideFrequency(yeast1, 4)

  ## With a multiple sequence input
  library(drosophila2probe)
  x <- DNAStringSet(drosophila2probe$sequence)
  alphabetFrequency(x[1:50], baseOnly=TRUE)
  alphabetFrequency(x, baseOnly=TRUE, collapse=TRUE)

  ## Get the less and most represented 6-mers
  f6 <- oligonucleotideFrequency(yeast1, 6)
  f6[f6 == min(f6)]
  f6[f6 == max(f6)]

  ## Get the result as an array
  tri <- trinucleotideFrequency(yeast1, as.array=TRUE)
  tri["A", "A", "C"] # == trinucleotideFrequency(yeast1)["AAC"]
  tri["T", , ] # frequencies of trinucleotides starting with a "T"

  ## Get nucleotide transition matrices for yeast1
  oligonucleotideTransitions(yeast1)
  oligonucleotideTransitions(yeast1, 2, freq=TRUE)

  ## Note that when dropping the dimensions of the 'tri' array, elements
  ## in the resulting vector are ordered as if they were obtained with
  ## 'fast.moving.side="left"':
  triL <- trinucleotideFrequency(yeast1, fast.moving.side="left")
  all(as.vector(tri) == triL) # TRUE

  ## Convert the trinucleotide frequency into the amino acid frequency based on
  ## translation
  tri1 <- trinucleotideFrequency(yeast1)
  names(tri1) <- GENETIC_CODE[names(tri1)]
  sapply(split(tri1, names(tri1)), sum) # 12512 occurrences of the stop codon

  ## When the returned vector is very long (e.g. width >= 10), using
  ## 'with.labels=FALSE' will improve the performance considerably (100x, 1000x
  ## or more):
  f12 <- oligonucleotideFrequency(yeast1, 12, with.labels=FALSE) # very fast!

  ## Some related utility functions
  dict1 <- mkAllStrings(LETTERS[1:3], 4)
  dict2 <- mkAllStrings(LETTERS[1:3], 4, fast.moving.side="left")
  identical(strrev(dict1), dict2) # TRUE 

[Package Biostrings version 2.10.22 Index]