alphabetFrequency {Biostrings} | R Documentation |
Given a biological sequence, the alphabetFrequency
function will
calculate the frequency of each letter in the (base) alphabet,
the dinucleotideFrequency
function the frequency of all possible
dinucleotides and the trinucleotideFrequency
function the frequency
of all possible trinucleotides.
More generally, the oligonucleotideFrequency
function will
calculate the frequency of all possible oligonucleotides of a given
length (called the "width" in this particular context).
In this man page we call "DNA input" a DNAString object, or a DNAStringSet object, or an XStringViews object with a DNAString subject, or a MaskedDNAString object. Similarly we call "RNA input" an RNAString object, or an RNAStringSet object, or an XStringViews object with an RNAString subject, or a MaskedRNAString object.
alphabetFrequency(x, baseOnly=FALSE, freq=FALSE, ...) hasOnlyBaseLetters(x) uniqueLetters(x) dinucleotideFrequency(x, freq=FALSE, fast.moving.side="right", as.matrix=FALSE, with.labels=TRUE, ...) trinucleotideFrequency(x, freq=FALSE, fast.moving.side="right", as.array=FALSE, with.labels=TRUE, ...) oligonucleotideFrequency(x, width, freq=FALSE, fast.moving.side="right", as.array=FALSE, with.labels=TRUE, ...) oligonucleotideTransitions(x, left=1, right=1, freq=FALSE) ## Some related utility functions strrev(x) mkAllStrings(alphabet, width, fast.moving.side="right")
x |
An XString, XStringSet, XStringViews
or MaskedXString object for the *Frequency and
uniqueLetters functions.
"DNA or RNA input" for hasOnlyBaseLetters .
A character vector for strrev .
|
baseOnly |
TRUE or FALSE .
If TRUE , the returned vector only contains frequencies for the
letters in the "base" alphabet i.e. "A", "C", "G", "T" if x
is a "DNA input", and "A", "C", "G", "U" if x is "RNA input".
When x is a BString object (or an XStringViews
object with a BString subject, or a BStringSet object),
then the baseOnly argument is ignored.
|
freq |
If TRUE then frequencies are reported, otherwise counts.
|
... |
Further arguments to be passed to or from other methods.
For the XStringViews and XStringSet methods,
the collapse argument is accepted.
|
fast.moving.side |
Which side of the strings should move fastest? |
as.matrix |
If TRUE then return a numeric matrix, otherwise a numeric
vector with no dim attribute.
|
as.array |
If TRUE then return a numeric array, otherwise a numeric
vector with no dim attribute.
|
with.labels |
If TRUE then return a named vector (or array).
|
width |
The number of nucleotides per oligonucleotide for
oligonucleotideFrequency .
The number of letters per string for mkAllStrings .
|
left, right |
The number of nucleotides per oligonucleotide for the rows and columns
respectively in the transition matrix created by oligonucleotideTransitions .
|
alphabet |
The alphabet to use to make the strings. |
alphabetFrequency
and oligonucleotideFrequency
are generic
functions defined in the Biostrings package with methods defined for
BString, DNAString, RNAString, XStringViews
and XStringSet objects.
All the *Frequency
functions return an integer vector if freq
is FALSE
(default), otherwise a double vector.
If as.matrix
or as.array
is TRUE
, this vector is
formatted as a matrix or an array.
For alphabetFrequency
: if x
is a "DNA or RNA input", then the
returned vector is named with the letters in the alphabet (unless
with.labels
is FALSE
).
If the baseOnly
argument is TRUE
, then the returned vector
has only 5 elements: 4 elements corresponding to the 4 nucleotides
+ the 'other' element.
dinucleotideFrequency
(resp. trinucleotideFrequency
and
oligonucleotideFrequency
) only works on "DNA or RNA input"
and returns a vector named with all the possible dinucleotides
(resp. trinucleotides or oligonucleotides).
If x
is a multiple sequence input (i.e. an XStringViews or
XStringSet object), then the returned object is a matrix (or a list)
with the same number of rows (or elements) as x
unless collapse=TRUE
is specified. In that case the returned vector (or array) contains the frequencies
cumulated across all sequences in x
.
hasOnlyBaseLetters
returns TRUE
or FALSE
indicating
whether or not x
contains only base letters (i.e. As, Cs, Gs and Ts
for "DNA input" and As, Cs, Gs and Us for "RNA input").
uniqueLetters
returns a vector of 1-letter or empty strings. The empty
string is used to represent the nul character if x
happens to contain
any. Note that this can only happen if XString base subtype of x
is BString.
H. Pages
countPDict
,
XString-class,
XStringSet-class,
XStringViews-class,
MaskedXString-class,
reverse,XString-method
,
rev
,
strsplit
,
GENETIC_CODE
,
AMINO_ACID_CODE
data(yeastSEQCHR1) yeast1 <- DNAString(yeastSEQCHR1) alphabetFrequency(yeast1) alphabetFrequency(yeast1, baseOnly=TRUE) hasOnlyBaseLetters(yeast1) uniqueLetters(yeast1) dinucleotideFrequency(yeast1) trinucleotideFrequency(yeast1) oligonucleotideFrequency(yeast1, 4) ## With a multiple sequence input library(drosophila2probe) x <- DNAStringSet(drosophila2probe$sequence) alphabetFrequency(x[1:50], baseOnly=TRUE) alphabetFrequency(x, baseOnly=TRUE, collapse=TRUE) ## Get the less and most represented 6-mers f6 <- oligonucleotideFrequency(yeast1, 6) f6[f6 == min(f6)] f6[f6 == max(f6)] ## Get the result as an array tri <- trinucleotideFrequency(yeast1, as.array=TRUE) tri["A", "A", "C"] # == trinucleotideFrequency(yeast1)["AAC"] tri["T", , ] # frequencies of trinucleotides starting with a "T" ## Get nucleotide transition matrices for yeast1 oligonucleotideTransitions(yeast1) oligonucleotideTransitions(yeast1, 2, freq=TRUE) ## Note that when dropping the dimensions of the 'tri' array, elements ## in the resulting vector are ordered as if they were obtained with ## 'fast.moving.side="left"': triL <- trinucleotideFrequency(yeast1, fast.moving.side="left") all(as.vector(tri) == triL) # TRUE ## Convert the trinucleotide frequency into the amino acid frequency based on ## translation tri1 <- trinucleotideFrequency(yeast1) names(tri1) <- GENETIC_CODE[names(tri1)] sapply(split(tri1, names(tri1)), sum) # 12512 occurrences of the stop codon ## When the returned vector is very long (e.g. width >= 10), using ## 'with.labels=FALSE' will improve the performance considerably (100x, 1000x ## or more): f12 <- oligonucleotideFrequency(yeast1, 12, with.labels=FALSE) # very fast! ## Some related utility functions dict1 <- mkAllStrings(LETTERS[1:3], 4) dict2 <- mkAllStrings(LETTERS[1:3], 4, fast.moving.side="left") identical(strrev(dict1), dict2) # TRUE