getSeq {BSgenome} | R Documentation |
A convenience function for extracting a set of sequences (or subsequences) from a BSgenome object.
getSeq(bsgenome, names, start=NA, end=NA, width=NA, as.character=TRUE)
bsgenome |
A BSgenome object.
See the available.genomes function for how to install
a genome.
|
names |
The names of the sequences to extract from bsgenome .
If missing, then seqnames(bsgenome) is used.
See ?seqnames and ?mseqnames to get
the list of single sequences and multiple sequences (respectively)
contained in bsgenome .
Here is how the lookup between the names passed to the names
argument and the sequences in bsgenome is performed.
For each name in names :
(1) if bsgenome contains a single sequence with that name
then this sequence is returned;
(2) otherwise the names of all the elements in all the multiple
sequences are searched: name is treated as a regular
expression and grep is used for this search.
If exactly one sequence is found, then it's returned, otherwise an
error is raised.
|
start, end, width |
Specify these arguments only if you don't want to extract the
entire sequences.
Then the subsequences specified by start , end
and width (single integers or NAs) will be extracted
by a call to subseq before they are
returned by getSeq .
|
as.character |
TRUE or FALSE . Should the extracted sequences
be returned in a standard character vector?
|
A standard character vector when as.character=TRUE
.
Note that when as.character=TRUE
, then the masks that
are defined on top of the sequences to extract are ignored if
any (see ?`MaskedXString-class`
for more information about masked sequences).
A DNAString or MaskedDNAString
object when as.character=FALSE
.
Note that as.character=FALSE
is not supported when more
than one sequence name is supplied.
Be aware that using as.character=TRUE
can be very inefficient
when the returned character vector contains very long strings
(> 1 million letters) or is itself a long vector (> 10000 strings).
getSeq
is much more efficient when used with
as.character=FALSE
but this works only for extracting
one sequence at a time for now.
H. Pages; improvements suggested by Matt Settles
available.genomes
,
BSgenome-class,
seqnames
,
mseqnames
,
grep
,
subseq
,
DNAString
,
MaskedDNAString
,
[[,BSgenome-method
# Load the Caenorhabditis elegans genome (UCSC Release ce2): library(BSgenome.Celegans.UCSC.ce2) # Look at the index of sequences: Celegans # Get chromosome V as a DNAString object: getSeq(Celegans, "chrV", as.character=FALSE) # which is in fact the same as doing: Celegans$chrV # Never try this: #getSeq(Celegans, "chrV") # or this (even worse): #getSeq(Celegans) # Get the first 20 bases of each chromosome: getSeq(Celegans, end=20) # Get the last 20 bases of each chromosome: getSeq(Celegans, start=-20) # Get the "NM_058280_up_1000" sequence (belongs to the upstream1000 # multiple sequence) as a character string: s1 <- getSeq(Celegans, "NM_058280_up_1000") # or a DNAString object (more efficient): s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE) getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1 # TRUE getSeq(Celegans, "NM_058280_up_5000", start=-1000, as.character=FALSE) == s2 # TRUE