BSgenome-class {BSgenome} | R Documentation |
A container for the complete genome sequence of a given species.
In the code snippets below,
x
is a BSgenome object
and name
is the name of a sequence (character-string).
organism(x)
:
Return the target organism for this genome e.g. "Homo sapiens"
,
"Mus musculus"
, "Caenorhabditis elegans"
, etc...
species(x)
:
Return the target species for this genome e.g. "Human"
,
"Mouse"
, "C. elegans"
, etc...
provider(x)
:
Return the provider of this genome e.g. "UCSC"
, "BDGP"
,
"FlyBase"
, etc...
providerVersion(x)
:
Return the provider-side version of this genome.
For example UCSC uses versions "hg18"
, "hg17"
, etc...
for the different Builds of the Human genome.
releaseDate(x)
:
Return the release date of this genome e.g. "Mar. 2006"
.
releaseName(x)
:
Return the release name of this genome, which is generally made of the
name of the organization who assembled it plus its Build version.
For example, UCSC uses "hg18"
for the version of the
Human genome corresponding to the Build 36.1 from NCBI hence
the release name for this genome is "NCBI Build 36.1"
.
sourceUrl(x)
:
Return the source URL i.e. the permanent URL to the place where the
FASTA files used to produce the sequences contained in x
can
be found (and downloaded).
seqnames(x)
:
Return the index of the single sequences contained in x
.
Each single sequence is stored in a BString (or derived)
object and comes from a source file (FASTA) with a single record.
The names returned by seqnames(x)
usually reflect the names
of those source files but a common prefix or suffix was eventually
removed in order to keep them as short as possible.
mseqnames(x)
:
Return the index of the multiple sequences contained in x
.
Each multiple sequence is stored in a BStringViews
object and comes from a source file (FASTA) with multiple records.
The names returned by mseqnames(x)
usually reflect the names
of those source files but a common prefix or suffix was eventually
removed in order to keep them as short as possible.
names(x)
:
Return the index of all sequences contained in x
.
This is the same as c(seqnames(x), mseqnames(x))
.
length(x)
:
Return the length of x
, i.e., the number of all sequences
that it contains. This is the same as length(names(x))
.
x[[name]]
:
Return sequence (single or multiple) named name
.
No sequence is actually loaded into memory until this is explicitely
requested with a call to x[[name]]
or x$name
.
x$name
:
Same as x[[name]]
but name
is not evaluated and
therefore must be a literal character string or a name (possibly
backtick quoted).
In the code snippets below,
x
is a BSgenome object
and name
is the name of a sequence (character-string).
unload(x, name)
:
Try to free the memory occupied by a loaded sequence by
removing the 1st reference to this sequence. This 1st
reference is a hidden reference that is created behind the scene
by x[[name]]
or x$name
.
See below for an example of how to make proper use of unload()
.
H. Pages
available.genomes
,
BString,
DNAString,
BStringViews,
getSeq
,
matchPattern
,
rm
,
gc
library(BSgenome.Celegans.UCSC.ce2) # This doesn't load the chromosome # sequences into memory. length(Celegans) # Number of sequences in this genome. Celegans # Displays a summary of the sequences # provided in this genome. seqnames(Celegans) # Index of single sequences. class(Celegans$chrI) # A DNAString instance. mseqnames(Celegans) # Index of multiple sequences. class(Celegans$upstream1000) # A BStringViews instance. desc(Celegans$upstream1000)[1:4] # Character vector containing the # description line found in the FASTA # file for the first 4 FASTA records. ## Some important considerations about memory usage: mem0 <- gc()["Vcells", "(Mb)"] # Current amount of data in memory (in # Mb). Celegans[["chrV"]] # Loads chromosome V into memory (hence # takes a long time). gc()["Vcells", "(Mb)"] - mem0 # Chromosome V occupies 20Mb of memory. Celegans[["chrV"]] # Much faster (sequence is already in # memory, hence it's not loaded again). Celegans$chrV # Equivalent to Celegans[["chrV"]]. class(Celegans$chrV) # Chromosome V (like any other # chromosome sequence) is a DNAString # object. nchar(Celegans$chrV) # It has 20922231 letters (nucleotides). x <- Celegans$chrV # Very fast because a BString object # doesn't contain the sequence, only a # pointer to the sequence, hence chrV # seq is not duplicated in memory. But # we now have 2 objects pointing to the # same place in memory. y <- substr(x, 10, 100) # A 3rd object pointing to chrV seq. ## We must remove all references to chrV seq if we want the 20Mb of memory ## used by it to be freed (note that it can be hard to keep track of all the ## references to a given sequence). ## IMPORTANT: The 1st reference to this seq (Celegans$chrV) should be removed ## last. This is achieved with unload(). All other references are removed by ## just removing the referencing object. rm(x) rm(y) unload(Celegans, "chrV") gc()["Vcells", "(Mb)"]