BSgenome-class {BSgenome}R Documentation

The BSgenome class

Description

A container for the complete genome sequence of a given species.

Accesor methods

In the code snippets below, x is a BSgenome object and name is the name of a sequence (character-string).

organism(x): Return the target organism for this genome e.g. "Homo sapiens", "Mus musculus", "Caenorhabditis elegans", etc...
species(x): Return the target species for this genome e.g. "Human", "Mouse", "C. elegans", etc...
provider(x): Return the provider of this genome e.g. "UCSC", "BDGP", "FlyBase", etc...
providerVersion(x): Return the provider-side version of this genome. For example UCSC uses versions "hg18", "hg17", etc... for the different Builds of the Human genome.
releaseDate(x): Return the release date of this genome e.g. "Mar. 2006".
releaseName(x): Return the release name of this genome, which is generally made of the name of the organization who assembled it plus its Build version. For example, UCSC uses "hg18" for the version of the Human genome corresponding to the Build 36.1 from NCBI hence the release name for this genome is "NCBI Build 36.1".
sourceUrl(x): Return the source URL i.e. the permanent URL to the place where the FASTA files used to produce the sequences contained in x can be found (and downloaded).
seqnames(x): Return the index of the single sequences contained in x. Each single sequence is stored in a BString (or derived) object and comes from a source file (FASTA) with a single record. The names returned by seqnames(x) usually reflect the names of those source files but a common prefix or suffix was eventually removed in order to keep them as short as possible.
mseqnames(x): Return the index of the multiple sequences contained in x. Each multiple sequence is stored in a BStringViews object and comes from a source file (FASTA) with multiple records. The names returned by mseqnames(x) usually reflect the names of those source files but a common prefix or suffix was eventually removed in order to keep them as short as possible.
names(x): Return the index of all sequences contained in x. This is the same as c(seqnames(x), mseqnames(x)).
length(x): Return the length of x, i.e., the number of all sequences that it contains. This is the same as length(names(x)).
x[[name]]: Return sequence (single or multiple) named name. No sequence is actually loaded into memory until this is explicitely requested with a call to x[[name]] or x$name.
x$name: Same as x[[name]] but name is not evaluated and therefore must be a literal character string or a name (possibly backtick quoted).

Other functions and generics

In the code snippets below, x is a BSgenome object and name is the name of a sequence (character-string).

unload(x, name): Try to free the memory occupied by a loaded sequence by removing the 1st reference to this sequence. This 1st reference is a hidden reference that is created behind the scene by x[[name]] or x$name. See below for an example of how to make proper use of unload().

Author(s)

H. Pages

See Also

available.genomes, BString, DNAString, BStringViews, getSeq, matchPattern, rm, gc

Examples

  library(BSgenome.Celegans.UCSC.ce2)   # This doesn't load the chromosome 
                                        # sequences into memory.
  length(Celegans)                      # Number of sequences in this genome.
  Celegans                              # Displays a summary of the sequences
                                        # provided in this genome.
  seqnames(Celegans)                    # Index of single sequences.
  class(Celegans$chrI)                  # A DNAString instance.
  mseqnames(Celegans)                   # Index of multiple sequences.
  class(Celegans$upstream1000)          # A BStringViews instance.
  desc(Celegans$upstream1000)[1:4]      # Character vector containing the
                                        # description line found in the FASTA
                                        # file for the first 4 FASTA records.

  ## Some important considerations about memory usage:
  mem0 <- gc()["Vcells", "(Mb)"]        # Current amount of data in memory (in
                                        # Mb).
  Celegans[["chrV"]]                    # Loads chromosome V into memory (hence
                                        # takes a long time).
  gc()["Vcells", "(Mb)"] - mem0         # Chromosome V occupies 20Mb of memory.
  Celegans[["chrV"]]                    # Much faster (sequence is already in
                                        # memory, hence it's not loaded again).
  Celegans$chrV                         # Equivalent to Celegans[["chrV"]].
  class(Celegans$chrV)                  # Chromosome V (like any other
                                        # chromosome sequence) is a DNAString
                                        # object.
  nchar(Celegans$chrV)                  # It has 20922231 letters (nucleotides).
  x <- Celegans$chrV                    # Very fast because a BString object
                                        # doesn't contain the sequence, only a
                                        # pointer to the sequence, hence chrV
                                        # seq is not duplicated in memory. But
                                        # we now have 2 objects pointing to the
                                        # same place in memory.
  y <- substr(x, 10, 100)               # A 3rd object pointing to chrV seq.
  
  ## We must remove all references to chrV seq if we want the 20Mb of memory
  ## used by it to be freed (note that it can be hard to keep track of all the
  ## references to a given sequence).
  ## IMPORTANT: The 1st reference to this seq (Celegans$chrV) should be removed
  ## last. This is achieved with unload(). All other references are removed by
  ## just removing the referencing object.
  rm(x)
  rm(y)
  unload(Celegans, "chrV")
  gc()["Vcells", "(Mb)"]

[Package BSgenome version 1.6.2 Index]