getSeq {BSgenome}R Documentation

getSeq

Description

A convenience function for extracting a set of sequences (or subsequences) from a BSgenome object.

Usage

  getSeq(bsgenome, names, start=NA, end=NA, width=NA, as.character=TRUE)

Arguments

bsgenome A BSgenome object. See the available.genomes function for how to install a genome.
names The names of the sequences to extract from bsgenome. If missing, then seqnames(bsgenome) is used.
See ?seqnames and ?mseqnames to get the list of single sequences and multiple sequences (respectively) contained in bsgenome.
Here is how the lookup between the names passed to the names argument and the sequences in bsgenome is performed. For each name in names: (1) if bsgenome contains a single sequence with that name then this sequence is returned; (2) otherwise the names of all the elements in all the multiple sequences are searched: name is treated as a regular expression and grep is used for this search. If exactly one sequence is found, then it's returned, otherwise an error is raised.
start, end, width Specify these arguments only if you don't want to extract the entire sequences. Then the subsequences specified by start, end and width (single integers or NAs) will be extracted by a call to subseq before they are returned by getSeq.
as.character TRUE or FALSE. Should the extracted sequences be returned in a standard character vector?

Value

A standard character vector when as.character=TRUE. Note that when as.character=TRUE, then the masks that are defined on top of the sequences to extract are ignored if any (see ?`MaskedXString-class` for more information about masked sequences).
A DNAString or MaskedDNAString object when as.character=FALSE. Note that as.character=FALSE is not supported when more than one sequence name is supplied.

Note

Be aware that using as.character=TRUE can be very inefficient when the returned character vector contains very long strings (> 1 million letters) or is itself a long vector (> 10000 strings).

getSeq is much more efficient when used with as.character=FALSE but this works only for extracting one sequence at a time for now.

Author(s)

H. Pages; improvements suggested by Matt Settles

See Also

available.genomes, BSgenome-class, seqnames, mseqnames, grep, subseq, DNAString, MaskedDNAString, [[,BSgenome-method

Examples

  # Load the Caenorhabditis elegans genome (UCSC Release ce2):
  library(BSgenome.Celegans.UCSC.ce2)

  # Look at the index of sequences:
  Celegans

  # Get chromosome V as a DNAString object:
  getSeq(Celegans, "chrV", as.character=FALSE)
  # which is in fact the same as doing:
  Celegans$chrV

  # Never try this:
  #getSeq(Celegans, "chrV")
  # or this (even worse):
  #getSeq(Celegans)

  # Get the first 20 bases of each chromosome:
  getSeq(Celegans, end=20)

  # Get the last 20 bases of each chromosome:
  getSeq(Celegans, start=-20)

  # Get the "NM_058280_up_1000" sequence (belongs to the upstream1000
  # multiple sequence) as a character string:
  s1 <- getSeq(Celegans, "NM_058280_up_1000")
  # or a DNAString object (more efficient):
  s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE)

  getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1  # TRUE

  getSeq(Celegans, "NM_058280_up_5000",
         start=-1000, as.character=FALSE) == s2  # TRUE

[Package BSgenome version 1.10.5 Index]