cosmo {cosmo}R Documentation

Constrained motif detection main function

Description

cosmo searches a set of unaligned DNA sequences for a shared motif that may, for example, represent a common transcription factor binding site. The algorithm is similar to MEME, but also allows the user to specify a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may include bounds on the information content across certain regions of the unknown motif, for example, and can often be formulated on the basis of prior knowledge about the structure of the transcription factor in question.

Usage

  cosmo(seqs="browse",constraints="None", minW=6, maxW=15,
    models = "ZOOPS", revComp = TRUE, minSites = NULL, maxSites = NULL, 
    starts = 5, approx = "over", cutFac = 5, wCrit = "bic", 
    wFold = 5, wTrunc = 100, modCrit = "lik", modFold = 5, modTrunc = 100, 
    conCrit = "likCV", conFold = 5, conTrunc = 90, intCrit = "lik", 
    intFold = 5, intTrunc = 100, maxIntensity = FALSE, lstarts = FALSE, 
    backSeqs = NULL, backFold = 5, bfile = NULL, transMat = NULL,
    order = NULL, maxOrder=6, silent = FALSE) 

Arguments

seqs This argument specifies the sequences to be analyzed. If seqs == "browse", a browser appears that allows the user to select a file that contains the sequences in FASTA format. If seqs is another character string, it is assumed to give the path to a FASTA file containing the sequences of interest. Lastly, seqs may be a list with each element representing a sequence in the form of a single string such as "ACGTAGCTAG" ("seq" entry) and a description ("desc" entry).
constraints These are the constraints that are to be imposed on the unknown motif. If constraints == "None", cosmo() will be run without constraints. If constraints == "GUI" and the cosmoGUI package has been installed, a GUI will pop up that allows the user to interactively create a set of constraints, either from scratch or on the basis of several templates of interest. If constraints is another character string, it is assumed to give the path to a file that contains the constraint definitions in the standard text format (see http://cosmoweb.berkeley.edu/constraints.html). Lastly, constraints may be an object of class constraintSet or a list of such objects that defines the constraints of interest.
minW numeric indicating the minimum motif width to consider
maxW numeric indicating the maximum motif width to consider
models character a vector containing the different models to be considered for the distribution of motif occurrences ("OOPS", "ZOOPS", and "TCM"). The One-Occurrence-Per-Sequence (OOPS) model assumes that each sequence contains exactly one occurrence of the motif. The Zero-or-One-Occurrences-Per-Sequence model allows zero or one occurrences of the motif on a given sequence. The Two-Compoment-Mixture (TCM) model allows an arbitrary number of motif occurrences on a given seqence.
revComp logical indicating whether motifs are allowed to occur in the reverse complement orientation.
minSites numerical The minimum number of motif occurrences in the input sequences (default: 2)
maxSites numerical The maximum number of motif occurrences in the input sequences (default: MIN(5*number of sequences, 50))
starts numerical number of starting values to use for each optimization
approx approximation for TCM likelihood; one of "over", "cut", "exact"
cutFac numerical if TCM model is approximated by over or cut models, subsequences are of length cutFac * motif width
wCrit Criterion for choosing the motif width. This can be either "lik" for the likelihood, "aic" for Akaike's Information Criterion, "bic" for the Bayesian Information Criterion, "eval" for the E-value of the alignment of the predicted motif sites, or "likCV" for likelihood-based cross-validation.
wFold numerical cross-validation fold for selecting motif width
wTrunc numerical truncate loss-function for selecting motif width to this percentile (1-100)
modCrit Criterion for choosing the model type. This can be either "lik" for the likelihood, "aic" for Akaike's Information Criterion, "bic" for the Bayesian Information Criterion, "eval" for the E-value of the alignment of the predicted motif sites, or "likCV" for likelihood-based cross-validation.
modFold numerical cross-validation fold for selecting the model type
modTrunc numerical truncate loss-function for selecting model type to this percentile (1-100)
conCrit Criterion for choosing the constraint set. This can be either "lik" for the likelihood, "eval" for the E-value of the alignment of the predicted motif sites, "likCV" for likelihood-based cross-validation, or "pwmCV" for cross-validation based on the Euclidean norm between two position weight matrices.
conFold numerical cross-validation fold for selecting the constraint set (likelihood cross-validation only).
conTrunc numericaltruncate loss-function for selecting constraint set to this percentile (1-100)
intCrit Criterion for estimating the intensity parameter in the ZOOPS or TCM model. This can be either "lik" for the likelihood, "aic" for Akaike's Information Criterion, "bic" for the Bayesian Information Criterion, or "eval" for the E-value of the alignment of the predicted motif sites.
intFold numerical cross-validation fold for selecting the intensity parameter
intTrunc numerical truncate loss-function for selecting intensity parameter to this percentile (1-100)
maxIntensity logical maximize likelihood function with respect to intensity parameter (in ZOOPS or TCM model) instead of using profiling approach?
lstarts logical should likelihood-based starting values be used rather than E-value-based starting values?
backSeqs This argument specifies the sequences that are to be used to estimate the background Markov model. If backseqs == NULL, the background model is estimated from the sequences supplied in the seqs argument. If backSeqs == "browse", a browser appears that allows the user to select a file that contains the sequences in FASTA format. If backSeqs is another character string, it is assumed to give the path to a FASTA file containing the sequences of interest. Lastly, backSeqs may be a list with each element representing a sequence in the form of a single string such as "ACGTAGCTAG" ("seq" entry) and a description ("desc" entry).
backFold numerical cross-validation fold for selecting order of background Markov model.
bfile character The name of a MEME-style background file for specifying the background Markov model. Such a file lists the frequencies of all tuples of all possible tuples of length up to order + 1. See the help file on the function bfile2tmat() for an example.
transMat The transition matrix to use for the background Markov model. This is a list of matrices, with the first matrix given the transition probabilities for the 0th order Markov model, the second matrix giving the transition probabilities for a 1st order Markov model, and so on. The entry in cell(i,j) of a k-th order transition matrix gives the probability of observing the nucleotide in column j given that the previous k nucleotides are equal to those in row i. Type 'data(transMats)' to look at an example. The function bgModel can be used to obtain a transition matrix from a set of sequences that can be used for this argument. The function bfile2tmat may be used to obtain a transition matrix from a MEME-style background file.
order numerical order of Markov background model
maxOrder numerical maximum order to consider for Markov background model
silent logical suppress output?

Details

Value

An object of class cosmo, returning all the results of the motif detection analysis.

Note

Author(s)

Oliver Bembom, bembom@berkeley.edu, Fabian Gallusser, fgallusser@berkeley.edu

References

Oliver Bembom, Sunduz Keles, and Mark J. van der Laan, "Supervised Detection of Conserved Motifs in DNA Sequences with cosmo" (2007). Statistical Applications in Genetics and Molecular Biology: Vol. 6 : Iss. 1, Article 8. http://www.bepress.com/sagmb/vol6/iss1/art8

See Also

bgModel, bfile2tmat

Examples

## initialize constraint set
## consisting of three intervals
## 1st and 3rd intervals are 3bp long
## middle interval is variable lenght
conSet <- makeConSet(numInt=3, type=c("B","V","B"),length=c(3,NA,3))

## construct two bound constraints
boundCon1 <- makeBoundCon(lower=1.0, upper=2.0)
boundCon2 <- makeBoundCon(lower=0.0, upper=1.0)

## construct palindromic constraint
## require intervals 1 and 3 to be palindromes
## to within 0.05 tolerance
palCon1 <- makePalCon(int1=1, int2=3, errBnd=0.05)

## add constraints to initial constraint set
constraint <- list(boundCon1, boundCon2, palCon1)
int <- list(1, 2, NA)
conSet <- addCon(conSet=conSet, constraint=constraint, int=int)

## path to example sequence file in FASTA format
seqFile <- system.file("Exfiles","seq.fasta",package="cosmo")

## search for motifs of width 8
## assume zero or one occurrences of motif per sequence (ZOOPS)
res <- cosmo(seqs=seqFile, constraints=conSet, minW=8, maxW=8, models="ZOOPS")
plot(res)

[Package cosmo version 1.4.0 Index]