GeneSelection {CMA}R Documentation

General method for variable selection with various methods

Description

For different learning data sets as defined by the argument learningsets, this method ranks the genes from the most relevant to the less relevant using one of various 'filter' criteria or provides a sparse collection of variables (Lasso, ElasticNet, Boosting). The results are typically used for variable selection for the classification procedure that follows.

For S4 class information, s. GeneSelection-methods.

Usage

GeneSelection(X, y, f, learningsets, method = c("t.test", "welch.test", "wilcox.test", "f.test", "kruskal.test", "limma", "rfe", "rf", "lasso", "elasticnet", "boosting", "golub"), scheme, trace = TRUE, ...)

Arguments

X Gene expression data. Can be one of the following:
  • A matrix. Rows correspond to observations, columns to variables.
  • A data.frame, when f is not missing (s. below).
  • An object of class ExpressionSet.
y Class labels. Can be one of the following:
  • A numeric vector.
  • A factor.
  • A character if X is an ExpressionSet.
  • missing, if X is a data.frame and a proper formula f is provided.
f A two-sided formula, if X is a data.frame. The left part correspond to class labels, the right to variables.
learningsets An object of class learningsets. May be missing, then the complete datasets is used as learning set.
method A character specifying the method to be used:
t.test
two-sample t.test (equal variances for both classes assumed).
welch.test
Welch modification of the t.test (unequal variances for both classes).
wilcox.test
Wilcoxon rank sum test.
f.test
F test belonging to the linear hypothesis that the mean is the same for all classes. Usually used for the multiclass scheme, is equivalent to method = t.test in the two-class case.
kruskal.test
Multi-class generalization of the Wilcoxon rank sum test and the nonparametric pendant to the F test, respectively.
limma
'Moderated t' statistic for the two-class case and 'moderated F' statistic for the multiclass case, described in Smyth (2003). Requires the package limma.
rfe
One-step Recursive Feature Elimination, based on the Support Vector Machine. The method is decribed in Guyon et al. (2002). Requires the package e1071. Take care that appropriate hyperparameters are passed by the ... argument.
rf
Random Forest Variable Importance Measure. Requires the package randomForest
lasso
L1 penalized logistic regression leads to sparsity with respect to the variables used. Calls the function LassoCMA, which requires the package glmpath. warning: Take care that appropriate hyperparameters are passed by the ... argument.
elasticnet
Penalized logistic regression with both L1 and L2 penalty, claimed by Zhou and Hastie (2004) to select 'variable groups'. Calls the function ElasticNetCMA, which requires the package glmpath
. warning: Take care that appropriate hyperparameters are passed by the ... argument.
boosting
Componentwise boosting (Buehlmann and Yu, 2003) has been shown to mimic the LASSO (Efron et al., 2004; Buehlmann and Yu, 2006). Calls the function compBoostCMA Take care that appropriate hyperparameters are passed by the ... argument.
golub
The (theoretically unfounded) variable selection criterion used by Golub et al. (1999), s. golub.

scheme The scheme to be used in the case of a non-binary response. Must be one of "pairwise","one-vs-all" or "multiclass". The last case only makes sense if method is one of f.test, limma, rf, boosting, which can directly be applied to the multi class case.
trace Should the progress be traced ? Default is TRUE.
... Further arguments passed to the function performing variable selection, s. method.

Value

An object of class genesel.

Note

most of the methods described above are only apt for the binary classification case. The only ones that can be used without restriction in the multiclass case are

For the rest, pairwise or one-vs-all schemes are used.

Author(s)

Martin Slawski martin.slawski@campus.lmu.de

Anne-Laure Boulesteix http://www.slcmsr.net/boulesteix

References

Smyth, G. K., Yang, Y.-H., Speed, T. P. (2003).

Statistical issues in microarray data analysis.

Methods in Molecular Biology 224, 111-136.

Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002).

Gene Selection for Cancer Classification using support vector machines. Journal of Machine Learning Research, 46, 389-422

Zhou, H., Hastie, T. (2004).

Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2),301-320

Buelmann, P., Yu, B. (2003).

Boosting with the L2 loss: Regression and Classification.

Journal of the American Statistical Association, 98, 324-339

Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. (2004).

Least Angle Regression.

Annals of Statistics, 32:407-499

Buehlmann, P., Yu, B. (2006).

Sparse Boosting.

Journal of Machine Learning Research, 7- 1001:1024

See Also

filter, GenerateLearningsets, tune, classification

Examples

# load Golub AML/ALL data
data(golub)
### extract class labels
golubY <- golub[,1]
### extract gene expression from first 10 genes
golubX <- as.matrix(golub[,-1])
### Generate five different learningsets
set.seed(111)
five <- GenerateLearningsets(y=golubY, method = "CV", fold = 5, strat = TRUE)
### simple t-test:
selttest <- GeneSelection(golubX, golubY, learningsets = five, method = "t.test")
### show result:
show(selttest)
toplist(selttest, k = 10, iter = 1)
plot(selttest, iter = 1)

[Package CMA version 1.0.0 Index]