samplesize {OCplus} | R Documentation |
This function tabulates the false discovery rate (FDR) for selecting differentially expressed genes as a function of sample size and cutoff level. Additionally, the same information can be displayed through an attractive plot.
samplesize(n = seq(5, 50, by = 5), p0 = 0.99, sigma = 1, D, F0, F1, paired = FALSE, crit, crit.style = c("top percentage", "cutoff"), plot =TRUE, local.show=FALSE, nplot = 100, ylim = c(0, 1), main, legend.show = FALSE, grid.show = FALSE, ...)
n |
sample size (as subjects per group) |
p0 |
the proportion of non-differentially expressed genes |
sigma |
the standard deviation for the log expression values |
D |
assumed average log fold change (in units of sigma ), by default 1; this is a shortcut for specifying a simple symmetrical alternative hypothesis through F1 . |
F0 |
the distribution of the log2 expression values under the null hypothesis; by default, this is normal with mean zero and standard deviation sigma , but mixtures of normals can be specified, see Details and Examples. |
F1 |
the distribution of the log2 expression values under the alternative hypothesis; by default, this is an equal mixture of two normals with means D and -D and standard deviation sigma ; mixture of normals are again possible, see Details and Examples. |
paired |
logical value indicating whether this is the independent sample case (default) or the paired sample case. |
crit |
a vector of cutoff values for selecting differentially expressed
genes; the interpretation depends on crit.style . |
crit.style |
indicates how differentially expressed genes are selected: either by a fixed cutoff level for the absolute value of the t-statistic or as a fixed percentage of the absolute largest t-statistics. |
plot |
logical value indicating whether to do the plotting business |
local.show |
logical value indicating whether to show local or global false discovery rate (default: global). |
nplot |
number of points that are evaluated for the curves |
ylim |
the usual limits on the vertical axis |
main |
the main title of the plot |
legend.show |
logical value indicating whether to show a legend for the types of gene selection in the plot |
grid.show |
logical value indicating whether to draw grid lines showing the sample sizes n to be tabulated in the plot |
... |
the usual graphical parameters, passed to plot |
This function plots the FDR as a function of the sample size when comparing the expression of multiple genes between two groups of subjects. This is based on a model assuming that a proportion p0
of genes is not differentially expressed (regulated) between groups, and that 1-p0
genes are. The logarithmized gene expression values of regulated and non regulated genes are assumed to be generated by mixtures of normal distributions; these mixtures can be specified through the parameters F0
, F1
or D
, and sigma
; please see TOC
for details on the model and the specification of the mixtures. By default, the null distribution of the log expression values is a normal centered on zero, and the alternative an equal mixture of normals centered at +D
and -D
.
The list of nominally differentially expressed genes can be selected in two ways:
cutoff
),
top percentage
).
Multiple critical values correspond to multiple curves, each labeled by the critical value, but only one value can be specified for the proportion of non-regulated genes p0
and the standard deviation sigma
.
A matrix with rows corresponding to elements of n
and columns corresponding to the specified critical values is returned. The matrix has the attribute param
that contains the specified arguments, see Examples.
Both the curve labels and the legend may be squashed if the plotting device is too small. Increasing the size of the device and re-plotting should improve readability.
Y. Pawitan and A. Ploner
Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A (2005) False Discovery Rate, Sensitivity and Sample Size for Microarray Studies. Bioinformatics, 21, 3017-3024.
Jung SH (2005) Sample size for FDR-control in microarray data analysis. Bioinformatics, 21, 3097-104.
# Default assumes a proportion of 0.01 regulated genes equally split # between two-fold up- and down-regulated # We select the top 1, 2, 3 percent absolute largest t-statistics samplesize(crit=c(0.03,0.02, 0.01)) # Same model, but using a hard cutoff for the t-statistics samplesize(crit=2:4, crit.style="cutoff") # Paired test of the same size has slightly better FDR (as expected) samplesize(paired=TRUE) # Compare the effect of p0 and effect size par(mfrow=c(2,2)) samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=1) samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=1) samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=2) samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=2) # An asymmetric alternative distribution: 20 percent of the regulated genes # are expected to be (at least) four-fold up regulated # NB, no graphical output ret = samplesize(F1=list(D=c(-1,1,2), p=c(2,2,1)), p0=0.95, crit=0.05, plot=FALSE) ret # Look at the parameters attr(ret, "param") # A wide null distribution that allows to disregard genes with small effect # Here: |log2 fold change| < 0.25, i.e. fold change of less than 19 percent samplesize(F0=list(D=c(-0.25,0,0.25)), grid=TRUE) # This is close to Example 3 in Jung's paper (see References): # p0=0.99 and sensitivity=0.6, so we want a rejection rate of # around 0.006 from the top list. # Here we require around 40 arrays/group, compared to # around 37 in Jung's paper, most likely because we use # the t-distribution instead of normal. Jung's alternative # is only one-sided, so the exact correspondence is # samplesize(p0=0.99,crit.style="top", crit=0.006, F1=list(D=1, p=1), grid=TRUE) abline(h=0.01) #The result is very close to the symmetric alternatives: samplesize(p0=0.99,crit=0.006, D=1, grid=TRUE, ylim=c(0,0.9))