flowClust {flowClust}R Documentation

Robust Model-based Clustering for Flow Cytometry

Description

This function performs automated clustering for identifying cell populations in flow cytometry data. The approach is based on the t mixture model with the Box-Cox transformation, which provides a unified framework to handle outlier identification and data transformation simultaneously.

Usage

flowClust(x, expName="Flow Experiment", varNames=NULL, K, B=500, 
          tol=1e-5, nu=4, lambda=1, trans=TRUE, min.count=10, 
          max.count=10, min=NULL, max=NULL, level=0.9, 
          u.cutoff=NULL, z.cutoff=0, randomStart=0, 
          B.init=B, tol.init=1e-2, seed=1, criterion="BIC")

Arguments

x A numeric vector, matrix, data frame of observations, or object of class flowFrame. Rows correspond to observations and columns correspond to variables.
expName A character string giving the name of the experiment.
varNames A character vector specifying the variables (columns) to be included in clustering. When it is left unspecified, all the variables will be used.
K An integer vector indicating the numbers of clusters.
B The maximum number of EM iterations.
tol The tolerance used to assess the convergence of the EM.
nu The degrees of freedom used for the t distribution. Default is 4. If nu=Inf, Gaussian distribution will be used.
lambda The initial transformation to be applied to the data.
trans A logical value indicating whether the Box-Cox transformation parameter is estimated from the data.
min.count An integer specifying the threshold count for filtering data points from below. The default is 10, meaning that if 10 or more data points are smaller than or equal to min, they will be excluded from the analysis. If min is NULL, then the minimum of data as per each variable will be used. To suppress filtering, set it as -1.
max.count An integer specifying the threshold count for filtering data points from above. Interpretation is similar to that of min.count.
min The lower boundary set for data filtering. Note that it is a vector of length equal to the number of variables (columns), implying that a different value can be set as per each variable.
max The upper boundary set for data filtering. Interpretation is similar to that of min.
level A numeric value between 0 and 1 specifying the threshold quantile level used to call a point an outlier. The default is 0.9, meaning that any point outside the 90% quantile region will be called an outlier.
u.cutoff Another criterion used to identify outliers. If this is NULL, then level will be used. Otherwise, this specifies the threshold (e.g., 0.5) for u, a quantity used to measure the degree of “outlyingness” based on the Mahalanobis distance. Please refer to Lo et al. (2008) for more details.
z.cutoff A numeric value between 0 and 1 underlying a criterion which may be used together with level/u.cutoff to identify outliers. A point with the probability of assignment z (i.e., the posterior probability that a data point belongs to the cluster assigned) smaller than z.cutoff will be called an outlier. The default is 0, meaning that assignment will be made no matter how small the associated probability is, and outliers will be identified solely based on the rule set by level or cutoff.
randomStart A numeric value indicating how many times a random parition of the data is generated for initialization. For instance, if randomStart is 10, 10 random partitions of the data will be generated, each of which is followed by a short EM run. The partition leading to the highest likelihood value will be adopted to be the initial partition for the eventual long EM run. The default is 0, meaning that this initialization strategy is not applied and hierarchical clustering is used instead.
B.init The maximum number of EM iterations following each random partition in random initialization.
tol.init The tolerance used as the stopping criterion for the short EM runs in random initialization.
seed An integer giving the seed number used when randomStart>0.
criterion A character string stating the criterion used to choose the best model. May take either "BIC" or "ICL". This argument is only relevant when length(K)>1.

Details

Estimation of the unknown parameters (including the Box-Cox parameter) is done via an Expectation-Maximization (EM) algorithm. At each EM iteration, Brent's algorithm is used to find the optimal value of the Box-Cox transformation parameter. Conditional on the transformation parameter, all other estimates can be obtained in closed form. Please refer to Lo et al. (2008) for more details.

The flowClust package makes extensive use of the GSL as well as BLAS. If an optimized BLAS library is provided when compiling the package, the flowClust package will be able to run multi-threaded processes.

Various operations have been defined for the object returned from flowClust. These include:
Subsetting operations: %in%, Subset and split
Slot retrieval operations: ruleOutliers, Map, criterion, posterior, importance, uncertainty and getEstimates
Graphical operations: plot, density and hist

In addition, to facilitate the integration with the flowCore package for processing flow cytometry data, the flowClust operation can be done through a method pair (tmixFilter and filter) such that various methods defined in flowCore can be applied on the object created from the filtering operation.

Value

If K is of length 1, the function returns an object of class flowClust containing the following slots, where K is the number of clusters, N is the number of observations and P is the number of variables:

expName Content of the expName argument.
varNames Content of the varNames argument if provided; generated if available otherwise.
K An integer showing the number of clusters.
w A vector of length K, containing the estimates of the K cluster proportions.
mu A matrix of size K x P, containing the estimates of the K mean vectors.
sigma An array of dimension K x P x P, containing the estimates of the K covariance matrices.
lambda The Box-Cox transformation parameter estimate.
nu The degrees of freedom used for the t distribution.
z A matrix of size N x K, containing the posterior probabilities of cluster memberships. The probabilities in each row sum up to one.
u A matrix of size N x K, containing the “weights” (the contribution for computing cluster mean and covariance matrix) of each data point in each cluster. Since this quantity decreases monotonically with the Mahalanobis distance, it can also be interpreted as the level of “outlyingness” of a data point. Note that, when nu=Inf, this slot is used to store the Mahalanobis distances instead.
label A vector of size N, showing the cluster membership according to the initial partition (i.e., hierarchical clustering if randomStart is FALSE). Filtered observations will be labelled as NA. Unassigned observations (which may occur since only 1500 observations at maximum are taken for hierarchical clustering) will be labelled as 0.
uncertainty A vector of size N, containing the uncertainty about the cluster assignment. Uncertainty is defined as 1 minus the posterior probability that a data point belongs to the cluster to which it is assigned.
ruleOutliers A numeric vector of size 3, storing the rule used to call outliers. The first element is 0 if the criterion is set by the level argument, or 1 if it is set by u.cutoff. The second element copies the content of either the level or u.cutoff argument. The third element copies the content of the z.cutoff argument. For instance, if points are called outliers when they lie outside the 90% quantile region or have assignment probabilities less than 0.5, then ruleOutliers is c(0, 0.9, 0.5). If points are called outliers only if their “weights” in the assigned clusters are less than 0.5 regardless of the assignment probabilities, then ruleOutliers becomes c(1, 0.5, 0).
flagOutliers A logical vector of size N, showing whether each data point is called an outlier or not based on the rule defined by level/u.cutoff and z.cutoff.
rm.min Number of points filtered from below.
rm.max Number of points filtered from above.
logLike The log-likelihood of the fitted mixture model.
BIC The Bayesian Information Criterion for the fitted mixture model.
ICL The Integrated Completed Likelihood for the fitted mixture model.

If K has a length >1, the function returns an object of class flowClustList. Its data part is a list with the same length as K, each element of which is a flowClust object corresponding to a specific number of clusters. In addition, the resultant flowClustList object contains the following slots:

index An integer giving the index of the list element corresponding to the best model as selected by criterion.
criterion The criterion used to choose the best model – either "BIC" or "ICL".

Note that when a flowClustList object is used in place of a flowClust object, in most cases the list element corresponding to the best model will be extracted and passed to the method/function call.

Author(s)

Raphael Gottardo <raph@stat.ubc.ca>, Kenneth Lo <c.lo@stat.ubc.ca>

References

Lo, K., Brinkman, R. R. and Gottardo, R. (2008) Automated Gating of Flow Cytometry Data via Robust Model-based Clustering. Cytometry A 73, 321-332.

See Also

summary, plot, density, hist, Subset, split, ruleOutliers, Map, SimulateMixture

Examples

data(rituximab)

### cluster the data using FSC.H and SSC.H
res1 <- flowClust(rituximab, varNames=c("FSC.H", "SSC.H"), K=1)

### remove outliers before proceeding to the second stage
# %in% operator returns a logical vector indicating whether each
# of the observations lies within the cluster boundary or not
rituximab2 <- rituximab[rituximab %in% res1,]
# a shorthand for the above line
rituximab2 <- rituximab[res1,]
# this can also be done using the Subset method
rituximab2 <- Subset(rituximab, res1)

### cluster the data using FL1.H and FL3.H (with 3 clusters)
res2 <- flowClust(rituximab2, varNames=c("FL1.H", "FL3.H"), K=3)
show(res2)
summary(res2)

# to demonstrate the use of the split method
split(rituximab2, res2)
split(rituximab2, res2, population=list(sc1=c(1,2), sc2=3))

# to show the cluster assignment of observations
table(Map(res2))

# to show the cluster centres (i.e., the mean parameter estimates
# transformed back to the original scale)
getEstimates(res2)$locations

### demonstrate the use of various plotting methods
# a scatterplot
plot(res2, data=rituximab2, level=0.8)
# a contour / image plot
res2.den <- density(res2, data=rituximab2)
plot(res2.den)
plot(res2.den, type="image", nlevels=100)
# a histogram (1-D density) plot
hist(res2, data=rituximab2, subset="FL1.H")

# the following line illustrates how to select a subset of data 
# to perform cluster analysis through the min and max arguments;
# also note the use of level to specify a rule to call outliers
# other than the default
flowClust(rituximab2, varNames=c("FL1.H", "FL3.H"), K=3, B=100, 
    min=c(0,0), max=c(400,800), level=0.95, z.cutoff=0.5)

[Package flowClust version 1.8.1 Index]