convert {GeneticsBase} | R Documentation |
Efficienctly convert strings of characters into integer codes.
convert(source, levels, byrow=FALSE, aslist=FALSE)
source |
Vector of character strings |
levels |
Vector of characters used to determine levels |
byrow |
Boolean. If FALSE (the default), return a matrix with one column per string. If TRUE, return a matrix with one row per string. |
aslist |
Boolean, return matrix (FALSE) or list of vectors (TRUE). |
This function efficiently converts character strings containing
characters into vectors of integers. Its primary purpose is to allow
translation of genotypes stored as character vectors, one character
per genotype, to a factor-coded matrix. The equivalent code using
factor
is quite a bit slower, as shown by the last section of
the example below.
The levels
argument should be a vector of 1-character strings.
This vector is used to determine the translation. The index of
matching characters provides the returned integer values. Characters
not present in levels
will be converted to NA's.
If aslist=TRUE
, the return value is a a list of vectors. Each
vector will contain the translation of the corresponding input string.
If aslist=FALSE (the default)
, the return value will be a
matrix. byrow
controls whether each string is converted into a
a column (byrow=FALSE
, the default) or row
(byrow=TRUE
).
When byrow=FALSE
, each element of the source
vector is
converted to a column, and the number of rows will be the number of
characters in the longest element of the source
vector. Any
shorter vectors will be padded with NA's.
When byrow=TRUE
the matrix is created with one row per element
of the source
vector, etc.
Only of the first character of each element of levels
is used.
Any other characters will be ignored.
Gregory R. Warnes warnes@bst.rochester.edu and Nitin Jain nitin.jain@pfizer.com
### # Toy Genetics Example ## # 'c' = 'homozygote common allele' # 'h' = 'heterozygone' # 'r' = 'homozygote rare allele' marker.data <- c( m1='cchchrcr', m2='chccccrr') marker.data convert(marker.data, c('c','h','r')) ### # simple test example ### source <- c(one='abcabcabc', two='abc','ggg',buckle='aaa',my='bbb', 'shoe '='bgb ') levels <- c('a','b','c','d') convert(source,levels) convert(source,levels,aslist=TRUE) convert(source,levels,byrow=TRUE) ### # compare efficiency with equivalent code using 'factor' ### ## Not run: makestr <- function(n) paste(sample(letters, size=n, replace=T), sep='', collapse='') timeit <- function( expr ) { start <- Sys.time() expr end <- Sys.time() return( as.numeric(end-start )) } # Step 1: create a large set of character strings x <- unlist(lapply(1:100000, function(x) makestr(1000))) # Step 2: Time convert (~17 sec on Intel Xeon 3.0 GHz, 32 GB RAM) newtime <- timeit( yn <- convert2(x, letters) ) newtime # old method (~4.7 min on Intex Xeon 3.0 GHz, 32 GB RAM) oldmethod <- function(x) { yo <- factor(unlist(strsplit(x, split='')),levels=letters) attr(y1,'dim') <- c(nchar(x[1]), length(x)) class(y1) <- 'matrix' } oldtime <- timeit( oldmethod(x) ) oldtime # time difference oldtime - newtime ## End(Not run)