convert {GeneticsBase}R Documentation

Efficienctly convert strings of characters into integer codes

Description

Efficienctly convert strings of characters into integer codes.

Usage

convert(source, levels, byrow=FALSE, aslist=FALSE)

Arguments

source Vector of character strings
levels Vector of characters used to determine levels
byrow Boolean. If FALSE (the default), return a matrix with one column per string. If TRUE, return a matrix with one row per string.
aslist Boolean, return matrix (FALSE) or list of vectors (TRUE).

Details

This function efficiently converts character strings containing characters into vectors of integers. Its primary purpose is to allow translation of genotypes stored as character vectors, one character per genotype, to a factor-coded matrix. The equivalent code using factor is quite a bit slower, as shown by the last section of the example below.

The levels argument should be a vector of 1-character strings. This vector is used to determine the translation. The index of matching characters provides the returned integer values. Characters not present in levels will be converted to NA's.

Value

If aslist=TRUE, the return value is a a list of vectors. Each vector will contain the translation of the corresponding input string.
If aslist=FALSE (the default), the return value will be a matrix. byrow controls whether each string is converted into a a column (byrow=FALSE, the default) or row (byrow=TRUE).
When byrow=FALSE, each element of the source vector is converted to a column, and the number of rows will be the number of characters in the longest element of the source vector. Any shorter vectors will be padded with NA's.
When byrow=TRUE the matrix is created with one row per element of the source vector, etc.

Note

Only of the first character of each element of levels is used. Any other characters will be ignored.

Author(s)

Gregory R. Warnes warnes@bst.rochester.edu and Nitin Jain nitin.jain@pfizer.com

See Also

factor, as.factor

Examples

###
# Toy Genetics Example
##
# 'c' = 'homozygote common allele'
# 'h' = 'heterozygone'
# 'r' = 'homozygote rare allele'
marker.data <- c( m1='cchchrcr', m2='chccccrr')
marker.data

convert(marker.data, c('c','h','r'))

###
# simple test example
###
source <- c(one='abcabcabc', two='abc','ggg',buckle='aaa',my='bbb',
            'shoe  '='bgb  ')
levels <- c('a','b','c','d')

convert(source,levels)
convert(source,levels,aslist=TRUE)
convert(source,levels,byrow=TRUE)

###
# compare efficiency with equivalent code using 'factor'
###
## Not run: 
makestr <- function(n)
  paste(sample(letters, size=n, replace=T), sep='', collapse='')

timeit <- function( expr )
  {
    start <- Sys.time()
    expr
    end <- Sys.time()
    return( as.numeric(end-start ))
  }

# Step 1: create a large set of character strings
x <- unlist(lapply(1:100000, function(x) makestr(1000)))

# Step 2: Time convert  (~17 sec on Intel Xeon 3.0 GHz, 32 GB RAM)
newtime <- timeit( yn <- convert2(x, letters) )
newtime

# old method  (~4.7 min on Intex Xeon 3.0 GHz, 32 GB RAM)
oldmethod <- function(x)
  {
    yo <- factor(unlist(strsplit(x, split='')),levels=letters)
    attr(y1,'dim') <- c(nchar(x[1]), length(x))
    class(y1) <- 'matrix'
  }

oldtime <- timeit( oldmethod(x) )
oldtime

# time difference
oldtime - newtime
## End(Not run)


[Package GeneticsBase version 1.8.0 Index]