read.snps.long {snpMatrix} | R Documentation |
Reads SNP data when organized in free format
as one call per line. Other than the one
call per line requirement, there is considerable flexibility. Multiple
input files can be read, the input fields can be in any order on the
line, and irrelevant fields can be skipped. The samples and SNPs
to be read must be pre-specified, and define rows and columns of an
output object of class "snp.matrix"
.
read.snps.long(files, sample.id = NULL, snp.id = NULL, female = NULL, fields = c(sample = 1, snp = 2, genotype = 3, confidence = 4), codes = c("0", "1", "2"), threshold = 0.9, lower = TRUE, sep = " ", comment = "#", skip = 0, simplify = c(FALSE,FALSE), verbose = FALSE, every = 1000)
files |
A character vector giving the names of the input files |
sample.id |
A character vector giving the identifiers of the samples to be read |
snp.id |
A character vector giving the names of the SNPs to be read |
female |
If the SNPs are on the X chromosome and the data are to
be read as such, this logical vector (of the same length as
sample.id should specify whether each sample was from a
female subject |
fields |
A integer vector with named elements specifying the
positions of the required fields in the input record. The fields are
identified by the names sample and snp for the sample
and SNP identifier fields, confidence for a call confidence
score (if present) and either genotype if genotype calls
occur as a single field, or allele1 and allele2 if the
two alleles are coded in different fields |
codes |
Either the single string "nucleotide" denoting
that coding in terms of nucleotides
(A , C , G or T , case insensitive),
or a character vector
giving genotype or allele codes (see below) |
threshold |
A numerical value for the calling threshold on the confidence score |
lower |
If TRUE , then threshold represents a lower
bound. Otherwise it is an upper bound |
sep |
The delimiting character separating fields in the input record |
comment |
A character denoting that any remaining input on a line is to be ignored |
skip |
An integer value specifying how many lines are to be skipped at the beginning of each data file |
simplify |
If TRUE , sample and SNP identifying strings
will be shortened by removal of any common leading or trailing
sequences when they are used as row and column names of the output
snp.matrix |
verbose |
If TRUE , a progress report is generated as
every every lines of data are read |
every |
See verbose |
If nucleotide coding is not used, the codes
argument
should be a character array giving the valid codes.
For genotype coding of autosomal SNPs, this should be
an array of length 3 giving the codes
for the three genotypes, in the order homozygous(AA), heterozygous(AB),
homozygous(BB). All other codes will be treated
as "no call". The default codes are "0"
, "1"
,
"2"
. For X SNPs, males are assumed to be coded as homozygous,
unless an additional two codes are supplied (representing the
AY and BY genotypes). For allele coding, the
codes
array should be of length 2 and should specify the codes
for the two alleles. Again, any other code is treated as
"missing" and, for X SNPs, males should be coded either as
homozygous or by omission of the second allele.
Although the function allows for reading of data for the X chromosome
directly into an object of class "X.snp.matrix"
,
it will often be preferable to read such data as a "snp.matrix"
(i.e. as autosomal) and to coerce it to an object of type
"X.snp.matrix"
later using as(..., "X.snp.matrix")
or
new("X.snp.matrix", ..., female=...)
. If sex is coded NA
for any subject the latter course must be followed, since
NA
s are not accepted in the female
argument.
The vectors sample.id
and snp.id
must be in the same
order as they vary on the input file(s) and this ordering must be
consistent. However, there is
no requirement that either SNP or sample should vary fastest; this is
detected from the input.
Each file may represent a separate sample or SNP, in which case the
appropriate .id
argument can be omitted and row or column names
taken from the file names.
An object of class "snp.matrix"
or "X.snp.matrix"
.
The function will read gzipped files.
This function has replaced and earlier version which was much less
flexible. Because all features have not been fully tested, the older
version has been retained as read.snps.long.old
.
Every combination of sample and snp listed in the
sample.id
and snp.id
arguments must be present in the
input file(s). Otherwise the function will search for any missing
observation until reaching the end of the data, ignoring everything
else on the way.
David Clayton david.clayton@cimr.cam.ac.uk
read.HapMap.data
read.snps.pedfile
,
read.snps.chiamo
, read.plink
,
snp.matrix-class
, X.snp.matrix-class