readFastq {ShortRead}R Documentation

Read FASTQ-formatted files into compact R representations

Description

readFastq reads all FASTQ-formated files in a directory dirPath whose file name matches pattern pattern, returning a compact internal representation of the sequences and quality scores in the files. Methods read all files into a single R object; a typical use is to restrict input to a single FASTQ file.

Usage


readFastq(dirPath, pattern=character(0), ...)

Arguments

dirPath A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) of FASTQ files to be read.
pattern The (grep-style) pattern describing file names to be read. The default (character(0)) results in line (attempted) input of all files in the directory.
... Additional arguments, perhaps used by methods.

Details

The fastq format is not quite precisely defined. The basic definition used here parses the following four lines as a single record:

    @HWI-EAS88_1_1_1_1001_499
    GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
    +HWI-EAS88_1_1_1_1001_499
    ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS

The first and third lines are identifiers preceded by a specific character (the identifiers are identical, in the case of Solexa). The second line is an upper-case sequence of nucleotides. The parser recognizes IUPAC-standard alphabet (hence ambiguous nucleotides), coercing . to - to represent missing values. The final line is an ASCII-encoded representation of quality scores, with one ASCII character per nucleotide.

The encoding implicit in Solexa-derived fastq files is that each character code corresponds to a score equal to the ASCII character value minus 64 (e.g., ASCII @ is decimal 64, and corresponds to a Solexa quality score of 0). This is different from BioPerl, for instance, which recovers quality scores by subtracting 33 from the ASCII character value (so that, for instance, !, with decimal value 33, encodes value 0).

The BioPerl description of fastq asserts that the first character of line 4 is a !, but the current parser does not support this convention.

Value

A single R object (e.g., ShortReadQ) containing sequences and qualities contained in all files in dirPath matching pattern. There is no guarantee of order in which files are read.

Author(s)

Martin Morgan

See Also

The IUPAC alphabet in Biostrings.

http://www.bioperl.org/wiki/FASTQ_sequence_format for the BioPerl definition of fastq.

Solexa documentation `Data analysis - documentation : Pipeline output and visualisation'.

Examples

showMethods("readFastq")

sp <- SolexaPath(system.file('extdata', package='ShortRead'))
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")
sread(rfq)
id(rfq)
quality(rfq)

## SolexaPath method 'knows' where FASTQ files are placed
rfq1 <- readFastq(sp, pattern="s_1_sequence.txt")
rfq1

[Package ShortRead version 1.0.7 Index]