Header menu logo BioFSharp

Fastq parsing

BinderScriptNotebook

Summary: This example shows how to parse and write fastq formatted files with BioFSharp

This module allows to parse FASTQ format data with original 4-lines entries into this record type

/// FastqItem record contains header, sequence, qualityheader, qualitysequence of one entry

type FastqItem<'a,'b> = {
    Header          : string
    Sequence        : 'a
    QualityHeader   : string
    QualitySequence : 'b      
}

To be able to use this parser you need to define two converter functions, one example for each you can also find in our module, but you also may need to write your own.

If you have following possible values for quality sequence:

!""#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

with Sanger format, that can encode a Phred quality score from 0 to 93 using ASCII 33 to 126, then you can use our converting function:

/// get Phred quality score
let qualityConvertFn (string:string) =
    string.ToCharArray()
    |> Array.map (fun i -> int i - 33)

And then you can easily use this module to read your FastQ file

open BioFSharp
open BioFSharp.IO

let yourFastqFile = (__SOURCE_DIRECTORY__ + "/data/FastQtest.fastq")

let FastQSequence = 
    FastQ.fromFile BioArray.ofAminoAcidString qualityConvertFn yourFastqFile
Warning: Output, it-value and value references require --eval
type FastqItem<'a,'b> = { Header: string Sequence: 'a QualityHeader: string QualitySequence: 'b }
 FastqItem record contains header, sequence, qualityheader, qualitysequence of one entry
'a
'b
Multiple items
val string: value: 'T -> string

--------------------
type string = System.String
val qualityConvertFn: string: string -> int array
 get Phred quality score
Multiple items
val string: string

--------------------
type string = System.String
System.String.ToCharArray() : char array
System.String.ToCharArray(startIndex: int, length: int) : char array
module Array from Microsoft.FSharp.Collections
val map: mapping: ('T -> 'U) -> array: 'T array -> 'U array
val i: char
Multiple items
val int: value: 'T -> int (requires member op_Explicit)

--------------------
type int = int32

--------------------
type int<'Measure> = int
namespace BioFSharp
namespace BioFSharp.IO
val yourFastqFile: string
val FastQSequence: FastQ.FastqItem<BioArray.BioArray<AminoAcids.AminoAcid>,int array> seq
module FastQ from BioFSharp.IO
val fromFile: converter: (string -> 'a) -> qualityConverter: (string -> 'b) -> filePath: string -> FastQ.FastqItem<'a,'b> seq
<summary> Reads FastqItem from FastQ format file. Converter and qualityConverter determines type of sequence by converting seq&lt;char&gt; -&gt; type </summary>
Multiple items
module BioArray from BioFSharp.BioCollectionsExtensions

--------------------
module BioArray from BioFSharp
<summary> This module contains the BioArray type and its according functions. The BioArray type is an array of objects using the IBioItem interface </summary>
val ofAminoAcidString: s: #(char seq) -> BioArray.BioArray<AminoAcids.AminoAcid>
<summary> Generates amino acid sequence of one-letter-code raw string </summary>

Type something to start searching.