Header menu logo BioFSharp

GFF3 parsing

BinderScriptNotebook

Summary: This example shows how to parse and write gff3 formatted files with BioFSharp

Generic Feature Format Version 3

In GFF3 files every line represents one genomic feature with nine tab-delimited fields, whereas unlimited key-value pairs can be stored in field 9. It is possible to link multiple features to genomic units using the 'Parent tag'.

In the following you can see a GFF file example (modified version of saccharomyces_cerevisiae.gff):

##gff-version 3
# date Mon Feb  7 19:35:06 2005
chrI  SGD  gene  335  649  .  +  .  ID=YAL069W;Name=YAL069W;Ontology_term=GO:0000004,GO:0005554,GO:0008372;Note=Hypothetical%20ORF;dbxref=SGD:S000002143;orf_classification=Dubious
chrI  SGD  CDS  335  649  .  +  0  Parent=YAL069W;Name=YAL069W;Ontology_term=GO:0000004,GO:0005554,GO:0008372;Note=Hypothetical%20ORF;dbxref=SGD:S000002143;orf_classification=Dubious
###
##FASTA
>chrI
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
TGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCAC
CCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGG
TCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAAT
ATTGTATAACTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTC
AATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAACAATAATACATAAACATATTG

Directives (marked with "##[...]") provide additional information like the gff-version which has to be the first line of each file ("##gff-version 3[...]"). Comment lines have to start with a single "#[...]". It is possible that sequences in FastA format are attached at the end of the file. This has to be announced by a "##FASTA" directive line.

For further information visit GFF3-Specifications.

Reading GFF3 files

To read in a gff you have to insert a filepath and optionally a FastA converter. For further information about FastA check the FastA section or visit API Reference - FastA.

open BioFSharp
open BioFSharp.IO

//path of the input file
let filepathGFF = (__SOURCE_DIRECTORY__ + "/data/gff3Example.gff")

//reads from file to seq of GFFLines
//If no FASTA Sequence is included you directly can use GFF3.fromFileWithoutFasta [filepathGFF].
let features = GFF3.fromFile BioFSharp.BioArray.ofNucleotideString filepathGFF 
seq
[Directive "##gff-version 3"; Comment "# date Mon Feb  7 19:35:06 2005";
 GFFEntryLine
   { Seqid = "chrI"
     Source = "SGD"
     Feature = "gene"
     StartPos = 335
     EndPos = 649
     Score = nan
     Strand = '+'
     Phase = -1
     Attributes =
      map
        [("ID", ["YAL069W"]); ("Name", ["YAL069W"]);
         ("Note", ["Hypothetical%20ORF"]);
         ("Ontology_term", ["GO:0000004"; "GO:0005554"; "GO:0008372"]);
         ("dbxref", ["SGD:S000002143"]);
         ("orf_classification", ["Dubious"])]
     Supplement = [|"No supplement"|] };
 GFFEntryLine{
     ...
]

How to use GFF3SanityCheck

The GFF3SanityCheck prints wether your GFF3 file is valid or not. It returns all specified errors including the lines in which they occured. In contrast to GFF2 the field 3 (type, feature or method) of a GFF3 entry is restricted to terms defined by the sequence ontology (SO) so this validator is able to check if the entry is a valid SO term. You can find new versions of the SO at (https://sourceforge.net/projects/song/files/SO_Feature_Annotation).

//to validate the GFF file without SOTerm verification use this function and only insert the filepath
let featuresSanityCheck = GFF3.sanityCheck filepathGFF
//path, name and version of the 'Sequence Ontology terms'-file
let filepathSO_Terms = (__SOURCE_DIRECTORY__ + "/data/Sequence_Ontology_Terms_2_5_3.txt")
//to validate the gff file insert filepath
let featuresSanityCheckWithSOTerm = GFF3.sanityCheckWithSOTerm filepathSO_Terms filepathGFF

How to use GFF3RelationshipSearch

You also can do a simple search for "Parent - child of" relationships giving back all genomic features which contain the searchterm in ID/Id or Parent field.

///Term to search for:
let searchterm = "YAL069W"
///with this function you can search features which are related to the searchterm
let gffExampleSearch = GFF3.relationshipSearch features searchterm
seq
[{ Seqid = "chrI"
   Source = "SGD"
   Feature = "gene"
   StartPos = 335
   EndPos = 649
   Score = nan
   Strand = '+'
   Phase = -1
   Attributes =
    map
      [("ID", ["YAL069W"]); ("Name", ["YAL069W"]);
       ("Note", ["Hypothetical%20ORF"]);
       ("Ontology_term", ["GO:0000004"; "GO:0005554"; "GO:0008372"]);
       ("dbxref", ["SGD:S000002143"]); ("orf_classification", ["Dubious"])]
   Supplement = [|"No supplement"|] };
 { Seqid = "chrI"
   Source = "SGD"
   Feature = "CDS"
   StartPos = 335
   EndPos = 649
   Score = nan
   Strand = '+'
   Phase = 0
   Attributes =
    map
      [("Name", ["YAL069W"]); ("Note", ["Hypothetical%20ORF"]);
       ("Ontology_term", ["GO:0000004"; "GO:0005554"; "GO:0008372"]);
       ("Parent", ["YAL069W"]); ("dbxref", ["SGD:S000002143"]);
       ("orf_classification", ["Dubious"])]
   Supplement = [|"No supplement"|] }]

Writing GFF3 files

In order to write a sequence of (GFFLine<_>) into a file use the following function. If FastA sequences are included they are appended by a FastA writer described in the API Reference - FastA.

Note: The order of key value pairs in field 9 (attributes) may be changed.

///Takes a seq<GFF<'a>>, a FASTA converter and a destination filepath and writes it into a .gff. Hint: Use converter = id if no FastA sequence is included.
let gffExampleWrite = GFF3.write BioItem.symbol (__SOURCE_DIRECTORY__ + "/data/gffExampleWrite.gff") features

Example: Sequence of CDS

If a FastA file is included you can look up the sequence of a CDS feature using the following function.

let firstCDS = 
    //get GFFEntries
    let filteredGFFEntries = 
        features 
        |> Seq.choose (fun x ->    
            match x with
            | GFF3.GFFEntryLine x -> Some x
            | _ -> None)

    //get all CDS features
    let filteredCDSFeatures =
        filteredGFFEntries
        |> Seq.filter (fun x -> x.Feature = "CDS")

    filteredCDSFeatures |> Seq.head
let firstCDSSequence = GFF3.getSequence firstCDS features
seq [A; T; G; A; ...]
namespace BioFSharp
namespace BioFSharp.IO
val filepathGFF: string
val features: GFF3.GFFLine<BioArray.BioArray<Nucleotides.Nucleotide>> seq
module GFF3 from BioFSharp.IO
<summary> Contains functions for reading and writing GFF3 files </summary>
val fromFile: fastAconverter: (char seq -> 'a) -> filepath: string -> GFF3.GFFLine<'a> seq
<summary> reads in a file and gives a GFFLine&lt;'a&gt; list. If file contains a FastA sequence it is converted to FastA.FastaItem with given converter. (Use 'id' as converter if no FastA is required). </summary>
module BioArray from BioFSharp
<summary> This module contains the BioArray type and its according functions. The BioArray type is an array of objects using the IBioItem interface </summary>
val ofNucleotideString: s: #(char seq) -> BioArray.BioArray<Nucleotides.Nucleotide>
<summary> Generates nucleotide sequence of one-letter-code raw string </summary>
val featuresSanityCheck: unit
val sanityCheck: filepath: string -> unit
val filepathSO_Terms: string
val featuresSanityCheckWithSOTerm: unit
val sanityCheckWithSOTerm: so_TermsPath: string -> filepath: string -> unit
<summary> Validates GFF3 file. Prints all appearances of errors with line index. If no (SO)FA check is needed set "" as so_TermsPath. </summary>
val searchterm: string
Term to search for:
val gffExampleSearch: GFF3.GFFEntry seq
with this function you can search features which are related to the searchterm
val relationshipSearch: gffList: GFF3.GFFLine<'a> seq -> searchterm: string -> GFF3.GFFEntry seq
<summary> Searches for a term and gives a list of all features of which the searchterm is the mainfeature (ID) or a child of it (Parent). </summary>
val gffExampleWrite: unit
Takes a seq<GFF<'a>>, a FASTA converter and a destination filepath and writes it into a .gff. Hint: Use converter = id if no FastA sequence is included.
val write: converter: ('a -> char) -> path: string -> input: GFF3.GFFLine<#('a seq)> seq -> unit
<summary> writes GFF lines to file. Hint: Use id as converter if no FASTA sequence is included. </summary>
module BioItem from BioFSharp
<summary> Basic functions on IBioItems interface </summary>
val symbol: bItem: #IBioItem -> char
<summary> Returns the symbol of the bio item </summary>
val firstCDS: GFF3.GFFEntry
val filteredGFFEntries: GFF3.GFFEntry seq
module Seq from Microsoft.FSharp.Collections
val choose: chooser: ('T -> 'U option) -> source: 'T seq -> 'U seq
val x: GFF3.GFFLine<BioArray.BioArray<Nucleotides.Nucleotide>>
union case GFF3.GFFLine.GFFEntryLine: GFF3.GFFEntry -> GFF3.GFFLine<'a>
val x: GFF3.GFFEntry
union case Option.Some: Value: 'T -> Option<'T>
union case Option.None: Option<'T>
val filteredCDSFeatures: GFF3.GFFEntry seq
val filter: predicate: ('T -> bool) -> source: 'T seq -> 'T seq
GFF3.GFFEntry.Feature: string
<summary> feature, type or method; has to be a term from SO or SO accession number </summary>
val head: source: 'T seq -> 'T
val firstCDSSequence: Nucleotides.Nucleotide seq
val getSequence: cDSfeature: GFF3.GFFEntry -> gFFFile: GFF3.GFFLine<#('a0 seq)> seq -> 'a0 seq
<summary> if a FastA sequence is included this function searches the features corresponding sequence </summary>

Type something to start searching.