GFF3 parsing
Summary: This example shows how to parse and write gff3 formatted files with BioFSharp
Generic Feature Format Version 3
In GFF3 files every line represents one genomic feature with nine tab-delimited fields, whereas unlimited key-value pairs can be stored in field 9. It is possible to link multiple features to genomic units using the 'Parent tag'.
In the following you can see a GFF file example (modified version of saccharomyces_cerevisiae.gff):
|
Directives (marked with "##[...]") provide additional information like the gff-version which has to be the first line of each file ("##gff-version 3[...]"). Comment lines have to start with a single "#[...]". It is possible that sequences in FastA format are attached at the end of the file. This has to be announced by a "##FASTA" directive line.
For further information visit GFF3-Specifications.
Reading GFF3 files
To read in a gff you have to insert a filepath and optionally a FastA converter. For further information about FastA check the FastA section or visit API Reference - FastA.
open BioFSharp
open BioFSharp.IO
//path of the input file
let filepathGFF = (__SOURCE_DIRECTORY__ + "/data/gff3Example.gff")
//reads from file to seq of GFFLines
//If no FASTA Sequence is included you directly can use GFF3.fromFileWithoutFasta [filepathGFF].
let features = GFF3.fromFile BioFSharp.BioArray.ofNucleotideString filepathGFF
|
How to use GFF3SanityCheck
The GFF3SanityCheck prints wether your GFF3 file is valid or not. It returns all specified errors including the lines in which they occured. In contrast to GFF2 the field 3 (type, feature or method) of a GFF3 entry is restricted to terms defined by the sequence ontology (SO) so this validator is able to check if the entry is a valid SO term. You can find new versions of the SO at (https://sourceforge.net/projects/song/files/SO_Feature_Annotation).
//to validate the GFF file without SOTerm verification use this function and only insert the filepath
let featuresSanityCheck = GFF3.sanityCheck filepathGFF
//path, name and version of the 'Sequence Ontology terms'-file
let filepathSO_Terms = (__SOURCE_DIRECTORY__ + "/data/Sequence_Ontology_Terms_2_5_3.txt")
//to validate the gff file insert filepath
let featuresSanityCheckWithSOTerm = GFF3.sanityCheckWithSOTerm filepathSO_Terms filepathGFF
How to use GFF3RelationshipSearch
You also can do a simple search for "Parent - child of" relationships giving back all genomic features which contain the searchterm in ID/Id or Parent field.
///Term to search for:
let searchterm = "YAL069W"
///with this function you can search features which are related to the searchterm
let gffExampleSearch = GFF3.relationshipSearch features searchterm
|
Writing GFF3 files
In order to write a sequence of (GFFLine<_>) into a file use the following function. If FastA sequences are included they are appended by a FastA writer described in the API Reference - FastA.
Note: The order of key value pairs in field 9 (attributes) may be changed.
///Takes a seq<GFF<'a>>, a FASTA converter and a destination filepath and writes it into a .gff. Hint: Use converter = id if no FastA sequence is included.
let gffExampleWrite = GFF3.write BioItem.symbol (__SOURCE_DIRECTORY__ + "/data/gffExampleWrite.gff") features
Example: Sequence of CDS
If a FastA file is included you can look up the sequence of a CDS feature using the following function.
let firstCDS =
//get GFFEntries
let filteredGFFEntries =
features
|> Seq.choose (fun x ->
match x with
| GFF3.GFFEntryLine x -> Some x
| _ -> None)
//get all CDS features
let filteredCDSFeatures =
filteredGFFEntries
|> Seq.filter (fun x -> x.Feature = "CDS")
filteredCDSFeatures |> Seq.head
let firstCDSSequence = GFF3.getSequence firstCDS features
|
<summary> Contains functions for reading and writing GFF3 files </summary>
<summary> reads in a file and gives a GFFLine<'a> list. If file contains a FastA sequence it is converted to FastA.FastaItem with given converter. (Use 'id' as converter if no FastA is required). </summary>
<summary> This module contains the BioArray type and its according functions. The BioArray type is an array of objects using the IBioItem interface </summary>
<summary> Generates nucleotide sequence of one-letter-code raw string </summary>
<summary> Validates GFF3 file. Prints all appearances of errors with line index. If no (SO)FA check is needed set "" as so_TermsPath. </summary>
Term to search for:
with this function you can search features which are related to the searchterm
<summary> Searches for a term and gives a list of all features of which the searchterm is the mainfeature (ID) or a child of it (Parent). </summary>
Takes a seq<GFF<'a>>, a FASTA converter and a destination filepath and writes it into a .gff. Hint: Use converter = id if no FastA sequence is included.
<summary> writes GFF lines to file. Hint: Use id as converter if no FASTA sequence is included. </summary>
<summary> Basic functions on IBioItems interface </summary>
<summary> Returns the symbol of the bio item </summary>
<summary> feature, type or method; has to be a term from SO or SO accession number </summary>
<summary> if a FastA sequence is included this function searches the features corresponding sequence </summary>