GenBank parsing

Summary: This example shows how to parse and write genbank formatted files with BioFSharp

GenBank format

GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). A GenBank file contains various information about a sequence and its features.

The file structure of a GenBank (.gb) flat file can be split into 4 subsections:

Section	Description
Meta	Contains meta information about the annotated sequence, the file itself and the organism the sequence is found in
References	A collection of publications about the annotated sequence(or a subsequence of it) and information associated with them
Features	A collection of features and their position within the sequence
Origin	The annotated sequence itself

With the origins section being optional and the features section usually being by far the largest amongst them. For more information about the sections and their formatting, have a look at this annotated sample record over at NCBI.

This file will also be used for the purpose of this tutorial and in plain text looks like this:

LOCUS       SCU49845                5028 bp    DNA     linear   PLN 14-JUL-2016 
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p     
            (AXL2) and Rev7p (REV7) genes, complete cds.                        
ACCESSION   U49845                                                              
VERSION     U49845.1                                                            
KEYWORDS    .                                                                   
SOURCE      Saccharomyces cerevisiae (baker's yeast)                            
    ORGANISM  Saccharomyces cerevisiae                                            
            Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;            
            Saccharomycetes; Saccharomycetales; Saccharomycetaceae;             
            Saccharomyces.                                                      
REFERENCE   1  (bases 1 to 5028)                                                
    AUTHORS   Roemer,T., Madden,K., Chang,J. and Snyder,M.                        
    TITLE     Selection of axial growth sites in yeast requires Axl2p, a novel    
            plasma membrane glycoprotein                                        
    JOURNAL   Genes Dev. 10 (7), 777-793 (1996)                                   
    PUBMED   8846915                                                             
REFERENCE   2  (bases 1 to 5028)                                                
    AUTHORS   Roemer,T.                                                           
    TITLE     Direct Submission                                                   
    JOURNAL   Submitted (22-FEB-1996) Biology, Yale University, New Haven, CT     
            06520, USA                                                          
FEATURES             Location/Qualifiers                                        
        source          1..5028                                                    
                        /organism="Saccharomyces cerevisiae"                       
                        /mol_type="genomic DNA"                                    
                        /db_xref="taxon:4932"                                      
                        /chromosome="IX"                                           
        mRNA            <1..>206                                                   
                        /product="TCP1-beta"                                       
        CDS             <1..206                                                    
                        /codon_start=3                                             
                        /product="TCP1-beta"                                       
                        /protein_id="AAA98665.1"                                   
                        /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA 
                        AEVLLRVDNIIRARPRTANRQHM"                                   
        gene            <687..>3158                                                
                        /gene="AXL2"                                               
        mRNA            <687..>3158                                                
                        /gene="AXL2"                                               
                        /product="Axl2p"                                           
        CDS             687..3158                                                  
                        /gene="AXL2"                                               
                        /note="plasma membrane glycoprotein"                       
                        /codon_start=1                                             
                        /product="Axl2p"                                           
                        /protein_id="AAA98666.1"                                   
                        /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF 
                        TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN 
                        VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE 
                        VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE 
                        TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV 
                        YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG 
                        DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ 
                        DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA 
                        NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA 
                        CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN 
                        NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ 
                        SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS 
                        YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK 
                        HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL 
                        VDFSNKSNVNVGQVKDIHGRIPEML"                                 
        gene            complement(<3300..>4037)                                   
                        /gene="REV7"                                               
        mRNA            complement(<3300..>4037)                                   
                        /gene="REV7"                                               
                        /product="Rev7p"                                           
        CDS             complement(3300..4037)                                     
                        /gene="REV7"                                               
                        /codon_start=1                                             
                        /product="Rev7p"                                           
                        /protein_id="AAA98667.1"                                   
                        /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ 
                        FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD 
                        KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR 
                        RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK 
                        LISGDDKILNGVYSQYEEGESIFGSLF"                               
ORIGIN                                                                          
        1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg     
        61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct     
        121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa     
        181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg     
        241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa     
        301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa     
        361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat     
        421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga     
        481 gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc     
        541 tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga     
        601 acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta     
        661 cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag     
        721 ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa     
        781 aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata     
        841 cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga     
        901 gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac     
        961 tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg     
        1021 acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc     
        1081 tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa     
        1141 acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca     
        1201 ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac     
        1261 ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa     
        1321 actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag     
        1381 gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct     
        1441 ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac     
        1501 ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa     
        1561 acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc     
        1621 cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata     
        1681 cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca     
        1741 ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc     
        1801 cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc     
        1861 aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca     
        1921 agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc     
        1981 tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg     
        2041 caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt     
        2101 acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc     
        2161 cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg     
        2221 ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca     
        2281 gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata     
        2341 atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg     
        2401 atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga     
        2461 aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt     
        2521 ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat     
        2581 tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt     
        2641 cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc     
        2701 tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag     
        2761 aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta     
        2821 tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa     
        2881 caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact     
        2941 ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt     
        3001 ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa     
        3061 gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag     
        3121 ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct     
        3181 taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt     
        3241 agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact     
        3301 taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa     
        3361 attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg     
        3421 tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt     
        3481 aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc     
        3541 tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca     
        3601 ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc     
        3661 ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc     
        3721 tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat     
        3781 aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa     
        3841 agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga     
        3901 acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat     
        3961 acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc     
        4021 tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc     
        4081 tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa     
        4141 gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg     
        4201 atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc     
        4261 ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt     
        4321 tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg     
        4381 cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg     
        4441 ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt     
        4501 agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt     
        4561 tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat     
        4621 tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc     
        4681 atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct     
        4741 tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta     
        4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac     
        4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct     
        4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct     
        4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc

Reading GenBank files

The type equivalent for a GenBank file in BioFSharp is a dictionary, mapping string keys to the GenBankItem<'a> type, where 'a is the type of the origin sequence in the file. More information about the type modelling can be found in our API reference. The default reader fromFile simply takes the origin sequence as a sequence of chars.

open BioFSharp
open BioFSharp.IO

///Path of the example file
let exampleFilePath = __SOURCE_DIRECTORY__ + @"\data\sequence.gb"

///Parsed Example File 
let parsedGBFile = GenBank.Read.fromFile exampleFilePath

dict
[("LOCUS",
  Value
    "SCU49845                5028 bp    DNA     linear   PLN 14-JUL-2016");
 ("DEFINITION",
  Value
    "Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p(AXL2) and Rev7p (REV7) genes, complete cds.");
 ("ACCESSION", Value "U49845"); ("VERSION", Value "U49845.1");
 ("KEYWORDS", Value ".");
 ("SOURCE", Value "Saccharomyces cerevisiae (baker's yeast)");
 ("ORGANISM",
 ...
]

You can also use converter functions for the origin sequence, which makes it easier to use them for other BioFSharp workflows. There are multiple prebuilt converters contained in the OriginConverters module for reading and writing. For example, the following code will parse the sequence as a BioSeq containing nucleotides

let converter = GenBank.OriginConverters.Input.nucleotideConverter

let parsedGBFile' = GenBank.Read.fromFileWithOriginConverter converter exampleFilePath

This makes it easy to perform additional tasks with the origin sequence:

let origin = GenBank.getOrigin parsedGBFile'

seq [G; A; T; C; ...]

///Transcribed origin sequence
let rnaSeq = origin |> BioSeq.transcribeCodingStrand

seq [G; A; U; C; ...]

///Translated origin sequence
let protein = rnaSeq |> BioSeq.translate 0

seq [Asp; Pro; Pro; Tyr; ...]

Writing GenBank files

Just as in the Read module, there is a default writer for a GenBank dictionary that contains a sequence of chars, as a writer taking a custom converter function. Just specify the output path, and you are ready to go.

let outputPath1 = __SOURCE_DIRECTORY__ + @"\data\sequenceTestWrite1.gb"
let outputPath2 = __SOURCE_DIRECTORY__ + @"\data\sequenceTestWrite2.gb"
let outputConverter = GenBank.OriginConverters.Output.bioItemConverter

parsedGBFile |> GenBank.Write.toFile outputPath1

parsedGBFile' |> GenBank.Write.toFileWithOriginConverter outputPath2 outputConverter

namespace BioFSharp

namespace BioFSharp.IO

val exampleFilePath: string
Path of the example file

val parsedGBFile: System.Collections.Generic.Dictionary<string,GenBank.GenBankItem<char seq>>
Parsed Example File

module GenBank from BioFSharp.IO
<summary> functions for reading and writing GenBank files </summary>

module Read from BioFSharp.IO.GenBank
<summary> functions for parsing a GenBank file. </summary>

val fromFile: path: string -> System.Collections.Generic.Dictionary<string,GenBank.GenBankItem<char seq>>
<summary> Returns a dictionary containing GenBank items, that represents the GenBank file at the input path </summary>

val converter: (char seq -> BioSeq.BioSeq<Nucleotides.Nucleotide>)

module OriginConverters from BioFSharp.IO.GenBank
<summary> contains prebuilt converters for origin sequences in a gb file for both reading and writing </summary>

module Input from BioFSharp.IO.GenBank.OriginConverters
<summary> contains a collection of prebuilt converters for parsing specific origin sequences </summary>

val nucleotideConverter: sequence: char seq -> BioSeq.BioSeq<Nucleotides.Nucleotide>
<summary> converts the origin sequence into a BioSeq of nucleotides </summary>

val parsedGBFile': System.Collections.Generic.Dictionary<string,GenBank.GenBankItem<BioSeq.BioSeq<Nucleotides.Nucleotide>>>

val fromFileWithOriginConverter: originConverter: (char seq -> 'a) -> path: string -> System.Collections.Generic.Dictionary<string,GenBank.GenBankItem<'a>>
<summary> Returns a dictionary containing GenBank items, that represents the GenBank file at the input path taking a converter function for the origin sequence of the file </summary>

val origin: BioSeq.BioSeq<Nucleotides.Nucleotide>

val getOrigin: gb: System.Collections.Generic.Dictionary<string,GenBank.GenBankItem<'a>> -> 'a
<summary> Returns the Origin of a GenBank file representation </summary>

val rnaSeq: BioSeq.BioSeq<Nucleotides.Nucleotide>
Transcribed origin sequence

Multiple items
module BioSeq from BioFSharp.BioCollectionsExtensions

--------------------
module BioSeq from BioFSharp
<summary> This module contains the BioSeq type and its according functions. The BioSeq type is a sequence of objects using the IBioItem interface </summary>

val transcribeCodingStrand: nucs: Nucleotides.Nucleotide seq -> BioSeq.BioSeq<Nucleotides.Nucleotide>
<summary> Transcribe a given DNA coding strand (5'-----3') </summary>

val protein: BioSeq.BioSeq<AminoAcids.AminoAcid>
Translated origin sequence

val translate: nucleotideOffset: int -> rnaSeq: Nucleotides.Nucleotide seq -> BioSeq.BioSeq<AminoAcids.AminoAcid>
<summary> translates nucleotide sequence to aminoacid sequence </summary>

val outputPath1: string

val outputPath2: string

val outputConverter: (BioSeq.BioSeq<#IBioItem> -> char seq)

module Output from BioFSharp.IO.GenBank.OriginConverters
<summary> contains a collection of prebuilt converters for writing specific origin sequences </summary>

val bioItemConverter: sequence: BioSeq.BioSeq<#IBioItem> -> char seq
<summary> converts the BioSeq to the 1 letter code representing the contained items </summary>

module Write from BioFSharp.IO.GenBank
<summary> Functions for writing a GenBank file </summary>

val toFile: path: string -> gb: System.Collections.Generic.Dictionary<string,GenBank.GenBankItem<#(char seq)>> -> unit
<summary> creates a GenBank file at the specified path </summary>

val toFileWithOriginConverter: path: string -> originConverter: ('a -> char seq) -> gb: System.Collections.Generic.Dictionary<string,GenBank.GenBankItem<'a>> -> unit
<summary> creates a GenBank file at the specified path, taking a converter function for the origin sequence of the file </summary>