Header menu logo BioFSharp

BioCollections

Summary: This example shows how to use collections of biological items in BioFSharp

Analogous to the build-in collections BioFSharp provides BioSeq, BioList and BioArray for individual collection specific optimized operations. The easiest way to create them are the ofBioItemString -functions

open BioFSharp

let s1 = "PEPTIDE" |> BioSeq.ofAminoAcidString 
let s2 = "PEPTIDE" |> BioList.ofAminoAcidString 
let s3 = "TAGCAT"  |> BioArray.ofNucleotideString 

s1, s2, s3

(Microsoft.FSharp.Collections.IEnumerator+mkSeq@177[BioFSharp.AminoAcids+AminoAcid], [Pro; Glu; Pro; ... ], BioFSharp.Nucleotides+Nucleotide[])
Item1
1  PEPTIDE
Item2
1  PEPTIDE
Item3
1  TAGCAT

Nucleotides

Nucleotides1

Figure 1: Selection of covered nucleotide operations (A) Biological principle. (B) Workflow with BioSeq. (C) Other covered functionalities.

Let's imagine you have a given gene sequence and want to find out what the according protein might look like.

let myGene = BioArray.ofNucleotideString "ATGGCTAGATCGATCGATCGGCTAACGTAA"

myGene

         1  ATGGCTAGAT CGATCGATCG GCTAACGTAA

Yikes! Unfortunately we got the 5'-3' coding strand. For proper transcription we should get the complementary strand first:

let myProperGene = BioArray.complement myGene

myProperGene

         1  TACCGATCTA GCTAGCTAGC CGATTGCATT

Now let's transcribe and translate it:

let myTranslatedGene = 
    myProperGene
    |> BioArray.transcribeTemplateStrand
    |> BioArray.translate 0

myTranslatedGene

         1  MARSIDRLT*

Of course, if your input sequence originates from the coding strand, you can directly transcribe it to mRNA since the only difference between the coding strand and the mRNA is the replacement of 'T' by 'U' (Figure 1B)

let myTranslatedGeneFromCodingStrand = 
    myGene
    |> BioArray.transcribeCodingStrand
    |> BioArray.translate 0

myTranslatedGeneFromCodingStrand

         1  MARSIDRLT*

Other Nucleotide conversion operations are also covered:

let mySmallGene = BioSeq.ofNucleotideString  "ATGTTCCGAT"

mySmallGene

         1  ATGTTCCGAT

BioSeq.reverse mySmallGene 

         1  TAGCCTTGTA

BioSeq.complement mySmallGene

         1  TACAAGGCTA

BioSeq.reverseComplement mySmallGene

         1  ATCGGAACAT

AminoAcids

Basics

Some functions which might be needed regularly are defined to work with nucleotides and amino acids:

let myPeptide = "PEPTIDE" |> BioSeq.ofAminoAcidString 

myPeptide

         1  PEPTIDE

myPeptide 
|> BioSeq.toFormula 
|> Formula.toString 
C34.00 H51.00 N7.00 O14.00
BioSeq.toAverageMass myPeptide 

781.8103169999999

Digestion

BioFSharp also comes equipped with a set of tools aimed at cutting apart amino acid sequences. To demonstrate the usage, we'll throw some trypsin at the small RuBisCO subunit of Arabidopos thaliana:
In the first step, we define our input sequence and the protease we want to use.

let RBCS = 
    """MASSMLSSATMVASPAQATMVAPFNGLKSSAAFPATRKANNDITSITSNGGRVNCMQVWP
    PIGKKKFETLSYLPDLTDSELAKEVDYLIRNKWIPCVEFELEHGFVYREHGNSPGYYDGR
    YWTMWKLPLFGCTDSAQVLKEVEECKKEYPNAFIRIIGFDNTRQVQCISFIAYKPPSFT""" 
    |> BioArray.ofAminoAcidString

RBCS

         1  MASSMLSSAT MVASPAQATM VAPFNGLKSS AAFPATRKAN NDITSITSNG GRVNCMQVWP
        61  PIGKKKFETL SYLPDLTDSE LAKEVDYLIR NKWIPCVEFE LEHGFVYREH GNSPGYYDGR
       121  YWTMWKLPLF GCTDSAQVLK EVEECKKEYP NAFIRIIGFD NTRQVQCISF IAYKPPSFT

let trypsin = Digestion.Table.getProteaseBy "Trypsin"

let digestedRBCS = Digestion.BioArray.digest trypsin 0 RBCS 

digestedRBCS
|> Seq.head

{ ProteinID = 0\n MissCleavages = 0\n CleavageStart = 0\n CleavageEnd = 27\n PepSequence =\n [Met; Ala; Ser; Ser; Met; Leu; Ser; Ser; Ala; Thr; Met; Val; Ala; Ser; Pro;\n Ala; Gln; Ala; Thr; Met; Val; Ala; Pro; Phe; Asn; Gly; Leu; Lys] }
ProteinID
0
MissCleavages
0
CleavageStart
0
CleavageEnd
27
PepSequence
         1  MASSMLSSAT MVASPAQATM VAPFNGLK

In reality, proteases don't always completely cut the protein down. Instead, some sites stay intact and should be considered for in silico analysis. This can easily be done with the concernMissCleavages function. It takes the minimum and maximum amount of misscleavages you want to have and also the digested protein. As a result you get all possible combinations arising from this information.

let digestedRBCS' = Digestion.BioArray.concernMissCleavages 0 2 digestedRBCS

digestedRBCS
|> Seq.item 1

{ ProteinID = 0\n MissCleavages = 0\n CleavageStart = 28\n CleavageEnd = 36\n PepSequence = [Ser; Ser; Ala; Ala; Phe; Pro; Ala; Thr; Arg] }
ProteinID
0
MissCleavages
0
CleavageStart
28
CleavageEnd
36
PepSequence
         1  SSAAFPATR

val s1: obj
val s2: obj
val s3: obj
val myGene: obj
val myProperGene: obj
val myTranslatedGene: obj
val myTranslatedGeneFromCodingStrand: obj
val mySmallGene: obj
val myPeptide: obj
val RBCS: obj
val trypsin: obj
val digestedRBCS: obj seq
module Seq from Microsoft.FSharp.Collections
val head: source: 'T seq -> 'T
val digestedRBCS': obj
val item: index: int -> source: 'T seq -> 'T

Type something to start searching.