BioFSharp


Pairwise Alignment

📂View module documentation

In this short tutorial, the usage of the pairwise alignment implementation is demonstrated. For global alignments, the NeedlemanWunsch algorithm is used. For local alignments, the SmithWaterman algorithm is used.

For both implementations, the gapvalues are evaluated using the affine gapscoremodel.

Aligning aminoacid- and nucleotide-sequences

For defining the scores of matching and missmatching characters, the scoring function is defined. In the case of amino acid or nucleotide sequence alignments, the integrated substitution-matrices can be used.

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
open BioFSharp
open BioFSharp.Algorithm
open PairwiseAlignment

//For aminoacids
let aaScoring = ScoringMatrix.getScoringMatrixAminoAcid ScoringMatrix.ScoringMatrixAminoAcid.BLOSUM62

//For nucleotides
let nucScoring = ScoringMatrix.getScoringMatrixNucleotide  ScoringMatrix.ScoringMatrixNucleotide.EDNA

In order to align two sequences, not only the values for substitutions, but also the costs for gaps are needed. In this implementation, an affine gap-penalty is realized. An affine gap penalty weights continuations of gaps different than the opening of a gap. The scoring function (in this case a substitution matrix) and the two gap penalty values are combined into the Costs type.

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
//For aminoacids
let aaCosts = {
    Open = -5
    Continuation = -2
    Similarity = aaScoring 
    }

//For nucleotides
let nucCosts = {
    Open = -2
    Continuation = -1
    Similarity = nucScoring 
    }

The alignment functions use Arrays as input. The Elements can be of any type, but require type equality. Also they need to have type equality with the input of the scoring function. Both the global and local alignment algorithms take the same parameters (costs,firstSequence,secondSequence) and return the same format.

1: 
2: 
3: 
4: 
5: 
//For aminoacids
let aaSeq1 = "ACDM" |> BioArray.ofAminoAcidSymbolString
let aaSeq2 = "MAACEDM" |> BioArray.ofAminoAcidSymbolString

let globalAAAlignment = NeedlemanWunsch.runAminoAcidSymbol aaCosts aaSeq1 aaSeq2
{ MetaData = 12
  Sequences = [[-; -; A; C; -; D; M]; [M; A; A; C; E; D; M]] }
1: 
let localAAAlignment = SmithWaterman.runAminoAcidSymbol aaCosts aaSeq1 aaSeq2
{ MetaData = 19
  Sequences = [[A; C; -; D; M]; [A; C; E; D; M]] }
1: 
2: 
3: 
4: 
5: 
//For nucleotides
let nucSeq1 = "ATGA" |> BioArray.ofNucleotideString
let nucSeq2 = "BATVAWG" |> BioArray.ofNucleotideString

let globalNucAlignment = NeedlemanWunsch.runNucleotide nucCosts nucSeq1 nucSeq2
{ MetaData = 9
  Sequences = [[Gap; A; T; G; A; Gap; Gap]; [B; A; T; V; A; W; G]] }
1: 
let localNucAlignment = SmithWaterman.runNucleotide nucCosts nucSeq1 nucSeq2
{ MetaData = 11
  Sequences = [[A; T; G; A]; [A; T; V; A]] }

Aligning anything else

This implementation was aimed to be as generic as possible. To achieve this, the scoring function can be designed at will, the only constraints being the need for two input variables and the type equality.
Also besides the alignment functions which only take BioItems as input and represent the gaps by the appropriate gaps of type BioItem. There is also a generic alignment function runGeneric which returns lists of options, where None represents gaps. Therefore any input type can be used, given it matches the cost function.

For example, one could use a simple if .. then .. else equality function to match nucleotides

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
let scoring a b = 
    if a = b then 2
    else -2

let costs = {
    Open = -2
    Continuation = -1
    Similarity = scoring
    }

let globalAlignmentNuc = NeedlemanWunsch.runGeneric costs nucSeq1 nucSeq2
{ MetaData = -1
  Sequences =
             [[None; Some A; Some T; Some G; Some A; None; None];
              [Some B; Some A; Some T; Some V; Some A; Some W; Some G]] }
1: 
let localAlignmentNuc = SmithWaterman.runGeneric costs nucSeq1 nucSeq2
{ MetaData = 1
  Sequences =
             [[Some A; Some T; Some G; Some A]; [Some A; Some T; Some V; Some A]] }

or also Integers:

1: 
2: 
3: 
4: 
let intSeq1 = [|1;2;3;4;5|]
let intSeq2 = [|1;1;2;4;6;7;|]

let globalAlignmentInt = NeedlemanWunsch.runGeneric costs intSeq1 intSeq2
{ MetaData = -2
  Sequences =
             [[Some 1; None; Some 2; Some 3; Some 4; Some 5; None];
              [Some 1; Some 1; Some 2; None; Some 4; Some 6; Some 7]] }
1: 
let localAlignmentInt = SmithWaterman.runGeneric costs intSeq1 intSeq2
{ MetaData = 0
  Sequences = [[Some 1; Some 2; Some 3; Some 4]; [Some 1; Some 2; None; Some 4]] }
namespace BioFSharp
namespace BioFSharp.Algorithm
module PairwiseAlignment

from BioFSharp.Algorithm
val aaScoring : (AminoAcidSymbols.AminoAcidSymbol -> AminoAcidSymbols.AminoAcidSymbol -> int)
module ScoringMatrix

from BioFSharp.Algorithm
val getScoringMatrixAminoAcid : scoringMatrixType:ScoringMatrix.ScoringMatrixAminoAcid -> (AminoAcidSymbols.AminoAcidSymbol -> AminoAcidSymbols.AminoAcidSymbol -> int)
type ScoringMatrixAminoAcid =
  | BLOSUM45
  | BLOSUM50
  | BLOSUM62
  | BLOSUM80
  | PAM30
  | PAM70
  | PAM250
    static member toFileName : (ScoringMatrixAminoAcid -> string)
union case ScoringMatrix.ScoringMatrixAminoAcid.BLOSUM62: ScoringMatrix.ScoringMatrixAminoAcid
val nucScoring : (Nucleotides.Nucleotide -> Nucleotides.Nucleotide -> int)
val getScoringMatrixNucleotide : scoringMatrixType:ScoringMatrix.ScoringMatrixNucleotide -> (Nucleotides.Nucleotide -> Nucleotides.Nucleotide -> int)
type ScoringMatrixNucleotide =
  | EDNA
  | Default
    static member toFileName : (ScoringMatrixNucleotide -> string)
union case ScoringMatrix.ScoringMatrixNucleotide.EDNA: ScoringMatrix.ScoringMatrixNucleotide
val aaCosts : Costs<AminoAcidSymbols.AminoAcidSymbol>
val nucCosts : Costs<Nucleotides.Nucleotide>
val aaSeq1 : BioArray.BioArray<AminoAcidSymbols.AminoAcidSymbol>
module BioArray

from BioFSharp
val ofAminoAcidSymbolString : s:#seq<char> -> BioArray.BioArray<AminoAcidSymbols.AminoAcidSymbol>
val aaSeq2 : BioArray.BioArray<AminoAcidSymbols.AminoAcidSymbol>
val globalAAAlignment : Alignment.Alignment<AminoAcidSymbols.AminoAcidSymbol list,Score>
module NeedlemanWunsch

from BioFSharp.Algorithm.PairwiseAlignment
val runAminoAcidSymbol : costs:Costs<AminoAcidSymbols.AminoAcidSymbol> -> fstSeq:BioArray.BioArray<AminoAcidSymbols.AminoAcidSymbol> -> sndSeq:BioArray.BioArray<AminoAcidSymbols.AminoAcidSymbol> -> Alignment.Alignment<AminoAcidSymbols.AminoAcidSymbol list,Score>
val localAAAlignment : Alignment.Alignment<AminoAcidSymbols.AminoAcidSymbol list,Score>
module SmithWaterman

from BioFSharp.Algorithm.PairwiseAlignment
val nucSeq1 : BioArray.BioArray<Nucleotides.Nucleotide>
val ofNucleotideString : s:#seq<char> -> BioArray.BioArray<Nucleotides.Nucleotide>
val nucSeq2 : BioArray.BioArray<Nucleotides.Nucleotide>
val globalNucAlignment : Alignment.Alignment<Nucleotides.Nucleotide list,Score>
val runNucleotide : costs:Costs<Nucleotides.Nucleotide> -> fstSeq:BioArray.BioArray<Nucleotides.Nucleotide> -> sndSeq:BioArray.BioArray<Nucleotides.Nucleotide> -> Alignment.Alignment<Nucleotides.Nucleotide list,Score>
val localNucAlignment : Alignment.Alignment<Nucleotides.Nucleotide list,Score>
val scoring : a:'a -> b:'a -> int (requires equality)
val a : 'a (requires equality)
val b : 'a (requires equality)
val costs : Costs<'a> (requires equality)
val globalAlignmentNuc : Alignment.Alignment<Nucleotides.Nucleotide option list,Score>
val runGeneric : costs:Costs<'a> -> fstSeq:'a [] -> sndSeq:'a [] -> Alignment.Alignment<'a option list,Score>
val localAlignmentNuc : Alignment.Alignment<Nucleotides.Nucleotide option list,Score>
val intSeq1 : int []
val intSeq2 : int []
val globalAlignmentInt : Alignment.Alignment<int option list,Score>
val localAlignmentInt : Alignment.Alignment<int option list,Score>
Fork me on GitHub