Header menu logo BioFSharp

Pairwise Alignment

BinderScriptNotebook

Summary: This example shows how to perform sequence alignments in BioFSharp

In this short tutorial, the usage of the pairwise alignment implementation is demonstrated. For global alignments, the NeedlemanWunsch algorithm is used. For local alignments, the SmithWaterman algorithm is used.

For both implementations, the gapvalues are evaluated using the affine gapscoremodel.

Aligning aminoacid and nucleotide sequences

For defining the scores of matching and missmatching characters, the scoring function is defined. In the case of amino acid or nucleotide sequence alignments, the integrated substitution-matrices can be used.

open BioFSharp
open BioFSharp.Algorithm
open PairwiseAlignment

//For aminoacids
let aaScoring = ScoringMatrix.getScoringMatrixAminoAcid ScoringMatrix.ScoringMatrixAminoAcid.BLOSUM62

//For nucleotides
let nucScoring = ScoringMatrix.getScoringMatrixNucleotide  ScoringMatrix.ScoringMatrixNucleotide.EDNA

In order to align two sequences, not only the values for substitutions, but also the costs for gaps are needed. In this implementation, an affine gap-penalty is realized. An affine gap penalty weights continuations of gaps different than the opening of a gap. The scoring function (in this case a substitution matrix) and the two gap penalty values are combined into the Costs type.

//For aminoacids
let aaCosts = {
    Open = -5
    Continuation = -2
    Similarity = aaScoring 
    }

//For nucleotides
let nucCosts = {
    Open = -2
    Continuation = -1
    Similarity = nucScoring 
    }

The alignment functions use Arrays as input. The Elements can be of any type, but require type equality. Also they need to have type equality with the input of the scoring function. Both the global and local alignment algorithms take the same parameters (costs,firstSequence,secondSequence) and return the same format.

//For aminoacids
let aaSeq1 = "ACDM" |> BioArray.ofAminoAcidSymbolString
let aaSeq2 = "MAACEDM" |> BioArray.ofAminoAcidSymbolString

NeedlemanWunsch.runAminoAcidSymbol aaCosts aaSeq1 aaSeq2
{ MetaData = 12
  Sequences = [[-; -; A; C; -; D; M]; [M; A; A; C; E; D; M]] }
SmithWaterman.runAminoAcidSymbol aaCosts aaSeq1 aaSeq2
{ MetaData = 19
  Sequences = [[A; C; -; D; M]; [A; C; E; D; M]] }
//For nucleotides
let nucSeq1 = "ATGA" |> BioArray.ofNucleotideString
let nucSeq2 = "BATVAWG" |> BioArray.ofNucleotideString

NeedlemanWunsch.runNucleotide nucCosts nucSeq1 nucSeq2
{ MetaData = 9
  Sequences = [[Gap; A; T; G; A; Gap; Gap]; [B; A; T; V; A; W; G]] }
SmithWaterman.runNucleotide nucCosts nucSeq1 nucSeq2
{ MetaData = 11
  Sequences = [[A; T; G; A]; [A; T; V; A]] }

Aligning anything else

This implementation was aimed to be as generic as possible. To achieve this, the scoring function can be designed at will, the only constraints being the need for two input variables and the type equality.
Also besides the alignment functions which only take BioItems as input and represent the gaps by the appropriate gaps of type BioItem. There is also a generic alignment function runGeneric which returns lists of options, where None represents gaps. Therefore any input type can be used, given it matches the cost function.

For example, one could use a simple if .. then .. else equality function to match nucleotides

let scoring a b = 
    if a = b then 2
    else -2

let costs = {
    Open = -2
    Continuation = -1
    Similarity = scoring
    }

NeedlemanWunsch.runGeneric costs nucSeq1 nucSeq2
{ MetaData = -1
  Sequences =
   [[None; Some A; Some T; Some G; Some A; None; None];
    [Some B; Some A; Some T; Some V; Some A; Some W; Some G]] }
SmithWaterman.runGeneric costs nucSeq1 nucSeq2
{ MetaData = 1
  Sequences =
   [[Some A; Some T; Some G; Some A]; [Some A; Some T; Some V; Some A]] }

or also Integers:

let intSeq1 = [|1;2;3;4;5|]
let intSeq2 = [|1;1;2;4;6;7;|]

NeedlemanWunsch.runGeneric costs intSeq1 intSeq2
{ MetaData = -2
  Sequences =
   [[Some 1; None; Some 2; Some 3; Some 4; Some 5; None];
    [Some 1; Some 1; Some 2; None; Some 4; Some 6; Some 7]] }
SmithWaterman.runGeneric costs intSeq1 intSeq2
{ MetaData = 0
  Sequences =
   [[Some 1; Some 2; Some 3; Some 4]; [Some 1; Some 2; None; Some 4]] }

Multiple sequence alignment with ClustalOmega

Clustal Omega is a multiple sequence alignment (MSA) tool. This tutorial describes using it in F# via the ClustalOWrapper. For some more indepth information about which parameters to choose for your goal, also check out the official tutorial.

Aligning sequences from files

The first step is to create a wrapper-object. As optional input it takes a path to the clustalo executable you want to use. You have to fill this argument if you work with a precompiled verion or on linux.

you will have to install clustal omega yourself to use this wrapper.

open BioFSharp.IO
open ClustalOWrapper
open BioFSharp
let cw = ClustalOWrapper("path/where/you/extracted/clustal-omega/clustalo.exe")

The general structure of arguments the wrapper takes was kept the same as in the command line tool. In general, you need an inputPath, an outputPath and parameters. As there are several inputs possible, you have to choose what it is. As we want to align a normal sequence we just pick SequenceFile.

let inputPath = Input.SequenceFile (__SOURCE_DIRECTORY__ + @"\data\Chlamy_Cp.fastA")

let outputPath = __SOURCE_DIRECTORY__ + @"\data\Chlamy_Cp.aln"

As additional parameters go, we'll restrict input to FastA format and the output to Clustal format. Also we will use the Force parameter to force the overwrite of a possilby already existing file with the name ChlamyCp.aln.

//Input has to be in FastA format
let inputModifier = Parameters.ClustalParams.Input [Parameters.InputCustom.Format Parameters.FileFormat.FastA]
//Output has to be in Clustal format
let outputModifier = Parameters.ClustalParams.Output [Parameters.OutputCustom.Format Parameters.FileFormat.Clustal]
//Forces overwriting
let forceModifier = Parameters.ClustalParams.Miscallaneous [Parameters.MiscallaneousCustom.Force]
//Perform alignment
cw.AlignFromFile(inputPath,outputPath,[inputModifier;outputModifier;forceModifier])

Aligning sequences directly in F# Interactive

With the AlignSequences method, one can also directly align sequences with the clustal tool and also directly receive the alignment directly in F#.

As input, it takes a collection of TaggedSequences, and again a set of parameters. The default options can be used by not using any additional parameters.

let sequences = [
    TaggedSequence.create "pep1" ("AAGECGEK")
    TaggedSequence.create "pep2" ("AAGEGEK")
    TaggedSequence.create "pep3" ("AAAGECGEK")
    TaggedSequence.create "pep4" ("AAGECGEL")
]
cw.AlignSequences(sequences,Seq.empty)
{ MetaData = { Header = seq ['C'; 'L'; 'U'; 'S'; ...]
ConservationInfo = seq [' '; '*'; '*'; '*'; ...] }
    Sequences =
     [{ Tag = "pep1"
        Sequence = seq ['-'; 'A'; 'A'; 'G'; ...] };
      { Tag = "pep2"
        Sequence = seq ['-'; 'A'; 'A'; 'G'; ...] };
      { Tag = "pep3"
        Sequence = seq ['A'; 'A'; 'A'; 'G'; ...] };
      { Tag = "pep4"
        Sequence = seq ['-'; 'A'; 'A'; 'G'; ...] }] }
namespace BioFSharp
namespace BioFSharp.Algorithm
module PairwiseAlignment from BioFSharp.Algorithm
<summary> Contains functions for evaluating the best possible alignments for 2 Sequences </summary>
val aaScoring: (AminoAcidSymbols.AminoAcidSymbol -> AminoAcidSymbols.AminoAcidSymbol -> int)
module ScoringMatrix from BioFSharp.Algorithm
<summary> Contains functions for using the included similarity matrices. These assign a score to every pair of aminoacids/nucleotides and therefore rate the probability of their substitution. The Scoring Matrices are generally used for alignments. </summary>
val getScoringMatrixAminoAcid: scoringMatrixType: ScoringMatrix.ScoringMatrixAminoAcid -> (AminoAcidSymbols.AminoAcidSymbol -> AminoAcidSymbols.AminoAcidSymbol -> int)
<summary> creates a scoring function for amino acids out of a scoring matrix </summary>
type ScoringMatrixAminoAcid = | BLOSUM45 | BLOSUM50 | BLOSUM62 | BLOSUM80 | PAM30 | PAM70 | PAM250 static member toFileName: (ScoringMatrixAminoAcid -> string)
<summary> Union case of implemented amino acid scoring matrices with the given reference to its place in the library. Use the "getScoringMatrixAminoAcid" function to obtain a simple mapping function for every amino acid pair </summary>
union case ScoringMatrix.ScoringMatrixAminoAcid.BLOSUM62: ScoringMatrix.ScoringMatrixAminoAcid
val nucScoring: (Nucleotides.Nucleotide -> Nucleotides.Nucleotide -> int)
val getScoringMatrixNucleotide: scoringMatrixType: ScoringMatrix.ScoringMatrixNucleotide -> (Nucleotides.Nucleotide -> Nucleotides.Nucleotide -> int)
<summary> creates a scoring function for nucleotides out of a scoring matrix </summary>
type ScoringMatrixNucleotide = | EDNA | Default static member toFileName: (ScoringMatrixNucleotide -> string)
<summary> Union case of implemented nucleotide scoring matrices with the given reference to its place in the library. Use the "getScoringMatrixNucleotide" function to obtain a simple mapping function for every nucleotide pair </summary>
union case ScoringMatrix.ScoringMatrixNucleotide.EDNA: ScoringMatrix.ScoringMatrixNucleotide
val aaCosts: Costs<AminoAcidSymbols.AminoAcidSymbol>
val nucCosts: Costs<Nucleotides.Nucleotide>
val aaSeq1: BioArray.BioArray<AminoAcidSymbols.AminoAcidSymbol>
Multiple items
module BioArray from BioFSharp.BioCollectionsExtensions

--------------------
module BioArray from BioFSharp
<summary> This module contains the BioArray type and its according functions. The BioArray type is an array of objects using the IBioItem interface </summary>
val ofAminoAcidSymbolString: s: #(char seq) -> BioArray.BioArray<AminoAcidSymbols.AminoAcidSymbol>
<summary> Generates amino acid symbol sequence of one-letter-code raw string </summary>
val aaSeq2: BioArray.BioArray<AminoAcidSymbols.AminoAcidSymbol>
val nucSeq1: BioArray.BioArray<Nucleotides.Nucleotide>
val ofNucleotideString: s: #(char seq) -> BioArray.BioArray<Nucleotides.Nucleotide>
<summary> Generates nucleotide sequence of one-letter-code raw string </summary>
val nucSeq2: BioArray.BioArray<Nucleotides.Nucleotide>
val scoring: a: 'a -> b: 'a -> int (requires equality)
val a: 'a (requires equality)
val b: 'a (requires equality)
val costs: Costs<'a> (requires equality)
val intSeq1: int array
val intSeq2: int array
namespace BioFSharp.IO
module ClustalOWrapper from BioFSharp.IO
<summary> Wrapper and its helpers for Clustal Omega multiple alignment tools </summary>
val cw: ClustalOWrapper
Multiple items
module ClustalOWrapper from BioFSharp.IO
<summary> Wrapper and its helpers for Clustal Omega multiple alignment tools </summary>

--------------------
type ClustalOWrapper = new: ?rootPath: string -> ClustalOWrapper member AlignFromFile: inputPath: Input * outputPath: string * parameters: ClustalParams seq * ?name: string -> unit member AlignSequences: input: TaggedSequence<string,char> seq * parameters: ClustalParams seq * ?name: string -> Alignment<TaggedSequence<string,char>,AlignmentInfo>
<summary> A wrapper to perform Clustal Omega alignment tasks </summary>

--------------------
new: ?rootPath: string -> ClustalOWrapper
val inputPath: Input
type Input = | SequenceFile of string | TwoProfiles of string * string | SequenceFileAndProfile of string * string | SequenceFileAndHMM of string * string
<summary> Specify the type of input and assign file path </summary>
union case Input.SequenceFile: string -> Input
<summary> Use this option to make a multiple alignment from a set of sequences. A sequence file must contain more than one sequence (at least two sequences). </summary>
val outputPath: string
val inputModifier: Parameters.ClustalParams
module Parameters from BioFSharp.IO.ClustalOWrapper
<summary> Contains modifier parameter type for Clustal Omega wrapper </summary>
type ClustalParams = | Input of InputCustom seq | Output of OutputCustom seq | Clustering of ClusteringCustom seq | Iteration of IterationCustom seq | Limits of LimitsCustom seq | Miscallaneous of MiscallaneousCustom seq
<summary> Collection of parameters for specifying clustalo alignment </summary>
union case Parameters.ClustalParams.Input: Parameters.InputCustom seq -> Parameters.ClustalParams
<summary> Specify input parameters </summary>
type InputCustom = | Format of FileFormat | Dealign | IsProfile | SeqType of SeqType
<summary> Optional modifiers for input </summary>
union case Parameters.InputCustom.Format: Parameters.FileFormat -> Parameters.InputCustom
<summary> Forced sequence input file format (default: auto) </summary>
type FileFormat = | FastA | Clustal | MSF | Phylip | Selex | Stockholm | Vienna
<summary> Input file format </summary>
union case Parameters.FileFormat.FastA: Parameters.FileFormat
<summary> FastA file format </summary>
val outputModifier: Parameters.ClustalParams
union case Parameters.ClustalParams.Output: Parameters.OutputCustom seq -> Parameters.ClustalParams
<summary> Specify output parameters </summary>
type OutputCustom = | Format of FileFormat | ResidueNumber | Wrap of int | OutputOrderAsTree
<summary> Optional modifiers for input </summary>
union case Parameters.OutputCustom.Format: Parameters.FileFormat -> Parameters.OutputCustom
<summary> MSA output file format (default: fasta) </summary>
union case Parameters.FileFormat.Clustal: Parameters.FileFormat
<summary> Clustal file format </summary>
val forceModifier: Parameters.ClustalParams
union case Parameters.ClustalParams.Miscallaneous: Parameters.MiscallaneousCustom seq -> Parameters.ClustalParams
<summary> Specify miscallaneous parameters </summary>
type MiscallaneousCustom = | Auto | Threads of int | Log of string | VerboseLevel of int | Version | LongVersion | Force
<summary> Optional, miscallaneous modifiers </summary>
union case Parameters.MiscallaneousCustom.Force: Parameters.MiscallaneousCustom
<summary> Force file overwriting </summary>
member ClustalOWrapper.AlignFromFile: inputPath: Input * outputPath: string * parameters: Parameters.ClustalParams seq * ?name: string -> unit
val sequences: TaggedSequence<string,char> list
type TaggedSequence<'T,'S> = { Tag: 'T Sequence: 'S seq } static member create: tag: 'T -> sequence: 'S seq -> TaggedSequence<'T,'S> static member mapSequence: mapping: ('S seq -> 'M seq) -> ts: TaggedSequence<'T,'S> -> TaggedSequence<'T,'M> static member mapTag: mapping: ('T -> 'U) -> ts: TaggedSequence<'T,'S> -> TaggedSequence<'U,'S>
<summary> Record of a sequence and its tag </summary>
static member TaggedSequence.create: tag: 'T -> sequence: 'S seq -> TaggedSequence<'T,'S>
member ClustalOWrapper.AlignSequences: input: TaggedSequence<string,char> seq * parameters: Parameters.ClustalParams seq * ?name: string -> Alignment.Alignment<TaggedSequence<string,char>,Clustal.AlignmentInfo>
module Seq from Microsoft.FSharp.Collections
val empty<'T> : 'T seq

Type something to start searching.