Phylogenetic tree reconstruction based on evolutionary distance

Interested? Contact muehlhaus@bio.uni-kl.de or schneike@bio.uni-kl.de

Introduction
Aim for this project
Coding clues
References
Additional information
- Testing
- Blog post

Introduction

Phylogenetic trees

Phylogenetic trees are diagrams that visualize inferred evolutionary relationships between a set of organisms. Consider this tree diagram:

         ____ A
       _|
      | |____ B
______|
      |______ C

Ancestors are shown as nodes on the tree, while the leaves represent the respective organisms. It tells you that A and B share a common ancestor, and that that ancestor shares a common ancestor with C. In other words, A and B are closer related to each other, than each of them to C.

A and B form a clade together with their common ancestor (also known as a monophyletic group) - a group of organisms that includes a single ancestor and all of its descendents that represent unbroken lines of evolutionary descent.

But based on what information do we construct such trees? There are different classes of approaches to this problem, but to stay beginner-friendy, only distance-based methods will be discussed in the scope of this project. For sake of completeness, other approaches include parsimony, maximum likelihood and Bayesian approaches to searching the possible tree space.

The first step in any (distance-based) phylogenetic tree reconstruction is the selection of the characteristic to infer evolutionary relationships from, and subsequently the determination of the phylogenetic distance between the organisms of interest based on that characteristic.

Evolutionary Distance of DNA sequences

Any kind of characteristic of organisms can be used to try to infer phylogenetic relationships - like for example beaks of Darwin finches - but DNA sequences have proven to be incredibliy helpful to reconstruct phylogenetic trees, as the nucleotide alphabet is relatively simple and sequencing data has reached unparalleled throughput and accuracy. Likewise, there are a wide range of sophisticated methods to calculate phylogenetic distance based on DNA sequences.

In the scope of this project, you will take a step back and look at some classic evolutionary models that can be used to model phylogenetic distance based on DNA sequences.

A few important bits of jargon for the following chapters:

DNA substitution mutations can be classified by 2 types:
- Transitions are interchanges of two-ring purines (A <> G) or of one-ring pyrimidines (C <> T)
- Transversions are interchanges of purine for pyrimidine bases or vice versa (A <> T | A <> C | G <> T | G <> C)

Proportional distance

The pairwise proportional distance (or p distance) is the classical 'naive' approach to estimate pairwise distances between two sequences.

It is simply the ratio between substitution sites and the length of the sequences. Note that - as in all distances you will implement - both sequences have to be of the same length or have to be aligned before calculating distances. It is obtained by dividing the amount of substitutions by the total amount of compared nucleotides:

\(p = \frac{d}{L},\)

where

\(d = s + v\)

\(s : Transitions\)

\(v : Transversions\)

\(L : length\)

What are the advantages and weaknesses of this approach? Inform yourself.

Distance Corrections based on evolutionary Models

To overcome the shortcomings of the simple proportional distance, many evolutionary models for biological sequences are available. Most of the time, sequence evolution by mutation is described as a stochastic process modelled by continuous-time Markov chains, with the alphabet (for DNA: A, C, G, T) as possible states.

For any sequence position, these models are then defining a substitution probability matrix based that acts as the stochastic matrix of the markov chain.

A basic understanding of markov chains may be beneficial, but not necessary. It is just important that these models define probabilities for transitions and transversions, and can be solved for a corrected pairwise distance that suffices the model substitution criteria.

The Jukes-Cantor Model

The Jukes-Cantor model is the simplest form of these kinds of models. It makes no difference between transitions and transversions, meaning that all substitutions have the same substitution rate (\(\alpha\)). Also, all sites are modelled independently.

The substitution matrix is:

\[\begin{matrix} & \begin{matrix}A & & C & & G & & T\end{matrix} \\\\ \begin{matrix}A\\\\C\\\\G\\\\T\end{matrix} & \begin{pmatrix} -3\alpha&\alpha&\alpha&\alpha\\\\ \alpha&-3\alpha&\alpha&\alpha\\\\ \alpha&\alpha&-3\alpha&\alpha\\\\ \alpha&\alpha&\alpha&-3\alpha \end{pmatrix}\\\\ \end{matrix}\]

\(3\alpha t\) mutations would be expected during a time \(t\) for each sequence site on each sequence, leading to a correction factor for the proportional distance \(d_{JC}\) :

\[d_{JC}=-\frac{3}{4}ln(1-\frac{4}{3}p)\]

What are the advantages and weaknesses of this approach? Inform yourself.

The Kimura 2-Parameter Model

The substitution matrix is:

\[\begin{matrix} & \begin{matrix}A & & & & C & & & & G & & & & T\end{matrix} \\\\ \begin{matrix}A\\\\C\\\\G\\\\T\end{matrix} & \begin{pmatrix} -2\beta-\alpha&\beta&\alpha&\beta\\\\ \beta&-2\beta-\alpha&\beta&\alpha\\\\ \alpha&\beta&-2\beta-\alpha&\beta\\\\ \beta&\alpha&\beta&-2\beta-\alpha \end{pmatrix}\\\\ \end{matrix}\]

It results in a corrected distance \(d_{K2P}\):

\(d_{K2P}=-\frac{1}{2}ln(1-2P-Q)-\frac{1}{4}ln(1-2Q)\),

where

\(P=\frac{s}{L}\)

\(Q=\frac{v}{L}\)

What are the advantages and weaknesses of this approach? Inform yourself.

Aim for this project

You understand the following evolutionary distance models and are able to explain the differences between them (required in your final report) Also, you implement them in F# for the BioFSharp library:
- Pairwise p Distance
- JC69 evolutionary distance based on the model by Jukes and Cantor
- K81 evolutionary distance based on the Kimura two-parameter model
As a demonstration of your implementations, as well as to show the differences between these models, You construct at least 10 adequate test sequences of equal length, and construct phyologenetic trees from them. Investigate the most interesting and obvious differences, and relate them to the different model assumptions.
Finally, you choose adequate sequences of at least 6 organisms, perform a multiple alignment for them and repeat above process for real-world sequences.
You present above results in a blog post resembling your final report.
Bonus: You implement a visualization method for phylogenetic trees for Cyjs.NET.

Coding clues

Before you start

(re)familiarize yourself with the basics behind phylogenetic trees.
(re)familiarize yourself with F# function signatures and the basics of F# programming.
A basic understanding of Markov chains is beneficial, but not necessary.

Scripting environment and necessary libraries

create a F# script (.fsx), load and open the following libraries:
- FSharpAux
- FSharpAux.IO
- FSharp.Stats
- BioFSharp
- BioFSharp.IO
- Plotly.NET

the top of your script file should look like this:

#r "nuget: FSharpAux, 1.1.0"
#r "nuget: FSharpAux.IO, 1.1.0"
#r "nuget: FSharp.Stats, 0.4.3" 
#r "nuget: BioFSharp, 2.0.0-preview.1"
#r "nuget: BioFSharp.IO, 2.0.0-preview.1"
#r "nuget: Plotly.NET, 2.0.0-preview.16"

open FSharpAux
open FSharpAux.IO
open BioFSharp
open BioFSharp.IO
open FSharp.Stats
open Plotly.NET

General coding advice

All pairwise distance functions should be able to operate on either BioArray, BioList, or BioSeq. You can use the fact that all of these sequence types are implementing IEnumerable and can only contain nucleotides.

To perform hierarchical clustering to reconstruct the phylogenetic trees, use the respective module from the BioFSharp library (PhylogeneticTree functions) and work with TaggedSequences:

open BioFSharp

let yourDistance (seqA: seq<#IBioItem>) (seqB:seq<#IBioItem>) = 42. //do real distance calculations here

//reconstruct a phylogenetic tree from tagged sequences
let myTree =
    PhylogeneticTree.ofTaggedBioSequences
        yourDistance // your distance function for either p, JC69, or K81 distance
        yourSequences // your adequate nucleotide test sequences as tagged sequences

to perform Multiple sequence alignment between your real world sequence examples, use the respective functions from BioFSharp: For that, you have to first install clustal omega, a very nice multiple sequence aligment tool. You can use it directly in F# interactive via BioFSharp's ClustalOWrapper:

open ClustalOWrapper
let cw = ClustalOWrapper("path/where/you/extracted/clustal-omega-1.2.2-win64/clustalo.exe") // replace with real path from your machine here!

let sequences = 
    [
    TaggedSequence.create "seq1" ("ATGAAAAA")
    TaggedSequence.create "seq2" ("ATGAAACA")
    TaggedSequence.create "seq3" ("ATGAAAAAAT")
    TaggedSequence.create "seq4" ("ATGGAAAA")
    ]

let alignedSequences = 
    cw.AlignSequences(sequences,Seq.empty)

There are two ways of handling the gaps produced by alignments: Complete-Deletion and Pairwise Deletion inform yourself about them.
Hint1: The output of above alignment contains conservation information about the alignment. Hint2: You might want to add an additional parameter to your distance function modelling this behaviour. For that, use a Discriminated Union type.

Here is a suggestion for the general workflow for the real-world sequences:

Suggested workflow

Read your fasta formatted sequences

Perform multiple sequence alignment \(A\)

for each implemented distance function \(dist\):

\(\quad\)reconstruct phylogenetic tree for \(A\) with \(dist\) function

\(\quad\)Write as newick format

\(\quad\)Visualize tree, for example on http://etetoolkit.org/treeview/

References

Nei, M. & Zhang J. Evolutionary Distance: Estimation 2006 https://doi.org/10.1038/npg.els.0005108

https://en.wikipedia.org/wiki/Models_of_DNA_evolution

https://www.cs.rice.edu/~nakhleh/COMP571/Slides/Phylogenetics-DistanceMethods-Full.pdf

https://www.megasoftware.net/web_help_10/index.htm#t=Models_for_estimating_distances.htm

https://www.megasoftware.net/mega1_manual/Distance.html

Additional information

Testing

the MEGA software suit contains many evolutionary distance functions. You can use it as reference implementation to check wether your functions return correct results. To do that, download the GUI version, install it, and follow these steps:
- Click the Distances button and choose pairwise distance:
- Select a fasta file containing the sequences of interest, and choose the reference model from this dialogue:

Blog post

Don’t forget to describe the limits/weaknesses of the models in your blog post.
How to handle gaps in aligned sequences?
Provide reasoning why you chose your real world sequences

namespace FSharpAux

namespace FSharpAux.IO

namespace BioFSharp

namespace BioFSharp.IO

Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp

namespace FSharp.Stats

namespace Plotly

namespace Plotly.NET

val yourSequences : seq<TaggedSequence<string,Nucleotides.Nucleotide>>

Multiple items
val seq : sequence:seq<'T> -> seq<'T>
<summary>Builds a sequence using sequence expression syntax</summary>
<param name="sequence">The input sequence.</param>
<returns>The result sequence.</returns>

--------------------
type seq<'T> = System.Collections.Generic.IEnumerable<'T>
<summary>An abbreviation for the CLI type <see cref="T:System.Collections.Generic.IEnumerable`1" /></summary>
<remarks> See the <see cref="T:Microsoft.FSharp.Collections.SeqModule" /> module for further operations related to sequences. See also <a href="https://docs.microsoft.com/dotnet/fsharp/language-reference/sequences">F# Language Guide - Sequences</a>. </remarks>

type TaggedSequence<'T,'S> = { Tag: 'T Sequence: seq<'S> } static member create : tag:'T -> sequence:seq<'S> -> TaggedSequence<'T,'S> static member mapSequence : mapping:(seq<'S> -> seq<'M>) -> ts:TaggedSequence<'T,'S> -> TaggedSequence<'T,'M> static member mapTag : mapping:('T -> 'U) -> ts:TaggedSequence<'T,'S> -> TaggedSequence<'U,'S>
<summary> Record of a sequence and its tag </summary>

Multiple items
val string : value:'T -> string
<summary>Converts the argument to a string using <c>ToString</c>.</summary>
<remarks>For standard integer and floating point values the and any type that implements <c>IFormattable</c><c>ToString</c> conversion uses <c>CultureInfo.InvariantCulture</c>. </remarks>
<param name="value">The input value.</param>
<returns>The converted string.</returns>

--------------------
type string = System.String
<summary>An abbreviation for the CLI type <see cref="T:System.String" />.</summary>
<category>Basic Types</category>

module Nucleotides from BioFSharp
<summary> Contains the Nucleotide type and its according functions. </summary>

type Nucleotide = | A | T | G | C | U | I | Gap | Ter | R | Y ... interface IBioItem static member op_Explicit : value:#IBioItem -> byte + 1 overload
<summary> Nucleotide Codes </summary>

val yourDistance : seqA:seq<#IBioItem> -> seqB:seq<#IBioItem> -> float

val seqA : seq<#IBioItem>

type IBioItem = abstract member Formula : Formula abstract member Name : string abstract member Symbol : char abstract member isGap : bool abstract member isTerminator : bool
<summary> Marker interface for BioItem base. </summary>

val seqB : seq<#IBioItem>

val myTree : PhylogeneticTree<TaggedSequence<string,Nucleotides.Nucleotide> * float>

Multiple items
module PhylogeneticTree from BioFSharp

--------------------
type PhylogeneticTree<'T> = | Branch of 'T * List<PhylogeneticTree<'T>> | Leaf of 'T
<summary> Recursive representation of a phylogenetic tree </summary>

val ofTaggedBioSequences : distanceFunction:(seq<'a> -> seq<'a> -> float) -> sequences:seq<TaggedSequence<string,'a>> -> PhylogeneticTree<TaggedSequence<string,'a> * float> (requires 'a :> IBioItem)
<summary>Performs hierarchical clustering of the input TaggedSequences using the provided distance function. Returns the result as a Phylogenetic tree.</summary>
<parameter name="distanceFunction">a function that determines the distance between two sequences e.g. evolutionary distance based on a substitution model</parameter>
<parameter name="sequences">the input TaggedSequences</parameter>

module ClustalOWrapper from BioFSharp.IO
<summary> Wrapper and its helpers for Clustal Omega multiple alignment tools </summary>

val cw : ClustalOWrapper

Multiple items
module ClustalOWrapper from BioFSharp.IO
<summary> Wrapper and its helpers for Clustal Omega multiple alignment tools </summary>

--------------------
type ClustalOWrapper = new : ?rootPath:string -> ClustalOWrapper member AlignFromFile : inputPath:Input * outputPath:string * parameters:seq<ClustalParams> * ?name:string -> unit member AlignSequences : input:seq<TaggedSequence<string,char>> * parameters:seq<ClustalParams> * ?name:string -> Alignment<TaggedSequence<string,char>,AlignmentInfo>
<summary> A wrapper to perform Clustal Omega alignment tasks </summary>

--------------------
new : ?rootPath:string -> ClustalOWrapper

val sequences : TaggedSequence<string,char> list

static member TaggedSequence.create : tag:'T -> sequence:seq<'S> -> TaggedSequence<'T,'S>

val alignedSequences : Alignment.Alignment<TaggedSequence<string,char>,Clustal.AlignmentInfo>

member ClustalOWrapper.AlignSequences : input:seq<TaggedSequence<string,char>> * parameters:seq<Parameters.ClustalParams> * ?name:string -> Alignment.Alignment<TaggedSequence<string,char>,Clustal.AlignmentInfo>

Multiple items
module Seq from Plotly.NET

--------------------
module Seq from FSharp.Stats
<summary> Module to compute common statistical measure </summary>

--------------------
module Seq from FSharpAux

--------------------
module Seq from Microsoft.FSharp.Collections
<summary>Contains operations for working with values of type <see cref="T:Microsoft.FSharp.Collections.seq`1" />.</summary>

--------------------
type Seq = static member CSV : separator:string -> header:bool -> flatten:bool -> data:seq<'a> -> seq<string> static member CSVwith : valFunc:('a -> ('a -> obj) []) -> strFunc:(string -> bool -> obj -> obj -> string) -> separator:string -> header:bool -> flatten:bool -> data:seq<'a> -> seq<string> static member fromFile : filePath:string -> seq<string> static member fromFileWithCsvSchema : filePath:string * separator:char * firstLineHasHeader:bool * ?skipLines:int * ?skipLinesBeforeHeader:int * ?schemaMode:SchemaModes -> seq<'schema> static member fromFileWithSep : separator:char -> filePath:string -> seq<string []> static member stringFunction : separator:string -> flatten:bool -> input:'a -> (obj -> string) static member valueFunction : dataEntry:'a -> ('a -> obj) [] static member write : path:string -> data:seq<'a> -> unit static member writeOrAppend : path:string -> data:seq<'a> -> unit

val empty<'T> : seq<'T>
<summary>Creates an empty sequence.</summary>
<returns>An empty sequence.</returns>