Sequencing by Hybridization as an Eulerian Path Problem

Interested? Contact muehlhaus@bio.uni-kl.de or ottj@rhrk.uni-kl.de

Table of contents

Introduction

Sequencing by Hybridization (SBH) involves building a miniature DNA array (or DNA chip), that contains thousands of short DNA fragments (probes). Each of these probes gives information about the presence of a known, but short, sequence in the unknown DNA sequence. All these pieces of information together should reveal the identity of the target DNA sequence. Given a short probe (an 8- to 30-nucleotide single-stranded synthetic DNA fragment) and a single-stranded target DNA fragment, the target will hybridize with the probe if the probe is a substring of the target's complement. When the probe and the target are mixed together, they form a weak chemical bond and stick together. For example, a probe ACCGTGGA will hybridize to a target CCCTGGCACCTA since it is complementary to the substring TGGCACCT of the target.

Given an unknown DNA sequence, an array provides information about all strings of length l (l-mer) that the sequence contains, but does not provide information about their positions in the sequence. For example, the 8-mer composition of CCCTGGCACCTA is {CCCTGGCA, CCTGGCAC, CTGGCACC, TGGCACCT,GGCACCTA} The reduction of the SBH problem to an Eulerian Path problem is to construct a graph whose edges, rather than vertices, correspond to those l-mers, and then to find a path in this graph visiting every edge exactly once. Paths visiting ALL EDGES correspond to sequence reconstructions.

Aim for this project

  1. Blog post introducing the method, its applications, and limitations.
  2. Implement a function which creates a directed graph from a given set of l-mers
  3. Implement Hierholzer's algorithm for finding Eulerian paths in directed graphs

Coding clues

Setting up the environment

  • Create a F# script file (.fsx) and paste the following text at the top of your file:
#r "nuget: FSharpAux, 1.1.0"
#r "nuget: BioFSharp, 2.0.0-preview.1"
#r "nuget: FSharp.FGL, 0.0.2"
#r "nuget: Cyjs.NET, 0.0.4"

open FSharpAux
open BioFSharp
open FSharp.FGL
open FSharp.FGL.Directed
open Cyjs.NET

Creating an Eulerian Graph of l-mers with FSharp.FGL

  • Vertices correspond to (l-1)-mers
  • Edges correspond to l-mers from the spectrum
  • All functions should operate on either BioArray, BioList or BioSeq

  • Start with any Nucleotide array

let sampleSequence: array<Nucleotide> = 
    "ATGGCGTGCA"
    |> BioArray.ofNucleotideString

  • Compute the l-mers of the Nucleotide array (in this case 3-mers)

let lMers3: array<array<Nucleotide>> = 
    [|[|A;T;G|]; [|T;G;G|]; [|T;G;C|]; [|G;T;G|]; [|G;G;C|]; [|G;C;A|]; [|G;C;G|]; [|C;G;T|]|]

  • Initialize a directed graph

let graph: Graph<int,array<Nucleotide>,array<Nucleotide>> = Graph.empty

  • Create a list with vertices based on the l-mers and add them to the graph

let vertices: list<int*array<Nucleotide>> =
    [(1, [|A; T|]); (2, [|T; G|]); (3, [|G; T|]); (4, [|G; G|]); (5, [|G; C|]);(6, [|C; G|]); (7, [|C; A|])]

let graphWithVertices: Graph<int,array<Nucleotide>,array<Nucleotide>> =
    Vertices.addMany vertices graph

  • Create a list with edges based in the vertices and l-mers and add them to the graph

let edges: list<int*int*array<Nucleotide>>=
    [(1, 2, [|A; T; G|]); (2, 4, [|T; G; G|]); (2, 5, [|T; G; C|]); (3, 2, [|G; T; G|]);
    (4, 5, [|G; G; C|]); (5, 7, [|G; C; A|]); (5, 6, [|G; C; G|]); (6, 3, [|C; G; T|])]

let graphWithEdges: Graph<int,array<Nucleotide>,array<Nucleotide>> =
    Edges.addMany edges graphWithVertices

  • You can visualize the graph using Cyjs.NET

let inline toCyJS (g : Graph<'Vertex,array<Nucleotide>,array<Nucleotide>>) =
    let vertices = 
        g
        |> Vertices.toVertexList
        |> List.map (fun (id,name) ->
            Elements.node (string id) [CyParam.label (name |> BioArray.toString)]
        )

    let edges =
        g
        |> Edges.toEdgeList
        |> List.map (fun (v1,v2,weight) -> 
            Elements.edge (string v1 + string v2) (string v1) (string v2) [CyParam.weight (weight |> BioArray.toString)]
        )

    CyGraph.initEmpty ()
    |> CyGraph.withElements vertices
    |> CyGraph.withElements edges
    |> CyGraph.withLayout (Layout.initBreadthfirst id)
    |> CyGraph.withStyle "node" [CyParam.content =. CyParam.label]
    |> CyGraph.withStyle "edge"     
                [
                    CyParam.Curve.style "bezier"
                    CyParam.opacity 0.666
                    CyParam.width <=. (CyParam.weight, 70, 100, 5, 5)
                    CyParam.Target.Arrow.shape "triangle"
                    CyParam.Line.color =. CyParam.color
                    CyParam.Target.Arrow.color =. CyParam.color
                    CyParam.Source.Arrow.color =. CyParam.color
                ]

graphWithEdges
|> toCyJS
No value returned by any evaluator
  • This graph is semibalanced (|indegree - outdegree| = 1). If a graph has an Eulerian path starting at vertex s and ending at vertex t, then all its vertices are balanced, with the possible exception of s and t, which may be semibalanced.

  • The Eulerian path problem can be reduced to the Eulerian cycle problem by adding an edge between two semibalanced vertices.

Hierholzer's algorithm

  • Choose any starting vertex, and follow a path along the unused edges from that vertex until you return to it. You will alway return to the starting vertex in a balanced Eulerian graph. Since every vertex has indegreee = outdegree, there is always an unused edge to leave the current vertex. The path found by doing this is a closed tour, starting and ending at the same vertex, but not necessarily covering all vertices and edges.

  • As long as there exists a vertex in the current closed tour that has unused edges, you can start finding a new closed tour and join it with the previously found tour.
  • Since we assume the original graph is connected, repeating the previous step will cover all edges of the graph.
  • Implement the algorithm, so that it works on a FSharp.FGL.Graph

Working with FSharp.FGL.Graphs

  • Returns all outward edges of vertex 1
Graph.getContext 1 graphWithEdges
|> Vertices.outwardEdges
[(1, 2, [|A; T; G|])]

  • Returns all inward edges of vertex 2

Graph.getContext 2 graphWithEdges
|> Vertices.inwardEdges
[(1, 2, [|A; T; G|]); (3, 2, [|G; T; G|])]

  • Returns all inward edges for all vertices in the graph

Graph.mapContexts Vertices.inwardEdges graphWithEdges
map
  [(1, []); (2, [(1, 2, [|A; T; G|]); (3, 2, [|G; T; G|])]);
   (3, [(6, 3, [|C; G; T|])]); (4, [(2, 4, [|T; G; G|])]);
   (5, [(2, 5, [|T; G; C|]); (4, 5, [|G; G; C|])]); (6, [(5, 6, [|G; C; G|])]);
   (7, [(5, 7, [|G; C; A|])])]

  • Returns all outward edges for all vertices in the graph

Graph.mapContexts Vertices.outwardEdges graphWithEdges
map
  [(1, [(1, 2, [|A; T; G|])]); (2, [(2, 4, [|T; G; G|]); (2, 5, [|T; G; C|])]);
   (3, [(3, 2, [|G; T; G|])]); (4, [(4, 5, [|G; G; C|])]);
   (5, [(5, 6, [|G; C; G|]); (5, 7, [|G; C; A|])]); (6, [(6, 3, [|C; G; T|])]);
   (7, [])]

  • Returns the adjacency matrix representation of the graph

Graph.toAdjacencyMatrix graphWithEdges

References

Additional information

Testing

  • Implement a function that returns an array of l-mers for any given DNA-Sequence
  • Generate an Eulerian Graph based on the returned set of l-mers
  • Compare the reconstructed sequence to the original sequence
  • Optional: Implement a function that performs the steps above for a large number of sequences with varying l-mer lengths

Blog post

  • Introduction about SBH, and how it is used today
  • Describe the limits and weaknesses of the approach in your blog post
  • Compare Hierholzer's algorithm to Fleury's algorithm
namespace FSharpAux
namespace BioFSharp
Multiple items
namespace FSharp

--------------------
namespace Microsoft.FSharp
namespace FSharp.FGL
namespace FSharp.FGL.Directed
namespace Cyjs
namespace Cyjs.NET
module Nucleotides from BioFSharp
<summary> Contains the Nucleotide type and its according functions. </summary>
val sampleSequence : Nucleotide array
type 'T array = 'T []
<summary>Single dimensional, zero-based arrays, written <c>int[]</c>, <c>string[]</c> etc.</summary>
<remarks>Use the values in the <see cref="T:Microsoft.FSharp.Collections.ArrayModule" /> module to manipulate values of this type, or the notation <c>arr.[x]</c> to get/set array values.</remarks>
<category>Basic Types</category>
type Nucleotide = | A | T | G | C | U | I | Gap | Ter | R | Y ... interface IBioItem static member op_Explicit : value:#IBioItem -> byte + 1 overload
<summary> Nucleotide Codes </summary>
Multiple items
module BioArray from BioFSharp.BioCollectionsExtensions

--------------------
module BioArray from BioFSharp
<summary> This module contains the BioArray type and its according functions. The BioArray type is an array of objects using the IBioItem interface </summary>
val ofNucleotideString : s:#seq<char> -> BioArray.BioArray<Nucleotide>
<summary> Generates nucleotide sequence of one-letter-code raw string </summary>
val lMers3 : Nucleotide array array
union case Nucleotide.A: Nucleotide
<summary> A : Adenine </summary>
union case Nucleotide.T: Nucleotide
<summary> T : Thymidine (only DNA) </summary>
union case Nucleotide.G: Nucleotide
<summary> G : Guanine </summary>
union case Nucleotide.C: Nucleotide
<summary> C : Cytosine </summary>
val graph : Graph<int,Nucleotide array,Nucleotide array>
Multiple items
module Graph from FSharp.FGL.Directed

--------------------
module Graph from FSharp.FGL
<summary> General functions for both directed and undirected graphs </summary>

--------------------
module Graph from FSharpAux

--------------------
type Graph<'Vertex,'Label,'Edge (requires comparison)> = Map<'Vertex,MContext<'Vertex,'Label,'Edge>>
<summary> Map of Vertices as keys and MContexts as values </summary>
Multiple items
val int : value:'T -> int (requires member op_Explicit)
<summary>Converts the argument to signed 32-bit integer. This is a direct conversion for all primitive numeric types. For strings, the input is converted using <c>Int32.Parse()</c> with InvariantCulture settings. Otherwise the operation requires an appropriate static conversion method on the input type.</summary>
<param name="value">The input value.</param>
<returns>The converted int</returns>


--------------------
[<Struct>] type int = int32
<summary>An abbreviation for the CLI type <see cref="T:System.Int32" />.</summary>
<category>Basic Types</category>


--------------------
type int<'Measure> = int
<summary>The type of 32-bit signed integer numbers, annotated with a unit of measure. The unit of measure is erased in compiled code and when values of this type are analyzed using reflection. The type is representationally equivalent to <see cref="T:System.Int32" />.</summary>
<category>Basic Types with Units of Measure</category>
val empty : Graph<'Vertex,'Label,'Edge> (requires comparison)
<summary> Creates a new, empty graph. </summary>
val vertices : (int * Nucleotide array) list
type 'T list = List<'T>
<summary>The type of immutable singly-linked lists. </summary>
<remarks>See the <see cref="T:Microsoft.FSharp.Collections.ListModule" /> module for further operations related to lists. Use the constructors <c>[]</c> and <c>::</c> (infix) to create values of this type, or the notation <c>[1; 2; 3]</c>. Use the values in the <c>List</c> module to manipulate values of this type, or pattern match against the values directly. See also <a href="https://docs.microsoft.com/dotnet/fsharp/language-reference/lists">F# Language Guide - Lists</a>. </remarks>
val graphWithVertices : Graph<int,Nucleotide array,Nucleotide array>
Multiple items
module Vertices from FSharp.FGL.Directed
<summary> Functions for vertices of directed Graphs </summary>

--------------------
module Vertices from FSharp.FGL
<summary> Functions for vertices of both directed and undirected graphs </summary>
val addMany : vertices:LVertex<'Vertex,'Label> list -> g:Graph<'Vertex,'Label,'Edge> -> Graph<'Vertex,'Label,'Edge> (requires comparison)
<summary> Adds a list of labeled vertices to the graph. </summary>
val edges : (int * int * Nucleotide array) list
val graphWithEdges : Graph<int,Nucleotide array,Nucleotide array>
module Edges from FSharp.FGL.Directed
<summary> Functions for edges of directed Graphs </summary>
val addMany : edges:LEdge<'Vertex,'Edge> list -> g:Graph<'Vertex,'Label,'Edge> -> Graph<'Vertex,'Label,'Edge> (requires comparison)
<summary> Adds a list of labeled, directed edges to the graph. </summary>
val toCyJS : g:Graph<'Vertex,Nucleotide array,Nucleotide array> -> CyGraph.CyGraph (requires comparison)
val g : Graph<'Vertex,Nucleotide array,Nucleotide array> (requires comparison)
val vertices : Elements.Node list
val toVertexList : g:Graph<'Vertex,'Label,'Edge> -> LVertex<'Vertex,'Label> list (requires comparison)
<summary> Creates a list of all vertices and their labels. </summary>
Multiple items
module List from FSharpAux

--------------------
module List from Microsoft.FSharp.Collections
<summary>Contains operations for working with values of type <see cref="T:Microsoft.FSharp.Collections.list`1" />.</summary>
<namespacedoc><summary>Operations for collections such as lists, arrays, sets, maps and sequences. See also <a href="https://docs.microsoft.com/dotnet/fsharp/language-reference/fsharp-collection-types">F# Collection Types</a> in the F# Language Guide. </summary></namespacedoc>


--------------------
type List<'T> = | ( [] ) | ( :: ) of Head: 'T * Tail: 'T list interface IReadOnlyList<'T> interface IReadOnlyCollection<'T> interface IEnumerable interface IEnumerable<'T> member GetReverseIndex : rank:int * offset:int -> int member GetSlice : startIndex:int option * endIndex:int option -> 'T list static member Cons : head:'T * tail:'T list -> 'T list member Head : 'T member IsEmpty : bool member Item : index:int -> 'T with get ...
<summary>The type of immutable singly-linked lists.</summary>
<remarks>Use the constructors <c>[]</c> and <c>::</c> (infix) to create values of this type, or the notation <c>[1;2;3]</c>. Use the values in the <c>List</c> module to manipulate values of this type, or pattern match against the values directly. </remarks>
<exclude />
val map : mapping:('T -> 'U) -> list:'T list -> 'U list
<summary>Builds a new collection whose elements are the results of applying the given function to each of the elements of the collection.</summary>
<param name="mapping">The function to transform elements from the input list.</param>
<param name="list">The input list.</param>
<returns>The list of transformed elements.</returns>
val id : 'Vertex (requires comparison)
val name : Nucleotide array
Multiple items
module Elements from Cyjs.NET

--------------------
module Elements from BioFSharp
<summary> Contains chemical elements represented as a mixture of their stable isotopes and functionality for building them </summary>
val node : id:string -> dataAttributes:CyParam.CyStyleParam list -> Elements.Node
Multiple items
val string : value:'T -> string
<summary>Converts the argument to a string using <c>ToString</c>.</summary>
<remarks>For standard integer and floating point values the and any type that implements <c>IFormattable</c><c>ToString</c> conversion uses <c>CultureInfo.InvariantCulture</c>. </remarks>
<param name="value">The input value.</param>
<returns>The converted string.</returns>


--------------------
type string = System.String
<summary>An abbreviation for the CLI type <see cref="T:System.String" />.</summary>
<category>Basic Types</category>
module CyParam from Cyjs.NET
val label : v:'a -> CyParam.CyStyleParam
val toString : bs:BioArray.BioArray<#IBioItem> -> string
<summary> Returns string of one-letter-code </summary>
val edges : Elements.Edge list
val toEdgeList : g:Graph<'Vertex,'Label,'Edge> -> LEdge<'Vertex,'Edge> list (requires comparison)
<summary> Creates a list of all edges and their labels. </summary>
val v1 : 'Vertex (requires comparison)
val v2 : 'Vertex (requires comparison)
val weight : Nucleotide array
val edge : id:string -> sourceId:string -> targetId:string -> dataAttributes:CyParam.CyStyleParam list -> Elements.Edge
val weight : v:'a -> CyParam.CyStyleParam
module CyGraph from Cyjs.NET
val initEmpty : unit -> CytoscapeModel.Cytoscape
val withElements : elems:seq<CytoscapeModel.Element> -> cy:CyGraph.CyGraph -> CyGraph.CyGraph
val withLayout : ly:Layout -> cy:CyGraph.CyGraph -> CyGraph.CyGraph
Multiple items
module Layout from Cyjs.NET

--------------------
type Layout = inherit DynamicObj new : name:string -> Layout member name : string
<summary> Layout type inherits from dynamic object </summary>

--------------------
new : name:string -> Layout
val initBreadthfirst : applyOption:(Layout -> Layout) -> Layout
<summary> initializes a layout of type "breadthfirst" applying the givin layout option function. The "breadthfirst" layout puts nodes in a hierarchy, based on a breadthfirst traversal of the graph. It is best suited to trees and forests in its default top-down mode, and it is best suited to DAGs in its circle mode. </summary>
val id : x:'T -> 'T
<summary>The identity function</summary>
<param name="x">The input value.</param>
<returns>The same value.</returns>
val withStyle : selector:string -> cyStyles:seq<CyParam.CyStyleParam> -> cy:CyGraph.CyGraph -> CyGraph.CyGraph
val content : v:'a -> CyParam.CyStyleParam
module Curve from Cyjs.NET.CyParam
val style : v:'a -> CyParam.CyStyleParam
val opacity : v:'a -> CyParam.CyStyleParam
val width : v:'a -> CyParam.CyStyleParam
module Target from Cyjs.NET.CyParam
module Arrow from Cyjs.NET.CyParam.Target
val shape : v:'a -> CyParam.CyStyleParam
module Line from Cyjs.NET.CyParam
val color : v:'a -> CyParam.CyStyleParam
module Source from Cyjs.NET.CyParam
module Arrow from Cyjs.NET.CyParam.Source
val withSize : width:int * height:int -> cy:CyGraph.CyGraph -> CyGraph.CyGraph
module HTML from Cyjs.NET
<summary> HTML template for Cytoscape </summary>
val toEmbeddedHTML : cy:CytoscapeModel.Cytoscape -> string
<summary> Converts a CyGraph to it HTML representation and embeds it into a html page. </summary>
val getContext : v:'Vertex -> g:Graph<'Vertex,'Label,'Edge> -> Context<'Vertex,'Label,'Edge> (requires comparison)
<summary> Lookup a context in the graph. Raising KeyNotFoundException if no binding exists in the graph. </summary>
val outwardEdges : Adj<'Vertex,'Edge> * 'Vertex * 'Label * Adj<'Vertex,'Edge> -> LEdge<'Vertex,'Edge> list (requires comparison)
<summary> Returns all edges pointing away from the given vertex </summary>
val inwardEdges : Adj<'Vertex,'Edge> * 'Vertex * 'Label * Adj<'Vertex,'Edge> -> LEdge<'Vertex,'Edge> list (requires comparison)
<summary> Returns all edges pointing to the given vertex </summary>
val mapContexts : mapping:(Context<'Vertex,'Label,'Edge> -> 'T) -> g:Graph<'Vertex,'Label,'Edge> -> Map<'Vertex,'T> (requires comparison)
<summary> Maps contexts of the graph. </summary>
val toAdjacencyMatrix : g:Graph<'Vertex,'Label,'Edge> -> 'Edge [] [] (requires comparison)
<summary> Transforms a graph into an adjacency matrix of its edges. </summary>