BioFSharp


String matching algorithms

📂View module documentation

String matching algorithms are concerned with finding a single or multiple matches of a query pattern within a source. The sub-modules of the BioFSharp.Algorithm.StringMatching module are organized as following:

  • AlgorithmName.findAll : returns the (zero-based) starting position of all matches of the query pattern in the source as an integer list

  • AlgorithmName.find : returns the (zero-based) starting position of the first match of the query pattern in the source as an integer

  • AlgorithmName.findFrom : returns the (zero-based) starting position of the first match of the query pattern in the source as an integer, starting from a specified (zero-based) position in the source

All algorithms contained in this library are implemented as generic as possible regarding the data type that the input source and query pattern can contain, being only restricted by implementing the IEquatable interface, as there must be a way to assure that two equal elements are indeed equal. However, the nature of the specific algorithm may impose additional restrictions, for example concerning the size of a single comparable item.

The type of query pattern and source has been constrained to be an array<'a> because many index accessions are made during the search.

Runtimes are provided using Bachmann-Landau notations:

  • n : length of the source
  • m : length of the query pattern
  • k : size of the Alphabet

The following query pattern and source will be used in all further examples demonstrating how to use the specific module.

1: 
2: 
3: 
4: 
5: 
///source containing the query pattern at positions 0,10 and 17 
let source = "AAABBBBBBBAAABBBBAAA" |> BioArray.ofAminoAcidString

///query pattern to find in the source
let queryPattern = "AAA"|> BioArray.ofAminoAcidString


The Naive Algorithm

The naive approach to the string matching problem is walking through the source starting from the beginning and checking at each position if the resulting substring equals the query pattern. While being inefficient, it may be beneficial to use it in cases where the speed advantage of another algorithm is neglegible or does not outhweigh the additional setup needed (for example if your source and query pattern are really short)

Usage

No additional setup needed. Just use query pattern and source as input parameters for the respective function

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
//returns [0;10;17]
Naive.findAll queryPattern source

//returns 0
Naive.find queryPattern source

//returns 17
Naive.findFrom 12 queryPattern source

Runtime

  • Preprocessing: None

  • Matching time: O(nm)



Knuth Morris Pratt

The KMP algorithm makes use of previous match information to determine an amount of skips that can be made until the next position in the source gets examined as a possible match. To achieve that, a prefix table (or failure function) of the pattern needs to be computed, which determines the amount of skippable elements depending on the previous (partial) match.

Usage

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
///prefix table computed from the query pattern
let prefixTable = KnuthMorrisPratt.createPrefixTable queryPattern

//returns [0;10;17]
KnuthMorrisPratt.findAll prefixTable queryPattern source

//returns 0
KnuthMorrisPratt.find prefixTable queryPattern source

//returns 17
KnuthMorrisPratt.findFrom prefixTable queryPattern 12 source

Runtime

  • Preprocessing: Θ(m)

  • Matching time: average: Θ(n)

  • Worst case(if the prefix table results in no possible skips): Θ(mn)



Rabin Karp

The RK Algorithm saves time by not comparing every single element of pattern and a substring of the same length from the source. Instead, it compares hashed versions of them. The RabinKarp module contains two submodules:

  • RKStandard: uses the build in .NET hash function hash

  • CP: uses a cyclic polynomial hash, which is way faster than the built in hash function. Elements must be castable to uint64 , as they get hashed as a 64 bit unsigned integer and updating the hash uses bitwise operators. Using large query patterns is problematic, as the combined hash may exceed the size of an uint64.

Usage

RKStandard

1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
//returns [0;10;17]
RabinKarp.RKStandard.findAll queryPattern source

//returns 0
RabinKarp.RKStandard.find queryPattern source

//returns 17
RabinKarp.RKStandard.findFrom 12 queryPattern source

CP

As the AminoAcid type cannot be casted to an uint64, a conversion to their one character code is needed

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
///source converted to char array 
let source' = source |> Array.map (fun x -> (x :> IBioItem).Symbol)

///query pattern converted to char array
let queryPattern' = queryPattern |> Array.map (fun x -> (x :> IBioItem).Symbol)

//returns [0;10;17]
RabinKarp.CP.findAll queryPattern' source'

//returns 0
RabinKarp.CP.find queryPattern' source'

//returns 17
RabinKarp.CP.findFrom 12 queryPattern' source'

Using your own hash functions

The RabinKarp module provides generic findAll, find and findFrom functions that take the following additional parameters:

  • blockHash: a function that hashes an entire array (used to hash the pattern and the first substring of the source)

  • updateHash: a function that removes an element from an existing hash and adds a new one

Just use functional composition to build your own findX functions. In fact the two provided algorithms are built the same way:

1: 
2: 
3: 
///RabinKarp.CP.findAll built by functional composition
let inline findAll (pattern : array<'a>) (s : array<'a>) = 
    RabinKarp.findAllGeneric (RabinKarp.CP.updateHash pattern.Length) (RabinKarp.CP.blockHash) pattern s 

Runtime

  • Preprocessing: Θ(m)
  • Matching time: average Θ(n + m)
  • Worst case: Θ((n−m)m)


Boyer Moore

coming soon...

Speed Comparison

Coming soon...

namespace System
namespace BioFSharp
namespace BioFSharp.IO
namespace FSharpAux
namespace FSharpAux.IO
module AminoAcids

from BioFSharp
module Nucleotides

from BioFSharp
module AminoAcidSymbols

from BioFSharp
namespace BioFSharp.Algorithm
module StringMatching

from BioFSharp.Algorithm
val source : BioArray.BioArray<AminoAcid>


source containing the query pattern at positions 0,10 and 17
module BioArray

from BioFSharp
val ofAminoAcidString : s:#seq<char> -> BioArray.BioArray<AminoAcid>
val queryPattern : BioArray.BioArray<AminoAcid>


query pattern to find in the source
module Naive

from BioFSharp.Algorithm.StringMatching
val findAll : pattern:'a array -> s:'a array -> int list (requires equality)
val find : pattern:'a array -> s:'a array -> int (requires equality)
val findFrom : startPos:int -> pattern:'a array -> s:'a array -> int (requires equality)
val prefixTable : int []


prefix table computed from the query pattern
module KnuthMorrisPratt

from BioFSharp.Algorithm.StringMatching
val createPrefixTable : pattern:'a array -> int [] (requires equality)
val findAll : prefixTable:int [] -> pattern:'a array -> s:'a array -> int list (requires equality)
val find : prefixTable:int [] -> pattern:'a array -> s:'a array -> int (requires equality)
val findFrom : prefixTable:int [] -> pattern:'a array -> startPos:int -> s:'a array -> int (requires equality)
module RabinKarp

from BioFSharp.Algorithm.StringMatching
module RKStandard

from BioFSharp.Algorithm.StringMatching.RabinKarp
val source' : char []


source converted to char array
type Array =
  member Clone : unit -> obj
  member CopyTo : array:Array * index:int -> unit + 1 overload
  member GetEnumerator : unit -> IEnumerator
  member GetLength : dimension:int -> int
  member GetLongLength : dimension:int -> int64
  member GetLowerBound : dimension:int -> int
  member GetUpperBound : dimension:int -> int
  member GetValue : [<ParamArray>] indices:int[] -> obj + 7 overloads
  member Initialize : unit -> unit
  member IsFixedSize : bool
  ...
val map : mapping:('T -> 'U) -> array:'T [] -> 'U []
val x : AminoAcid
type IBioItem =
  interface
    abstract member Formula : Formula
    abstract member Name : string
    abstract member Symbol : char
    abstract member isGap : bool
    abstract member isTerminator : bool
  end
val queryPattern' : char []


query pattern converted to char array
module CP

from BioFSharp.Algorithm.StringMatching.RabinKarp
val findAll : pattern:'a array -> s:'a array -> int list (requires member op_Explicit and equality)
val find : pattern:'a array -> s:'a array -> int (requires member op_Explicit and equality)
val findFrom : startPos:int -> pattern:'a array -> s:'a array -> int (requires member op_Explicit and equality)
val findAll : pattern:'a array -> s:'a array -> int list (requires member op_Explicit and equality)


RabinKarp.CP.findAll built by functional composition
val pattern : 'a array (requires member op_Explicit and equality)
type 'T array = 'T []
val s : 'a array (requires member op_Explicit and equality)
val findAllGeneric : updateHash:('c -> 'a -> 'a -> 'c) -> blockHash:('a array -> 'c) -> pattern:'a array -> s:'a array -> int list (requires equality and equality)
val updateHash : pLength:int32 -> hs:uint64 -> inchar:'b -> outchar:'c -> uint64 (requires member op_Explicit and member op_Explicit)
val blockHash : pattern:'a array -> uint64 (requires member op_Explicit)
Fork me on GitHub