Header menu logo BioFSharp

Pattern query algorithms

BinderScriptNotebook

Summary: This example shows how to use different pattern query algorithms in BioFSharp

String matching algorithms are concerned with finding a single or multiple matches of a query pattern within a source. The sub-modules of the BioFSharp.Algorithm.StringMatching module are organized as following:

All algorithms contained in this library are implemented as generic as possible regarding the data type that the input source and query pattern can contain, being only restricted by implementing the IEquatable interface, as there must be a way to assure that two equal elements are indeed equal. However, the nature of the specific algorithm may impose additional restrictions, for example concerning the size of a single comparable item.

The type of query pattern and source has been constrained to be an array<'a> because many index accessions are made during the search.

Runtimes are provided using Bachmann-Landau notations:

The following query pattern and source will be used in all further examples demonstrating how to use the specific module.

open BioFSharp

///source containing the query pattern at positions 0,10 and 17 
let source = "AAABBBBBBBAAABBBBAAA" |> BioArray.ofAminoAcidString

///query pattern to find in the source
let queryPattern = "AAA"|> BioArray.ofAminoAcidString

The Naive Algorithm

The naive approach to the string matching problem is walking through the source starting from the beginning and checking at each position if the resulting substring equals the query pattern. While being inefficient, it may be beneficial to use it in cases where the speed advantage of another algorithm is neglegible or does not outhweigh the additional setup needed (for example if your source and query pattern are really short)

Usage

No additional setup needed. Just use query pattern and source as input parameters for the respective function

open BioFSharp.Algorithm.StringMatching

Naive.findAll queryPattern source
Warning: Output, it-value and value references require --eval
Naive.find queryPattern source
Warning: Output, it-value and value references require --eval
Naive.findFrom 12 queryPattern source
Warning: Output, it-value and value references require --eval

Runtime

Knuth Morris Pratt

The KMP algorithm makes use of previous match information to determine an amount of skips that can be made until the next position in the source gets examined as a possible match. To achieve that, a prefix table (or failure function) of the pattern needs to be computed, which determines the amount of skippable elements depending on the previous (partial) match.

Usage

///prefix table computed from the query pattern
let prefixTable = KnuthMorrisPratt.createPrefixTable queryPattern

KnuthMorrisPratt.findAll prefixTable queryPattern source
Warning: Output, it-value and value references require --eval
KnuthMorrisPratt.find prefixTable queryPattern source
Warning: Output, it-value and value references require --eval
KnuthMorrisPratt.findFrom prefixTable queryPattern 12 source
Warning: Output, it-value and value references require --eval

Runtime

Rabin Karp

The RK Algorithm saves time by not comparing every single element of pattern and a substring of the same length from the source. Instead, it compares hashed versions of them. The RabinKarp module contains two submodules:

Usage

RKStandard

RabinKarp.RKStandard.findAll queryPattern source
Warning: Output, it-value and value references require --eval
RabinKarp.RKStandard.find queryPattern source
Warning: Output, it-value and value references require --eval
RabinKarp.RKStandard.findFrom 12 queryPattern source
Warning: Output, it-value and value references require --eval

CP

As the AminoAcid type cannot be casted to an uint64, a conversion to their one character code is needed

///source converted to char array 
let source' = source |> Array.map (fun x -> (x :> IBioItem).Symbol)

///query pattern converted to char array
let queryPattern' = queryPattern |> Array.map (fun x -> (x :> IBioItem).Symbol)

RabinKarp.CP.findAll queryPattern' source'
Warning: Output, it-value and value references require --eval
RabinKarp.CP.find queryPattern' source'
Warning: Output, it-value and value references require --eval
RabinKarp.CP.findFrom 12 queryPattern' source'
Warning: Output, it-value and value references require --eval

Using your own hash functions

The RabinKarp module provides generic findAll, find and findFrom functions that take the following additional parameters:

Just use functional composition to build your own findX functions. In fact the two provided algorithms are built the same way:

///RabinKarp.CP.findAll built by functional composition
let inline findAll (pattern : array<'a>) (s : array<'a>) = 
    RabinKarp.findAllGeneric (RabinKarp.CP.updateHash pattern.Length) (RabinKarp.CP.blockHash) pattern s 

Runtime

Boyer Moore

coming soon...

Speed Comparison

Coming soon...

namespace BioFSharp
val source: BioArray.BioArray<AminoAcids.AminoAcid>
source containing the query pattern at positions 0,10 and 17
Multiple items
module BioArray from BioFSharp.BioCollectionsExtensions

--------------------
module BioArray from BioFSharp
<summary> This module contains the BioArray type and its according functions. The BioArray type is an array of objects using the IBioItem interface </summary>
val ofAminoAcidString: s: #(char seq) -> BioArray.BioArray<AminoAcids.AminoAcid>
<summary> Generates amino acid sequence of one-letter-code raw string </summary>
val queryPattern: BioArray.BioArray<AminoAcids.AminoAcid>
query pattern to find in the source
namespace BioFSharp.Algorithm
module StringMatching from BioFSharp.Algorithm
<summary> A collection of different string matching algorithms </summary>
module Naive from BioFSharp.Algorithm.StringMatching
<summary> A collection of naive string matching algorithms </summary>
val findAll: pattern: 'a array -> s: 'a array -> int list (requires equality)
<summary> finds all matches of a query pattern in a source </summary>
val find: pattern: 'a array -> s: 'a array -> int (requires equality)
<summary> finds the first match of a query pattern in a source </summary>
val findFrom: startPos: int -> pattern: 'a array -> s: 'a array -> int (requires equality)
<summary> finds the first match of a query pattern in a source starting from a specific position in the source </summary>
val prefixTable: int array
prefix table computed from the query pattern
module KnuthMorrisPratt from BioFSharp.Algorithm.StringMatching
<summary> A collection of Knuth-Morris-Pratt string matching algorithms </summary>
val createPrefixTable: pattern: 'a array -> int array (requires equality)
<summary> creates a prefix table for a query pattern </summary>
val findAll: prefixTable: int array -> pattern: 'a array -> s: 'a array -> int list (requires equality)
<summary> finds all matches of a query pattern in a source using a given prefix table </summary>
val find: prefixTable: int array -> pattern: 'a array -> s: 'a array -> int (requires equality)
<summary> finds the first match of a query pattern in a source using a given prefix table </summary>
val findFrom: prefixTable: int array -> pattern: 'a array -> startPos: int -> s: 'a array -> int (requires equality)
<summary> finds the first match of a query pattern in a source starting from a specific position in the source using a given prefix table </summary>
module RabinKarp from BioFSharp.Algorithm.StringMatching
<summary> A collection of Rabin-Karp string matching algorithms and hash functions </summary>
module RKStandard from BioFSharp.Algorithm.StringMatching.RabinKarp
<summary> A collection of Rabin-Karp string matching algorithms using the built-in hash function of f# </summary>
val source': char array
source converted to char array
module Array from Microsoft.FSharp.Collections
val map: mapping: ('T -> 'U) -> array: 'T array -> 'U array
val x: AminoAcids.AminoAcid
type IBioItem = abstract Formula: Formula abstract Name: string abstract Symbol: char abstract isGap: bool abstract isTerminator: bool
<summary> Marker interface for BioItem base. </summary>
val queryPattern': char array
query pattern converted to char array
module CP from BioFSharp.Algorithm.StringMatching.RabinKarp
<summary> A collection of Rabin-Karp string matching algorithms using the cyclic polynomial (CP) hash </summary>
val findAll: pattern: 'a array -> s: 'a array -> int list (requires member op_Explicit and equality)
<summary> find all matches of a query pattern in a source </summary>
val find: pattern: 'a array -> s: 'a array -> int (requires member op_Explicit and equality)
<summary> find the first match of a query pattern in a source </summary>
val findFrom: startPos: int -> pattern: 'a array -> s: 'a array -> int (requires member op_Explicit and equality)
<summary> find the first match of a query pattern in a source starting from a specific position in the source </summary>
val findAll: pattern: 'a array -> s: 'a array -> int list (requires member op_Explicit and equality)
RabinKarp.CP.findAll built by functional composition
val pattern: 'a array (requires member op_Explicit and equality)
type 'T array = 'T array
'a
val s: 'a array (requires member op_Explicit and equality)
val findAllGeneric: updateHash: ('c -> 'a -> 'a -> 'c) -> blockHash: ('a array -> 'c) -> pattern: 'a array -> s: 'a array -> int list (requires equality and equality)
<summary> takes an updateHash and blockHash function to find all matches of a query pattern in a source </summary>
val updateHash: pLength: int32 -> hs: uint64 -> inchar: 'a -> outchar: 'b -> uint64 (requires member op_Explicit and member op_Explicit)
<summary> updates an existing hashvalue </summary>
property System.Array.Length: int with get
<summary>Gets the total number of elements in all the dimensions of the <see cref="T:System.Array" />.</summary>
<exception cref="T:System.OverflowException">The array is multidimensional and contains more than <see cref="F:System.Int32.MaxValue">Int32.MaxValue</see> elements.</exception>
<returns>The total number of elements in all the dimensions of the <see cref="T:System.Array" />; zero if there are no elements in the array.</returns>
val blockHash: pattern: 'a array -> uint64 (requires member op_Explicit)
<summary> hashes a pattern </summary>

Type something to start searching.