Pattern query algorithms
Summary: This example shows how to use different pattern query algorithms in BioFSharp
String matching algorithms are concerned with finding a single or multiple matches of a query pattern within a source. The sub-modules of the BioFSharp.Algorithm.StringMatching
module are organized as following:
AlgorithmName.findAll : returns the (zero-based) starting position of all matches of the query pattern in the source as an integer list
AlgorithmName.find : returns the (zero-based) starting position of the first match of the query pattern in the source as an integer
AlgorithmName.findFrom : returns the (zero-based) starting position of the first match of the query pattern in the source as an integer, starting from a specified (zero-based) position in the source
All algorithms contained in this library are implemented as generic as possible regarding the data type that the input source and query pattern can contain, being only restricted by implementing the IEquatable
interface, as there must be a way to assure that two equal elements are indeed equal.
However, the nature of the specific algorithm may impose additional restrictions, for example concerning the size of a single comparable item.
The type of query pattern and source has been constrained to be an array<'a>
because many index accessions are made during the search.
Runtimes are provided using Bachmann-Landau notations:
- n : length of the source
- m : length of the query pattern
- k : size of the Alphabet
The following query pattern and source will be used in all further examples demonstrating how to use the specific module.
open BioFSharp
///source containing the query pattern at positions 0,10 and 17
let source = "AAABBBBBBBAAABBBBAAA" |> BioArray.ofAminoAcidString
///query pattern to find in the source
let queryPattern = "AAA"|> BioArray.ofAminoAcidString
The Naive Algorithm
The naive approach to the string matching problem is walking through the source starting from the beginning and checking at each position if the resulting substring equals the query pattern. While being inefficient, it may be beneficial to use it in cases where the speed advantage of another algorithm is neglegible or does not outhweigh the additional setup needed (for example if your source and query pattern are really short)
Usage
No additional setup needed. Just use query pattern and source as input parameters for the respective function
open BioFSharp.Algorithm.StringMatching
Naive.findAll queryPattern source
Warning: Output, it-value and value references require --eval
Naive.find queryPattern source
Warning: Output, it-value and value references require --eval
Naive.findFrom 12 queryPattern source
Warning: Output, it-value and value references require --eval
Runtime
Preprocessing: None
Matching time: O(nm)
Knuth Morris Pratt
The KMP algorithm makes use of previous match information to determine an amount of skips that can be made until the next position in the source gets examined as a possible match. To achieve that, a prefix table (or failure function) of the pattern needs to be computed, which determines the amount of skippable elements depending on the previous (partial) match.
Usage
///prefix table computed from the query pattern
let prefixTable = KnuthMorrisPratt.createPrefixTable queryPattern
KnuthMorrisPratt.findAll prefixTable queryPattern source
Warning: Output, it-value and value references require --eval
KnuthMorrisPratt.find prefixTable queryPattern source
Warning: Output, it-value and value references require --eval
KnuthMorrisPratt.findFrom prefixTable queryPattern 12 source
Warning: Output, it-value and value references require --eval
Runtime
Preprocessing: Θ(m)
Matching time: average: Θ(n)
Worst case(if the prefix table results in no possible skips): Θ(mn)
Rabin Karp
The RK Algorithm saves time by not comparing every single element of pattern and a substring of the same length from the source. Instead, it compares hashed versions of them. The RabinKarp module contains two submodules:
RKStandard: uses the build in .NET hash function
hash
-
CP: uses a cyclic polynomial hash, which is way faster than the built in hash function. Elements must be castable to
uint64
, as they get hashed as a 64 bit unsigned integer and updating the hash uses bitwise operators. Using large query patterns is problematic, as the combined hash may exceed the size of anuint64
.
Usage
RKStandard
RabinKarp.RKStandard.findAll queryPattern source
Warning: Output, it-value and value references require --eval
RabinKarp.RKStandard.find queryPattern source
Warning: Output, it-value and value references require --eval
RabinKarp.RKStandard.findFrom 12 queryPattern source
Warning: Output, it-value and value references require --eval
CP
As the AminoAcid type cannot be casted to an uint64
, a conversion to their one character code is needed
///source converted to char array
let source' = source |> Array.map (fun x -> (x :> IBioItem).Symbol)
///query pattern converted to char array
let queryPattern' = queryPattern |> Array.map (fun x -> (x :> IBioItem).Symbol)
RabinKarp.CP.findAll queryPattern' source'
Warning: Output, it-value and value references require --eval
RabinKarp.CP.find queryPattern' source'
Warning: Output, it-value and value references require --eval
RabinKarp.CP.findFrom 12 queryPattern' source'
Warning: Output, it-value and value references require --eval
Using your own hash functions
The RabinKarp
module provides generic findAll
, find
and findFrom
functions that take the following additional parameters:
blockHash
: a function that hashes an entire array (used to hash the pattern and the first substring of the source)updateHash
: a function that removes an element from an existing hash and adds a new one
Just use functional composition to build your own findX functions. In fact the two provided algorithms are built the same way:
///RabinKarp.CP.findAll built by functional composition
let inline findAll (pattern : array<'a>) (s : array<'a>) =
RabinKarp.findAllGeneric (RabinKarp.CP.updateHash pattern.Length) (RabinKarp.CP.blockHash) pattern s
Runtime
- Preprocessing: Θ(m)
- Matching time: average Θ(n + m)
- Worst case: Θ((n−m)m)
Boyer Moore
coming soon...
Speed Comparison
Coming soon...
source containing the query pattern at positions 0,10 and 17
module BioArray from BioFSharp.BioCollectionsExtensions
--------------------
module BioArray from BioFSharp
<summary> This module contains the BioArray type and its according functions. The BioArray type is an array of objects using the IBioItem interface </summary>
<summary> Generates amino acid sequence of one-letter-code raw string </summary>
query pattern to find in the source
<summary> A collection of different string matching algorithms </summary>
<summary> A collection of naive string matching algorithms </summary>
<summary> finds all matches of a query pattern in a source </summary>
<summary> finds the first match of a query pattern in a source </summary>
<summary> finds the first match of a query pattern in a source starting from a specific position in the source </summary>
prefix table computed from the query pattern
<summary> A collection of Knuth-Morris-Pratt string matching algorithms </summary>
<summary> creates a prefix table for a query pattern </summary>
<summary> finds all matches of a query pattern in a source using a given prefix table </summary>
<summary> finds the first match of a query pattern in a source using a given prefix table </summary>
<summary> finds the first match of a query pattern in a source starting from a specific position in the source using a given prefix table </summary>
<summary> A collection of Rabin-Karp string matching algorithms and hash functions </summary>
<summary> A collection of Rabin-Karp string matching algorithms using the built-in hash function of f# </summary>
source converted to char array
<summary> Marker interface for BioItem base. </summary>
query pattern converted to char array
<summary> A collection of Rabin-Karp string matching algorithms using the cyclic polynomial (CP) hash </summary>
<summary> find all matches of a query pattern in a source </summary>
<summary> find the first match of a query pattern in a source </summary>
<summary> find the first match of a query pattern in a source starting from a specific position in the source </summary>
RabinKarp.CP.findAll built by functional composition
<summary> takes an updateHash and blockHash function to find all matches of a query pattern in a source </summary>
<summary> updates an existing hashvalue </summary>
<summary>Gets the total number of elements in all the dimensions of the <see cref="T:System.Array" />.</summary>
<exception cref="T:System.OverflowException">The array is multidimensional and contains more than <see cref="F:System.Int32.MaxValue">Int32.MaxValue</see> elements.</exception>
<returns>The total number of elements in all the dimensions of the <see cref="T:System.Array" />; zero if there are no elements in the array.</returns>
<summary> hashes a pattern </summary>