String matching algorithms |
📂View module documentation |
String matching algorithms are concerned with finding a single or multiple matches of a query pattern within a source. The sub-modules of the BioFSharp.Algorithm.StringMatching
module are organized as following:
AlgorithmName.findAll : returns the (zero-based) starting position of all matches of the query pattern in the source as an integer list
AlgorithmName.find : returns the (zero-based) starting position of the first match of the query pattern in the source as an integer
AlgorithmName.findFrom : returns the (zero-based) starting position of the first match of the query pattern in the source as an integer, starting from a specified (zero-based) position in the source
All algorithms contained in this library are implemented as generic as possible regarding the data type that the input source and query pattern can contain, being only restricted by implementing the IEquatable
interface, as there must be a way to assure that two equal elements are indeed equal.
However, the nature of the specific algorithm may impose additional restrictions, for example concerning the size of a single comparable item.
The type of query pattern and source has been constrained to be an array<'a>
because many index accessions are made during the search.
Runtimes are provided using Bachmann-Landau notations:
- n : length of the source
- m : length of the query pattern
- k : size of the Alphabet
The following query pattern and source will be used in all further examples demonstrating how to use the specific module.
1: 2: 3: 4: 5: |
|
The Naive Algorithm
The naive approach to the string matching problem is walking through the source starting from the beginning and checking at each position if the resulting substring equals the query pattern. While being inefficient, it may be beneficial to use it in cases where the speed advantage of another algorithm is neglegible or does not outhweigh the additional setup needed (for example if your source and query pattern are really short)
Usage
No additional setup needed. Just use query pattern and source as input parameters for the respective function
1: 2: 3: 4: 5: 6: 7: 8: |
|
Runtime
Preprocessing: None
Matching time: O(nm)
Knuth Morris Pratt
The KMP algorithm makes use of previous match information to determine an amount of skips that can be made until the next position in the source gets examined as a possible match. To achieve that, a prefix table (or failure function) of the pattern needs to be computed, which determines the amount of skippable elements depending on the previous (partial) match.
Usage
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: |
|
Runtime
Preprocessing: Θ(m)
Matching time: average: Θ(n)
Worst case(if the prefix table results in no possible skips): Θ(mn)
Rabin Karp
The RK Algorithm saves time by not comparing every single element of pattern and a substring of the same length from the source. Instead, it compares hashed versions of them. The RabinKarp module contains two submodules:
RKStandard: uses the build in .NET hash function
hash
-
CP: uses a cyclic polynomial hash, which is way faster than the built in hash function. Elements must be castable to
uint64
, as they get hashed as a 64 bit unsigned integer and updating the hash uses bitwise operators. Using large query patterns is problematic, as the combined hash may exceed the size of anuint64
.
Usage
RKStandard
1: 2: 3: 4: 5: 6: 7: 8: |
|
CP
As the AminoAcid type cannot be casted to an uint64
, a conversion to their one character code is needed
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: |
|
Using your own hash functions
The RabinKarp
module provides generic findAll
, find
and findFrom
functions that take the following additional parameters:
blockHash
: a function that hashes an entire array (used to hash the pattern and the first substring of the source)updateHash
: a function that removes an element from an existing hash and adds a new one
Just use functional composition to build your own findX functions. In fact the two provided algorithms are built the same way:
1: 2: 3: |
|
Runtime
- Preprocessing: Θ(m)
- Matching time: average Θ(n + m)
- Worst case: Θ((n−m)m)
Boyer Moore
coming soon...
Speed Comparison
Coming soon...
from BioFSharp
from BioFSharp
from BioFSharp
from BioFSharp.Algorithm
source containing the query pattern at positions 0,10 and 17
from BioFSharp
query pattern to find in the source
from BioFSharp.Algorithm.StringMatching
prefix table computed from the query pattern
from BioFSharp.Algorithm.StringMatching
from BioFSharp.Algorithm.StringMatching
from BioFSharp.Algorithm.StringMatching.RabinKarp
source converted to char array
member Clone : unit -> obj
member CopyTo : array:Array * index:int -> unit + 1 overload
member GetEnumerator : unit -> IEnumerator
member GetLength : dimension:int -> int
member GetLongLength : dimension:int -> int64
member GetLowerBound : dimension:int -> int
member GetUpperBound : dimension:int -> int
member GetValue : [<ParamArray>] indices:int[] -> obj + 7 overloads
member Initialize : unit -> unit
member IsFixedSize : bool
...
interface
abstract member Formula : Formula
abstract member Name : string
abstract member Symbol : char
abstract member isGap : bool
abstract member isTerminator : bool
end
query pattern converted to char array
from BioFSharp.Algorithm.StringMatching.RabinKarp
RabinKarp.CP.findAll built by functional composition