GeneralizedSuffixArray Type

Given a text string, the substring starting at position i and until the end of text is called the suffix starting at position i. For a text string with N characters there are N suffixes. This class gives an efficient algorithm to sort all N suffixes lexicographically, and to determine the length of the longest common prefix of each suffix with its lexicographical predecessor. This is generally done in O(N) time.

First of all, we need an O(N) algorithm to calculate the suffix array in O(N) time. This algorithm is implemented according to J. Kärkkäinen and P. Sanders: "Simple linear work suffix array construction.", Proc. 30th International Colloquium on Automata, Languages and Programming, volume 2719 of Lecture Notes in Computer Science, pages 943-955, Berlin, 2003, Springer-Verlag."

Secondly, an O(N) algorithm is neccessary to calculated the length of longest common prefix of two adjacent suffixes. This is implemented according to T. Kasai, G. Lee, H. Arimura, S. Arikawa1, and K. Park: "Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications", Proc. 12th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2089, pp. 181–192. Springer, Berlin (2001).

See also the very nice paper by Michael Arnold and Enno Ohlebusch, 'Linear Time Algorithms for Generalizations of the Longest Common Substring Problem', Algorithmica (2011) 60; 806-818; DOI: 10.1007/s00453-009-9369-1. The code here was partially adopted from the C++ sources from the web site of the authors at http://www.uni-ulm.de/in/theo/research/sequana.html.

Constructors

Constructor Description

GeneralizedSuffixArray(textWithPadding, textLength, numberOfWords, wordStartPositions, alphabetSize)

Full Usage: GeneralizedSuffixArray(textWithPadding, textLength, numberOfWords, wordStartPositions, alphabetSize)

Parameters:
    textWithPadding : int[] - The text. The character of this texts are integers. Each unique element of the original text (or list of objects) is mapped to an unique integer value, while preserving the lexicographical order of the original characters in the integer values.
    textLength : int - Number of characters of the text. This usually is smaller than the length of text array.
    numberOfWords : int - The number of words in the original text, if the original text was separated into words. Otherwise, this parameter should be 1.
    wordStartPositions : int[] - The start positions of the words in the text array, if the original text was separated into words. Otherwise, this parameter is ignored.
    alphabetSize : int - Size of the alphabet. This is the number of unique characters (or objects) of the original text. If the text was separated into words, the numberOfWords is added, since each separator is a unique character.

Constructs an instance of the GeneralizedSuffixArray class.

textWithPadding : int[]

The text. The character of this texts are integers. Each unique element of the original text (or list of objects) is mapped to an unique integer value, while preserving the lexicographical order of the original characters in the integer values.

textLength : int

Number of characters of the text. This usually is smaller than the length of text array.

numberOfWords : int

The number of words in the original text, if the original text was separated into words. Otherwise, this parameter should be 1.

wordStartPositions : int[]

The start positions of the words in the text array, if the original text was separated into words. Otherwise, this parameter is ignored.

alphabetSize : int

Size of the alphabet. This is the number of unique characters (or objects) of the original text. If the text was separated into words, the numberOfWords is added, since each separator is a unique character.

GeneralizedSuffixArray(integerText)

Full Usage: GeneralizedSuffixArray(integerText)

Parameters:

Constructs a new instance of the GeneralizedSuffixArray class from IntegerText.

integerText : IntegerText

The integer text data.

Instance members

Instance member Description

this.InverseSuffixArray

Full Usage: this.InverseSuffixArray

Returns: int[]

Maps the suffix that starts at position i in the text to the lexicographical order position of this suffix, which is the value of the i-th element of this array.

Returns: int[]

this.LCPArray

Full Usage: this.LCPArray

Returns: int[]

Stores the length of the Longest Common Prefix of the lexicographically i-th suffix and its lexicographical predecessor (the lexicographically (i-1)-th suffix). The element at index 0 is always 0.

Returns: int[]

this.LCPSArray

Full Usage: this.LCPSArray

Returns: int[]

Returns: int[]

this.MaximumLcp

Full Usage: this.MaximumLcp

Returns: int

Maximum of all values in the GeneralizedSuffixArray._LCP array.

Returns: int

this.NumberOfWords

Full Usage: this.NumberOfWords

Returns: int

Number of words, if the text was separated into individual words. Otherwise, this field is equal to one.

Returns: int

this.SuffixArray

Full Usage: this.SuffixArray

Returns: int[]

Maps the lexicographical order position i of a suffix to the starting position of the suffix in the text, which is the value of the i-th element of this array.

Returns: int[]

this.WordIndices

Full Usage: this.WordIndices

Returns: int[]

Maps the lexicographical order position i of a suffix to the index of the word, in which this suffix starts. This means, that for instance the value of the i-th element contains the index of the word, in which the lexicographically i-th suffix that starts at position GeneralizedSuffixArray._suffixArray

Returns: int[]

this.WordStartPositions

Full Usage: this.WordStartPositions

Returns: int[]

Start positions of the words in which the original text was separated in the array GeneralizedSuffixArray._text.

Returns: int[]

Static members

Static member Description

GeneralizedSuffixArray.FromSeparateWords(words, useSortedMapping)

Full Usage: GeneralizedSuffixArray.FromSeparateWords(words, useSortedMapping)

Parameters:
    words : IEnumerable<string> - The list of 'words'.
    useSortedMapping : bool - If this parameter is true, a sorted mapping of the elements T to integers will be used. The type T then has to implement IComparable. If this parameter is false, a unsorted HashSet will be used to make a unique mapping of the elements to integers.

Returns: GeneralizedSuffixArray The generalized suffix array. Since each list in words is treated as separate word, the generalized suffix array is prepared to search for the longest common substring in these words.

Constructs the generalized suffix array from separate 'words'. Each list in the parameter words is treated as 'word'. The elements are treated as characters. The elements are converted to an integer alphabet by means of the IntegerText class.

words : IEnumerable<string>

The list of 'words'.

useSortedMapping : bool

If this parameter is true, a sorted mapping of the elements T to integers will be used. The type T then has to implement IComparable. If this parameter is false, a unsorted HashSet will be used to make a unique mapping of the elements to integers.

Returns: GeneralizedSuffixArray

The generalized suffix array. Since each list in words is treated as separate word, the generalized suffix array is prepared to search for the longest common substring in these words.

GeneralizedSuffixArray.FromSeparateWords(words, useSortedMapping)

Full Usage: GeneralizedSuffixArray.FromSeparateWords(words, useSortedMapping)

Parameters:
    words : IEnumerable<IEnumerable<'T>> - The list of 'words'.
    useSortedMapping : bool - If this parameter is true, a sorted mapping of the elements T to integers will be used. The type T then has to implement IComparable. If this parameter is false, a unsorted HashSet will be used to make a unique mapping of the elements to integers.

Returns: GeneralizedSuffixArray The generalized suffix array. Since each list in words is treated as separate word, the generalized suffix array is prepared to search for the longest common substring in these words.

Constructs the generalized suffix array from separate 'words'. Each list in the parameter words is treated as 'word'. The elements are treated as characters. The elements are converted to an integer alphabet by means of the IntegerText class.

words : IEnumerable<IEnumerable<'T>>

The list of 'words'.

useSortedMapping : bool

If this parameter is true, a sorted mapping of the elements T to integers will be used. The type T then has to implement IComparable. If this parameter is false, a unsorted HashSet will be used to make a unique mapping of the elements to integers.

Returns: GeneralizedSuffixArray

The generalized suffix array. Since each list in words is treated as separate word, the generalized suffix array is prepared to search for the longest common substring in these words.

GeneralizedSuffixArray.FromWords(words, withSeparators, useSortedMapping, customSortingComparer)

Full Usage: GeneralizedSuffixArray.FromWords(words, withSeparators, useSortedMapping, customSortingComparer)

Parameters:
    words : IEnumerable<string> - The list of 'words'.
    withSeparators : bool - If set to true, the words were treated as individual words, and will be separated by special separator elements. If set to false, the converted text will contain the concenated 'words' without separator elements. If set to false, they will be concenated to form one single word.
    useSortedMapping : bool - If this parameter is true, a sorted mapping of the elements T to integers will be used. The type T then has to implement IComparable. If this parameter is false, a unsorted HashSet will be used to make a unique mapping of the elements to integers.
    customSortingComparer : IComparer<char> - If useSortedMapping is true, you can here provide a custom comparer for the elements of type T. Otherwise, if you want to use the default comparer, leave this parameter null.

Returns: GeneralizedSuffixArray The generalized suffix array. Since each list in words is treated as separate word, the generalized suffix array is prepared to search for the longest common substring in these words.

Constructs the generalized suffix array from separate 'words'. Each list in the parameter words is treated as 'word'. The elements are treated as characters. The elements are converted to an integer alphabet by means of the IntegerText class.

words : IEnumerable<string>

The list of 'words'.

withSeparators : bool

If set to true, the words were treated as individual words, and will be separated by special separator elements. If set to false, the converted text will contain the concenated 'words' without separator elements. If set to false, they will be concenated to form one single word.

useSortedMapping : bool

If this parameter is true, a sorted mapping of the elements T to integers will be used. The type T then has to implement IComparable. If this parameter is false, a unsorted HashSet will be used to make a unique mapping of the elements to integers.

customSortingComparer : IComparer<char>

If useSortedMapping is true, you can here provide a custom comparer for the elements of type T. Otherwise, if you want to use the default comparer, leave this parameter null.

Returns: GeneralizedSuffixArray

The generalized suffix array. Since each list in words is treated as separate word, the generalized suffix array is prepared to search for the longest common substring in these words.

GeneralizedSuffixArray.FromWords(words, withSeparators, useSortedMapping, customSortingComparer)

Full Usage: GeneralizedSuffixArray.FromWords(words, withSeparators, useSortedMapping, customSortingComparer)

Parameters:
    words : IEnumerable<IEnumerable<'T>> - The list of 'words'.
    withSeparators : bool - If set to true, the words were treated as individual words, and will be separated by special separator elements. If set to false, the converted text will contain the concenated 'words' without separator elements. If set to false, they will be concenated to form one single word.
    useSortedMapping : bool - If this parameter is true, a sorted mapping of the elements T to integers will be used. The type T then has to implement IComparable. If this parameter is false, a unsorted HashSet will be used to make a unique mapping of the elements to integers.
    customSortingComparer : IComparer<'T> - If useSortedMapping is true, you can here provide a custom comparer for the elements of type T. Otherwise, if you want to use the default comparer, leave this parameter null.

Returns: GeneralizedSuffixArray The generalized suffix array. Since each list in words is treated as separate word, the generalized suffix array is prepared to search for the longest common substring in these words.

Constructs the generalized suffix array from separate 'words'. Each list in the parameter words is treated as 'word'. The elements are treated as characters. The elements are converted to an integer alphabet by means of the IntegerText class.

words : IEnumerable<IEnumerable<'T>>

The list of 'words'.

withSeparators : bool

If set to true, the words were treated as individual words, and will be separated by special separator elements. If set to false, the converted text will contain the concenated 'words' without separator elements. If set to false, they will be concenated to form one single word.

useSortedMapping : bool

If this parameter is true, a sorted mapping of the elements T to integers will be used. The type T then has to implement IComparable. If this parameter is false, a unsorted HashSet will be used to make a unique mapping of the elements to integers.

customSortingComparer : IComparer<'T>

If useSortedMapping is true, you can here provide a custom comparer for the elements of type T. Otherwise, if you want to use the default comparer, leave this parameter null.

Returns: GeneralizedSuffixArray

The generalized suffix array. Since each list in words is treated as separate word, the generalized suffix array is prepared to search for the longest common substring in these words.