Problem Description:
“Design a program to gather statistics of all strings P of length k in a string S of length n.”
The problem requires to design a program that will deliver useful statistics – count numbers and positions for a set of substrings in P having length k, stored in a string S of length n. The string S can be a literature book, an encyclopedia, or a data set obtained from one million users of a social network. String S should be composed of a finite set of symbols called Alphabet. For instance, The alphabet for DNA code is {A,C,T,G}. For RNA it is {A,G,C,U}. The alphabet for dictionary searching or bibliographical searching contains all the letters in English language. Depending on the application, the alphabet may contain small and capital letter both for case-sensitive searching. It can also include special characters and punctuation marks used in natural languages. A fully developed version of the program may contain a graphical user interface to communicate with the user. It may prompt user for searching a specific set of substrings P and allow selecting a source containing the string S. The allowed length or allowed value for k is limited from 1 to 4 for testing this program. The testing program aims for efficiency in searching short substrings in a large string and multiple values of k in a set of substrings.
There are a number of algorithms that can perform the search required for this particular problem and they vary in performance and complexity. Complexity can affect the performance of an algorithm. A suitable algorithm is needed that can locate all the matched substrings and gives correct answer for at least non overlapping short substrings. As the problem domain (i. e. alphabet, string length) gets bigger and requires greater resource, as they do in real life systems, the program should be scalable accordingly to give results. Complexity of the algorithm which is chosen to design this program is vital. The overall performance and complexity of the algorithm are two of the most concerning issues related to substring searching.
Designing the program requires clear understanding of Object Oriented principles and UML notations. To understand how to design the objects, successful implementation of selected algorithm is necessary.
Communicating the design is one of the most important tasks of the assignment. Primary effort should be invested in describing the design in a way that is understandable, coherent, relevant to the problem domain, and worthy of being regarded as a useful blueprint that can be implemented by programmers.
Design Description:
Aho-Corasick algorithm has reputation of satisfying qualities of being an ideal string matching algorithm since its inception in 1977. We can conclude that it is arguably one of the most efficient string matching algorithms in terms of both complexity and performance. As pointed out in the paper entitled “String Matching: An Aid to Bibliographic Search” by Alfred V. Aho and Margaret J. Corasick, their proposed algorithm is efficient when we are trying to gather statistics of substring set P that has a large number of occurrences within String S. [1]
To implement the algorithm, it is a prerequisite to understand the algorithm’s data structures. We will decompose the algorithm in several parts and analyze each of them so that the searching technique as a single unit can be understood. The proposed program uses as creator a Finite State Pattern Matching Machine; it is constructed from the set of substrings, or patterns, or keywords (the terms can be used interchangeably, in this document substring is used) that we are searching for. It is a data structure that can be regarded as a variant of Trie. The Machine takes as input the String (or Text) and locates all substrings in S that are also a member of the set of substrings, P.
The algorithm has three main steps or functions –
- goto (g): When a node matches, it acts as transition to next node.
- fail (f): This is used when a character is not matching.
- Output: When a match is found with a substring.
Figure 1: The three functions of Pattern Matching Algorithm
The functions g, f, and output are valid for a set of substrings if a substring p ends at position i of a string S and S=upv where the length of up is i where u is the suffix and v is the prefix of the string. To construct a goto function a goto graph is prepared from the set of substrings. Fig. 1(a) depicts such a goto graph. It starts with a vertex that represents state 0. Then each substring is added to the graph, taking one character at a time by adding directed paths from last vertex. New vertices and edges are added in a way so that a collection of nodes and edges spells out a substring. Starting at state 0, the algorithm will emit the state where the desired substring is matched following the path. For example, let us consider that in our test program the collection of substrings is {he,she,his, and hers}. We begin with state 0 and make a path to spell “he”, as depicted in figure 2(a). We see that starting from state 0 to state 2 we obtain “He”, as a result “He” is associated with state 2. The output function will return 2 when the machine finds “He”. Similarly, it will return 5,7,9 for “She”, “His” and “Hers” respectively. We complete the goto function by adding a loop from state 0 to state 0 to the Figure 2(d). Figure 1(a) represents a completed goto function for the set of substrings.
Figure 2: goto function construction steps (a) “He” (b) “She” (c) “His” (d) “Hers”.
Now let us examine what the algorithm does when it searches the string “ushers” and matches substrings he, his, her, and hers. First it is in state 0, it inputs ‘u’ which is not found in the finite state machine/goto function, so it stays in state 0. Next s comes in and it locates s in g(0,s)=3. So 3 is printed. The current state is 3 at the particular moment. Then it finds h from this position because g(3,h)=4 and so, 4 is printed. Current state is 4. Similarly, it locates next input from the string S, e, in state 5. At this stage, it cannot find next input r and returns fail=g(5,r). Output function is not called in this situation. Instead, the machine generates failure function f(5)=2 and enters state 2. This is called a failure transition. It is a standout feature which eliminates the need for starting the search from the beginning of S again.
So far it is clear how the goto function and output function is working. To emphasize on understanding how exactly the fail function is working further elaboration is necessary.
The fail function is constructed from the goto function. As part of the construction, a decomposition of the goto graph is needed in terms of Depth. At state 0, the beginning of the finite state pattern matching machine, the depth is 0. For every state in-depth d, the fail function is calculated for depths less than d. The function is calculated for d=1, 2, 3, and so on. At d=1, failure function f(1)=0. The fail function for depth 0 is undefined. In the “ushers” example, there is no failure until in state 5 (d=3). the fail function is calculated for all states of depth 2 (d-1=3-1=2). At depth 2 it locates next matching input symbol, generates f(5)=2, and enters state 2. If the function had not found at depth 2 it would scan depth 1 and upon match would generate a corresponding state to enter. Following the notation f(s)=s`, where s=current state and s`=transition state, let us consider the construction of fail function from the goto function in figure 1(a).
At first for d=0, f(0)=undefined. For d=1, f(1)=f(3)=0, because 0 is the state of the depth previous (d-1) to current depth 2. Likewise, at d=2, for computing f(2) we set f(1)=0 and since g(0,e)=0, we set f(2)=0. For computing f(4), we see f(3)=0, and since g(0,h)=1, we set f(4)=1. Likewise, we set f(6)=0, because the previous state to 6 is 1, f(1)=0, from state 1 to state 6 the edge gives us “i”. At the previous state’s goto function g(0,i) returns 0. So f(6) is 0. This means we are always tracking connecting state in previous depth to get a matching input symbol until we reach depth 0. Moreover, when fail functions are constructed; outputs are also merged with states. The fail function f(5) points us to state 2. At state 2 we already have the output “He”, which is merged with f(5) to have the output {he, she}; as depicted in figure 1(c).
Another important feature of fail function is to avoid making unnecessary transitions. From figure 1(a) we see that g(4,e)=5. If it was not 5, we would calculate f(3)=0 and g(0,h)=1 and thus f(4)=1. But if the machine already knew next symbol is not e, it could skip transiting to state 1 to calculate its goto function. For figure 1(a), it can input the next symbol from state 1 then, if it was i then it would transit to state 6. But if “his” was not a substring in the set, then it could skip going to state 1 in the first place and to state 0 directly. To avoid these occasional unnecessary transitions, a generalization of next function introduced by Knuth-Morris-Pratt algorithm is applied to this algorithm. It ensures that there is exactly one failure transition per input symbol.
The program may be composed of three primary classes or more depending on the degree of its features and program framework. The aim is to devise a program to obtain location and counts of all matching substrings from the input string, S. The algorithm provides us with the end location of a matching substring. Below is the pseudo code for the main components of the program –
/*
* Construction of the Deterministic Finite Automaton (DFA) 14 // Acts as Creator
*
* INPUT: d - the dictionary of words.
* OUTPUT: The DFA which accepts all the words in the dictionary d.
*/
DFA dfaConstruction(Dictionary d) { // d contains list of substrings
DFA dfa = new DFA(); // DFA contains all states, transitions
String w;
State state, nextState;
state = dfa.newState();
dfa.setStartState(state);
while ((w = d.remove()) ! = null) { // removes a word from dictionary
state = dfa.getStartState();
for (int i = 0; i < w.length(); i++) {
nextState = dfa.getTransition(state, w.charAt(i)) // selects next state.
if (! nextState.isValid()) { // true if next state is vald, false otherwise
nextState = dfa.newState(); // creates new state in DFA
dfa.addTransition(state s, w.charAt(i), nextState);} //starting at state s, char at i, to nextState
state = nextState;}
dfa.addAcceptState(state);}
return dfa;
}
/*
* Construction of Failure Function 14*
* INPUT: dfa - the DFA to construct the failure function for.
* OUTPUT: The failure function.
*/
FailureFunction failureConstruction(DFA dfa) {
FailureFunction f = new FailureFunction(); // Failure function class
Queue q = new Queue(); // FIFO queue class
State state, nextState, s;
char ch;
q.add(dfa.getStartState());
f.setFailure(dfa.getStartState(), null);
while (! q.isEmpty()) { // returns true if queue is empty
state = q.remove(); // removes a state from the front of a queue
for (i = 0; i < dfa.getAlphabetLength(); i++) {
ch = dfa.getAlphabetSymbol(i);
nextState = dfa.getTransition(state, ch);
if (nextState.isValid()) {
s = f.getFailure(state); // gets next move is state fails
while ((s ! = null) && ! dfa.getTransition(s,ch).isValid()) {
s = f.getFailure(s);
}
if (s ! = null) {
f.setFailure(nextState, dfa.getTransition(s,ch));
} else {
f.setFailure(nextState, dfa.getStartState()); // sets dfa.getStartStare if nextState fails
}
if (f.getFailure(nextState).isAcceptState()) { // returns true if nextState is accepted
dfa.addAcceptState(nextState);
}
q.add(nextState); // adds a state to end of a queue
} } }
/*
* Aho-Corasick Search Algorithm 14
*
* INPUT: dict - dictionary of words to search for.
* doc - the document to search.
* OUTPUT: The results found*/
Results ahoCorasickSearch(Dictionary dict, Document doc) { // Document: a text file class
DFA dfa = dfaConstruction(dict);
FailureFunction f = failureConstruction(dfa);
State state, nextState;
char ch;
Results r = new Results(); // Results class contains all the results
state = dfa.getStartState();
while (! doc.eof()) { // doc.eof true if end of file is reached
ch = doc.getChar(); // get next character from doc
nextState = dfa.getTransition(state,ch);
if (! nextState.isValid()) {
nextState = f.getFailure(state);
while ((nextState ! = null) && ! dfa.getTransition(nextState,ch).isValid()) {
nextState = f.getFailure(nextState);
}
if (nextState ! = null) {
nextState = dfa.getTransition(s,ch);
} else {
nextState = dfa.getStartState();
}
}
if (nextState.isAcceptState()) {
r.add(doc.getPosition()); // returns current position in doc
}
state = nextState;
}
return r;
}
This program is implemented for calculating the suffix location, not the prefix. The prefix can be determined by changing the failure function. We can also calculate counts by making simple alteration to the algorithm, such as, adding a method that would take as arguments the current substring and string. For example:-
int countSubstring(String subStr, String str) // Java &
std::cout << countSubstring("the three truths", "th") // C++ 10
- Major Decisions:
The selection of algorithm is a major decision to make before preparing the design document. Upon analyzing a number of candidate algorithm, decided that the best options are the Aho-Corasick algorithm and the Rabin-Karp algorithm to solve this particular problem – both of which, should be mentioned, are inferior to single pattern searching algorithms such as Knuth–Morris–Pratt algorithm and Boyer–Moore string search algorithm. A comparison between popular multiple string matching algorithms is displayed in the table 1. 11
Aho-Corasick is an algorithm of choice for multiple pattern searching because of its almost sheer speed and efficiency. It is preferred even if there is a large number of substring mismatches and also if there are varying lengths for the substrings.
We assume all the substrings have a fixed length k. To search for m number of substrings is for a single pattern search taking O(n) time for each substrings, the total time complexity is O(nm). In comparison, Aho-Corasick algorithm can find m substrings in O(n+m) time. When we are working with small alphabet size, small number of substrings or string length the comparison is negligible but for a greater size of the problem domain, it draws comparison important enough to notice. Different substring search algorithms are appropriate in different applications depending on the purpose of the searching, software architecture being used and so on. The attractiveness of Aho-Corasick lies in its asymptotic runtime performance: O(n). It does not matter how big the substring is or how many times they occur, the performance of this algorithm will always decline linearly.
4. Evaluation:
The algorithm has been implemented in C++ 12 and Java 13 for the substring set: {he,she,his,hers}. C++ program is built-in conventional procedural method; it uses three functions that are similar to the pseudo codes in the previous section. It returns the starting and ending position of substrings for S=”ahishershis”.
Test run using maven built Java project shows that construction of the pattern matching machine takes 59ms and it can read as the input String S “A Ghost Story of Christmas” by Charles Dickens (177.8 Kb text file that contains approximately 28,944 words) in between 427ms and 437ms. The program can output suffix positions for S=”ushers”.
If we have 100 million substrings with average length of 10 characters (10 bytes) and additional data structure use of 4 bytes assigned to each substring, then we will need an array of 100+4=104 million bytes or 104Mb to store the list of substrings. From 100 million substrings 10 bytes each we can build the trie with no more than 1000 million nodes. For each node of the trie, it should at least keep 1 byte for a label (a letter) and 4 bytes (a word) for a pointer to next node, and also 1 bit for a boolean to mark a node; in total occupying 5 bytes. As a result, the minimum size of a trie mapping 100 million substrings with 10 characters each; we need minimum 5000 million bytes or 5 Gb of memory. With this estimation of memory allocation, we will need 7 Gb memory. This amount of memory is moderately available in today’s personal computers. However, if we need to examine very large string in bioinformatics research, the requirements can be 3-5 times more. We can conclude that it is possible to analyze very large strings by using this algorithm, if the users have state-of-the-art hardware system and the coding is efficient.
According to search results, there have been numerous efforts to improve Aho-Corasick algorithm, such as by using multicore processors, using threading to concurrently process multiple sets of strings, rearranging states etc. This algorithm requires great focus on memory management because the trie resides in memory. We should also give importance to the time the program takes to make each transition. For large S or P this can be crucial because if it is not constant then the performance can be hampered greatly. A graphical user interface can be implemented where the user can enter multiple substrings for searching and also upload the source document that is to be searched. The GUI should have high cohesion with underlying objects processing the results.
- References:
- Efficient String Matching: An Aid to Bibliographic Search by Alfred V. Aho and Margaret J. Corasick
- Exact String Matching Algorithms
- Suffix Tree Searcher: Exploration of Common Substrings in Large DNA Sequence Sets
- General Purpose Implementation of Advanced Algorithms
- An implementation of Aho-Corasick Automata for Java
- Comparison Between Different Substring Search Algorithms
- High-performance Pattern Matching in Java
- Data Structures, Algorithms, & Applications in Java Suffix Tree
- Java Implementation of the Aho-Corasick Algorithm for Efficient String Matching
- Rosetta Code – Count Occurrences of a Substring
- Multiple Pattern String Matching Methodologies: A Comparative Analysis
- An Implementation of Aho-Corasick Algorithm in C++
- Source Code Distribution for an Implementation of the Aho-Corasick Automaton
- Dictionary Matching Automata : The Aho Corasick Algorithm
Leave a Reply