The pattern matching problem is to find all occurrences of a given pattern in an input text. In particular, we consider the case when the pattern is a stochastic regular language where each pattern string has its own probability. Our problem is to find all matching patterns—(start, end) indices in the text—whose probability is larger than a given threshold probability. A pattern matching procedure is frequently used on streaming data in several applications, and often it is very challenging to find the start index of a matching in streaming data. We design an efficient algorithm for the stochastic pattern matching problem over streaming data based on the transformation of the pattern PFA into a weighted automaton and a constant bound on the number of backtracks required to find a start index while reading the streaming input. We also employ heuristics that enable us to reduce the number of backtracks, which improves the practical runtime of our algorithm. We establish the tight theoretical runtime of the proposed algorithm and experimentally demonstrate its practical performance. Finally, we show a possible application of our algorithm to another stochastic pattern matching problem where we search for the maximum probability substring of a text that is a superstring of a specified string.
|Title of host publication||Implementation and Application of Automata - 23rd International Conference, CIAA 2018, Proceedings|
|Number of pages||12|
|Publication status||Published - 2018|
|Event||23rd International Conference on Implementation and Application of Automata, CIAA 2018 - Charlottetown, Canada|
Duration: 2018 Jul 30 → 2018 Aug 2
|Name||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Other||23rd International Conference on Implementation and Application of Automata, CIAA 2018|
|Period||18/7/30 → 18/8/2|
Bibliographical notePublisher Copyright:
© 2018, Springer International Publishing AG, part of Springer Nature.
All Science Journal Classification (ASJC) codes
- Theoretical Computer Science
- Computer Science(all)