We present techniques to parallelize membership tests for Deterministic Finite Automata (DFAs). Our method searches arbitrary regular expressions by matching multiple bytes in parallel using speculation. We partition the input string into chunks, match chunks in parallel, and combine the matching results. Our parallel matching algorithm exploits structural DFA properties to minimize the speculative overhead. Unlike previous approaches, our speculation is failure-free, i.e.; (1) sequential semantics are maintained, and (2) speed-downs are avoided altogether. On architectures with a SIMD gather-operation for indexed memory loads, our matching operation is fully vectorized. The proposed load-balancing scheme uses an off-line profiling step to determine the matching capacity of each participating processor. Based on matching capacities, DFA matches are load-balanced on inhomogeneous parallel architectures such as cloud computing environments. We evaluated our speculative DFA membership test for a representative set of benchmarks from the Perl-compatible Regular Expression (PCRE) library and the PROSITE protein database. Evaluation was conducted on a 4 CPU (40 cores) shared-memory node of the Intel Academic Program Manycore Testing Lab (Intel MTL), on the Intel AVX2 SDE simulator for 8-way fully vectorized SIMD execution, and on a 20-node (288 cores) cluster on the Amazon EC2 computing cloud. Obtained speedups are on the order of O(1 + |P|-1/|Q| ·γ, where | P | denotes the number of processors or SIMD units, | Q | denotes the number of DFA states, and 0 < γ ≤ 1 represents a statically computed DFA property. For all observed cases, we found that 0.02 γ 0.47. Actual speedups range from 2.3 × to 38.× for up to 512 DFA states for PCRE, and between 1.3 ×and 19.9 × for up to 1,288 DFA states for PROSITE on a 40-core MTL node. Speedups on the EC2 computing cloud range from 5.0 ×to 65.8 ×for PCRE, and from 5.0 ×to 138.5 ×for PROSITE. Speedups of our C-based DFA matcher over the Perl-based ScanProsite scan tool range from 559.3 ×to 15079.7 ×on a 40-core MTL node. We show the scalability of our approach for input-sizes of up to 10 GB. ;copy 2013 Springer Science+Business Media New York.
Bibliographical noteFunding Information:
Acknowledgments Research partially supported by the National Research Foundation of Korea (NRF) grants funded by the Korean government (MEST) (Grant No. 2010-0005234, 2012R1A1A2044562 and 2012K2A1A9054713), through the Global Ph.D. Fellowship Program 2011 of the NRF (Grant No. 2010-0008582), and by the Intel Academic Program Manycore Testing Lab.
All Science Journal Classification (ASJC) codes
- Theoretical Computer Science
- Information Systems