String matching is the problem of finding all occurrences of a character pattern in a text. A general query entails finding the locations of multiple terms with “DON'T CARE” symbols in a text. Such queries are in common use in libraries, medical, and legal information services. We introduce special heuristics to the Knuth-Morris-Pratt algorithm to reduce the time and space required to perform the string matching. We compare our hardware-based approach to the software approaches embodied in the UNIX1 System grep and fgrep commands. Our simulation results show that the hardware approach can provide a 25-500 fold performance improvement depending on the complexity of the query and that it is fast enough even in the presence of variable length DON’T CARES to keep up with a 20 million character/second disk. Our approach compares favorably to other hardware designs in speed and space. The proposed hardware implementation requires 10 kbytes of one cycle static memory, 28 single character comparators, four 16 bit adders, and control logic for four finite state machines and a term matcher controller. After that, additional hardware produces negligible performance improvements for queries with up to 80 terms, about half of which have variable length DON’T CARES.
ASJC Scopus subject areas
- Theoretical Computer Science
- Hardware and Architecture
- Computational Theory and Mathematics