Regular expression search algorithm pdf
Metacharacters are the building blocks of regular expressions. There are several implementations of RegEx. The differences in implementations usually include the way special characters are handled and how character classes are treated. This page contains the following sections:. For more information about other uses for RegEx, see the following:.
RegEx can help you in cases where you need to find different numbers that contain the same pattern. For example, the following serial numbers:. This single RegEx returns any document that contains any of the three serial numbers. Note: Think of each RegEx as a phrase when you construct your search string.
If you switch the order of the string you won't receive the same results. Characters in RegEx are understood to be either a metacharacter with a special meaning or a regular character with a literal meaning. The following are some common RegEx metacharacters and examples of what they would match or not match in RegEx. Character set, at least one of which must be a match, but no more than one unless otherwise specified. Quantifiers that allow pand[ora] to match for pandora is discussed below.
Figures and Topics from this paper. Citation Type. Has PDF. Publication Type. More Filters. Fast Regular Expression Search. A regular expression pattern matching processor for APL. APL ' Fast text searching for regular expressions or automaton searching on tries.
Mathematics, Computer Science. View 1 excerpt. A compact function for regular expression pattern matching. A fast regular expression indexing engine.
The thick blue line is the C implementation of Thompson's algorithm given above. Some might argue that this test is unfair to the backtracking implementations, since it focuses on an uncommon corner case. This argument misses the point: given a choice between an implementation with a predictable, consistent, fast running time on all inputs or one that usually runs quickly but can take years of CPU time or more on some inputs, the decision should be easy.
Also, while examples as dramatic as this one rarely occur in practice, less dramatic ones do occur. Examples include using. As a result, programmers often learn which constructs are expensive and avoid them, or they turn to so-called optimizers. Using Thompson's NFA simulation does not require such adaptation: there are no expensive regular expressions. For example, here is the NFA we used earlier for abab abbb , with state numbers added:. Rather than throw away this work after each step, we could cache the Lists in spare memory, avoiding the cost of repeating the computation in the future and essentially computing the equivalent DFA as it is needed.
This section presents the implementation of such an approach. Starting with the NFA implementation from the previous section, we need to add less than lines to build a DFA implementation.
To implement the cache, we first introduce a new data type that represents a DFA state:. A DState is the cached copy of the list l. Nextstate computes, records, and returns the next state for a given state and character. All the DStates that have been computed need to be saved in a structure that lets us look up a DState by its List.
To do this, we arrange them in a binary tree using the sorted List as the key. The dstate function returns the DState for a given List , allocating one if necessary:. An alternative would be to compute the entire DFA at once. Doing so would make match a little faster by removing the conditional branch, but at the cost of increased startup time and memory use.
One might also worry about bounding the amount of memory used by the on-the-fly DFA construction. Since the DStates are only a cache of the step function, the implementation of dstate could choose to throw away the entire DFA so far if the cache grew too large.
This cache replacement policy only requires a few extra lines of code in dstate and in nextstate , plus around 50 lines of code for memory management. An implementation is available online. NFAs derived from regular expressions tend to exhibit good locality: they visit the same states and follow the same transition arrows over and over when run on most texts.
This makes the caching worthwhile: the first time an arrow is followed, the next state must be computed as in the NFA simulation, but future traversals of the arrow are just a single memory access. Real DFA-based implementations can make use of additional optimizations to run even faster. A companion article not yet written will explore DFA-based regular expression implementations in more detail. Regular expression usage in real programs is somewhat more complicated than what the regular expression implementations described above can handle.
This section briefly describes the common complications; full treatment of any of these is beyond the scope of this introductory article. Character classes. Character classes can be expanded into alternations during compilation, though it is more efficient to add a new kind of NFA node to represent them explicitly. POSIX defines special character classes like [[:upper:]] that change meaning depending on the current locale, but the hard part of accommodating these is determining their meaning, not encoding that meaning into an NFA.
Escape sequences. Counted repetition. Submatch extraction. When regular expressions are used for splitting or parsing strings, it is useful to be able to find out which sections of the input string were matched by each subexpression. For example, one might write in Perl:.
The extraction of submatch boundaries has been mostly ignored by computer science theorists, and it is perhaps the most compelling argument for using recursive backtracking. However, Thompson-style algorithms can be adapted to track submatch boundaries without giving up efficient performance.
The Eighth Edition Unix regexp 3 library implemented such an algorithm as early as , though as explained below, it was not very widely used or even noticed. Unanchored matches. This article has assumed that regular expressions are matched against an entire input string. In practice, one often wishes to find a substring of the input that matches the regular expression. Unix tools traditionally return the longest matching substring that starts at the leftmost possible point in the input.
An unanchored search for e is a special case of submatch extraction: it is like searching for. Non-greedy operators. In traditional Unix regular expressions, the repetition operators? These operators are now called greedy. Perl introduced?? By definition, whether an operator is greedy cannot affect whether a regular expression matches a particular string as a whole; it only affects the choice of submatch boundaries. The backtracking algorithm admits a simple implementation of non-greedy operators: try the shorter match before the longer one.
For example, in a standard backtracking implementation, e? The submatch-tracking variants of Thompson's algorithm can be adapted to accommodate non-greedy operators. Perl also generalized the idea to arbitrary conditions called lookahead assertions:?
The lookbehind assertions? The generalized assertions are harder to accommodate but in principle could be encoded in the NFA. As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. Specifically, the problem is NP-complete , meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.
The simplest, most effective strategy for backreferences, taken by the original awk and egrep, is not to implement them. This strategy is no longer practical: users have come to rely on backreferences for at least occasional use, and backreferences are part of the POSIX standard for regular expressions.
Even so, it would be reasonable to use Thompson's NFA simulation for most regular expressions, and only bring out backtracking when it is needed. A particularly clever implementation could combine the two, resorting to backtracking only to accommodate the backreferences.
Backtracking with memoization. Perl's approach of using memoization to avoid exponential blowup during backtracking when possible is a good one. At least in theory, it should make Perl's regular expressions behave more like an NFA and less like backtracking. Memoization does not completely solve the problem, though: the memoization itself requires a memory footprint roughly equal to the size of the text times the size of the regular expression.
Memoization also does not address the issue of the stack space used by backtracking, which is linear in the size of the text: matching long strings typically causes a backtracking implementation to run out of stack space:. Character sets. The Plan 9 regular expression library incorporates Unicode by running an NFA with a single Unicode character as the input character for each step.
That library separates the running of the NFA from decoding the input, so that the same regular expression matching code is used for both UTF-8 and wide-character inputs. They won the Turing Award in for the introduction of the concept of non-determinism in that paper. McNaughton and H. Yamada [ 4 ] and Ken Thompson [ 9 ] are commonly credited with giving the first constructions to convert regular expressions into NFAs, even though neither paper mentions the then-nascent concept of an NFA.
The approach used above, mimicking Thompson, encodes the choices with explicit choice nodes the Split nodes above and unlabeled arrows. An alternative approach, the one most commonly credited to McNaughton and Yamada, is to avoid unlabeled arrows, instead allowing NFA states to have multiple outgoing arrows with the same label. McIlroy [ 3 ] gives a particularly elegant implementation of this approach in Haskell.
A copy of the editor can be found in archived CTSS sources [ 5 ]. Peter Deutsch and Butler Lampson [ 1 ] developed the first QED, but Thompson's reimplementation was the first to use regular expressions. Thompson's paper marked the beginning of a long line of regular expression implementations. Thompson chose not to use his algorithm when implementing the text editor ed, which appeared in First Edition Unix , or in its descendant grep, which first appeared in the Fourth Edition Instead, these venerable Unix tools used recursive backtracking!
Backtracking was justifiable because the regular expression syntax was quite limited: it omitted grouping parentheses and the ,?
0コメント