> {-# OPTIONS -fwarn-incomplete-patterns -fwarn-tabs #-}
> {-# LANGUAGE FlexibleInstances, ScopedTypeVariables #-}
Regular Expressions
> module RegExp where
> import Prelude hiding(either)
> import Data.Set (Set)
> import qualified Data.Set as Set (singleton, fromList)
> import Data.Map (Map)
> import qualified Data.Map as Map
> import Control.Applicative(Alternative(..))
> import Control.Monad (ap, liftM2)
> import Test.HUnit hiding (State)
> import Test.QuickCheck
> import Test.QuickCheck.Function
> main :: IO ()
> main = return ()
Regular expressions are a specialized language for representing string-matching patterns. Regular expressions were invented by the mathematician Stephen Kleene, one of the pioneers that created the foundations of the theory of computation (with Goedel, Turing, Church, and Post). Kleene invented regular expressions to represent the sets of possible behaviors of the abstract finite computation devices called finite-state machines. In Kleene's original formulation, regular expressions were were built from individual letters with three operators: concatenation, representing one pattern followed by another; alternation (also called union) denoted by |
, representing two alternative patterns; and closure (also called Kleene-Star), denoted by *
, to represent zero or more repetitions of a pattern. By convention, the empty string represents a special regular expression, the empty regular expression, which matches the empty string.
For example, the regular expression a(bc|d)*e
matches all strings that start with a
, then have some number of bc
or d
characters (possibly none), and end with e
. Some such strings include ae
, abce
, abcde
, adbcde
, abcbcdbce
. In the 1970s, computer scientists at Bell Labs were working on the first software tools for text processing, and they adapted and generalized Kleene's idea in several software tools, starting with grep, for searching and editing text. Regular expressions have many practical uses, mainly in pattern matching, which has applications in everything from compilers to searching databases.
In this problem, we consider regular expression evaluators, that is, programs that determine whether a string is in the language denoted by the regular expression. This process is also called regular expression matching.
We can represent regular expressions using the following datatype in Haskell.
> data RegExp = Char (Set Char) -- single literal character
> | Alt RegExp RegExp -- r1 | r2 (alternation)
> | Seq RegExp RegExp -- r1 r2 (concatenation)
> | Star RegExp -- r* (Kleene star)
> | Empty -- ε, accepts empty string
> | Void -- ∅, always fails
> | Mark String RegExp -- (for marked subexpressions, see (b) below)
> deriving (Show, Eq)
Note that the [Char] constructor takes a set of characters as argument. It matches any single character from this set.
While doing the exercise, you may find useful to implement your own instance of the [Show] class for [RegExp]s instead of using the one that arising from the deriving
annotation above. (In particular, you might want to print the set of characters associated with a Char
in a more compact form than the one given by the default show
instance for Set
.)
We can define regular expressions for specific character classes:
> char :: Char -> RegExp
> char = Char . Set.singleton
> chars :: String -> RegExp
> chars = Char . Set.fromList
> lower, upper, letter, digit, punc, white, anyc, anyc':: RegExp
> lower = chars ['a' .. 'z']
> upper = chars ['A' .. 'Z']
> digit = chars ['0' .. '9']
> punc = chars "<>!/.*()?@"
> white = chars " \n\r\t"
Or the union of all of the above classes:
> anyc' = lower `Alt` upper `Alt` digit `Alt` punc `Alt` white
A more convenient definition (for debugging purposes, since it will print more compactly) is the following:
> anyc = chars $ ['a' .. 'z']
> ++ ['A' .. 'Z']
> ++ ['0' .. '9']
> ++ "<>!/.*()?@"
> ++ "\n\r\t"
A regular expression for letters will be also useful:
> letter = chars $ ['A' .. 'Z'] ++ ['a' .. 'z']
Or we can define regular expressions that match specific words.
> word :: String -> RegExp
> word = foldr (Seq . char) Empty
> cis552 :: RegExp
> cis552 = word "cis552"
The Star
operator corresponds to 0 or more occurrences of a pattern. For example, this regular expression accepts any string that begins and ends with the tags <b>
and </b>
.
> boldHtml :: RegExp
> boldHtml = word "<b>" `Seq` Star anyc `Seq` word "</b>"
We can use Star
and Seq
to define the plus
operator, which corresponds to one or more occurrences of a pattern.
> plus :: RegExp -> RegExp
> plus pat = pat `Seq` Star pat
> -- (a)
First, we'll write a straightforward (and probably inefficient) operation that determines whether a particular string is part of the regular language accepted by the given regular expression. (For now, just ignore 'Mark' constructors in the input regular expressions.)
Begin by implementing the following two helper functions. (You may find that the list monad is useful for defining and using these functions.)
> -- all decompositions of a string into two different pieces
> -- split "abc" == [("","abc"),("a","bc"),("ab","c"),("abc","")]
> split :: [a] -> [([a], [a])]
> split = error "split: unimplemented"
> -- all decompositions of a string into multi-part (nonempty) pieces
> -- parts "abc" = [["abc"],["a","bc"], ["ab","c"], ["a","b","c"]]
> parts :: [a] -> [[[a]]]
> parts = error "parts: unimplemented"
Now, use split
and parts
to determine whether the RegExp
matches the given input string, your implementation should simply explore all possibilities. For example, to determine whether the concatenation pattern Seq r1 r2
matches an input string, use split
to compute all possible ways of splitting the string into two parts and see whether r1
and r2` match the two parts.
> accept :: RegExp -> String -> Bool
>
> accept (Mark _ r) s = accept r s
> accept _ _ = error "accept: finish me"
> testAccept :: Test
> testAccept = TestList [
> not (accept Void "a") ~? "nothing is void",
> not (accept Void "") ~? "really, nothing is void",
> accept Empty "" ~? "accept Empty true",
> not (accept Empty "a") ~? "not accept Empty",
> accept lower "a" ~? "accept lower",
> not (accept lower "A") ~? "not accept lower",
> accept boldHtml "<b>cis552</b>!</b>" ~? "cis552!",
> not (accept boldHtml "<b>cis552</b>!</b") ~? "no trailing" ]
> -- (b)
Backtracking is not the most efficient implementation of regular expressions, but it is the most easily extensible.
One extension is support for marked subexpressions. For this problem, rewrite accept
so that it returns all strings that are matched by the marked subexpressions in the regular expression.
For example, we can mark the part of the regular expression between the tags:
> boldHtmlPat :: RegExp
> boldHtmlPat = word "<b>" `Seq` Mark "<b>" (Star anyc) `Seq` word "</b>"
Or mark sequences of letters that correspond to the first and last names:
> namePat :: RegExp
> namePat = Mark "first" (plus letter) `Seq` Star white `Seq` Mark "last" (plus letter)
Or mark any number of sequences of lowercase letters:
> wordsPat :: RegExp
> wordsPat = Star (Mark "word" (plus lower) `Seq` Star white)
Then, patAccept
below returns not only whether the pattern matches, but also the parts of the string that correspond to the marks (in order).
> testPat :: Test
> testPat = TestList [
> patAccept boldHtmlPat "<b>cis552" ~?= Nothing,
> patAccept boldHtmlPat "<b>cis552!</b>" ~?=
> Just (Map.fromList [("<b>",["cis552!"])]),
> patAccept boldHtmlPat "<b>cis552</b>!</b>" ~?=
> Just (Map.fromList [("<b>",["cis552</b>!"])]),
> patAccept namePat "Haskell Curry" ~?=
> Just (Map.fromList [("first",["Haskell"]),("last",["Curry"])]),
> patAccept wordsPat "a b c d e" ~?=
> Just (Map.fromList [("word",["a","b","c","d","e"])])
> ]
> type Match = Map String [String]
> patAccept :: RegExp -> String -> Maybe Match
> patAccept = error "patAccept: unimplemented"
> -- (c)
You may have noticed by now that this implementation of Regular Expression matching is really slow. Let's fix that.
The textbook way to implement regular expression matching is to first translate the regular expression into a finite-state machine and then apply the finite-state matching to the string.
However, there's a more direct, elegant, but not so well-known alternative, the method of derivatives due to Janusz A. Brzozowski. This method is described in more detail here.
The basic idea is that, given a regular expression and the first character in the input string to match, you can compute a new regular expressions, which must match the remaining string in order for the original RegExp
to match the entire input string. This new regular expression is called the derivative of the original.
We can use this idea to implement regular expression matching by repeatedly calculating the derivatives for each character in the string. If the final result is a regular expression that accepts the empty string, then the original regular expression would have matched the string. In other words:
> match :: RegExp -> String -> Bool
> match r s = nullable (foldl deriv r s)
Your job is to implement nullable
and deriv
to complete this implementation.
> -- | `nullable r` return `True` when `r` matches the empty string
> nullable :: RegExp -> Bool
> nullable _ = error "nullable: unimplemented"
> -- | Takes a regular expression `r` and a character `c`,
> -- and computes a new regular expression that accepts word `w` if `cw` is
> -- accepted by `r`.
> deriv :: RegExp -> Char -> RegExp
> deriv = error "deriv: unimplemented"
For example, if r
is the literal character c
, then the derivative of r
is Empty
, the regular expression that only accepts the empty string. In the case of Seq
, you need to think about the case where the first regular expression could accept the empty string. In that case, the derivative should include the possibility that it could be skipped, and the character consumed by the second regexp.
Note that Haskell's lazy evaluation avoids the evaluation of the whole regular expression. The expression has only to be evaluated as much as nullable
needs to calculate an answer.
> -- (d)
We can optimize our match function even further through the use of "smart constructors" for the regular expression type. These smart constructor recognize simplifications that can be made when ever a regular expression is put together.
> rStar :: RegExp -> RegExp
> rStar (Star x) = Star x -- two iterations is the same as one
> rStar Empty = Empty -- iterating the empty string is the empty string
> rStar Void = Empty -- zero or more occurrences of void is empty
> rStar r = Star r -- no optimization
We can quickCheck our optimizations using accept
. In each of our three optimization cases, the rhs and lhs of the definition above should accept the same strings.
We'll define this as a property, testing the regexps on strings that contain arbitrary sequences of a's, b's, c's, and d's.
> (%==%) :: RegExp -> RegExp -> Property
> r1 %==% r2 = forAll (resize 10 (listOf (elements "abcd"))) $
> \s -> match r1 s == match r2 s
> prop_rStar :: RegExp -> Property
> prop_rStar r = prop_rStar1 r .&&. prop_rStar2 .&&. prop_rStar3 where
> prop_rStar1 r = rStar (Star r) %==% Star (Star r)
> prop_rStar2 = rStar Empty %==% Star Empty
> prop_rStar3 = rStar Void %==% Star Empty
Complete an Arbitrary
instance for regular expressions to test this property above. Your regexps should only contain the characters "abcd" and need not contained marked subexpressions.
> instance Arbitrary RegExp where
> arbitrary = undefined
> shrink = undefined
Now design and test optimizations for sequencing and alternation:
> rSeq :: RegExp -> RegExp -> RegExp
> rSeq = undefined
> rAlt :: RegExp -> RegExp -> RegExp
> rAlt = undefined
Finally, modify your definition of match
above to take advantage of these optimizations.