> {-# OPTIONS -Wincomplete-patterns #-}
> {-# LANGUAGE FlexibleInstances #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE AllowAmbiguousTypes #-}
Regular Expressions
> import Data.Set (Set)
> import qualified Data.Set as Set (singleton, union, toList, fromList, size, elems)
> import Data.Map (Map)
> import qualified Data.Map as Map
>
> import Data.List (foldl', reverse, nub)
> import qualified Data.Maybe as Maybe
Regular expressions are a specialized language for representing string-matching patterns. Regular expressions were invented by the mathematician Stephen Kleene, one of the pioneers that created the foundations of the theory of computation (with Goedel, Turing, Church, and Post). Kleene invented regular expressions to represent the sets of possible behaviors of the abstract finite computation devices called finite-state machines. In Kleene's original formulation, regular expressions were were built from individual letters with three operators: concatenation, representing one pattern followed by another; alternation (also called union) denoted by |
, representing two alternative patterns; and closure (also called Kleene-Star), denoted by *
, to represent zero or more repetitions of a pattern. By convention, the empty string represents a special regular expression, the empty regular expression, which matches the empty string.
For example, the regular expression a(bc|d)*e
matches all strings that start with a
, then have some number of bc
or d
characters (possibly none), and end with e
. Some such strings include ae
, abce
, abcde
, adbcde
, abcbcdbce
. In the 1970s, computer scientists at Bell Labs were working on the first software tools for text processing, and they adapted and generalized Kleene's idea in several software tools, starting with grep, for searching and editing text. Regular expressions have many practical uses, mainly in pattern matching, which has applications in everything from compilers to searching databases.
In this problem, we consider regular expression evaluators, that is, programs that determine whether a string is in the language denoted by the regular expression. This process is also called regular expression matching.
We can represent regular expressions using the following datatype in Haskell.
> data RegExp = Char (Set Char) -- single literal character
> -- matches any character in the (nonempty) set
> | Alt RegExp RegExp -- r1|r2 (alternation)
> | Seq RegExp RegExp -- r1 r2 (concatenation)
> | Star RegExp -- r* (Kleene star)
> | Empty -- ε, accepts empty string
> | Void -- ∅, always fails
> | Mark String [String] RegExp
> -- (for marked subexpressions, see below)
> deriving (Eq)
Your goal in this problem is to define four different evaluators.
The functions accept
and match
use two different algorithms to determine whether the given String
is accepted by the given RegExp
. Then, patAccept
and patMatch
extend each of these algorithms with marked subexpressions, and returns additional information about the matched string if it is accepted.
Characters and Character sets
The first data constructor of the RegExp
datatype defines a regular expression that matches a single character from a specified set. For example, if we want to define a regexp that only matches the single character a
, we could write:
When working with regexps, we can define some shorthand character classes to make our lives easier, such as sets of digits, whitespace, lowercase, uppercase characters.
> -- | Digits, usually written \d in regexp libraries
> digit :: Set Char
> digit = Set.fromList ['0' .. '9']
> -- | lowercase \l
> lower :: Set Char
> lower = Set.fromList ['a' .. 'z']
>
> -- | uppercase \u
> upper :: Set Char
> upper = Set.fromList ['A' .. 'Z']
> -- | The union of all of the above, plus punctuation
> -- but not including the newline characters
> -- similar to '.' in regexp libraries
> anyc :: Set Char
> anyc = word <> Set.fromList " \t!@#$%^&*()_+{}|:\"<>?~`-=[]\\;',./"
Examples and derived forms
A name is an upper case letter followed by a sequence of lowercase letters of any length. In a regular-expression library, we might write it as [A-Z][a-z]*
. (The *
post-fix operator corresponds to 0 or more occurrences of a pattern.)
Using our datatype above, we can express this regexp as:
> testName :: Test
> testName = "name" ~:
> TestList[ accept name "Stephanie" ~? "a good name"
> , not (accept name "stephanie") ~? "must be capitalized"
> , not (accept name "Ste7anie") ~? "no extra symbols"
> , not (accept name "Steph Annie") ~? "not even spaces"
> ]
We can also define regexps that only accept specific words. For example, given this function
> string :: String -> RegExp
> string "" = Empty
> string s = foldr1 Seq (map (Char . Set.singleton) s)
we can define this RegExp to recognize the name of our favorite course.
We can also use Star
and Seq
to define the plus
operator, which corresponds to one or more occurrences of a pattern.
For example, this regular expression accepts any non-empty string that is surrounded by the tags <b>
and </b>
.
Before you go further, note that we have provided you with an instance of the Show
type class for regexps that displays regexps succinctly (i.e. it tries to use abbreviations such as and for character classes). Try displaying the RegExp
s above and make sure that you understand the connection between their representation and how they are displayed by ghci.
ghci> boldHtml
RegExp Acceptance
First, we'll write a straightforward (and extremely inefficient) operation, called accept
, that determines whether a particular string is part of the regular language accepted by the given regular expression.
Begin by implementing the following two helper functions. (To practice with the list monad, you must use a list comprehension in each of these implementation.)
> -- | all decompositions of a string into two different pieces
> -- split "abc" == [("","abc"),("a","bc"),("ab","c"),("abc","")]
> split :: [a] -> [([a], [a])]
> split = error "split: unimplemented"
> -- | all decompositions of a string into multi-part (nonempty) pieces
> -- parts "abc" == [["abc"],["a","bc"], ["ab","c"], ["a","b","c"]]
> parts :: [a] -> [[[a]]]
> parts = error "parts: unimplemented"
Don't forget to write some quickCheck properties so that you can test your functions! For example, here's one that makes sure that we get the right number of pairs from split.
You can do better, though!
Note: we have to be careful when randomly testing parts; this is an exponential algorithm after all. The resize
function allows us to bound the size of the lists that are produced so that they don't get too big. If you find that prop_partsLength
takes too long, you can decrease the argument to resize
.
> prop_partsLength :: Property
> prop_partsLength = forAll smallLists $ \ l ->
> length (parts l) == 2 ^ predNat (length l) where
> predNat n = if n <= 0 then 0 else n - 1
Now, use split
and parts
to determine whether a RegExp
matches the given input string. Below, your implementation should simply explore all possibilities. For example, to determine whether the concatenation pattern Seq r1 r2
matches an input string, use split
to compute all possible ways of splitting the string into two parts and see whether r1
and r2` match the two parts. Again, use list comprehensions as part of the design of your implementation. (For now, just ignore 'Mark' constructors in the input regular expressions.)
> -- | Decide whether the given regexp matches the given string
> accept :: RegExp -> String -> Bool
> accept (Mark _ _ r) s = accept r s
> accept _ _ = error "accept: finish me"
> testAccept :: Test
> testAccept = "accept" ~: TestList [
> not (accept Void "a") ~? "nothing is void",
> not (accept Void "") ~? "really, nothing is void",
> accept Empty "" ~? "accept Empty true",
> not (accept Empty "a") ~? "not accept Empty",
> accept (Char lower) "a" ~? "accept lower",
> not (accept (Char lower) "A") ~? "not accept lower",
> accept boldHtml "<b>cis552</b>!</b>" ~? "cis552!",
> not (accept boldHtml "<b>cis552</b>!</b") ~? "no trailing" ]
QuickChecking accept
How can we use quick check to test our implementation of the accept
function?
One idea is that we can use regular expressions in reverse---instead of using them to identify strings from a language, we can instead use them to generate strings from a language.
So, given a regexp, the accept
function is correct if it returns true for all strings generated from the language of the regexp.
> prop_accept :: RegExp -> Property
> prop_accept r =
> case genRegExpString r of
> Just g -> forAll g $ accept r -- if we can generate a random string,
> -- then it should be accepted
> Nothing -> property $ isVoid r -- otherwise, we should have a RegExp
> -- equivalent to 'Void'
Write a function that can generate a String that is accepted by the given RegExp. Note that this function is partial; some RegExps accept no strings and denote the empty language.
NOTE: in the case of Star
, your function does not need to generate all strings accepted by the RegExps. Instead, put a bound on the number of iterations in your result. In particular, iterating more than twice could cause a large blow-up when quickChecking prop_accept
.
> -- | Create a generator for the strings accepted by this RegExp (if any)
> genRegExpString :: RegExp -> Maybe (Gen String)
> genRegExpString r = undefined
> -- | Is this the regexp that never matches any string?
> -- (It may include marks (see next problem), because this regexp will never
> -- match anything.)
> isVoid :: RegExp -> Bool
> isVoid Void = True
> isVoid (Seq r1 r2) = isVoid r1 || isVoid r2
> isVoid (Alt r1 r2) = isVoid r1 && isVoid r2
> isVoid (Star r) = False
> isVoid (Mark _ _ r) = isVoid r
> isVoid _ = False
Now complete an Arbitrary
instance for regular expressions to test this property above. Your regexps should only contain the characters "abcd" and need not contain marked subexpressions. Note that you should avoid generating 'Char' with an empty set.
Make sure that you also implement the 'shrink' function too.
Marked Subexpressions
Backtracking is not the most efficient implementation of regular expressions, but it is easy to extend.
One extension is support for marked subexpressions. For this part, you will rewrite accept
, as patAccept
so that it returns strings that are matched by the marked subexpressions in the regular expression.
For example, we can mark the part of the regular expression between the tags:
> boldHtmlPattern :: RegExp
> boldHtmlPattern = string "<b>" `Seq` mark "bold" (plus (Char anyc)) `Seq` string "</b>"
Or mark sequences of letters that correspond to the first and last names:
> namePattern :: RegExp
> namePattern = mark "first" name `Seq` Star (Char white) `Seq` mark "last" name
Or mark any number of sequences of lowercase letters:
> stringsPattern :: RegExp
> stringsPattern = Star (mark "string" (plus (Char lower))
> `Seq` Star (Char white))
Then, patAccept
below returns not only whether the pattern matches, but also all parts of the string that correspond to the marks.
> testPat :: Test
> testPat = "testPat" ~: TestList [
> patAccept boldHtmlPattern "<b>cis552" ~?= Nothing,
> patAccept boldHtmlPattern "<b>cis552!</b>" ~?=
> Just (Map.fromList [("bold",["cis552!"])]),
> patAccept boldHtmlPattern "<b>cis552</b>!</b>" ~?=
> Just (Map.fromList [("bold",["cis552</b>!"])]),
> patAccept namePattern "Haskell Curry" ~?=
> Just (Map.fromList [("first",["Haskell"]),("last",["Curry"])]),
> patAccept stringsPattern "a b c d e" ~?=
> Just (Map.fromList [("string",["a","b","c","d","e"])])
> ]
Note that the above examples use a "smart constructor" for marking capture groups. The data constructor for Mark
ed subexpressions includes a list of strings that we don't need now; it's needed for the last problem. You can just make it nil for now and ignore it for this part.
Now implement the patAccept
function. Note that the function returns a Maybe
--- look for at least one place to use monad do-notation in its definition. (There won't be many.)
> -- | If the regexp matches the string, return a dictionary from
> -- named marks to the captured substrings
> patAccept :: RegExp -> String -> Maybe (Map String [String])
> patAccept = error "patAccept: unimplemented"
Simplifying Regular Expressions
You may notice that when working with regular expressions, there are several ways to describe the same set of strings. For example, the regular expression
a**|a**
which we would encode as
matches exaclt the same set of strings as a*
, i.e. strings composed of any number of a's.
However, using the first regexp to test for acceptance takes a lot more time than the second, especially when the string doesn't match. For example, I found this comparison on my laptop:
*RegExp> accept astar "aaaaaaaaaaaaaaab"
False
(5.87 secs, 4,278,663,696 bytes)
*RegExp> accept (Star a) "aaaaaaaaaaaaaaab"
False
(0.07 secs, 50,671,104 bytes)
We can optimize code that works with RegExp through the use of "smart constructors". These smart constructor recognize simplifications that can be made when ever a regular expression is put together. Suppose we have "smart" variants of the star and alt regular expression constructors. Then we can form the regexp in the same way, but (sometimes) get better performance.
*RegExp> let astar' = star (star a) `alt` star (star a)
(0.00 secs, 68,904 bytes)
*RegExp> accept astar' "aaaaaaaaaaaaaaab"
False
(0.07 secs, 50,670,984 bytes)
For example, here is a definition of smart constructor for star
. This construct looks for simplifications that it can apply while constructing the output.
> star :: RegExp -> RegExp
> star r1 | isEmpty r1 = Empty
> -- iterating the empty string is the empty string
> star r1 | isVoid r1 = Empty
> -- zero or more occurrences of void is empty
> star (Star r) = Star r
> -- two iterations is the same as one
> star r = Star r
> -- no optimization
> -- | Is this the regexp that accepts *only* the empty string
> -- (It cannot include marks because we want to eliminate it
> -- during optimization above.)
> isEmpty :: RegExp -> Bool
> isEmpty Empty = True
> isEmpty (Seq r1 r2) = isEmpty r1 && isEmpty r2
> isEmpty (Alt r1 r2) = isEmpty r1 && isEmpty r2
> isEmpty (Star r) = isEmpty r
> isEmpty _ = False
How do we know that our definition of star
is really smart? We want to be sure that the regexp that it produces matches the same strings as its input.
We'll compare the optimized version with the original to make sure that they match the same strings. (We'll also make this a conditional property to make sure that we only do the test when the smart constructor actually modifies the string.)
> prop_star :: RegExp -> Property
> prop_star r = sr /= Star r ==> sr %==% Star r where
> sr = star r
We can quickCheck our optimizations using accept
and the following property. This property tests pairs of regexps on strings and ensures that either they are both accepted or both rejected. To make this property more efficient, it randomly selects strings that are either accepted by the first regexp, accepted by the second, or that contain arbitrary sequences of a's, b's, c's, and d's.
> -- | Property to determine whether two regexps accept the same language of
> -- strings
> (%==%) :: RegExp -> RegExp -> Property
> r1 %==% r2 = forAll (genString r1 r2) $
> \s -> accept r1 s == accept r2 s
> where
> genString :: RegExp -> RegExp -> Gen String
> genString r1 r2 = oneof $ Maybe.catMaybes
> [ genRegExpString r1
> , genRegExpString r2
> , Just $ resize 10 (listOf (elements "abcd"))
> ]
Now design and test similar optimizations for sequencing and alternation. (For inspiration, read the wikipedia page about Kleene Algebra.)
> -- | Smart constructor for `Seq`
> seq :: RegExp -> RegExp -> RegExp
> seq = undefined
>
> prop_seq :: RegExp -> RegExp -> Property
> prop_seq r1 r2 = rs /= Seq r1 r2 ==> rs %==% Seq r1 r2 where
> rs = seq r1 r2
> -- | Smart constructor for `Alt`
> alt :: RegExp -> RegExp -> RegExp
> alt = undefined
>
> prop_alt :: RegExp -> RegExp -> Property
> prop_alt r1 r2 = rs /= Alt r1 r2 ==> rs %==% Alt r1 r2 where
> rs = alt r1 r2
Regular Expression Derivatives
You may have noticed by now that this implementation of Regular Expression matching is really slow. Let's fix that.
The textbook way to implement regular expression matching is to first translate the regular expression into a finite-state machine and then apply the finite-state matching to the string.
However, there's a more direct, elegant, but not so well-known alternative, the method of derivatives due to Janusz A. Brzozowski. This method is described in more detail here.
The basic idea is that, given a regular expression and the first character in the input string to match, you can compute a new regular expressions, which must match the remaining string in order for the original RegExp
to match the entire input string. This new regular expression is called the derivative of the original.
We can use this idea to implement regular expression matching by repeatedly calculating the derivatives for each character in the string. If the final result is a regular expression that accepts the empty string, then the original regular expression would have matched the string. In other words:
> -- | Determine whether the given regexp matches the given String
> match :: RegExp -> String -> Bool
> match r s = nullable (foldl' deriv r s)
Your job is to implement nullable
and deriv
to complete this implementation. Once again, you can ignore marked sub-expressions. (We'll return to them below.)
> -- | `nullable r` return `True` when `r` could match the empty string
> nullable :: RegExp -> Bool
> nullable (Mark _ _ r) = nullable r
> nullable _ = error "nullable: unimplemented"
> -- | Takes a regular expression `r` and a character `c`,
> -- and computes a new regular expression that accepts word `w`
> -- if `cw` is accepted by `r`. Make sure to use the smart constructors
> -- above when you construct the new `RegExp`
> deriv :: RegExp -> Char -> RegExp
> deriv (Mark _ _ r) c = deriv r c -- ignore marks completely
> deriv _ _ = error "deriv: unimplemented"
For example, if r
is the literal character c
, then the derivative of r
is Empty
, the regular expression that only accepts the empty string. In the case of Seq
, you need to think about the case where the first regular expression could accept the empty string. In that case, the derivative should include the possibility that it could be skipped, and the character consumed by the second regexp.
Note that Haskell's lazy evaluation avoids the evaluation of the whole regular expression. The expression has only to be evaluated as much as nullable
needs to calculate an answer.
Derivatives with Marked Sub-expressions
So far, the code above didn't include marked sub-expressions. We can extend our derivative-based matching routine by not dropping the marks, but by keeping them around in the derived regexp itself.
With this strategy, we will do the same thing as above, but this time, after repeatedly calculating the derivative with respect to each character in the String, we will use the extract
function to determine, not just whether the resulting regexp is nullable, but what strings it has matched along the way.
> -- | matching using derivatives, with marked subexpressions
> patMatch :: RegExp -> String -> Maybe (Map String [String])
> patMatch r w = extract (foldl' markedDeriv r w)
> -- | Determine whether the regexp matches the empty string, and, if so,
> -- extract any saved matches.
> extract :: RegExp -> Maybe (Map String [String])
> extract _ = undefined
> -- | Calculate the derivative of the regular expression, storing each character
> -- in the mark data structure.
> markedDeriv :: RegExp -> Char -> RegExp
> markedDeriv _ _ = undefined
> testPatMatch :: Test
> testPatMatch = "patMatch" ~: TestList [
> patMatch boldHtmlPattern "<b>cis552" ~?= Nothing,
> patMatch boldHtmlPattern "<b>cis552!</b>" ~?=
> Just (Map.fromList [("bold",["cis552!"])]),
> patMatch boldHtmlPattern "<b>cis552</b>!</b>" ~?=
> Just (Map.fromList [("bold",["cis552</b>!"])]),
> patMatch namePattern "Haskell Curry" ~?=
> Just (Map.fromList [("first",["Haskell"]),("last",["Curry"])]),
> patMatch stringsPattern "a b c d e" ~?=
> Just (Map.fromList [("string",["a","b","c","d","e"])])
> ]
Showing Regular Expressions
The following code is meant to help you display regular expressions. You do not need to edit it.
> -- | display a set of characters succinctly, and escape all special characters
> showCharSet :: Set Char -> ShowS
> showCharSet s
> | s == lower = showString "\\l"
> | s == upper = showString "\\u"
> | s == digit = showString "\\d"
> | s == white = showString "\\s"
> | s == word = showString "\\w"
> | s == anyc = showString "."
> | Set.size s == 1 = showString (concatMap escape (Set.elems s))
> | otherwise = showString "[" .
> showString (concatMap escape (Set.elems s)) . showString "]"
> -- | Add slashes in front of standard regexp characters
> escape :: Char -> String
> escape c
> | c `elem` "[\\^$.|?*+()" = ['\\', c]
> | otherwise = [c]
> -- | display a mark in a regexp
> showMark :: String -> [String] -> String -> String
> showMark n []
> = showString n
> showMark n [k]
> = showString n . showString "=" . showString k
> showMark n ks
> = showString n . showString "=" . shows ks
> -- | Display a regexp, using precedence to reduce the number of required
> -- parentheses in the output.
> instance Show RegExp where
> -- special case for '+'
> showsPrec p (Seq r1 (Star r2)) | r1 == r2 =
> showParen (p > 6) $ showsPrec 6 r1 . showString "+"
> showsPrec p (Char s) = showCharSet s
> showsPrec p (Alt r1 r2) =
> showParen (p > 7) $ showsPrec 7 r1 . showString "|" . showsPrec 7 r2
> showsPrec p (Seq r1 r2) =
> showParen (p > 10) $ showsPrec 10 r1 . showsPrec 10 r2
> showsPrec p (Star r) =
> showParen (p > 6) $ showsPrec 6 r . showString "*"
> showsPrec p Empty = showString "ε"
> showsPrec p Void = showString "∅"
> showsPrec p (Mark m k r) =
> showParen (p > 5) $
> showString "?P<" . showMark m k . showString ">" . showsPrec 5 r