> {-# OPTIONS -Wincomplete-patterns #-}
> {-# LANGUAGE FlexibleInstances #-}
> {-# LANGUAGE ScopedTypeVariables #-}
> {-# LANGUAGE AllowAmbiguousTypes #-}

Regular Expressions

> module RegExp where

> import Prelude hiding (seq)

> import Data.Set (Set)
> import qualified Data.Set as Set (singleton, union, toList, fromList, size, elems)

> import Data.Map (Map)
> import qualified Data.Map as Map
> 
> import Data.List (foldl', reverse, nub)
> import qualified Data.Maybe as Maybe

> import Control.Applicative(Alternative(..))
> import Control.Monad (ap, liftM2)

> import Test.HUnit hiding (State)
> import Test.QuickCheck
> import Test.QuickCheck.Function

> main :: IO ()
> main = return ()

Regular expressions are a specialized language for representing string-matching patterns. Regular expressions were invented by the mathematician Stephen Kleene, one of the pioneers that created the foundations of the theory of computation (with Goedel, Turing, Church, and Post). Kleene invented regular expressions to represent the sets of possible behaviors of the abstract finite computation devices called finite-state machines. In Kleene's original formulation, regular expressions were were built from individual letters with three operators: concatenation, representing one pattern followed by another; alternation (also called union) denoted by |, representing two alternative patterns; and closure (also called Kleene-Star), denoted by *, to represent zero or more repetitions of a pattern. By convention, the empty string represents a special regular expression, the empty regular expression, which matches the empty string.

For example, the regular expression a(bc|d)*e matches all strings that start with a, then have some number of bc or d characters (possibly none), and end with e. Some such strings include ae, abce, abcde, adbcde, abcbcdbce. In the 1970s, computer scientists at Bell Labs were working on the first software tools for text processing, and they adapted and generalized Kleene's idea in several software tools, starting with grep, for searching and editing text. Regular expressions have many practical uses, mainly in pattern matching, which has applications in everything from compilers to searching databases.

In this problem, we consider regular expression evaluators, that is, programs that determine whether a string is in the language denoted by the regular expression. This process is also called regular expression matching.

We can represent regular expressions using the following datatype in Haskell.

> data RegExp = Char (Set Char)    -- single literal character
>                                  -- matches any character in the (nonempty) set
>             | Alt RegExp RegExp  -- r1|r2     (alternation)
>             | Seq RegExp RegExp  -- r1 r2     (concatenation)
>             | Star RegExp        -- r*        (Kleene star)
>             | Empty              -- ε, accepts empty string
>             | Void               -- ∅, always fails
>             | Mark String [String] RegExp
>                 -- (for marked subexpressions, see below)
>   deriving (Eq)

Your goal in this problem is to define four different evaluators.

accept    :: RegExp -> String -> Bool
patAccept :: RegExp -> String -> Maybe (Map String [String])

match     :: RegExp -> String -> Bool
patMatch  :: RegExp -> String -> Maybe (Map String [String])

The functions accept and match use two different algorithms to determine whether the given String is accepted by the given RegExp. Then, patAccept and patMatch extend each of these algorithms with marked subexpressions, and returns additional information about the matched string if it is accepted.

Characters and Character sets

The first data constructor of the RegExp datatype defines a regular expression that matches a single character from a specified set. For example, if we want to define a regexp that only matches the single character a, we could write:

> -- | accept a single character only
> char :: Char -> RegExp
> char = Char . Set.singleton

> -- | accept only the character 'a'
> a :: RegExp
> a = char 'a'

When working with regexps, we can define some shorthand character classes to make our lives easier, such as sets of digits, whitespace, lowercase, uppercase characters.

> -- | Digits, usually written \d in regexp libraries
> digit :: Set Char
> digit  = Set.fromList ['0' .. '9']

> -- | Whitespace \s
> white :: Set Char
> white  = Set.fromList " \n\r\t"

> -- | lowercase \l
> lower :: Set Char
> lower  = Set.fromList ['a' .. 'z']
> 
> -- | uppercase \u
> upper :: Set Char
> upper  = Set.fromList ['A' .. 'Z']

> -- | Word characters \w
> word :: Set Char
> word = upper <> lower <> digit <> Set.fromList "_"

> -- | The union of all of the above, plus punctuation
> -- but not including the newline characters
> -- similar to '.' in regexp libraries
> anyc :: Set Char
> anyc = word <> Set.fromList " \t!@#$%^&*()_+{}|:\"<>?~`-=[]\\;',./"

> ----------------------------------------------------

Examples and derived forms

A name is an upper case letter followed by a sequence of lowercase letters of any length. In a regular-expression library, we might write it as [A-Z][a-z]*. (The * post-fix operator corresponds to 0 or more occurrences of a pattern.)

Using our datatype above, we can express this regexp as:

> name :: RegExp
> name = Char upper `Seq` Star (Char lower)

> testName :: Test
> testName = "name" ~:
>   TestList[ accept name "Stephanie"         ~? "a good name"
>           , not (accept name "stephanie")   ~? "must be capitalized"
>           , not (accept name "Ste7anie")    ~? "no extra symbols"
>           , not (accept name "Steph Annie") ~? "not even spaces"
>           ]

We can also define regexps that only accept specific words. For example, given this function

> string :: String -> RegExp
> string "" = Empty
> string s  = foldr1 Seq (map (Char . Set.singleton) s)

we can define this RegExp to recognize the name of our favorite course.

> cis552 :: RegExp
> cis552 = string "cis552"

We can also use Star and Seq to define the plus operator, which corresponds to one or more occurrences of a pattern.

> plus :: RegExp -> RegExp
> plus pat = pat `Seq` Star pat

For example, this regular expression accepts any non-empty string that is surrounded by the tags <b> and </b>.

> boldHtml :: RegExp
> boldHtml = string "<b>" `Seq` plus (Char anyc) `Seq` string "</b>"

Before you go further, note that we have provided you with an instance of the Show type class for regexps that displays regexps succinctly (i.e. it tries to use abbreviations such as and for character classes). Try displaying the RegExps above and make sure that you understand the connection between their representation and how they are displayed by ghci.

  ghci> boldHtml

> ----------------------------------------------------

RegExp Acceptance

First, we'll write a straightforward (and extremely inefficient) operation, called accept, that determines whether a particular string is part of the regular language accepted by the given regular expression.

Begin by implementing the following two helper functions. (To practice with the list monad, you must use a list comprehension in each of these implementation.)

> -- | all decompositions of a string into two different pieces
> --     split "abc" == [("","abc"),("a","bc"),("ab","c"),("abc","")]
> split :: [a] -> [([a], [a])]
> split = error "split: unimplemented"

> -- | all decompositions of a string into multi-part (nonempty) pieces
> -- parts "abc" == [["abc"],["a","bc"], ["ab","c"], ["a","b","c"]]
> parts :: [a] -> [[[a]]]
> parts = error "parts: unimplemented"

Don't forget to write some quickCheck properties so that you can test your functions! For example, here's one that makes sure that we get the right number of pairs from split.

> prop_splitLength :: Eq a => [a] -> Bool
> prop_splitLength l = length (split l) == length l + 1

You can do better, though!

> -- Add a QuickCheck property for split

Note: we have to be careful when randomly testing parts; this is an exponential algorithm after all. The resize function allows us to bound the size of the lists that are produced so that they don't get too big. If you find that prop_partsLength takes too long, you can decrease the argument to resize.

> smallLists :: Gen [String]
> smallLists = resize 15 arbitrary

> prop_partsLength :: Property
> prop_partsLength = forAll smallLists $ \ l ->
>   length (parts l) == 2 ^ predNat (length l) where
>     predNat n = if n <= 0 then 0 else n - 1

> -- Add a QuickCheck property for parts

Now, use split and parts to determine whether a RegExp matches the given input string. Below, your implementation should simply explore all possibilities. For example, to determine whether the concatenation pattern Seq r1 r2 matches an input string, use split to compute all possible ways of splitting the string into two parts and see whether r1 and r2` match the two parts. Again, use list comprehensions as part of the design of your implementation. (For now, just ignore 'Mark' constructors in the input regular expressions.)

> -- | Decide whether the given regexp matches the given string
> accept :: RegExp -> String -> Bool
> accept (Mark _ _ r)  s = accept r s
> accept _           _ = error "accept: finish me"

> testAccept :: Test
> testAccept = "accept" ~: TestList [
>    not (accept Void "a") ~? "nothing is void",
>    not (accept Void "") ~? "really, nothing is void",
>    accept Empty "" ~? "accept Empty true",
>    not (accept Empty "a") ~? "not accept Empty",
>    accept (Char lower) "a" ~? "accept lower",
>    not (accept (Char lower) "A") ~? "not accept lower",
>    accept boldHtml "<b>cis552</b>!</b>" ~? "cis552!",
>    not (accept boldHtml "<b>cis552</b>!</b") ~? "no trailing" ]

> ----------------------------------------------------

QuickChecking `accept`

How can we use quick check to test our implementation of the accept function?

One idea is that we can use regular expressions in reverse---instead of using them to identify strings from a language, we can instead use them to generate strings from a language.

So, given a regexp, the accept function is correct if it returns true for all strings generated from the language of the regexp.

> prop_accept :: RegExp -> Property
> prop_accept r = 
>   case genRegExpString r of
>     Just g  -> forAll g $ accept r    -- if we can generate a random string,
>                                       -- then it should be accepted
>     Nothing -> property $ isVoid r    -- otherwise, we should have a RegExp
>                                       -- equivalent to 'Void'

Write a function that can generate a String that is accepted by the given RegExp. Note that this function is partial; some RegExps accept no strings and denote the empty language.

NOTE: in the case of Star, your function does not need to generate all strings accepted by the RegExps. Instead, put a bound on the number of iterations in your result. In particular, iterating more than twice could cause a large blow-up when quickChecking prop_accept.

> -- | Create a generator for the strings accepted by this RegExp (if any)
> genRegExpString :: RegExp -> Maybe (Gen String)
> genRegExpString r = undefined

> -- | Is this the regexp that never matches any string?
> -- (It may include marks (see next problem), because this regexp will never
> -- match anything.)
> isVoid :: RegExp -> Bool
> isVoid Void          = True
> isVoid (Seq r1 r2)   = isVoid r1 || isVoid r2
> isVoid (Alt r1 r2)   = isVoid r1 && isVoid r2
> isVoid (Star r)      = False
> isVoid (Mark _ _ r)  = isVoid r
> isVoid _             = False

Now complete an Arbitrary instance for regular expressions to test this property above. Your regexps should only contain the characters "abcd" and need not contain marked subexpressions. Note that you should avoid generating 'Char' with an empty set.

Make sure that you also implement the 'shrink' function too.

> instance Arbitrary RegExp where
>    arbitrary = undefined
>    shrink = undefined

> ----------------------------------------------------

Marked Subexpressions

Backtracking is not the most efficient implementation of regular expressions, but it is easy to extend.

One extension is support for marked subexpressions. For this part, you will rewrite accept, as patAccept so that it returns strings that are matched by the marked subexpressions in the regular expression.

For example, we can mark the part of the regular expression between the tags:

> boldHtmlPattern :: RegExp
> boldHtmlPattern = string "<b>" `Seq` mark "bold" (plus (Char anyc)) `Seq` string "</b>"

Or mark sequences of letters that correspond to the first and last names:

> namePattern :: RegExp
> namePattern = mark "first" name  `Seq` Star (Char white) `Seq` mark "last" name

Or mark any number of sequences of lowercase letters:

> stringsPattern :: RegExp
> stringsPattern = Star (mark "string" (plus (Char lower))
>                    `Seq` Star (Char white))

Then, patAccept below returns not only whether the pattern matches, but also all parts of the string that correspond to the marks.

> testPat :: Test
> testPat = "testPat" ~: TestList [
>     patAccept boldHtmlPattern "<b>cis552" ~?= Nothing,
>     patAccept boldHtmlPattern "<b>cis552!</b>" ~?=
>         Just (Map.fromList [("bold",["cis552!"])]),
>     patAccept boldHtmlPattern "<b>cis552</b>!</b>" ~?=
>         Just (Map.fromList [("bold",["cis552</b>!"])]),
>     patAccept namePattern "Haskell  Curry" ~?=
>         Just (Map.fromList [("first",["Haskell"]),("last",["Curry"])]),
>     patAccept stringsPattern "a    b c   d e" ~?=
>         Just (Map.fromList [("string",["a","b","c","d","e"])])
>   ]

Note that the above examples use a "smart constructor" for marking capture groups. The data constructor for Marked subexpressions includes a list of strings that we don't need now; it's needed for the last problem. You can just make it nil for now and ignore it for this part.

> -- Mark a subexpression for capture
> mark :: String -> RegExp -> RegExp
> mark s = Mark s []

Now implement the patAccept function. Note that the function returns a Maybe --- look for at least one place to use monad do-notation in its definition. (There won't be many.)

> -- | If the regexp matches the string, return a dictionary from
> -- named marks to the captured substrings
> patAccept :: RegExp -> String -> Maybe (Map String [String])
> patAccept = error "patAccept: unimplemented"

> ----------------------------------------------------

Simplifying Regular Expressions

You may notice that when working with regular expressions, there are several ways to describe the same set of strings. For example, the regular expression

a**|a**

which we would encode as

> astar :: RegExp
> astar = Star (Star a) `Alt` Star (Star a)

matches exaclt the same set of strings as a*, i.e. strings composed of any number of a's.

However, using the first regexp to test for acceptance takes a lot more time than the second, especially when the string doesn't match. For example, I found this comparison on my laptop:

 *RegExp> accept astar "aaaaaaaaaaaaaaab"
 False
 (5.87 secs, 4,278,663,696 bytes)
 *RegExp> accept (Star a) "aaaaaaaaaaaaaaab"
 False
 (0.07 secs, 50,671,104 bytes)

We can optimize code that works with RegExp through the use of "smart constructors". These smart constructor recognize simplifications that can be made when ever a regular expression is put together. Suppose we have "smart" variants of the star and alt regular expression constructors. Then we can form the regexp in the same way, but (sometimes) get better performance.

 *RegExp> let astar' = star (star a) `alt` star (star a)
 (0.00 secs, 68,904 bytes)
 *RegExp> accept astar' "aaaaaaaaaaaaaaab"
 False
 (0.07 secs, 50,670,984 bytes)

For example, here is a definition of smart constructor for star. This construct looks for simplifications that it can apply while constructing the output.

> star :: RegExp -> RegExp
> star r1 | isEmpty r1 = Empty
>   -- iterating the empty string is the empty string
> star r1 | isVoid r1  = Empty
>   -- zero or more occurrences of void is empty
> star (Star r) = Star r
>   -- two iterations is the same as one
> star r        = Star r
>   -- no optimization

> -- | Is this the regexp that accepts *only* the empty string
> -- (It cannot include marks because we want to eliminate it
> -- during optimization above.)
> isEmpty :: RegExp -> Bool
> isEmpty Empty         = True
> isEmpty (Seq r1 r2)   = isEmpty r1 && isEmpty r2
> isEmpty (Alt r1 r2)   = isEmpty r1 && isEmpty r2
> isEmpty (Star r)      = isEmpty r
> isEmpty _             = False

How do we know that our definition of star is really smart? We want to be sure that the regexp that it produces matches the same strings as its input.

We'll compare the optimized version with the original to make sure that they match the same strings. (We'll also make this a conditional property to make sure that we only do the test when the smart constructor actually modifies the string.)

> prop_star :: RegExp -> Property
> prop_star r = sr /= Star r ==> sr %==% Star r where
>    sr = star r

We can quickCheck our optimizations using accept and the following property. This property tests pairs of regexps on strings and ensures that either they are both accepted or both rejected. To make this property more efficient, it randomly selects strings that are either accepted by the first regexp, accepted by the second, or that contain arbitrary sequences of a's, b's, c's, and d's.

> -- | Property to determine whether two regexps accept the same language of
> -- strings
> (%==%) :: RegExp -> RegExp -> Property
> r1 %==% r2 = forAll (genString r1 r2) $
>                  \s -> accept r1 s == accept r2 s 
>   where
>     genString :: RegExp -> RegExp -> Gen String
>     genString r1 r2 = oneof $ Maybe.catMaybes
>       [ genRegExpString r1
>       , genRegExpString r2
>       , Just $ resize 10 (listOf (elements "abcd"))
>       ]

Now design and test similar optimizations for sequencing and alternation. (For inspiration, read the wikipedia page about Kleene Algebra.)

> -- | Smart constructor for `Seq`
> seq :: RegExp -> RegExp -> RegExp
> seq = undefined
> 
> prop_seq :: RegExp -> RegExp -> Property
> prop_seq r1 r2 = rs /= Seq r1 r2 ==> rs %==% Seq r1 r2 where
>      rs = seq r1 r2

> -- | Smart constructor for `Alt`
> alt :: RegExp -> RegExp -> RegExp
> alt = undefined
> 
> prop_alt :: RegExp -> RegExp -> Property
> prop_alt r1 r2 = rs /= Alt r1 r2 ==> rs %==% Alt r1 r2 where
>      rs = alt r1 r2

> ----------------------------------------------------

Regular Expression Derivatives

You may have noticed by now that this implementation of Regular Expression matching is really slow. Let's fix that.

The textbook way to implement regular expression matching is to first translate the regular expression into a finite-state machine and then apply the finite-state matching to the string.

However, there's a more direct, elegant, but not so well-known alternative, the method of derivatives due to Janusz A. Brzozowski. This method is described in more detail here.

The basic idea is that, given a regular expression and the first character in the input string to match, you can compute a new regular expressions, which must match the remaining string in order for the original RegExp to match the entire input string. This new regular expression is called the derivative of the original.

We can use this idea to implement regular expression matching by repeatedly calculating the derivatives for each character in the string. If the final result is a regular expression that accepts the empty string, then the original regular expression would have matched the string. In other words:

> -- | Determine whether the given regexp matches the given String
> match :: RegExp -> String -> Bool
> match r s = nullable (foldl' deriv r s)

Your job is to implement nullable and deriv to complete this implementation. Once again, you can ignore marked sub-expressions. (We'll return to them below.)

> -- | `nullable r` return `True` when `r` could match the empty string
> nullable :: RegExp -> Bool
> nullable (Mark _ _ r)  = nullable r 
> nullable _ = error "nullable: unimplemented"

> -- |  Takes a regular expression `r` and a character `c`,
> -- and computes a new regular expression that accepts word `w`
> -- if `cw` is accepted by `r`. Make sure to use the smart constructors
> -- above when you construct the new `RegExp`
> deriv :: RegExp -> Char -> RegExp
> deriv (Mark _ _ r) c = deriv r c   -- ignore marks completely
> deriv _ _ = error "deriv: unimplemented"

For example, if r is the literal character c, then the derivative of r is Empty, the regular expression that only accepts the empty string. In the case of Seq, you need to think about the case where the first regular expression could accept the empty string. In that case, the derivative should include the possibility that it could be skipped, and the character consumed by the second regexp.

Note that Haskell's lazy evaluation avoids the evaluation of the whole regular expression. The expression has only to be evaluated as much as nullable needs to calculate an answer.

> -- Don't forget to test 'match'

> ----------------------------------------------------

Derivatives with Marked Sub-expressions

So far, the code above didn't include marked sub-expressions. We can extend our derivative-based matching routine by not dropping the marks, but by keeping them around in the derived regexp itself.

With this strategy, we will do the same thing as above, but this time, after repeatedly calculating the derivative with respect to each character in the String, we will use the extract function to determine, not just whether the resulting regexp is nullable, but what strings it has matched along the way.

> -- | matching using derivatives, with marked subexpressions
> patMatch :: RegExp -> String -> Maybe (Map String [String])
> patMatch r w = extract (foldl' markedDeriv r w)

> -- | Determine whether the regexp matches the empty string, and, if so,
> -- extract any saved matches.
> extract :: RegExp -> Maybe (Map String [String])
> extract _ = undefined

> -- | Calculate the derivative of the regular expression, storing each character
> -- in the mark data structure.
> markedDeriv :: RegExp -> Char -> RegExp
> markedDeriv _ _ = undefined

> testPatMatch :: Test
> testPatMatch = "patMatch" ~: TestList [
>     patMatch boldHtmlPattern "<b>cis552" ~?= Nothing,
>     patMatch boldHtmlPattern "<b>cis552!</b>" ~?=
>         Just (Map.fromList [("bold",["cis552!"])]),
>     patMatch boldHtmlPattern "<b>cis552</b>!</b>" ~?=
>         Just (Map.fromList [("bold",["cis552</b>!"])]),
>     patMatch namePattern "Haskell  Curry" ~?=
>         Just (Map.fromList [("first",["Haskell"]),("last",["Curry"])]),
>     patMatch stringsPattern "a    b c   d e" ~?=
>         Just (Map.fromList [("string",["a","b","c","d","e"])])
>   ]

> -------------------------------------------------

Showing Regular Expressions

The following code is meant to help you display regular expressions. You do not need to edit it.

> -- | display a set of characters succinctly, and escape all special characters
> showCharSet :: Set Char -> ShowS
> showCharSet s 
>    | s == lower = showString "\\l" 
>    | s == upper = showString "\\u" 
>    | s == digit = showString "\\d" 
>    | s == white = showString "\\s" 
>    | s == word  = showString "\\w"
>    | s == anyc  = showString "."
>    | Set.size s == 1  = showString (concatMap escape (Set.elems s))
>    | otherwise        = showString "[" .
>          showString (concatMap escape (Set.elems s)) . showString "]"

> -- | Add slashes in front of standard regexp characters
> escape :: Char -> String
> escape c
>    | c `elem` "[\\^$.|?*+()" = ['\\', c]
>    | otherwise               = [c]

> -- | display a mark in a regexp
> showMark :: String -> [String] -> String -> String
> showMark n []
>   = showString n
> showMark n [k]
>   = showString n . showString "=" . showString k
> showMark n ks
>   = showString n . showString "=" . shows ks

> -- | Display a regexp, using precedence to reduce the number of required
> -- parentheses in the output.
> instance Show RegExp where
>    -- special case for '+'
>    showsPrec p (Seq r1 (Star r2)) | r1 == r2 =
>      showParen (p > 6) $ showsPrec 6 r1 . showString "+"
>    showsPrec p (Char s)    = showCharSet s
>    showsPrec p (Alt r1 r2) =
>      showParen (p > 7) $ showsPrec 7 r1 . showString "|" . showsPrec 7 r2
>    showsPrec p (Seq r1 r2) =
>      showParen (p > 10) $ showsPrec 10 r1 . showsPrec 10 r2
>    showsPrec p (Star r)    =
>      showParen (p > 6) $ showsPrec 6 r . showString "*"
>    showsPrec p Empty       = showString "ε"
>    showsPrec p Void        = showString "∅"
>    showsPrec p (Mark m k r)  =
>       showParen (p > 5) $ 
>         showString "?P<" . showMark m k  . showString ">" . showsPrec 5 r

CIS 552: Advanced Programming

Fall 2019

Regular Expressions

Characters and Character sets

Examples and derived forms

RegExp Acceptance

QuickChecking `accept`

Marked Subexpressions

Simplifying Regular Expressions

Regular Expression Derivatives

Derivatives with Marked Sub-expressions

Showing Regular Expressions

CIS 552: Advanced Programming

Fall 2019

Regular Expressions

Characters and Character sets

Examples and derived forms

RegExp Acceptance

QuickChecking accept

Marked Subexpressions

Simplifying Regular Expressions

Regular Expression Derivatives

Derivatives with Marked Sub-expressions

Showing Regular Expressions

QuickChecking `accept`