undefined
.
Eventually, the complete
version will be made available.
Parsing with Applicative Functors
> import Prelude hiding (filter)
> import Data.Char
> import Text.Read (readMaybe)
> import Control.Applicative
What is a Parser?
"A parser for things
-- Graham Hutton
Is a function from strings
To lists of pairs
Of things and strings."
A parser is a piece of software that takes a raw String
(or sequence of bytes) and returns some structured object -- for example, a list of options, an XML tree or JSON object, a program's Abstract Syntax Tree, and so on. Parsing is one of the most basic computational tasks. For example:
- Shell Scripts (command-line options)
- Web Browsers (duh!)
- Games (level descriptors)
- Routers (packets)
- etc.
(Indeed I defy you to find any serious system that does not do some parsing somewhere!)
The simplest way to think of a parser is as a function -- i.e., its type should be roughly this:
Composing Parsers
The usual way to build a parser is by specifying a grammar and using a parser generator (e.g., yacc, bison, antlr) to create the actual parsing function. Despite its advantages, one major limitation of the grammar-based approach is its lack of modularity. For example, suppose we have two kinds of primitive values, Thingy
and Whatsit
.
Thingy : ...rule... { ...action... } ;
Whatsit : ...rule... { ...action... } ;
If we want a parser for sequences of Thingy
and Whatsit
we have to painstakingly duplicate the rules:
Thingies : Thingy Thingies { ... }
EmptyThingy { ... } ;
Whatsits : Whatsit Whatsits { ... }
EmptyWhatsit { ... } ;
That is, the languages in which parsers are usually described are lacking in features for modularity and reuse.
In this lecture, we will see how to compose mini-parsers for sub-values to get bigger parsers for complex values.
To do so, we will generalize the above parser type a little bit, by noting that a (sub-)parser need not (indeed, in general will not) consume all of its input, in which case we need to have the parser return the unconsumed part of its input:
Of course, it would be silly to have different types for parsers for different kinds of structured objects, so we parameterize the Parser
type over the type of structured object that it returns:
One last generalization is to allow a parser to return a list of possible parse results, where the empty list corresponds to a failure to parse:
As the last step, let's wrap this type definition up as a newtype
and define a record component to let us conveniently extract the parser:
This type definition will make sure that we keep parsers distinct from other values of this type and, more importantly, will allow us to make parsers an instance of one or more typeclasses, if this turns out to be convenient (see below!).
Below, we will define a number of operators on the Parser
type, which will allow us to build up descriptions of parsers compositionally. The actual parsing happens when we use a parser by applying it to an input string, using doParse
.
Now, the parser type might remind you of something else... Remember this?
Indeed, a parser, like a state transformer, is a monad! There are good definitions of the return
and (>>=)
functions.
However, most of the time, we don't need the full monadic structure for parsing. Just deriving the applicative operators for this type will allow us to parse any context-free grammar. So in today's lecture, keep your eye out for applicative structure for this type.
Now all we have to do is build some parsers!
We'll start with some primitive definitions, and then generalize.
Parsing a Single Character
Here's a very simple character parser that returns the first Char
from a (nonempty) string. Recall the parser type:
So we need a function that pattern matches its argument, and pulls out the first character of the string, should it exist.
Try it out!
See if you can modify the above to produce a parser that looks at the first char of a (nonempty) string and interprets it as an int. (Hint: remember the readMaybe
function.)
ghci> doParse oneDigit "1"
[(1,"")]
ghci> doParse oneDigit "12"
[(1,"2")]
ghci> doParse oneDigit "hey!"
[]
And here's a parser that looks at the first char of a string and interprets it as the unary negation operator, if it is '-'
, and an identity function if it is '+'
.
> oneOp :: Parser (Int -> Int)
> oneOp = P $ \s -> case s of
> ('-' : cs) -> [ (negate, cs) ]
> ('+' : cs) -> [ (id, cs) ]
> _ -> []
Can we generalize this pattern? What if we pass in a function that specifies whether the character is of interest? The satisfy
function constructs a parser that succeeds if the first character satisfies the predicate.
SPOILER SPACE
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Here's how I implemented satisfy
, taking advantage of the list monad.
> satisfy' :: (Char -> Bool) -> Parser Char
> satisfy' f = P $ \s -> [ (c,cs) | (c,cs) <- doParse get s, f c ]
With this implementation, we can see that we can generalize again, so that it works for any parser, not just get
...
> filter :: (a -> Bool) -> Parser a -> Parser a
> filter f p = P $ \s -> [ (c,cs) | (c, cs) <- doParse p s, f c ]
Parser is a Functor
The name filter
is directly inspired by the filter
function for lists. And indeed, just like we can think of [a]
as a way to get values of type a
, we can likewise think of Parser a
as a way to potentially get a value of type a
.
So, are there other list-like operations that our parsers should support?
Of course! Like lists, the type constructor Parser
is a functor.
With get
, satisfy
, filter
, and fmap
, we now have a small library to build new (single character) parsers.
For example, we can write some simple parsers for particular sorts of characters. The following definitions parse alphabetic and numeric characters respectively.
And now we can use fmap
to rewrite oneDigit
:
> oneDigit' :: Parser Int
> oneDigit' = cvt <$> digitChar where -- fmap!
> cvt :: Char -> Int
> cvt c = ord c - ord '0'
Finally, finish this parser that should parse just one specific Char
:
Parser Composition
What if we want to parse more than one character from the input?
Using get
we can write a composite parser that returns a pair of the first two Char
values from the front of the input string:
> twoChar0 :: Parser (Char, Char)
> twoChar0 = P $ \s -> do (c1, cs) <- doParse get s
> (c2, cs') <- doParse get cs
> return ((c1,c2), cs')
More generally, we can write a parser combinator that takes two parsers and returns a new parser that uses first one and then the other and returns the pair of resulting values...
and use that to rewrite twoChar
more elegantly like this:
Parser is an Applicative Functor
Suppose we want to parse two characters, where the first should be a sign and the second a digit?
We've already defined single character parsers that should help. We just need to put them together.
And we put them together in a way that looks a bit like fmap
above. However, instead of passing in the function as a parameter, we get it via parsing.
> signedDigit0 :: Parser Int
> signedDigit0 = P $ \ s -> do (f, cs) <- doParse oneOp s
> (x, cs') <- doParse oneDigit cs
> return (f x, cs')
Can we generalize this pattern? What is the type when oneOp
and oneDigit
are arguments to the combinator?
Does this type look familiar?
Whoa! That is the type of the (<*>)
operator from the Applicative
class. What does this combinator do? It grabs all function values out of the first parser and then grabs all of the arguments (using the remaining part of the string) from the second parser, and then return all of the applications.
What about pure
?
The definition of pure
is very simple -- we can let the types guide us. This parser produces a specific character without consuming any of the input string.
So we can put these two definitions together in our class instance.
Let's go back and reimplement our examples with the applicative combinators:
ghci> doParse twoChar "hey!"
ghci> doParse twoChar ""
ghci> doParse signedDigit "-1"
ghci> doParse signedDigit "+3"
Now we're picking up speed. First, we can use our combinators to rewrite our more general pairing parser (pairP
) like this:
Or, more idiomatically, we can replace pure f <*>
with f <$>
. (The hlint
tool will suggest this rewrite to you.)
We can even dip into the Control.Applicative
library and write pairP
even more succinctly using this liftA2
combinator:
And, Control.Applicative
gives us even more options for constructing parsers. For example, it also includes a definition of liftA3
.
The *>
and <*
operators are also defined in Control.Applicative
. The first is the Applicative
analogue of the (>>)
operator for Monads
.
-- sequence actions, discarding the value of the first action
(*>) :: Applicative f => f a -> f b -> f b
The second is the dual to the first---it keeps the first result but discards the second.
Here's an example of a parser that uses both operators. When we parse something surrounded by parentheses, don't want to keep either the opening or closing characters.
> -- | Parse something surrounded by parentheses
> parenP :: Parser a -> Parser a
> parenP p = char '(' *> p <* char ')'
Monadic Parsing
Although we aren't going to emphasize it in this lecture, the Parser
type is also a Monad
. Just like State
and list, we can make Parser
an instance of the Monad
type class. To make sure that you get practice with the applicative operators, such as <*>
, we won't do that here. However, for practice, see if you can figure out an appropriate definition of (>>=)
.
Recursive Parsing
However, to parse more interesting things, we need to add some kind of recursion to our combinators. For example, it's all very well to parse individual characters (as in char
above), but it would a lot more fun if we could recognize whole String
s.
Let's try to write it!
> string :: String -> Parser String
> string "" = pure ""
> string (x:xs) = liftA2 (:) (char x) (string xs)
Much better!
For fun, try to write string
using foldr
for the list recursion.
Furthermore, we can use natural number recursion to write a parser that grabs n
characters from the input:
> grabn :: Int -> Parser String
> grabn n = if n <= 0 then pure "" else liftA2 (:) get (grabn (n-1))
Nondeterministic Choice
The Applicative
operators give us sequential composition of parsers (i.e. run one parser then another). But what about parallel composition?
Let's write a combinator that takes two sub-parsers and nondeterministically chooses between them.
How to write it? Well, we want to return a successful parse if either parser succeeds. Since our parsers return multiple values, we can simply return the union of all the results!
We can use the above combinator to build a parser that returns either an alphabetic or a numeric character
With chooseP
, if both parsers succeed, then we get back all the results. For example, with this parser
we will get back two results if both parses are possible,
and only one if the other is not possible:
Even with just these rudimentary parsers that we have at our disposal, we can start doing some interesting things. For example, here is a little calculator. First, we parse arithmetic operations as follows:
> intOp = plus `chooseP` minus `chooseP` times `chooseP` divide
> where plus = char '+' *> pure (+)
> minus = char '-' *> pure (-)
> times = char '*' *> pure (*)
> divide = char '/' *> pure div
(Can you guess the type of the above parser? Ask ghci if you are unsure.) Then we parse simple expressions by parsing a digit or parsing a digit followed by an operator and another calculation.
> infixAp :: Applicative f => f a -> f (a -> b -> c) -> f b -> f c
> infixAp = liftA3 (\i1 o i2 -> i1 `o` i2)
This parser, when run, will perform both parsing and calculation. It will also produce all of the intermediate results.
ghci> doParse calc "8/2"
[(8,"/2"),(4,"")]
ghci> doParse calc "8+2+3"
[(8,"+2+3"),(10,"+3"),(13,"")]
Parsing With a List of Sub-Parsers
Let's write a combinator that takes a parser p
that returns an a
and constructs a parser that recognizes a sequence of strings (each recognized by p
) and returns a list of a
values. That is, it keeps grabbing a
values as long as it can and returns them as [a]
.
We can do this by writing a parser that either parses one thing (if possible) and then calls itself recursively or succeeds without consuming any input. In either case, the result is a list. Fill in the first part of the definition below.
Beware: This parser can yield a lot of results!
ghci> doParse (manyP oneDigit) "12345a"
[([1,2,3,4,5],"a"),([1,2,3,4],"5a"),([1,2,3],"45a"),([1,2],"345a"),([1],"2345a"),([],"12345a")]
What happens if we swap the order of the arguments to chooseP
?
Deterministic Choice
Often, what we want is a single result, not a long list of potential results. To get this, we want a deterministic choice combinator. This combinator returns no more than one result — i.e., it runs the choice parser but discards extra results.
> chooseFirstP :: Parser a -> Parser a -> Parser a
> chooseFirstP p1 p2 = P $ \s -> take 1 (doParse (p1 `chooseP` p2) s)
Note that this parser is extremely sensitive to the order of the arguments. If p1
produces output, we will never try to parse with p2
.
We can use deterministic choice and failure together to make the Parser
type an instance of the Alternative
type class from Control.Applicative.
The Alternative
type class has two methods:
where empty
is an applicative computation with zero results, and (<|>)
, a "choice" operator that combines two computations. The Alternative
type class laws require the choice operator to be associative but it need not be commutative (and it isn't here).
The empty
computation should be an identity for the choice operator. In other words we should have
empty <|> a === a
and
a <|> empty === a
For parsers, this means that we need to have a failure parser that never parses anything (i.e. one that always returns []
):
Putting these two definitions together gives us the Alternative instance.
> instance Alternative Parser where
> empty = failP -- always fail
> (<|>) = chooseFirstP -- try the left parser, if that fails then try the right
The Alternative
type class automatically gives definitions for functions many
and some
, defined in terms of (<|>)
.
The many
operation corresponds to running the applicative computation zero or more times, whereas some
runs the computation one or more times. Both return their results in a list.
some :: Alternative f => f a -> f [a] --- result list is guaranteed to be nonempty
some v = (:) <$> v <*> many v
For parsing, the many
combinator returns a single, maximal sequence produced by iterating the given parser, zero or more times
ghci> doParse (many digitChar) "12345a"
[("12345","a")]
ghci> doParse (many digitChar) ""
[("","")]
ghci> doParse (some digitChar) "12345a"
[("12345","a")]
ghci> doParse (some digitChar) ""
[]
This sequence is maximal because the definition of many
tries some v
before returning []
. If the definition had been the other way around, then the result would always be the empty list (because pure []
always succeeds).
Let's use some
to write a parser that will return an entire natural number (not just a single digit.)
> oneNat :: Parser Int
> oneNat = fmap read (some digitChar) -- know that read will succeed because input is all digits
Challenge (will not be on the quiz): use the Alternative
operators to implement a parser that parses zero or more occurrences of p
, separated by sep
.
ghci > doParse (sepBy oneNat (char ',')) "1,12,0,3"
[([1,12,0,3],"")]
ghci> doParse (sepBy oneNat (char ',')) "1"
[([1],"")]
ghci > doParse (sepBy oneNat (char ',')) "1,12,0,"
[([1,12,0],",")]
ghci > doParse (sepBy oneNat (char '8')) "888"
[([888],"")]
ghci > doParse (sepBy (char '8') (char '8')) "888"
[("88","")]
ghci > doParse (sepBy oneNat (char ',')) ""
[([],"")]
Parsing Arithmetic Expressions
Now let's use the above to build a small calculator that parses and evaluates arithmetic expressions. In essence, an expression is either a binary operand applied to two sub-expressions or else an integer. We can state this as:
This works pretty well...
But things get a bit strange with minus:
Huh? Well, if you look back at the code, you'll realize the above was parsed as
because in each binExp
we require the left operand to be an integer. In other words, we are assuming that each operator is right associative hence the above result. Making this parser left associative is harder than it looks — we can't just swap oneNat
and 'calc1' above. (Why not?)
Furthermore, things also get a bit strange with multiplication:
This string is parsed as:
But the rules of precedence state that multiplication should bind tighter that addition. So even if we solve the associativity problem we'll still need to be careful.
Precedence
We can introduce precedence into our parsing by stratifying the parser into different levels. Here, let's split our binary operations into addition-like and multiplication-like ones.
Now, we can stratify our language into mutually recursive sub-languages, where each top-level expression is parsed first as an addition expression (addE
) starting with a multiplication expressions (mulE
). Multiplication expressions must then start with a basic factors: either natural numbers or arbitrary expressions inside parentheses.
Now our parser is still right associative, but multiplication binds tighter than addition.
Do you understand why the first parse returned 121
?
Parsing Pattern: Associativity via Chaining
But we're still not done: we need to fix the associativity problem.
Ugh! I hope you understand why: it's because the above was parsed as 10 - (1 - 1)
(right associative) and not (10 - 1) - 1
(left associative). You might be tempted to fix that simply by flipping the order in infixAp
, thus
but this would be disastrous. Can you see why? The parser for addE
directly (recursively) calls itself without consuming any input! Thus, it goes off the deep end and never comes back.
Let's take a closer look at what is going on with our current definitions. In essence, an addE
is of the form:
mulE + ( mulE + ( mulE + ... mulE ))
That is, we keep chaining together mulE
values and adding them for as long as we can. Similarly a mulE
is of the form
factorE * ( factorE * ( factorE * ... factorE ))
where we keep chaining factorE
values and multiplying them for as long as we can.
Instead, we want to parse the input as starting with a multiplication expression followed by any number of addition operators and multiplication expressions. We can temporarily store the operators and expressions in a list of pairs. Then, we'll foldl
over this list, using each operator to combine the current result with the next number.
> addE1 :: Parser Int
> addE1 = process <$> first <*> rest where
>
> process :: Int -> [(IntOp, Int)] -> Int
> process = foldl comb
>
> comb :: Int -> (IntOp, Int) -> Int
> comb x (op,y) = x `op` y
> -- parse any number of `addOp`s followed
> -- by a multiplication expression
> -- return the result in a list of tuples
> rest :: Parser [(IntOp, Int)]
> rest = many ((,) <$> addOp <*> mulE1)
> mulE1 :: Parser Int
> mulE1 = foldl comb <$> factorE1 <*> rest where
> comb x (op,y) = x `op` y
> rest = many ((,) <$> mulOp <*> factorE1)
The above is indeed left associative:
Also, it is very easy to spot and bottle the chaining computation pattern: the only differences are the base parser (mulE1
vs factorE1
) and the binary operation (addOp
vs mulOp
). We simply make those parameters to our chain-left combinator:
> chainl1 :: Parser Int -> Parser IntOp -> Parser Int
> p `chainl1` pop = foldl comb <$> p <*> rest where
> comb x (op,y) = x `op` y
> rest = many ((,) <$> pop <*> p)
after which we can rewrite the grammar in three lines:
> addE2 = mulE2 `chainl1` addOp
> mulE2 = factorE2 `chainl1` mulOp
> factorE2 = parenP addE2 <|> oneNat
ghci> doParse addE2 "10-1-1"
[(8,"")]
ghci> doParse addE2 "10*2+1"
[(21,"")]
ghci> doParse addE2 "10+2*1"
[(12,"")]
Of course, we can generalize chainl1
even further so that it is not specialized to parsing Int
expressions. Try to update the type above so that it is more polymorphic.
This concludes our exploration of applicative parsing, but what we've covered is really just the tip of an iceberg. Though parsing is a very old problem, studied since the dawn of computing, algebraic structures in Haskell bring a fresh perspective that has now been transferred from Haskell to many other languages.