undefined
.
Eventually, the complete
version will be made available.
In class exercise: XML parsing
In today's exercise you will use the definitions from the Parsers
lecture to build a simple parser for XML
data.
This exercise is based on the following definitions from the Parsers
lecture. Make sure that you have downloaded the solution.
> import Parsers (Parser, satisfy, char, string, doParse, filter)
> -- Note that this line imports these functions as well as the instance for Parser
> -- for the Functor, Applicative and Alternative classes.
Note, you can also use the files distributed with hw06 by replacing the above line with the following two:
Your goal: produce this structured data from a string
> -- | A simplified datatype for storing XML
> data SimpleXML =
> PCDATA String
> | Element ElementName [SimpleXML]
> deriving Show
First: the characters /
, <
, and >
are not allowed to appear in tags or PCDATA. Let's define a function that recognizes them.
Use this definition to parse a maximal nonempty sequence of nonreserved characters:
Xml> doParse text "skhdjf"
[("skhdjf","")]
Xml> doParse text "akj<skdfsdhf"
[("akj","<skdfsdhf")]
Xml> doParse text ""
[]
and then use this definition to parse nonreserved characters into XML.
Parse an empty element, like "<br/>"
Parse a container element: this consists of an open tag, a (potentially empty) sequence of content parsed by p
, and matching a closing tag. For example, container pcdata
should recognize <br></br>
or <title>A midsummer night's dream</title>
. You do NOT need to make sure that the closing tag matches the open tag.
Xml> doParse (container pcdata) "<br></br>"
[(Element "br" [],"")]
Xml> doParse (container pcdata) "<title>A midsummer night's dream</title>"
[(Element "title" [PCDATA "A midsummer night's dream"],"")]
-- This should also work, even though the tag is wrong
Xml> doParse (container pcdata) "<title>A midsummer night's dream</br>"
[(Element "title" [PCDATA "A midsummer night's dream"],"")]
Now put the above together to construct a parser for simple XML data:
Xml> doParse xml "<body>a</body>"
[(Element "body" [PCDATA "a"],"")]
Xml> doParse xml "<body><h1>A Midsummer Night's Dream</h1><h2>Dramatis Personae</h2>THESEUS, Duke of Athens.<br/>EGEUS, father to Hermia.<br/></body>"
[(Element "body" [Element "h1" [PCDATA "A Midsummer Night's Dream"],Element "h2" [PCDATA "Dramatis Personae"],PCDATA "THESEUS, Duke of Athens.",Element "br" [],PCDATA "EGEUS, father to Hermia.",Element "br" []],"")]
Now let's try it on something a little bigger. How about sample.html
from hw02?
> -- | Run a parser on a particular input file
> parseFromFile :: Parser a -> String -> IO [(a,String)]
> parseFromFile parser filename = do
> handle <- openFile filename ReadMode
> str <- hGetContents handle
> return $ doParse parser str
Challenge: rewrite container so that it only succeeds when the closing tag matches the opening tag.