undefined
.
CIS 552 students should be able to access this code through
github. Eventually, the
completed version will be available.
In class exercise: XML parsing
In today's exercise you will use the definitions from the Parsers
lecture to build a simple parser for XML
data.
This exercise is based on definitions from the Parsers
lecture, summarized by the module ParserCombinators
. You may modify the import statement below to bring more functions into scope, but you should not modify the ParserCombinators
library.
Note that the import below makes the listed functions available, in addition to the instances for Parser
for the Functor
, Applicative
and Alternative
classes. However, it does not import the P
data constructor so you should think of Parser a
as an abstract type.
You may import more operators from the ParserCombinators
and Control.Applicative
libraries if it is helpful for you. However, you should not modify the ParserCombinators
module itself.
> module Xml where
> import Control.Applicative (Alternative(..))
> import System.IO
> import Prelude hiding (filter)
> import ParserCombinators (Parser, doParse, satisfy, char, string, filter)
Your goal: produce this structured data from a string
> -- | A simplified datatype for storing XML
> data SimpleXML =
> PCDATA String
> | Element ElementName [SimpleXML]
> deriving Show
> type ElementName = String
First: the characters /
, <
, and >
are not allowed to appear in tags or PCDATA. Let's define a function that recognizes them.
> reserved :: Char -> Bool
> reserved c = c `elem` ['/', '<', '>']
Use this definition to parse a maximal nonempty sequence of nonreserved characters: (HINT: check out operations related to the Alternative
type class.)
> text :: Parser String
> text = undefined
Xml> doParse text "skhdjf"
Just ("skhdjf","")
Xml> doParse text "akj<skdfsdhf"
Just ("akj","<skdfsdhf")
Xml> doParse text ""
Nothing
Now use this definition to parse nonreserved characters into XML.
> pcdata :: Parser SimpleXML
> pcdata = undefined
Xml> doParse pcdata "akj<skdfsdhf"
Just (PCDATA "akj","<skdfsdhf")
Next, parse an empty element, like "<br/>"
> emptyContainer :: Parser SimpleXML
> emptyContainer = undefined
Xml> doParse emptyContainer "<br/>sdfsdf"
Just (Element "br" [],"sdfsdf")
Parse a container element: this consists of an open tag, a potentially empty sequence of content parsed by p
, and a closing tag. For example, container pcdata
should recognize <br></br>
or <title>A midsummer night's dream</title>
(and more examples below). You do NOT need to make sure that the closing tag matches the open tag.
> container :: Parser SimpleXML -> Parser SimpleXML
> container p = undefined
Xml> doParse (container pcdata) "<br></br>"
Just (Element "br" [],"")
Xml> doParse (container pcdata) "<title>A midsummer night's dream</title>"
Just (Element "title" [PCDATA "A midsummer night's dream"],"")
Xml> doParse (container emptyContainer) "<text><br/><br/></text>"
Just (Element "text" [Element "br" [], Element "br" []], "")
-- This should also work, even though the tag is wrong
Xml> doParse (container pcdata) "<title>A midsummer night's dream</br>"
Just (Element "title" [PCDATA "A midsummer night's dream"],"")
Now put the above together to construct a parser for simple XML data:
> xml :: Parser SimpleXML
> xml = undefined
Xml> doParse xml "<body>a</body>"
Just (Element "body" [PCDATA "a"],"")
Xml> doParse xml "<body><h1>A Midsummer Night's Dream</h1><h2>Dramatis Personae</h2>THESEUS, Duke of Athens.<br/>EGEUS, father to Hermia.<br/></body>"
Just (Element "body" [Element "h1" [PCDATA "A Midsummer Night's Dream"],Element "h2" [PCDATA "Dramatis Personae"],PCDATA "THESEUS, Duke of Athens.",Element "br" [],PCDATA "EGEUS, father to Hermia.",Element "br" []],"")
Xml> doParse xml "cis552"
Just (PCDATA "cis552", "")
Xml> doParse xml "<br/>"
Just (Element "br" [], "")
Now let's try it on something a little bigger. How about dream.html
from hw02?
> -- | Run a parser on a particular input file
> parseFromFile :: Parser a -> String -> IO (Maybe (a,String))
> parseFromFile parser filename = do
> handle <- openFile filename ReadMode
> str <- hGetContents handle
> return $ doParse parser str
Xml> parseFromFile xml "dream.html"
Challenge: rewrite container so that it only succeeds when the closing tag matches the opening tag.
> container2 :: Parser SimpleXML -> Parser SimpleXML
> container2 p = undefined
Xml> doParse (container2 pcdata) "<title>A midsummer night's dream</title>"
Just (Element "title" [PCDATA "A midsummer night's dream"],"")
Xml> doParse (container2 pcdata) "<title>A midsummer night's dream</br>"
Nothing