Parser Combinators Made Simple
Parsing theory has been around for quite a long time, but it is often thought of as magic by the swarms of people who haven’t bothered to read about it, and see how plain and dry it actually is. Algorithms for parsing LR(k) grammars (meaning Left-to-right, Right-most derivation, k tokens lookahead) for instance, normally just traverse a state machine that was computed before hand (either by hand, or by using a parser generator such as bison or yacc). Sure, there are many things to trip on, tedious to track down ambiguities, and other issues, but the general theory of parsing has remained unchanged for years–one might say, it is a solved problem.1
When learning about parsing for the first time though, the idea of a recursive descent parser is often taught first. Recursive descent parsers, are relatively simple to reason about, to write and to shoot yourself in the foot with. A simple LL(1) parser (meaning Left-to-right, Left-most derivation, 1 token lookahead), for instance, can‘t parse [left-recursive grammars][7], which is the most natural way to write certain types of grammars2. Typically, when writing a recursive descent parser, the author takes the grammar and produces a function for each production (non- terminal). Each function then reads a token and recurses to the other non- terminals in the grammar reachable from the current production. And, eventually, at the end of the function the sub parts will be combined in such a way that a parse tree will be created.
This sounds boring and tedious, and in fact is. However, there is a useful technique for creating these types of parsers that was developed some time ago3, which involves composing a small set of functions into more meaningful, more advanced parsers. They still suffer from the same problems as your typical recursive descent parser (as presented), but with some other trickery can be made to overcome those deficiencies (we won’t discuss that in this article).
In order to build a parser from the ground up, we need to think about what a parser actually is. In some sense, it is really just a function that takes an input string and produces some result. That result, in order to make any progress should contain the leftover string after consuming some part of it, or in the case of error (i.e. incorrect input), return some value indicating failure. In Python, a natural way to encode both results would be to use (_"matched string"_, _"leftover string"_)
or None
. For sanity‘s sake, let us refer to functions which match this criteria as parser functions.
To start off, we’ll write a useful parser function, which at first glance seems pointless, anychar
. anychar
matches any (no trickery here!) character so long as there is at least one character left in the input string. (Note: we‘ll use the variable strn
to always refer to the input string, which represents the string left to parse.)
def anychar(strn):
if strn == "":
return None
return (strn[0], strn[1:])
It is easy to see that the result of this parser function matches our encoding. If there are no characters left in strn
, then we return the failure condition, None
, otherwise we return a tuple of what we parsed, and the rest of the string which we didn’t parse.
It becomes more useful when we pair anychar
with a test against the character it consumes. Enter chartest
, which is a function that creates another parser function, given a predicate (i.e. a function which returns True
or False
).
def chartest(pred):
def _(strn):
c = anychar(strn)
if c and pred(c[0]):
return (c[0], c[1])
return None
return _
In order to use chartest
, we pass it a predicate, like so:
>>> chartest(lambda x: x == 'a')('abc')
('a', 'bc')
To see what happened, remember that chartest
creates a new parser function. With that, we just call the new parser function with the rest of the input string 'abc'
. The result indicates success, because an 'a'
was discovered as the first character. If we were unsuccessful, just like in anychar
, instead of ('a', 'bc')
, we‘d have seen None
returned.
It is a bit verbose to always create a lambda
to match a single character, so matchchr
gets a target character and calls chartest
for us. (Remember, calling chartest
creates a new parser function, this is an important thing to note.)
def matchchr(targetchr):
return chartest(lambda x: x == targetchr)
Now we can match single characters against our input stream, which is a great starting point, but hardly makes for an easy to use library. One limitation is that there is no way to specify more than one character as a possible match, such as ”all alpha numeric“–for that, we use oneof
.
def oneof(target):
chars = set([x for x in target])
return chartest(lambda x: x in chars)
oneof
creates a new test function to pass to chartest, which instead of testing if a character is equal to a single target character, checks to see if the character is in the set of characters we’re looking for. Some useful definitions follow, which make parser functions using oneof
, and a set of characters.
alpha = oneof('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
loweralpha = oneof('abcdefghijklmnopqrstuvwxyz')
upperalpha = oneof('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
digit = oneof('0123456789')
hexdigit = oneof('0123456789abcdefABCDEF')
whitespace = oneof(' \t\n\r')
While matching a single character is useful, it would be much more useful if we could match a token, like ”while
,“ or ”Content-Type.“ Not to worry, matchstr
produces a parser function that will combine the previously created matchchr
for each character in the target string. It looks a bit complicated, so we‘ll go through it step by step.
def matchstr(target):
if not target:
return lambda strn: ("", strn)
def _(strn):
c = matchchr(target[0])(strn)
if c:
cs = matchstr(target[1:])(c[1])
if cs:
return (c[0] + cs[0], cs[1])
return None
return _
target
, just like targetchr
in matchchr
is the string we’re eventually trying to match in full. If target
is empty, then our parser function is simple–it doesn‘t advance the input string, and doesn’t consume anything.
Why don‘t we return None
here? Well, if our target is empty, we’re not asking matchstr
to do any work at all, so there isn‘t a failure (indicated by None
). It, however, also makes for a great base case to the recursion that follows.
If there is a target string to match against, we attempt to match the first character within it. If that succeeds, we shorten the target string and recurse. We eventually return a combination of the result of matchchr
and the result of the recursive call to matchstr
. Take a minute to look over this and ensure you understand it–it’s actually pretty straightforward assuming understanding of the previous functions.
Let‘s take a look at how we use it:
>>> matchwhile = matchstr('while')
>>> matchwhile('while True:')
('while', ' True:')
As you can see, we used matchstr
to build a parser function which matches the string ”while
“–simple enough.
Ok, so what if we want to parse more complicated things, like say, the rest of the input string from the ”while True:“ example? We need some ways to combine these parser functions to make them more useful, otherwise, all we did was create the equivalent of:
if strn.startswith("while"):
return (strn[0:5], strn[5:])
Which, in Python, would be much more efficient!4
Another parser function that we need to make this whole thing useful is optional
. optional
takes as an argument a parser function, and returns a new parser function that succeeds even if the original parser function does not. Essentially, if there is a failure, it returns the original input string.
def optional(parserfn):
def _(strn):
c = parserfn(strn)
if c:
return c
return ('', strn)
return _
If we make matchwhile
, from above, optional we get this:
>>> optional_matchwhile = optional(matchwhile)
>>> optional_matchwhile('foo')
('', 'foo')
Without optional
, attempting to call matchwhile
on the input string 'foo'
would have resulted in None
, the failure condition.
The presence of optional
also leads us to repeat
and repeat0
which are mutually exclusive. repeat
will attempt to match the parser function at least once, with no boundary. repeat0
will match the parser function zero or more times:
def repeat(parser):
def _(strn):
c = parser(strn)
if c:
cs = repeat0(parser)(c[1])
return (c[0] + cs[0], cs[1])
return None
return _
def repeat0(parser):
return optional(repeat(parser))
Again, like optional
, repeat
and repeat0
build parser functions from existing ones. This is very much a common pattern when building parsers of this type.
The implementation of repeat0
and repeat
is quite clever. Note that zero or more is the same as optionally one or more. The implementation of both follows from that realization. repeat
first attempts to call the passed in parser function. If it succeeds it calls repeat0
on the rest of the input string after calling parser
the first time. If repeat0
succeeds, which it always will given optional
, we combine the results and return.
>>> optrepeat_while = repeat0(matchwhile)
>>> optrepeat_while('whilewhilewhile')
('whilewhilewhile', '')
>>> optrepeat_while('foo')
('', 'foo')
>>> repeat_while = repeat(matchwhile)
>>> repeat_while('foo')
None
>>> repeat_while('while foo')
('while', ' foo')
We still need the ability to do alternation, like ”while“ or ”if.“ For that we introduce alt
.
def alt(*parsers):
def _(strn):
for p in parsers:
result = p(strn)
if result:
return result
return _
This is really simple. We take a list of parser functions and try them one by one, from left to right, until we find one that passes.
>>> iforwhileorfor = alt(matchstr('if'), matchstr('while'), matchstr('for'))
>>> iforwhileorfor('if')
('if', '')
>>> iforwhileorfor('while')
('while', '')
>>> iforwhileorfor('for')
('for', '')
>>> iforwhileorfor('foof')
None
Alternation is important, but it is maybe even more important to ensure that many parser functions pass in order, a sequence of parser functions if you will. It is this operator that allows us to do something like whilestmt = sequence(whileToken, conditional, colonToken, codeBlock)
.
def sequence(*parsers):
def _(strn):
parsed = ''
rest = strn
for p in parsers:
result = p(rest)
if result:
rest = result[1]
parsed += result[0]
else:
return None
return (parsed, rest)
return _
Assuming simplified definitions of the supporting rules, our whileStmt
example looks something like this:
>>> whileToken = matchstr("while")
>>> conditional = oneof("><=")
>>> colonToken = matchchr(":")
>>> codeBlock = alt(matchstr("if"), matchstr("for"))
>>> whileStmt = sequence(whileToken, conditional, colonToken, codeBlock)
>>> whileStmt('while<:if')
('while<:if', '')
>>> whileStmt('while>:if')
('while>:if', '')
>>> whileStmt('while>:for')
('while>:for', '')
>>> whileStmt('while:for')
None
sequence
looks complicated, but is rather simple. It is basically reduce
, combining the results of each parser into the results of all of the parser function outputs together.
That’s all we really need to construct more interesting parsers, so we‘ll now construct a simplified parser for JSON.5
We’ll start with some utility functions:
def betweenchrs(parser, left="(", right=")"):
def _(strn):
lres = matchchr(left)(strn)
if lres:
pres = parser(lres[1])
if pres:
rres = matchchr(right)(pres[1])
if rres:
return (left + pres[0] + right, rres[1])
return None
return _
betweenparens = lambda p: betweenchrs(p, left="(", right=")")
betweenbrackets = lambda p: betweenchrs(p, left="[", right="]")
betweencurlies = lambda p: betweenchrs(p, left="{", right="}")
betweenchrs
lets us easily create a parser function which attempts to parse, using parser
, only if it is between left
and right
. This is useful in JSON, because of its list and dictionary data types, which are delimited by []
and {}
respectively.
Strings in JSON are composed of a series of characters between "
‘s. But, if you want to actually use a "
within the string, you can do that by preceding it with a \
. We can make a parser function that satisfies these rules rather easily, making use of anychar
.
def charorquoted(strn):
c = anychar(strn)
if c[0] == '"':
return None
elif c[0] == '\\':
c2 = chartest(lambda x: x in ('\\', '"'))(c[1])
if c2:
return (c[0] + c2[0], c2[1])
else:
return c
In the case that we find a ’”‘ character without a ’' character preceding it, it is a failure.
Whitespace doesn‘t much matter between tokens in JSON, so let us define something that ultimately ignores it. ignorews
uses repeat0
to strip the preceding whitespace, calls the parser function given using the left over input string. If the parser function passes, it calls repeat0
again against whitespace and ultimately returns the passed in parser function’s parsed result and the ignored whitespace‘s left over input string. That’s a mouthful, but it‘s fairly easy to understand:
def ignorews(p):
def _(strn):
w = repeat0(whitespace)(strn)
if w:
pres = p(w[1])
if pres:
w2 = repeat0(whitespace)(pres[1])
if w2:
return (pres[0], w2[1])
return None
return _
anint
, astring
, acolon
and acomma
are just helper functions which do exactly what they describe. We’re simplifying this implementation, as it is for demonstration purposes, so we‘re not taking into consideration floating point numbers, or integers specified using hexidecimal and other formulations of numbers.
anint = sequence(optional(matchchr("-")), repeat(digit))
astring = betweenchrs(repeat0(charorquoted), left='"', right='"')
acolon = matchchr(':')
acomma = matchchr(',')
When we define dictionaries and lists, we run into a problem. Both lists and dictionaries can contain lists and dictionaries (as well as numbers and strings of course), which represents a problem for when we define these functions (we can’t recursively define something that doesn‘t already exist!). One solution to this problem is to use a mutable object which acts as a “forward reference.” When we finish defining the pieces we need, we update the forward reference, and then all is well.
For both aesthetic, and practical reasons, we’ll use a class instance which overrides __call__
, and __ilshift__
, which will allow us to use an instance of Forward
as a parser function, and <<=
(from ilshift) as a way to update the parser function that's contained within the reference.
class Forward(object):
def __init__(self):
self.p = None
def __call__(self, *args, **kwargs):
return self.p(*args, **kwargs)
def __ilshift__(self, p):
self.p = p
return self
To use a Forward
, we simply create an instance of Forward
and assign it to a variable, just as if we were creating a parser function. Forward
is like a promise. “If you act like a parser function for me for a little bit, I promise I‘ll actually turn you into one later.” If the promise is kept, and the Forward
is updated, parsing will proceed as if nothing was ever not specified to begin with.
avalue = Forward()
akey = ignorews(alt(anint, astring))
akeyvaluepair = sequence(akey, acolon, avalue)
Both lists and dictionaries have items that are separated by comma. commaseparated
is essentially repeat0
except that it ensures a comma appears after each item, except in the last item.
def commaseparated(parser):
def _(strn):
r = repeat0(sequence(parser, acomma))(strn)
if r:
r2 = parser(r[1])
if r2:
return (r[0] + r2[0], r2[1])
elif r:
return r
return None
return _
Now that we have all the pieces specified, we put a commaseparated
key value pair between curly braces to parse a dictionary, and a commaseparated
value parser function between square brackets to parse a list.
adict = betweencurlies(commaseparated(akeyvaluepair))
alist = betweenbrackets(commaseparated(avalue))
We still have our promise to keep for avalue
, and with the definitions of alist
and adict
, we now can. avalue
, as in JSON, should either be a number, a string, a list or a dictionary, and whitespace is ignored.
avalue <<= alt(*map(ignorews, [adict, alist, anint, astring]))
To achieve alternation, we make use of alt
, but before we do that, we wrap each parser function contained in avalue
in an ignorews
parser function builder to satisify that requirement. Finally, we shift the newly created parser function into the forward reference for avalue
.
To parse a top level JSON document, we look for either a list, or a dictionary. The parser function to do that is quite easy to specify.
json = alt(adict, alist)
Last, but certainly not least, let’s actually use what we‘ve constructed:
>>> json('''{"hello": {1: "how are you?"}, "i is": "fine", "how": "are you?", 1: ["these", "values", "work", 2]}''')
('{"hello":{1:"how are you?"},"i is":"fine","how":"are you?",1:["these","values","work",2]}', '')
Success!
While we haven’t shown how to formulate an LL(k) grammar, or even talked about what that actually is formally, we have shown that with a few simple functions that build parser functions, we can build, quite easily, parsing functions which parse complicated things. However, we‘ve only shown that the input is valid, actually constructing a parse tree, or acting upon it, is left as an exercise to the reader.
Source code for all this is here
—2011-04-18