Parsing theory has been around for quite a long time, but it is often thought of as magic by the swarms of people who haven’t bothered to read about it, and see how plain and dry it actually is. Algorithms for parsing [LR(k)][1] grammars (meaning Left-to-right, Right-most derivation, k tokens lookahead) for instance, normally just traverse a state machine that was computed before hand (either by hand, or by using a parser generator such as [bison][2] or [yacc][3]). Sure, there are many things to trip on, tedious to track down ambiguities, and other issues, but the general theory of parsing has remained unchanged for years–one might say, it is a solved problem.1
When learning about parsing for the first time though, the idea of a [recursive descent parser][5] is often taught first. Recursive descent parsers, are relatively simple to reason about, to write and to shoot yourself in the foot with. A simple [LL(1)][6] parser (meaning Left-to-right, Left-most derivation, 1 token lookahead), for instance, can’t parse [left-recursive grammars][7], which is the most natural way to write certain types of grammars2. Typically, when writing a recursive descent parser, the author takes the grammar and produces a function for each production (non- terminal). Each function then reads a token and recurses to the other non- terminals in the grammar reachable from the current production. And, eventually, at the end of the function the sub parts will be combined in such a way that a [parse tree][9] will be created.
This sounds boring and tedious, and in fact is. However, there is a useful technique for creating these types of parsers that was developed some time ago3, which involves composing a small set of functions into more meaningful, more advanced parsers. They still suffer from the same problems as your typical recursive descent parser (as presented), but with some other trickery can be made to overcome those deficiencies (we won’t discuss that in this article).
In order to build a parser from the ground up, we need to think about what a
parser actually is. In some sense, it is really just a function that takes an
input string and produces some result. That result, in order to make any
progress should contain the leftover string after consuming some part of it,
or in the case of error (i.e. incorrect input), return some value indicating
failure. In Python, a natural way to encode both results would be to use
(_"matched string"_, _"leftover string"_) or None. For sanity’s sake, let
us refer to functions which match this criteria as parser functions.
To start off, we’ll write a useful parser function, which at first glance
seems pointless, anychar. anychar matches any (no trickery here!)
character so long as there is at least one character left in the input string.
(Note: we’ll use the variable strn to always refer to the input string,
which represents the string left to parse.)
def anychar(strn):
if strn == "":
return None
return (strn[0], strn[1:])
It is easy to see that the result of this parser function matches our
encoding. If there are no characters left in strn, then we return the
failure condition, None, otherwise we return a tuple of what we parsed, and
the rest of the string which we didn’t parse.
It becomes more useful when we pair anychar with a test against the
character it consumes. Enter chartest, which is a function that creates
another parser function, given a predicate (i.e. a function which returns
True or False).
def chartest(pred):
def _(strn):
c = anychar(strn)
if c and pred(c[0]):
return (c[0], c[1])
return None
return _
In order to use chartest, we pass it a predicate, like so:
>>> chartest(lambda x: x == 'a')('abc')
('a', 'bc')
To see what happened, remember that chartest creates a new parser
function. With that, we just call the new parser function with the rest of the
input string 'abc'. The result indicates success, because an 'a' was
discovered as the first character. If we were unsuccessful, just like in
anychar, instead of ('a', 'bc'), we’d have seen None returned.
It is a bit verbose to always create a lambda to match a single character,
so matchchr gets a target character and calls chartest for us. (Remember,
calling chartest creates a new parser function, this is an important thing
to note.)
def matchchr(targetchr):
return chartest(lambda x: x == targetchr)
Now we can match single characters against our input stream, which is a great
starting point, but hardly makes for an easy to use library. One limitation is
that there is no way to specify more than one character as a possible match,
such as “all alpha numeric”–for that, we use oneof.
def oneof(target):
chars = set([x for x in target])
return chartest(lambda x: x in chars)
oneof creates a new test function to pass to chartest, which instead of
testing if a character is equal to a single target character, checks to see if
the character is in the set of characters we’re looking for. Some useful
definitions follow, which make parser functions using oneof, and a set of
characters.
alpha = oneof('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
loweralpha = oneof('abcdefghijklmnopqrstuvwxyz')
upperalpha = oneof('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
digit = oneof('0123456789')
hexdigit = oneof('0123456789abcdefABCDEF')
whitespace = oneof(' \t\n\r')
While matching a single character is useful, it would be much more useful if
we could match a token, like “while,” or “Content-Type.” Not to worry,
matchstr produces a parser function that will combine the previously created
matchchr for each character in the target string. It looks a bit
complicated, so we’ll go through it step by step.
def matchstr(target):
if not target:
return lambda strn: ("", strn)
def _(strn):
c = matchchr(target[0])(strn)
if c:
cs = matchstr(target[1:])(c[1])
if cs:
return (c[0] + cs[0], cs[1])
return None
return _
target, just like targetchr in matchchr is the string we’re eventually
trying to match in full. If target is empty, then our parser function is
simple–it doesn’t advance the input string, and doesn’t consume anything.
Why don’t we return None here? Well, if our target is empty, we’re not
asking matchstr to do any work at all, so there isn’t a failure (indicated
by None). It, however, also makes for a great base case to the recursion
that follows.
If there is a target string to match against, we attempt to match the first
character within it. If that succeeds, we shorten the target string and
recurse. We eventually return a combination of the result of matchchr and
the result of the recursive call to matchstr. Take a minute to look over
this and ensure you understand it–it’s actually pretty straightforward
assuming understanding of the previous functions.
Let’s take a look at how we use it:
>>> matchwhile = matchstr('while')
>>> matchwhile('while True:')
('while', ' True:')
As you can see, we used matchstr to build a parser function which matches
the string “while”–simple enough.
Ok, so what if we want to parse more complicated things, like say, the rest of the input string from the “while True:” example? We need some ways to combine these parser functions to make them more useful, otherwise, all we did was create the equivalent of:
if strn.startswith("while"):
return (strn[0:5], strn[5:])
Which, in Python, would be much more efficient!4
Another parser function that we need to make this whole thing useful is
optional. optional takes as an argument a parser function, and returns a
new parser function that succeeds even if the original parser function does
not. Essentially, if there is a failure, it returns the original input string.
def optional(parserfn):
def _(strn):
c = parserfn(strn)
if c:
return c
return ('', strn)
return _
If we make matchwhile, from above, optional we get this:
>>> optional_matchwhile = optional(matchwhile)
>>> optional_matchwhile('foo')
('', 'foo')
Without optional, attempting to call matchwhile on the input string
'foo' would have resulted in None, the failure condition.
The presence of optional also leads us to repeat and repeat0 which are
mutually exclusive. repeat will attempt to match the parser function at
least once, with no boundary. repeat0 will match the parser function zero or
more times:
def repeat(parser):
def _(strn):
c = parser(strn)
if c:
cs = repeat0(parser)(c[1])
return (c[0] + cs[0], cs[1])
return None
return _
def repeat0(parser):
return optional(repeat(parser))
Again, like optional, repeat and repeat0 build parser functions from
existing ones. This is very much a common pattern when building parsers of
this type.
The implementation of repeat0 and repeat is quite clever. Note that zero
or more is the same as optionally one or more. The implementation of both
follows from that realization. repeat first attempts to call the passed in
parser function. If it succeeds it calls repeat0 on the rest of the input
string after calling parser the first time. If repeat0 succeeds, which it
always will given optional, we combine the results and return.
>>> optrepeat_while = repeat0(matchwhile)
>>> optrepeat_while('whilewhilewhile')
('whilewhilewhile', '')
>>> optrepeat_while('foo')
('', 'foo')
>>> repeat_while = repeat(matchwhile)
>>> repeat_while('foo')
None
>>> repeat_while('while foo')
('while', ' foo')
We still need the ability to do alternation, like “while” or “if.” For that
we introduce alt.
def alt(*parsers):
def _(strn):
for p in parsers:
result = p(strn)
if result:
return result
return _
This is really simple. We take a list of parser functions and try them one by one, from left to right, until we find one that passes.
>>> iforwhileorfor = alt(matchstr('if'), matchstr('while'), matchstr('for'))
>>> iforwhileorfor('if')
('if', '')
>>> iforwhileorfor('while')
('while', '')
>>> iforwhileorfor('for')
('for', '')
>>> iforwhileorfor('foof')
None
Alternation is important, but it is maybe even more important to ensure that
many parser functions pass in order, a sequence of parser functions if you
will. It is this operator that allows us to do something like whilestmt =
sequence(whileToken, conditional, colonToken, codeBlock).
def sequence(*parsers):
def _(strn):
parsed = ''
rest = strn
for p in parsers:
result = p(rest)
if result:
rest = result[1]
parsed += result[0]
else:
return None
return (parsed, rest)
return _
Assuming simplified definitions of the supporting rules, our whileStmt
example looks something like this:
>>> whileToken = matchstr("while")
>>> conditional = oneof("><=")
>>> colonToken = matchchr(":")
>>> codeBlock = alt(matchstr("if"), matchstr("for"))
>>> whileStmt = sequence(whileToken, conditional, colonToken, codeBlock)
>>> whileStmt('while<:if')
('while<:if', '')
>>> whileStmt('while>:if')
('while>:if', '')
>>> whileStmt('while>:for')
('while>:for', '')
>>> whileStmt('while:for')
None
sequence looks complicated, but is rather simple. It is basically reduce,
combining the results of each parser into the results of all of the parser
function outputs together.
That’s all we really need to construct more interesting parsers, so we’ll now construct a simplified parser for JSON.5
We’ll start with some utility functions:
def betweenchrs(parser, left="(", right=")"):
def _(strn):
lres = matchchr(left)(strn)
if lres:
pres = parser(lres[1])
if pres:
rres = matchchr(right)(pres[1])
if rres:
return (left + pres[0] + right, rres[1])
return None
return _
betweenparens = lambda p: betweenchrs(p, left="(", right=")")
betweenbrackets = lambda p: betweenchrs(p, left="[", right="]")
betweencurlies = lambda p: betweenchrs(p, left="{", right="}")
betweenchrs lets us easily create a parser function which attempts to parse,
using parser, only if it is between left and right. This is useful in
JSON, because of its list and dictionary data types, which are delimited by
[] and {} respectively.
Strings in JSON are composed of a series of characters between "’s. But, if
you want to actually use a " within the string, you can do that by preceding
it with a \. We can make a parser function that satisfies these rules rather
easily, making use of anychar.
def charorquoted(strn):
c = anychar(strn)
if c[0] == '"':
return None
elif c[0] == '\\':
c2 = chartest(lambda x: x in ('\\', '"'))(c[1])
if c2:
return (c[0] + c2[0], c2[1])
else:
return c
In the case that we find a ‘“‘ character without a ‘\’ character preceding it, it is a failure.
Whitespace doesn’t much matter between tokens in JSON, so let us define
something that ultimately ignores it. ignorews uses repeat0 to strip the
preceding whitespace, calls the parser function given using the left over
input string. If the parser function passes, it calls repeat0 again against
whitespace and ultimately returns the passed in parser function’s parsed
result and the ignored whitespace’s left over input string. That’s a mouthful,
but it’s fairly easy to understand:
def ignorews(p):
def _(strn):
w = repeat0(whitespace)(strn)
if w:
pres = p(w[1])
if pres:
w2 = repeat0(whitespace)(pres[1])
if w2:
return (pres[0], w2[1])
return None
return _
anint, astring, acolon and acomma are just helper functions which do
exactly what they describe. We’re simplifying this implementation, as it is
for demonstration purposes, so we’re not taking into consideration floating
point numbers, or integers specified using hexidecimal and other formulations
of numbers.
anint = sequence(optional(matchchr("-")), repeat(digit))
astring = betweenchrs(repeat0(charorquoted), left='"', right='"')
acolon = matchchr(':')
acomma = matchchr(',')
When we define dictionaries and lists, we run into a problem. Both lists and dictionaries can contain lists and dictionaries (as well as numbers and strings of course), which represents a problem for when we define these functions (we can’t recursively define something that doesn’t already exist!). One solution to this problem is to use a mutable object which acts as a “forward reference.” When we finish defining the pieces we need, we update the forward reference, and then all is well.
For both aesthetic, and practical reasons, we’ll use a class instance which
overrides __call__, and __ilshift__, which will allow us to use an
instance of Forward as a parser function, and <<= (from ilshift) as a
way to update the parser function that’s contained within the reference.
class Forward(object):
def __init__(self):
self.p = None
def __call__(self, *args, **kwargs):
return self.p(*args, **kwargs)
def __ilshift__(self, p):
self.p = p
return self
To use a Forward, we simply create an instance of Forward and assign it to
a variable, just as if we were creating a parser function. Forward is like a
promise. “If you act like a parser function for me for a little bit, I promise
I’ll actually turn you into one later.” If the promise is kept, and the
Forward is updated, parsing will proceed as if nothing was ever not
specified to begin with.
avalue = Forward()
akey = ignorews(alt(anint, astring))
akeyvaluepair = sequence(akey, acolon, avalue)
Both lists and dictionaries have items that are separated by comma.
commaseparated is essentially repeat0 except that it ensures a comma
appears after each item, except in the last item.
def commaseparated(parser):
def _(strn):
r = repeat0(sequence(parser, acomma))(strn)
if r:
r2 = parser(r[1])
if r2:
return (r[0] + r2[0], r2[1])
elif r:
return r
return None
return _
Now that we have all the pieces specified, we put a commaseparated key value
pair between curly braces to parse a dictionary, and a commaseparated value
parser function between square brackets to parse a list.
adict = betweencurlies(commaseparated(akeyvaluepair))
alist = betweenbrackets(commaseparated(avalue))
We still have our promise to keep for avalue, and with the definitions of
alist and adict, we now can. avalue, as in JSON, should either be a
number, a string, a list or a dictionary, and whitespace is ignored.
avalue <<= alt(*map(ignorews, [adict, alist, anint, astring]))
To achieve alternation, we make use of alt, but before we do that, we wrap
each parser function contained in avalue in an ignorews parser function
builder to satisify that requirement. Finally, we shift the newly created
parser function into the forward reference for avalue.
To parse a top level JSON document, we look for either a list, or a dictionary. The parser function to do that is quite easy to specify.
json = alt(adict, alist)
Last, but certainly not least, let’s actually use what we’ve constructed:
>>> json('''{"hello": {1: "how are you?"}, "i is": "fine", "how": "are you?", 1: ["these", "values", "work", 2]}''')
('{"hello":{1:"how are you?"},"i is":"fine","how":"are you?",1:["these","values","work",2]}', '')
Success!
While we haven’t shown how to formulate an LL(k) grammar, or even talked about what that actually is formally, we have shown that with a few simple functions that build parser functions, we can build, quite easily, parsing functions which parse complicated things. However, we’ve only shown that the input is valid, actually constructing a parse tree, or acting upon it, is left as an exercise to the reader.
Source code for all this is [here][13]
-
I’m not sure if parsing is really “solved,” but the algorithms we have work well enough in practice that there isn’t a ton of interesting new research going on around it. Yacc is Dead, for instance used the results of a paper from 1964, but came under fire. Can parsing be made trivially easy? Maybe, but it’s likely that ambiguity will be somewhere–which would be the unsolved in parsing. ↩
-
See left recursion. All joking aside, left recursion occurs in a grammar when a non-terminal rule is used recursively and appears on the left. For example:
expr = expr + tail. ↩ -
Graham Hutton. Proceedings of the 1989 Glasgow Workshop on Functional Programming (Fraserburgh, Scotland), Springer-Verlag Series of Workshops in Computing, Springer-Verlag, Berlin, 1990. ↩
-
The purpose of this article isn’t to describe an efficient parsing technique for Python, but to rather demonstrate a useful technique that could be adapted and built upon, to build an efficient parsing framework for any language that supports first class functions. There’s a follow up article to this which shows just how much of this can actually be abstracted out in order to make it even more simple. ↩
-
We limit our parser to strings, integers, dictionary and lists. The complete specification appears at http://json.org ↩
[1]: http://en.wikipedia.org/wiki/LR_parser
[2]: http://www.gnu.org/software/bison/
[3]: http://en.wikipedia.org/wiki/Yacc
[5]: http://en.wikipedia.org/wiki/Recursive_descent_parser
[6]: http://en.wikipedia.org/wiki/LL_parser
[7]: http://en.wikipedia.org/wiki/Left_recursion
[8]: #note-leftrecursion
[9]: http://en.wikipedia.org/wiki/Parse_tree
[13]: http://files.sigusr2.net/parser_functions.py
[14]: http://arxiv.org/abs/1010.5023
[15]: http://portal.acm.org/citation.cfm?id=321249
[16]: http://research.swtch.com/2010/12/yacc-is-not-dead.html
[21]: http://json.org