Parsing and Parsers with Dave Beazley and Erik Rose

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@linode.com/podcast

init and get a $20 credit to try out their fast and reliable Linux virtual servers for running your next app or experimenting with something you hear about on the show. You can visit the site at www.podcast

init.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. And to help other people find the show, you can leave a review on Itunes or Google Play Music, and you can also tell your friends and coworkers or share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Eric Rose and Dave Beasley about what parsing is, why you might want to use it, and how their respective libraries parsimonious and ply make it easy. So could you each introduce yourself? How about Dave? You go first. Alright. Well, I'm Dave Beasley. I'm sitting here in Chicago. People probably know me from maybe the Python essential reference book and the Python cookbook.

And, Eric, how about yourself?

Well, let's see. I'm Eric, and I sit in my glacially carved fortress in North Carolina where I work remotely as a language wonk for Mozilla. And, I'm really interested in the intersection of formal languages and human cognition and have been for some time. In Mozilla, I make a little tool called DXR that is a search and analysis utility for large code bases in various languages.

I work on Fathom, which is a language for teaching web browsers how to understand pages like,

web pages like humans do. And on the side, I make an eclectic

venting of Python libraries.

Starting with you again, Dave. How did you get introduced to Python?

I actually got introduced to Python through parsing, believe it or not.

You know, that's sort of a weird story.

So I came into Python through scientific computing, and I had taken a compilers course as a grad student, and I got this idea that, boy, it'd be kinda cool if I could build my own version of MATLAB or something to control this big physics code. Went off and, actually wrote my own scripting language and then went back to grad school and was, like, immediately scorned. They're like, you idiot. Why did you make another programming language? And then they're like, why don't you use something that exists? So I I kinda went around and looked at different things like, Tcl and Perl

and discovered Python in the process of that and was actually working on the the SWIG project at the time. So people have heard of SWIG that is basically a parser that takes c and c plus plus header files and makes extension modules out of it. That's where I discovered it. I was working on parsing from from the beginning there.

And, Eric, how about you? Oh, well, I also got into PyCon through kind of a weird series of events.

Through not really knowing how to interview, I ended up writing VB script code over at Penn State. And,

I had I had been a collector of languages for some years and couple dozen to my name and I said I'm gonna give v b script a chance and I gave it about a year and then I finally realized that you can't actually write anything in v b script and so I started looking for escape hatches. And,

fortunately,

there's this there's this wonderful Microsoft

artifact called the Windows Scripting Host that's, it sits around and runs VBScript and also JScript and also other languages like Python. And you can bridge between languages on Windows Scripting Host. And so I was able to take Python and kinda cram it as a sidecar under the VBScript project and,

escape from the VBScript wonderland where you can have classes, but not subclasses

and and other horrible things. So that's how I've got into Python, and and, well, I've been on it ever since in Jettison VB script, good riddance, and, here I still am.

So can you start by talking a bit about your respective libraries and the problems that you were trying to solve when they were first created? So Dave, how about you go first again?

Alright. So, yeah, I have this ply library and maybe a a bit of an interesting story on that. So as as I noted, I'm you know, I originally created this swig package for doing c and c plus plus extensions. And 1 of the, questions I always got with that package is why was it written in c plus plus? So the swig package is written in c plus plus, and it the parser is written in c plus plus and and so forth. And the the reason it was written in c plus plus is that was what I've learned in in graduate school. I mean, I wrote a compiler and used Lex and Yac, which are these Unix tools, and that's what I knew to to how to write things with, and that's why I used it. But as a as a Python programmer,

you know, people started asking me. It's like, why don't you just write this in Python? Like, why are you punishing yourself by writing this tool in c plus plus?

And the the answer to that is, you know, at the time, keep it keeping in mind, this would have been, like, 99,

you know, year 2000.

There actually were were not a large number of parsing tools available in Python.

It certainly wasn't anything

sort of comparable to the Unix sort of Lex Yac toolkit. And I I I honestly didn't know how I was going to like, how I would under take such a project without that.

So through a kind of a a a weird sort of turn of events, I I had an opportunity to teach the compilers course at the university. Turns out the, the prior professor didn't get tenure and disappeared. So this this opening opened up, and I I I sort of took the compilers course over. And as part of that, I said, you know what? I'm gonna I'm gonna do this in Python. Let's find out what it's like to write a a compiler in Python and ended up having to write, like, a Lex Yac tool

over spring break. So it turns out that the original version of ply was written in, like, 10 days over spring break for the purpose of doing a class. Since then, it's it's it's been about 16 years of debugging.

So

but yeah. Yeah. Yeah. It's, yeah. We'll we'll talk about that later. But, but that's that's where it came from. It was like, just I need to have a tool. There is no tool. Let's just let's just do something and, you know, kind of kind of see what we can get out of it. So That's actually 1 of the things that I admire most about Ply is, you basically put it through the worst debugging ringer imaginable, which is class after class of students. And so it has this fantastic error behavior and error output. My own library, parsimonious,

got started because

while I was working on the Firefox support website. And because of hysterical purposes, we had a,

a love for the MediaWiki grammar. We wrote all of our pages in this well, it was I guess it was basically stock MediaWiki. And we had a MediaWiki library off to the side that David Kramer had ported over from the original PHP.

It was like a straight transliteration. It still had dollar signs in the comments. And the problem came when we wanted to shoehorn some of our own control structures into that existing MediaWiki syntax. We wanted to do things like,

make notations so that if someone came up in Firefox 3 and looked at the page, it'd say 1 thing. But if they were in fire fox 4, they'd see a slightly different set of instructions, let's say. And trying to cram that thing into the existing kind of MediaWiki port code was just offensively offensively ugly. The way that media wiki worked at the time was by running something like 41 regex find and replace routines, just 1 after another after another, sticking tokens in there and running some other replaces and then trying to go replace those other tokens and trying to keep these different replaces from stepping on each other's toes. And then on top of that, there was 2,000 imperative lines of code that kind of drove all this stuff and loops and and horrible things like that. And so I set out to, kind of kind of as a research project, kind of as a practical thing, let's see if we can come up with a, a media wiki grammar in a formal sense and b, a toolkit to run it on. Yeah. I went shopping, went through all these Python,

parsing libraries, and and 1 of the things that drove me to write my own was that, MediaWiki is not,

like an l r 1 grammar, which is the technical way of saying it. The, the basic meaning of that is you can't quite

tell what it is you're parsing until you look ahead more than 1 step. So a well designed language like, well, Python, for example, you don't have to look ahead that far. Say def, hey, you know it's gonna be a function. If you say class, it's probably gonna be a class. E Wiki, not so much. Like the link syntax, you know, you start with the square bracket and then maybe a URL

and then maybe a title and then you wait for that end square bracket. But, hey, you know what? It's not there. We're just gonna keep on trucking it. It turns out, well, that's not a link after all that because the end square bracket didn't come when we expected it to be. But it was, again, more than 1 step ahead that we have to look to find that out. So parsimonious was written to solve this problem. It was written, to support extensible grammars.

Parsimonious uses

some theory called parsing extension grammars. That's the kind of grammars it supports, and those were introduced by a guy, Brian Ford, in 2004. And the nice thing about them is they're composable. So we could start with a grammar that represented MediaWiki and then later plug in our little Firefox specific

modules, and you still have a valid grammar. And I I also wanted to modernize things. High parsing at the time was kind of the I consider what the standard thing was, and it's a single 3,000 line module. It doesn't I recall right. Didn't really have any tests at the time. And it's very hard to read a pipe parsing grammar, at least to my eyes. And so parsimonious grammars look pretty much like what you'd find on a man page. So that pretty much covers my motivations.

I was gonna say it's really it's really interesting. It's almost like you have the the super modern and the super old here at the same time. I mean, it's just like the, the no. The the parsing, you know, the parsing algorithm in ply is straight out of 1970. You know? It's like it is, you know, it's the this thing, you know, the l a l r 1 parsing algorithm. And I was basically trying to recreate, you know, just exactly what is found in sort of the Unix tools. Was that, like, for knowledge portability so your students could run off and grab Lex and Yac and feel at home?

Partly. Yeah. And so, well, there's there's a lot of things written about Lex and Yac.

There's sort of a known quantity. You can buy books about those. Sure. Got 1 sitting here. You know? And it it you know, from a from, like, a yeah. It was sort of a pedagogical

point of view. It was it was, like, we're doing Lex and Yac, not Yac plus plus or, you know, or, you know, whatever you wanted wanted to do. So it was it was always meant to be, like, just a really just traditional

look at things. And and and I actually, most I think most programming language, you'll find, like, some implementation of LexiAC kinda sitting around somewhere. I I just happen to be the unlucky 1 to do it for Python. I don't know. I don't I don't know how that, like, fell in my lap, but You did a good job, I think. It's really hard to implement. I would you know, not to I, you know, I I tried to yeah. I mentioned it took, like, 16 years to debug this thing. There are there is actually not a lot of very good information on how to actually implement the l a l r 1 parsing algorithm. Like, there's a lot of books that tell you how to use it, but actually putting it into code was, like, really hard to track down.

There's a book I have in my hands right now that I was using, when I designed parsimonious. I was shopping around for different things. Have you ever heard of, parsing techniques by Gruen and Jacobs?

Yeah. I don't know. Not off the top of my head, but Fantastic book. It's like where weird parsing algorithms go to die.

Oh, okay. And it feels like it's a it's, yeah, 700 pages long, and it's just here here they all are guys, and here's how to write them, and here's the theoretical behaviors of them, and

o notations, and everything you could possibly want. It's a great book. See, I I tried to to implement it originally out of the infamous dragon book. Oh, yeah. Oh, god. Yeah. Yeah. Yeah. For for people listening to, like, if you're new to parsing, just go look up the dragon book. Go this this is like the book that you use if you wanna weed out, like, half of your grad students. Just You can dissuade someone from running a compiler. Here's the Well, no. Just they just they'll just leave grad school. You know? Just set them loose on the dragon book. But, you know, I tried to I tried to implement it out of that, and I I'm just not smart enough to figure that out. I had to get, I ended up buying another textbook from like, used textbook, I think from, like, 1978

or something. It was it was some very old book and it actually had the out the algorithm in there in a very clear way. So ended up using that, but, you know, the dragon book just exploded my head on the on the algorithm. I I I just don't know that there's that many people walking around who've actually had to put out that algorithm into code. I mean, it's very obscure thing. So

For people who are fairly new to the idea of parsing, what are some of the ways that a full fledged parser differs from what a general engineer

would be more familiar with in terms of regular expression engines?

Well, I think the biggest way that it differs is that a vanilla regular expression engine can't deal with infinite nesting, arbitrary nesting. You know, good luck writing a vanilla regex that'll do, you know, arbitrarily

nested sets of parentheses or arbitrarily nested Python lists.

Or HTML.

Sure. That's a good example. You you don't parse HTML with a reg x? That's

Not if you value your sanity.

But Just think of that famous Stack Overflow

answer. That's the

You have to you have to be careful to distinguish. Do you know, do you mean reg x's, you know, PCRE, whatever it is? Or do you mean the academic idea of regular expressions,

which totally can't do nesting? Now there are, you know,

pearl and and maybe even Python abominations that stretch out regexes so they can deal with nesting. And, that's a thing. Yeah. It's talking more specifically about the implementations of regex engines that an average engineer might have been exposed to or implemented, and then decided that trying to figure out how to use a regular expression to do, as you said, arbitrarily nested syntax, making their head explode, and they want to turn to something that is more capable of actually solving that problem. So it was basically just sort of differentiating between what a full fledged parser like ply or parsimonious is able to do versus what they might be trying to attempt to do with regular expression?

I generally say there are 2 things that should make you reach for an actual parser. 1, you need to parse something you can't represent with regexes like arbitrary nesting. And 2, your head is exploding because your regex soup has gotten too big to deal with even with, you know, ignored white space and comments.

Yeah. I would jump in there too. I mean, for for me you know, my background was not in programming languages. And, you know, I I think, like, my experience in parsing, you sort of there's, like, this evolution where it's like, oh, I can parse command line argument. Okay. That's not too bad usually. You know? It's like split things up into a list. And then maybe you can parse, like, white like like a line of text where you have things separated by white space or something. And then you sort of move up and do things like XML or something. You're like, well, maybe I could parse that. And then then you get it if if if you get into, like, a real programming language, like, oh, I'm gonna write a parser for Python. All of a sudden, things start getting really hairy really fast. You realize that, you know, whatever you learn to parse, like, command line arguments arguments is not gonna is not gonna cut it here, and it's that that is where you start turning to these parsing tools. It's Yeah. When you find yourself coming up with all this imperative code wrapping around your salad of regexes, then it's time to think, you know, this would be a lot easier to read and less buggy if it were in some kind of kit that lets me give it a grammar. Yeah. I think parsing like, these parsing, toolkits too are often where a lot of students might first encounter, like like, a real need to use tooling for stuff. I mean, I'm thinking back on my my own background. It was you know, for a lot of stuff, it was always like, oh, yeah. I could code that. You know, it's like, I can I can write something for the network or I can I can do this, but then you get into compiler writing and you realize, oh, man? This is like, it's it's it's like at a whole level of complexity where, you know, almost the tooling is like, if you don't have the tooling, you're just you're just dead or it's What What do you mean by tooling? Like, you're lexicon Yeah. Like Lex and Yac tools or, you know, like part like the parsing tools. Yeah. You know, it's like if you don't have the tool, you're just The first step is to write the tool then. Like, give it the I mean, oh, I mean, that's so it's so like, well, I mean, if for me, I was at the time I came into Python, I was writing a c plus plus parser. If I had to write a c plus plus parser doing a handwritten, you know, like a recursive descent parser or so, I mean, that that would be 1 of the most horrible things ever probably. I mean, it was already bad with the tooling, but without that, it would not be terrible.

That answer your question, Tobias?

Yes. It does. So, Eric, you mentioned the, idea of being able to build a grammar for solving the particular problem set that you might be encountered with. So I'm wondering if you guys can dig a bit deeper into the sort of technical definition of grammars

and

how you

might typically define a grammar and how that also gets consumed by the parsing tool to generate the desired output.

Well, I think of a grammar as just a formal way of describing

the set of valid,

things in a given language. You know, a grammar describes what is valid Python, what is a valid, you know, maybe little query language you're designing. Down to the point where,

you know, how if you try to import to some garbage messed up module in Python, you'll get a syntax error right at import time. That's a that's a sort of thing that a grammar 1 for us. Okay. I don't get syntax errors anymore. Now when I run it, it might still be buggy. I might get name errors. I might be talking about symbols that haven't been defined. But at least I can get it through the initial read in and, okay, it's a module, and I would know where to look for this looks like a function, this will look a function, this looks like a variable. Now, away we go. I feel like that was fuzzy, so maybe Dave can, defuzz it for me. I guess I would agree with that. I mean, like, you know, a grammar is really defining kind of the structure of your of of your program. I mean, is or your input, you know, is is this a valid input? And so extracting a bit from that, it's basically the idea of setting the sort of tokens or identifying markers

that will allow the parser to say that this is the start of a particular type of input. This is the end of a particular type of input. I can now extract everything between them

to be this particular type of instance within the syntax tree that I'm generating. You know, it's a really good way to say it, Tobias, actually. Parsers are really among other things that they do, input categorizers.

Hey. This is a function and this is an assignment statement and this is something else. And then you've got your structure like, okay. This 1 contains that 1 and this 1 contains those other ones. But, yeah, they're they're at a high level. They're really kind of categorizers.

Yeah. I think that's a good description, you know. I like like in Python sense, you know, like, if you're assigning a variable or something, you know, grammar would define, you know, what can appear on the right hand side of an equal sign. You know, not everything can appear over there. It's, like, you can't put, like, a while loop or something. So, you know, a grammar would would sort of specify that. Yeah. You can say def whatever, but you can't say def def.

At least not if you don't wanna break your instance of Python.

Right. I don't think you can do that at all. I think you'll scream immediately. Now you got me wanting to try it. Def def. Yeah. It doesn't doesn't like that. Alright.

Well, at least we've got that going for us.

The 1 thing with grammars, I I'm actually be curious to know how the how Eric's tool,

handles this. It's it's dealing with ambiguity

Right. In the in the grammar. You know, it's like you may have multiple, well, actually, you know, this categorization,

you may end up with syntax that falls into more than 1 category, and you get like a you get, you know, an an ambiguous

parsing situation. And 1 of the things with these tools oftentimes is how they deal with that, like, that kind of situation. I think, like, having a tool where there's, like, more look ahead tends to better deal with that. I never thought of it like that.

I suppose it would give you 1 more piece of signal. Yeah. 1 example

that I have seen of something like that where there is the ambiguity in terms of what a given token might represent is the

tool by Anthony Scopats, which is a hybrid of Python and Bash and figuring out, okay. When you say l s, do you mean l s the shell command or l s this variable that you assigned? And, I know that he's using apply as the actual parser for that. So he's created some additional syntax for being able to disambiguate some of those signals. So that's an interesting point that you made there, Dave, of bringing up the fact that by adding that look ahead, it might provide additional context that you can then use to interpret the original meaning of the first token that you encounter.

So 1 of the things that differentiates

PEGs

from a lot of other methods is it's really good at disambiguating things. It always comes up with an unambiguous parse. And that really comes down to its or operator, its alternation operator. You know, a lot of, you know, reading EBNF, you'll see something vertical bar, something vertical bar. Well, PEG kind of tilts that 45 degrees and and uses slashes as the convention. And the reason it changes the the look of the operator is that it runs it all the time from left to right. 1st, look at the leftmost 1. If that doesn't work, try the next 1. If that doesn't work, try the next 1. And thus, it deals with ambiguity. It gives you the leftmost happy recognition of the text you give it. And when you're reading in the input for the PEG parser, are you reading in the entirety

of a block of text before it resolves the parsing step? Or are there particular checkpoints within a given parsing where you can say, okay, I've definitely closed this loop. Now I can sort of free up that block of memory to continue doing the parsing. Bringing that forward, is there the risk of hitting a memory limit for being able to consume the entirety of the input before you can generate the, data structure that gets output as as that parsing step.

Tobias, you have asked a tremendously penetrating question.

So the weakness of PEG parsing is, since this is not from the seventies when we were, you know, short on RAM, it goes ahead and blows

memory proportional to the grammar size times the input length at at, you know, at maximum. And that's that's in the case where you don't close any of those loops as you said. You know, you're reading along, you're reading along, you get to the second last character. Oops. The last 1 doesn't match what I thought I was doing. Rewind the whole thing and start over again. That can totally happen in a vanilla bag. Now let's see. If it's 5 years old, it is still a research problem. I'm gonna say it's still a research problem. It's a solved research problem. You can embed in PEG grammars, though not yet in parsimonious,

things called cuts, which mean, okay. Well, if I got this far, I am committed to such and such production. I'm committed to this being a while loop, for example. And so I can throw away all those, you know, partial parses that I've cached along the way for speed should I have to rewind and continue on with my while loop parsing. And incidentally, that's also a great place to perk your, error reporting. Okay. I've committed this to this being a while loop. So if something makes it not a while loop, I can say, hey, I was totally committed to making a while loop out of this, and I was expecting 1 of these things, but I doubt 1 of these things. So go fix your program. Yeah. I'm not that familiar

with PEG parsing, to be to be honest. I mean, how does it compare to something like early parsing?

Early was the other candidate,

which I am now really rusty on. So, if you can generalize what early was again before I can find it in my parsing techniques book, and I hope I'll answer your question. Well, I mean, my, I I'm not an expert on early parsing either. Except I will say that the 1 of the big inspirations for ply or at least the way that it works was there was, another Python programmer, John Aycock, who I think is, I think he's a professor up in Canada somewhere right now. He had he had this toolkit,

Spark Yeah. Which kind of worked in the same way that Ply did, except it used it used early parsing as its algorithm. And my impression of that is that it could it could basically parse anything

without you know, it could we would resolve this this ambiguity problem, but it had, like, these tremendous

scaling issues. Like, I'm I'm probably getting this wrong, but it was, like, scaling, like, n cubed where n was the number of input tokens or some it it was it could potentially have, like, a huge memory footprint or performance hit. Well, that must be 1 place where they differ because this tops out a little better than that. It's just, you know, it's the grammar size. That's the, really, it's it's linear. It's linear in, grammar size times input length. So, yeah, it's not nearly as bad as n cubed. Okay. Keep in mind, I'm trying I'm trying to remember from, like, 15, 20 years ago. But, yeah, I remember it had very different scaling prop. And when I was doing parsimonious, when I was shopping, I remember looking at Spark and pulling it down, looking at the early parser in there, And I don't remember and I looked at my my note piece of notebook paper. I don't see why I didn't choose it. It might have been something, cosmetic even. Who knows? I'm kinda wondering whether Spark was the inspiration for ultimately getting, like, Python decorators or something. Only 1 of the 1 of the most genius things about that that I remember was the horrible abuse of docstrings.

So John gay you know, gay came up there at Pike at the Python conference, and he's like, yeah. I embedded the grammar into docstrings.

They did all this like that in the Oh, I yeah. I copied it off of him. I went I went to the Python

or the Python conference. I saw that. I was like, that is just so diabolically

awesome. I stole it. No. I took the idea straight straight into ply, and I you know, this predated

even, like, function attributes and other stuff in in Python. And I think, like, they added I think maybe the motivation for function attributes and maybe decorators and stuff later on is to get people to stop embedding hacks into docstrings. When when when you wanna unleash that, you know, on the world, it's like, oh, yeah. We could abuse docstrings and put all sorts of cool non documentation stuff in there. You wanna know something else funny? I think I've gotten at least 2 different portal requests for parsimonious

to start sticking bits of grammar into docstrings of the, visitor procedures. It's not a bad thing. It's not a bad it's not a bad thing unless, of course, you want to,

get 2 different results out of a parse. You know, you get this tree, and then I wanna render a text out of it and also an HTML out of it. Well, then you're kind of hard coupled together, and it's no fun anymore.

So once a grammar has been consumed along with the input to the parser, what does the data structure look like after the parsing run has been completed?

It's just a nested list. It's a tree of some kind. Mhmm. Do you wanna hear about the individual libraries or just just kind of in general, it's gonna be a tree of some kind? Yeah. In general, but also, I guess, more specifically, once it has been parsed and you do have that data structure, what are some of the ways that you might work with it to then be able to transform that parsed input into some form of logical output, whether it's rendering it as text or, you know, in particular for things like where you're trying to build a DSL on top of an existing programming language being able to render that first tree into logical operations.

Oh, this is where all the fun begins. Yeah. I guess I yeah. I guess there's a lot of things you could do. I mean, you could do type checking. This would be more like program validation stuff. You know? Is is this I mean, just the just because you've parsed it doesn't mean it it's a valid program. So you can get into all sorts of, you know, program checking things. This it take you in the direction of maybe of, like, linter tools, you know, pie checking tools, that that sort of stuff? You could do optimizations. You could do tree transforms. Hey. These are adding 2 constants together. Heck with that. Merge them and just put a single constant in there. You could flatten it back out again if you're wanting to say, maybe you're trying to pretty print a program. Maybe you're trying to change something from Wiki syntax over to HTML or something like that. Yeah. So it sounds like at this point, it's largely up to the individual user to determine, you know, how how best to actually use that data structure. And it's, from that point on, it's in in others' hands and best left to other libraries or techniques.

But what can't you do with a nested list? What can't you represent with a nested list, really? I mean, just look at look at look at any lisp. Right? And just here's a terming complete, whatever you want is nested lists. Enjoy yourself.

Yeah. Speaking of which, there's actually a lisp that was implemented on top of the Python VM called HighLang that I did an episode about a while back that was pretty interesting to hear about some of the things that they're doing with it. Like, I know that the creator of the project actually managed to use lisp

to back port the yield from capabilities from Python 3 to Python 27.

Okay.

Yeah.

Yeah. So he he He is he is a nice hack. Yeah. Yeah. I mean, I like the only thing with the list the list people though is if you really wanna get their, you know, get their IRS, you know, talk about the problem of recursively

navigating a nested list if it gets too deep. That's 1 that's 1 thing in Python. I'm not very good at recursion. If you make a giant parse tree and it's too deep, you could blow the recursion limit. Right. Actually, I plan to play with this. I haven't done it yet. But so parsimonious, being a recursive descent parser, can run into those recursion limits. I mean, you'd have to be parsing something fairly pathological, but, hey. Who knows? It could happen. And so 1 way around that in Python is you do some kind of trampolining

where instead of functions calling functions, your function goes 1 level deep, and it kind of returns the next function to call to its caller, and then it just kind of calls that thing you return. You just kinda bounce around in this 1 loop. And a nice side effect of that is is, well, blowing the recursion limit, is it also gives you,

speed ups under something like PyPy, which has a jet that only kicks in after you spun around a flat loop a hundred times. Like parsimonious under Pypi, it doesn't go that much faster. 2x, 3x speed up. You turn this thing into a trampoline loop, and you get your 10x speed up. Wow. Interesting. Okay. Yeah. I did something like that in the cookbook. There the other 1 1 way of traversing

parse trees is to use variations of the visitor pattern. Sure. Which is yeah. Parsimony just throws 1 of those into the box. So I got this, got this idea in the cookbook. It's like, oh, I wonder if I could write, like, a nonrecursive

visitor pattern. Yeah.

Using generators. Sure. This is kind of the, you know, the same the same idea, and it's just completely

insane recipe. I mean, the people will know it if they find it, but it's, yeah, it's like it's like nonrecursive

visitor pattern with with something. And it it works. I feel a little bit sorry for anybody who would, like, just wander into that recipe in the in the in the

actually, as a the publisher at 1.1 would be to put, like,

difficulty grades on the different recipes, you know, kinda like ski slopes. You know, like, you have the beginner slope and the intermediate slope and the advanced slope. And then and and it it just sort of never happened. So none of the recipes in there are actually marked with any difficulty levels. Like, I I feel sorry for some anybody, you know, just wander into that 1. Would it have been a double black flaming skull slow Oh, yeah. Yeah. Yeah. Yeah. Yeah. It's oh, it's insane. But it's it's it's kinda cool. I think it may use the trampolining thing or some variation of that, but Yeah. I mean, any variation of hold your own stack on the heap would suffice. Yeah. It's too bad I didn't write it now because then I I could have dropped async and await on it or something. I have to It really made it really confusing for somebody.

So for people who are interested in using a parser or developing a grammar for their own domain specific language or even an alternative language implementation on top of something like Python, what are some of the considerations that they should be aware of while creating that grammar?

I I think they should probably think about whether their grammar is like a programming language or whether it's more like a natural language. Maybe that's not a very precise thing, but like a tool like y is it is is very much geared at programming languages. You know? Does it look like c or Java or something like that? You know, that's what it's really good at. You get into into more, like, I don't know, like, natural language things, like parsing English or something, probably is not a good tool for that.

Right. Yeah. Well, if somebody is building a DSL, I would hope that they're not trying for an overly natural language type tool. But I know that for instance, like, the Gherkin syntax from cucumber, while it is using

English terminology,

there is a certain structure to it. So I imagine that you could potentially use something like playa parsimonious

to actually turn that into something that's able to be interpreted by a, computer program.

1 thing I consider is,

do you need to run-in a RAM restrained environment? Do you need does your compiler need to run on an embedded system or something like that? And if so, watch your look ahead. You know? Make sure you design your language such that you don't need to look ahead more than, you know, 1 token. Then you can use l a r 1 part, l a r 1 parser like, like Lex or Yac or like ply. Another consideration is try to disambiguate your construct as early as possible so that as you're parsing, you don't have a lot of rewinds. You don't have a lot of wasted time and your parser going back and allowed to work when you go back and try to interpret it in a different way, which is basically the same thing when you write regex. Right? If you're trying to write an efficient regex, you kinda put the the high entropy stuff up first and say, well, you know, this, it's a you know, it's not 1 of these character classes, so it can't possibly be this next token that's after my my, alternation. And the other thing I like to think of is, trying to have a a good convention for white space. Now once you've designed your your language and you're trying to actually lay the grammar out in informal syntax, you know, what do you do about white space? There are some parsing kits which will just ignore all white space for you, which is fine if you're writing kind of a run of the mill c like programming language. But if white space means something like,

I don't know, kind of not like in Python, but if you're actually caring about white space,

to convey some semantics, you may wanna preserve it in your grammar. And, well, like with PEGs, I like to hang the white space off the right side of my leaf nodes. And then at the root node, put a little white space absorber just to the left, and that makes it all work perfectly.

And what are some of the most

interesting or innovative

or downright scary uses of parsers that either of you have seen, whether it's ply or parsimonious or any parser from another language tool chain as well?

I mean, 1 interesting,

use I I I have seen is using parsing toolkits to parse network protocols.

I see that done with Ometa. Yeah. Yeah. That's kind of a you you could potentially do something like that with ply, actually, like hook up like, you write a grammar for a but you can often describe, like, a network protocol in terms of a grammar.

Why not? String is a string. And then yeah. And then run it run it through something like ply. That actually works reasonably well for, like, something like an l a l r 1 parser because it's just basically just 1 token look ahead, and it's just kinda sweeping through your data. So I've certainly seen that.

I mean, I've seen the inverse. I've seen the thing turned upside down and made a generator. There was a guy who went to

music school same time I did at Penn State, and he turned out a thing that created novel 12 tone music pieces from a generative grammar. I think my favorite kind of grammar bonanza is the parsing bonanza is Lisp, which barely needs a parser at all. It practically is a parse tree on the screen. And, of course, the power you get out of that is, well and you get all the synergies from your program being the same as your data structures, and 1 of those is your program has become easily self modifying, right, these wonderful LISP macros that take in something that looks like LISP that might be a DSL and just kind of tweak it and turn it around and turn it into valid LISP, and that goes and you make machine code out of that. It's much more trivial than trying to, you know, read Python code, and then I have to lex it, and then I have to get it all structured again, and then I can do my job. I think parsers can often show up in really unusual places too. I had, had coffee with somebody, I guess, a year ago. I don't know. It was it was it was somebody who had I he had taken a course with me, and we'd done some parsing stuff. He he's like, Dave, I need to buy you coffee. So he had had a coffee, and he told me about how he how what he learned doing the parsing stuff helped him write code that was used to save the health care dot gov website.

I I was I was like, really? Oh, you know, and I I I not remembering all the details, but, you know, 1 of the issues with that health care site is that the I guess they had, like, all these entities

trying to talk to each other, and it was just like this communication,

like, nightmare of stuff. And he he said he was using some parsing tools to, like, basically grab data and, you know, sort of massage the data into standard format and other other things. So, you know, it's not even it's not even like programming related. It's, you know, using a parsing you know, some kind of parsing tool to parse, like, health insurance,

policy data or something. So it kinda blew my mind. You know? Never would have occurred to me that, you know, what he learned. Programming, like, you know, programming language would apply to that. But Kinda being a programming language wonk, I do fall into the, hey, parse it and you're done,

rut. But parsers are a big part of natural language processing as well. I mean, you I'm not sure how part of speech tagging works, but it must be something like running parsers across sentences and then they fail and then you make a different guess about what kind of part of speech that must be. Well, I know too that a lot of times when you're doing parts of speech tagging, the first thing you'll do is run it through a stemmer where you're stemming the words and figuring out the word roots. But basically, we have the beginnings of words that are similar with different endings depending on if it's a plural or singular form or things like that. Run and running and ran. Right. And 1 once it's been stemmed, then a lot of times you'll be able to figure out parts of speech tagging. You'll use sort of the grammar syntax of the language where, for instance, in English, the verb goes before the noun.

And so then you say, okay. If this happens and then this happens, then this is probably the subject of the sentence, and then also using

punctuation to determine where the sentence boundaries are. That's essentially where the person comes into play where you say, okay. This is punctuation, so that means that this prior token is has this particular meaning or this was recognized as a verb. So that means that the next thing is a noun, stuff like that. Although, of course, if you prematurely stem your words, you lose a piece of signal. Like, you know, we I don't know. Let's let's say,

how does English work again?

Exactly. And I'm my my recollection of taking natural language processing class, it was just like, this is insane. I mean, I I I I love collecting some of these sentences, you know, that I see every now and then. That's, I don't know. There's 1 that made the rounds. It was like a picture of, like, this giant boat, and then on top of the boat were, like, other boats. You know, it was like this stack of boats and stuff, and there was, like, this sentence. It was like it was like a ship shipping ship shipping shipping ships.

Oh. It was some it was something it was some crazy thing like that, and it's just like, oh, god. I had to parse that? I mean, that's Here's 1 for you. Have you ever heard this 1? Buffalo Buffalo Buffalo Buffalo Buffalo

Buffalo? Yeah. I heard yeah.

That's a that's a valid English sentence.

Yeah.

I sort of wonder, you know, I I I'm not an AI person, and it's I know, doing the natural language stuff, you know, there's also a whole, like, statistical

side of that. You know, there's 1 side where it's like, oh, we're gonna parse it according to a grammar. We're gonna, you know, we're gonna build structure out of it, almost like what we do with programming languages. But then you have the other side where it's like, we're we're just gonna do statistics on it. And that statistics takes probably takes you more into things like sentiment analysis and other, you know, other things where they just do, like, bunch of statistics on the text. On the on the text, you know, it's like, is the text, like, positive or is it negative? Is it are they talking about terrorists or whatever? You know, it was like, yeah, that's not really my I don't know how much people are doing, like, the structural side of that. Couple of, interesting tools along those lines are spaCy,

from,

explosion AI by Matthew Honnable and Gensim by Radim Jahuzek

and,

Rare Labs. I actually did episodes about both of those tools. But, yeah, it's definitely interesting hearing more about the idea of natural language processing. And like you said, the, sort of conjunction of parsing and statistics to be able to infer meaning from what is, you know, at some level structured, but also inherently

free form in terms of how we communicate using natural language.

No. And speaking of that overlap

for our Python using listeners,

first of all, if you haven't looked at NLTK and you're interested in, natural language processing, go look at NLTK right now. It's fantastic. Great book on it. It's free and everything. I see it's full of parsing classes, including a probabilistic,

context free grammar parser. That is definitely interesting.

I've been aware of an LPK. I haven't really had much cause to do a lot of work with it, but, yeah, it's definitely a well regarded project and 1 that has been sort of a cornerstone of the Python community for a while now. So definitely

deserving of, further interest and investigation for anybody who's out there and working in the space. Yeah. I'd second that as well.

So are there any other topics about parsing or either of your respective libraries that you think we should talk about?

Well, I have a couple of comments.

So this is more of a story. So I I recently got a a bug report for Ply. Came out of the blue in October.

Somebody found this critical bug in, like, an absolutely essential part of ply that had never been never been found. Given my ply has been, like, out there, like, 17 years. I'm looking at this bug, and I'm like, how can no how can that is it that nobody has ever found this bug. And the bug came in the form of, like, a, like, a poll request on GitHub. And the person who submitted the poll request, that was the only thing that he'd ever submitted on GitHub, like, in 2 years. It was just like this 1 poll request. This 1 day on ply, like, for this insanely

deep, horrible bug, and it just kinda blew my mind. The only reason I mentioned it is that, you know, sometimes there's it's it's easy to get into this game of, like, you know, like your score on GitHub. You know, like, how many pull requests, how many bug reports you've submitted, and and so forth. And, like, this 1 person, person with, like, the 1 poll request in, like, 2 years fixed this, like, amazingly hard bug. I I just thought it I thought it was, like, amazing. So I I if anybody's out there wondering, it's like, oh, I wonder if my GitHub score is high enough. Just ignore that. That's that's all I would Imaginary Internet points are not going to, make you a happy person. Good advice. They're not going to be fulfilling in the long run. No. The the only other thing I would say is I I've actually rewritten ply into this new live oh, yeah. I have this new library called sly. It's so it's so secret that nobody knows about it. But, no, I did so,

no. So so when I did ply, you know, in 2001,

there were, like, all these missing Python features. Like, there were no generators, no decorators,

no closures. Like, I mean, it was, frankly, it's just this horrible environment. And and Ply has long sort of suffered by not being able to take advantage of that. So the Sly project is basically a reimplementation

of ply, assuming that all versions of Python prior to 3.6

don't exist. So it's like it's basically just like free, like, free license to use every insane advanced feature out there. So it it kinda takes a different approach, uses a lot of things with meta glasses and other exotic techniques. But if if you wanna go on the bleeding edge, I would look at that. That's definitely interesting. I'll have to, hunt that down and take a peek at it. It's, definitely great when you have the opportunity

to start fresh from lessons learned, but using a more advanced set of tools to be able to rebuild the thing that you've been struggling with for a long time. Yeah. It's kind of fun, actually. I've been kind of in a reinvention

mood lately, and it's like, I'm gonna I'm gonna re I'm gonna redo this thing for Python 3 6. And, like, forget about backwards compatibility.

Just just sort of see where you can push it. So Eric, Eric, do you have anything else to add before we start to close out the show? Well, I just encourage anybody with the,

slightest interest to build yourself a parser or use an off the shelf kit because it's a gateway drug to so many interesting computer science subfields.

Natural language processing, you could bust out NLTK and play with their stuff. You can do programming languages, I mean, or write rate compilers and optimizers.

And, and even this is something I I seem to do on a repeat offender basis. It's just create custom little query languages.

DXR has a little Google like colony query language.

It's an easy thing to do, and, you know, you tend to avoid the sorts of corner cases you do when you're trying to use, you know,

splits on white space and then do more splits and then reg x's. And then you forget you might need quotation marks in something, and so you slap on some kind of escaping thing. And before long, you made a mess. So parsers for the win. Talking more about this stuff and thinking about it, I can see a few pieces in, my day to day where I'm doing a lot of work with SaltStack and configuration management where having a parser in place for being able to interpret the, you know, the the input to SaltStack that then gets regenerated out into another,

tool chain would be useful. You know, 1 of the more ugly things is I actually wrote a better call. It was a somewhat recursive function to take a Python data structure and then generate a valid Erlang configuration file from it, which was not a fun thing to do.

Yeah. Alright.

Definitely parser. Yeah. I think another fun thing to look at if you if you wanna go in a different direction is LLVM

Sure. As well. I mean, 1 1 thing you can piggy you can sort of piggyback on top of is the work that the Numba project has done. Amazing. If you have if you haven't seen that, I mean, it could it can take, like, Python, you know, Python functions and take them down to, you know, machine code using LLVM. But, they they've actually like, if you download, you know, download Anaconda or something, you you can actually get the the other Python libraries that will let you interface with LLVM, and you can you can do your own thing, which is it's kind of insane and awesome at the same time. Like, make your own just in time compilers and other other things. So, you know, that that might be something to kinda look at too just to, like, sort of try to piggyback off that project. I I mean, nobody ever thinks of it, but with a built in c types module and the standard lib, you can jump to arbitrary,

you know, buffers of machine instructions that you pull together you're on your own. I think that's actually how they do. I'm not I'm not I'm not exactly sure on the numbers. I know you can do it with LLVM where it's like you can make, like, a fragment of machine code just sitting there in RAM, and then you can have c types call it. Yeah. Why not? And it's it works. It's cool.

It's Yeah. And on the front of, building your own JIT, now that Pidgin is becoming part of the, core language, it's actually something that more people may be, inclined to do. It is? I believe it got included in 3 dot 6. Last time I took spoke with, Brett Cannon. There was a frame there was like a frame hook or something added. Yeah. Clearly, I have to re up. Is Pigeon the Microsoft thing? Is the Dropbox what is that? Piston that's the Dropbox? Piston is the Dropbox thing, which is a just in time compiled

Python language VM implementation.

Pigeon is the Microsoft project that is an add on to the c Python runtime that allows you to plug in a JIT, and they have a reference implementation

that t 6. Yeah. It has a reference implementation

that, plugs the core CLR JIT into the CPython runtime.

That could be an interesting thing to play with. Yeah. I was I was looking at that thinking, I wonder what kind of evil thing I could do with that frame

Yeah. That's that's probably a topic for another day. That's,

Right. So for people who want to keep in touch and follow what you guys are up to, I'll ask you to send me your preferred contact methods. I'll put those in the show notes. And with that, I'll move us into the picks. My pick today

is a terminal program that I started using recently called Terminix. I had been using Terminator for a while,

before its ability to arbitrarily

split screens both horizontally and vertically in whatever configuration you choose. And I was becoming stymied by the inability to have an effective

find capability where it actually highlights the word that you're looking for. And so through a bit more searching, I found Terminix, which has both of those things as well as a couple of other niceties, including the ability to have it run-in a drop down mode similar to the Quake terminal. So been using it for a while. Seems to do everything I want. It's definitely been pretty pleasant to use. It's a bit of an upgrade from Terminator, which is another great project. So FURAM Linux, I definitely recommend checking it out. And with that, what do you have for us, Eric? Well, my pick is somewhat,

unusual. It's a game from 1997.

It's called Riven, and it's a sequel to, Myst. Great game. Played it back in 97. You know, I got it for Christmas when I was about, 2 and a half years old or something. I played this thing, and I really enjoyed it. Well, I came back 10 years later and played it again. And, boy, it is a work of art. It is,

more like a virtual theme park or or like an epic novel than a computer game. It's pretty much like Myst, except its puzzles are more naturally part of its story. They're not these arbitrary Rubik's cubes that get in your way. Yeah. And it's a much larger game as well, so there's more to explore. It is humongous. The whole the thing was rendered in this in soft image, which is the same, piece of software they used to do the effects in Jurassic Park. And it was rendered as 1 huge world. It's not like Mist where you're bumping from self contained age to self contained age all the time. And so as a result, this thing would take hours just to load in even on their SGI workstations.

In fact,

ribbon is actually a lost piece of art. They they're not able to render it anymore. They're not able to run soft image or whatever it is. They they can no longer render it. So all we have today is this, these set of dithered 8 bit renderings at, like, just under 640 by 480. It's really kind of sad, but, you can totally play it today under ScummVM

if you can get a hold of the original disks. Or a little easier, it's available on iOS, and there's even an Android version coming. And then there are these crazy people who are recreating ribbon as a labor of love, as not just a kinda clicky through slideshow, but as a 3 d real time rendered thing using the unreal 4 inch. And they have the full blessing of Cyan who owns the,

IP on it. So I'm excited to walk around that when it's finished. That's pretty cool. Dave, do you have any picks for us? I'm gonna have to figure out some way to split terminals and games here.

Well, maybe

maybe a maybe a bit of both. I guess over the last year, I've been fooling around at this iTerm 2 terminal on the Mac. And, 1 of the 1 of the fun things about that is it has the ability to display inline images just straight into, like, a terminal text session. And, so I I've been I've been using that a fair bit, over the last year for things like conference presentations

and doing things like injecting

impossible font renderings.

So I gave a talk in Chicago where I was doing some stuff in the terminal where there were, like, fonts that were impossible. People were looking at me. They're, like, having, like well, like, I had a I had, like, a a a part a portion of it. I was, like, using the Python interactive, like, REPL, and then I put this thing up, and it had, like, curly braces,

like, rotated 45 degrees Oh. Going off at an angle somewhere. And I didn't really say anything about it. I just kinda let it fly, you know, just like, like, nothing. No. Nothing going on. And then what what, you know, what was not realized was that was actually an inline PNG image of a font rotated 45 degrees. So people kinda come up afterwards. They're like, what were you doing in the terminal? And I I I made up some story. Like, well, I'm using some lesser known VT codes, you know, for the, you know, advanced, you know, deck term. You know, not not saying like, well, actually, those were all, like, inline images

displayed in Iterm.

So I've been having, been having a bit of fun with, bit fun with that. I I guess on the game front, I've been wasting way too much time landing Kerbals lately. So if you're into the Kerbalspaceprogram

game, I that that's probably single handedly killed a lot of my productivity over the last month or so. So it's a lot of fun, though. I don't know.

Alright. Well, I really appreciate the both of you taking time out of your day to talk to me and share your knowledge and interest in parsers. I've definitely learned a lot from it, and I'm gonna have to take some deeper looks at, how I might use it in my day to day. And I'm sure that that a number of other people are going to be inspired to do the same. So thank you both for your time. It's been a pleasure, Tobias, and likewise, Dave. Alright. Thanks a lot. No. It was fun.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__