Summary
Programming languages are a powerful tool and can be used to create all manner of applications, however sometimes their syntax is more cumbersome than necessary. For some industries or subject areas there is already an agreed upon set of concepts that can be used to express your logic. For those cases you can create a Domain Specific Language, or DSL to make it easier to write programs that can express the necessary logic with a custom syntax. In this episode Igor Dejanović shares his work on textX and how you can use it to build your own DSLs with Python. He explains his motivations for creating it, how it compares to other tools in the Python ecosystem for building parsers, and how you can use it to build your own custom languages.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Igor Dejanović about textX, a meta-language for building domain specific languges in Python
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what a domain specific language is and some examples of when you might need one?
- What is textX and what was your motivation for creating it?
- There are a number of other libraries in the Python ecosystem for building parsers, and for creating DSLs. What are the features of textX that might lead someone to choose it over the other options?
- What are some of the challenges that face language designers when constructing the syntax of their DSL?
- Beyond being able to parse and process an arbitrary syntax, there are other concerns for consumers of the definition in terms of tooling. How does textX provide support to those end users?
- How is textX implemented?
- How has the design or goals of textX changed since you first began working on it?
- What is the workflow for someone using textX to build their own DSL?
- Once they have defined the grammar, how do they distribute the generated interpreter for others to use?
- What are some of the common challenges that users of textX face when trying to define their DSL?
- What are some of the cases where a PEG parser is unable to unambiguously process a defined grammar?
- What are some of the most interesting/innovative/unexpected ways that you have seen textX used?
- What have you found to be the most interesting, unexpected, or challenging lessons that you have learned while building and maintaining textX and its associated projects?
- While preparing for this interview I noticed that you have another parser library in the form of Parglare. How has your experience working with textX informed your designs of that project?
- What lessons have you taken back from Parglare into textX?
- When is textX the wrong choice, and someone might be better served by another DSL library, different style of parser, or just hand-crafting a simple parser with a regex?
- What do you have planned for the future of textX?
Keep In Touch
- Website
- igordejanovic on GitHub
- @dejanovicigor on Twitter
Picks
- Tobias
- Igor
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- textX
- U of Novi Sad
- Serbia
- DSL course
- Secondary Notation
- Django
- Xtext
- Eclipse
- PLY
- SLY
- PyParsing
- Lark
- PEG Grammar
- Language Workbench
- Language Server Protocol
- Visual Studio Code
- textX-LS
- Arpeggio Parser
- Context-Free Grammar
- pyTabs
- Guitar Tablatures
- Parglare
- GLR parsing
- TEP 1
- Evennia
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode today. That's l I n o d e, and get a $60 credit to try out our Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. And today, I'm interviewing Igor Dejanovich about TextX, a metalanguage for building domain specific languages in Python. So Igor, can you start by introducing yourself?
[00:01:29] Unknown:
Hi, Tobias. Thanks for hanging. Sure. I'm Igor Dejanovic. I work as a professor at the University of Halmi Sad, teaching their several courses in software engineering. And the most relevant for this podcast is probably, the the course on DSL. And that's 1 of the reason why the text text exist actually.
[00:01:51] Unknown:
Alright. And do you remember how you first got introduced to Python? Well, I remember it was relatively late,
[00:01:57] Unknown:
since Python is around since nineties. I use different languages back there, and I even missed the opportunity to use Python in the days where I do a lot of sysadmin stuff, where Python is used a lot. I think I picked up Python tried to pick up Python first in 2008. But I remember that I was put off by semantic white spaces at that time because I didn't have actually an experience with any language that did something like that. So I said, remember I called off a little bit, and again tried Python. I think it was 2, 009. And I decided that time to give it a few days.
Let's try it for a couple of days and see how the things will go. And I remember that, day after day I actually start to enjoy even the white space stuff. So the nesting with white space started to get very, very logical to me. And later on, I learned there is actually a name for that in a DSL, literature. It's called the secondary syntax, part of the language syntax that you can that doesn't does not have, specific semantics, but you can freely use. That's the white space, for example, in the intellectual language or in graphical language that's, colors and shapes and positions. And actually, if you have a lot of secondary syntax, you get all sorts of readability issues because people tend to develop their own styles. And Python did it quite good here because reducing secondary syntax actually, improves readability.
So, yeah, I actually love it a lot now.
[00:03:34] Unknown:
And then in terms of the context of domain specific languages, can you give a bit of a description about what they are and some of the cases where you might need to build 1 versus just using purpose programming language?
[00:03:48] Unknown:
Dsl is language tailored and constrained to a particular domain. It is at the right level of abstraction which enables its user, the domain experts, a higher level of expressiveness by removing unnecessary information that is a part of common understanding. What do I mean by that? For example imagine 2 lawyers if they're talking some legal issues because they are operating in the same domain they can remove all unnecessary information that are, common understanding. So their expressiveness is higher, they can use shorter forms to convey information between each other. But if, they are about to explain something to a person that is outside of the legal domain they will have to be much more verbose because they do not share the common understanding.
Besides using the concept of a domain, DSLs also use a concrete notation that is used in a given domain and thus ideally making the domain expert capable of specifying the solutions on their own. Of course in practice that is not always achieved but anyway even if the domain expert is not using the DSL directly it is much easier to communicate with the developer when they're looking at the at the familiar notation for them. DSLs came in different forms and shapes. We have, for example, internal and external DSLs. Internal DSLs are those built inside the host language.
For example, if you use some clever feature of some language you can make something that looks like different language, but is actually interpreted or compiled by the same compiler. Some languages are more capable in that direction. For example, lisp is, well known for being, very capable when it comes to DSL and the lispers are usually create DSLs all the time. Then we have for example some of the more contemporary languages like like Ruby is also very popular for building internal DSLs. Even Python which is not very capable of building internal DSLs, we see internal DSLs all the time. For example if you take Django as a web application framework, in Django you have the definition of a data model, you create a class that extends the model class and then you specify some class attributes as a instance of a fields and then, Django is capable from that description to dynamically generate all kind of stuff like for example, object relational mapper or SQL schema or for example admin interface for CRUD operations.
On the other hand, external DSLs are full blown languages on their own. They have their own syntax. We have to build a compiler for them or interpreter so they are much harder to to build and maintain. And then, of course, we have other categorization like, for example, by concrete syntax we have textual or graphical or some other notation like tabular or those languages are DSLs but just with different interface to the user. And why we should use DSLs? Well, first of all when you are constrained in a way what you can see and you use only the domain concept, you can express the solution much in a much more condensed form so you're more expressive because, the commonly understood stuff is hidden from the from you. It's built into the the the tooling platform into the compiler that enables us to be more productive.
Some case studies for example there is a case study by our meta case done in Nokia, I think it was like maybe 10 years ago done with the development of mobile applications. They measured productivity boost by a factor of 10. But in general what I can observe in practice is that you should achieve at least factor of 5 if you're implement dsl right in the right way. Another appealing reason for using dsl beside productivity boost is that, your solution, your knowledge from some domain is stored and specified in a DSL which is independent from the underlying technology.
So it will evolve not with the technology itself but with the domain. What it means is that your knowledge is preserved and can be easily transferred to another target platform and what is also important because it is on the right level of abstraction that is familiar to the main expert. The specification of a solution serve also as the up to date documentation of the system.
[00:08:41] Unknown:
And so TextX is a tool chain for being able to define your own DSLs for being able to incorporate into Python and other language projects. I'm wondering if you can give a bit more description about the TextX project and some of the motivation behind creating it and, some of its origin story. Well,
[00:09:00] Unknown:
TexteX is actually a DSL and a tool for building DSLs. So it's a metalanguage. And the motivation was, early in my career I got introduced to model driven engineering and the DSL just, let's say, a different flavor of it or they are overlap a lot. So I quickly got into DSL stuff, through the project called Xtext. It is a Java based project, and I think it was somewhere in 2, 005 or 6 when I played and used the Xtext. And I always wanted something, that is similar to Xtext but in Python. And I wanted something for the DSL course, so it should be lightweight. It should be something easy to use.
And the Hextex was a little bit heavier because it is Java, it is Eclipse based, and it is a lot a lot harder to learn. So I wanted something easier for a student to get started. So that's that's the motivation. And I think I started developing it somewhere maybe 2, 005, I think. 2015. Sorry. So that's that's the time I decided to to sit down and and actually implement it because I realized there is nothing similar to it in the Python world at that time. And as you mentioned,
[00:10:23] Unknown:
there are a number of other libraries that do exist in Python for being able to build DSLs or write parsers. But what are some of the capabilities of TextX that make it stand out that might cause somebody to choose it over some of the available other options and maybe some of the characteristics of the overall space in of DSLs in Python that make something like TextX useful and necessary?
[00:10:46] Unknown:
Yeah. There are great options in the Python world. We're actually lucky to have many parsing option. There is, for example, ply and sly and py parsing and parsimonious and lark. Even Antler, for example, which is a Java tool but can produce a parser for Python. But all those tools are, I would say, more, like a normal normal parser or classical parsing tools where you get, much more involvement into building, when you're building DSL, you have to beside grammar, you have to describe, the actions actually, which is used to transform the parse tree to something else or or they can do that on the fly without building parse tree. So you have to do a lot more to maintain your language.
But, the Textics is built on the idea that stem right from Xtext, where you actually describe through the grammar, you actually describe both the parse or the syntax of the language and the what we call the metamodel of the language or the structure of the language. So in a way it is constrained. That's why I like to say it's DSL for building DSL. You're constrained but through that that constraintment, you actually are much more productive and you can say, more with less because it's all you need is, more or less the grammar itself and you can start using your language.
Text will out of the box create all the necessary elements of your language dynamically. So it will create classes that correlates with your grammar rules. It will create a parser that will parse your textual representation of your model and instantiate the object of your dynamically created classes. And that all happened at runtime, just by reading the grammar. Your language is much easier to maintain.
[00:12:34] Unknown:
And for people who are building these DSLs, how does the actual definition of the grammar end up hooking into the behavior? As you've mentioned with the direct parsing tools, you have to be much more manual and explicit about it? Well,
[00:12:49] Unknown:
when you're using classical parsing, you make grammar and then you will either get the parse tree or you'll get some nested list, for example. That's 1 way to transform the parsed content to some, to some data structure. But then you will need to transform that to some other form to be usable in the further processing. And you have all sort of things like you if that is some real language you have for example a reference resolving or you have, to define, for example, parent child relationship between elements. That's all, built in into TextText.
So, by just giving the text as a grammar, you will get the nice graph of you are not getting the parse tree, you are actually getting the object, graph or Python object plain Python objects and they are connected by the reference resolving. So it's not a tree, it's a graph, and you can use it straight away. It's a graph of a Python object. And you can, plug into the creation, sort of way, you can define something called in TextText object processors. So object processor is, is a callable in Python that, is called to either check or transform, the object that is being created.
So in that way, you can implement additional semantics checks or you can, on the fly change the object being created. That is similar to, for example, semantic action you are doing in classical parsing. But this is optional thing that you can add, to introduce additional semantics or additional transformation of the objects.
[00:14:28] Unknown:
And so for those semantics, is that something like being able to say that this particular token is a keyword, whereas this other type of token is
[00:14:46] Unknown:
the ask asking about tokenization of the input, the text text is based on, PEG parsing or PEG PEG grammars and, recursive descent, parsing called packet parsing. So it will distinguish between, the name of, for example, of a function and some keyword, because it has unlimited, unlimited look ahead and can, resolve that ambiguity.
[00:15:09] Unknown:
For people who are defining the DSLs, what are some of the challenges that they face in just constructing the syntax of their target language? And what are some of the types of inspiration that they might look to for determining how it's going to look and the user experience of the people who are going to be using that language and the DSL parser and logic that are going to be generated and built with TextX?
[00:15:36] Unknown:
Well, there are all sorts of challenges during language design. First of all, we should be sure that when building domain specific languages, first, we must be ensure that the domain is covered correctly. What that means is that, that you built into your language the correct concepts and the relationship and there are any, there are no concept that you're left out or that you don't have any additional concepts that are not relevant for the domain. What usually happens in practice, and is a danger for DSL developer is that you start with the DSL for some domain, but then it's it's hard. It's very tempting to add stuff to that language.
And many times, DSL end up being a GPL or general purpose language. That's 1 consideration. The other is, defining the proper syntax for for the language. And when you we are talking about DSL in general, we are not talking about only the textiles languages. Concrete syntax can be anything. It can be graphical representation, it can be tabular. So, the concrete syntax is actually the interface to the user. What user will, see and feel and interact with during the usage of of the language. So it must be very nice. It must be easy to use. It must be very intuitive. To be intuitive, it must already correlate to the existing language in the domain. We're just formalizing what user are already probably using. And those are consideration regarding the syntax. And of course, we're we are building, textual syntaxes. We have to consider all technical stuff, like parsing issues, left recursion, ambiguities, and at the end comes the semantics. So, we usually describe semantics by interpreting or compiling our language to something else. So choosing the right execution style, whether we should interpret our language or compile, makes a difference.
And of course, it's probably less important, but we have to take care about, runtime performance of our language. So all this decision will influence how, efficient our language will be in practice.
[00:17:50] Unknown:
And on the point of runtime performance, what are some of the capabilities of Python that will lead someone to use it as the host language for a DSL? And what are some of the cases where somebody might want to use a DSL that's using a different host language that's more optimized for particular latencies or particular target environments?
[00:18:13] Unknown:
Well, it depends how critical is the runtime performance for your use case. I usually look at 2, performances. It's 1 is runtime, and 1 is development time or maintenance time. So if it's more important to you to, quickly develop your language and to easily maintain it, then Python is a a really good choice. Its its dynamic nature gives you a quick turnaround and you can experiment easily so no wonder it's used for prototyping and for thing that are, you know, right 1 and throw it away. So if your, if your system really needs, it's critical and needs something that must give a better run time performance. Python probably is not a good option. There are other languages that could serve as a better host. But still, you can use TextText or similar tools built on Python to produce or generate code for some other, runtime platforms.
So, you can see, different aspects. What are you using for developing language and what are you using for runtime? Runtime can be different. So you can easily generate, for example, Rust code from, using text text from your models. That's not any there is no any constraint on in that regard.
[00:19:29] Unknown:
And as far as the overall end user experience of working with the DSLs that are being built using TextX, what are some of the associated needs as far as tooling or overall ecosystem of building and working in that environment? And what are some of the additional either associated projects of TextX or capabilities built into it that help in that overall process? Yes. Well, tooling is very important, when talking about DSLs. And,
[00:19:58] Unknown:
given all the benefits you get from DSL, the main reason why people didn't use so much DSLs in the in the past probably is the tooling support. Because it's not easy to build DSL from scratch and to maintain. So, there are actually a class of software tools that are specifically made for building DSLs and they are called language workbenches. TextText is not a language workbench, so it's a simpler tool. Language word bench is a is a integrated environment for building languages and evolving language. So it's much more complex. But for the DSL part, there is a a text text comment that you get when you install the library. So when you install your library in a Python environment, virtual environment, you get the text text comment that can be used to compile your or to check your model or to visualize your model, for example, or metamodel. So you can, for example, generate a nice diagram of your, grammar.
It is a class class like diagram that describes the structure of the language. Or you can use text text for example, text text common to start the project to make initial out outline of the project. And there is, another useful tools. For example, there is a support for for language server protocol and visual studio code integration. It is a project TextText LS that is worked on, that is, Daniel Alero is working on it. He was working on a master thesis regarding, language language server protocol in a text text based languages and he took after his master thesis finished, he continued to work on that. And now we have a second version of that project that is very nicely developing.
So anyone who want to try Textex should check out that project. It is under the same organization GitHub like,
[00:22:03] Unknown:
Textex itself. Yeah. Being able to have that syntax highlighting and the language server support in the development environments will certainly reduce the burden of people who want to be able to take advantage of the DSL without just looking at a blank wall of text and not really having any indicators of what the different tokens are and what their meaning might be in relation to each other? Definitely.
[00:22:26] Unknown:
It's, first thing that should be done when you're producing a tooling for your language, syntax lighting and code completion and code navigation. So you should help your user to easily navigate around and to get help from the from the environment. And the TextXLS is a project exactly for that. Any language you develop using TextX can automatically generate the Visual Studio Code Visual Studio Code integration for your for that language. So out of the box, you get syntax highlighting for your language, which you can further configure if you're not, satisfied with the with the initial results. And it is planned to support, all styles of IDE support like navigation and completion and so it's it's not still finished fully, but it's very usable at the moment and can be tried out. Digging deeper into Textex
[00:23:18] Unknown:
itself, can you talk through how it's implemented and some of the ways that the structure of the project and its overall goals have evolved since you first began working on it? Well, it's built on top of a parser called El Arpeggio. It's a parser,
[00:23:33] Unknown:
it's a peg parser. I started developing in 2, 009. It's what it's probably first project, a real project that I did in python. So when I decided to write Textex, I decided okay I will use back parsing and because I knew Arpeggio very well and I realized that I will probably need to need to tweak it along the way to support all the features I I want to have in TextText, I use it. And that was a good decision, I think, because along the way I did have to tune few things in arpeggio itself to help develop develop some TextExp features easier. So basically, arpeggio is doing the parsing. TextText is just a layer above arpeggio. How it works is that there is if you open the TextText source you will see that there is a TextText a grammar language defined in arpeggio syntax. And when the grammar is parsed, there is a visitor that will build the metamodel and the another parser out of the grammar. That another parser is arpeggio parser for your new language.
And the metamodel is the object holding all the information about your language. So all the concepts, all the relationship, everything is contained in that object. And that metamodel object is actually used as a entry point API entry point for further parsing. You create the metamodel and you say metamodel.parse model from file or model from string if you, want to parse the file or the parse to parse the string. And the arpeggio parser built dynamically for your language, will is accompanied with a visitor that will transform the parse tree to the object to the object graph of corresponding to your grammar. And that design didn't change much from the beginning, so, the core design remained the same. But the language itself for the grammar evolved over time. It started as a x text language. The my idea at the time was to just make a NextText implementation in Python, But actually, it grew over time and add some additional shortcuts in the grammar language itself and some way to easier specify some stuff. For example, there is something called a repetition modifier in TextText.
When you want to match 0 or more stuff or 1 or more stuff, you can attach, a syntactic addition to the plus sign or to asterisk style and say, okay, match 0 or more elements or objects but they should be separated by by something and you just add the separator. In a classical parsing, what will you do? You will have to do that manually. You will have to say match this and then comma and this 0 or more time. But in, Textex, it's much shorter to write. There are also for example rule modifiers that doesn't exist in in x text but do exist in text text. There is a relatively recent addition of an order choice. An order choice is, when you have a sequence of thing that you want to match and you say, okay, match those sequence in any order.
The text text or beneath our page, you will match all those elements, but in any order they are specified, which is very handy for some languages that define the key for example, some keywords that can be written in any in any order. So more or less that's that's about the design, the design itself. So the core core remained
[00:27:10] Unknown:
pretty much the same through all this time. And for people who are using Textex for building their own languages, you mentioned a little bit about the need for having the grammar definition and then being able to parse the written language of the end user and being able to generate the concrete model from that. But what is the overall end to end workflow of somebody who is defining a new language with TextX and then being able to distribute it to end users for them to be able to actually make use of that and develop within it?
[00:27:47] Unknown:
Well, the workflow can be, different depending on the on how complex your DSL is. So you can start very simple. Like, you can define your language embedded in a Python module. You can just write a string with a little grammar and then you can you can call, 1 function that will transform that string to the metamodel, and then you can use a metamodel. So it's just few lines of code if your language is very simple. But if you were developing something more complex, then you can, build the whole project, language project. And there is now a support for that in additional project called TextText Dev that can be installed together with TextText either by using p p install TextText with the with the dev as dependency or as a optional dependency, or you can directly install TextText dev. When you install that that project, it will add additional command to TextText called start project. So it's similar to, for example, how Django would create a new project. So you type text text start project and you get initial project answer answer to several question. And then the initial project is generated.
In that project, you have a grammar file where where you where you should go to define your grammar. And then the project has a registration already built in. So the project will be registered with the text text. It is done through the setup tools extension point mechanism. So languages can be extendable, can be actually, there there are like plugins for a TextText in a way. So you can use text text to list languages. And the generators for the languages are also registered in the text text. Setup.py. So you can list generators also. So in that case workflow is start a project, text text start project and then play with create the grammar. Usually, I tend to first open just a blank file and try to write some model in it. So how I would like to express some solution.
And then I write that, solution that model. And then I in parallel, I'll develop a grammar for that. And then I usually have a small unit test that I run constantly to see if everything works, or I just have, a I just use it for example text text command line to check if grammar is okay and then I iterate. I extend the model, add some new things in the model and then I extend the grammar and see if everything parses and when I end up with the syntax part so I'm satisfied with how the model look likes and how the grammar look likes, then I designed the semantics. I built a compiler for that, either we're using some template engine or I make a little I made a little, interpreter for the language. At the end of the process, you can pack that up, make a package of it, and release it on pipe PyPI, for example.
So the user can just install that language. And if the user would like to have, ID support, you can use TextXLS to build a Visual Studio Code plugin with all the syntax highlighting and and stuff. And then you can distribute your language through that plugin. So that those are the options and it's very flexible. You can use it either very simple or you can use it like a full full blown project language project.
[00:31:14] Unknown:
For the languages that you're defining, that brings up an interesting thought as far as how you would provide things like unit test capabilities for the people who are writing the language to ensure that what they're building is going to parse properly or function as intended. And I'm curious what your experience has been as far as how frequently people will actually go that extra mile to build additional ecosystem tooling for their languages and and just the overall need for it and some of the points at which it hits the tipping point of complexity where that's even necessary?
[00:31:49] Unknown:
Well it all depends who who the end users are. If, if for example end user are people that are not that technically savvy probably, it's probably good investment to make good tooling support. And for the testing, I think it's generally always good to to write tests when you're developing your language. I usually cover all, projects, open source projects that I work on with Pytest test with a good coverage. I think it's very important, beside the documentation. I generally feel more confident when doing some larger refactorings, changing the language.
I want to to be sure that the assumptions that I had before are not broken. So I think it's it's worthwhile to put some additional additional job in in making proper testing. And then as far as
[00:32:44] Unknown:
the specifics of the parsing implementation, I know you mentioned that you're using a peg parser with some customization a a defined grammar and concrete implementations of it, and you would be better served with a different parsing approach?
[00:33:08] Unknown:
Well, PEG parsers are really, really nice for its simplicity. It's it's kinda something you will probably end up with, if you're trying to build the parser manually. You will probably go to recursive descent. It's it's easy to understand. So the back parsers are really easy to debug. But, they're they have this difference comparing to context free grammars that their order choice is ordered. So the alternative is ordered. And by that, I mean when, you have several alternatives to match at some point, you're telling the parser, try this. If this does not succeed, try the other 1. And do this until you find something that that succeeds.
So in a way PAGS are more imperative, in comparison to, to context free grammar which are more declarative. You would just say, this non terminal is this or this or this. I don't care in what order or what whatever. It's just this I declare this. And what's problem with PEG is that it is always, it will always be, unambiguous. But, this might sound good and but in practice it's actually not always. Because, it hides the ambiguity in the language. It will, just go from left to right and pick first that match from the order choice and that is the the way it resolves ambiguity. But it's not always what you want.
So and you will not get any warning. It's the grammar is very hard to analyze for those things. For example, the typical problem you have with PAGS is for example imagine you will try to match a and then if that that doesn't succeed you will match a and then b. You can see that this second match will never succeed or never be reached. If you find a in input it will be matched by the first choice. So this a with b afterwards will never be reached. And then can introduce various problems in practice. And the most difficult problems are when you reorder the order choice. You you are actually changing language, but it's hard to see how. So in a big grammars can that can be problematic.
You don't have any analysis from the from the tool. And with, the other parsing approaches, which are based on on CFGs and do some preprocessing and grammar analysis, you do have some help. Either choose some like for example shift reduce conflicts that tell you that at some point you have some either ambiguity or you need more look ahead to resolve something. So PAGS are easy to debug, easy to understand but do have their own problems.
[00:36:08] Unknown:
For people who are using TextX and building their own languages, what are some of the common challenges that they run into either in terms of being able to overcome those challenges of the limitations of the peg grammar or just the overall process of building the DSL and making it available to their end users for being able to do the work that the DSL is intended for?
[00:36:34] Unknown:
So if if I understood the question, usually people, from my experience, because the problem with open source projects is that you don't always get full feedback from the users. But I do have a lot of feedback from my students. So, it usually goes relatively smooth. When they have a good documentation and good examples, they will just read through that and generally they understand understand that very well so they don't have much problems with it. Usually, initially they have a sort sometimes they have a sort of fear from the parsing in general, probably because previously they were exposed to some old classic tools like Flex and Yac and similar, so they consider parsing very hard and hard to understand and, but I think that fear is very quickly overcome when they start to work with easier to understand and to use tools.
[00:37:38] Unknown:
And as far as projects that you have seen built with Tech Stack so that you've built yourself, What are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:37:48] Unknown:
Well, again, the most projects I see developed are from from my students. There are several projects listed on the on the Textex front page who is using, but usually users don't reach out that much. So I I encourage users of Textex who are listening to this podcast to drop me a line what they are using Textechs for. I always like to hear about that. But from the other projects, probably most interesting was a project done by several students. It's a language for describing guitar tablatures. They call the project PyTabs. It's on the GitHub, and the guitar tablatures is a way to, to write the pieces of music done for guitar but for folks that probably didn't have for example formal education don't know to read notes. And it's very easy to understand format. It's it actually depicts the neck of the guitar in ASCII art, where you see 6 strings running horizontally.
And on each string there is a number that says, it's a fret you have to press when you play that note. And they managed to parse that with text text, and the grammar is actually very elegant. And if you think about it, it's like a 2 dimensional language. You know, you have not only horizontal you you're not only parsing horizontally, but vertically also because you have to correlate the different strings at the same place. And the interpreter for that language will play the music. So they designed language with the semantics of playing the music that is described by the user. So it's, for example, for me, was very innovative and interesting way of using TextX.
[00:39:37] Unknown:
Yeah. That's really cool. And as far as your experience of building TextX and maintaining it and continuing to use it as a teaching tool, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:39:53] Unknown:
Well, maintenance, of open source projects in general, I I learned is time consuming and not very easy to do, especially when the projects start to get some traction. So, when you're maintaining project, you have a lot of, a lot of work to do around just organizing stuff, making sure every issue is commented on and every pull request is reviewed and etcetera etcetera. And release process is done properly with the right versioning and stuff like that. So since many ops open source projects are done on a voluntary basis in a free time and, it's something hard to do. So, in I think it was a year and a half ago, I got a I got a really great contribution for Pierre Berl.
It was implementation of custom scoping support because Textex had always this reference resolving thing but reference is let me quickly just describe what it is. For example when you are parsing something, At some place you have here I want to match the name of some object I defined somewhere else and the text text in that place will resolve that to a proper Python reference. So you don't have to do that yourself. That's why you end up with a graph of Python object not a tree. And the scoping was done by using a global scope. So the text text in the older version or by default will search for that kind of object, that type of object but globally. And that's not all what you always want. So peer done a support for custom scopes. You can define the custom scope provider that will, where you can in Python define actual algorithm how that object is to be found. And another piece of that pull request was support for multi model and multi methanol. So you can for example have several grammars different and you can build model that can reference things from other model in other languages.
You can even reference things that are outside of text text. For example, you can reference specific node in a JSON file or a specific node in a XML file. And that support is really cool and, he did that a year and a half ago, send a pull request and we have really great collaboration on that pull request. And when we merged that to master, I asked Pierre to to join the project to help him maintaining, and I'm really happy he accepted. So he is now co maintaining project with me. And then it's much easier when you have someone that is a co maintainer with you, because we can discuss this design decision and sometimes I don't have time to look at some pull request or some issue. Sometimes peer don't have time. So it's much easier when there are more
[00:42:46] Unknown:
people. And 1 of the other interesting things that I found out while I was doing the research for this conversation is that in addition to TextX and the arpeggio parser that it's using, you've also built another parser using a different type of grammar support called par glare. I'm wondering what your motivation was for creating that and just some of the ways that your experience in Arpeggio and TextX fed into the work you did there and some of the ways that the work you're doing on Parglare has informed decisions about how you approach things with a TextX and Arpeggio? Well, it's it's it's actually started as,
[00:43:24] Unknown:
as problems in PEG parsing, I realized at that time. So for example, 1 of the problem I already said about it, it's that unambiguous parser well, parsing which is not what you always want. There are hidden ambiguities. The other thing, for example, generally related to all top down parser is they generally don't accept left left recursion, left recursive rules. And sometimes grammar is most naturally described by using, left recursive rules. For example, if you're building something that is heavily expression oriented, it's much easier to encode it naturally.
For example, expression. If expression is if you're building expression for arithmetic operations, you you can easily say expression is expression, plus expression or expression expression, minus expression, and so on. With, top down parsing you must avoid the left recursion so you encode, those rules differently which is not very natural. So I wanted to experiment with the other parsing approach. My idea was to offer additional backend for TextText instead of arpeggio to, for example as an option you could plug in some other parser. So I built a paraglider to experiment with lr parsers, bottom up parsers. And especially I was interested in general, in general parsing so the paraglider also implements glr parsing. And later on I realized that trying to put 2 different parsing styles in a TextX project would be very complicated.
So I decided not to do that but but the par glare project itself was developing quite nicely and I really liked the the some of the results I got with, especially glr parsing. So and I like the way you can use actually context free grammars to declaratively express your language. The some learned lesson I get from the working on paraglider is that sometimes it's really nice to have something easy to start with like back parser, especially for, students that are root, learning parsing. But sometimes you need more power for a power, like, bottom up parser with declarative specification of language and with full general parsing like GLR which can which can accept any context free grammar, even ambiguous 1, and can, in case of ambiguity can produce arse forest. So all the possible solution for your input. And that's especially, for example, important for natural language processing where you,
[00:46:09] Unknown:
by default have ambiguity in in your language. And then for people who are making the decision of what to use, what are the cases where TextX is the wrong choice and they might be better served by either using a different DSL library or using a simple parsing library and then doing the manual resolution of how that logic is supposed to play or just using a simple regex for maybe the smaller cases? Well,
[00:46:35] Unknown:
if you're trying to parse something more complex, it's generally not a very wise choice to just use a regexes. So generally I recommend always use some parsing library because even if you think you can easily handcraft your parser there are all sorts of edge cases that are already built in into the parsing library and there is good error support and stuff like that. But handcrafting parser can give you additional control. So if you want if you want a real total control over the parsing process or for example if you want to learn parsing in-depth then you can go, with handcrafted parser. And for different libraries, well, text text is not a great choice if you really want to influence the outcome what are you transforming your input to or for example if you want to get the best possible runtime performance or for example if you want to parse a stream of of tokens as as they arrive, then Textex is not suitable suitable parser for that.
Or for example if your input is ambiguous naturally like for example it's a natural language or some language that is ambiguous, you cannot use back parser in general for that in that case. So generally if you need full control and you want to produce something that that does not correspond fully to your grammar let me give you an example. If you are building an again, expression language expression based language. For example, maybe you want to evaluate the expression on the fly. If you want that, that, then TextX is not a good choice because, you will always end up with a object graph and you will have to, transform that graph to the result of the expression. And as you continue
[00:48:24] Unknown:
working with TextX and using it for your own purposes and for your teaching, what are some of the new capabilities or features or just overall improvements that you have planned for it or
[00:48:34] Unknown:
associated projects that you have in mind to build? Well, first of all, 1 thing we discussed recently was to to drop Python 2 support from Textechs and Arpeggio. They're still compatible with Python 2 and, because of that we cannot move on with some Python 3 only stuff. For example, 1 of the thing I would really like to see, in TextExp is type hinting, so we can provide more more, stricter check for types in the library itself. And there is also 1 bigger feature we were be we have been planning maybe a year ago. It is a small DSL for custom scoping providers.
That is the part that Pierre is working was working on but you now describe the scope providers by Python functions and the idea was to create a small DSL where you can, describe the scoping rule by very small and simple DSL that you can embed in the grammar itself. So so at the place of of using the reference, you can write that expression that, tell TextExp how to resolve the reference. And there is even because we were discussing that on several issues, we made in the wikis. There is a TAP 1, is a text text enhancement proposal. So we collected all all the idea about that, DSL in that document. And that's probably something we should
[00:50:09] Unknown:
work on in some in some future when we find some time. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing or contribute to your work on TextX and your other libraries, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And this week, I'm going to choose the project we make Python style guide. It's a set of plugins for providing fairly strict linting of your projects, and I've started using that on some of the new, projects I'm doing lately at my job. So been enjoying that. There are a few things that I've had to turn off, but all in all, it has been doing a good job of making sure to catch some of the silly errors that I add in as I'm working without me having to go through the whole, you know, reevaluate print loop, just being able to see in my editor that a mistake was made and fix it in that context. So I've been enjoying that and recommend it to anybody who's starting a new project or wants to add some new linting rules for their work. And with that, I'll pass it to you Igor. Well my pick would be something I rediscovered lately. It is called interactive fiction. It's a genre of
[00:51:14] Unknown:
crossover from literacy to the gaming. I remember I played those games back in the 19th, but at the time they were called text adventures. So recently I tried to see what happened with that genre. I thought it was completely gone. It was very popular as in eighties by Infocom I think the company was called and it is nowadays it's not visible at least on the surface of the internet so I dig deeper and find out that the community around interactive fiction is actually very alive and there is a site for example called interactive fiction database where you can find titles that are published published even in these years. There are outer actively working on new on new titles and there what is interesting there are, altering tools that are actively developed. 1 is called TUDs and the other is called Inform 7 that is developed by Graha Nelson, a british mathematician.
And inform 7 is very interesting. It's a kind of DSL but based on a natural language so when you read the description of the game it's like you're reading some, what you imagine would be when somebody would describe the language to you in plain english. And it's here is a quote, little quote from the inform 7, website. It says it's a tool for writers intrigued by computing and computer programmers intrigued by writing. Perhaps these are not so very different pursuits in their rewards and pleasures.
[00:52:46] Unknown:
So it's very interesting. So I encourage anyone with interest in reading novels and solving puzzles to try out interactive interactive fiction. Yeah. There's another category of that as well with multi user interactive fiction, and I did an interview with the maintainer of a library called Evenia a while ago. So I'll add a link to that in the show notes as well. Oh, that sounds great. So with that, I would like to thank you for taking the time today to join me and discuss the work that you've been doing with TextX and Arpeggio and Parglare.
Definitely a very interesting problem domain and something that can provide a lot of utility to people who are struggling with trying to build their own DSLs or make Python work the way that they want it to syntactically. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day. Hey. It was my pleasure. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Host Welcome
Interview with Igor Dejanovich
Understanding Domain Specific Languages (DSLs)
Introduction to TextX
Runtime Performance and Use Cases
Workflow for Defining a New Language with TextX
Parsing Implementation and Challenges
Interesting Projects and Use Cases
Exploring Parglare and Other Parsing Approaches
Future Plans and Improvements for TextX
Closing Remarks and Picks