Summary
Programmers love to automate tedious processes, including refactoring your code. In order to support the creation of code modifications for your Python projects Jimmy Lai created LibCST. It provides a richly typed and high level API for creating and manipulating concrete syntax trees of your source code. In this episode Jimmy Lai and Zsolt Dollenstein explain how it works, some of the linting and automatic code modification utilities that you can build with it and how to get started with using it to maintain your own Python projects.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Zsolt Dollenstein and Jimmy Lai about LibCST, a concrete syntax tree parser and serializer library for Python
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what LibCST is and the story behind it?
- How does a concrete syntax tree differ from an abstract syntax tree?
- What are some of the situations where the preservation of the exact structure is necessary?
- There are a few other libraries in Python for creating concrete syntax trees. What was missing in the available options that made it necessary to create LibCST?
- What are the use cases that LibCST is focused on supporting
- Can you describe how LibCST is implemented?
- How have the design and goals of the project changed or evolved since you started working on it?
- How might I use LibCST for something like restructuring a set of modules to move a function definition while maintaining proper imports?
- How do the capabilities of LibCST for codemodding compare to the Rope framework?
- What are some other workflows that someone might build with LibCST?
- What are some of the ways that LibCST is being used in your own work?
- What are the most interesting, innovative, or unexpected ways that you have seen LibCST used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on LibCST?
- When is LibCST the wrong choice?
- What do you have planned for the future of LibCST?
Keep In Touch
Picks
- Tobias
- Zsolt
- Jimmy
- Paying down technical debt
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- LibCST
- Carta
- lib2to3
- Abstract Syntax Tree
- Concrete Syntax Tree
- Pyre
- Parso
- Cython
- mypyc
- Rope
- Flake8
- Pylint
- ESLint
- Fixit
- MonkeyType
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Schultz Dollenstein and Jimmy Lai about libcst, a concrete syntax tree parser and serializer library for Python. So, Scholz, can you start by introducing yourself?
[00:01:10] Unknown:
So my name is Scholte. I've been working for meta slash Facebook for the past, like, 6 years. I've been maintaining libcst, actually, since the just when the pandemic started to hit Europe. So this is like a pandemic project for me. It's like 2020 ish. I have an 8 year old son. I'm big into video games.
[00:01:30] Unknown:
Yeah. That's me. And, Jimmy, how about you? I'm Jimmy. So I was a engineer in Instagram while working on deep CSD since the beginning. And right now, I'm a engineer at Carta. So, yeah, I'm a big Python fan. I have been using Python for, like, 10 years, and I to share what I learned about Python in local meetup and Python conferences.
[00:01:59] Unknown:
And going back to you, Scholz, do you remember how you first got introduced to Python? I had to go dig deep on to this question. So, like, I remember at my previous company, they threw me in the deep end, they had some massive Django Python 2 codebase. They told me to do some like minor infrastructure features. And that's when I realized that c plus plus is not the best language in the world, but it's instead it's Python. And, Jimmy, do you remember how you got introduced to Python?
[00:02:25] Unknown:
I think it's back to when I was at school, when I try to learn data mining, data analysis. Python was used in the course. So I like it that a lot of libraries are provided. It makes data mining or machine learning very easy.
[00:02:44] Unknown:
Alright. And that brings us now to the libcst project. So can you describe a bit about what it is and some of the story behind how it came to be and sort of what was missing in the ecosystem for syntax tree parsers that made this a necessary project?
[00:03:00] Unknown:
In Facebook meta, we have a lot of large code base that is multimillion slice of code. So at Instagram, we have monologues that is a few millions lines of code. So we have a lot of different problem that is at large scale to solve. In terms of LinkedIn or Comadin, we have a lot of challenge. So in the beginning, we try to look for some open source tool to help us build customized linters and also Comarch. And we didn't find a good 1, so we decided to build this this DeepCSD. In terms of Comod, so Facebook has a Comod open source library that allow you to modify the code or match some code pattern using regex.
But regex is not able to drop us to these modify the code or match the more complex code pattern, so we couldn't use regex. And later, we also try to look into the available open source syntax tree parser. But the built in syntax tree parser in Python, it doesn't provides concrete formatting information like space, and those are needed for combats because you will want to convert the source code as a syntax tree, modify the syntax tree, and convert it back to its updated source code. And there are some available parser or coma tool, but they are not very convenient for us to use. So we end up decided to build the implementation of Compression Text 3, deep CST. I have my own take on this. I wasn't there when the project originally started, but, like, I started,
[00:04:50] Unknown:
you know, maintaining this as a kind of, like, a power user. So the reason I flock to libcst personally is because, like, I tried lib223, which is another concrete syntax implementation that actually Black, the popular formatter, uses. And it's just like I mean, it gets the job done, but it's so low level. Right? So, like, you have to keep in mind how the CPython grammar is implemented and then, like, fiddle with low level CST nodes there, which is great for a formatter like black, which basically only manipulates white space. Not so great for, you know, 99% of the other use cases that where you want to write code mods. The major things in libsist, I feel like, are, like, a relatively high level API where you can reason about your code as you would, you know, normally, as a normal human being. So you have this high level API that's also type safe. I mean, it's not impossible, but it's hard to shoot yourself in a foot compared to an untyped
[00:05:48] Unknown:
CST library like lib223 or other ones as well. I don't want to, like, pick specifically on lib223 because it's a great library. So those are the 2 main selling points for libcsc for me. A lot of people might be familiar with the idea of the abstract syntax tree, and I know that the concrete syntax tree is a bit above and beyond that because it preserves some of the extra information like comments and white space that aren't preserved in the abstract syntax tree. And I'm wondering if you can describe some of the situations where that maintenance of the extra information is necessary or useful as compared to use cases where an abstract syntax tree would be the preferable approach?
[00:06:26] Unknown:
Yeah. I mean, so I get this question a lot from people who are not familiar with this space. The most concise way I can describe this is AST is useful for machines and for people who want to reason about, like, the semantics of the code, like, what the code is intended to do. But CST is useful for, like, presenting the code to humans. And so, yeah, as as you mentioned, like, the CST includes a bunch of additional stuff compared to AST, like, I don't know, white space, comments, extra non semantic commas and parentheses, and these kinds of things. That's the main difference. And in terms of, like, the situations where the CS like, preserving, like, the syntactic trivia would be important are, I guess, where you have to interact with humans because you want to present the code in a particular way to these humans. Like, I don't know. Maybe you want to keep existing formatting choices or you, I don't know, you want to order some things in a certain way and stuff like that. I think from the use case of, LinkedIn and Commod, we
[00:07:31] Unknown:
do need to preserve those formatting information. For LinkedIn, very common feature of linter is to ignore some linter errors add in comments. So in order to skip some specific errors, we'd need to be able to parse and analyze the comments to figure out if it's ignore comments. Then in your data framework, you can ignore some errors not showing them to user. And for COMAS, it preserve the formatting information is very important because if you don't do that, when you just modify a small piece of the code, you may end up reformatting the entire file. And that generated diff will make it hard to understand by the user because they cannot see what exactly is changed from the COMAD logic.
So it's important to preserve the formatting formation.
[00:08:31] Unknown:
Yeah. Exactly. And just to underline this, most of the like, a lot of our code mods that are shipped with libcst are about manipulating type hints and Lint suppressions and comments and these these types of things.
[00:08:44] Unknown:
In terms of the overall ecosystem in Python Python for concrete syntax tree parsers, there are a few of them around, and I'm wondering what it is that libcst has over and above them that made it a useful exercise to build that versus just taking 1 of the existing projects and extending it or either upstreaming the changes or forking it, maintaining your own version, and sort of what are some of the core sort of architectural requirements that you wanted or needed that made it worth the effort of starting a completely new project?
[00:09:17] Unknown:
I am not aware of any CST implementation that's type safe apart from libcst. So I'm a fan of type hints in general. So I might be extremely biased here, but, like, that's a killer feature, like, in and of itself. The the other thing that libcst provides that others don't are, as I mentioned, like, a high level API where you can manipulate code CST nodes that are kind of logical. Validation, it's very hard to produce accidentally produce invalid code that has syntax errors, for example. Generally easy like, as part of writing code mods in libcxt, you generally will eventually write code that you know, emits some Python code. And 1 of the very big disadvantages of a CST library compared to an AST library is because it includes all of these syntactic trivia.
When you're constructing CST nodes, there are tons and tons of parameters that you have to, like, just remember. And libcst provides an API, I think, that is pretty convenient to use because there are some sane defaults where, I don't know, like, if you have a list, you can forget about making sure that the commas are in the right places and whatnot because libcst will figure that out for you and things like that. So, like, some of these features are implemented in some CST libraries, but none of them, like, combine all of them. I want to add something. So libcst,
[00:10:36] Unknown:
it starts as concrete syntax tree parser. But after we continue to debug on this, it's not just a concrete syntax tree parser anymore. It also provides convenient API for traversing and modify the tree. We have transformer API and major API, which can allow you to describe a pattern or a shape of a sub tree you want to match. It also provides some static analysis that adds metadata. So that way, you can build some higher level linter and coma easier. We provide position data that is useful for linter, scope analysis that links assignment and SSS, which allow you to link different notes in the tree so you know they are referring to the same variables.
We also provide qualified name information and also integrate with higher type checker, which provides us the inferred type given a node. And those build are building blocks that make you build comas or linter. More powerful, comma and the inter is the Some of the stuff that Jimmy was mentioned
[00:11:53] Unknown:
is provided by this metadata framework that we have in libcscst, which basically lets you say, oh, I have the cst node because while visiting the whole document, I happen to, you know, pick this because it's an interesting import, for example. And then this metadata API allows you to query interesting things about that particular CST node. Like, for example, as Jimmy mentioned, like, position, where is it in the original document? Or with the Pyre integration, you can ask for what type of this symbol that's being imported, for example. And there are, like, I don't know, 6 or 7 different metadata providers, and you can also implement your own. So, like, with this framework, like, it's an extremely powerful framework. You can do some, like, surprisingly complex analysis of the source code that you're transforming at the moment. For example, the most complex, like, analysis that I know of, it's implemented by the scope provider within libcst. And it allows you to to figure out which symbol is exactly defined where where it's being accessed.
And it implements all sorts of, like, weirdness that you as a Python user might not know about, Like, for example, name shadowing rules and whatnot, especially with the new, like, rules operator that became a bit tricky to implement. So a reasonably complex analysis that kind of approaches what the type checker can do. It will never reach what the type checker can actually do. But, like, you can, you know, use libcst not just for code mods, but for, like, reasonably complex analysis like this as well using this amazing metadata framework.
[00:13:21] Unknown:
Digging into LibCST itself, can you talk through how it's implemented and some of the internal architecture and design choices that you made when you were structuring it?
[00:13:31] Unknown:
So we didn't build everything from scratch. We tried to reuse available parser's p gen 2. What we use is a 4 Parcel library. So what provides by this library is it allows us to define the grammar and the conversion functions. So what we need to do is to tag the grammar defined by Python documentation, the grammar of Python, and we need to build a lot of different conversion functions to process the output written from the parser. So our main logic will be focused on how do we want to preserve all those past value, especially the formatting values, And how do we want to define our nodes in the concrete syntax tree? Oh, so we need to carefully design the values, how those formatting value would we want to preserve.
We try to carefully review different cases of whitespace, comma, and parenthesis. Try to think about their use cases and then decide which nodes you own them. Yeah. So this is early work of the implementation of bps PCCST. So we would need to implement a lot of conversion function, and we add a lot of tests. And, of course, we also make it strong typed, so it will be very easy and convenient to use. And in terms of some of the API, like the transformer and visitor pattern, you are able to register a callback function, like visit simple statement.
By registering a function like this, you can have the function being called during the tree traverse. So and in order to provide strong type annotation support, we automatically generate the type annotation for all those callback function hook. So we actually use deep c s t to comads in libcst to generate a lot of those type hints, function, the definitions.
[00:15:50] Unknown:
From, like, a bird's eye view, libcst has roughly, like, let's say, 4 big components. 1 being kind of like the parser, like stuff that takes Python code and converts it to something that you as a Libs CSD user can use. So that the parser. The second 1 is basically just a bunch of classes that describe the CST itself. So, like, things like here's how an if statement looks like. Here's how while loop looks like. Stuff like that. So those are all just like a bunch of classes somewhere in the API. And the parser outputs these objects, basically. So that's the 2nd component. The 3rd big component is what Jimmy mentioned, all this visitor and pattern matching and traversal APIs, which are generated from the second component. So these, like, class definitions and whatnot, they're generated using libcst, as Jimmy mentioned, which was a bit confusing for me originally, but now I kinda sort of maybe understand. Last big component is the whole, like, CLI framework on top of all of this. So essentially, the HPSC is intended to be used is, like, there is already a code mod for your use case. You just have to invoke it. Like, there's a CLI interface so you can parameterize.
It can handle, like, big projects with, like, Python roots and multiple files. And all of these things are in this whole, like, 4th component.
[00:17:10] Unknown:
In terms of the evolution of the project, I'm curious, what are some of the ways that the overall scope and design and goals of the system have changed or evolved since you first began working on it? Yeah. I think in the beginning, we focused on
[00:17:25] Unknown:
building the concrete syntax tree, build a parser, and also for build the basic visitor and transformer API. This API also exist in the Python AST library, and so we try to use the same pattern. And after we have those ready and start to use it in our later use case, we found sometimes when you have a lot of callback function registered implementation, it can be very messy and complex. So we decided to, spend some effort to make the traversal and modification easier. So we do we that's why we build the matcher pattern. With matcher, we can describe the shape of a subtree that is more easier to the code reader and easier to maintain.
In the beginning, we also try to build some basic static analysis, like the position metadata for our data. The data needs to report the line number of, dinter warning to the IDE. So the IDE can show the error in 9. And later, when we, start to build a lot of printer, we run into some task that the author of those linters, they want to have more powerful static analysis. So we starting to, build more static analysis, including, scope analysis and the qualifying provider. So, basically, I would say we evolve the library based on our clients.
So we, as a infrastructure team, try to provide this as a tool. All the other engineers in meta and some engineers, they have different use case, and we will base on the ask to prioritize
[00:19:26] Unknown:
our work. 1 of those engineers was me. I started using libcscst in a surprising way. I didn't use it for code mods. I used it for code analysis. And so I mean, technically, when I started working on the project, this was already a goal for me. So it wasn't really a change. But for in terms of the project, I was focusing a lot on the analysis capabilities of LibCSD. So that's how I got onboarded, actually. And then over the course of the past 2 years, what has become apparent is that libcst is really, really powerful API that is very pleasant to use. And lots of people who want to manipulate code want to use it, but sometimes they can't because, I mean, it's not the fastest piece of code around town because that was never a design goal. Like, being fast was was always secondary to being expressive and powerful.
And so libcsc was and it is still used the majority of the time in offline mode. Like, I have these 100000 files. Apply this code mod to them, please. And then it's okay if it takes, I don't know, a minute more. But sometimes, especially more recently, people want to use libcst in more interactive sessions like IDEs. Like, I want to remove all my unused imports, but I don't want it to take a minute, please. And so, like, these people cannot use libcst or more complex code mods today because it's just too slow. Like, just to paint your picture, like, ID operations, they need to complete within 300 milliseconds.
Code mods can take multiple seconds easily. That's just a no no. So that's something I see nowadays that is changing. Like, the shift in terms of goals, performance
[00:21:11] Unknown:
is slowly shifting upwards in in the prior to this. That's what I'm trying to say, I guess. And I'm curious if you've had any attempts at just running it through Cython to see if that improves some of the operation speed at all or if it's just because of the nature of the changes that it's operating and just, like, the the IO that it has to do with the files that causes it to slow it down a lot?
[00:21:32] Unknown:
Yeah. Honestly, I haven't, but it's on my to do list. And not Cython, but more MyPICI, actually. Because because the code base is already well typed, I think running MyPICI should be fairly straightforward on it. But, otherwise, I've said the same thing about Black and it's it was a year and a half project. But, yeah, we'll see. That's 1 of the opportunities, we're exploring. The other thing I have personally explored actually is because we needed to add support for Python 3.10, and the parser that libs60 used to be based on was Parcel. Parcel does not support 3.10 grammar because it's an entire different type of grammar. Not sure if we need to go into the specifics here. But, like, basically, Parcel doesn't support all the features of Python 3.10.
And for a code mod tool that's being applied to code that can potentially include 3.10 syntax, like features, we need it to support 3.10. I went ahead and rewrote the parser from scratch in Rust. And yeah. I know. Yes. In Rust. Part of the reasoning behind the choice for Rust was because it gives us, in the future, opportunity to optimize the parsing side of things, which admittedly is not the most the slowest part of libcst. Generally, the slowest part of libcst is visitation and the transforming. So parsing is, I don't know, like about 300 milliseconds for a typical 2, 009 document. That's not terrible.
But because performance is slowly shifting upwards in terms of the prioritized, which was made to write this in Rust. So in the future, if this becomes the bottleneck, then we'll, you know, have to leverage. That's how I rationalize it then.
[00:23:13] Unknown:
Or just because you wanted a new project to to keep yourself busy with.
[00:23:18] Unknown:
I mean, if I'm completely honest, this was a hobby project. Like, it's just just like on my spare time, and then I realized, oh, shit. Actually, we need this for some business cases. And then I was the furthest along, so we just went ahead and did it my way. But, honestly, I probably wouldn't rewrite it in Python even if I had given the chance because what I realized is that so 1 of the high level design goals for libcst is it being well typed. And the biggest gaping hole in terms of the type safety within libcst is the parsing because parcel is on types. And also it's very, very hard to express the types of transformations that the parsing, you know, function needs to express in in Python's type system.
It was just, like, not done. So it's completely dynamic. And then we, like, assume that the types are correct on the output side of the parser. And that turns out, of course, to be incorrect in some cases. And there are several bugs in libcsc today that are caused by parser returning slightly different types. Like, they make sense, but they're not the ones that are being advertised. So then, I don't know, sometimes they don't have an extra comma attribute even though you're not supposed to put commas there, but the type allows you to. And that's I consider that a bug. And then the reason I bring this up is because during the Rust rewrite of the parser, I bumped into this all the time because Rust type system does not allow you to mess around. So I had to figure out what is the, you know, correct type to return and implement it, file a bug. And then, eventually, when when the old parser infrastructure will be turned off completely, then these bugs will be killed for good.
[00:24:58] Unknown:
Rust is quickly becoming the companion language for Python in slowly replacing c for a lot of the native code that you're plugging into Python libraries. And Yeah. Close to hardware code. Yeah. And so in terms of actually using libcst, I'm wondering if you can talk through some example workflows of saying, I either have this code mod that I want to do, maybe some examples of the types of code mods that you would use it for, and then the process of actually maybe interactively exploring what are the changes that I need to make and then being able to turn that into a module and make it part of your maybe standard CI suite to make sure that, you know, we never wanna have this specific code pattern. We always wanna make sure that this function name is actually translated to this or that there are no function signatures that have 3 parameters because it's actually supposed to have 4 now. And 1 of the challenges about answering this question is there's lots and lots of internal tooling at Meta
[00:25:53] Unknown:
that makes most of these answers, like, easy internally, but there's no real good open source solution. And I fully admit that part of the weakness around LibCST is all the open source tooling. This is great, but now I want to enforce these rules in my code base. There are no at least there are no tools that I'm aware of that allow you to do this painlessly the way I'm used to doing it internally at the company. So that's something that we might explore in the future. But at a high level, assuming you want to, like, run a code mod that already exists and you don't have to write it yourself, You discovered it by, I don't know, browsing documentation or a colleague tip you after or something. What you do is you just, like, take your Python source code. You just run the libcst CLI, choosing your code mod, and then maybe passing on some parameters, seeing your files change, making sure that changes make sense, and then automating it. It's kinda like writing a script that runs as part of your CI, and then either flags pull requests or just automatically checks in code. We do both approaches sometimes internally.
Some of our code mods are actually run on the entire code base every day. Like, formatting is 1 of these, but also removing non used imports and stuff like that. They're just, like, being run, producing pull requests, being also accepted automatically, and just, like, good to go. Does that answer the question?
[00:27:18] Unknown:
Yeah. And then, for example, if I have a code mod where I want to restructure a set of modules, so I've got a function definition that I actually want to move into a different lib module, and then I wanna make sure that all of the imports across the project are updated to use that new module location. What would be the process for actually
[00:27:38] Unknown:
writing a code mod to do that? For this particular 1, there exists a code mod, so you don't have to write it yourself. You just, like, call libclsc.tool something something rename and then pass in old name and new name, pass in all your files, and make sure that the Python root is set correctly so that the import like, the fully qualified names are are set. But if you wanted to write a new code model from scratch, my process for this is starting always with the visitation API. And so figuring out what are the patterns I need to match on and what do I need to change them to. Myself, I look at the matcher API first because they are extremely expressive. So you just put some decorators on your visitation function that will will make sure your visitation function will only get called when the node matches this particular pattern. For example, I don't know, it's within an else branch of an if with a particular condition.
So you can put some crazy things in there in a decorator, and then your function will only get called in those conditions. And then you basically say, okay. Now I'm in the code that I care about. I transform like, let's say, for your example, renaming imports and symbols, you would be interested in visiting all the name CST nodes. And then in your visitation function, you say, oh, I'm in a name node. Let's check if the name of this node is equal to the old name. If it does, then emit a different name node with a new name. And then the catch here is you want to do this in an intelligent way. So not just like, you know, control c, control v, but more like you want to figure out, oh, if this if this name actually refers to the same thing even though the name is different but it's imported as a as an alias, for example, then suddenly, like, you need to be aware of not just checking the value inside the CST node, but using the metadata APIs we talked about earlier. There's a metadata provider called qualified name provider or fully qualified name provider. And you just say, oh, fully qualified name provider. Please give me the fully qualified name for this name node. And then you compare that to your old name stream.
[00:29:46] Unknown:
As far as being able to do refactorings like that of sort of restructuring modules or ensuring that function signatures are updated everywhere. I'm wondering if you can do some compare and contrast to LibCST offers as a sort of foundational library to build these type of structures versus what something like ROPE might have out of the box.
[00:30:07] Unknown:
I think deep CST try to provide both low level APIs for providing the concrete syntax tree and traversing the syntax tree easily. So if your use case collect some knowledge by analyzing the syntax tree, then you will probably want to use deep CST. But if your use case is is a very concrete about you just want to rename some function or move some functions around for common refactoring fit needs. I think rope already provides a lot of feature to support. So I would say if you just want to do some common and simple refactoring, you can start with both. But if your use case require understanding the syntax tree by doing some custom analysis or the commands you want to do is very customized, then I think deep c s t can help more those cases.
[00:31:07] Unknown:
If rope supports your use case, it's probably fine. Although that's the same for libc s t. If there is a code mod for your use case for libc s t, that's probably also fine. The question is when none of these, like, there's no existing code mod for your use case, then good luck with Drorp. I would definitely use libc s t.
[00:31:25] Unknown:
Yeah. Absolutely. And beyond just the code modding use case that we've been discussing, you also mentioned that your original reason for getting involved in libcst shield was to add in some of these linting capabilities. And so I'm wondering if you can talk through some of the other types of workflows that you might build on top of LibCST and for linting in particular, why you might use libcst in place of or in addition to something like flake or pylint?
[00:31:53] Unknown:
Flaky is a pretty common linter open source tool, and a lot of people use it. But 1 feature we have found it's missing is auto fix. So if you compare this linter with ESDint, you can see some of the linter from ES lint can provide autofix. And we think autofix is useful and important, especially in a large organization like Meta. You have 100 or 1000 of engineers. If you are the inter just surface an error to them and they need to manually fix them, then we think having engineers to manually work on those trivial fix is a waste of their time. We want them to be able to focus more on developing products. So we want to provide autofix for them. So what we have been doing after build deep CST is we built a inter framework.
It's also open sourced. It's called fix it. And in fix it, you can easily define and implement a data rule that provides auto fix. We actually try to provide auto fix for some the end rules provided in select k. So yeah. For example, if you look at fixing, you can see, there are more than, 20 we open sourced. And internally, there's a lot more custom rules we we provide. I think overall, there's more than a 100 rules. And a lot of those rules can provide auto fix. For example, there's a rule called no string annotation. So in annotation, you may use a string quoted type, and we think that is less readable.
The IDE may just highlight it as a string. But if you don't use the quotes, you make it real type. ID can provide better syntax highlighting, which is more readable. So define something as a best practice and get a consensus from everyone. You can start building a the tools using fix it. And the auto fix we provide from this tool is to automatically remove the codes from the string type notation. So what, you know, end user see, their IDE is a button, so called fix it. So we built a Versus code plugin that automatically call our fix it enter. And whenever there's a auto fix provided, it will show a button. So the engineer can just 1 click to fix it. And we think this can save their time, get them focused more on business development.
[00:34:40] Unknown:
Yeah. So, essentially, like, TLDR is Flaket started as, hey. There are some bad patterns. Let's find them, which is a valid approach to linting. And but it's completely different to the fix it or libcsc approach where it was a code mod first, like, we should fix it first. And then oh, you can also like flag them to users in their ID. So it was, like, kind of the other way around. Because Flakid I believe Flakid does have some kind of auto fixes. They're just kinda hard to implement.
[00:35:06] Unknown:
1 of the interesting challenges in these kind of situations is that flak8 is a valuable tool that a lot of people have invested time into adding different plugins and additional linting rules to be able to surface those errors. Libcst is a project that's able to identify those errors and automatically fix them, but doesn't have the ecosystem around it yet. And I'm wondering if there's any opportunity for being able to either automatically use some of the rules that are defined for flake 8 and translate them into a syntax that libcst can understand so that you can use fix it to automatically resolve them and just what that process will look like for being able to build up the capabilities around LibCST and fix it to sort of get up to a level of what the flake 8 ecosystem has already built up. Yeah. You know what? Great minds think alike. I think that's exactly what Jimmy was alluding to when he said, sometimes fix it takes Flakgate rules
[00:36:01] Unknown:
and just, like, provides auto fixes for them. I think that's exactly what's happening there. But, Jimmy, correct me if I'm wrong. Yeah. So
[00:36:07] Unknown:
because Flaky API provides very different API for implementing the interrules and servers in the syntax tree. So what we end up doing in fix it is we are calling into Flaky to reuse some of the existing rules. But in order to provide autofix, we still need to redo the implementation using deep CST for finding the problematic code pattern. And on top of that, we add the autofix logic. But I think what you described will be ideal That is we can automatically convert, like, a
[00:36:47] Unknown:
pattern finder as libcct finder, but that doesn't exist yet. Pull requests are accepted, though. I was gonna say, sounds like a community contribution waiting to happen.
[00:36:58] Unknown:
Exactly.
[00:36:59] Unknown:
And so digging a bit more into the fix it project, which is actually how I initially came across LibCST. Wondering if you can talk through just some of the structure of that and maybe some of the ways that you're using both libcst and fix it in your own work.
[00:37:13] Unknown:
So before we have fix it, we try to use AST to develop, linter, and we build it as a flakey plug in. We start having a AST visitor to traverse the code and find patterns. And over time, the visitor just become very complex because a lot of different rules try to implement different logic on the same visitor, and that makes it hard to maintain. So when we get a chance to implement a new framework to solve the data problem other than auto fix, we also want to make, data is easy to implement and maintain. So each linter rule can implement its own independent visitor.
But with this, if you have a lot of intervals, let's say, a 1, 000 intervals, if you need to traverse actually a 1, 000 times, then it will become slower over time. So in order to be able to traverse fast and also keep the codes separate for each filter. We decide a, batchable visitor, which can allow us to combine all the different callback function registered from different linters. We batch them together, and we apply them in just 1 single syntax tree traversal. Given a source code, we can just parse it as syntax tree and traverse it once. And while running all those different callback function from different rules, we allow them to have separate context for their logic.
So this is 1 design we had for efficiency of the data run. And we also try to make it very easy to test. So in the inter, you can justify some test case just like providing code examples. It would use the automatic document generation that take those code examples and generate them as the Wiki page you can see, with the dots. And those code examples are also used in automatic unit testing, so we wrote a framework. So give so when you implement a rule, you can provide some valid co examples and some invalid examples. For embedded code example, you will want to also mention what are the expected data errors you want to show to the user, and those will be used as example of the data to the end user.
It will also be automatically run as unit tests. And we found this approach make it very easy to develop the dinter. So with those in place, we were able to develop a lot of linters rules, run it efficiently.
[00:40:21] Unknown:
And in terms of the applications of libcst, now that it's been available for a while and open to the community, I'm wondering what are some of the most interesting or innovative or ways that you've seen it
[00:40:33] Unknown:
used. I have several. This is both to Jimmy. Did you know that basically, like, most of the Google Cloud Python libraries ship with libcst code mods to convert between major versions of their API. I think that was pretty cool. I discovered this a couple of weeks ago when I when when I was debugging why the hell is libsissy downloaded so often. The other 2 that was kind of, like, surprising to me is some crazy person decided to write a Python indexer that emits basically information that's required to do kind of a GitHub like jump to definition and find references information using libc s t.
This crazy person also works at meta and is now powering basically all of our ID services in the browser and the new Versus Code. And then the third 1 is less crazy, but it was surprising to me that people wanted to use libcst in an IDE setting where they wanted to refactor, you know, just code, remove unused imports and whatnot.
[00:41:38] Unknown:
Yeah. I don't have much unexpected, surprising use case for the interesting use case. I think libcc has been used to solve large scale tech debt problems, like cutting up unused code and also convert your code base to adopt asyncio. You will need to convert the function from non async to async function. And for each async function code, you need to add a weight. Those seems easy, but in a large code base, it's a lot of work. I have seen some people use that and shared a blog post talking about how they use libscc for their async adoption. It's also used for type annotation adoption.
So in a logical base, misses a lot of type notation. We want to automatically add some type annotations rather than manual to all the work. And monkey type is a open source tool, open source by Instagram. And it actually uses the CST for inserting the missing types that collected by running the program.
[00:42:49] Unknown:
I have seen a lot of people use funky type as well. Can I also mention a crazy use case that I wish existed, but it doesn't? I wish this has been my pet peeve ever since I knew about libcsc. I wish black would be implemented in terms of libcsc. That's my goal in life now to make that happen. I mean, even the original maintainers of the FCC, including Jimmy, was like, why would you do this? Like, lib 223 is way better suited to black because you kind of manipulate white space. That's the kind of the entire point of the project. And there are some downsides to the libcst approach because white space might get attached to completely different parts of the syntax even though they're syntactically right next to each other, especially during, like, boundaries between annotation blocks and whatnot.
But I still think black would be better off if it were implemented in terms of or LipCSC would be better off if it were made an API that made Black's implementation easier.
[00:43:47] Unknown:
In your own experience of building and working on LibCST and building other projects on top of it, what are some of the most interesting or unexpected or challenging lessons that you've each learned?
[00:43:58] Unknown:
Code quality is important and hard.
[00:44:00] Unknown:
Way harder than I thought. And just rewrite everything in Rust.
[00:44:06] Unknown:
No. No. Don't in fact, no. Just don't don't rewrite stuff in Rust without a good reason. No.
[00:44:12] Unknown:
I think working on deep CST and make it open source is a new experience to me. Because I think it's very different than a internal project. Because internal projects, we can try to develop fast, and we don't need to do a lot of work on documentation. But when we decided to make it open source, we think documentation is important. So we try to use auto generated documents and try to add a lot of documentation to major class to their docstring. And, also, we try to increase the test coverage to make, make sure we we have good practice the entire project. We also bring in the popular later tools and set up the CIs on GitHub to validate our changes, use pull request, which, internally, we we are using a very different set of tool. So and we also try to design a logo for our project. So internally, we have some designer. So by brainstorming with them, we come up with the the final logo, which what we try to represent is a tree with a lot of leaves, which are the syntact tree with a lot of nodes.
Yeah. I think, overall, it's interesting
[00:45:45] Unknown:
and good learning experience to me. I mean, maybe we should include the parts about, hey, documentation and method was not important, but open sourcing made it made us realize that it's actually important. But I I do definitely agree with Jimmy's experience. Like, I've been a maintainer for quite a while, so I know the open source experience. But writing something that's mainly driven from internal use cases, but also making sure that it's actually not just a usable, but a pleasant user experience for open source. It's a major challenge. And it's, like, it's a lot of effort. It's very hard to justify internally, like, why are you working on this? Well, because of all open source users. But there are these 10 other more important things that you could be working on. And so instead, what ends up happening is you still deeply care about the project. So you do the work over the weekends and whenever you have time, and you're trying to squeeze it in. And then, like, the end result is kind of subpar.
So I don't have a good solution to this, but it's definitely a lesson I learned is that I mean, I knew that open sourcing something from an internal code base was a lot of work, but just, like, way more than I expected. It's a lot of continuous work that doesn't get easily recognized. Even from the open source community, I don't think I've seen anyone say, hey. Good job with CSD maintainers. I don't expect that's not how open source works. It just, like, makes it even harder to justify all the work that's required to make this pleasant.
[00:47:13] Unknown:
Well, consider this interview your good job. Thanks. And so for people who are looking for a way to be able to parse and maybe code mod their projects or they're looking for a way to be able to do round trip parsing and regeneration of their code. What are cases where libcst is the wrong choice and maybe they're better suited with a different concrete syntax tree, or maybe they just need an abstract syntax tree, or just something like rope, and they actually don't need a parser and syntax tree analyzer at all. Wrong choice? What do you mean? I don't understand the question.
[00:47:46] Unknown:
Libsius is always the right choice. No. In all seriousness, if you're looking to write a code mod, libsys is a really good choice. There's very, very few cases where it wouldn't recommend libsys if you're writing a code mod. Having said that, like, if you just want to make sure that the code is syntactically valid and it's never going to be read or used by a human, then the complexity of a concrete syntax tree is usually not worth your time. You're usually better off either regular expressioning it or if you're doing something more complex than using an AST. And, you know, like, if there's an off the shelf product that does what you need, then just use it. You don't need to reimplement your logic within libcsc just because it's cool. Although that's a fun weekend project, and you can learn a lot. In general, I think, like, libcsc hits the mark on a lot of points when you're writing code mod. So, generally, it's I would on the side of using it than not.
[00:48:39] Unknown:
Yeah. And I think I do aware of some cases that cannot be supported by DeepCSD. Imagine in your IDE, you want to build some code automation based on the draft the user have. And that means that the code is not completed yet. It's not a valid Python code. Libccc cannot parse those code. So if you want to build some very smart automation that can autocomplete the Python code in IDE, you will need to find a different solution.
[00:49:14] Unknown:
Yeah. Today, that's entirely true, and that might change in the future if I get my way. Speaking of the future, what are some of the things you have planned for the near to medium term of the LibCST project and maybe also for a fix it if you wanna dig into that? I can talk about libcst, not so much fix it because libcst is my current focus.
[00:49:32] Unknown:
Obviously, we want to keep up with new language features. Python is evolving quite fast. Like, once a year major release usually introduces a new language feature, we do want to be on top of that. So it's just already supports 3.11, I believe. So that's a given. Second big thing is I want to make sure that the traversal of nodes in libc s t is fast enough so it can support more interactive use cases with the caveat that Jimmy just mentioned. So, like, even if it's fast, we will still only be able to transform, analyze, modify valid Python code.
So in fact, that's the next priority item is now that the rewrite of flip c s t's parser is slowly but surely wrapping up, now we will have a chance to change the parser so it can accept invalid Python code and then have some boundaries where it says, okay, yeah, this expression is invalid but I can still parse the rest of the document, and you can still operate on them. So that's something that we might be looking at. No promises. We might be looking at in the future.
[00:50:35] Unknown:
Are there any other aspects of the libcst project or some of the ways that it's being used both in your own work and in the community and just sort of code modding and concrete syntax trees in general that we didn't discuss yet that you'd like to cover before we close out the show?
[00:50:50] Unknown:
I think as Zsolt mentioned, we still need a lot more tools that built around deep CST in the open source world in order to make it more easy to use. So the other popular tools like black or flaky, I have seen that other people build a lot of tools to make them very easy to use. Like, they they they have GitHub actions, so you can easily enable black or mypy on your GitHub projects, but deep CST is missing those. And at my work in Carta, I actually also try to use libcst, and we are using GitHub for our code development. So I have been trying to developing things, around GitHub and deep CSD and fix it too.
So I have trying to build some code mass service that automatically create a GitHub request. And, hopefully, in the future, if some of the my work can make the open source ecosystem more useful.
[00:51:58] Unknown:
You should definitely take a look at 1 of the picks I have picked for later. It's called auto transform. And if you remember Code Mod Service from internally, that's the open source version. And the guy who wrote Code Mod Service writes that, highly recommend. It is growing Ellipsis integration as we speak.
[00:52:19] Unknown:
Alright. Well, speaking of the picks, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And so with that, now I will move us into the picks. So to get us started, I'm gonna pick the Osprey Manta backpack. I got 1 recently, and it's definitely a major upgrade for some of the other packs that I've used. I particularly like the sort of support structure that it has against the back that keeps the bag off of your back so you get nice airflow so you don't get all sweaty when you're out and about hiking or biking. So recommend that if you're looking for a pack to wear when you're out and being active. And so with that, I'll pass it to you, Jolt. Do you have any pics this week? Yeah. But they are unfortunately boring and related to code.
[00:53:01] Unknown:
So auto transform is 1 of mine. Just a quick shout out to Nathan who's working on it. Just Google auto transform GitHub, and you'll find it. It's a service that automates your code mods for you. So it's like a way to once you have a code mod that transforms your little source file and to the desired state, This service will let you apply that across all your 100 and tens of files in an automated way and not lose your sanity. The other pick is go to gleam.software in your browser. It's cool little database project that allows you to store facts about any kind of source code. And I'll just leave it at that. You can discover it for yourself.
[00:53:41] Unknown:
Alright. How about you, Jimmy? Do you have any picks this week?
[00:53:44] Unknown:
I don't have, specific pick, but I would like to share something that is related to Deep CST and CoMART. So I think after being work on a lot of different project and in different companies, I have seen that attack that is a problem that we always face. And it's hard to make the decision whether we should fix those tech that right now or because there's always more important or urgent product development requirements. I think that's 1 of the motivation. We want to build the automation to fix them. But not all the tech that can be auto fixed. There's still a lot of more complex tech that it require engineer to develop it. So we want to make sure when the engineer proactively work on tech debt, their contribution will be recognized.
So what I have been trying to do in my previous few project is to analyze the comments to the code base to figure out what kind of contribution the developer make to solve some of the tech debt. For example, maybe they help missing types or they clean up some unused code. With a automatic analysis, we can summarize their contribution, and we try to use those summarized contribution to provide the numbers about their contribution as a reference to recognize their contribution. So that's a philosophy or a strategy I have for large scale tech debt problem.
[00:55:30] Unknown:
Alright. Well, thank you both very much for taking the time today to join me and share the work that you've been doing on LibCST. It's definitely a very interesting project and 1 that I'm glad to see as part of the ecosystem. So I appreciate all of the time and energy that the both of you and the rest of the contributors to the project have put into that, and and I hope you each enjoy the rest of your day.
[00:55:49] Unknown:
Thank you for all the work you put into the Python community at large. I've been an on and off listener to the podcast, by the way. Good job in general. Thank you. Yeah. Thank you for inviting us. I'm so happy to share our experience with more people.
[00:56:06] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
Introduction and Guest Introduction
Scholz and Jimmy's Backgrounds
First Encounters with Python
Overview of LibCST
Concrete Syntax Tree vs Abstract Syntax Tree
LibCST's Unique Features
Implementation and Architecture of LibCST
Evolution and Goals of LibCST
Performance and Optimization
Using LibCST for Code Mods
Linting with LibCST and Fixit
Structure and Use Cases of Fixit
Lessons Learned from LibCST
When Not to Use LibCST
Future Plans for LibCST
Closing Remarks