Automatically Enforce Software Structures With Powerful Code Modifications Powered By LibCST

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host as usual is Tobias Macy. And today, I'm interviewing Schultz Dollenstein

and Jimmy Lai about libcst,

a concrete syntax tree parser and serializer library for Python. So, Scholz, can you start by introducing yourself?

So my name is Scholte. I've been working for meta slash Facebook for the past, like, 6 years.

I've been maintaining libcst,

actually, since the just when the pandemic

started to hit Europe. So this is like a pandemic project for me. It's like 2020 ish. I have an 8 year old son. I'm big into video games.

Yeah. That's me. And, Jimmy, how about you? I'm Jimmy. So I was a engineer in Instagram while working on deep CSD

since the beginning. And

right now, I'm a engineer at Carta.

So, yeah, I'm a big Python fan. I have been using Python for,

like, 10 years, and I to share what I learned about Python in

local meetup and Python conferences.

And going back to you, Scholz, do you remember how you first got introduced to Python? I had to go dig deep on to this question. So, like, I remember at my previous company, they threw me in the deep end, they had some massive Django Python 2 codebase.

They told me to do some like minor infrastructure features.

And that's when I realized that c plus plus is not the best language in the world, but it's instead it's Python. And, Jimmy, do you remember how you got introduced to Python?

I think it's back to when I was at school,

when I

try to learn data mining, data analysis.

Python was used in the course.

So I like it that a lot of libraries are provided. It makes

data mining or machine learning very easy.

Alright. And that brings us now to the libcst

project. So can you describe a bit about what it is and some of the story behind how it came to be and sort of what was missing in the ecosystem for syntax tree parsers that made this a necessary project?

In Facebook

meta, we have a lot of large code base that is multimillion

slice of code. So at Instagram, we have monologues

that is

a few millions lines of code. So we have a lot of different

problem that is at large scale to solve.

In terms of LinkedIn or Comadin,

we have a lot of challenge.

So in the beginning, we try to look for

some open source tool

to help us build customized

linters

and also Comarch. And we didn't find a good 1, so we decided to build this this DeepCSD.

In terms of Comod, so Facebook has a Comod open source library

that allow you to modify the code or match some code pattern using regex.

But regex is not able to

drop us to these modify the code or match the

more complex code pattern, so we couldn't use regex.

And

later, we also try to look into the available open source

syntax tree parser. But the built in syntax tree parser in Python,

it doesn't

provides concrete

formatting information like space, and those are needed for combats

because you will want to convert the source code as a syntax tree, modify the syntax tree, and convert it back to its updated source code. And

there are some available

parser or coma tool, but they are not very convenient for us to use. So we end up decided to build the implementation of Compression Text 3, deep CST. I have my own take on this. I wasn't there when the project originally started, but, like, I started,

you know, maintaining this as a kind of, like, a power user.

So the reason I flock to libcst personally is because,

like, I tried

lib223,

which is another concrete syntax implementation that actually Black, the popular formatter, uses.

And

it's just like

I mean, it gets the job done, but it's so low level. Right? So, like, you have to keep in mind how

the CPython grammar is implemented

and then, like, fiddle with low level CST nodes there, which is great for a formatter like black, which basically only manipulates white space. Not so great for, you know, 99% of the other use cases that where you want to write code mods. The major things in libsist, I feel like, are, like, a relatively high level API where you can reason about your code as you would, you know, normally, as a normal human being. So you have this high level API that's also type safe. I mean, it's not impossible, but it's hard to shoot yourself in a foot compared to an untyped

CST library like lib223 or other ones as well. I don't want to, like, pick specifically on lib223 because it's a great library. So those are the 2 main selling points for libcsc for me. A lot of people might be familiar with the idea of the abstract syntax tree, and I know that the concrete syntax tree is a bit above and beyond that because it preserves some of the extra information like comments and white space that aren't preserved in the abstract syntax tree. And I'm wondering if you can describe some of the situations where that maintenance of the extra information is necessary or useful

as compared to use cases where an abstract syntax tree would be the preferable approach?

Yeah. I mean, so I get this question a lot from people who are not familiar with this space.

The most concise way I can describe this is AST is useful for machines

and for people who want to reason about, like, the semantics of the code, like, what the code is intended to do. But CST

is useful for,

like, presenting the code to humans. And so, yeah, as as you mentioned, like, the CST

includes

a bunch of additional stuff compared to AST, like, I don't know, white space, comments,

extra

non semantic commas and parentheses, and these kinds of things. That's the main difference. And in terms of, like,

the situations where the CS like, preserving, like, the syntactic trivia

would be important

are, I guess, where you have to interact with humans

because you want to present the code in a particular way to these humans. Like, I don't know. Maybe you want to

keep existing formatting choices or you, I don't know, you want to order some things in a certain way and stuff like that. I think from the use case of, LinkedIn and Commod, we

do need to preserve those formatting information.

For LinkedIn,

very common feature of linter is to ignore some linter

errors add in comments.

So in order to

skip some specific errors, we'd need to be able to parse and analyze the comments

to figure out if it's ignore comments.

Then in your data framework, you can

ignore some errors not showing them to user.

And for COMAS,

it preserve the formatting information is very important because if you don't do that,

when you just modify a small piece of the code, you may end up reformatting the entire file. And that generated diff will make it hard to understand by the user because they cannot

see what exactly is changed from

the COMAD logic.

So it's important to preserve

the formatting formation.

Yeah. Exactly. And just to underline this, most of the like, a lot of our code mods that are shipped with libcst

are about

manipulating

type hints and Lint suppressions and comments and these these types of things.

In terms of the overall ecosystem in Python Python for concrete syntax tree parsers, there are a few of them around, and I'm wondering what it is that libcst

has over and above them that made it a useful exercise to build that versus just taking 1 of the existing projects and extending it or either upstreaming the changes or forking it, maintaining your own version, and sort of what are some of the core sort of architectural

requirements that

you wanted or needed that made it worth the effort of starting a completely new project?

I am not aware of any CST implementation that's type safe apart from libcst.

So I'm a fan of type hints in general. So I might be extremely biased here, but, like, that's a killer feature, like, in and of itself.

The the other thing that libcst provides that others

don't are, as I mentioned, like, a high level API where you can manipulate

code CST nodes that are kind of logical.

Validation, it's very hard to produce accidentally produce

invalid code that has syntax errors, for example.

Generally easy like, as part of writing code mods in libcxt, you generally will eventually write code that you know, emits some Python code.

And 1 of the very big disadvantages of a CST library compared to an AST library is because it includes all of these syntactic trivia.

When you're constructing CST nodes, there are tons and tons of parameters that you have to, like, just remember.

And libcst provides an API, I think, that is pretty convenient to use because there are some sane defaults

where, I don't know, like, if you have a list, you can forget about making sure that the commas are in the right places and whatnot because libcst will figure that out for you and things like that. So, like, some of these features are implemented in some CST libraries, but none of them, like, combine all of them. I want to add something. So libcst,

it starts

as concrete syntax tree parser. But

after we continue to debug on this, it's not just a concrete syntax tree parser anymore. It also provides convenient

API for traversing and modify the tree. We have transformer

API and major API,

which can allow you to describe a pattern or a shape of a sub tree you want to match. It also provides some static analysis

that adds metadata.

So that way, you can build some

higher level linter and coma easier. We provide

position data that is useful for linter,

scope analysis

that links assignment and SSS,

which allow you to

link different notes in the tree

so you know they are referring to the same variables.

We also provide qualified name information

and also integrate with higher

type checker, which provides us the inferred type given a node. And those build are building blocks that make you

build comas or linter.

More powerful, comma and the inter is the Some of the stuff that Jimmy was mentioned

is provided

by this metadata framework that we have in libcscst, which basically lets you say, oh, I have the cst node because while visiting the whole document, I happen to, you know, pick this because it's an interesting import, for example. And then this metadata API allows you to query interesting things about that particular CST node. Like, for example, as Jimmy mentioned, like, position, where is it in the original document? Or

with the Pyre integration, you can ask for what type of this symbol that's being imported, for example. And there are, like, I don't know, 6 or 7 different metadata providers, and you can also implement your own. So, like, with this framework, like, it's an extremely powerful framework. You can do some, like, surprisingly complex analysis of the source code that you're transforming at the moment. For example,

the most complex, like, analysis that I know of, it's implemented by the scope provider within libcst. And it allows you to to figure out which symbol is exactly defined where

where it's being accessed.

And it implements all sorts of, like, weirdness

that you as a Python user might not know about, Like, for example, name shadowing rules and whatnot, especially with the new, like, rules operator that became a bit tricky to implement. So a reasonably complex

analysis that kind of approaches what the type checker can do. It will never reach what the type checker can actually do. But, like, you can, you know, use libcst not just for code mods, but for, like, reasonably complex analysis like this as well using this amazing metadata framework.

Digging into LibCST

itself, can you talk through how it's implemented and some of the internal architecture and design choices that you made when you were structuring it?

So we

didn't build everything from scratch. We tried to reuse available

parser's p gen 2.

What we use is a 4 Parcel library.

So what provides by this library is it allows us to define

the grammar

and the conversion functions.

So what we need to do is to tag the

grammar defined by Python documentation,

the grammar of Python, and

we need to build a lot of different conversion

functions to

process the output

written from the parser.

So our main logic will be focused on how do we want to preserve

all those past value,

especially the formatting

values,

And how do we want to define our nodes in the concrete syntax tree? Oh, so we need to carefully design

the values, how those formatting value would we want to preserve.

We

try to

carefully review different cases of whitespace,

comma, and parenthesis.

Try to think about their use cases and then decide

which nodes you own them. Yeah. So this is early work of the implementation of bps PCCST. So we would need to

implement a lot of conversion function, and we add a lot of tests.

And, of course, we also

make it strong

typed, so

it will be

very easy and convenient to use.

And in terms of some of the API, like the transformer

and visitor pattern,

you are able to register a callback function,

like visit simple statement.

By registering

a function like this, you can

have the function being called

during the tree traverse. So and in order to provide strong type annotation

support,

we automatically generate

the type annotation for all those callback function hook.

So we actually use deep c s t to comads

in libcst

to generate

a lot of those

type hints, function, the definitions.

From, like, a bird's eye view,

libcst has roughly,

like, let's say, 4 big components.

1 being kind of like the parser,

like stuff that takes Python code and converts it to something that you as a Libs CSD user can use. So that the parser.

The second 1 is

basically just a bunch of classes that describe

the CST itself. So, like, things like here's how an if statement looks like. Here's how while loop looks like. Stuff like that. So those are all just like a bunch of classes

somewhere in the API. And the parser outputs these objects, basically.

So that's the 2nd component. The 3rd big component

is what Jimmy mentioned, all this visitor and pattern matching and traversal APIs,

which are

generated from the second component. So these, like, class definitions and whatnot,

they're generated using libcst, as Jimmy mentioned, which was a bit confusing for me originally, but now I kinda sort of maybe understand. Last big component is the whole, like, CLI framework on top of all of this. So essentially, the HPSC is intended to be used is, like, there is already a code mod for your use case. You just have to invoke it. Like, there's a CLI

interface so you can parameterize.

It can handle, like, big projects with, like, Python roots and multiple files. And all of these things are in this whole, like, 4th component.

In terms of the evolution of the project, I'm curious, what are some of the ways that the overall scope and design and goals of the system have changed or evolved since you first began working on it? Yeah. I think in the beginning, we focused on

building the concrete syntax tree, build a parser,

and also for build the basic visitor and transformer API. This API also exist in the Python AST library,

and so we try to use the same pattern.

And after we have those ready and start to use it in our later use case, we found

sometimes

when you have a lot of callback function registered

implementation,

it can be very messy and complex.

So we decided to,

spend some effort to make the traversal

and modification easier. So we do

we that's why we build the matcher pattern.

With matcher, we can

describe the shape of a subtree

that

is more easier to the code reader and easier to maintain.

In the beginning, we also try to build some basic

static analysis,

like the position

metadata

for our data. The data needs to report

the line number of,

dinter warning to the IDE. So the IDE can show the error in 9.

And later, when we,

start to build a lot of printer,

we run into some task

that the author of those linters, they want to have more powerful

static analysis. So we starting to,

build more static analysis, including,

scope analysis and the qualifying provider.

So, basically, I would say we evolve the library based on our

clients.

So we, as a infrastructure team, try to provide this as a tool.

All the other engineers in meta and

some

engineers, they have different use case, and we will base on the ask to prioritize

our work. 1 of those engineers was me.

I started using libcscst in a surprising way. I didn't use it for code mods. I used it for code analysis.

And so

I mean, technically, when I started working on the project, this was already a goal for me. So it wasn't really a change. But for in terms of the project, I was focusing a lot on the analysis capabilities

of LibCSD.

So that's how I got onboarded,

actually.

And then over the course of the past 2 years,

what has become apparent is that libcst is really, really powerful API that is very pleasant to use. And lots of people who want to manipulate code

want to use it, but sometimes they can't because,

I mean, it's not the fastest

piece of code around town because that was never a design goal. Like, being fast was was always secondary to being expressive and powerful.

And so libcsc was and it is still

used the majority of the time in offline

mode. Like, I have these 100000 files. Apply this code mod to them, please. And then it's okay if it takes, I don't know, a minute more.

But sometimes,

especially more recently, people want to use libcst in more interactive sessions like IDEs.

Like, I want to remove all my unused imports, but I don't want it to take a minute,

please.

And so, like, these people cannot use libcst

or more complex code mods today because it's just too slow. Like,

just to paint your picture, like, ID operations, they need to complete within 300 milliseconds.

Code mods can take multiple seconds easily. That's just a no no. So that's something I see

nowadays that is changing. Like, the shift in terms of

goals,

performance

is slowly shifting upwards in in the prior to this. That's what I'm trying to say, I guess. And I'm curious if you've had any attempts at just running it through Cython to see if that improves some of the operation speed at all or if it's just because of the nature of the changes that it's operating and just, like, the the IO that it has to do with the files that causes it to slow it down a lot?

Yeah. Honestly, I haven't, but it's on my to do list. And not Cython, but more MyPICI, actually. Because because the code base is already well typed, I think running MyPICI should be fairly straightforward on it. But, otherwise, I've said the same thing about Black and it's it was a year and a half project. But, yeah, we'll see. That's 1 of the opportunities,

we're exploring. The other thing I have personally explored actually is

because we needed to add support for Python 3.10,

and the parser that libs60 used to be based on was Parcel. Parcel does not support 3.10 grammar

because it's an entire different type of grammar. Not sure if we need to go into the specifics here. But, like, basically, Parcel doesn't support

all the features of Python 3.10.

And for a code mod tool that's being applied to code that can potentially include 3.10 syntax,

like features, we need it to support 3.10. I went ahead and rewrote the parser

from scratch

in Rust.

And

yeah. I know. Yes. In Rust. Part of the reasoning behind the choice for Rust was because it gives us, in the future, opportunity to optimize the parsing side of things, which admittedly is not the most the slowest part of libcst.

Generally, the slowest part of libcst is visitation and the transforming.

So parsing is, I don't know, like

about 300 milliseconds for a typical 2, 009 document. That's not terrible.

But

because performance is slowly shifting upwards in terms of the prioritized, which was made to write this in Rust. So in the future, if this becomes the bottleneck,

then we'll, you know, have to leverage.

That's how I rationalize it then.

Or just because you wanted a new project to to keep yourself busy with.

I mean, if I'm completely honest, this was a hobby project. Like, it's just just like on my spare time, and then I realized, oh, shit. Actually, we need this for some business cases. And then I was the furthest along, so we just went ahead

and did it my way. But, honestly,

I probably wouldn't rewrite it in Python even if I had given the chance because

what I realized is that

so

1 of the high level design goals for libcst is it being well typed.

And

the biggest gaping hole in terms of the type safety

within libcst

is the parsing

because parcel is on types. And also it's very, very hard to express the types of transformations that the parsing, you know, function

needs to express in in Python's type system.

It was just, like,

not done.

So it's completely dynamic. And then we, like, assume that the types are correct on the output side of the parser. And that turns out, of course, to be incorrect in some cases. And there are several bugs in libcsc today

that are caused by

parser returning slightly different types. Like, they make sense, but they're not the ones that are being advertised.

So then, I don't know, sometimes they don't have an extra comma attribute even though you're not supposed to put commas there, but the type allows you to. And that's I consider that a bug. And then the reason I bring this up is because during the Rust rewrite of the parser, I bumped into this all the time because Rust type system does not allow you to mess around. So I had to figure out what is the, you know, correct type to return and implement it, file a bug. And then, eventually, when when the old parser infrastructure will be turned off completely, then these bugs will be killed for good.

Rust is quickly becoming the companion language for Python

in slowly replacing c for a lot of the native code that you're plugging into Python libraries. And Yeah. Close to hardware code. Yeah.

And so in terms of actually using libcst, I'm wondering if you can talk through some example workflows of saying,

I either have this code mod that I want to do, maybe some examples of the types of code mods that you would use it for, and then the process of actually maybe

interactively

exploring what are the changes that I need to make and then being able to turn that into a module and make it part of your

maybe standard CI suite to make sure that, you know, we never wanna have this specific code pattern. We always wanna make sure that this function name is actually translated to this or that there are no function signatures that have 3 parameters because it's actually supposed to have 4 now. And 1 of the challenges about answering this question is there's lots and lots of internal tooling at Meta

that makes most of these answers, like,

easy

internally, but there's no real good open source solution. And I fully admit that part of the weakness

around LibCST is all the open source tooling.

This is great, but now I want to enforce these rules in my code base. There are no

at least there are no tools that I'm aware of that allow you to do this

painlessly

the way I'm used to doing it internally at the company. So that's something that we might explore in the future. But

at a high level, assuming you want to, like, run a code mod that already exists and you don't have to write it yourself, You discovered it by, I don't know, browsing documentation or a colleague tip you after or something. What you do is you just, like, take your Python source code. You just run the libcst

CLI,

choosing your code mod, and then maybe passing on some parameters,

seeing your files change, making sure that changes make sense,

and then

automating it. It's kinda like writing a script that runs as part of your CI, and then either flags pull requests or

just automatically

checks in code. We do both approaches sometimes internally.

Some of our code mods are actually run on the entire code base every day. Like, formatting is 1 of these, but also

removing non used imports and stuff like that. They're just, like, being run, producing pull requests, being also accepted automatically, and just, like, good to go. Does that answer the question?

Yeah. And then, for example, if I have a code mod where I want to restructure a set of modules, so I've got a function definition that I actually want to move into a different lib module, and then I wanna make sure that all of the imports across the project

are updated to use that new module location. What would be the process for actually

writing a code mod to do that? For this particular 1, there exists a code mod, so you don't have to write it yourself.

You just, like, call libclsc.tool

something

something rename

and then pass in old name and new name, pass in all your files, and make sure that the Python root is set correctly so that the import like, the fully qualified names are are set. But if you wanted to write a new code model from scratch,

my process for this is

starting always with the visitation API. And so figuring out what are the patterns I need to match on and what do I need to change them to. Myself, I look at the matcher API first because they are extremely

expressive. So you just put some decorators on your visitation function

that will will make sure your visitation function will only get called when the node matches this particular pattern. For example, I don't know, it's within

an else branch of an if with a particular condition.

So you can put some crazy things in there in a decorator, and then your function will only get called in those conditions. And then you basically say, okay. Now I'm in the code that I care about. I transform

like, let's say, for your example,

renaming imports and symbols, you would be interested in visiting all the name CST nodes.

And then in your visitation function, you say, oh, I'm in a name node. Let's check if the name of this node is equal to the old

name. If it does, then

emit a different name node with a new name.

And then the catch here is

you want to do this in an intelligent way. So not just like,

you know, control c, control v, but more like you want to figure out, oh, if this if this name actually refers to the same thing even though the name is different but it's imported as a as an alias, for example,

then suddenly,

like, you need to be aware of not just checking the value inside the CST node, but using the metadata APIs we talked about earlier. There's a metadata provider called qualified name provider

or fully qualified name provider. And you just say, oh, fully qualified name provider. Please give me the fully qualified name for this name node. And then you compare that to your old name stream.

As far as being able to do

refactorings

like that of sort of restructuring modules or ensuring that function signatures are updated everywhere. I'm wondering if you can do some compare and contrast to LibCST offers

as a sort of foundational library to build these type of structures versus what something like ROPE might have out of the box.

I think deep CST try to provide both low level APIs

for providing the concrete syntax tree and traversing the syntax tree easily. So if your use case collect some knowledge by analyzing the syntax tree, then you will probably want to use deep CST.

But if your use case is

is a very concrete

about you just want to rename some function or move some functions around for common

refactoring

fit needs. I think rope already provides a lot of feature to support.

So I would say

if you just want to do some common

and simple refactoring, you can start with both.

But if your use case require understanding

the syntax tree by doing some custom analysis

or

the commands

you want to do is very

customized,

then I think deep c s t can help more those cases.

If rope supports your use case, it's probably fine. Although that's the same for libc s t. If there is a code mod for your use case for libc s t, that's probably also fine. The question is when none of these, like, there's no existing code mod for your use case, then good luck with Drorp. I would definitely use libc s t.

Yeah. Absolutely.

And

beyond just the code modding use case that we've been discussing, you also mentioned that your original reason for getting involved in libcst shield was

to add in some of these linting capabilities.

And so I'm wondering if you can talk through some of the other types of workflows that you might build on top of LibCST

and for linting in

particular, why you might use libcst in place of or in addition to something like flake or pylint?

Flaky is a pretty common linter open source tool, and

a lot of people use it. But

1 feature we have found it's missing is auto fix.

So if you compare this linter with ESDint,

you can see

some of the linter from ES lint can provide autofix.

And we think autofix

is useful and important, especially in a large

organization

like Meta. You have 100 or 1000 of engineers.

If you are the inter just surface an error to them and they need to manually fix them, then

we think having engineers to manually work on those trivial fix is a waste of their time. We want them to be able to focus more on developing products.

So we want to provide autofix for them. So what we have been doing after build deep CST is we

built

a inter framework.

It's also open sourced. It's called fix it. And in fix it, you can easily define

and implement a data rule

that provides auto fix.

We actually try to provide auto fix for some

the end rules provided

in select k. So yeah. For example,

if you look at fixing, you can see,

there are more than, 20 we open sourced. And internally, there's a lot more custom

rules we we provide. I think overall, there's more than a 100 rules.

And a lot of those rules can provide auto fix. For example,

there's a rule called no string annotation.

So in annotation, you may use a string quoted

type,

and we think that is less readable.

The IDE may

just highlight it as a string. But if you don't use the quotes, you make it real type.

ID can provide better

syntax highlighting, which is more readable. So define something as a best practice and get a consensus

from everyone. You can start building a the tools

using

fix it. And the auto fix we provide from this tool is to automatically remove the codes

from the string type notation.

So what, you know, end user see,

their IDE is a button, so called fix it. So we built a

Versus code plugin

that automatically call our fix it enter. And

whenever there's a auto fix provided, it will show a button. So the engineer can just 1 click to fix it. And we think this can save their time, get them focused more on business development.

Yeah. So, essentially, like, TLDR is Flaket started as, hey. There are some bad patterns. Let's find them, which is a valid approach to linting. And but it's completely different to the fix it or libcsc approach where it was a code mod first, like, we should fix it first. And then oh, you can also like flag them to users in their ID. So it was, like, kind of the other way around.

Because Flakid I believe Flakid does have some kind of auto fixes. They're just kinda hard to implement.

1 of the interesting

challenges in these kind of situations is

that flak8 is a

valuable tool that a lot of people have invested time into adding different plugins and additional linting rules to be able to surface those errors. Libcst is a project that's able to identify those errors and automatically fix them, but doesn't have the ecosystem around it yet. And I'm wondering if there's any opportunity for being able to

either

automatically

use some of the rules that are defined for flake 8 and translate them into a syntax that libcst can understand so that you can use fix it to automatically resolve them and just what that process will look like for being able to build up the capabilities around LibCST and fix it to sort of get up to a level of what the flake 8 ecosystem has already built up. Yeah. You know what? Great minds think alike. I think that's exactly what Jimmy was alluding to when he said, sometimes fix it takes Flakgate rules

and just, like, provides auto fixes for them. I think that's exactly what's happening there. But, Jimmy, correct me if I'm wrong. Yeah. So

because Flaky API provides very different API for implementing

the interrules

and servers in the syntax tree. So what we end up doing in fix it is we are calling into Flaky to reuse

some of the existing rules. But in order to provide autofix,

we still need to

redo the implementation

using deep CST

for finding the problematic code pattern.

And on top of that, we add the autofix

logic. But

I think what you described will be ideal That is we can automatically convert, like, a

pattern finder as libcct finder, but that doesn't exist yet. Pull requests are accepted, though. I was gonna say, sounds like a community contribution waiting to happen.

Exactly.

And so digging a bit more into the fix it project, which is actually how I initially came across LibCST.

Wondering if you can talk through just some of the structure of that and maybe some of the ways that you're using both libcst and fix it in your own work.

So before we have fix it, we try to use AST

to develop,

linter,

and we build it as a flakey plug in. We start having a AST

visitor to traverse

the code and find patterns. And

over time, the visitor just become very complex because

a lot of different rules try to implement different logic on the same visitor, and that makes it hard to maintain.

So

when we get a chance to implement a new

framework to solve

the data problem

other than

auto fix, we also want to make, data

is easy to implement and maintain. So

each linter rule can implement its own independent

visitor.

But

with this, if you have a lot of intervals,

let's say, a 1, 000 intervals,

if you need to traverse actually a 1, 000 times, then it will become slower over time.

So in order

to be able to traverse fast and also keep the codes separate for each filter. We decide a, batchable visitor,

which can

allow us to combine all the different

callback function registered

from different linters. We batch them together,

and we apply them in just 1 single

syntax tree traversal.

Given a source code, we can just parse it as syntax tree and traverse it once.

And while running all those different callback function from different rules, we allow them to have separate context

for their

logic.

So this is 1

design we had for efficiency of the data run. And we also

try to make it very easy to test. So in the inter, you can justify

some

test

case just like providing code examples.

It would use

the automatic document generation

that take those code examples

and generate them as the Wiki page you can see, with the dots.

And those code examples are also used in automatic

unit testing, so we wrote a framework.

So

give so when you implement a rule, you can provide some valid

co examples

and some invalid

examples. For embedded code example, you will want to

also mention what are the expected

data errors you want to show to the user, and

those will be used as example

of the data

to the end user.

It will also be automatically

run as unit tests. And we found this approach make it very easy to develop

the dinter.

So with those in place, we were able to develop a lot of linters rules,

run it efficiently.

And in terms of the applications of libcst,

now that it's been available for a while and open to the community, I'm wondering what are some of the most interesting or innovative or ways that you've seen it

used. I have several. This is both to Jimmy. Did you know

that

basically, like, most of the Google Cloud

Python libraries

ship with libcst

code mods

to convert

between major versions of their API.

I think that was pretty cool. I discovered this a couple of weeks ago when I when when I was debugging why the hell is libsissy downloaded so often.

The other 2 that was kind of, like, surprising to me is

some crazy person decided to write

a Python indexer that emits basically

information that's required to do kind of a GitHub like jump to definition and find references

information using libc s t.

This crazy person also works at meta and is now powering basically all of our

ID services in the browser and the new Versus Code. And then the third 1 is less crazy, but it was surprising to me that people wanted to use libcst in an IDE setting where they wanted to refactor, you

know, just code,

remove unused imports and whatnot.

Yeah. I don't have

much unexpected,

surprising use case for

the interesting

use case. I think libcc has been used to solve large scale tech debt problems,

like cutting up unused code

and also

convert

your code base to adopt asyncio.

You will need to convert the function from non async to async function. And

for each async function code, you need to add a weight.

Those seems easy, but in a large code base, it's a lot of work. I have seen

some people use that and shared a blog post talking about how they use libscc for their async adoption.

It's also used for

type annotation adoption.

So in a logical base, misses a lot of type notation.

We want to automatically add some type annotations

rather than manual to all the work.

And monkey type is a open source tool, open source by Instagram.

And it actually uses the CST

for inserting the missing types that collected

by running the program.

I have seen a lot of people use funky type as well. Can I also mention a crazy use case that I wish existed, but it doesn't?

I wish this has been my pet peeve ever since I knew about libcsc.

I wish black would be implemented in terms of libcsc.

That's my goal in life now

to make that happen. I mean, even the original maintainers of the FCC, including Jimmy, was like, why would you do this? Like, lib 223 is way better suited to black because you kind of manipulate

white space. That's the kind of the entire point of the project.

And there are some downsides to the libcst approach because white space might get attached to completely different parts of the syntax even though

they're syntactically right next to each other, especially during, like, boundaries between annotation blocks and whatnot.

But I still think

black would be better off if it were implemented in terms of or LipCSC would be better off if it were made an API

that made Black's implementation easier.

In your own experience of

building and working on LibCST

and building other projects on top of it, what are some of the most interesting or unexpected or challenging lessons that you've each learned?

Code quality is important and hard.

Way harder than I thought. And just rewrite everything in Rust.

No. No.

Don't in fact, no. Just don't don't rewrite stuff in Rust without a good reason. No.

I think

working on deep CST

and make it open source is

a new experience to me. Because

I think it's very different

than a internal project. Because internal projects,

we can try to develop

fast,

and we don't need to do a lot of work on documentation.

But when we

decided to make it open source, we think documentation is important. So we try to use

auto generated

documents

and try to

add a lot of documentation to major class to their

docstring.

And, also,

we try to increase the test coverage to make, make sure we

we have good practice

the entire project. We also bring

in the popular later tools and set up the CIs on GitHub to

validate our changes,

use pull request,

which, internally, we we are using a very different

set of tool. So

and we

also try to design a logo for our project. So internally, we have some designer. So by brainstorming with them, we

come up with the the final logo, which

what we try to represent is a tree

with a lot of leaves, which are the syntact tree with a lot of nodes.

Yeah. I think, overall, it's

interesting

and good learning experience to me. I mean, maybe we should include the parts about, hey, documentation and method was not important, but open sourcing made it made us realize that it's actually important.

But I I do definitely agree with Jimmy's experience. Like,

I've been a maintainer for quite a while, so I know the open source experience. But writing something that's mainly

driven from internal use cases,

but also making sure that it's actually

not just a usable, but a pleasant

user experience for open source.

It's a major challenge. And it's, like,

it's a lot of effort. It's very hard to justify internally, like, why are you working on this? Well, because of all open source users. But there are these 10 other more important things that you could be working on. And so instead, what ends up happening is you still deeply care about the project. So you do the work over the weekends and whenever you have time, and you're trying to squeeze it in. And then, like, the end result is kind of subpar.

So I don't have a good solution to this, but it's definitely a lesson I learned is that

I mean, I knew that open sourcing something from an internal code base was a lot of work, but just, like, way more than I expected. It's a lot of continuous

work that doesn't get easily recognized. Even from the open source community, I don't think I've seen anyone

say, hey. Good job with CSD maintainers.

I don't expect that's not how open source works. It just, like, makes it even harder to justify

all the work that's required to make this pleasant.

Well, consider this interview your good job.

Thanks.

And so for people who are looking for a way to be able to parse and maybe code mod their projects or they're looking for a way to be able to do round trip parsing and regeneration of their code. What are cases where libcst is the wrong choice and maybe they're better suited with a different concrete syntax tree, or maybe they just need an abstract syntax tree, or just something like rope, and they actually don't need a parser and syntax tree analyzer at all. Wrong choice? What do you mean? I don't understand the question.

Libsius is always the right choice.

No.

In all seriousness, if you're looking to write a code mod, libsys is a really good choice. There's very, very few cases where it wouldn't recommend libsys if you're writing a code mod. Having said that, like, if you just want to make sure that the code is syntactically valid and it's never going to be read or used by a human,

then the complexity of a concrete syntax tree is usually not worth your time. You're usually better off

either regular expressioning it or if you're doing something more complex than using an AST.

And, you know, like, if there's an off the shelf product that does what you need, then just use it. You don't need to reimplement

your logic within libcsc just because it's cool. Although that's a fun weekend project, and you can learn a lot. In general, I think, like, libcsc hits the mark on a lot of points when you're writing code mod. So, generally, it's I would on the side of using it than not.

Yeah. And I think I do aware of some cases that cannot be supported by DeepCSD.

Imagine

in your IDE, you want to build

some code automation

based on

the draft the user have. And that means that the code is not completed yet. It's not a valid Python code.

Libccc cannot parse those code. So if you want to build some very smart

automation

that can autocomplete

the Python code in IDE,

you will need to find a different solution.

Yeah. Today, that's entirely true, and that might change in the future if I get my way. Speaking of the future, what are some of the things you have planned for the near to medium term of the LibCST project and maybe also for a fix it if you wanna dig into that? I can talk about libcst, not so much fix it because libcst is my current focus.

Obviously, we want to keep up with new language features. Python is evolving quite fast. Like, once a year major release usually introduces a new language feature,

we do want to be on top of that. So it's just already supports 3.11,

I believe. So that's a given. Second big thing is I want to make sure that the traversal

of nodes in libc s t is fast enough so it can support

more interactive use cases

with the caveat that Jimmy just mentioned. So, like, even if it's fast, we will still only be able

to transform, analyze, modify

valid Python code.

So in fact, that's the next priority item is

now that the

rewrite of flip c s t's parser is

slowly but surely wrapping up, now we will have a chance

to

change the parser so it can accept invalid Python code and then have some boundaries where it says, okay, yeah, this expression is invalid but I can still parse the rest of the document, and you can still operate on them.

So that's something that we might be looking at. No promises. We might be looking at in the future.

Are there any other aspects of the libcst

project or some of the ways that it's being used both in your own work and in the community

and just sort of code modding and concrete syntax trees in general that we didn't discuss yet that you'd like to cover before we close out the show?

I think as Zsolt mentioned,

we

still need a lot more tools that built around deep CST in the open source world in order to make it more easy to use.

So the other

popular tools like black or flaky, I have seen that other people build a lot of tools to make them very easy to use. Like,

they they they have GitHub actions,

so you can easily enable

black or mypy

on your GitHub projects, but

deep CST is missing those.

And

at my work in Carta, I

actually also try to use libcst,

and we are using GitHub for our code development. So I have been trying to

developing

things,

around GitHub and deep CSD

and fix it too.

So I have trying to build some code mass service that automatically create a GitHub request.

And,

hopefully,

in the future,

if some of the my work can make the open source ecosystem

more useful.

You should definitely take a look at

1 of

the picks I have

picked for later. It's called auto transform.

And if you remember Code Mod Service from internally,

that's the open source version. And the guy who wrote Code Mod Service writes that,

highly recommend.

It is growing Ellipsis integration as we speak.

Alright. Well, speaking of the picks, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes.

And so with that, now I will move us into the picks. So to get us started, I'm gonna pick the Osprey Manta backpack.

I got 1 recently, and it's definitely a major upgrade for some of the other packs that I've used. I particularly like the sort of support structure that it has against the back that keeps the bag off of your back so you get nice airflow so you don't get all sweaty when you're out and about hiking or biking. So recommend that if you're looking for a pack to wear when you're out and being active. And so with that, I'll pass it to you, Jolt. Do you have any pics this week? Yeah. But they are unfortunately boring and related to code.

So auto transform is 1 of mine. Just a quick shout out to Nathan who's working on it. Just Google auto transform GitHub, and you'll find it. It's a service that automates your code mods for you. So it's like a way to once you have a code mod that transforms your little source file and to the desired state, This service will let you apply that across all your 100 and tens of files in an automated way and not lose your sanity.

The other pick is

go to gleam.software

in your browser.

It's cool

little database project

that allows you to store facts about any kind of source code. And I'll just leave it at that. You can discover it for yourself.

Alright. How about you, Jimmy? Do you have any picks this week?

I don't have,

specific pick, but I would like to share

something that is related to Deep CST and CoMART. So I think

after being work on

a lot of different project and in different companies, I have seen that attack that is a problem

that we always face.

And

it's hard to make the decision whether we should fix those tech that right now or because there's always more important or urgent product development requirements.

I think that's

1 of the motivation.

We want to build the automation to fix them.

But not all the tech that can be auto fixed. There's still a lot of more complex tech that it require engineer

to develop

it. So we want to

make sure when the engineer

proactively

work on tech debt, their contribution

will be recognized.

So what I have been trying to do

in my previous few project is to

analyze the comments to the code base to figure out what kind of contribution

the developer make to solve some of the tech debt. For example,

maybe they help missing types

or they clean up some unused code.

With a automatic analysis, we can summarize their contribution, and we try to use those summarized contribution

to

provide the numbers

about their contribution as a reference to recognize their contribution.

So that's

a philosophy

or a strategy

I have for

large scale tech debt problem.

Alright. Well, thank you both very much for taking the time today to join me and share the work that you've been doing on LibCST.

It's definitely a very interesting project and 1 that I'm glad to see as part of the ecosystem. So I appreciate all of the time and energy that the both of you and the rest of the contributors to the project have put into that, and and I hope you each enjoy the rest of your day.

Thank you for all the work you put into

the Python community at large. I've been an on and off listener to the podcast, by the way. Good job in general. Thank you. Yeah. Thank you for inviting us. I'm so happy to share our experience with more people.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com

for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__