Make Your Code More Readable With The Magic Of Refactoring Using Sourcery

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, Go

to

python

podcast.com/linode.

That's l I Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Nick Thoppen and Brendan McGinnis about Sourcery, an advanced refactoring engine that cleans up your code as you work. So, Nick, can you start by introducing yourself? I'm Nick. So I'm me and Brenda both sort of technical, so we both have background in software engineering.

We actually met in our first software job back in 2007 at the university.

So I had a background there and working first on

this language called IBM RPG, which is actually

green screen programming for IBM mainframes, then got into Delphi,

then got into Java doing some sort of more

enterprise web stuff.

And then I kind of left there in 2013 to do a master's in AI. I got interested in machine learning. And after that, worked at Imperial College in London

on quite an interesting project. It was, like,

doing

Twitter mining and then trying to analyze people's tweets to see if they were talking about symptoms of disease. It was, like, a kind of biosurveillance program,

kind of before all this started.

And that was sort of the main machine learning there was kind of turned between things like Bieber fever and actual fever. It's like trying to determine whether people are actually talking about Tim's illness or not.

And then that kind of

got canned.

There's been a lot more interest in it in the past year, certainly. But at that point, there was not so much interest.

And then went to work with Brendan on Sorcery. And Brendan, how about yourself? Thank you for having me on the show, and Nick as well. Yeah. So, like, it makes sense. My first

programming position, I met him back in 2007.

I did pretty much the same things,

RPG,

Delphi,

Java, and

I stuck around for a little bit longer and I introduced Scala

to the

company.

But a couple of years of programming, I got really, really into

code quality

and really obsessed with writing high quality code and in particular taking code that already exists and refactoring it to make it much better and easier to work with. So this all led into this Scala project where I convinced the management that we could rewrite a lot of the system using Scala

and start them from scratch

and spent about 3 years on that

and finally managed to deliver that with a team of people behind me, which was

great success,

especially on a personal level.

And then I was coming up to my 10 year anniversary and wanted out. Didn't want to reach 10 years, so I left. Ended up joining Nick for 6 months on his Twitter analytics

project

before

getting

really excited about

machine learning.

And that's when I got into Python and doing

deep learning.

And after a while, I was like, okay.

Let's apply

machine learning to refactoring.

And so Sorcery was born.

And going back to you, Nick, do you remember how you first got introduced to Python? During the masters, we did a bit for machine learning projects, and I thought it was really cool.

And then when it came to the Twitter analytics, it was kind of a natural language to use with, like, PyTorch, NLP stuff. So that was where I got my first kind of intro to it back in, I guess, it was 2014 ish.

And then coming to Sourcetree, we decided to do everything in Python and focus on Python. And, Brendan, do you remember how you first got introduced to Python? It was in the

projects in Imperial College London with Nick,

but I was only there for very shortly, so it was more of a dabble at that point. It was only at the end of that where

I took some time off just to do my own thing, but I started reading all of these research papers about using reinforcement learning to play Atari

and using supervised learning to

classify pictures

and text.

And so I spent ages just going through all of these research papers and reimplementing them in TensorFlow. And that was how I learned Python, really, just actually as a side product of

reimplementing these research papers to try and understand machine learning. And it was during that period that I realized, well actually Python's really nice, simple, easy language.

I really like it. All of the code's very clean. It's easy to read,

and the libraries are really well written.

And so

decided to use it for sorcery, ultimately.

Yeah. Now I have to go back to Java. I'm like

And so before we get too much into sorcery itself, the term refactoring has come up a few times, and there are a few different ways to think about it and some different framings for it. So for the purposes of this conversation and what you're doing at Sourcery, can you just give a bit of a 30, 000 foot view of refactoring and some of the types of refactoring that we're talking about? Refactoring

is restructuring and changing

the code without changing what it does. It's kind of the the basics of it.

So it might be

changing how the logic works. It might be moving things around to be in different classes. It might be

renaming variables and functions,

keeping all

the functionality exactly the same while improving the quality

is how we kinda think about it. So, you know, if you have a test beforehand, it should still pass afterwards.

And the user will have no idea that the code has been refactored. Everything will be exactly the same. But the benefit of it is that the code becomes easier to read. And

the side effect of that is it becomes easier to maintain that code base going forwards.

So it's easier to add new features because

everything is simpler. It's easier to track down bugs.

And the byproduct of both those things is

we can develop code more rapidly. It's always exciting to develop code fast.

So morale improves in a team. It's more fun to develop in a nice clean code base.

So refactoring actually leads to this, like, happy flow state of programming where things are a bit easier. Things are in the right place. There's no duplication of code across the code base.

Refactoring has been a practice in software for quite a while now. There are a number of different tools across

different

refactoring. I'm wondering if you can give a bit of an overview about what it is that you're building at Sourcery

and some of the motivation for building a new system for being able to perform these types of refactorings and sort of what was lacking in the ecosystem

that Sourcery brings to the table? We like to think of Sourcery as an automated pair programmer

that sits there reading the code that you're writing.

And as you're writing it, it understands it and suggests these refactorings to you in real time. So you've got your code open in your IDE. PyCharm or Versus Code are the 2 main ones that we support,

and you're working on file of code.

And it's understanding

those functions in your code. And for each of them,

if it finds an improvement, it'll just

offer a little highlight

that you can hover over and see a description of that refactoring,

an English description and also a code def.

And

you can apply that refactoring, and your code will be

updated in place. And then you can carry on working with your new improved code. So as well as PyCharm and Versus Code, we also

have language server protocol

implementation,

which allows

Sourcery to be used in Vim and Sublime.

And Sourcery can also be used as a code review tool. So we have an integration which for GitHub,

where it scans pull requests

and offers suggestions to improve those pull requests.

Or we have a command line interface

which allows

Sourcetree to be used as a pre commit hook

or in a standard CI pipeline

doing the code review before

a human needs to get involved,

making those simple

improvements to the code base automatically.

It's the contrast with what's out there. It's

so there's various class of things. So it's like, say, in PyCharm, you can it helps you do refactorings. You can kind of select a bit of code and say extract method on this. If I know what I wanna do, there are these IDs that can kind of help me that and streamline that process.

If you don't know what to do, it's not sort of suggesting those to you. So sorcery is kind of suggesting fixes to you. You don't have to go in there and be like, I wanna do some refactoring now. It's like, it's probably a natural flow. It's suggesting little improvements.

I see other class of tools out there is things like linting tools or

or formatting.

And formatting is just sort of improving the formatting. We do that as well.

Linting is often sort of picking out little errors, but not telling you how to fix them.

So either that becomes just this blizzard of things you ignore, or you kind of keep on top of it, but you still have to go manually go in and fix all these things.

So the idea for Sourcethere was, you know, it's automatic. Whenever we suggest

we notice something, we suggest an automatic fix for it. It's kind of speeding you up.

If I think back to when I was learning to program,

I didn't know the term refactoring. I didn't have the concept of

rewriting code to improve it initially.

Only after a year or 2 did I start to understand that this was a possibility.

And then I had to go out and learn how to do it. So I read

Kent Beck books, Martin Fowler books,

went out and manually learned how to do all these things.

And

I found it fascinating,

but it took a lot of time.

And with a tool like Sorcery,

it can not only give you these refactoring suggestions, but teach you how to improve your code as you're going as well.

So you don't need to know, oh, actually, I can do this because it will tell you. You mentioned things like linting and new things like extract a method.

And as you said, there are things like linters or there are tools like PyCharm or Rope that allow you to manually say, I want to do this

when you know that that's something that you want to do. So Sourcetry fits in this category of automated code review, automated assistance kind of tools. And there are a number of other projects that have come out recently

in the past few years to offer

similar kinds of approaches. So I'm thinking of things like Kite. I know that there are some other sort of automated review tools

and automated sort of code completion where they will

scan multiple open source repositories and use some of the code patterns that they find there to suggest structures

where they'll hook on different sorts of keywords or logical structures and say, you know, here's a snippet that you might want to use to be able to perform this action that you're trying to achieve. I'm Wondering if you can give a bit of a comparison of how sorcery fits within this broader ecosystem of services that people might use to act as

sort of automatic assistance while they're programming and some of the differences in terms of the goals and priorities of what you're building with Sourcery versus what are available with these other systems? There's a few code completion tools. Like you mentioned, there's Kite.

We're aware of another couple, the Python, Codota, and Tab 9.

And all of those

help as you're writing the code in the first place. So you're

writing a line of code, and they'll complete that line for you. So it's like you're in Gmail, you're writing a sentence, and Gmail suggests the rest of the sentence to you. These tools do exactly the same except for with source codes.

The key difference between Sourcery and

these code completion tools is they help you write the line of code that you're working on. Sourcery understands the code that you've already written

and rewrites it to improve the structure and improve the code quality.

So as opposed to being during the code writing is after you've written the code,

and it gives you the high quality code that lets you write

code faster in future or detect bugs

quicker in future. It improves the readability of the code that you've already written. The code completion tools actually don't care about the quality of the codes that you're writing.

They just care about what is the most likely

rest of the line of code that you're working on. Yeah. Because I think there's a study that shows you spend maybe 5% of your time actually typing when you're programming and, like, 70% of your time reading and trying to understand it.

So we're coming from the point of view of not trying to speed up your typing, but try and improve the overall readability of your code so you can go and make changes more quickly later. That readability is where we're really focusing. There are also classes of tools that, particularly in Python, but also in other languages

that serve to add things like constraints to what the function is trying to do. So, like, contract driven programming,

and then there's type inference and things like that. And I'm curious how those types of information are able to feed into the refactoring decisions that you're making with Sourcery? The 1 that we actually use within the Sourcery code base is Mypy, which

I personally really love.

I became quite fond of strong typing when I was doing Scala back in the day. I wanted to find something similar for Python, and MyPy is really, really exceptional.

So in terms of the code

out there, the majority of Python codes doesn't actually use

type hints and type information at the moment.

So we tend to make the assumption that most of the code that we analyze

won't have that information available to us.

So we try to analyze it expecting that not to be there. However, we do do some type inference within Sourcetree that helps us do certain types of

refactorings.

Yeah. So if your code has the type hints in it, we'll be able to ingest more things than if it doesn't, for example. Adding that typing to our code base internally has definitely caught loads of issues.

Another interesting aspect of what you're building at Sorcery is that you both admitted that Python is 1 of the languages you came to later in your career and having a much broader background in a number of other languages. And I'm curious why you chose to focus on Python as both the tool that you're using to implement Sourcery, but also the primary language that you're focusing on as far as the refactorings that you're providing and what your motivation is for investing in Python as a language and for building your business on it. So interestingly enough, it wasn't actually Python, which was the first target for Sourcetree.

So

it was always written at Python, Python, but the first target was actually Closure,

which is a lisp for the JVM.

I was quite keen on Closure before

coming to Sourcetree.

And the first version of Sourcetree was actually a pure deep learning

solution. So it took in closure code and it spat out closure code at the end,

and the goal was to

spit out improved

source codes that was in closure.

What I expected was it would actually be quite easy to do this and I was completely wrong. So the source code that it spat out it was very difficult to make sure that it was actually syntactically correct.

And then after a while, I managed to get that working, and then I realized actually

it might not be semantically correct. It might not actually run. It might not do anything sensible. And then the third thing with refactoring is it actually has to have exactly the same meaning as the source code that is initially passed in. And that turned out to be just completely

impossible.

So during this whole period of trying to do it with a pure machine learning approach, I was talking all the time to Nick,

and we came up with an entirely different approach that actually doesn't use machine learning at the moment. Because we're gonna be writing this all in Python,

we want to be able to refactor our own code and dogfoods the whole thing. And so that was how we ended up using Python and v 2 of Sourcetree.

And, actually, it's turned out to be a really excellent choice

because

of a number of reasons. Firstly,

Python

is a nice simple language to work with. There's less constructs to deal with than in other programming languages.

So that makes it easier to do all of our analysis and refactoring.

And then

it's a very, very popular language. It's I think it's now the 2nd most popular,

and it's the fastest growing language out there.

And

another thing that's really great is a lot of

people starting out on their programming experience

choose Python.

And probably that's where we give the most value

to people who are learning how to program.

They don't know about refactoring. They don't know how to write high quality code. And so when they use a tool like Sourcetree,

it really

levels them up more rapidly, turns them into a good programmer much more quickly. Taking a step back a little bit in terms of

your journey of starting Sourcery, what was it that convinced you that this was a viable business opportunity and that it was something that you wanted to invest your time and money and energy into pursuing?

The first genesis was, you know, the company we worked at was, like, the legacy code base back going stretching it started in the eighties, and had been going for 20, 25 years by the time we got there. It's constantly dealing with bugs. The issue list was enormous.

It was hard really hard to get anything done. It was just sort of not a great quality code base. Everyone's a bit sad about it.

So it was the first Genesys, you know, there has to be a better way of improving

code quality maybe than kind of doing it completely manually. It's just it won't take too long.

And then

it was kind of that wave of interest in deep learning and machine learning

starting in the 2010s. We got really into that. And then

as I'm actually in my masters, and Brendan got more into code quality, it was sort of this spark of idea. Maybe there's a way to turn this machine learning back on the process of writing code quality. Because while code's underlying everything, the process of writing code is still entirely manual. You know, it's still sort of this craft. So it'd be awesome if there were

really great tools to help you. On the business side,

it's only in the last year that we've really

started to try and think about

building a a user base and commercializing

the product.

And I was honestly

incredibly naive about the whole thing.

I expected it to be very easy. I thought it'd be like, okay. We'll build a tool

and then we'll have a business.

Yeah. Before we build our first prototype and then, yeah, it'd be magically

Yeah. It'd be a successful business overnight. And

how wrong we were at that time.

Yeah. It's been a real learning experience in the last year

of just how little we knew about building a business.

Yeah. Just as with any project you undertake, at at the surface level, you think, oh, this is simple. You know, you do this, this, and this, and then it's done. And then you actually start to dig into it, you know, whether it's business or programming, and then you realize, oh, there's a whole can of worms in here that I didn't realize was lurking under the surface.

Yeah. Absolutely.

So many wems. Yeah.

Yeah. I have a tendency to be just extremely optimistic about the outcome of things, which is probably the reason that sorcery exists in the first place.

If I had any understanding of how difficult it would be, then probably would never have started in the 1st place. So

it's a good thing, actually.

Digging a bit more into the refactorings

themselves, you mentioned a couple of things like extracting a method or renaming variables. But I'm wondering if you can just spend a bit more time talking about the types of refactoring that you're able to automate and some of the

structures and inputs that you use for identifying opportunities for those refactorings?

The first thing to say is that Sorcery is currently limited

to reading an individual functional method,

analyzing that, and suggesting

refactorings within that. So the scope is very narrow at the moment.

In the future, we're gonna scale up to understanding

classes and modules and refactoring,

but everything that we suggest at the moment is

at the method level. The code converted to an abstract syntax tree

using kind of some stuff built on top of Python's SE module.

And then

we've

got a load of patterns we're looking for, basically small atomic changes we can make to the code, each of which will not

change the

meaning. So things like moving a statement or combining 2 if statements,

and dropping

an else for just a pass in it, or changing for loop into a list comprehension.

So we've downloaded stack analysis on that syntax tree to see

which lines of code depend on each other, and so we know kind of which changes are safe to make.

And so we chain together individual little safe changes to make a bigger change that we can suggest. This kind of guided by a load of code

metrics, so there's lots of code metrics out there. There's

psychometric complexity, cognitive complexity,

number of lines,

and, sort of elements of duplication.

So we kind of have an idea of how good the code is according to those metrics when you start, then we try to change what because these little changes to kinda get to a better place, try to put a nicer bit of code based on the structure.

First big example was so is this refactoring exercise called the gilded rose cutter,

which is like this horrible nested

spaghetti code of conditionals.

And our very first prototype with sorcery was trying to

sort of straighten that out by

making these small changes, because we did it manually first. We kind

of manually refactored it, looked at the tiniest changes we could that would be refactorings, and tried to incorporate those.

Some of the main things we do are

untangling complex conditionals,

suggesting comprehensions,

suggesting using built in functions like any, all, enumerate,

min, max.

We find duplicate code across conditionals.

So if there's the same

line of code and then both sides of an if and else statement, we would kind of hoist it out

that if else statement.

We can move code closer to,

say if variables are declared quite far away from where they're used, then we can move them closer to the place that they're used.

And

most recently,

Nick's been working on method extraction. So if there's the same

few lines of code happening in a couple of places within

another method, then

we'll suggest extracting that out into a new method.

That's really the direction that we're going. So if we can identify duplicate code within

a single method, we can start to do it at higher levels of abstraction. So within a class, and then we can do extract method within a class.

And

that's really where we're going. The class level refactorings are where we're

very excited.

I guess the reason that I'm really excited about the class level stuff is from a personal level,

when I'm writing codes,

I tend to write

codes that

is already well written from the point of view of sorcery,

partly because Nick and I have implemented the whole thing. We understand how to write exactly the sort of code that Sourcery likes.

So on a personal level, we don't get that many suggestions.

But when I'm writing at a class level, it's always more to take into account. Is there duplicate code?

This can be extracted out. Sometimes it's not easy to identify that as you're writing the code,

or

would this code be better pulled out of this method into another 1 or moved into a different place?

So

as we go higher up the scope of what Sourcetree can do,

we start to,

I think, appeal to more advanced

Python developers.

So a lot of the people who really like Sourcetree

are those

junior to intermediate developers who are still learning about code quality and how to write high quality code.

Obviously, Nick and I consider ourselves pretty good at Python by now, so we don't get as many of those suggestions. And so we find that with other

advanced Python developers, they don't get so many suggestions.

But if we can start suggesting refactoring to a class level, I think it'd be really powerful.

So as somebody who has been using Python for a while, I like to consider myself as advanced as well, but, you know, there's always a little bit of hubris involved there.

Yeah.

You were mentioning that you're looking largely at sort of the function level.

And I'm curious how you handle things like very imperative code where everything is just procedural

within a module. There aren't any functions available.

What types of refactorings or suggestions you're able to make in that kind of context? Yeah. So I guess if there's no functions, we kind of almost analyze it as if the entire thing was 1 function. You know? It's just 1 big script.

It's definitely more difficult to analyze. You can still sort of see, oh, there's a for loop I can change to a comprehension or whatever.

But, say, not as easy as when the code is already nicely

split into functions for you. The sort of things that we can

do around there

are around

moving code closer to where it's used.

And ultimately, as you start to do that, then the code starts to get grouped together.

And

another goal that we have in the medium term is

make code self similar.

And what I mean by that is you want the same pattern appearing

across the function that you're writing. So

if you've got 3 lines of code and then a similar 3 lines of code later on, you want them ordered in the same ordering so that as you read it,

it's easier to understand each time.

So as we move towards that level of refactoring,

then once you've got that, then you can start doing things like, okay. This code is similar here and here. Well, then we can actually do an extract method again.

So

the whole thesis of the way that we've designed Sourcetree is

we're building these very small

library of very small refactorings

that compose together

to build much larger refactorings.

So even though a lot of the refactorings we talk about are quite small, when you combine them together, you can get really powerful things.

And really, if you think about how to do refactoring

as a developer, that's the way you do it. You don't just go, like, wholesale rewrite it from scratch.

You make a small change, check the test to still pass and make another small change, keep going in that routine.

So we're taking that intuition and trying to do it programmatically.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email?

Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to python podcast.com/census

today to get a free 14 day trial and make your life a lot easier.

Going back to the types of refactorings

that I'd be interested in, just looking at some of the code that I'm working on recently, I'm thinking about, k. My lender is complaining that I've got too many imports from this module, so I want to just rename all of the uses of this class to be namespaced. So

rather than saying,

you know, bar, I wanna say foo dot bar and change from my from foo import bar to just import foo and be able to do that all throughout the entire module rather than having to go through and do that manually because it's tedious and time consuming.

Or

I've decided that I don't like the naming that I've used for a particular module space, so I want to rename the directory and automatically have all of the imports

changed to match across by project? I'm curious sort of what the challenges or complexities are in terms of your ability to be able to

create those types of changes within a code base? So tackling

the second 1 first,

we would find it very hard to suggest something like that because

naming is not something that we really want to

take control of. It's unlikely that we're going to look at your code base and say,

actually, I think this module should be named something else.

That's something that's very hard for

a machine to do better than a human, I think.

In terms of the first 1, if you've got too many

imports from the same module,

we could easily do something like that. Yeah. We haven't up to now because we've been focused on the functions, but, yeah, it's like a that is part of the file that we're information that we're reading, and we could suggest things like that. Yeah. The challenge with all of these things is

suggesting things that people want

to make to their code base

most of the time.

So we don't want to be suggesting things that people are like, okay.

I can see that, but, actually, I disagree with it. And they're rejecting the suggestion, like, 50% of the time. That's no good. We we wanna at all, where 95%

of the time, whatever we suggest to you, you look at it and go, that's definitely an improvement to my code base. I'm gonna accept it.

So in the example of the imports, it's like,

how many imports is the point where you go, right,

10 is too many. I'm gonna start suggesting that

I namespace all of these.

Okay. Is it 7? Is it 8?

So, and, obviously, we just have to make a call on many of these things. And there's constants throughout our code base where it's like, okay, if there's over

3 lines of code, then we'll

suggest to extract a method. But, yeah, it's very much our intuition of what good quality code is,

trying to match that up with what we believe other people think is good quality code as well,

or most of the time believe it's good quality code. There are a couple of directions I wanna go here. So first, another type of refactoring

that I'd be interested in understanding sort of the scope of the problem and how you might try to address it is

as you expand to analyze

more of the project. So beyond just the individual file level, start to look at the module level or at the project level,

being able to suggest, okay, I see that you're using this pattern in 3 different files. How about we extract that out to a helper module that we put in the lib directory?

Or you're passing a lot of data around into functions, maybe you should convert this into a class and make this an attribute on the class object.

Curious sort of

what is involved in being able to perform those types of suggestions or if that's something that you feel is in scope for what you're trying to build with sorcery. So the first one's definitely in scope because we have

a partial solution to it.

So if you're in PyCharm or Versus Codes, you can right click

directory

and say scan for duplicates.

And we'll look for

any code like that, 3 or more lines of code that is very similar. And then you'll be able to see, okay, in file a, b, and c, I've got this duplicate code.

But at that point, we won't suggest

actually

the extracts

function or extract method.

And the challenge there is where should that go? Where does that code go?

And

I think that's an extremely difficult problem.

Even as a professional programmer, I spend a lot of time thinking where is the best place to put this code? Should it go in a utility class?

Should it be part of this

class that I'm working on? Do I need an inheritance tree?

There's various different possible options that you could choose.

And so at the moment, we're just choosing to

identify that problem for you. You can scan for that duplicate code and then you take the next step of

deciding where it goes.

So 1 additional thing that we do need to do is

suggest to the user, actually, this is what the extracted

function looks like. You choose where it goes.

This is what you need to replace

the codes that you're extracting with this function call with these parameters.

Yeah. It's interesting.

We hadn't really thought about, like, integrating with other linters, but there are

problems around that. So

for example, there's this printing error that comes up if you have a return and then an else, because you could drop the else.

And so we don't particularly agree with that in our code base, so we've disabled that linting error. But then it turned out that 1 of our refactorings can lead to a situation which this new linting error will be triggered once you accept the refactoring.

And we really don't wanna be introducing

linting errors into someone's code base. So we don't we wanna be

basically taking what's broadly accepted as linting errors and making sure we don't introduce them, making sure we fix them,

and also making sure we have configurations so that, you know, it can play nicely with what your linting setup is.

But we hadn't sort of considered, I'm not sure we would, taking in the linting errors as kind of an input to Sourc3. I Think this is a good opportunity to discuss a bit more about how sorcery itself is implemented

and just some of the

design considerations

that you take in and maybe

who you

use as sort of your

apocryphal user to understand

how to design the interface, how to design the interactions

of the overall product?

Yeah. So the way we implemented it was

inside out, really.

We didn't care about the user interface at all in the beginning. It was all about can we take some source code and refactor it. So that was the goal all along.

And

as Nick mentioned earlier, we

take

the code and turn it into an abstract syntax tree.

And

we keep a whole load of extra information on there around code formatting.

Because when we output the code at the end, we want it to have the same formatting

that

came in at the beginning.

And then it goes through a whole range of analysis.

So it's things like

which

variables

does this statement of code depend on from previous in the function.

And so things like what's the control flow, what's the next statement for each existing statements,

How many substatements are within each statement?

And I guess a lot of the analysis that compilers do is stuff that we've kind of brought in. They sort of build a dependency graph as Brendan was saying. So once the analysis

has happened,

we then

do the refactoring matching. So we look for specific patterns in the code.

And

if we find those patterns, we then use the analysis to see if we can perform the refactoring.

So you may have a pattern in the code,

but there's a call that changes some state elsewhere in the program. That means that you can't do that refactoring pattern.

So that's where the analysis comes in useful,

ruling out code changes that wouldn't be refactorings.

So

so we build up a list of possible refactorings

at each step,

and

these go into a search engine

that searches for the best possible overall refactoring.

So

each step has a list of refactorings that can be done. The search engine chooses between them based on the code quality that will be output.

Then the new code

is run through the whole process again. Step by step, it looks for more refactorings that can improve the code until it comes up with a final output, which gives the best possible code quality score. And then that is suggested as the final refactoring to the user. I guess in terms of user interface, we kind of went with what the IDs naturally do. They have this concept of diagnostics.

If you take LSB, have this concept of diagnostics where it finds a problem and underlines it, and there's this concept of quick fixes.

So that's how we've implemented it. So

if we find a suggestion, it kind of underlines it, and then we provide a quick fix. And it's same in Versus Code and PyCharm. It's very similar, kind of kind of fit in with

what other tools already kinda give you. So it's kind of as expected to a developer. And digging a bit more into the editor integration,

I'm interested in understanding

sort of how much variation there is in terms of capabilities

with the editors that you're targeting

and how you manage sort of

maintaining

sort of feature parity across the different environments and experiences

and just some of the

challenges or complexities that are involved in trying to work within these editors.

And you mentioned that you have a language server protocol implementation and just what your views are on sort of the overall benefits of of that development to the overall development ecosystem.

So we started with a PyCharm, which had its own sort of API, and you write the plugin in Java or Kotlin. So that plugin's kind of a bit thicker. It has to have more in it. And then we wrote a

LSP, inflation of the Versus code.

And, ideally, we'd like to have everything in LSP because it's sort of very nice. So it means you can we've actually been able to roll out a bit of support We've got FIM and supply them at the moment. Because of just plugging in the LSP.

So the moment, we kind of think of LSP as our main

ideal implementation, and then we kind of have to write the extra bit in PyCharm to maintain parity, as you were saying. And, usually, that's fairly possible. Like, it's quite a rich

API

in PyCharm, but it's a bit of a hassle.

But, unfortunately, there isn't really a

good way of using the LSP implementation with PyCharm at the moment, it seems. There are

a couple of things that live outside

the LSP

spec as well. So

in both Python and Versus Codes, if you want to scan a whole folder for refactorings or scan for these duplicate code blocks, then that's outside of LSP. You can't do it. So we had to manually put that in on top of the LSP implementation.

Another thing that's probably gonna be quite painful for us going forward is that

LSP doesn't allow you to ask the user for input.

So

say

a good example of this is with our extract methods.

When we extract the methods, we just have to generate a name for it. Best thing to do would be at the point where the user says, yes, I want to do this, then immediately pop up the box what do you want to call it and then apply that. There's no way of doing that in the LSP protocol at the moment.

Obviously, the great thing about LSP is we've got a free sublime and thin implementation.

Sadly, not all of the

LSP clients

are

as good as

the Versus Code 1.

So for instance, Jupyter Notebook has an LSP implementation,

but it doesn't support code actions, which is what we use to actually

apply the changes to your code.

So

I got the Jupyter Notebook stuff almost working.

You could see the suggested code change, you could hover over it,

But then when you actually went to apply it, there was no way of doing it. You couldn't change the codes. So we don't actually have a Jupyter Notebook implementation

because of that,

which is a shame. I've been excited to see the development of LSP because as an emacs user, anytime things like Sourcery or Kite or any of these other new services come about, they say, hey. We support your editor. Accept that 1 because it's very old and weird and hard to figure out.

But Emax actually has really good 2 different LSP clients. So I haven't done it yet, but I'm interested in trying out Sourcery and trying to integrate it with Emax using the LSP protocol. So I was excited when I saw that as an option. Yeah. It's probably really easy. So the reason that we've got FIM is because I use FIM. So I obviously had to find an LSP implementation,

and it was literally

putting 10 lines of configuration

into a file and then Sourceries started working.

So you can probably just go and look at the BIM implementation

in our docs,

copy most of that configuration, and put it in some emacs file somewhere, and you'll be good to go.

Going back a bit to the core implementation of Sourcery, I'm interested too in

some of the

specific patterns

or libraries or tools that you found useful in building it and some of the ways that the overall system has

changed or evolved in terms of the scope and goals that you have for the project?

Yeah. So I guess to start with the last 1, it was has been this evolution, I guess, the first iteration was this ML solution for closure,

and version 2 was

Python working on Python.

We actually initially had it as like a cloud service,

so it'd send your code to

our cloud, and we'd analyze it and send it back.

We got very, very quick feedback on it that, no, we can't do that. We can't have our codes leave. We can't possibly use it for lots of people, which is absolutely fair.

So we had to

completely change it and run everything locally on the user's machine, which is how we currently do it. So the binary runs a new machine, and it just talks to that.

That was the biggest change

in kind of scope as we went on. Obviously, adding new editors and adding a GitHub integration was also part of it.

Don't know if you wanna talk about patterns and libraries, Brendan. In terms of how we initially implemented it, we used

so we talked about using an AST. So we used the asteroid library

for

the core

of Sourcetree, and

that's

the library that's used

inside Pylint.

So it's very well used,

and it's got loads of functionality.

And we made use of it for probably a year, year and a half. I guess the reason we're gonna use rope as well, won't we? We did have use of rope as well. Both of them turned out to be limited in different ways. So the key problem with

Asteroid was that it didn't record any of the formatting of the code.

So

when we output codes, then it would look different to the way it came in. And, additionally, it didn't record any of the comments in the code either. So all refactorings, all of the comments in your code would just suddenly disappear, which is obviously not a good user interface at all.

So

I ended up writing a new AST library from scratch with those specific requirements in mind.

And to be honest, it was only from using Asteroid that we can get sorcery off the ground. And then you learn, oh, there's all of these additional requirements that we have.

In terms of rope,

we found that

some of the refactorings

that we wanted to reuse for it were not actually correct.

So it would misunderstand

certain things around keyword arguments and things like this that

led it to breaking the code base. And really, that's 1 of our number 1 requirements for Sourcetree

more than people accepting the refactoring is that it's actually a refactoring.

The suggestion that we make does not change the functionality of your code

because we want people to be able to trust it. When you first start using Sourcetree, you're obviously gonna review every code change and understand what the suggestion is. But ultimately, you wanna be able to say, I know that Sourcetree is correct. I trust it. I'm going to accept this refactoring.

So we had to throw out rope. And

so even though we didn't go with the not invented here mentality,

almost everything now is invented here.

In terms of the actual

design

of the refactoring engine, as we call it,

it's very much

built on

the intuition that I gained from working in Scala.

So it's very, very functional.

There's almost no state throughout the whole thing.

It's very, very easy to test.

Everything is nicely separated. We have a whole module that's full of analysis. We have a whole module that's full of refactorings. We have a whole module that's full

of understanding the codes,

turning it into an AST, and printing it out again.

Everything's very, very nice and clean.

And in terms

of the refactorings themselves,

we use dependency injection

to get the analysis that you want. So for refactoring, we just have

this is a a refactoring

proposer.

And then below that, you just list out

as class level

variables. I want these 3 pieces of analysis,

and then they'll be available for you when you run

run that refactoring,

which makes it very, very easy

to write new refactorings. You don't have to think about

imperatively

getting this piece of analysis from somewhere else. Or what's the ordering that I do these analysis is?

We build a tree of all the analysis that needs to be run, and we run it in the optimal order and then inject it into

the refactoring proposals, which then output the proposals that are fed into the search engine.

That was probably extremely complicated and needs a diagram. It's difficult to explain these things in words. Yeah. The the joys of doing a podcast about software.

Yeah. Exactly.

And so

I'm definitely interested

in sort of what the opportunity is for being able to integrate some of the rest of the ecosystem of sort of code quality and developer tools in ways that that can

feed into Sourcery to inform or augment the types of refactorings

that you're offering or sort of ways to offer customizations

to people to be able to

specify their preferred code styles. So, you know, I prefer methods named using these patterns, or I prefer to

refactor

using this particular

structure of conditionals or whatever that might be, and sort of how that factors into your

opinionated approach of these are the refactorings that everybody's going to want versus, you know, these are refactorings that are useful, but not everybody wants to have and just being able to set up those toggles of, I want these types of suggestions. I don't want these types of suggestions and maybe use these linting or type inference tools to be able to trigger or inform the types of refactorings that I want to perform.

At the moment, we have gone with this kind of opinionated approach. We have talked about could we integrate black, could we integrate other formatting and, like,

get a preferred formatting?

A lot of it is, like, the interfaces for these things are not super clear, like, it's

like, we briefly looked into, could we integrate MyPy

to use the power of their

type inference, and we thought that seemed too difficult at the time.

So maybe that is something we will explore in future.

We have, at the moment, got the ability to switch through factorings on and off in a kind of a configuration file. So, yeah, I think that's probably something we need to make more powerful in future. Perhaps when we come to look at things like more of a team offering, and so maybe a team can set up how they want their code base to be structured in a way. Yeah. I think that, as you mentioned, the interfaces

to all these different tools and the ways that they approach their analysis

is still too

bespoke across them. So there are some sort of common patterns, but there's no 1 interface to say, I want you to be able to feed me all of the information that you're feeding to the end user.

And, you know, maybe that's an opportunity for something like the LSP to act as the focal point that everybody can coalesce on. This is the interface. This is how we all interoperate

together. Yeah. If there was something like that, that'd be awesome. Yeah. Definitely. Seems like as an industry, there's a lot of

movement happening where interfaces and APIs and patterns are coalescing in sort of different subcommunities, so things like NumPy being standardized as an API for other libraries to use,

you know, data streaming

ecosystem. They're standardizing around sort of the open streaming format,

you know, in the data lineage ecosystem.

There's the open lineage spec that's happening. And so a lot of these different tools and communities are saying, okay. You know, we have all these different ways of doing things, but we don't have any way for being able to interoperate without a lot of manual coding and intervention. So I'd be interested to see

what happens in the code quality and developer tooling space to see if there's an opportunity to coalesce on. This is the way that we all interoperate.

You can all do your own things, but this is the API so that we can all be able to build on top of each other rather than sort of working in our own little corners. That would be really fantastic if you could just

go to Mypy,

pass in a node in the abstraction sax tree and say, tell me the type information

about this.

Or another tool could

hook into Sourcetree and tell me and ask,

tell me what other nodes in the abstract syntax tree this depends on. It would be really fantastic.

Yeah. I can see how that would be excellent. It's just

At the moment, it's sort of like pausing the output file and, like yeah.

There there's a long road to that point, but perhaps we can get there as an industry.

Yeah. I mean, part of the issue is everyone's got their own abstracts in touch screen. And for this particular example,

in a lot of cases, it's very much tightly coupled all the way from

the user

interface all down to the deep workings. And so actually splitting that up is a choice that you have to make.

In our case, it's relatively loosely coupled,

but only because

I had to rip out

asteroids and put in our own ones.

When you've gotta do that, you've gotta simplify the interfaces all across the board. Yeah. It definitely seems like there's an opportunity for a similar movement to what happened with the Sands. Io approach to network protocols, maybe being able to

sort of abstract out the AST parsing and then adding a way for being able to hook in and add your own additional metadata that you can pass around and just sort of removing

the sort of AST protocol implementation from the business logic that you actually care about.

Yeah. That would be brilliant.

Well, in terms of your overall experience of building sorcery

and creating this technology and turning it into a business

and trying to gain users and grow adoption, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? I think from an implementation point of view, the code formatting has

been remarkably hard. It's not something that I ever expected to be an issue at all. I expected

just to be able

to pass in code, refactor it, and pass it out. But as soon as you start to think about it, people are not gonna want that. If I suggest a refactoring to your code base that you then have to go and

manually reformat to match the rest of the code around it, that's just not gonna work.

And it's

very, very hard to solve that problem. There's

some other libraries out there that have come out recently that try and do a similar sort of thing,

and there's always new bugs that come up with it. Like, a very, very simple example of it is

do I indent with 2 spaces or do I indent with 4 spaces? That's, like, the absolute basics,

but you have to get that right. You have to implement the same.

Gets 1 of our big sort of pivots we had to do was we thought, oh, we'll do it in the cloud. That'll be easy. That makes sense.

Then people, like, really concerned about code privacy.

So we had

to pivot there, do everything on low user's machines. And I guess an interesting thing is, like, can work we've had to do to make sure

that our refactoring star are actual refactorings.

It's sort of more than half of our tests are, like, don't do the refactoring in this case, in this case, in this case, in all these edge cases.

And we also actually

run source through overload of open source libraries and check their tests or pass afterwards. That's kind of our

backstop, which works pretty well. Our analysis section of our code base has sort of had to grow and grow and grow and grow and grow. 1 thing that we haven't really talked about yet is sort of what's involved in somebody actually

getting started with sorcery,

you know, setting it up, and then also

what is involved in using it in a team format and some of the benefits that it might provide in that context?

Yeah. So you can just go to our website

and you can sign up.

Then you go to your editor of choice, and the 2 easiest ones are PyCharm and Versus Code, and you go into their marketplace,

search for Sourcetree

and install

the plug in. And then you go back to the website,

copy

a token, and paste it into

the Sourcetree configuration screen in your plugin,

and then it's going. It'll start analyzing your code immediately and suggesting refactorings to you. And actually, just before that step, it shows a little

demonstration

install file that teaches you how Sourcery actually works.

In a team environment

and also if you're just using GitHub on your own, you can install the GitHub bot and

that's even easier. You just

find Sourcetree in the GitHub marketplace,

you click install,

and then you choose which of your repos you want to add it to. And you just add it to whichever Python repos that you like. And then you do a pull request, and Sourcetree will analyze it for you

and give you the feedback on that. Flash if you just star our repo, it'll still find your most popular Python repo and refactor the entirety

if you just wanna give it a super, super quick test in about 10 seconds. The other option for Teams is if you don't have GitHub

and you have another CI tool, use our command line interface, which you can install through iPy.

You just run 2 commands in the command line interface,

which are all fully documented,

and it will scan the files that have changed in the commit

and out the suggestions

and also can fail the build if you choose to do it like that. And you can do that as a pre commit hook as well. There's a pre commit hook using the pre commit

library

that you can install in 3 lines of configuration.

And for people who are interested

in being able to

simplify some of their refactorings

and streamline some of their development,

What are the cases where Sourcery is the wrong choice?

It won't fix bugs.

That's the first thing, you know. We try to leave the code doing exactly the same as it did when it started,

so that includes any bugs. And I guess we've

gone really hard on on readability, so

all our refactorings will try and improve the readability.

Some of them may improve performance, but it's just kind of a side effect. So if you're really trying to optimize for performance,

SourceTree is probably the wrong choice. And, certainly, we only cover

some ideas that we mentioned.

I guess, Jupyter Notebooks is the big 1. We want to cover in the future, but don't cover now, so the data scientists.

And, also, I guess,

this kind of big thing of if you're moving code between classes

and restructuring your code base on that higher level, we're definitely not there yet. At the moment, it's

this sort of lower level structure, readability of the code. And as you continue to build the product and grow the business,

what are some of the plans that you have for the near to medium term future of the product? The big 1 is kind of analysis at the class level, which I think we've alluded to, as in

being able

to scan your whole class, your whole module, find

duplicated

code, or normally extract that for you. And because a lot of what you're doing when you're manually factoring is this process of sort of splitting methods up, finding removing duplication.

That's the next really exciting thing we're we're planning on doing as well as incorporating some machine learning for doing things like automatic function naming or automatic variable naming

extracting variables?

I think more longer term,

I'm likely to complete

convert to functional

programming style.

So, ultimately,

I would like Sourcetree

to push you towards writing functional style code.

And,

really, that's actually a design decision up front. But it's possible

to write a class so that the majority of the code within a class is functional and just

the interface code, the code in the

external interface contains

side effecting code.

So ultimately, I want to push

Sourcetree to be able to do that. And 1 of the

benefits of that for us is that functional code is much more easy to analyze. It's much more easy to refactor

and change the ordering of

statements and things like that. So if we can help people write functional codes, then we can actually do more refactorings as well. So it's a virtuous cycle.

It's actually gonna take us a while to get to that point, but that's, like, the goal. So we wanna bring functional programming to Python.

For anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose the movie The Croods New Age. Watched that with my family a couple weeks ago, and it was absolutely hilarious. It was probably the most I've laughed at a movie in recent memory. So definitely recommend watching that regardless of whether you have kids. So with that, I'll pass it to you, Nick. Do you have any pics this week? Me and my wife been kind of binging on a lot of series recently, and 1 we really, really enjoyed was The Magicians.

So I think Keira is on Amazon Prime. It's kind of

a combination of Buffy the Vampire Slayer and Narnia. It was really funny and absorbing, and it kept us busy for quite some time. And, Brendan, do you have any pics this week? Yeah. So I've just finished reading David Cogfield by Charles Dickens,

and

I've never

read any books more than a 100 years old before.

The reason I got into it is because I'm a massive fan of Armando Iannucci

who has made some hilarious programs, particularly if you're British, you'll know of the Alan Partridge show, the day to day, Brass Eye.

If you're American, you probably have heard of Veep.

And he made a version of David Cottfield a couple of years ago. And I decided before I was gonna watch it, I was gonna read the book.

And it's absolutely amazing book. So well written,

so emotionally involving that I actually had to stop reading at times because I couldn't cope with it. The really great thing about it is

it talks about London from a 150 years ago. So I know what London is like now, but you get to imagine London in the past and

just the way things have changed.

But the way things are still the same is really amazing.

I really loved it. Well, thank you both for taking the time today to join me and share the work that you're doing at Sourcery. It's definitely a very interesting project and 1 that I'll have to check out for myself,

see what kinds of fix ups I can offer for my code. So I appreciate all the time and energy you're putting into helping make developers more productive and code cleaner and easier to maintain. So thank you for that, and I hope you enjoy the rest of your day. Thanks very much. It's been great to be on. Yeah. Thank you for having us. Been brilliant.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.init