Summary
The CPython implementation has grown and evolved significantly over the past ~25 years. In that time there have been many other projects to create compatible runtimes for your Python code. One of the challenges for these other projects is the lack of a fully documented specification of how and why everything works the way that it does. In the most recent Python language summit Mark Shannon proposed implementing a formal specification for CPython, and in this episode he shares his reasoning for why that would be helpful and what is involved in making it a reality.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to pythonpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s pythonpodcast.com/talkpython, and don’t forget to thank them for supporting the show.
- Your host as usual is Tobias Macey and today I’m interviewing Mark Shannon about his efforts to create a formal specification for the CPython interpreter
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing the current state of how the Python language and the CPython runtime are defined?
- What is your motivation in advocating for a specification?
- After ~25 years of the language, why is now the time to pursue this effort?
- How does the history of the language and the scope of the ecosystem and community impact the effort required to make this a reality?
- What is involved in creating the specification and where would it be located once complete?
- What are some examples of languages that are formally specified?
- What are the possible benefits of creating a specification for the CPython virtual machine?
- What is the distinction between a specification for the VM as opposed to a specification for the language?
- What are some potential downsides to having a (semi-)formal specification become part of the definition of the interpreter?
- Can you describe the process of doing the work to create the specification?
- How are you approaching the actual definition of the specification (e.g. prose vs programmatic)?
- What are the tradeoffs of prose vs. an executable specification (e.g. TLA+, Alloy)?
- How does this work tie into your goals of improving the speed of the CPython interpreter?
- What are some of the most interesting, unexpected, or challenging aspects of your efforts to bring this specification to CPython?
- How can the community contribute to this effort?
Keep In Touch
- markshannon on GitHub
- Website
Picks
- Tobias
- Mark
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- CPython
- PyPy
- PEP 380 yield from
- Language Summit
- RustPython
- Jython
- C++
- ML programming language
- Java
- Python Formal Semantics git repository
- CPython PEG Parser Episode with Pablo Galindo and Lysandros Nikolaou
- IETF RFCs
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Thought in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Mark Shannon about his efforts to create a formal specification for the CPython interpreter. So Mark, can you start by introducing yourself?
[00:01:05] Unknown:
Yeah. I'm Mark Shann, a Python core developer. I've been using Python since 2005 ish. I started using it when I was doing my master's degree on building a CPython for a stack machine. And it's 1 of those things where you just need to do little tasks like generate tables of data and all sorts of things. And, you know, you just find that c or Java is just a pain. And something like PHP is just even worse. And then I just came across Python, and it was just perfect. So I've been using ever since.
[00:01:37] Unknown:
And so can you start by describing the current state of how the Python language and the CPython runtime are defined?
[00:01:44] Unknown:
If you're the PyPI people, effectively, you have to just assume that CPython is the definition of the language and just try and be bug for bugmatical. There is a sort of specification on, like, for users. So there's object model, how the syntax works, etcetera, Which is very useful if you're learning the language, well, probably not great for initially learning the language. But once you've learned it, sort of, you know, to look stuff up and, like, what's this supposed to do? But in terms of a detailed specification for continued CPython development and development of other potential virtual machines like PyPy is sort of lacking.
[00:02:22] Unknown:
And so you mentioned how CPython is the reference implementation, and everything has to define itself in terms of whatever CPython happens to be doing. And what that is can change from version to version. And so I'm curious if you can do what the motivation is for actually trying to advocate for a more formalized specification of the language and whether the specification is then tied to CPython itself or if it is a body apart from that, and all implementations of Python should be trying to adhere to that? I think initially it would just be for CPython, but there's no reason why
[00:03:04] Unknown:
we can't differentiate what's sort of CPython specific and what's more generally just Python. The motivation really is this sort of evolution of the language and the runtime. If we talk about changing things, it's you know, just looking at patches of C code or sort of informal discussions about what we're gonna do, whether, you know, how the language will change if you add a new feature. It's very hard to see the sort of odd corner cases or to work out the sort of long term ramifications of things. So I'm gonna use PEP 380 as an example. So PEP 380, the title is syntax for delegating to subgenerates, but you probably know it's just yield from, it's just the yield from keyword.
Now that's a nice little feature. It's pretty well defined in the PEP. It's got a big long chunk of code, sort of equivalent sort of behavior. But nonetheless that produced quite a few sort of odd sort of corner case bugs and in fact has taken quite a long time to knock out all this sort of obscure little bugs to do with handling, throw, various other little corner cases. Most of which, almost no 1 would notice, but they sort of just crop up occasionally. And those sort of bugs, I don't know if we would have got rid of them if we'd had a more formal spec, but I think it's likely that we would have more likely seen those upfront.
[00:04:24] Unknown:
Python itself has been around for on the order of 25 years at this point. I'm curious why you see now as being a good time to pursue the effort of formalizing the language and the runtime?
[00:04:37] Unknown:
So I think it's basically now is a good time because I can't do it 20 years ago or 10 years ago. It's the old saying, what's the best time to plant a tree? You know, it's 20 years ago, but if you can't do that, do it now. So I think it's a long term useful thing, and would be nice if we'd had it 5, 10 years ago, but we didn't. So let's do it now.
[00:04:56] Unknown:
And so because the language does have all of this history accrued at this point, and there is this entire ecosystem that has grown up around it in terms of libraries and packages and other implementations of the runtime. I'm curious how that overall scope and weight impacts the overall effort required to actually bring a specification to fruition?
[00:05:19] Unknown:
If it were to be a complete, pretty formal specification, that would make it a huge job. But I still think a specification of, like, the core language itself, even we're a bit fuzzy about sort of some of the interactions with CAPI or things of that sort of level are vague. And the other thing is it doesn't have to I mean, it can grow. If the spec's kind of almost there for, say, a new language feature or interaction with some important library, then, you know, specifying changes to those things, it might be useful to add those things to specification as part of that. It's just the nature of open source stuff is it's, you know, if there's something there that's almost there, people are motivated to sort of push it a little bit further to get what they want. So somebody has to, like, do enough of the work to get things going.
[00:06:09] Unknown:
You brought up the topic of the specification during the language summit in the lead up to what was supposed to be PyCon this year. I'm wondering if you can summarize some of the reactions of the other folks who are engaged in that conversation.
[00:06:23] Unknown:
I yeah. I think general skepticism. I think people need convincing there's any purpose in it. A few people sort of see value in it and were interested in helping. It's not obvious what the value is. I think it's a useful way to talk about changes the implementation is the implementation doing the right thing? If you're not dealing with actually implementing Python, it's probably not that much value.
[00:06:50] Unknown:
And you mentioned, for instance, the folks who are building PyPy and the fact that they have to try and be both feature and bug compatible with CPython to ensure that people who are writing programs to run on this other execution engine will function as intended. And I'm wondering if you can just talk through some of the potential benefits of what the specification might provide for people who are working on the CPython virtual machine and people who are working on other implementations such as PyPy or Rust Python or Jython.
[00:07:22] Unknown:
Developing PyPy has found a lot of bugs in CPython where experimenting, well, what happens if you do this? And CPython behavior is somewhat odd. Sometimes we kinda have to specify those as that sort of behavior for historical reasons. Other times it's been changed and got fixed. So I think then it's then having a model to sort of say, well, this is how it should work. So, you know, here's what we thought the model was and here's what CPython does. Is the model wrong or is CPython wrong? As opposed to you know, just having a bunch of test cases and saying this is what it should do. Obviously, the test cases should correspond to the model, which is an interesting thing of like, well, how do you actually verify that a model matches the tests? But I think we're gonna come on to that later.
[00:08:08] Unknown:
In terms of the actual act of specifying a language, I'm wondering if you can give some context of other ecosystems and other run times that do have a more formalized specification and what you see as being the overall impact in the direction that that language has taken or the value that it provides to those ecosystems?
[00:08:29] Unknown:
Python itself does have a reasonable sort of specification. It's not really sort of semantics, though, sort of operational or doing intentional semantics. So the idea is to have a model of like an abstract machine upon which the language runs. The idea is you specify the translation to that machine and the operation of the machine, that's sort of operational semantics. Java has a virtual machine specification and a language specification, which isn't quite the abstract machine specification. It's it's more detailed in parts of how it actually operates in the physical world. But C plus plus there is an abstract machine, in which supposedly your C plus plus language will translate to machine code obviously, but it's as if it were running it were translated onto this abstract machine and running on that. C plus plus is extremely complicated language, how well it maps.
ML, which is not machine learning, but the ML, the programming language from the 19 seventies, did actually have a full form of specification. But I'm not sure how much value that added because I think 1 of the things of formal specifications is you can determine properties of a system from it. So they're very useful if you have things like you want to specify a form of sort of state machine model or something, and you want to say things about your property, like your system, like it won't deadlock or something like that. Whereas I think for language specification, that's not really that useful. What you really want is just, is my implementation doing the right thing? Do we even agree on what it's supposed to do?
So I think a slightly less formal specification is what we want there. So other languages here, I covered ML, Java, C plus plus No doubt there are others, but I I couldn't tell you off the top of the head.
[00:10:14] Unknown:
1 of the things that you mentioned there is the intent of a particular operation, and that's always 1 of the hardest things to be able to understand as somebody coming into a code base who wasn't there at its creation because that information can easily be lost because it just lives in somebody's head, you know, for the duration of them actually working on themselves will likely forget about it if they try to come back to the same piece of code. And so given that, I'm wondering what your thoughts are on the viability of being able to try to reconstruct what the intent is of various behaviors in CPython, given that the people who implemented it either might not be part of the community anymore, or they might just not even remember what the purpose of that was at the time of creation.
[00:11:00] Unknown:
I guess it's kinda hard to know what people were thinking when they wrote code. You only have the code to go by. But it's reasonable to assume that any weird peculiar corner cases, unless they're documented specifically, are just likely to be errors rather than deliberate choices if they conflict with the sort of general idea of things. And there's also the problem, of course, that code is written with 2 things in mind, to do the right thing, but also because it's a programming language to do it as quickly as possible. So those 2 aims of, sort of, clarity and precision and performance are often in conflict. And often these corner cases are where things have been made faster and they're made faster incorrectly.
This can happen fairly easily, unfortunately, especially because it's all written in C. So having a specification where you say, right, this piece of syntax translates into this sequence of operations and these operations individually do these things, with no interest in performance whatsoever, we just don't care. So we can make that entirely based around clarity. And I think that helps a lot. It's much easier to describe things doing So this pushes this thing onto a stack. Everyone knows what a stack is. But if it's written in C, you have to worry about buffer overflows, point arithmetic, and all sorts of stuff. And the general intent often gets lost in the details.
[00:12:21] Unknown:
That also brings up the point of the specification being tied to the virtual machine itself of CPython versus the language, which is defined by the syntax for which you can use the grammar that's embedded in CPython to understand what is the actual allowable structure, but that, again, doesn't necessarily convey the appropriate semantics. So I'm wondering if you can maybe just draw the line between creating a specification for the CPython VM versus creating the specification for Python the language.
[00:12:53] Unknown:
The idea here is that we have an abstract machine rather than a virtual machine. So a virtual machine, despite the name, is is a real machine, you can actually run code on it. I mean, the CPython virtual machine, Java virtual machine, whatever virtual machine you choose will run programs, it will produce real output, it'll heat your computer up, hopefully do something useful in the while doing so. Whereas Avonnect machine is really just a pen and paper exercise. So it's described in terms of, you know, you probably describe it in the way you would describe a program in terms of data structures and how things change those data structures.
But they don't need to be restricted by practical or finite machines. You'd have to worry about what happens if someone keeps multiplying interested together until they run out of memory. You don't have to worry about does the stack overflow, you don't have to worry about any of that sort of stuff. There's nothing stopping you using potentially algorithms that wouldn't terminate or crazily inefficient as long as the meaning is clear because the video their actual run time is not isn't really an issue at all. So this is the idea of abstract versus virtual machine. Once you have an abstract machine, it's a much easier thing to define a language on because you're not having to define all these little details that are largely irrelevant. You can just say things like memory there is as much memory as you need, it's automatically managed. That's all you need to say. You don't say anything about how it works, any timeliness guarantees, or any of these things that you might need to worry about for a real machine.
[00:14:25] Unknown:
In terms of the actual effort to define the specification, I'm wondering if you can just outline the approach to actually committing that to paper and writing it down and what's involved in Dirich to identify the areas that need to be specified. And then once it is created, where that specification might live to ensure that it is accessible to people at the time that they need it.
[00:14:51] Unknown:
I'm currently working on git repo. I'll put a link in the notes. Currently, what you wanna do is you wanna say, okay, here's a Python program, what will it do? And I'll be able to sort of describe to you using various elements of these sort of semantics how it would work. There's various stages in translating a Python program into some sort of machine description, whether it's an abstract machine or a virtual machine. And first we have to tokenize it, parse it, and so on. The specification of that I think is pretty well done because the grammar itself is a fairly formal specification of the translation from source code to to AST. So picking up from the AST, if we can describe the translation from AST into a series of abstract machine operations and then describe the abstract machine operations. We have, at that point, a more or less end to end description of how Python source code runs what it does.
So there's there's 2 basically parts to that. There's the how does the translation, how's the AST draft map to a sequence of operations, and what do those operations do. So I guess there's 3 parts because there's what the operations do and what the operations operate on. One's as a machine model, which says there is some threads and each thread has a stack, and the stack's made up of frames, and the frames has local variables, etcetera, etcetera. And there'll be you still find a start state, and then you'll say, well, this program translates to this sequence of operations, and you can specify each operation as here's what it does to the machine state, and the machine state also defines what operation's gonna happen next. That sort of defines an operation, operational semantics.
So this is useful for someone implementing the virtual machine. In terms of actually trying to understand what's going on with your Python program, I would say it's next to useless because it's heavily recursive and it's quite complicated, so it's not really gonna be terribly enlightening. However, it is useful sort of discussion for how a new programming language feature would work or how the current ones are supposed to work. Because, you know, each of those planes is reasonably well defined even though in combined they're quite complicated. But if you're look focusing on a narrow part of the language, then I think it should be comprehensible in a way that the actual source code isn't because it's just too big and too complicated.
[00:17:23] Unknown:
Particularly for people who are working on the CPython runtime, if you're deep in the guts of the interpreter and you're trying to figure out how you wanna approach a particular problem, is the formal specification something that might be embedded into the comments where there's a link to say, this is where the specification is for this particular section of the code so that it is collocated and easy to find? Or is it just something that will live independently as its own body of text and you just know that if you're trying to figure something out, you reference that to understand whether that area of the interpreter is covered in specification?
[00:17:59] Unknown:
I think the first thing is to get something up to the state of actually being useful to the point where someone at some point will say, actually, this is a useful thing to describe this other thing, you know, new feature or or whatever. Just a bug. Maya should be a likely 1 as well. And at what point in there, maybe we consider moving it to into sort of under the Python organization, but I'm not gonna suggest that. Now it's far from ready from that, and I think just having it sort of on a pirate repo just just makes it more sort of accessible, and people can play with it, fork it, whatever.
[00:18:36] Unknown:
In terms of once we get to a state where this specification is adopted, if that day does come, and then somebody wants to add some new capability to the language where, right now, we have the PEP process to define what it is we're trying to do, some examples of how this might be implemented, perhaps a reference implementation to look at. Would they then also need to append to the specification itself to incorporate that new information at the time of submitting the PEP? Or is it something that you think might happen after it's been accepted, then they need to go in and update the specification? Just curious what role this might have in terms of additional work for upgrading the language or adding new capabilities?
[00:19:21] Unknown:
A PEP for a new feature for the language should be well specified. It should specify the behavior of the new feature in detail. So I would hope that as the sort of formal specification would actually make it easier to specify those because specifying it not to new features are described in sort of equivalent existing Python code that they would do, but that only works if the language as it exists can or it can support that feature, which isn't always the case. It's very hard to describe, you know, generators in terms of Python without generators. I mean they add a fundamental capability to the language.
But they could be described in terms of extending the form of specification because for example generators, you'd extend the machine model and add a few operations for yield and and then the converse. So I think that it shouldn't make it harder to write a well specified PEP. Hopefully, it would make it easier, and it might also offer a framework for those who aren't possibly familiar with specifying things reasonably formally to build on because there's already sort of specifications of language features that they're familiar with.
[00:20:37] Unknown:
In the future world where we have a formal specification and it has been adopted as part of the Python language and community, what are some of the possible benefits that you see coming out of that for people who are using or working on Python?
[00:20:53] Unknown:
For users of Python, I would say probably very little. I'd imagine the prose documentation would be far more useful. Those people writing the prose documentation, at least it gives them sort of a more formal thing to base their documentation on. Because if you're documenting a fairly small feature of Python, you wanna write that in prose that fits in with the other prose around the other language, the features that fit around it. Whereas a specification doesn't need to really work that way. It just needs to be precise. It doesn't need to be particularly readable or accessible.
But it's there as a reference for those people wanting to make more accessible documentation. The Python language documentation is pretty good. I don't think there's a great use for these more formal specs for most users of language. I think it's implementers of the language that it really has value for.
[00:21:47] Unknown:
And what are some of the potential downsides of having this specification?
[00:21:51] Unknown:
More of anything, isn't it? So errors are a problem. You know, if there's errors in the specification, then that can be misleading. That's obviously bad. Yeah. So I think it takes more work and like anything we do, it's likely to introduce errors. I mean, the only way to not have any errors or anything is not to do anything. Those are the obviously problems. It's less likely to have errors. I mean, for a new feature, I would say specification is less likely to have errors than code. For well tested older features, and the code tends to have had the bugs knocked out of it. It's hard to test a specification. 1 possible thing to do a specification is write a implementation of the language based against the specification which has the device of the abstract machine model. If you can write in Python or some other high level language, Python is our other obvious candidate.
And performance just really isn't an issue in the slightest. So it should be an interesting exercise to do that to cross check the specification against the language. And that might be more useful from a sort of educational point of view. But yeah, I think the basically problem is, yeah, it's more work and it has the potential for errors.
[00:23:09] Unknown:
Another possible approach to formally specifying a given run time is to use an executable version of that with something like tla plus or Alloy. And I'm wondering what your thoughts were on the potential value of that versus the amount of extra effort that's necessary to be able to actually get something like that correct.
[00:23:27] Unknown:
Well, as I say, it's a lot of extra effort, but I think the main problem with it is accessibility. It basically requires people to learn those effectively another programming language. And I'm not sure how useful it is because I think the point of these specs is that you can prove things about them. Now this is very useful for certain systems that should have certain properties. For example, like deadlock, various other properties you may want. Your avionics software to have certain properties like not dropping your plane out of the sky, but it's very harder to define the properties of a programming language that you want to prove.
I mean, it would be nice to prove things like it doesn't underflow the stack or it doesn't have memory or it doesn't crash in these various ways, but that's really where you wanna prove of the actual implementations, not the specification. So if you can, like, run some sort of model check or a verifier on your c code, then that would be great. But I don't think the sort of these formal specification languages would be that useful. And I think also just mainly it just excludes people. I think force individual operations in the sort of machine model, it should be sufficiently simple that the pros are unambiguous. And that way it's understandable.
[00:24:46] Unknown:
And another effort that you are trying to take on right now is proposing some means of improving the overall execution speed of Python programs by speeding up the interpreter, and your current goal is by a factor of 5. And I'm wondering how the efforts of formalizing the CPython runtime tie into your goals of improving the overall execution speed and some of the work that you have planned to make that a reality?
[00:25:14] Unknown:
Can I first say that that's a very long term goal? So before anyone gets too excited, as terms to this helping, I'm not sure it does that much because the speeding up CPython has to remain compatible with CPython. It's not just the language semantics that we may or may not have defined. It's all the other features, the CAPI, certain behaviors that are not necessarily part of the language, but are expected in terms of, say, immediacy of garbage collection. We have reference counting. Garbage collector, which has pros and cons relative to sort of chasing garbage collector that Node and the JVM use. But 1 of its pros is it's very prompt reclamation of larger short lived objects.
That's a feature of CPython we'd probably would want to keep or if not kept. We don't wanna keep it. You know, we'd we wanna formally document the change, but that's independent of improving performance. So it might help in terms of sort of discussing why optimizations are valid or coming up with ideas for optimization because it gives you a mental model of how Python is being executed that's independent from the C code, which is a very low level and quite bulky. So it might give you a clearer way of saying, well, actually, if we represent this abstract structure instead of how we currently are represented in this other way, it would run faster.
So in that sense, it might be useful. Although, to be honest, I think the way it's actually come about is more the other way around. In that I have been thinking about how to improve performance of CPython, and that has led me into thinking, well, what are the fundamentals here? And then having worked some of those out, thinking, well actually it's useful to document that. So that's kind of how the formal semantics actually came out of the my performance work, not the other way around.
[00:27:15] Unknown:
And is this effort of formalizing the specification something that you're currently just doing by yourself? Or are there other people who are contributing their time and efforts to understanding some of the intent of various aspects of the runtime and contributing to the specification itself?
[00:27:35] Unknown:
It's currently just me. Informally, a few people have mentioned they might be interested. I've sort of wanted to get it to a stage where it was at least covering enough of the language to sort of make sense for people to add to it. So I'm kind of aiming to for example, I'm specific I'm not planning obviously excluding long term coroutines, but in my immediate sort of trying to get some sort of specification for a chunk of the language, I've chosen to leave those out in order to get something coherent. That's pretty much done and should be out by the time this podcast is released. There will be a GitHub thing. I mean, if you wish to collaborate, then issues and pull requests on GitHub are always welcome.
[00:28:20] Unknown:
For documents that intend to try and create some formal specification, there's usually a specific set of grammars or terminology that are used to to try and avoid any ambiguous language. I'm wondering if you have any sort of glossary or defined means of referring to various aspects of the runtime or things such as with RFCs, where there are specific meanings around should or must, what your thoughts are on the level of detail and level of rigor that's necessary for this document?
[00:28:53] Unknown:
To us, not really so far. As opposed to the whole should, must thing, I think that's sort of like for formal English. Now obviously, this will be based on formal English, but the idea of having this sort of abstract machine is that it does operate to form a sequence of operations, and that's possibly a more mathematical construction. So I think we're generally into less ambiguous language because of that. That's not to say that there are no ambiguities in the specification I've written so far. But, yeah, I think the more mathematical style of English rather than the sort of legal style probably helps.
Maybe that's just my personal biases.
[00:29:36] Unknown:
In your work of trying to define the specification and determine the goals and potential benefits of that, the actual work of committing it to paper. I'm wondering what you have found to be some of the most interesting or unexpected or challenging aspects of that work.
[00:29:53] Unknown:
To define Python, you also need to define what the objects do. So Python itself has an object model. You know, if you add 2 things together, the definition of it is fairly simple, but it doesn't actually say what anything does. It calls the done to add method, and if that return is not implemented, it calls the done to our add method on the right hand side with some fiddling around with subtyping to make sure that that works properly. But it doesn't actually tell you what anything does. It doesn't say what happens if you add 2 integers or add a floating point to int or anything like that. So there's that sort of object model sort of part of the language, and that's I've really barely started on that.
But that also brings you into another key point of this is why this is a sort of CPython semantics rather than just a Python semantics. What happens when you're interacting with like compiled C code, NumPy, things like that. Because these also have all these operations. There's a reason for CPython is that extensions can you know, extension classes and extension objects are first class members of, you know, CPython. There's no real difference between a NumPy array and a built in list or integer. There might be some subtle differences in terms of performance because of integers and lists are just fundamental to the virtual machine. But in terms of what their capabilities, there is no difference. So you need to define how the VM transfers information to these opaque blobs and what it gets back and what that means.
And there's also the other side of the interaction with, let's say, C code. I mean, it doesn't have to be written in C, it could be Rust or whatever. But there's there's sort of machine level code. And there's also how this machine level code interacts with the abstract machine itself. How does it start? So like what are the start states of the abstract machine? So it's already well saying it's got, you know, it's a stack. So let's describe addition. You evaluate the left hand side of the addition and you push that onto some stack. You evaluate the right hand side, you push that onto a stack. And then addition pops the 2 off and adds them together and push the result back on the stack. But what stack?
How did that stack get in the 1st place? What's the starting state of the stack? Is it like a stack per thread? And so on and so forth. So there's a lot of how the abstract machine, its internal states, how that's all put together, what the starting state is, what the end state is. You also think of a Python program or any program as like how it operates. Usually you haven't put a lot of thought into how it sort of starts and finishes. It's interesting because if you look at how a compiled program works, if you write a small c program, a small c program is a lot of code in a small c program, low machine code, and a lot of that has to do with loading the program up, setting up its state before it can actually do its job. And it's a similar thing with the abstract machine.
[00:32:48] Unknown:
In your exploration of CPython, either as a contributor or with your work of trying to define some of the semantics behind it, What are some of the more interesting or unexpected behaviors that you've uncovered or any sort of particular quirks that you thought are worth calling out?
[00:33:05] Unknown:
I think with CPython, I think something interesting is the difference or lack of difference between mappings and sequences. So conceptually there is a difference. At the most abstract, a sequence is just a mapping of continuous range of integers to some set of values and a mapping is just a mapping of anything to anything. But obviously they tend to be implemented quite differently internally because sequences are often implemented in sort of linear layout of memory so it's much more efficient. So the CPython virtual machine sort of distinguishes between the 2. Python doesn't really distinguish between the 2. If you have getitem, you're mapping or a sequence, depending on what you'll take, whether you just take a range of integers or any value.
So the CPython has different implementations for the different things, sort of like different function pointers internally. So this sort of thing occasionally leaks out a bit. So that's 1 example.
[00:34:04] Unknown:
And I'm curious to see how things will play out, whether it ends up being ultimately accepted as part of the core of Python itself or if it continues to be a side project for you. So we'll definitely be keeping an eye on that. But for anybody who does want to get in touch with you or follow along with the work that you're doing or contribute to this effort, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose American Gods, both the book and the TV series, where the book is by Neil Gaiman and the TV series, I think, is on Starz. Been watching the first 2 seasons that have come out so far, and I think they did a really good job of bringing the story to the screen. So I look forward to continuing that. It is not suitable for children, but it is a good story nonetheless. So definitely recommend that if you're looking for something to keep yourself entertained. And with that, I'll pass it to you, Mark. Do you have any picks this week?
[00:34:55] Unknown:
Yeah. So first 1 is a book called Roadside Picnic. It's actually a 19 seventies Russian sci fi book. It's a little unusual for a sci fi book. The sci fi element is sort of rather mysterious and as a backdrop for sort of interesting story. The authors are 1970s Russia obviously, or the Soviet Union as it was then. And there was some veiled criticism of the then Soviet Soviet Union, which was only possible in sci fi, which makes it interesting. But it's also an excellent book and well worth reading regardless of that background. And the other thing I want to recommend is something completely different. For anyone who's got a VR headset, and if you haven't you might wanna get 1 because since we're basically not allowed outside for the next few months, A game called In Death, which is much less seminal, but good fun.
[00:35:42] Unknown:
Thank you very much for taking the time today to join me and discuss your work of trying to provide a specification for CPython. It's definitely an interesting project and 1 that, as I said, I'll be following closely. So thank you for your time, and I hope enjoy the rest of your day. Thank you. Goodbye. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Current State of Python and CPython Runtime
Motivation for a Formal Specification
Challenges and Benefits of Specification
Examples from Other Languages
Reconstructing Intent and Handling Legacy Code
Approach to Writing the Specification
Integration with PEP Process
Future Benefits of a Formal Specification
Efforts to Improve Python Performance
Current Contributors and Collaboration
Challenges in Defining the Specification
Interesting Behaviors in CPython
Closing Remarks and Picks