Summary
One of the most common causes of bugs is incorrect data being passed throughout your program. Pydantic is a library that provides runtime checking and validation of the information that you rely on in your code. In this episode Samuel Colvin explains why he created it, the interesting and useful ways that it can be used, and how to integrate it into your own projects. If you are tired of unhelpful errors due to bad data then listen now and try it out today.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show because you love Python and want to keep your skills up to date. Machine learning is finding its way into every aspect of software engineering. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. Podcast.__init__ is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to pythonpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
- Your host as usual is Tobias Macey and today I’m interviewing Samuel Colvin about Pydantic, a library for enforcing type hints at runtime
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Pydantic is and what motivated you to create it?
- What are the main use cases that benefit from Pydantic?
- There are a number of libraries in the Python ecosystem to handle various conventions or "best practices" for settings management. How does pydantic fit in that category and why might someone choose to use it over the other options?
- There are also a number of libraries for defining data schemas or validation such as Marshmallow and Cerberus. How does Pydantic compare to the available options for those cases?
- What are some of the challenges, whether technical or conceptual, that you face in building a library to address both of these areas?
- The 3.7 release of Python added built in support for dataclasses as a means of building containers for data with type validation. What are the tradeoffs of pydantic vs the built in dataclass functionality?
- How much overhead does pydantic add for doing runtime validation of the modelled data?
- In the documentation there is a nuanced point that you make about parsing vs validation and your choices as to what to support in pydantic. Why is that a necessary distinction to make?
- What are the limitations in terms of usage that you are accepting by choosing to allow for implicit conversion or potentially silent loss of precision in the parsed data?
- What are the benefits of punting on the strict validation of data out of the box?
- What has been your design philosophy for constructing the user facing API?
- How is Pydantic implemented and how has the overall architecture evolved since you first began working on it?
- What have you found to be the most challenging aspects of building a library for managing the consistency of data structures in a dynamic language?
- What are some of the strengths and weaknesses of Python’s type system?
- What have you found to be the most challenging aspects of building a library for managing the consistency of data structures in a dynamic language?
- What is the workflow for a developer who is using Pydantic in their code?
- What are some of the pitfalls or edge cases that they might run into?
- What is involved in integrating with other libraries/frameworks such as Django for web development or Dagster for building data pipelines?
- What are some of the more advanced capabilities or use cases of Pydantic that are less obvious?
- What are some of the features or capabilities of Pydantic that are often overlooked which you think should be used more frequently?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Pydantic used?
- What are some of the most interesting, challenging, or unexpected lessons that you have learned through your work on or with Pydantic?
- When is Pydantic the wrong choice?
- What do you have planned for the future of the project?
Keep In Touch
- samuelcolvin on GitHub
- Website
- @samuel_colvin on Twitter
Picks
- Tobias
- Samuel
- Flash Boys by Michael Lewis
- Algorithms To Live By by Brian Christian and Tom Griffiths
- NGrok.com
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Pydantic
- Matlab
- C#
- FastAPI
- Marshmallow
- Cerberus
- 12 Factor App
- Django
- Python Type Hints
- Cython
- MyPy
- Duck Typing
- Haskell
- Higher Order Types
- PyCharm Pydantic Plugin
- Django Rest Framework
- Avro
- Parquet
- Dagster
- Starlette
- Flask
- Ludwig
- Deep Pavlov
- Fast MRI
- Reagent
- Pynt
- Open Source Has Failed article
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit in private networking, node balancers, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI and CD pipelines, they've got dedicated CPU and GPU instances. Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And And don't forget to thank them for their continued support of this show.
You listen to this show because you love Python and want to keep your skills up to date. Machine learning is finding its way into every aspect of software engineering. SpringBoard has partnered with us to help you take the next step in your career by offering a scholarship to their machine learning engineering career track program. In this online project based course, every student is paired with a machine learning expert who provides unlimited 1 to 1 mentorship support throughout the program via video conferences. You'll build up your portfolio of machine learning projects and gain hands on experience in writing machine learning algorithms, deploying models into production, and managing the life cycle of a deep learning prototype.
SpringBoard offers a job guarantee, meaning that you don't have to pay for the program until you get a job in the space. Podcast.init is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes, and there's no obligation. Go to python podcast.com/springboard and apply today. Make sure to use the code AI springboard when you enroll. Your host as usual is Tobias Macy. And today, I'm interviewing Samuel Colvin about Pydantic, a library for enforcing type hints at runtime. So Samuel, can you start by introducing yourself?
[00:02:10] Unknown:
Hi. I'm Samuel. I am a software developer. I split my time usually between, a SaaS company, Tutor Crunch, I, been working on for many years, and on Nel Health, which is a very exciting health tech company. We do blood and genetic testing to give people actual health data, and then I also spend too much of my time doing open source. And do you remember how you first got introduced to Python? I do. I got offered an internship when I was university by a company that used Python quite a lot. I actually didn't do the internship, but I played with Python and got hooked. I've done quite a lot of developing in other languages in MATLAB and c sharp and Julia and Rust, and, obviously, like everyone, JavaScript, but I've always come back to Python.
[00:02:49] Unknown:
And so a few years ago, you started the project. And I'm wondering if you can give a bit of a description about what that project is and what it is that motivated you to create it in the first place. Yeah. I can remember the problem I was trying to solve. I was trying to parse a dictionary of, HTTP request headers
[00:03:06] Unknown:
into a kinda class, a bit like a data class with some properties that had type annotations. And I was really frustrated that all of the validation libraries that I could find didn't respect or care about type annotations. In fact, they directly, conflicted with them. So I I guess I started digging, found you can get access to the annotations, and went from there. And a few days later, I released version 0.0.1, and some people used it. It got quite popular on Hacker News right then, and the rest is history.
[00:03:36] Unknown:
And so in terms of the main use cases that benefit from Pydantic, you mentioned being able to parse a dictionary of headers into a class object. But what are some of the other ways that it's being used?
[00:03:49] Unknown:
So the the fun and the challenging bit of Pydantic is that it's used in quite a lot of different, situations. So it's used for settings management. You can think of it a bit like the the settings dot py file in Django or in any any server project where you would have settings of DSM for connecting to databases and what port and a thousand other different settings, but it's also used for kind of API form data validation. And then for I think people use it at library boundaries. So to confirm when people are using an external library that they are passing that library the correct arguments, and then it's also used by, data scientists in a data processing pipeline kind of scenario. So the the range of different ways in which it's used has been really interesting, but it's also been in certain situations a bit confusing, I think, for for developers who assume that everyone else is doing with it what they're doing. About a year ago, Sebastian Ramirez, who I think you interviewed a couple of weeks ago, started FastAPI, which uses Pydantic, and that's where the library really took off in terms of its popularity.
But what's been really interesting is since then, its usage has really exploded not just with FastAPI. So Hidantik now has, just a bit over a 1000000 downloads a month, so quite a lot more than than FastAPI and the other libraries that use it. So it's obviously being used, I suspect, somewhere by big corporations who are running CI thousands of times a day, which is why it's being downloaded so much.
[00:05:14] Unknown:
And in terms of the use cases that you're seeing for it, you know, the web API is 1 sort of obvious avenue for it. And then you also mentioned it being used in the data science and data engineering contexts. Are there any ways that it's being employed that you found to be particularly surprising or any types of feature requests that you're getting for given contexts that you have either been, sort of surprised and delighted by or had to actively turn down in terms of trying to avoid feature creep? Yeah. Yeah. The feature creep,
[00:05:47] Unknown:
issue is definitely definitely challenging because of the of the demand for people well, people wanting it to use it in lots of different situations. There's a issue around strictness in how you validate or parse data, which I think we'll come on to in a bit, which is definitely problematic because different people have very strong and different ideas about how it should work in those regards. I guess when I first started using it, I was always thinking about kind of dumb data, so JSON type inputs, strings, mostly strings, but also obviously floats and ints. But, obviously, a lot of the time, it's used in context where the inputs can be quite complex Python objects.
And so a lot of its usage is has, like, expanded to
[00:06:25] Unknown:
do validation of those complex objects. They definitely wasn't how it wasn't how it first started. And there are a number of other libraries that exist in the Python ecosystem for being able to do various things like settings management, which Pydantic is focusing on and that try to enforce different sorts of best practices. And then there are also another suite of libraries for being able to do things like define data schemas or define validation logic for input data structures such as Marshmallow and Cerberus. I'm wondering if you can give a bit of an idea of the comparison of how Pydantic fits in both of those different ecosystems, and some of the reasons that people might want to choose Pydantic over the other available options?
[00:07:07] Unknown:
So looking at the settings case, first of all, I'm a big fan of the 12 factor app approach to building applications and so a PyDaltonic base settings class will automatically read and infer environment variables or environment variables defined in the dotm file. So you can think where if you've ever used Django, you'll have a settings file, but a great number of your default settings will have ost.getenv around them to allow you to overwrite that default. That's automatic in Hidantik, but, of course, because it also coerces types, it would automatically coerce a string to say an int or a path or whatever else you might might want it to be. The the main difference, of course, is that it uses type hints to define what your type should be. So you don't have to to set that out twice or conflict with type hints. I also think that tests need to be a first class citizen when you think about settings. So settings class can also take arguments when you initialize it, which will overwrite the defaults, but also environment variables, making it really easy to use in a in a testing situation.
And then lastly, Pydantic isn't tied to any particular framework or ecosystem. So it can be used by tools like FastAPI, but it's not it's not tied to them. So it can be used in any any library you like. In terms of comparing it to tools like Marshmallow and Cerberus, as I say, the the the first thing you'll notice is is the type hints are used. The second thing will be speed. Hydantic is about 2 and a half times faster than Marshmallow and from our benchmarks, 26 times faster than Cerberus. I think Cerberus must have some some problem there because I've I've asked it asked question on Stack Overflow and I think on on their repo about why it's so slow and have never got an answer. So it seems that Cerberus is in particular, is very slow. But, Pydantic is faster than all of the other libraries that we've we've benchmarked or on a par with with the fastest ones. It's compiled with siphon, and there are binaries available on Linux for Linux, Mac, and Windows from PyPI, which is 1 of the reasons it's so fast. But even even without that, it's among the fastest of the validation libraries. As you will learn when you start using Pydantic, we lean towards coercion over strictness, unlike other libraries, which is confusing in some scenarios, but mostly it's useful. If you think about parsing, like, get arguments, the the values there will always be strings. If you had some value, let's say, age, that you wanted to be an int or a or time delta or a datetime, you you need to have that coercion. There's no way that you could do strict validations that the input value must be an int, say, when the inputs have to be strings. And lastly, I'd say I hope that we managed to be friendly and helpful on on GitHub.
Unlike lots of projects, Pydantic allows people to ask questions on GitHub, and I try to always be as helpful as I can, which, to put it mildly, isn't true of all projects.
[00:10:04] Unknown:
And because of the fact that you are addressing these 2 different use cases of being able to be used for settings validation, as well as being able to handle validation of input and output data or data as it traverses an application. What are some of the tensions or challenges that you face, whether it's technical or conceptual or just for in terms of requests from the community that you're facing in terms of being able to build a library that addresses these different use cases? So I I said earlier that pedantic
[00:10:32] Unknown:
is leans towards coercion over strictness. So if you pass it a string of a of a number, so a string of the characters 123 into a field which is an integer, then it will convert that string to an integer. That occasionally confuses and frustrates people and they say, oh, it should be strict and say that it's refused that string to be the input to an int. If you then say, well, what would happen if I pass you a string to a path field, for example? They They would say, oh, well, it should definitely do coercion then. And the conversation goes on, and what you realize is that the person you're speaking to is thinking in the in the JSON world where you have the 7 types of data in JSON, and they assume that there should be no coercion between them, but there should be coercion from those JSON types to to higher types like, I don't know, UUIDs or, or paths. So 1 of the problems I've I've seen, and it's quite understandable, is people assuming that other people's usage is similar to to theirs. And, therefore, they might assume everyone's using it for an API where everything's JSON, where, actually, it's being used in lots of other contexts. Or they might assume that the the data being passed in is lots of different Python objects and that JSON is, like, is not relevant. That's been a problem both both technically in in how strict PyDantic should be, but also conceptually to to explain to people that that it's used for lots of different things, which is great, but it can also lead to to slight confusion.
[00:11:52] Unknown:
In terms of the actual usage of Pydantic, from looking at the documentation, it's largely based around building up these class objects that act as containers for the different data fields. And in a lot of ways, it's very similar to the data classes that were added in Python 37. So I'm wondering what you see as being the trade offs of using the built in data classes versus Pydantic and what the sort of challenges were that you faced in when data classes were added as a first class concern given that it seems that you started the project before they were part of the mainline Python. Yeah.
[00:12:28] Unknown:
Pydantic was released first for Python. I think it was 3.6, which didn't have have data classes, and they had to we had to support them later. So Pydantic has a its own version of data classes, which are really just validation on top of standard data classes. So the data class you you get when you initialize a data class that uses the Pydantic's decorator effectively will be a completely vanilla data class, just the validation will have gone on. More generally, the the the number 1 trade off, of course, is that data classes don't do any data validation. So it might say, foo needs to be an int, but if you pass it a string or a nested dictionary of UUIDs, data classes isn't gonna care and isn't gonna do anything, and so you're you're relying on mypy or static type analysis to to check that that's actually true, where, obviously, Pydantic does the runtime type checking. The second big difference is that data classes use the pattern or arguably anti pattern of generating a bunch of Python code and then calling eval on that to create the data class, which is quite slow and avoid prevents you from doing things like compiling with Sysun, which Pydantic data classes don't do. So Pydantic data classes don't do any of that eval stuff, and so they can work with Sysun. The only case implicitly where that happens is when we're using our own variant of standard library data classes where, of course, we we thought we have to use the the standard initialization of a data class once we've done the validation. And then the 3rd big difference is that, of course, Pidantec has lots of other tools on top of it, whether that be passing JSON, validators, serialization of things like JSON, nested data structures, all that stuff is available in Pydantic. That's not gonna be available in the standard library data classes, which are kind of the like a building block. So,
[00:14:10] Unknown:
simpler, and and they're great. And I use them quite a lot, but they're they're not right in every scenario. So because Pydantic is a library and also because it's doing this runtime type checking as opposed to just the ahead of time validation that you might get from a linter or something like mypy, What are some of the points of overhead or the potential complications that get added by using Pydantic in place of the built in capabilities or doing this ahead of time checking? Well, it's gonna be a lot slower, of course, to call a function where you go through the whole of data validation before
[00:14:42] Unknown:
you call a function. That's unavoidable. There are cases where it's better to use the kind of duck typing and catch the error approach rather than rather than doing the validation first, but, of course, there are cases where you where you need validation. Compared to handwritten validation because of the the compiling. PyDantic is generally on a par, maybe slightly faster or slightly slower than handwritten validators. So it it is, of course, slower than not doing validation, but if you're gonna do validation, PyDantic is pushing towards the fastest you can do it in in Python, I think. And as far as the conversation that we started of
[00:15:18] Unknown:
strict validation versus type coercion, I know that you added some explicit points in the documentation to be able to call that out to avoid confusion because of some conversations that came up earlier in the life cycle of the project. And I'm wondering if you can talk a bit more about some of the nuances of that strictness versus coercion and some of the ways that it manifests, and, some of the limitations that you see in terms of the use cases by explicitly not supporting that strict validation in favor of being able to do the coercion?
[00:15:51] Unknown:
Yeah. I mean, Pydantic started off trying to be fast and trying to be simple, and so I took the approach that if I want something to be an integer, then the simplest thing to do is to call the int built in on that value. And if it succeeds, then you know you've got an integer. And if it fails, then you take that error and you use that as a basis for the exception you're gonna throw. And in most cases, we still do that. And that does mean that, for example, if you pass it a string, that will that will be passed to an int. Or if you pass something to a list, it'll just call the list built in and and give you back a list. There are cases where that makes no sense. So for example, virtually everything can be cast to a string because it has the the string method on it. And so it wouldn't make sense just to call string on everything and say, if it can be passed to a string, then it is a string. It wouldn't make any sense to have if you pass a a list of integers to field you expect to be a string for you just to get back to string representation of that list. But there were some other other weird ones that people came up with early on, pointed out were mistakes that we fixed. So for example, before, because you were just calling list on something to see if it's a list and then call calling, for example, int on something to see if it's an int. If you pass a string of 123 to something that tried to be a list of ints, you got back the list 123, which was very confusing. So there are cases where we we've gradually moved to be slightly stricter in the right places. I think that's the right approach to take. There's still, to some degree, an open open well, there is an open issue literally, but it's also open issue in my head about whether we should have a completely strict mode. But from what I've seen, when you explain to someone what a completely strict mode means, things like path has to be an instance of path, not a string, that generally isn't what people want. So I think we will avoid going completely strict and instead move towards slowly making things stricter in the in the right ways, but only have have 1 mode. And if you really want to be stricter, you can use validators to to do that yourself. In terms of cases where that means it doesn't work, if you had a kind of testing situation where you wanted to test the output of a a webhook or of a API, Pydantic wouldn't be the right tool because it will lean towards doing coercion instead of instead of just checking. There are some cases where you could lose data. For example, if you had a float of 3.1415, you pass that to an int, you'll silently lose data because it would convert that to the int 3.
But most of the problems, I think, really come from developer confusion that they assume it would be strict, and that's where they kind of end up getting confused and getting getting into into trouble. And then I get a slightly irate issue saying, why the hell does it work this way? And for those cases where there is the potential for data loss, I know that in some ways, it's
[00:18:30] Unknown:
potentially impossible to be able to determine ahead of time if there is any sort of lossy conversion that's going to happen. But have you explored the possibility of adding in some types of warning or being able to capture hooks of this this is going to lose data and then being able to raise errors during the development cycle or anything along those lines for giving people that option of lossy versus lossless conversions?
[00:18:54] Unknown:
Well, we thought about the the strict mode would be the I I think would be the route there. So you would enable the strict version of data class during testing, and then you would know that in testing it was it was working. That's just a lot more work and not something that I think people actually want in the end. If you look at libraries like, I don't wanna pick on Cerberus, but I know that you have to explicitly set up any coercion that you want, which is in some ways more explicit, and explicit is better than implicit, but it does also mean that every single place you want to use some higher order objects that will will require some coercion, you you need to, like, explicitly set that, whereas Pydantic tries really hard to just work out of the box in the most likely scenario. I mean, I also I built it for me for for projects I needed, and so I built it the way I wanted. And to a degree, unless people pay me, they can kinda have it the way I want it, to be a bit blunt about it.
[00:19:42] Unknown:
And that brings up an interesting point too about the API design of the library and how you approach the overall philosophy of the structure of the project and the interfaces that you expose to users. And I'm wondering what your prior experience has been in terms of building out projects that are more widely used and what your thoughts are in terms of how to design that API in a way that is easy to adopt as well as being appropriately expressive for the problem space that you're working in? I think it's I think it's a difficult and nuanced problem. I think that humility is quite a useful attribute of a developer and probably an under considered 1. Trying to remember what it was like when you first started developing and remembering how little you understood is really valuable when you're building open source code that will be used a lot by junior developers.
[00:20:28] Unknown:
And, I mean, Sebastian Ramirez, who who built FastAPI, who's helped me a lot on Pydantic, is the is the master of that. He seems great at creating open source projects and getting them to to grow and doing it in a way which arguably isn't even always the technically perfect way, but which allows the most people to to use his code and to to get something done. Then there are there are, like, more practical things you can do. I know Sebastian was talking on your podcast previously about things like setting every keyword argument explicitly in the public facing interface public facing API so that with ID type ins, you can kind of get there without having to use the documentation.
Lots of projects don't do that, and it's really frustrating, so I I definitely try and get things like that right and make it easy to use. I suppose my overall approach is as long as it's fast and it's easy to get started with, I'm happy to to skip some edge cases and and allow people to to fix them themselves or or or use use another tool. I guess it's better to be right for the majority of people and usable than than do everything.
[00:21:27] Unknown:
And in terms of how Pydantic itself is implemented, can you talk through the overall design of the project and some of the ways that it has evolved since you first began working on it? Well, so the the biggest change is that we we got to version 1,
[00:21:41] Unknown:
late last year. There were a few understandable complaints from from big projects using Pydantic that it was a moving target. Pre version 0.1, I was quite version 1, I was quite keen on on breaking things to make it better, and and that understandably frustrated people who, had it as 1 of many dependencies. So we got to version 1, and I've tried really hard to avoid any backwards incompatible changes since then. The main other change in Pydantic over over the years has been the the fact that there are now 4 main interfaces to to Pydantic. So there's the base model approach, which is the primary 1. Then there's Pydantic data classes, which I talked about earlier. Then you have parse as object, which allows you to parse or validate any object you like. So you give it the type and you give it the raw data, and it will either succeed or fail. And then most recently, in version 1.5, we released validate arguments, which is a function decorator that allows you to validate the arguments of of any function. And then Sebastian added back when he was first working on fast API schema generation using JSON schema, which was another another big step forward for for Pydantic. In terms of other changes, David Montagu worked a lot on getting Pydantic to compile to siphon, which has made a big difference to performance. That was 1 of the big changes we made last year. And for the actual internal architecture of the project,
[00:23:01] Unknown:
I'm wondering how the actual class definitions are structured to be able to do things like gain access to the type attributes of the arguments to the class or of the fields within the class. And also in terms of the typing capabilities in Python itself, I'm wondering what you have found to be some of the well considered aspects of it and any challenges that you face in terms of shortcomings of the typing system. Because I know that whenever this conversation comes up, people will invariably look to things like Haskell for some of the more complex and elegant ways to handle complex types. But I'm wondering what your just overall thoughts are on typing in Python and the ways that you approach leveraging it within the project. So the the first bit of that question is super easy. In the in the meta class,
[00:23:47] Unknown:
it's really easy to get access to the attributes object to annotations object on a class and then use that as you as a basis for your validation. The tough bit is introspecting the types to work out what they are and then developing codes to to suitably pass or validate data to check whether or not it's compliant with that. That's the that's been the tough bit. I think that somewhere, although I I was looking for it before this and couldn't find it, Guido has said that he explicitly doesn't think that type annotations are designed for runtime type checking. I think originally, they said they're here. Do whatever you like with them. And then later on, it was terrified that they weren't for runtime type checking, and that kind of shows when you try and start introspecting types. They're hellishly complicated to get access to what what they are. It's 1 of the things that was a lot of the the work early on in, in Playdantic. And still to this day, 1 of the the frustrations is that you have even if you think about the sequence of different integers, let's say, you have collection, sequence, list, set, frozen set, iterable.
There are many different versions of what you might generically call an array of integers. So that's been problematic, again and goes on being a problem that not all of that is completely solved in Pydantic. But most of the time most of the time, it it just works still. So for a developer who's interested in getting started with using or integrating it into an existing project, what is the overall workflow of being able to add in those type definitions and some of the different capabilities that are exposed by bringing Pydantic into that project? So getting started with Pydantic should be as simple as pip install Pydantic. That will give you compiled binary on whatever operating system you're using, then have a look through the docs and get started. I think it's almost, in some ways, too easy to get started because everyone already has an idea about how type ins work. That's probably 1 of the the reasons that people find the pitfalls because they haven't gone through the docs. Quite understandably like like the rest of us, they just start going and, end up running into problems. Those type hints are awesome because they then work with your IDE, with MyPy, with your own intuition, with any other IDE that you're with with PyCharm or whatever IDE you're using. In particular, PyCharm has a plugin specifically for PyDantic built by the community, which is awesome, which makes usage of PyCharm with PyDantic even easier and even better. But, yeah, the the use of the of type hints avoids you having to learn another, like, schema micro language for defining your models. You can just use standard Python and off you go. And what are some of the edge cases that developers
[00:26:21] Unknown:
often run into aside from the confusion about the strictness versus coercion of the typing information?
[00:26:28] Unknown:
1 of the useful divisions, I think, is that when you're defining your model, it can't look at the rest of the world. So if, let's say, that you're creating a user with an email address and you need that email address to be unique, you can't do the check that the email address is unique within a validator in general because that validator can't be asynchronous. And even if it's even if you're using a synchronous database lookup, it's it's generally bad practice to put that inside the validator. So using Pydantic gives you a very good division between checking the data is consistent and then checking that the data works for the rest of the world. But that's something that quite often leads to confusion and there's a bit of tooling around how to raise validate valid error, validate validation errors with after you've done the initial validation, which I'm I'm gonna work on in future. There are some differences with with data classes even if you use the pedantic data classes, mostly around implicit constraints of data classes. You can't have extra arguments to them, which which occasionally people have problems with. And then there's a kind of complex question about whether or not to pass around pedantic models or dictionaries created from pedantic models or data classes you create or or how you then pass your data along. If it's simply as simple as accessing an attribute of a model and saving that to a database, that's fine. But, obviously, people have complex processing workflows, and people have lots of different solutions for that. And Antique works well with pickle, so you can just pick all your models, but even that occasionally leads to leads to problems.
[00:28:01] Unknown:
And for the capability of being able to access the different attributes and pass along the data, what are some of the, sort of best practices that you have found to be useful of whether you pass the exact model or just a representation of the dataset and then do the coercion back and forth throughout? I don't think there's a there's a single good answer. I do occasionally end up with data classes that shadow models,
[00:28:26] Unknown:
which can allow or at least shadow parts of models that I need in a particular context. Sometimes it is as simple as as calling dotdict on a on a model and getting back that dictionary and using that. But in general, I think that the best practice would be to continue passing the model around and using that model directly rather than converting it to a dictionary too early. And another element
[00:28:49] Unknown:
of using Pydantic within these applications, particularly using the model terminology, people might get confused for those who are used to working with ORMs in different web frameworks such as Django. And I'm wondering what the options are for being able to integrate Pydantic with some of the other elements of the ecosystem, you know, starting with things like ORMs of, using it within Django or with things like SQLAlchemy for
[00:29:17] Unknown:
doing the validation of the data as it's flowing into the application and then being able to easily convert it into a database object? Yeah. I'm gonna put Django to 1 side because I think the best and the worst of Django is that its battery is included. And if you're going to use Django, you're probably best using vanilla Django and Django RESTful framework and and leaving Pydantic out of it. I'm sure that there are some cases where people do build stuff with with Django but use Pydantic, but I definitely haven't. SQLAlchemy and other ORMs are a more nuanced question. Pydantic has a from ORM mode, which basically allows it to inspect the attributes of an ORM class and and build a pedantic model from there. I personally am not particularly pro ORMs. I would much rather write my queries in raw SQL or or whatever than than have this ORM step in between. So I haven't done that much recently with ORMs in Pydantic, and I I would say ORMs are quite often a mistake full stop, so I don't put that much effort into using them myself. But I know many people do, and you do find yourself having to define your data twice in the form of a pedantic model and a SQL alchemy model. I I think that's just unavoidable. At least in most cases, there are there are no doubt exceptions where you could also generate 1 or the other. But I would say unless you had thousands of different tables and therefore models, it wouldn't be worth it. And talking about it in the context of databases also brings up the interesting question of the ability
[00:30:43] Unknown:
to define pedantic classes that relate specifically to other classes so that you can do things like joining across different, data objects or being able to specify relations of those different objects as it flows throughout your application. And I'm wondering what you have found to in terms of some of the advanced usage capabilities of Pydantic that
[00:31:04] Unknown:
are not necessarily obvious at first blush. 1 1 of the things I've been amazed by is how advanced many people's usage of Pydantic has been in in regards which I definitely haven't haven't done that. Pydantic doesn't have, and it's an open open issue, a way of avoiding recursive links between models. So if you have I can't think of an example right now, but you have a user linked to a pet and then you have the pet linked back to the user, that can lead to recursive problems in Pydantic. It's it's an open issue. I'd I'd love someone to come along and help with that. But in short, I haven't solved that because I suspect that models like that are an anti pattern in the 1st place, and so they just shouldn't exist. I would much prefer myself to keep my models individual and not connected and have an integer field if there's a if there's a foreign key, for example, rather than having some implicit link to another model that automatically does a query that I can't see. Because I think that that complexity gets you into hot water when it's not really actually saving you enough time. In terms of other advanced usages or advanced features of Pydantic, the generic models built by David Montague are scarily powerful. If you think about generics in the context of Python typing, generic models are like that, but in but with validation on top. So you can define the model that has some generic type or types associated with it, and Pydantic will then go and do the validation based on on dynamic types within the definition of the model. Other another of its of the powerful features of Pydantic is custom types with custom validation and custom schemas, which I don't think people are aware of enough or and don't get used enough. Probably the documentation could could use some work there. 1 of the things that I often find suggesting as solutions to people who ask questions is custom based models with custom config and even modified methods like dict or JSON on a model, which gets around lots of people's requests for for more features. And then the validation error system in Pydantic is quite quite complex. It allows things like custom translations and customized messages on errors, which are often, again, probably aren't documented well enough because people don't seem to be aware of that and the the power of what that can do. And another interesting capability that you touched on briefly earlier is the
[00:33:17] Unknown:
way to be able to generate things like a JSON schema from a Pydantic model. And I'm wondering if there is the capability of being able to do something like that in the reverse where you have a schema definition and then being able to generate the corresponding model object for being able to validate other instances
[00:33:37] Unknown:
of that schema? There are. There are some third party projects. I I think there were at least 2 out there that generate either Python code or Python models. The problem with that is that things like mypy and static typing and your own intuition aren't going to work because the model isn't defined in code somewhere. So in general, I would say that the Python code should be your single source of truth about the definition of what your data should look like, and then you should generate the schema from there. But, yeah, there there are tools that can generate Python code to represent a schema, and I'm sure there are contexts in which they're useful. I haven't used them
[00:34:11] Unknown:
myself. And I'm wondering too if you have explored the space of being able to generate other types of schema objects beyond just JSON schema. So for instance, in the data engineering context, being able to use a pedantic model to be able to create instances of Avro objects or parquet rows, things like that? I haven't myself. I I haven't worked on that. I would have said that the best approach would be to work from the current JSON schema dict and and go about generating them from there, but but I haven't had any experience myself. And then as far as being able to integrate PyDantic with other frameworks, because of the fact that it is largely just vanilla Python, it seems like it's fairly straightforward for things like using it in the settings module of a Django project. But what have you found to be some of the useful tips in terms of the overall process of integrating with things like maybe the DAXTER project for ETL workflows or maybe Pyramid for being able or Flask for integrating it with other web frameworks or frameworks of other types that people might be trying to use the data validation capabilities within? I've used it quite a lot with with libraries like Starlet. Obviously, it's a it's a cornerstone of fast API. I know people use it quite a lot for settings management in Flask. I think there are some libraries around that. I haven't used Flask for a few years, so I I haven't been working with that. It's become more and more a cornerstone
[00:35:32] Unknown:
of the data pipelines used in used in machine learning projects. It's kind of been amazing to see. It was an application I had never thought of before, but now you see all of these quite popular, machine learning packages like, deep pavlov and transformers from Hugging Face using Pydantic for both settings management and for parsing the data before doing the machine learning, which has been really interesting. I haven't had much experience with them myself, but what's the kind of amazing thing about about building Pydantic has been seeing other people pick it up and run with it and do stuff I had never I never thought of. And what are some of the other interesting or innovative or unexpected ways that you have seen Pydantic used that you have been particularly
[00:36:14] Unknown:
surprised or,
[00:36:15] Unknown:
impressed by? Probably the most surprising thing for me has been big companies you would have heard of like Microsoft, IBM, AWS, NSA, Uber, Salesforce using using Pydantic, which never something I would have expected when I first hacked something together and and released something on PyPI. In terms of particular projects, Facebook have, their Fast MRI project for making, MRI scanning faster and their reagent machine learning library that use pedantic. Microsoft use pedantic through FastAPI for core Windows and Office services, which is amazing. There's a Mexican neobank called, Suenka, I hope I pronounced that right, who used Pydantic for their interbank transfer validation. The Molecular Science Software Institute used Pydantic a lot. As far as I know, they're using it for their COVID response, and lots of other projects in academia and in industry. Each of those projects is cool, but the the most gratifying thing for me has been seeing the, like, the sheer number and diversity of different projects that have used it in ways I wouldn't have wouldn't have thought of. And particularly for some of the scientific contexts, have you found people using Pydantic alongside
[00:37:22] Unknown:
things like PYNT for being able to handle unit conversions and being able to incorporate that validation or transform logic within the Pydantic models? 1 of the
[00:37:33] Unknown:
frustrations I find writing open source code is that you can see the open source tip of the usage iceberg, but you can't see the closed source usage. And so it's it frustrates me. It's it's tantalizing to be able to see some of it and guess what other people are doing with it, but not to be able to go in and and actually understand what people are doing with it. So the short answer is I just don't know because lots of those some of those projects are obviously open. And I'm sure if I spent some weeks digging away, I would find out what people are doing with it. I'm sure that would be very useful for me in terms of how I develop it further, but I I haven't done that. I would say a lot of the usage is is closed source, and so you just don't know. But I mean, it's interesting because you get occasional intuition about about what people are doing when you get I had a credit checking agency who who points out a bug the other day in a in a regex. And I, you know, I never it never occurred to me the company that I would be using it. So it's you occasionally get a hint at what's underneath the the sea level of that iceberg, but mostly, I don't know. And in terms of your own experience of building and growing the Pydantic project, what are some of the most interesting or challenging or unexpected lessons that you've learned in that process? It's been really fun working with with some big companies. I would do that again, either commercially or or just for free, out of curiosity to see how how, large organizations use it. It's been a there's a strange paradox when you start writing open source that on day 1, you desperately hope some other people will install it and use it. And careful what you wish for because 3 years later, I now spend an hour to 2 hours a day working on Pydantic, mostly just answering issues, which definitely wasn't what I planned to do, but but it's been been really interesting. There's, I I said earlier I talked earlier about humility and and remembering what it was like not to know how to do things. Answering a lot of, like, quote, dumb unquote questions from people who are who are, like, relatively new to Python has has been a good experience in reminding me how much I know in a sense and and how lucky I am to be able to to write code like I can. And so I I do my best to give back and not to get frustrated by people people asking questions that I think they could have worked out out with a a few minutes on Google. Some of the other challenging aspects
[00:39:37] Unknown:
of open source maintainership are also things like knowing when to say no per to bus factor of the project and figuring out what is the succession path for maintainership if you decide to step away from the project or you're no longer actively using it and you start to want to spend your time on other things. And I'm wondering what your thoughts are on your own personal approach to that or any other thoughts that you have in this, sort of discussion of maintaining open source projects? Yeah. I think it's really hard, and I I wouldn't deny that the bus factor wasn't quite big with Pydantic right now that that I'm the majority of the work. It is really gratifying to see other people,
[00:40:18] Unknown:
particularly Sebastian Ramirez and David Montague, but lots of other people as well, contributing to it, taking time to answer questions. But, I think it's I think it's a outstanding big problem. I think that we do a lot of patting ourselves on the back and saying how great open source is, but I was reading, an essay by John Mark, which I'll I'll leave a link to. Calls perhaps slightly hyperbolically open source has failed. I wouldn't say I would go quite that far, but I definitely don't think it's quite as rosy as as it should be. And I I definitely see the problem that all these big and very profitable companies use a library like Pydantic and, of course, many more, but don't contribute financially. And that then leads to a buzz factor with projects like this, and even more so with FastAPI, if you look. I think Sebastian is amazing, and he's done lots, And it's got incredibly popular very quickly. But I don't think that any of the organizations that use that have a succession planning and have thought about what would happen if those libraries stop being maintained. I mean, I think the good thing is, at least with Pydantic, it's relatively stable. I have lots of interesting things I wanna do on v 2 that that I will, 1 day, get get around to working on. But without that, it's not like it's gonna suddenly crash or fall to the ground if if it doesn't have any work on it for a few weeks or a few months. And in terms
[00:41:31] Unknown:
of selecting the sort of libraries to use within an application, if people are looking at Pydantic, what are the cases where it's the wrong choice and they might be better suited using maybe built in data classes or something,
[00:41:43] Unknown:
like Marshmallow or some other settings management library? I'd I'd say the first case is that it's not a substitute for strictly type compiled languages like c plus plus or Rust or Go. And you see occasionally questions where I wonder whether the real answer here is you shouldn't be using Python. It's not it helps get around some of those problems, but it's not gonna be a substitute for them. I also think that it's often the old fashioned Python approach of kind of duck typing and catch the error, easy to ask forgiveness and permission, just try it, works well. And if you end up validating every single input in Python, it gets really, really slow. So often, if you are reasonably sure about the inputs and if there's no particular security concern of just calling it and seeing what happens, then then that's often a better approach. It's also obviously wrong, as I said earlier, in the in the validation context for, say, a testing case where you want to confirm that a webhook or an API is is giving you the right data, that's not the right tool for Pydantic because it leans and always will towards coercion over over strictness. But if you are determined to write Python and if you wanna do validation, I think Pydantic is is probably the best tool in most scenarios.
[00:42:50] Unknown:
That's obviously my biased opinion. And as you continue to work on and maintain the project, what is this what are some of the things that you have planned for the future of it and, maybe some of the ways that you're planning on using it in your own work? Yeah. I'm I'm heading towards v 2 now, and I've got quite a lot of big features that I wanna work on in in v 2, some stuff I wanna break to to get right.
[00:43:10] Unknown:
So I think validators, currently, you can think of them like a list of functions, and each function is called in turn and the output from 1 is given as the argument to the next unless an error occurs, of course. That's slightly slow and somewhat confusing. I would much prefer validators to work a bit like middleware. So it's a function stack, and each function calls the next function along. But the from the outside, that just looks like 1 function, which should be faster in the case that you are doing some common piece of validation where it's just 1 function, but also much more powerful because you could do your checks before or after calling the next function without having to to mess with the order of those validators.
I also think 1 of the things Marshmallow does well is it talks it makes serialization a kind of first class concern. So it's not just about the parsing stage, but it's also about the output stage. And Pylantic has quite a few things I wanna work on in that in that regard. So, input and output aliases, computed fields. So so fields on a model that aren't don't come explicitly from an input, but instead are computed either eagerly or lazily based on other fields, and there's a whole bunch of other stuff. Have a look on the the v 2
[00:44:16] Unknown:
milestone on GitHub, and I'd I'd love some feedback. And are there any other aspects of your work on PyDantic or the overall field of data validation and type checking, things like that, or any ways that you're looking for help from the community that we didn't discuss that you'd like to cover before we close out the show? Oh, I would I would I would say that I've got quite a lot of issues with a label called feedback. I love people's feedback. Obviously, there's a risk of just bike shedding, but getting people's input is really useful before you release a feature rather than them being furious once you've done it. So I don't need people to write code or submit issues, but just a bit of feedback, a plus 1 here or there, makes makes it much easier to work out what people what people are looking for. Other than that, nothing in particular. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose something that I had forgotten about for a while and, remember during a conversation with my family the other day of the devil sticks, also known as juggling sticks or flower sticks, is some way to pass the time and, have something to keep your idle hands busy. So if you're looking for something to do or play with, I'll definitely recommend checking that out. And so with that, I'll pass it to you, Samuel. Do you have any picks this week? I do. I,
[00:45:29] Unknown:
in terms of books, I would thoroughly recommend Flash Boys by Michael Lewis. I know it's not that recent, but it's an awesome book. In fact, everything by Michael Lewis, I'm a massive fanboy of him. Then in a more computing specific context, there's Algorithms to Live by by Brian Christian and Tom Griffiths, which is a awesome book. Not it's it's much better than its title suggests. In terms of TV, I really enjoyed, Sex Education on Netflix. If you're bored at home at the moment, I would thoroughly recommend it. It's extremely funny. And then in terms of tech, I found recently a website called engrok.com, which creates a tunnel from a port on your local machine to the public internet, which is awesomely helpful when developing and you want
[00:46:08] Unknown:
to show something to someone or if you want to have an HTTPS connection to a local port. That was that was a really nice to find, really useful tool. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Pydantic. It's definitely very interesting project and 1 that I'm planning to take use of in some of my work. So thank you for all of your time and effort on that. And I hope you enjoy the rest of your day. Thank you very much. You too. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcastdot com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Samuel Colvin
Samuel Colvin's Background
Introduction to Pydantic
Use Cases for Pydantic
Feature Requests and Challenges
Comparison with Other Libraries
Technical and Conceptual Challenges
Data Classes vs Pydantic
Overhead and Complications
Strictness vs Coercion
Handling Data Loss
API Design and Philosophy
Implementation and Evolution
Typing in Python
Getting Started with Pydantic
Edge Cases and Best Practices
Integration with ORMs and Frameworks
Advanced Usage and Features
Schema Generation
Integration with Other Frameworks
Interesting Use Cases
Lessons from Open Source
Maintaining Open Source Projects
When Not to Use Pydantic
Future Plans for Pydantic
Community Feedback
Picks and Recommendations
Closing Remarks