Summary
Building any software project is going to require relying on dependencies that you and your team didn’t write or maintain, and many of those will have dependencies of their own. This has led to a wide variety of potential and actual issues ranging from developer ergonomics to application security. In order to provide a higher degree of confidence in the optimal combinations of direct and transitive dependencies a team at Red Hat started Project Thoth. In this episode Fridolín Pokorný explains how the Thoth resolver uses multiple signals to find the best combination of dependency versions to ensure compatibility and avoid known security issues.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Need to automate your Python code in the cloud? Want to avoid the hassle of setting up and maintaining infrastructure? Shipyard is the premier orchestration platform built to help you quickly launch, monitor, and share python workflows in a matter of minutes with 0 changes to your code. Shipyard provides powerful features like webhooks, error-handling, monitoring, automatic containerization, syncing with Github, and more. Plus, it comes with over 70 open-source, low-code templates to help you quickly build solutions with the tools you already use. Go to dataengineeringpodcast.com/shipyard to get started automating with a free developer plan today!
- Your host as usual is Tobias Macey and today I’m interviewing Fridolín Pokorný about Project Thoth, a resolver service that computes the optimal combination of versions for your dependencies
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what Project Thoth is and the story behind it?
- What are some examples of the types of problems that can be introduced by mismanaged dependency versions?
- The Python ecosystem has seen a number of dependency management tools introduced recently. What are the capabilities that Thoth offers that make it stand out?
- How does it compare to e.g. pip, Poetry, pip-tools, etc.?
- How do those other tools approach resolution of dependencies?
- Can you describe how Thoth is implemented?
- How have the scope and design of the project evolved since it was started?
- What are the sources of information that it relies on for generating the possible solution space?
- What are the algorithms that it relies on for finding an optimal combination of packages?
- Can you describe how Thoth fits into the workflow of a developer while selecting a set of dependencies and keeping them up to date over the life of a project?
- What are the opportunities for expanding Thoth’s application to other language ecosystems?
- What are the interfaces available for extending or integrating with Thoth?
- What are the most interesting, innovative, or unexpected ways that you have seen Thoth used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Thoth?
- When is Thoth the wrong choice?
- What do you have planned for the future of Thoth?
Keep In Touch
Picks
- Tobias
- Fridolin
Links
- Redhat
- Project Thoth
- Thamos CLI
- PyPA Advisory Database
- Project2Vec
- Thoth Prescriptions
- Thoth: Egyptian God
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, and dedicated CPU and GPU instances. And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover.
Go to python podcast.com/linode today to get a $100 credit to try out their new database service, and don't forget to thank them for their continued support of this show. So now your modern data stack is set up. How is everyone going to find the data they need and understand it? SelectStar is a data discovery originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use. With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.
Try it out for free and double the length of your free trial today at python podcast dotcom/selectstar. You'll also get a swag package when you continue on a paid plan. Your host, as usual, is Tobias Macy, and today I'm interviewing Friedelijn Pokorny about Project TOF, a resolver service that computes the optimal combination of versions for your dependencies. So, Friedelijn, can you start by introducing yourself?
[00:01:57] Unknown:
Thanks for having me. I work as a software engineer, in Red Hat's office of the CTO, that group that is called Emerging Technologies Group. And more specifically, I work on a project that is called TOT, and it's project made out of multiple parts. I think the most interesting 1 for this talk is DOT's resolver that can resolve application dependencies, similar to PIP, but in a more
[00:02:24] Unknown:
clever fashion, let's say. And do you remember how you first got introduced to Python?
[00:02:29] Unknown:
It was back in 2012, I think, during my university studies. And I was impressed by Python because it's really a pseudo code. You just write on whiteboard, while you're discussing problem. And then you write it in your to to the file, and it suddenly works. It's like magic.
[00:02:52] Unknown:
And so in terms of the project TAF that we're discussing today, I'm wondering if you can give a bit more context around what it is that you've built there and some of the story behind how it came to be and what the overall goals are.
[00:03:05] Unknown:
Projects started in 2018 as a research project in AI division at Red Hat. We wanted to create a system that would help developers to create healthy applications, developing Python applications. There were 2 main use cases. 1 was developing Python applications locally. When people use pip, sometimes not that clear which versions of libraries should be used and if they are compatible between each other. But there was also another use case, like running Python in containers in a cluster where you have, containerized environments and would like to utilize software environment that is provided by the containerized environment and make sure that software layer of an application works well with hardware, and you get the best possible performance, also security of the application that you are developing and things like that.
[00:04:02] Unknown:
As far as the types of problems that you have seen, and you mentioned PIP doesn't always give a very good or clear answer as to what are the actual versions of the transitory dependencies that I have, What are some of the types of problems that can be introduced by having versions that aren't sort of optimally determined or that aren't locked to a specific version or a specific release for those nonprimary, dependencies.
[00:04:30] Unknown:
So we spotted multiple issues when it comes to library incompatibilities. So for example, we had issue when installing Pillow together with NumPy, and that was not respected properly API calls, and that made the application crashing. Or there were issues with Python versions, with Python interpreter versions. Some libraries were not ready for more recent Python interpreters and, that caused, issues. Besides this Python layer, you can find other issues when, using, for example, TensorFlow. If you use TensorFlow together with GPU, you need to make sure that it fits correctly so you have a proper CUDA version installed, things like that. And the only way how to make this work is to basically browse documentation of library and make sure that proper CUDA version is installed, for example, the right gipsy variable.
And this causes headaches to developers because it's something that could be answered, let's say, automatically during the resolution process. And the resolver can say this combination of libraries in your environment will not work properly.
[00:05:49] Unknown:
As far as the overall ecosystem in Python for being able to manage the dependencies, pip has been maybe the longest running 1 that's still in use, and there are a number that have grown and PIP tools. And I'm wondering if you can just talk to some of the and PIP Tools. And I'm wondering if you can just talk to some of the ways that what you're doing with Toth is either an improvement on those tools or works alongside them and just some of the specifically in terms of the dependency resolution, some of the ways that Toth approaches it in a different manner than what those tools provide.
[00:06:26] Unknown:
Can you take a look at PIPEM, Poetry, these resolvers all around locally? So developers install applications locally and they the actual resolution process, is done on client machines. The design of TOT is different. TOT's tooling aggregates information about client's environment, where clients run their applications and send this information to Topps back end. And the back end, based on pre aggregated knowledge about the MSCs, can resolve application, the MSCs specifically for users, for these clients. You can see it as a guidance service in this case because it does not solely resolve these applications just by looking at the dependency graph, but it also considers additional knowledge about these dependencies.
It can be, for example, quality of these dependencies, like, security aspects. Does the given dependency have, some vulnerability? Or do these dependencies work well together? Or what about communities? Are these packages backed by a good community, by a healthy community so that I can trust these open source KGs. If I'm using, for example, recurrent neural networks, the resolution process can be different for my use case. I'm using this. The resolver can behave differently. So it's synthesizing software and hardware aggregated by client's tooling and also doing some kind of contextual resolution.
And based on static source code analysis, the resolver can guide which packages and package versions a user should use.
[00:08:14] Unknown:
As far as the implementation of Toth itself, I'm wondering if you can talk to some of the various components in the overall ecosystem of what project Toth encapsulates? And then in terms of the actual resolver itself, some of the ways that that is architected and implemented to be able to provide this resolution as a service?
[00:08:36] Unknown:
Clients tools include Tamo's CLI, the Visa Command Line Interface for contacting ToT. It does the mentions, aggregation about the environment, and sending information to the back end, as well as consuming results that are computed by back end. We have also JupyterLab requirements. This is an extension to Jupyter Notebooks that can contact Stott as well. And besides that, it embeds packages or information about packages to Jupyter Notebooks. So if you share a Jupyter Notebook, then you have a reproducible or information to have packages reproducible. And you know what packages were used during the resolution, during the application development.
We also provide bots that can be used by community. These bots perform releases of Python modules, but also update repositories with bot recommendations. And these are client tools, that can be used. All of them contact back end, and the actual resolution is implemented in complement that is called advisor. It's a stochastic resolver. It's implemented using it implements temporal difference learning. And in that case, the wall dependency graph is explored how what are the possibilities, how to resolve application dependencies, and the resolver learns how to resolve the best possible combination of dependencies.
Then there are other components that make sure that data that are stored on back end are fresh and available to be used during the actual resolution process.
[00:10:22] Unknown:
As far as the goals of the project, you mentioned that it started off as a research exercise and has now grown into a full blown engineering project. I'm wondering if you could just talk to some of the ways that it went through that evolution and how the initial ideas of what was trying to be achieved with that research project has evolved into the current scope of what you're trying to build with Toth?
[00:10:47] Unknown:
We had pretty quick developments, and I personally liked it because it was, like, let's create a prototype. Let's prove it works. And if it works, let's try to incorporate it into the system. I think we, took many blind paths when it comes to implementation. It was sometimes completely rewritten. I think we switched database twice. Originally, we used graph database, but the problem that we were trying to solve proved it doesn't require a graph database. And we use relational database that proved to be scalable for this problem and, proved to work.
[00:11:30] Unknown:
As far as the pieces of information that the resolver is using for determining what the overall possible solution space looks like and then figuring out the optimal result. I'm wondering what are the core pieces of information that it's relying on? You mentioned sending along some information about the environment where the application is running as well as the set of dependencies that you're trying to work with. So I'm just curious if you can talk to sort of how that possible solution space is constructed and how the TOTH resolver is able to explore that space and find the optimal result.
[00:12:03] Unknown:
TothResolver combines multiple sources. For example, we use advisory database that is maintained by the packaging authority, and it stores information about vulnerabilities in Python packages. We use also scans of container images as produced by Kway. And these include information about vulnerabilities in in container images on RPM level, for example. Then background navigation logic also gets information from GitHub. So for example, it takes a look at contributors, how many contributors a project has, or if it is formed, or if it is archived. So users and consumers of these projects are warned that the project is archived.
We also compute popularity. So based on stars, number of forks, we say that a given project is popular or not that popular, and the user is possibly consuming some not that known packaging ecosystem. We also take a look at how these projects are updated. So if they have healthy healthy community, if the project has recent commits and things like that. We also have a take a look at PyPI, which was these packages. We take a look at downloads and maintainers. So if a project has a low number of maintainers, it might look like it can produce or if a project has a lot a lot of maintainers, it might warn users that projects might be not that maintained and lose contributions over time. We also take a look at open source security foundation. They compute something that is called security scorecards.
They check if the community, for example, signed releases of packages or if they provide they use, CI checks. So pull requests are properly reviewed and changes are go properly to the code base.
[00:14:15] Unknown:
As far as the actual process of starting to work with Toth and saying, I have an existing project. I have my dependencies tracked in whether it's a requirements dot text file or a pyproject.toml. And now I want to take advantage of TAF's resolver to make sure that I am not subjecting myself to security issues or that I'm not inadvertently introducing some dependency conflicts. And so I want to actually, you know, send my requirements to TAF and make sure that I can improve the overall dependency list from that and, you know, generate a lock file that I can actually use for production?
[00:14:51] Unknown:
The best way to start is to use Thanos, CLI in this case. You can install it using pip install tamos. And in that case, you will have tamos commands available in your environment. It will show you some help message if you run tamos dash dash help. The most interesting or the most important commands could be tamos add, which adds the analysis to your project. Tamos itself consumes formats that are similar to PIPEN. So if you use PIPEN and you use PIPEN or PIP file and PIP file dot log, then the workflow naturally integrates. But Tamos can also cause you requirements files.
So once you add requirements to your projects, then you can call thamos config. It generates a configuration file and automatically detects what's what is present in your environment. In this configuration file, you can find information about software that is present in the environment. For example, if you use CUDA, tooling will store CUDA information in the configuration file, but also information about operating system, Python version that you are using. After this initial setup, you can call tamos advice, which will contact back end with provided in information, and we will wait for, the back end to respond.
Once back end computes recommendations and provides a log file for you with all of the dependencies pinned to a specific version, then Tamils obtains this information from back end and can install your dependencies using tamos install. The overall workflow is similar to Repend, but we have that special thermos device command is somehow different from pipenv walk.
[00:16:44] Unknown:
I know that on the landing page for Toth, it also makes a mention of being a recommendation engine for your Python projects. And I'm wondering if you can talk to some of the types of recommendations that it can make and what kinds of information you need to feed to it to be able to generate some of those recommendations and how that might differ from just using a list of dependencies that I wanna get a lock for.
[00:17:08] Unknown:
The resolver itself works on requirements, files, and also constraints that we provide to the resolution process, also with information about environments that is stored in the configuration file. So it's very similar to what VPP can do. When you would like to explore what are the possibilities or developing web application, we would like to get information what libraries are out there and what libraries you can use. We have developed a model that is called projects to back. Basically, creates vectorized representation of Python projects that are available on PyPI.
And in that case, you can see clusters of packages. So let's say, 1 cluster keeps packages that are related to that application. Now the cluster can store information, which packages are related to machine learning. And a nice thing about this model is that you can query it because these projects are represented using vectorized representation. And in that case, you can create your vector, and that vector can encode that you are interested in web application that runs on Python 38 and Python 39. And the vector then forms a mask that can be used to query large project to make representation.
This model is under development, and it's not available on top endpoints yet.
[00:18:45] Unknown:
Hey. Need to automate your Python code in the cloud? Want to avoid the hassle of setting up and maintaining infrastructure? Shipyard is the premier orchestration platform built to help you quickly launch, monitor, and share Python workflows in a matter of minutes with 0 changes to your code. Shipyard provides powerful features like webhooks, error handling, monitoring, automatic containerization, syncing with GitHub, and more. Plus, it comes with over 70 open source low code templates to help you quickly build solutions with the tools you already use. Go to python podcast.com/shipyard today to get started automating with a free developer plan.
You know, customize you know, customize its behavior or be able to contribute some additional functionality to it? The whole project should be open source, so you can find the sources
[00:19:45] Unknown:
on GitHub, in TOT Station Organization. We also provide with me file, how you can set up your own TOT locally with some limited or some restrictions. And if you would like to contribute directly to what the recommendation engine can provide to users, There's a bucketable interface that is called prescriptions. Prescriptions are basically YAML files that are stored in a Git repository, and this git repository acts as an open database that is consumed by the resolver. These presumptions state information about AKGs and projects. So if you would like to provide some information about projects to users of Tops, you can, state it in a prescription in a YAML file.
And you can also state these library incompatibilities in prescriptions. So our restrictions really act as a database that can be consumed by the resolver to provide high quality resolution. Restrictions provide a way how to state that some libraries have overpinning issues or underpinning issues or do not perform well in certain environments or some background data aggregation creates restrictions automatically and store them in the restrictions repository.
[00:21:12] Unknown:
In terms of the actual functionality of TAF, there's nothing that seems to be inherently coupled to the Python ecosystem. And I'm curious what you see as the opportunity and the work necessary to be able to extend it into other language communities for being able to resolve, say, node dependencies or Java dependencies and being able to, you know, bring some of this functionality to other language environments?
[00:21:37] Unknown:
It's an excellent question. I think some parts of infrastructure are not tightly coupled to Python. So they can be reused like queuing system that is implemented or API service that accepts inputs from users. So the infrastructure could be extended. Then there are some aspects of this language ecosystem such as Node, Node. Js, or Hoelang, or Java, that would deserve some kind of spike and check if all these ideas are suitable also for these language ecosystems. But I think at least some design patterns can apply also to other language
[00:22:21] Unknown:
ecosystems. As far as the applications of Toth, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:22:29] Unknown:
Some teams use editable installations, and these editable installations do not fit well to cloud that cloud based resolution process because they require certain specific environments set up on client machines. And, to work solely on packages that are released on Python package indices. So that's 1 of the things that we found challenging in the team. We have discussions if we want to support external installations and if they should be supported or if they should be not supported.
[00:23:06] Unknown:
And in your own experience of working on Toth and helping to evolve the project and working with end users of it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:23:17] Unknown:
I think 1 advice that I can give is to set up quality checks as soon as possible because technical debt is carried in, the source code. So it's really good to set up some my py checks or pre commit checks to make a repository healthy. We also found out that graph database is not always the right choice for graph related problems and their own, specific use case, but that was not, suitable for Tod. We also found that having good cluster setup and stable cluster is very variable because, especially in the beginning, we were having issues with fresh clusters, but that's improved a lot over time. And now the OpenShift cluster that we are using is very stable, and some components would not be developed properly without this ability.
Also, naming is hard. We found different issues with thoughts and pronunciation, especially if you are in a team that is distributed worldwide. And, yeah, Python is cool language and has very nice community. And it has 1 nice aspect, and that is if you find 1 or if you spot not that performing part of your application, you can still optimize it. And that's also a case for TOT, where the whole application is developed in Python. But some very core parts of the resolver are implemented in CC plus plus and that helped to, gain performance.
[00:24:58] Unknown:
As far as the naming aspect, 1 of the things I forgot to ask earlier is where the actual name of the project originated. Sort of what inspired you to name it after the Egyptian god of knowledge?
[00:25:09] Unknown:
It was basically given by 1 of the directors, 1 of the directors who came with Todd's idea. And we really liked it, and Todd was not taken on by p I. So we were very happy about that namespace and having Dot's name. But we have also other services that are called after Egyptian gods. And as I said, naming is sometimes hard, and this made smiles in in the team, and it's worked at the end of the day.
[00:25:41] Unknown:
And so for people who are interested in being able to improve the, I guess, quality of their dependency resolution? What are the cases where TAF is the wrong choice and they're better suited just sticking with, you know, pip or poetry or pipenv or something like that? TOT can resolve application dependencies,
[00:25:59] Unknown:
open source projects. So if you consume open source packages, TOT can be nice to use. But if you use packages that are hosted privately, internally on a private package index, then Tod doesn't know how these packages are hosted, cannot access them, cannot explore how what are the dependencies of of these projects, and cannot guide on these projects. So TOT is a service that is suitable for packages that are hosted on PyPI or publicly hosted NDCs.
[00:26:34] Unknown:
As you continue to build and iterate on the Toth project and its ecosystem, what are some of the things you have planned for the near to medium term? We wanted to improve our UI as Stops provides also user interface if you want to browse and search results of
[00:26:51] Unknown:
Advise. We also plan to integrate another component that is called license solver. It basically checks what licensees, package has and spots any discrepancies between the current licenses. So, users can be really warned, about issues in in licensing. We also plan to adopt, Zix store once it gets more mature in the Python ecosystem. And also, we would like to improve our our community adoption. So if you have any feedback, feel free to try and send feedback to several of us.
[00:27:30] Unknown:
On the note of contribution, are there any particular areas of input that you're looking for or particular skills or backgrounds that would be helpful?
[00:27:39] Unknown:
I think extending Tod's database. These prescriptions, that would be very helpful because the more prescriptions, the more information about the resource packages, Tod has, the better recommendations it can give. So if any spots an issue with a package or finds out that some packages deserve information or warning to users, it's very valuable to have this knowledge accumulated in 1 central play place. So the whole Python community can benefit from this.
[00:28:10] Unknown:
Are there any other aspects of the Toth project or the overall space of dependency resolution and dependency management that we didn't discuss yet that you'd like to cover before we close out the show?
[00:28:22] Unknown:
I think the dependency resolution is not that trivial task, and having information what's for the dependencies should be included in application, what are their quality, and how to consume them, how to set up environments. That's something valuable. And if we have that knowledge accumulated, that would be great. So we extend the resolution process, not just providing or checking what version ranges are acceptable for application requirements that I have, but also including information about the actual packages or some type of metadata about packages that are consumed and installed to clients' machines.
[00:29:05] Unknown:
Awesome. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and contribute to the Toth project, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose a musical group that I came across recently called Brass Against. They seem to mostly do cover songs, but they're a heavy metal group that has a horn section, so hence the name. So it's interesting hearing them do covers of things like Tool songs, System of a Down songs with a horn section as a backing ensemble. So definitely worth checking out. Rather different style than I've heard most other places. So with that, I'll pass it to you, Friedelijn. Do you have any picks this week?
[00:29:45] Unknown:
You can check, another project that was created in TOTS, and it's called micro pipenv. It's a tool that extends pip, and it can install dependencies that are stated in pipenv, poetry, or requirements. Txt files. And in that case, it can install dependencies very quickly. So you don't need to run the Dynamo poetry, especially when you are installing dependencies in containerized environments. Or you just want to try and install the MNCs into into an environment that are enriched by these tools.
[00:30:23] Unknown:
Very cool. Definitely have to take a look at that. Well, thank you very much for taking the time today to join me and share the work that you're doing on Project Toth. Very interesting project, and it's great to see more investment in the ecosystem and the community and helping people find the optimal set of dependencies for their projects. So I appreciate all the time and energy that you and your coworkers are putting into that, and I hope you enjoy the rest of your day. Thank you. Thanks for having me, and have a great day. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of python podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
Introduction and Welcome
Interview with Friedelijn Pokorny
Origins and Goals of Project Toth
Challenges with Dependency Management
Components and Architecture of Toth
Information Sources for Toth Resolver
Using Toth for Dependency Resolution
Recommendation Engine and Future Plans
Extending Toth to Other Languages
Lessons Learned and Technical Challenges
Naming and Community Contributions
Future Developments and Community Involvement
Closing Remarks and Picks