Intelligent Dependency Resolution For Optimal Compatibility And Security With Project Thoth

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers, 40 gigabit networking, and dedicated CPU and GPU instances.

And now you can launch a managed MySQL, Postgres, or Mongo database cluster in minutes to keep your critical data safe with automated backups and failover.

Go to python podcast.com/linode

today to get a $100 credit to try out their new database service, and don't forget to thank them for their continued support of this show.

So now your modern data stack is set up. How is everyone going to find the data they need and understand it?

SelectStar is a data discovery

originated,

which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries.

Best of all, it's simple to set up and easy for both engineering and operations teams to use.

With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.

Try it out for free and double the length of your free trial today at python podcast dotcom/selectstar.

You'll also get a swag package when you continue on a paid plan.

Your host, as usual, is Tobias Macy, and today I'm interviewing Friedelijn Pokorny about Project TOF, a resolver service that computes the optimal combination of versions for your dependencies. So, Friedelijn, can you start by introducing yourself?

Thanks for having me.

I work as a software engineer,

in Red Hat's office of the CTO,

that group that is called Emerging Technologies Group.

And more specifically, I work on a project that is called TOT, and it's project made out of multiple parts.

I think the most interesting

1 for this talk is DOT's resolver that can resolve application dependencies,

similar to PIP, but in a more

clever fashion, let's say. And do you remember how you first got introduced to Python?

It was back in 2012,

I think, during my university studies.

And I was impressed by Python because

it's really a pseudo code. You just write on whiteboard,

while you're discussing

problem. And then you write it in your

to to the file, and it suddenly works. It's like magic.

And so in terms of the project TAF that we're discussing today, I'm wondering if you can give a bit more context around

what it is that you've built there and some of the story behind how it came to be and what the overall goals are.

Projects started in 2018

as a research project in AI division at Red Hat.

We wanted to create a system that would help developers to create healthy applications,

developing Python applications.

There were 2 main use cases. 1 was developing Python applications locally.

When people use pip, sometimes

not that clear which versions of libraries should be used and if they are compatible

between each other.

But there was also another use case, like running Python in containers in a cluster

where you have, containerized environments

and would like to utilize

software environment that is provided by the containerized environment and make sure that software layer of an application

works well with hardware,

and you get the best

possible performance,

also security

of the application that you are developing and things like that.

As far as the types of problems that you have seen, and you mentioned PIP doesn't always give a very good or clear answer as to what are the actual versions of the transitory dependencies that I have, What are some of the types of problems that can be introduced by having versions that aren't

sort of optimally determined or that aren't locked to a specific version or a specific release for those nonprimary,

dependencies.

So we spotted multiple issues

when it comes to library incompatibilities.

So for example, we had

issue when installing Pillow together with NumPy,

and that was not respected properly API calls,

and that made the application

crashing.

Or there were issues with Python versions,

with Python interpreter versions.

Some libraries were not ready for more recent Python interpreters

and, that caused,

issues.

Besides this Python layer, you can find

other issues

when, using, for example, TensorFlow.

If you use TensorFlow together with GPU, you need to make sure that it fits

correctly so you have a proper CUDA version installed,

things like that. And the only way how to make this work is to basically browse

documentation of library and make sure that proper CUDA version is installed,

for example, the right gipsy

variable.

And this causes

headaches to developers because it's something that could be answered, let's say, automatically during the resolution process.

And the resolver can say this combination of libraries in your environment will not work properly.

As far as the overall

ecosystem

in Python for being able to manage the dependencies,

pip has been maybe the longest running 1 that's still in use, and there are a number that have grown and PIP tools. And I'm wondering if you can just talk to some of the

and PIP Tools. And I'm wondering if you can just talk to some of the ways that what you're doing with Toth is

either an improvement on those tools or works alongside them and just some of the specifically in terms of the dependency resolution, some of the ways that Toth approaches it in a different manner than what those tools provide.

Can you take a look at PIPEM, Poetry,

these resolvers

all around locally? So developers install applications

locally and they the actual resolution process,

is done on client machines.

The design of TOT is different.

TOT's tooling aggregates information about client's environment, where clients

run their applications and send

this information

to Topps back end.

And the back end, based on pre aggregated knowledge about the MSCs,

can resolve application,

the MSCs specifically for users,

for these clients.

You can see it as a guidance service in this case

because it does not solely resolve these applications

just by looking at the dependency graph, but it also considers

additional knowledge about these dependencies.

It can be, for example, quality of these dependencies,

like, security aspects. Does the given dependency have, some vulnerability?

Or do these dependencies

work well together?

Or what about communities? Are these packages

backed by

a good community, by a healthy community so that I can trust these open source KGs.

If I'm using, for example, recurrent neural networks,

the resolution process can be

different

for my use case. I'm using this. The resolver can behave differently.

So it's synthesizing

software and hardware

aggregated by client's tooling

and also

doing some kind of contextual resolution.

And based on static source code analysis, the resolver can

guide which packages and

package versions a user should use.

As far as the

implementation

of Toth itself, I'm wondering if you can talk to some of

the various components in the overall ecosystem of what project Toth encapsulates? And then in terms of the actual resolver itself, some of the ways that that is architected and implemented to be able to

provide this resolution as a service?

Clients tools include Tamo's CLI, the Visa Command Line Interface for contacting ToT. It does the mentions, aggregation about the environment, and sending

information to the back end, as well as consuming

results that are computed by back end.

We have also

JupyterLab

requirements. This is an extension to Jupyter Notebooks

that can contact Stott as well.

And besides that, it embeds

packages or information about packages to Jupyter Notebooks.

So if you share a Jupyter Notebook, then

you have a reproducible

or information to have packages reproducible.

And you know what packages were used during the resolution,

during the application development.

We also provide bots that can be used by community. These bots

perform releases of Python modules,

but also

update repositories with bot recommendations.

And these are

client

tools, that can be used.

All of them contact back end, and the actual resolution is implemented in

complement that is called advisor.

It's a stochastic resolver.

It's implemented using it implements temporal difference learning.

And in that case,

the wall dependency graph is explored how what are the possibilities, how to resolve application dependencies,

and the resolver

learns how to resolve the best possible combination of dependencies.

Then there are other components that make sure that data that are stored on back end are fresh

and available to be used during the actual resolution process.

As far as the goals of the project, you mentioned that it started off as a research exercise and has now grown into a full blown engineering project. I'm wondering if you could just talk to some of the ways that it went through that evolution and how the initial ideas of what was trying to be achieved with that research project has

evolved into the current scope of what you're trying to build with Toth?

We had pretty quick developments,

and I personally liked it because it was, like, let's create a prototype. Let's prove it works. And if it works, let's try to incorporate it into the system. I think we, took

many

blind

paths when it comes to implementation.

It was sometimes completely rewritten.

I think we switched database

twice.

Originally, we used graph database,

but the problem that we were trying to solve proved it doesn't require a graph database.

And we use relational database that proved to be scalable for this problem and, proved to work.

As far as the pieces of information that the resolver is using for determining what the overall possible solution space looks like and then figuring out the optimal result.

I'm wondering what are the core pieces of information that it's relying on? You mentioned sending along some information about the environment where the application is running as well as the set of dependencies that you're trying to work with. So I'm just curious if you can talk to sort of how that possible solution space is constructed and how the TOTH resolver is able to explore that space and find the optimal result.

TothResolver

combines multiple sources.

For example, we use advisory database that is maintained by the packaging authority,

and it stores information about

vulnerabilities

in Python packages.

We use also

scans of container images

as produced by Kway.

And

these include

information about vulnerabilities

in in container images on RPM level, for example.

Then background navigation logic also

gets information from GitHub.

So for example, it takes a look at contributors,

how many contributors

a project has, or if it is formed, or if it is archived. So users and consumers of these projects

are warned that the project is archived.

We also compute popularity.

So based on stars, number of forks, we say that a given project is popular or

not that popular, and

the user is possibly consuming some

not that known packaging ecosystem.

We also take a look at how these projects are updated.

So

if they have healthy

healthy

community,

if the project has recent commits and things like that.

We also have a take a look at PyPI, which was these packages.

We take a look at downloads

and maintainers.

So if a project has a low number of maintainers,

it might look

like it can produce

or if a project has a lot

a lot of maintainers,

it might warn users that projects might

be not that maintained and lose contributions over time. We also take a look at open source security

foundation. They compute something that is called

security scorecards.

They check if the community, for example, signed releases of packages

or if they provide they use,

CI checks. So pull requests are properly reviewed

and changes

are go properly

to the code base.

As far as the actual process of starting to work with Toth and saying, I have an existing project. I have my dependencies

tracked in whether it's a requirements dot text file or a pyproject.toml.

And now I want to take advantage of TAF's resolver to make sure that I am not subjecting myself to security issues or that I'm not

inadvertently introducing

some dependency conflicts. And so I want to actually, you know, send my requirements to TAF and make sure that I can improve the overall dependency list from that and, you know, generate a lock file that I can actually use for production?

The best way to start is to use Thanos,

CLI in this case.

You can install it using pip install tamos. And in that case, you will have tamos commands available in your environment.

It will

show you some help message if you run tamos dash dash help.

The most interesting or the most important commands could be tamos add, which

adds the analysis to your project. Tamos itself consumes

formats that are similar to PIPEN.

So if you use PIPEN

and you use PIPEN or PIP file and PIP file dot log, then the workflow

naturally integrates.

But Tamos can also cause you requirements files.

So once you add

requirements to your projects, then you can call thamos

config.

It generates a configuration file and

automatically detects what's what is present in your environment.

In this configuration file, you can find information about software that is present in the environment. For example, if you use CUDA, tooling will store CUDA information in the

configuration file, but also information about operating system, Python version that you are using.

After this initial setup, you can call tamos advice, which will contact back end with provided in information,

and we will wait for, the back end to respond.

Once

back end computes

recommendations and provides

a log file for you with all of the dependencies pinned to a specific version,

then Tamils obtains this information from back end

and can install your dependencies using tamos install. The overall workflow is similar to Repend,

but we have that special thermos device command is somehow different from pipenv walk.

I know that on the landing page for Toth, it also makes a mention of being a recommendation engine for your Python projects. And I'm wondering if you can talk to some of the types of recommendations that it can make and what kinds of information you need to feed to it to be able to generate some of those recommendations and how that might differ from just using a list of dependencies that I wanna get a lock for.

The resolver itself works on requirements, files, and also constraints that we provide to the resolution process,

also with information about environments that is stored in the configuration file.

So it's very similar to what VPP can do.

When you would like to explore

what are the possibilities

or developing

web application, we would like to get information what libraries are out there and what libraries you can use.

We have developed

a model that is called projects to back. Basically,

creates vectorized representation

of Python projects that are available

on PyPI.

And in that case, you can see clusters of packages.

So let's say, 1 cluster keeps packages that are related to that application.

Now the cluster can store information,

which packages are

related to machine learning.

And a nice thing about this model is that you can query it because these projects are represented using

vectorized

representation.

And in that case, you can create your vector,

and that vector can encode that you are interested in web application

that runs on Python 38 and Python 39.

And the vector then forms a mask that can be used to query

large project

to make representation.

This model is under development, and

it's not available on top endpoints yet.

Hey. Need to automate your Python code in the cloud? Want to avoid the hassle of setting up and maintaining infrastructure?

Shipyard is the premier orchestration platform built to help you quickly launch, monitor, and share Python workflows in a matter of minutes with 0 changes to your code. Shipyard provides powerful features like webhooks, error handling, monitoring, automatic containerization,

syncing with GitHub, and more. Plus, it comes with over 70 open source low code templates to help you quickly build solutions with the tools you already use.

Go to python podcast.com/shipyard

today to get started automating with a free developer plan.

You

know, customize you know, customize its behavior or be able to contribute some additional functionality to it? The whole project should be open source, so you can find the sources

on GitHub, in TOT Station Organization.

We also provide

with me file, how you can set up your own

TOT locally

with some limited or some restrictions.

And

if you would like to contribute directly to

what the recommendation engine can provide to users,

There's a bucketable interface that is called prescriptions.

Prescriptions are basically YAML files that are stored in a Git repository,

and this git repository acts as an open database that is consumed by the resolver.

These presumptions

state information about AKGs and projects. So if you would like to

provide some information about projects to users

of Tops, you can, state it in a prescription

in a YAML file.

And you can also

state

these library incompatibilities

in prescriptions.

So our restrictions really act as a database

that can be consumed by the resolver to provide high quality resolution.

Restrictions

provide a way how to state that some libraries

have overpinning issues or underpinning issues

or do not perform well in certain environments

or some background data aggregation

creates restrictions automatically

and store them in the restrictions repository.

In terms of the actual functionality of TAF, there's nothing that seems to be inherently

coupled to the Python

ecosystem. And I'm curious what you see as the

opportunity and the work necessary to be able to extend it into other language communities for being able to resolve,

say, node dependencies

or Java dependencies

and being able to, you know, bring some of this functionality

to other language environments?

It's an excellent question.

I think some parts of infrastructure

are not

tightly coupled to Python.

So they can be reused like queuing system that is implemented

or API service that accepts inputs from users.

So the infrastructure

could be extended.

Then there are some

aspects of this language ecosystem such as Node,

Node. Js, or Hoelang, or Java,

that would deserve some kind of spike and check if

all these ideas are suitable also for these language ecosystems.

But I think at least some design patterns can apply also to other language

ecosystems. As far as the applications of Toth, I'm wondering what are some of the most interesting or innovative or unexpected ways that you've seen it used?

Some teams use editable installations,

and these editable installations do not fit well

to cloud that cloud based resolution process

because they require certain

specific

environments set up on client machines.

And, to work solely on packages that are released on Python package indices.

So that's 1 of the things that we found challenging

in the team. We have discussions if we want to support external installations

and if they should be supported

or if they should be

not supported.

And in your own experience of working on Toth and helping to evolve the project and working with end users of it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think 1 advice that I can give is to

set up quality checks as soon as possible because technical debt is carried in, the source code. So it's really good to set up some my py checks or pre commit checks to make a repository

healthy.

We also

found out that graph database

is not always the right choice for

graph related problems

and their own, specific use case, but that was not, suitable

for Tod.

We also found that having good cluster setup and

stable cluster is

very variable

because, especially in the beginning, we were having issues with fresh clusters,

but that's improved a lot over time. And now

the OpenShift cluster that we are using is very stable, and some components would not be developed properly without this

ability.

Also, naming is hard. We found

different

issues with thoughts and pronunciation, especially if you are in a team that is

distributed worldwide.

And, yeah, Python is cool language

and has very nice community.

And it has 1 nice aspect, and that is if you find 1 or if you spot

not that performing

part of your application,

you can still optimize it. And that's also

a case for TOT, where the whole application is developed in Python.

But some very core parts of the resolver are implemented in CC plus plus and that helped to, gain performance.

As far as the naming aspect, 1 of the things I forgot to ask earlier is where the actual name of the project originated. Sort of what inspired you to name it after the Egyptian god of knowledge?

It was basically given by 1 of the directors,

1 of the directors who came with Todd's idea.

And we really liked it, and Todd was not taken on by p I. So we were very happy about that namespace and having Dot's

name. But we have also other services that are called after Egyptian gods.

And as I said, naming is sometimes hard, and this made

smiles in in the team, and it's worked at the end of the day.

And so for people who are interested in being able to improve

the, I guess, quality of their dependency

resolution? What are the cases where TAF is the wrong choice and they're better suited just sticking with, you know, pip or poetry or pipenv or something like that? TOT can resolve application dependencies,

open source projects. So if you consume open source packages,

TOT can be nice to use.

But if you use packages that are hosted privately,

internally

on a private package index,

then Tod doesn't know how

these packages are hosted,

cannot access them, cannot explore

how what are the dependencies

of of these projects,

and cannot guide on these projects. So TOT is a service that is suitable for packages that are hosted on PyPI or publicly hosted NDCs.

As you continue

to build and iterate on the Toth project and its ecosystem, what are some of the things you have planned for the near to medium term? We wanted to improve our UI as Stops provides also user interface if you want to browse and search results of

Advise.

We also plan to integrate another component that is called license solver. It basically checks what licensees,

package has and spots any discrepancies

between the current licenses.

So, users can be really warned,

about

issues in in licensing.

We also plan to adopt,

Zix store once it gets more mature in the Python ecosystem.

And also, we would like to improve our our community adoption.

So if you have any feedback, feel free to try

and send feedback to several of us.

On the note of contribution,

are there any particular

areas of input that you're looking for or particular skills or backgrounds that would be helpful?

I think extending Tod's database.

These prescriptions, that would be very helpful

because the more prescriptions, the more information about the resource packages,

Tod has, the better recommendations

it can give. So if any spots an issue with a package

or finds out that some packages deserve

information or warning to users, it's very valuable to

have this knowledge accumulated in 1 central play place. So the whole Python community can benefit from this.

Are there any other aspects of the Toth project or the overall space of dependency resolution

and dependency management that we didn't discuss yet that you'd like to cover before we close out the show?

I think the dependency resolution

is not that trivial task, and having information what's for the dependencies

should be included in application,

what are their quality, and

how to consume them, how to set up

environments. That's something valuable.

And if we have that knowledge accumulated,

that would be great. So we extend the resolution process, not just

providing

or checking what version ranges are acceptable for application requirements that I have, but also including information about the actual packages or some type of metadata

about packages

that are consumed and installed to clients' machines.

Awesome. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing and contribute to the Toth project, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose a musical group that I came across recently called Brass Against. They seem to mostly do cover songs, but they're a heavy metal group that has a horn section, so hence the name. So it's interesting hearing them do covers of things like Tool songs, System of a Down songs with a horn section as a backing

ensemble. So definitely worth checking out. Rather different style than I've heard most other places. So with that, I'll pass it to you, Friedelijn. Do you have any picks this week?

You can check, another project that was created in TOTS, and it's called micro pipenv.

It's a tool that extends pip, and it can install dependencies

that are stated in pipenv,

poetry,

or requirements. Txt files.

And in that case, it can install dependencies

very quickly. So you don't need to run the Dynamo poetry,

especially when you are installing dependencies in containerized environments.

Or you just want to try and install the MNCs into into an environment

that are enriched by these tools.

Very cool. Definitely have to take a look at that. Well, thank you very much for taking the time today to join me and share the work that you're doing on Project Toth. Very interesting project, and it's great to see more investment in the ecosystem and the community and helping people

find the optimal set of dependencies for their projects. So I appreciate all the time and energy that you and your coworkers are putting into that, and I hope you enjoy the rest of your day. Thank you. Thanks for having me, and have a great day.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of python podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__