A Data Catalog For Your PyData Projects

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you get everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or building your CI pipeline, they just launched dedicated CPU instances.

Go to python podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity,

and the Open Data Science Conference.

Go to python podcast dotcom/conferences

to learn more and to take advantage of our partner discounts when you register.

And visit the site at python podcast.com

to subscribe to the show, sign up for the newsletter, and read the show notes.

And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

Your host as usual is Tobias Macy. And today, I'm interviewing Martin Durant about intake, a lightweight package for finding, investigating, loading, and disseminating data. So, Martin, could you start by introducing yourself? Hello. My name is Martin Durant. I work for Anaconda

and have done for nearly 4 years now.

I work in the open source part of the company, and,

I'm involved in packages such as Dask and files and things,

file format loaders

like FastAPI.

But I've been on the team making intake

for

a bit over a year,

and intake

is now my main thing. So I I'm the lead developer of intake. And do you remember how you first got introduced to Python? I do. So,

it's been a while.

In my former life, I used to be an astronomer. And when I started my postgrad studies, the people around me were using a thing called IDL to do the majority of their

data analysis and plotting, that kind of thing. But my supervisor was using FORTRAN for much of that. So he told me, go off and use whatever you like. And after struggling with IDL for

6 months or a year or something like that, I heard of this thing called Python,

which, the Space Telescope Institute was getting interested in at the time. So I thought I'd give it a go. I sat down. I went all the way way through the introductory tutorial on the Python site and, I never looked back. That was I'm well, I won't say how many years ago that was, but it was a few.

And so as you said, now you have become the lead on the intake project. So can you start by giving a bit of an overview about what the project is and the story behind why it was created? Yep. Intake

is

being around as an as an idea, a a sort of concept that something like this should exist for a while. And in fact, there are lots of packages out there that do some part of the work that intake hopes to cover. But the specifics

of the idea of intake was, I believe,

a year and a half ago, something like that, maybe a bit more. And it was an idea that that was sort of very loosely born out of the this

loose

blaze concept that had been around at

Continuum and then Anaconda for a long time, the sort of

overall

library that's going to bring together all of your data needs. The idea for intake, though, was to be much more specific than any of that, to be at the cataloging and loading side, not to deal with,

many more complex things that you might want to do with data, just give you a simple interface

that is a layer of many other

Python libraries,

do the minimum

necessary to let you get your data and then do all of your analysis and everything else in the way that suits you. So that was the initial concept.

Python is, of course, very rich. There are lots of these libraries out there for loading from data format to x, and Intech doesn't want to redo that work. It's very much in the vein of the PyData

ecosystem, make use of what's out there, and try and work well with all of them. So that's the idea. And,

it actually started

in the,

I think about January of last year, we started to actually have some working code at that time. And we released it in the summer as we announced it in blog posts, and, I started to go to conferences, that kind of thing to talk about it. And,

now it's thoroughly usable, and, I encourage people to try it out. If you're listening to this, then I I hope you find it interesting. And as you mentioned, there are a number of other libraries, both in the Python ecosystem and elsewhere that handles some of this data loading challenge. But as you said, intake is more for providing an abstraction and sort of unifying layer among them. But I'm wondering if you can talk a bit about some of the other projects that overlap some of the

functionality of intake in terms of things like the data loading or the data cataloging? So I'm thinking of projects like Quilt Data

or the Arrow project or Data retriever. The most common thing that people have said to me aside from this is interesting and and general comments, but the most the the most common concrete thing that people have said to me when they've heard me talk about intake is, hey. My company has something like that. We tried to write a platform for doing just this thing, but we only got partway.

And it it does some of it, and then it got abandoned because the the developer left or something like that. So it's very common. There are lots of proto intakes out there,

ones that never even got released.

And,

hopefully, intake will be the thing that unifies all of these things. It's very much designed with simplicity and the code and use in mind, and some of them are actually available. So you you mentioned I'll I'll talk to Quilton and DataRetriever.

The latter,

I actually hadn't heard about before coming to this conversation. I had to look it up, and

it's very nice and simple. It gives you downloads of data files. There are a small limited number of formats that it it supports, and it does that 1 thing, and I haven't actually tried it. I doubt that it does it quite well. Quilt, on the other hand, is is quite different. It is,

a hosting environment. You give your data over to Quilt. They manage it, and then you get, an API to their system, which allows you to see what data you have and access whichever of those datasets that you want to have. Intake,

I started off by saying it wants to be simple. It it wants to do the minimum amounts of work. It doesn't host your data. Your data can live wherever you want it to live. So that could be on shared disk drives. It could be on GitHub. It could be on Amazon s 3. It could be in in a number of places, including your local disk, but that's just 1 of many. And the catalogs themselves can live in all of these different places. So if you want to use intake to share data, you can just put your catalog into, for example, a gist on GitHub or a repo, point people at that, and you you don't need infrastructure.

You don't need a server. You don't need to set up anything. You can just load it and go. So it it really is going for that, simplicity of use. In the data retriever

case, that is also functionality that is covered by intake. Intake has a caching system,

optional, and it's the catalog author that gets to decide whether

a particular dataset

ought to be cached or not. And caching can mean downloading the files that hold the data,

or it can mean accessing some data resource, whether it's a database or whatever it may be, and,

forming that into a local version. So, for example, for tabular data, it would be pocket files. And that's an interesting thing that we've started with from the very first step with intake is that we want to be able to support not just data frames,

not just arrays. We want to really give a uniform

experience

across all the different kinds of data that are out there, all the different places that that data may live, whether it's services, whether it's remote storage,

but still give a very simple

experience to the end user. Arrow, I think we can talk about separately. It doesn't really it's not the space that Arrow is in. Arrow is in data memory format. It does care about loading data to some extent for a couple of formats. Maybe down the line, that will grow, and it does have some hooks for remote file systems, particularly HDFS.

Maybe some more coming. But for the time being, it's really about representation of data in memory and,

passing data between processes. So that's it's sidelong to

what intake does. It's it's not really the same territory, I don't think. And so

intake

is built to be able to simplify

a lot of the work flow for data scientists and data engineers and reduce issues of code duplication that are required for being able to

obtain and prepare the data for analysis. I'm curious if you can talk through just the overall workflow of

using intake from the perspective

of both a data scientist who is consuming it for building these analyses and a data engineer who is likely,

defining and maintaining the catalogs and the data and the data sources. That idea of, separation of concerns between

the

owner of the data and the user of the data is is really core to how we designed intake. We want the user, and

the user is they they will be more numerous. So often, they're the ones that shout the loudest, but we want their their experience to be simple. We want them to be able to find some abstract cataloging thing where they don't need to worry about how those catalogs are created

and how they reference 1 another

to be able to search through those, know the metadata of the datasets

themselves,

and use a single method

to get a handle to the data, whether it's in memory or some DASK thing

or or some Spark thing or whatever it is that they're used to analyzing it with. It should be a 1 liner once they found the dataset of interest to be able to work for that. And then intake is done, and they can get on with the actual work that they're paid to do. So from the from their point of view, they shouldn't really notice intake that much. They may be using our GUI. They may be using search functionality, that kind of thing. But,

the less

interaction that they have with intake

that nevertheless lets them get on with their job, the better it is from the, end user's point of view. From the engineer's point of view or the data owners,

they create the catalogs or cataloging systems. Often, there will be the the same people that curate the data, so they decide this particular dataset is the

authoritative

version of

whatever it is that it's supposed to be. And

at that time that they are doing that curation is when you want to capture as much metadata about it as you can. And that process is very important,

and intake provides

a very simple way in

in in the the base case, at least, a very simple way of

writing down all of that metadata. You can just include it into a YAML file as a dictionary like thing and

put everything

into that metadata block that you, as the catalog author, find interesting. But you also provide the set of arguments to the particular

loader. We call these, data drivers. The particular set of arguments which will load that data in the way that it ought to be loaded. So you might have a choice of different packages to load it with. You might have a choice of,

various

specific options that make sense for that data type. Maybe only some columns

of a particular dataset are of interest when you if you think about this data in 1 particular context, you can have multiple versions of that maybe.

You could present some specific options to the end user. Do you want the production set? Do you want the

latest leading edge set or or something like that or the just a sandbox version?

All of those things, you could encode into

a catalog. You can also structure these catalogs so that they reference 1 another in a hierarchical model. And that's, again, a clue to the user on how to go about finding the correct data

that they need. So

the idea of the the best situation from the data engineer's point of view is that the data users

know what data is out there. So it's a way of doing this dissemination

piece

just by, in the simplest case, putting YAML files in the correct position. It can be more complicated than that too if you need something more specific. And then

the user,

because they've used intake before,

they know that the same 3 methods of interest will do everything that they need. They shouldn't need to come back constantly to the data engineer saying, hey. I need this latest bit. Hey. How do I load this thing? Or to go to their colleagues saying, have you do you have that little piece of code that that loads the current dataset that I need to work on? So it by making clear the different roles around this,

data story, you save everybody a lot of work. And as far as the metadata that you can embed in the catalog definition, I'm assuming that that's also passed on to the end user for being able to inquire about some of the different attributes of the data, such as what you were saying as far as the different columns that are of interest or maybe having a last generated time stamp so that somebody can inquire about the freshness of the data that they're analyzing without having to go back to the catalog owner to ask that question? Yes. Although all of it is optional.

You could have an empty metadata

section if you if you chose to make a catalog that way. That's obviously not recommended. This is the value added as being able to include the information in this. Some specific things that come from the dataset itself, so the number of the the columns that are available

is usually something, depending on the data format, that the data itself can tell you. So that's already often encoded in some kind of metadata. For parquet, for example, it's,

there's a special part of the file that has a list of the columns that you expect to get there. Other things, like if it say JSON, you may might not know what columns exist. There's there's no specific metadata location in a common JSON. Maybe you have a JSON API

description or something. So there are lots of possibilities,

and

the design of how you get this encoded into metadata is is 1 of the most important considerations when you write a data

driver. There will be some common things.

Yep. People will want to know time stamps. They will

want to know,

tags, for example, is a is a common use case. So,

category of data for set for example, if you want to find all of the datasets that are in the sales category, then you should be able to search sales and it will find all of them because it's there in the metadata. And that's that's already possible now. Time stamp is actually a bit more subtle because,

commonly,

it's latest is the time stamp because you're loading from some remote location,

and, the data in that location can be updated in place, which is very useful. But then if you then then the question of versioning is is

another interesting topic that maybe we can talk about later, especially if your data source is actually some database system and not only do instead of having a new version of files, you have

new entries in there. You have individual updates to individual rows,

and you're

not normally in the database case

talking about versions of data. Usually, you are getting the latest.

Maybe you have some,

snapshotting capability.

And

going back to the catalog definition again, you mentioned the ability to have them defined hierarchically,

and I know that in terms of

data, particularly as you start to accrue a larger volume and variety of it, it can be difficult to understand just what data is even available, what exists, and what the overall discovery and cataloging capabilities are. And so I'm wondering how intake addresses some of those problems.

And at a high level, it seems that you could have 1 catalog definition that's just a sort of meta definition that defines all of the other catalogs that exist that just gets updated as you add new data sources. But I'm wondering if you can just talk through the overall experience of the discovery

process for working with intake and understanding

what data exists within an organization?

In the intake model,

a catalog object

is itself a data source.

So any catalog can refer to any other catalog

as 1 of its entries, and that can be

most commonly just a a a URL or it can be something more complicated. But you can imagine a simple case in which, let's say, you have your data on s 3. You could have your catalog definitions also on s 3 as a a particular bucket and key

as your my master catalog. And in it, as you say, this could just be a YAML file, and each of your YAML blocks within each of your entries

put it says, this is a catalog, and it's at this other URL. From the

user's point of view, they would load the first 1 of these or maybe their system administrator

has provided it as 1 of the automatic lockups that intake does for you at, import time, and then it's just there. And you can do a list

of your master catalog, and you will see it has a certain number of entries and that those entries are themselves catalogs. You can do a a search method which will walk down into the hierarchy looking for particular terms in the metadata

and description

of all of those datasets

or something that is has now been heavily developed and, isn't becoming

a pretty pleasant experience, we have a a GUI which you can launch either in a notebook or as a stand alone application

with which you can click down the tree of your catalogs

and see what entries are in there. And for each of those entries, you can just you can see their description and metadata in front of you. You don't need to type anything to do that. And, also, coming soon, this is also already

partly implemented,

you can do some

on the fly analysis

and visualization of those datasets

within the interface itself. That obviously will work better for certain sizes of datasets than others, and we need to care then about where is the execution happening if you're doing some aggregated

visualization,

or you actually just want to see the data. You still have to load it in some process somewhere to display it. So it does open a new set of questions. But in general, this idea that

whatever execution environment you are using anyway is the thing that's going to get you some quick look on your data, perhaps with some great subsampling,

maybe only, display 1 in a 1000 rows or something like that. That is all possible within the interface or it will be very soon.

So then

the user of the data

can know pretty rapidly if that's actually the

correct dataset that they want for the job in hand.

There are some

additional niceties about this. Catalogs are hierarchical, but they're not exclusive. It's not a, like, a directory

this directory hierarchy. So there's not only 1 parent per catalog. You can cross reference them.

You can have

multiple entries to the same set of files or the same data service but with different options

where you can then describe what those different options are. And, again, following those descriptions and metadata, the the user can pick the 1 that's correct for them. I already mentioned,

parameters that you expose to the user so that they can make some choices

at load time if they need to. Those parameters can include credentials, which is a

a fairly important story within,

commercial

institution

to have the correct,

identity attached to each data access. That's something that, intake intake doesn't store those credentials, but it provides ways to pass those credentials on. And I've wandered a little far from,

hierarchies now. But,

yeah, the basic idea is that you you can click through your hierarchy,

you can search through your hierarchy, or you can do all of these things in

the code if you want to. So we we expose as many different ways to do these things as possible. And so another challenge that comes

as possible. And so another

challenge that comes up often when dealing with data is the idea of capturing the lineage or provenance,

and

that can change as some of the sources change or the manipulations of what generates a derived data source changes, and that ties into the versioning aspect that you mentioned as well. So I'm wondering if you can just talk through a bit about the data lineage capabilities of intake, and also some of the story around data versioning and catalog versioning.

For the time being, intake is something that reads your data and doesn't write up.

So you wouldn't use intake to derive a new dataset from a previous 1. You would use intake to load it and then manipulate it however you want to. We are developing hooks to be able

to edit

catalogs

in place so that you can add new entries into them,

and that would be the opportunity to add extra information about the current session of how it was generated. That kind of thing often makes sense

in the context of a larger processing framework.

So, you may be aware that Anaconda has a for

fee enterprise platform

within which whenever you're in the platform, you're running within some container,

and that container knows,

a various amounts of information about who the user is and what the resources are, what the environment is, that kind of thing. So if you were saving data

into an intake catalog from such a system, then you would have a lot of information about how it was generated, especially if it's a job that is something that's run regularly in an express declarative

way as opposed to at an interactive prompt. But Python being Python, you do anything that's an interactive prompt. Right? So the more constrained your execution environment, the more information you can have, and then

it's up to the administrator

of the data system to capture that information at the time that data is

ingested. Intake

may

eventually go down this road of being more of a

data pipeline

type

system in which it actually does analyses for you and and executes

jobs that does that kind of thing. For the time being, we're of the opinion that we want to do the 1 thing, which is loading

extremely well,

and to go to both sides of that story, the reading and writing, and become some kind of Dremio like system or or whatever. There's there's a a whole load of those kind of systems around too

would actually make it make intake less good at the at the particular thing that it's aimed at. So we're very hesitant to try and do too much with intake, certainly in the

near to medium future.

The only exception to this rule is that, you can there there is a functionality and intake called persist, which is read the data from remote,

typically,

database or something where your query would take a long time to process, and then write

that table or array or whatever it may be to a local format as is. And then because we've made the data ourselves, we can capture all these things about where it came from, the time that it was executed,

and,

be able to refer back to the original data so that at a later stage,

you can either

automatically say this data is now stale. We need to refresh it. Or the user can say, I would like, the most recent version of this data.

So that's 1 side of it. When it comes to working with the versions of the data itself, though, for for reading,

that is something that is currently under development,

and it will depend on how you want to store your data. There are many different places that your data can live, and each of those may have a different concept of what data versioning means.

Probably,

the first 1 of these that will expose directly to the user will be

Git because Git is the best known versioning

thing that's out there, and Git makes a very natural storage place for your catalogs. Catalogs

can be text files as I've I've said a few times. YAML is the standard format that we use. So you can be versioning your catalogs using Git anyway. But if you have a reference to a repo

and you regard that repo as a cacheable thing, then Intake can take care of cloning that repo,

offering the set of

versions,

maybe just tags. This is something to be decided, and exposing those to the user so that they can get latest or they or they can go back in time. There are a number of things like that that I would regard as caching mechanisms. That is, you can get a local copy of some some data that lives elsewhere. Docker is another obvious 1 that uses, versions of that sort. 1 we already have implemented, which is implemented but experimental is that. That is a much bigger data concern, but, 1 in which you have absolute hashes to every version of of your data so that you can

very much reproduce the data at a particular point in time. So intake can interact with that. You can say my lit my data lives in this dot

hash, and at the time of access, intake will call that and have that do that referencing for you. So, again, it's not

intake that's keeping hold of the versions.

It's an external provider of versioning information. Intake

maybe will

have the ability to do snapshots

itself

of in the same way that currently we can persist data from remote

and we can cache data from remote, maybe we will eventually version those things so you can have snapshots of of them locally.

Exactly how that will be done and what the experience will be, that's that's something to be decided. But it's probably something that will appear in the next 6 months, I would reckon. And so

can you describe a bit now about how intake itself is actually implemented

and some of the extension points that it provides for somebody who wants to customize

for different data sources or different,

container formats that it'll support? Yeah. We've,

tried to make intake approachable to the developer. We have,

a simple class structure

so that if you want to implement something like a data driver or indeed

a catalog, which is a

specific type of data driver,

then you need to just derive from a particular

class

and implement a

a small number of methods, I think, like 5 or 6 methods on there. And these classes in a simple case for something fairly straightforward like

text file maybe, and or being pretty short. They they're basically

1 page of,

code

for the I I have a tutorial lying around that was using,

the GitHub API as a data source. So the parameters that you would pass to it would be the repo that you're interested in

and,

the type it it was the issues for this. It it was just a throwaway example. So, the type of issues, whether it's open or closed, those are the parameters that it took. It connected to GitHub, and it would give you back all of those issues with time stamps and author and and whatever else as a data frame in that case. And the code to do that and make it look like

any other intake data source

is maybe 50 lines or something like that, probably less.

50 lines including

class declaration and and documentation and that kind of thing. There are some other

extension points

in there. So you you mentioned

container formats. The containers are, again, a specialized data source.

They're we haven't mentioned the,

intake server at all. It is

something that might be very useful to people, but it does really complicate

the set of things that, intake can do.

For this conversation, I'll just say that there is an intake server, which you can just run as a Python process. You don't need to have any special infrastructure for it. And

it can serve catalogs. So you can connect to this server, and it and it gives back a list of data entries or sets of catalogs.

And

the second use for the server is that you can actually stream data from it.

So

it may be that the server could it lives behind a firewall or it has some special credentials access so that it can see data that the user

on a different machine can't. And then you can use the server to stream that data over a a single

socket. So that's a very nice convenience. But for to be able to do that,

you need to agree on what that data type is. So for basic

things like arrays and data frames, those are plumbed into intake.

So if you expose

a data frame from the server

and the client wants to wants to be able to load this

proxying through the server, then it gets chunks off that data frame, and it doesn't need to know what the original format was. Perhaps it doesn't have that driver and that's the reason that you want to proxy is because the, server has the driver, but the client doesn't. Those few types like data frame and array, we we call those containers. And it's also

the thing that's responsible for if you do a if you do a persist,

then for data frame, it gets written as a parquet set of parquet files. And where that definition is

is that

because it's a data frame, that container, the definition of the data frame container has a persist method,

and that defines how it gets written to disk locally.

And it turns it to parquet

for all cases.

Writing the new container is similarly

pretty easy.

So we have,

a couple. We have,

GeoPandas,

so,

a specialized data frame in which you have a notion of geometry, and xarray,

which are

related sets of

numerical arrays

where you have coordinates and and metadata and perhaps some hierarchical structure to them in a sort of NetCDF like model of,

data format. So those were were new, and they live outside of the main intake repo, but you can you can use them easily in conjunction with

intake. If you had some other

type of data

and we're looking, for example, at streaming data,

at,

fitted machine learning models. These are interesting things that you might want to be able to pass from 1 place to another

without having

to do the the basic load yourself.

Those are all candidates

that would be fairly simple to implement. It's just a a matter of time or somebody getting around to doing it. In practice, data frames and arrays

and free form text like things such as as JSON,

they fill the needs of the great majority of users certainly in their business case. But the fact that there are lots of different types of data out there is something that is

often glossed over in other

data access,

conversations out there. I don't just want 20 different ways to load data frames. Loading from XML is not very different from loading from JSON.

So in my mind, those 2 are the same thing, really. But there are lots of fundamentally different

types of data in existence that

often you just don't have a convenient way of loading at all. And as far as the overall implementation of intake, you mentioned that it's a fairly young project. So I'm curious

what some of the assumptions are that are sort of baked into the origins of intake and how it's designed

and some of the ways that the architecture has evolved and maybe some of the ways that those assumptions have been challenged or updated in the process? It's a trick tricky thing to ask a software author what their assumptions are.

It's easier to see from the outside, of course. We

did start off with a concrete set of

data sources

that we were interested in, which was,

partly driven by, an external client of the company. So they had their own interests, so we implemented those first.

And some of them just just don't get used.

For example,

we have, like,

or at least don't get used as far as I know. There may be people out there that that do and they're happy with them, but, it could you could say perhaps there's absolutely no problems with them, and that's why I haven't heard from them. But I suspect that they don't get used very much.

So,

Sola, for example, is a queryable

database ish service typically used for,

system logging

on some cluster. So there are doubtless lots of Solr systems

out there at s Solr,

but

probably not that many people using intake to derive data frames from Solr queries because you would typically use it with a graphical front end where you want to do anomaly detection or something like that. So there are a few of those that,

we initially

made and then realized that these are probably not the most important ones to be working on. I think what I said just before of the untouched

data types that are out there that are

there's a huge number of them and there's huge amounts of data in them.

And we if if we just follow on trying to make data frame like things, then we'll be missing out on those, and the people that want to be able to load from those things will also be missing out. So I think that that was the the biggest pivot is to to realize and we we had talked about this early on, but to to really extend

the

sets of data

types

that intake can interact with. So at the moment, I I've started a a scratch repo on, streaming data. There are not many frameworks

in the Python ecosystem that deal with real

really real time data. So there there is 1

that I I'm now developing called streams with a a zed at the end, and,

it seems like a fairly good candidate for being able to do this.

Since I now develop for it, I can customize it to be friendly to intake, of course. That that's convenient.

But exactly how the user experience of that will be is not yet entirely clear, and that is the kind of thing that makes me wonder whether there are some things in intake that, you know, we're we're a bit too set in our ways, concrete files,

concrete

versions of data. But now we have something that's truly dynamic. What do we need to change to to make that work?

And as you mentioned early on, intake

focuses primarily

on integrating with the PyData ecosystem

and solving a lot of the problems that developers

in that space are encountering.

And I'm wondering what are some of the other types of communities that either are currently or could be benefiting from the work being done on intake? I think I'd start off by saying that the pydata

ecosystem

is

truly enormous.

It it encompasses

people basically, anybody who's using NumPy or Pandas or scikit learn or or even related things like tens TensorFlow. Those are those are all Pydata

ecosystem players.

Some are more deeply integrated with other tools than others, but that that's the the nature of things. Python makes it very easy to play along with all of the different libraries that are out there. It's about finding the users rather than the code ecosystems,

I think. I think if if all of the people that

currently

use Python for data analysis,

if a significant fraction of those

would be using intake to catalog those data

systems and and to, access their data, then that would be an enormous number of people. And I'd say, let us get to that point before we try and say who else might be benefiting from it. We're not particularly at the stage, I I think, you you were going to ask me this later, of, cross platform

interoperability

is something that we

think about, and

we try and make it possible in some cases, but

intake itself

remains implemented in Python and it and it will be for the time being. So all that we can promise is that

where there is a platform independent description

of a dataset, and there are quite a few of these different different ways of doing it, that intake is capable of interoperating

with them and ideally of expressing its own catalogs in those formats as well. So both of those, particularly the first option are being able to read their data. That that is something that there are already several examples of, such as

stack and threads are are scientific.

Data description

standards

and intake can read from those catalogs and expose their data as x-ray objects. But being able to go the other way, taking intake catalogs and generating those platform independent

prescriptions is is also something that is being worked on already.

And

if that's there, then you don't anymore need to ask the question of patient intake also also be implemented in r, for example. Maybe it should. Maybe somewhere

a couple of years from now, that that can be something that we think about, but, definitely not for the time being. Yeah. It's definitely valuable to have an understanding of the scope that you're aiming for and the limitations that you're willing to accept in order to serve your target community well, rather than trying to be everything to everyone and ultimately being useless to everyone. Yeah. And, obviously,

that, we have a a bit of bias at Anaconda.

We we support and we develop for R2, but Anaconda is unofficially,

perhaps, Python first. And we have a much bigger name in the Python world, and we we develop a lot of Python oriented

open source

products

and,

projects, I should say, I guess, if it's open source.

So the convenience that I know people that develop Pyvis.

So if I need some visualization

stuff tailored to the intake needs, then that's a very easy conversation for me to have.

That doesn't mean that I want to convert the whole world away from,

I don't know, plot layout, plot layout, or or whatever. It it's just that we can move a lot faster within that. And in terms of the overall work

of building and maintaining

the intake project, what have you found to be some of the most challenging or complex

or sort of,

unexpected

aspects of building and maintaining it? Personally, the biggest

problem, the the hardest thing really has been the messaging

from the start. This is a tool that people

don't necessarily

realize would be useful for them Until they've seen it demonstrated, until they've seen some real concrete examples,

until

lots of other people have created intake catalogs of their data

as an easy way to share it. I can I can make talks? I can talk about intake, and people generally find those

quite interesting and it's a novel idea for them. But actually getting people

on board and and using it

requires

crossing a certain hurdle, a certain amount of effort

for some institution or or individual

data handling person to make

that effort to,

actually come over and use intake. It's very much a a different experience to, when I was writing Fast Parquet or when I was writing s 3 f s for accessing

s 3. Those were needs that people already had, and they were waiting for a library that fulfill that function for them. And then it came, and people used it, and they were very grateful. In this case, it is this,

finding out

exactly

what people's pain point is so that I can address, hey. Is this something that you're struggling with? Are you aware that intake can do this for you? Intake has

perhaps already too many

facets, too many functionality

pieces within it. So the the very front page of the intake

documentation

has, like, the the 4 main use cases. Are you 1 of these people? 4 is already quite a lot. It's much easier when you say,

this is what you're trying to do. This is the thing that solves it for you. Quite getting that tone right is

definitely far and away the thing that I have found the the hardest.

So I appreciate conversations like this to try and make it as clear as possible

what it is that we're after, why we're working on intake.

And as far as your experience

of using intake on your own and helping some of your customers and community members leverage intake,

what have you found to be some of the most interesting or unexpected or innovative ways that you've seen it leveraged?

So I know a a fair number of scientists, particularly

in the,

atmospheric

community

that they routinely

use in their deployment.

And

a lot of data is referenced in those catalogs, and they use it every day in a very open way so anybody can access these catalogs and and,

know what those datasets are. There is

1 particular

project to take that set of

intake catalogs,

and,

it is itself a GitHub repo. So you can submit to this GitHub re repo. They have CI on it, which takes all of these prescriptions and checks that you can actually load all of those datasets. And then it takes them, and it renders it all into a very handy website so that you can that's another

way that like, an unofficial way, if you like, to browse through intake catalogs is that there is this 1 particular

project within which they render everything into static HTML so that you can just click through the whole thing. So that's that's pretty cool. And then a different collaborator, this is at Brookhaven, I believe. So, scientists again, but of a of a different type in which they've taken an existing

experimental

data

repo. That's so that's implemented

in,

Mongo, MongoDB,

and that keeps track of every single experiment that is done within this particular group, a very absolutely truly massive amount of information that is in there. And they've implemented

an intake

hook to this so that you access the server. It looks like a catalog from the user's point of view.

They submit some some,

basic query parameters.

And the server,

when it gives you the catalog back, what it's actually done, it's taken this query and it's talked to Mongo. And from Mongo, it's got back a a list of entities.

These are all datasets. And then it's packaged that into something that the user looks like, an intake catalog so that they can just load the individual datasets

from a source. That's the kind of next order

intake thing. We're no longer talking about just files. Writing things into YAML files is really quick way, very convenient to get going. But, actually, you can do this on a much bigger scale without

having to write a lot of code at all, without having to

institute

complicated

cluster topologies or anything like that. So there's a couple of examples that have impressed me. I try and maintain the, there is an examples page on the intake docs. So I try and maintain links to all of these different things. People who are building,

like I mentioned, stack and threads, and CMIP is another 1 of these scientific standards. People who maintain those, and you can see all the different datasets that get added to these collections.

And I have I'm I'm not a scientist in that field. I have no idea what these things are. I just know that they look really complicated and there's a massive amount of stuff there. So, yeah, I I find those things very impressive, and it is really quite gratifying to see other people who are working, within this intake

landscape.

And are there any other aspects

of the intake project or the challenges

of loading and processing data that we didn't discuss yet that you'd like to cover before we close out the show? It's a pretty open space.

So intake is is young. We're very much open to having people propose things that we haven't thought of. We have, you know at Anaconda, we have certain customers.

Someone like me, I have particular

experience with what I've done with data and and open source projects that I've worked on, but this is really just the the edge or tip of the iceberg, if you like. So there are so many things out there. Come and talk to us.

The hey. This is a kind of data or this is

a a data curation system that we're interested in. How would I go about writing an implementation

for intake to be able to interface with this thing? And that includes, for example, we started off with data retriever,

this tidy little library that loads some stuff for you in a very particular way. Intake could be a layer over that too. Hey. Here is 1 of your catalogs that is referenced from your master catalog. Just happens to be all of your datasets that live in, data retriever. You have another entry that sits next to it. These are all of your tables that live in your on your spark cluster, for example. You can totally do that. So intake wants to be really this this very

light and simple layer over everything else that might exist

and what we need,

users and use cases above all. For the actual aspects of getting things into memory or loading things in chunks into a into a cluster, the the kind of thing thing that DASK does, those are the that detail I find interesting, and there are so many different ways of doing it and so many different ways of storing your data.

Usually,

a lot of that code already exists somewhere. So usually, it's a case of

writing something that can call that code in a a structured way that makes sense from the intake point of view. It's usually quite easy. And for anybody who does want to follow along with the work that you're doing or provide feedback or suggestions

for intake

or follow it along as it continues to grow and mature, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a tool called Ubersuggest

that is just a handy little website for doing some,

easy sort of SEO analysis for things like keywords and backlinks for your web site, which is something that I've been trying to work on for the podcast. So I found that to be useful.

And so with that, I'll pass it to you, Martin. Do you have any picks this week? I haven't actually had time to pick anything.

Well, I want to thank you for taking the time today to join me and discuss the work that you're doing on intake. It's definitely an interesting tool, and it's great to see another entry into trying to make it easy for people to have a

opinionated and standardized method for accessing and analyzing their data. So

I look forward to experimenting with that on my own, and I hope you enjoy the rest of your day. I I wonder before you go, how did you come across DataRetriever?

As I said, I I'd never heard of it. So, you you wanted to write about intake and you did some some Google search, I suppose, or did you already know about it? Data retriever, I actually had on a previous episode. So I'll add a link to that in the show notes. And I think I just saw it in 1 of the Python newsletters that went out. I think it was maybe a couple of years ago at this point. Mhmm. Okay. Alright. Thank you. Thank you for your time.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.init