Summary
One of the biggest pain points when working with data is getting is dealing with the boilerplate code to load it into a usable format. Intake encapsulates all of that and puts it behind a single API. In this episode Martin Durant explains how to use the Intake data catalogs for encapsulating source information, how it simplifies data science workflows, and how to incorporate it into your projects. It is a lightweight way to enable collaboration between data engineers and data scientists in the PyData ecosystem.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Martin Durant about Intake, a lightweight package for finding, investigating, loading and disseminating data
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what Intake is and the story behind its creation?
- Can you outline some of the other projects and products that intersect with the functionality of Intake and describe where it fits in terms of use case and capabilities? (e.g. Quilt Data, Arrow, Data Retriever)
- Can you describe the workflows for using Intake, both from the data scientist and the data engineer perspective?
- One of the persistent challenges in working with data is that of cataloging and discovery of what already exists. In what ways does Intake address that problem?
- Does it have any facilities for capturing and exposing data lineage?
- For someone who needs to customize their usage of Intake, what are the extension points and what is involved in building a plugin?
- Can you describe how Intake is implemented and how it has evolved since it first started?
- What are some of the most challenging, complex, or novel aspects of the Intake implementation?
- Intake focuses primarily on integrating with the PyData ecosystem (e.g. NumPy, Pandas, SciPy, etc.). What are some other communities that are, or could be, benefiting from the work being done on Intake?
- What are some of the assumptions that are baked into Intake that would need to be modified to make it more broadly applicable?
- What are some of the assumptions that were made going into this project that have needed to be reconsidered after digging deeper into the problem space?
- What are some of the most interesting/unexpected/innovative ways that you have seen Intake leveraged?
- What are your plans for the future of Intake?
Keep In Touch
- martindurant on GitHub
- Website
- @martin_durant_ on Twitter
Picks
- Tobias
- Ubersuggest SEO tool
Links
- Intake
- Anaconda
- Dask
- Fast Parquet
- IDL
- Space Telescope Institute
- Blaze
- Quilt Data
- Arrow
- Data Retriever
- Parquet
- DataFrame
- Apache Spark
- Dremio
- Dat Project – distributed peer-to-peer data sharing
- GeoPandas
- XArray
- Solr
- Streamz
- PyViz
- S3FS
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you get everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or building your CI pipeline, they just launched dedicated CPU instances. Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. Go to python podcast dotcom/conferences to learn more and to take advantage of our partner discounts when you register. And visit the site at python podcast.com to subscribe to the show, sign up for the newsletter, and read the show notes.
And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
[00:01:36] Unknown:
Your host as usual is Tobias Macy. And today, I'm interviewing Martin Durant about intake, a lightweight package for finding, investigating, loading, and disseminating data. So, Martin, could you start by introducing yourself? Hello. My name is Martin Durant. I work for Anaconda
[00:01:51] Unknown:
and have done for nearly 4 years now. I work in the open source part of the company, and, I'm involved in packages such as Dask and files and things, file format loaders like FastAPI. But I've been on the team making intake for a bit over a year, and intake is now my main thing. So I I'm the lead developer of intake. And do you remember how you first got introduced to Python? I do. So, it's been a while. In my former life, I used to be an astronomer. And when I started my postgrad studies, the people around me were using a thing called IDL to do the majority of their data analysis and plotting, that kind of thing. But my supervisor was using FORTRAN for much of that. So he told me, go off and use whatever you like. And after struggling with IDL for 6 months or a year or something like that, I heard of this thing called Python, which, the Space Telescope Institute was getting interested in at the time. So I thought I'd give it a go. I sat down. I went all the way way through the introductory tutorial on the Python site and, I never looked back. That was I'm well, I won't say how many years ago that was, but it was a few.
[00:03:06] Unknown:
And so as you said, now you have become the lead on the intake project. So can you start by giving a bit of an overview about what the project is and the story behind why it was created? Yep. Intake
[00:03:18] Unknown:
is being around as an as an idea, a a sort of concept that something like this should exist for a while. And in fact, there are lots of packages out there that do some part of the work that intake hopes to cover. But the specifics of the idea of intake was, I believe, a year and a half ago, something like that, maybe a bit more. And it was an idea that that was sort of very loosely born out of the this loose blaze concept that had been around at Continuum and then Anaconda for a long time, the sort of overall library that's going to bring together all of your data needs. The idea for intake, though, was to be much more specific than any of that, to be at the cataloging and loading side, not to deal with, many more complex things that you might want to do with data, just give you a simple interface that is a layer of many other Python libraries, do the minimum necessary to let you get your data and then do all of your analysis and everything else in the way that suits you. So that was the initial concept.
Python is, of course, very rich. There are lots of these libraries out there for loading from data format to x, and Intech doesn't want to redo that work. It's very much in the vein of the PyData ecosystem, make use of what's out there, and try and work well with all of them. So that's the idea. And, it actually started in the, I think about January of last year, we started to actually have some working code at that time. And we released it in the summer as we announced it in blog posts, and, I started to go to conferences, that kind of thing to talk about it. And,
[00:05:01] Unknown:
now it's thoroughly usable, and, I encourage people to try it out. If you're listening to this, then I I hope you find it interesting. And as you mentioned, there are a number of other libraries, both in the Python ecosystem and elsewhere that handles some of this data loading challenge. But as you said, intake is more for providing an abstraction and sort of unifying layer among them. But I'm wondering if you can talk a bit about some of the other projects that overlap some of the functionality of intake in terms of things like the data loading or the data cataloging? So I'm thinking of projects like Quilt Data
[00:05:36] Unknown:
or the Arrow project or Data retriever. The most common thing that people have said to me aside from this is interesting and and general comments, but the most the the most common concrete thing that people have said to me when they've heard me talk about intake is, hey. My company has something like that. We tried to write a platform for doing just this thing, but we only got partway. And it it does some of it, and then it got abandoned because the the developer left or something like that. So it's very common. There are lots of proto intakes out there, ones that never even got released.
And, hopefully, intake will be the thing that unifies all of these things. It's very much designed with simplicity and the code and use in mind, and some of them are actually available. So you you mentioned I'll I'll talk to Quilton and DataRetriever. The latter, I actually hadn't heard about before coming to this conversation. I had to look it up, and it's very nice and simple. It gives you downloads of data files. There are a small limited number of formats that it it supports, and it does that 1 thing, and I haven't actually tried it. I doubt that it does it quite well. Quilt, on the other hand, is is quite different. It is, a hosting environment. You give your data over to Quilt. They manage it, and then you get, an API to their system, which allows you to see what data you have and access whichever of those datasets that you want to have. Intake, I started off by saying it wants to be simple. It it wants to do the minimum amounts of work. It doesn't host your data. Your data can live wherever you want it to live. So that could be on shared disk drives. It could be on GitHub. It could be on Amazon s 3. It could be in in a number of places, including your local disk, but that's just 1 of many. And the catalogs themselves can live in all of these different places. So if you want to use intake to share data, you can just put your catalog into, for example, a gist on GitHub or a repo, point people at that, and you you don't need infrastructure.
You don't need a server. You don't need to set up anything. You can just load it and go. So it it really is going for that, simplicity of use. In the data retriever case, that is also functionality that is covered by intake. Intake has a caching system, optional, and it's the catalog author that gets to decide whether a particular dataset ought to be cached or not. And caching can mean downloading the files that hold the data, or it can mean accessing some data resource, whether it's a database or whatever it may be, and, forming that into a local version. So, for example, for tabular data, it would be pocket files. And that's an interesting thing that we've started with from the very first step with intake is that we want to be able to support not just data frames, not just arrays. We want to really give a uniform experience across all the different kinds of data that are out there, all the different places that that data may live, whether it's services, whether it's remote storage, but still give a very simple experience to the end user. Arrow, I think we can talk about separately. It doesn't really it's not the space that Arrow is in. Arrow is in data memory format. It does care about loading data to some extent for a couple of formats. Maybe down the line, that will grow, and it does have some hooks for remote file systems, particularly HDFS.
Maybe some more coming. But for the time being, it's really about representation of data in memory and, passing data between processes. So that's it's sidelong to what intake does. It's it's not really the same territory, I don't think. And so
[00:09:22] Unknown:
intake is built to be able to simplify a lot of the work flow for data scientists and data engineers and reduce issues of code duplication that are required for being able to obtain and prepare the data for analysis. I'm curious if you can talk through just the overall workflow of using intake from the perspective of both a data scientist who is consuming it for building these analyses and a data engineer who is likely, defining and maintaining the catalogs and the data and the data sources. That idea of, separation of concerns between
[00:09:57] Unknown:
the owner of the data and the user of the data is is really core to how we designed intake. We want the user, and the user is they they will be more numerous. So often, they're the ones that shout the loudest, but we want their their experience to be simple. We want them to be able to find some abstract cataloging thing where they don't need to worry about how those catalogs are created and how they reference 1 another to be able to search through those, know the metadata of the datasets themselves, and use a single method to get a handle to the data, whether it's in memory or some DASK thing or or some Spark thing or whatever it is that they're used to analyzing it with. It should be a 1 liner once they found the dataset of interest to be able to work for that. And then intake is done, and they can get on with the actual work that they're paid to do. So from the from their point of view, they shouldn't really notice intake that much. They may be using our GUI. They may be using search functionality, that kind of thing. But, the less interaction that they have with intake that nevertheless lets them get on with their job, the better it is from the, end user's point of view. From the engineer's point of view or the data owners, they create the catalogs or cataloging systems. Often, there will be the the same people that curate the data, so they decide this particular dataset is the authoritative version of whatever it is that it's supposed to be. And at that time that they are doing that curation is when you want to capture as much metadata about it as you can. And that process is very important, and intake provides a very simple way in in in the the base case, at least, a very simple way of writing down all of that metadata. You can just include it into a YAML file as a dictionary like thing and put everything into that metadata block that you, as the catalog author, find interesting. But you also provide the set of arguments to the particular loader. We call these, data drivers. The particular set of arguments which will load that data in the way that it ought to be loaded. So you might have a choice of different packages to load it with. You might have a choice of, various specific options that make sense for that data type. Maybe only some columns of a particular dataset are of interest when you if you think about this data in 1 particular context, you can have multiple versions of that maybe.
You could present some specific options to the end user. Do you want the production set? Do you want the latest leading edge set or or something like that or the just a sandbox version? All of those things, you could encode into a catalog. You can also structure these catalogs so that they reference 1 another in a hierarchical model. And that's, again, a clue to the user on how to go about finding the correct data that they need. So the idea of the the best situation from the data engineer's point of view is that the data users know what data is out there. So it's a way of doing this dissemination piece just by, in the simplest case, putting YAML files in the correct position. It can be more complicated than that too if you need something more specific. And then the user, because they've used intake before, they know that the same 3 methods of interest will do everything that they need. They shouldn't need to come back constantly to the data engineer saying, hey. I need this latest bit. Hey. How do I load this thing? Or to go to their colleagues saying, have you do you have that little piece of code that that loads the current dataset that I need to work on? So it by making clear the different roles around this,
[00:13:55] Unknown:
data story, you save everybody a lot of work. And as far as the metadata that you can embed in the catalog definition, I'm assuming that that's also passed on to the end user for being able to inquire about some of the different attributes of the data, such as what you were saying as far as the different columns that are of interest or maybe having a last generated time stamp so that somebody can inquire about the freshness of the data that they're analyzing without having to go back to the catalog owner to ask that question? Yes. Although all of it is optional.
[00:14:28] Unknown:
You could have an empty metadata section if you if you chose to make a catalog that way. That's obviously not recommended. This is the value added as being able to include the information in this. Some specific things that come from the dataset itself, so the number of the the columns that are available is usually something, depending on the data format, that the data itself can tell you. So that's already often encoded in some kind of metadata. For parquet, for example, it's, there's a special part of the file that has a list of the columns that you expect to get there. Other things, like if it say JSON, you may might not know what columns exist. There's there's no specific metadata location in a common JSON. Maybe you have a JSON API description or something. So there are lots of possibilities, and the design of how you get this encoded into metadata is is 1 of the most important considerations when you write a data driver. There will be some common things.
Yep. People will want to know time stamps. They will want to know, tags, for example, is a is a common use case. So, category of data for set for example, if you want to find all of the datasets that are in the sales category, then you should be able to search sales and it will find all of them because it's there in the metadata. And that's that's already possible now. Time stamp is actually a bit more subtle because, commonly, it's latest is the time stamp because you're loading from some remote location, and, the data in that location can be updated in place, which is very useful. But then if you then then the question of versioning is is another interesting topic that maybe we can talk about later, especially if your data source is actually some database system and not only do instead of having a new version of files, you have new entries in there. You have individual updates to individual rows, and you're not normally in the database case talking about versions of data. Usually, you are getting the latest.
Maybe you have some, snapshotting capability.
[00:16:40] Unknown:
And going back to the catalog definition again, you mentioned the ability to have them defined hierarchically, and I know that in terms of data, particularly as you start to accrue a larger volume and variety of it, it can be difficult to understand just what data is even available, what exists, and what the overall discovery and cataloging capabilities are. And so I'm wondering how intake addresses some of those problems. And at a high level, it seems that you could have 1 catalog definition that's just a sort of meta definition that defines all of the other catalogs that exist that just gets updated as you add new data sources. But I'm wondering if you can just talk through the overall experience of the discovery process for working with intake and understanding what data exists within an organization?
[00:17:29] Unknown:
In the intake model, a catalog object is itself a data source. So any catalog can refer to any other catalog as 1 of its entries, and that can be most commonly just a a a URL or it can be something more complicated. But you can imagine a simple case in which, let's say, you have your data on s 3. You could have your catalog definitions also on s 3 as a a particular bucket and key as your my master catalog. And in it, as you say, this could just be a YAML file, and each of your YAML blocks within each of your entries put it says, this is a catalog, and it's at this other URL. From the user's point of view, they would load the first 1 of these or maybe their system administrator has provided it as 1 of the automatic lockups that intake does for you at, import time, and then it's just there. And you can do a list of your master catalog, and you will see it has a certain number of entries and that those entries are themselves catalogs. You can do a a search method which will walk down into the hierarchy looking for particular terms in the metadata and description of all of those datasets or something that is has now been heavily developed and, isn't becoming a pretty pleasant experience, we have a a GUI which you can launch either in a notebook or as a stand alone application with which you can click down the tree of your catalogs and see what entries are in there. And for each of those entries, you can just you can see their description and metadata in front of you. You don't need to type anything to do that. And, also, coming soon, this is also already partly implemented, you can do some on the fly analysis and visualization of those datasets within the interface itself. That obviously will work better for certain sizes of datasets than others, and we need to care then about where is the execution happening if you're doing some aggregated visualization, or you actually just want to see the data. You still have to load it in some process somewhere to display it. So it does open a new set of questions. But in general, this idea that whatever execution environment you are using anyway is the thing that's going to get you some quick look on your data, perhaps with some great subsampling, maybe only, display 1 in a 1000 rows or something like that. That is all possible within the interface or it will be very soon.
So then the user of the data can know pretty rapidly if that's actually the correct dataset that they want for the job in hand. There are some additional niceties about this. Catalogs are hierarchical, but they're not exclusive. It's not a, like, a directory this directory hierarchy. So there's not only 1 parent per catalog. You can cross reference them. You can have multiple entries to the same set of files or the same data service but with different options where you can then describe what those different options are. And, again, following those descriptions and metadata, the the user can pick the 1 that's correct for them. I already mentioned, parameters that you expose to the user so that they can make some choices at load time if they need to. Those parameters can include credentials, which is a a fairly important story within, commercial institution to have the correct, identity attached to each data access. That's something that, intake intake doesn't store those credentials, but it provides ways to pass those credentials on. And I've wandered a little far from, hierarchies now. But, yeah, the basic idea is that you you can click through your hierarchy, you can search through your hierarchy, or you can do all of these things in the code if you want to. So we we expose as many different ways to do these things as possible. And so another challenge that comes
[00:21:24] Unknown:
as possible. And so another challenge that comes up often when dealing with data is the idea of capturing the lineage or provenance, and that can change as some of the sources change or the manipulations of what generates a derived data source changes, and that ties into the versioning aspect that you mentioned as well. So I'm wondering if you can just talk through a bit about the data lineage capabilities of intake, and also some of the story around data versioning and catalog versioning.
[00:21:54] Unknown:
For the time being, intake is something that reads your data and doesn't write up. So you wouldn't use intake to derive a new dataset from a previous 1. You would use intake to load it and then manipulate it however you want to. We are developing hooks to be able to edit catalogs in place so that you can add new entries into them, and that would be the opportunity to add extra information about the current session of how it was generated. That kind of thing often makes sense in the context of a larger processing framework. So, you may be aware that Anaconda has a for fee enterprise platform within which whenever you're in the platform, you're running within some container, and that container knows, a various amounts of information about who the user is and what the resources are, what the environment is, that kind of thing. So if you were saving data into an intake catalog from such a system, then you would have a lot of information about how it was generated, especially if it's a job that is something that's run regularly in an express declarative way as opposed to at an interactive prompt. But Python being Python, you do anything that's an interactive prompt. Right? So the more constrained your execution environment, the more information you can have, and then it's up to the administrator of the data system to capture that information at the time that data is ingested. Intake may eventually go down this road of being more of a data pipeline type system in which it actually does analyses for you and and executes jobs that does that kind of thing. For the time being, we're of the opinion that we want to do the 1 thing, which is loading extremely well, and to go to both sides of that story, the reading and writing, and become some kind of Dremio like system or or whatever. There's there's a a whole load of those kind of systems around too would actually make it make intake less good at the at the particular thing that it's aimed at. So we're very hesitant to try and do too much with intake, certainly in the near to medium future.
The only exception to this rule is that, you can there there is a functionality and intake called persist, which is read the data from remote, typically, database or something where your query would take a long time to process, and then write that table or array or whatever it may be to a local format as is. And then because we've made the data ourselves, we can capture all these things about where it came from, the time that it was executed, and, be able to refer back to the original data so that at a later stage, you can either automatically say this data is now stale. We need to refresh it. Or the user can say, I would like, the most recent version of this data.
So that's 1 side of it. When it comes to working with the versions of the data itself, though, for for reading, that is something that is currently under development, and it will depend on how you want to store your data. There are many different places that your data can live, and each of those may have a different concept of what data versioning means. Probably, the first 1 of these that will expose directly to the user will be Git because Git is the best known versioning thing that's out there, and Git makes a very natural storage place for your catalogs. Catalogs can be text files as I've I've said a few times. YAML is the standard format that we use. So you can be versioning your catalogs using Git anyway. But if you have a reference to a repo and you regard that repo as a cacheable thing, then Intake can take care of cloning that repo, offering the set of versions, maybe just tags. This is something to be decided, and exposing those to the user so that they can get latest or they or they can go back in time. There are a number of things like that that I would regard as caching mechanisms. That is, you can get a local copy of some some data that lives elsewhere. Docker is another obvious 1 that uses, versions of that sort. 1 we already have implemented, which is implemented but experimental is that. That is a much bigger data concern, but, 1 in which you have absolute hashes to every version of of your data so that you can very much reproduce the data at a particular point in time. So intake can interact with that. You can say my lit my data lives in this dot hash, and at the time of access, intake will call that and have that do that referencing for you. So, again, it's not intake that's keeping hold of the versions.
It's an external provider of versioning information. Intake maybe will have the ability to do snapshots itself of in the same way that currently we can persist data from remote and we can cache data from remote, maybe we will eventually version those things so you can have snapshots of of them locally. Exactly how that will be done and what the experience will be, that's that's something to be decided. But it's probably something that will appear in the next 6 months, I would reckon. And so
[00:27:18] Unknown:
can you describe a bit now about how intake itself is actually implemented and some of the extension points that it provides for somebody who wants to customize for different data sources or different, container formats that it'll support? Yeah. We've,
[00:27:34] Unknown:
tried to make intake approachable to the developer. We have, a simple class structure so that if you want to implement something like a data driver or indeed a catalog, which is a specific type of data driver, then you need to just derive from a particular class and implement a a small number of methods, I think, like 5 or 6 methods on there. And these classes in a simple case for something fairly straightforward like text file maybe, and or being pretty short. They they're basically 1 page of, code for the I I have a tutorial lying around that was using, the GitHub API as a data source. So the parameters that you would pass to it would be the repo that you're interested in and, the type it it was the issues for this. It it was just a throwaway example. So, the type of issues, whether it's open or closed, those are the parameters that it took. It connected to GitHub, and it would give you back all of those issues with time stamps and author and and whatever else as a data frame in that case. And the code to do that and make it look like any other intake data source is maybe 50 lines or something like that, probably less.
50 lines including class declaration and and documentation and that kind of thing. There are some other extension points in there. So you you mentioned container formats. The containers are, again, a specialized data source. They're we haven't mentioned the, intake server at all. It is something that might be very useful to people, but it does really complicate the set of things that, intake can do. For this conversation, I'll just say that there is an intake server, which you can just run as a Python process. You don't need to have any special infrastructure for it. And it can serve catalogs. So you can connect to this server, and it and it gives back a list of data entries or sets of catalogs.
And the second use for the server is that you can actually stream data from it. So it may be that the server could it lives behind a firewall or it has some special credentials access so that it can see data that the user on a different machine can't. And then you can use the server to stream that data over a a single socket. So that's a very nice convenience. But for to be able to do that, you need to agree on what that data type is. So for basic things like arrays and data frames, those are plumbed into intake. So if you expose a data frame from the server and the client wants to wants to be able to load this proxying through the server, then it gets chunks off that data frame, and it doesn't need to know what the original format was. Perhaps it doesn't have that driver and that's the reason that you want to proxy is because the, server has the driver, but the client doesn't. Those few types like data frame and array, we we call those containers. And it's also the thing that's responsible for if you do a if you do a persist, then for data frame, it gets written as a parquet set of parquet files. And where that definition is is that because it's a data frame, that container, the definition of the data frame container has a persist method, and that defines how it gets written to disk locally.
And it turns it to parquet for all cases. Writing the new container is similarly pretty easy. So we have, a couple. We have, GeoPandas, so, a specialized data frame in which you have a notion of geometry, and xarray, which are related sets of numerical arrays where you have coordinates and and metadata and perhaps some hierarchical structure to them in a sort of NetCDF like model of, data format. So those were were new, and they live outside of the main intake repo, but you can you can use them easily in conjunction with intake. If you had some other type of data and we're looking, for example, at streaming data, at, fitted machine learning models. These are interesting things that you might want to be able to pass from 1 place to another without having to do the the basic load yourself.
Those are all candidates that would be fairly simple to implement. It's just a a matter of time or somebody getting around to doing it. In practice, data frames and arrays and free form text like things such as as JSON, they fill the needs of the great majority of users certainly in their business case. But the fact that there are lots of different types of data out there is something that is often glossed over in other data access, conversations out there. I don't just want 20 different ways to load data frames. Loading from XML is not very different from loading from JSON. So in my mind, those 2 are the same thing, really. But there are lots of fundamentally different types of data in existence that
[00:33:04] Unknown:
often you just don't have a convenient way of loading at all. And as far as the overall implementation of intake, you mentioned that it's a fairly young project. So I'm curious what some of the assumptions are that are sort of baked into the origins of intake and how it's designed and some of the ways that the architecture has evolved and maybe some of the ways that those assumptions have been challenged or updated in the process? It's a trick tricky thing to ask a software author what their assumptions are.
[00:33:35] Unknown:
It's easier to see from the outside, of course. We did start off with a concrete set of data sources that we were interested in, which was, partly driven by, an external client of the company. So they had their own interests, so we implemented those first. And some of them just just don't get used. For example, we have, like, or at least don't get used as far as I know. There may be people out there that that do and they're happy with them, but, it could you could say perhaps there's absolutely no problems with them, and that's why I haven't heard from them. But I suspect that they don't get used very much. So, Sola, for example, is a queryable database ish service typically used for, system logging on some cluster. So there are doubtless lots of Solr systems out there at s Solr, but probably not that many people using intake to derive data frames from Solr queries because you would typically use it with a graphical front end where you want to do anomaly detection or something like that. So there are a few of those that, we initially made and then realized that these are probably not the most important ones to be working on. I think what I said just before of the untouched data types that are out there that are there's a huge number of them and there's huge amounts of data in them.
And we if if we just follow on trying to make data frame like things, then we'll be missing out on those, and the people that want to be able to load from those things will also be missing out. So I think that that was the the biggest pivot is to to realize and we we had talked about this early on, but to to really extend the sets of data types that intake can interact with. So at the moment, I I've started a a scratch repo on, streaming data. There are not many frameworks in the Python ecosystem that deal with real really real time data. So there there is 1 that I I'm now developing called streams with a a zed at the end, and, it seems like a fairly good candidate for being able to do this.
Since I now develop for it, I can customize it to be friendly to intake, of course. That that's convenient. But exactly how the user experience of that will be is not yet entirely clear, and that is the kind of thing that makes me wonder whether there are some things in intake that, you know, we're we're a bit too set in our ways, concrete files, concrete versions of data. But now we have something that's truly dynamic. What do we need to change to to make that work?
[00:36:24] Unknown:
And as you mentioned early on, intake focuses primarily on integrating with the PyData ecosystem and solving a lot of the problems that developers in that space are encountering. And I'm wondering what are some of the other types of communities that either are currently or could be benefiting from the work being done on intake? I think I'd start off by saying that the pydata
[00:36:47] Unknown:
ecosystem is truly enormous. It it encompasses people basically, anybody who's using NumPy or Pandas or scikit learn or or even related things like tens TensorFlow. Those are those are all Pydata ecosystem players. Some are more deeply integrated with other tools than others, but that that's the the nature of things. Python makes it very easy to play along with all of the different libraries that are out there. It's about finding the users rather than the code ecosystems, I think. I think if if all of the people that currently use Python for data analysis, if a significant fraction of those would be using intake to catalog those data systems and and to, access their data, then that would be an enormous number of people. And I'd say, let us get to that point before we try and say who else might be benefiting from it. We're not particularly at the stage, I I think, you you were going to ask me this later, of, cross platform interoperability is something that we think about, and we try and make it possible in some cases, but intake itself remains implemented in Python and it and it will be for the time being. So all that we can promise is that where there is a platform independent description of a dataset, and there are quite a few of these different different ways of doing it, that intake is capable of interoperating with them and ideally of expressing its own catalogs in those formats as well. So both of those, particularly the first option are being able to read their data. That that is something that there are already several examples of, such as stack and threads are are scientific.
Data description standards and intake can read from those catalogs and expose their data as x-ray objects. But being able to go the other way, taking intake catalogs and generating those platform independent prescriptions is is also something that is being worked on already. And if that's there, then you don't anymore need to ask the question of patient intake also also be implemented in r, for example. Maybe it should. Maybe somewhere
[00:39:05] Unknown:
a couple of years from now, that that can be something that we think about, but, definitely not for the time being. Yeah. It's definitely valuable to have an understanding of the scope that you're aiming for and the limitations that you're willing to accept in order to serve your target community well, rather than trying to be everything to everyone and ultimately being useless to everyone. Yeah. And, obviously,
[00:39:27] Unknown:
that, we have a a bit of bias at Anaconda. We we support and we develop for R2, but Anaconda is unofficially, perhaps, Python first. And we have a much bigger name in the Python world, and we we develop a lot of Python oriented open source products and, projects, I should say, I guess, if it's open source. So the convenience that I know people that develop Pyvis. So if I need some visualization stuff tailored to the intake needs, then that's a very easy conversation for me to have. That doesn't mean that I want to convert the whole world away from, I don't know, plot layout, plot layout, or or whatever. It it's just that we can move a lot faster within that. And in terms of the overall work
[00:40:15] Unknown:
of building and maintaining the intake project, what have you found to be some of the most challenging or complex or sort of, unexpected aspects of building and maintaining it? Personally, the biggest
[00:40:29] Unknown:
problem, the the hardest thing really has been the messaging from the start. This is a tool that people don't necessarily realize would be useful for them Until they've seen it demonstrated, until they've seen some real concrete examples, until lots of other people have created intake catalogs of their data as an easy way to share it. I can I can make talks? I can talk about intake, and people generally find those quite interesting and it's a novel idea for them. But actually getting people on board and and using it requires crossing a certain hurdle, a certain amount of effort for some institution or or individual data handling person to make that effort to, actually come over and use intake. It's very much a a different experience to, when I was writing Fast Parquet or when I was writing s 3 f s for accessing s 3. Those were needs that people already had, and they were waiting for a library that fulfill that function for them. And then it came, and people used it, and they were very grateful. In this case, it is this, finding out exactly what people's pain point is so that I can address, hey. Is this something that you're struggling with? Are you aware that intake can do this for you? Intake has perhaps already too many facets, too many functionality pieces within it. So the the very front page of the intake documentation has, like, the the 4 main use cases. Are you 1 of these people? 4 is already quite a lot. It's much easier when you say, this is what you're trying to do. This is the thing that solves it for you. Quite getting that tone right is definitely far and away the thing that I have found the the hardest.
So I appreciate conversations like this to try and make it as clear as possible what it is that we're after, why we're working on intake.
[00:42:32] Unknown:
And as far as your experience of using intake on your own and helping some of your customers and community members leverage intake, what have you found to be some of the most interesting or unexpected or innovative ways that you've seen it leveraged?
[00:42:47] Unknown:
So I know a a fair number of scientists, particularly in the, atmospheric community that they routinely use in their deployment. And a lot of data is referenced in those catalogs, and they use it every day in a very open way so anybody can access these catalogs and and, know what those datasets are. There is 1 particular project to take that set of intake catalogs, and, it is itself a GitHub repo. So you can submit to this GitHub re repo. They have CI on it, which takes all of these prescriptions and checks that you can actually load all of those datasets. And then it takes them, and it renders it all into a very handy website so that you can that's another way that like, an unofficial way, if you like, to browse through intake catalogs is that there is this 1 particular project within which they render everything into static HTML so that you can just click through the whole thing. So that's that's pretty cool. And then a different collaborator, this is at Brookhaven, I believe. So, scientists again, but of a of a different type in which they've taken an existing experimental data repo. That's so that's implemented in, Mongo, MongoDB, and that keeps track of every single experiment that is done within this particular group, a very absolutely truly massive amount of information that is in there. And they've implemented an intake hook to this so that you access the server. It looks like a catalog from the user's point of view.
They submit some some, basic query parameters. And the server, when it gives you the catalog back, what it's actually done, it's taken this query and it's talked to Mongo. And from Mongo, it's got back a a list of entities. These are all datasets. And then it's packaged that into something that the user looks like, an intake catalog so that they can just load the individual datasets from a source. That's the kind of next order intake thing. We're no longer talking about just files. Writing things into YAML files is really quick way, very convenient to get going. But, actually, you can do this on a much bigger scale without having to write a lot of code at all, without having to institute complicated cluster topologies or anything like that. So there's a couple of examples that have impressed me. I try and maintain the, there is an examples page on the intake docs. So I try and maintain links to all of these different things. People who are building, like I mentioned, stack and threads, and CMIP is another 1 of these scientific standards. People who maintain those, and you can see all the different datasets that get added to these collections.
And I have I'm I'm not a scientist in that field. I have no idea what these things are. I just know that they look really complicated and there's a massive amount of stuff there. So, yeah, I I find those things very impressive, and it is really quite gratifying to see other people who are working, within this intake landscape.
[00:45:58] Unknown:
And are there any other aspects of the intake project or the challenges of loading and processing data that we didn't discuss yet that you'd like to cover before we close out the show? It's a pretty open space.
[00:46:11] Unknown:
So intake is is young. We're very much open to having people propose things that we haven't thought of. We have, you know at Anaconda, we have certain customers. Someone like me, I have particular experience with what I've done with data and and open source projects that I've worked on, but this is really just the the edge or tip of the iceberg, if you like. So there are so many things out there. Come and talk to us. The hey. This is a kind of data or this is a a data curation system that we're interested in. How would I go about writing an implementation for intake to be able to interface with this thing? And that includes, for example, we started off with data retriever, this tidy little library that loads some stuff for you in a very particular way. Intake could be a layer over that too. Hey. Here is 1 of your catalogs that is referenced from your master catalog. Just happens to be all of your datasets that live in, data retriever. You have another entry that sits next to it. These are all of your tables that live in your on your spark cluster, for example. You can totally do that. So intake wants to be really this this very light and simple layer over everything else that might exist and what we need, users and use cases above all. For the actual aspects of getting things into memory or loading things in chunks into a into a cluster, the the kind of thing thing that DASK does, those are the that detail I find interesting, and there are so many different ways of doing it and so many different ways of storing your data.
Usually, a lot of that code already exists somewhere. So usually, it's a case of writing something that can call that code in a a structured way that makes sense from the intake point of view. It's usually quite easy. And for anybody who does want to follow along with the work that you're doing or provide feedback or suggestions
[00:48:12] Unknown:
for intake or follow it along as it continues to grow and mature, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a tool called Ubersuggest that is just a handy little website for doing some, easy sort of SEO analysis for things like keywords and backlinks for your web site, which is something that I've been trying to work on for the podcast. So I found that to be useful. And so with that, I'll pass it to you, Martin. Do you have any picks this week? I haven't actually had time to pick anything. Well, I want to thank you for taking the time today to join me and discuss the work that you're doing on intake. It's definitely an interesting tool, and it's great to see another entry into trying to make it easy for people to have a opinionated and standardized method for accessing and analyzing their data. So I look forward to experimenting with that on my own, and I hope you enjoy the rest of your day. I I wonder before you go, how did you come across DataRetriever?
As I said, I I'd never heard of it. So, you you wanted to write about intake and you did some some Google search, I suppose, or did you already know about it? Data retriever, I actually had on a previous episode. So I'll add a link to that in the show notes. And I think I just saw it in 1 of the Python newsletters that went out. I think it was maybe a couple of years ago at this point. Mhmm. Okay. Alright. Thank you. Thank you for your time.
Introduction and Episode Overview
Interview with Martin Durant
Martin Durant's Background and Introduction to Python
Overview of the Intake Project
Comparison with Other Data Loading Libraries
Workflow for Data Scientists and Data Engineers
Metadata and Cataloging in Intake
Data Discovery and Hierarchical Catalogs
Data Lineage and Versioning
Implementation and Extension of Intake
Assumptions and Evolution of Intake
Communities Benefiting from Intake
Challenges in Building and Maintaining Intake
Interesting Use Cases of Intake
Open Space for Development and User Contributions
Contact Information and Closing Remarks