Summary
Analyzing and interpreting data is a large portion of the work involved in scientific research. Getting to that point can be a lot of work on its own because of all of the steps required to download, clean, and organize the data prior to analysis. This week Henry Senyondo talks about the work he is doing with Data Retriever to make data preparation as easy as retriever install
.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.podastinit.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app.
- Visit the site to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Henry Senyondo about Data Retriever, the package manager for public data sets.
Interview
- Introductions
- How did you get introduced to Python?
- Can you explain what data retriever is and the problem that it was built to solve?
- Are there limitations as to the types of data that can be managed by data retriever?
- What kinds of data sets are currently available and who are the target users?
- What is involved in preparing a new dataset to be available for installation?
- How much of the logic for installing the data is shared between the R and Python implementations of Data Retriever and how do you ensure that the two packages evolve in parallel?
- How is the project designed and what are some of the most difficult technical aspects of building it?
- What is in store for the future of data retriever?
Keep In Touch
- Github
- @henrykironde on Twitter
Picks
- Tobias
- Henry
Links
- Weecology Lab
- University of Florida
- Data Retriever
- LG
- R
- Julia
- Open Knowledge Foundation
- Frictionless Data Format
- Data Weaver
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at ww w.podcastinnit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app or experimenting with something that you hear about on the show. You can visit the site at www.podcastinit.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. To help other people find the show, please leave a review on Itunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Henry Sagnondo about DataRetriever, the package manager for public datasets. So, Henry, could you please introduce yourself?
[00:01:00] Unknown:
My name is Henry Sagnondo. I'm a software developer at the Recology Lab at University of Florida. I maintain the data retriever and, a few packages that wrap around the data retriever.
[00:01:13] Unknown:
And do you remember how you first got introduced to Python?
[00:01:16] Unknown:
I was first introduced to Python when I was doing my graduate school at KAIST, that is in South Korea. I was, in the natural language processing lab and we did a few scripts to clean up data and to annotate data. The best way was to use Python. And after that, I worked for LG for 1 year where I did, voice recognition and, for the smart TV, and I used a lot of, Python for scripting.
[00:01:45] Unknown:
And could you briefly explain what data retriever is and how you got first got involved with the project?
[00:01:51] Unknown:
The data retriever is a data management platform. It enhances research productivity. Researchers go through a lot of time. They use a lot of time trying to clean up, standardize, and collect, sometimes even search for datasets. If we could reduce the amount of time scientists get to clean up this data and standardize it, they could use more time to focus on research so as to come up with the cool solutions to most of the problems that we have. We came up with this idea to reduce this amount of time by creating a platform that could use datasets that are publicly available, clean them, standardize them, and get them ready for analysis.
The data retrieval is written in Python and it has a few other libraries that wrap around it, like the r package, and, we're building a Julia package.
[00:02:46] Unknown:
Most people are going to be familiar with package managers from their operating system or for from the Python package repository or other languages. So what are the sort of similarities, and in what ways does it diverge having to manage data from a package management perspective?
[00:03:03] Unknown:
So package management is that kind of, system that helps us to reduce the amount of time to install packages. Right? So the data retriever also kind of comes in in the field of data science because a lot of this data is uploaded or is supplied to the public in its raw form. Most of this data has a lot of, errors. It has a lot of abnormalities and we come up and say, okay, the data retriever with 1 line of code could help you as a packet manager by searching the data for you. It looks up the information of the dataset. You could read through the information and tells you which dataset you could actually use for your analysis. With 1 line of code, you could get this, data onto your systems. It, supports various database management systems because we want we know that you want to keep your data in a Postgres data management system or a SQLite management system. So any database, we support a few of the database management systems. You could easily install that data in each clean version and ready to analyze into those systems.
[00:04:12] Unknown:
And 1 of the things that's often difficult with software packages is, you know, the the secondary and tertiary dependencies that need to be installed as part of that. But from from reading through the documentation, it looks like that's not really something that leaks into the data retriever situation because you're just dealing with single sets of data and not necessarily having to pull in dependent information to enrich that data. Is that correct? That is correct to,
[00:04:38] Unknown:
a certain point because the data retriever also uses other libraries. So the data retrievers, Python has got this, awesome packaging management system whereby you could, describe the date the requirement, the required dependencies for the retriever that are going to enable us to preprocess data. For example, a JSON engine, MySQL engine, Postgres engine, all these engines have certain drivers that the retriever depends on. So we have to make sure that these APIs are also working perfectly or these dependencies are working perfectly.
[00:05:14] Unknown:
And are there limitations as to the kinds of datasets that can be packaged and installed by Data Retriever?
[00:05:22] Unknown:
So currently we handle tabular data. Tabular data is delimited data. And then we also handle spatial data, but, and these are the main datasets that we're looking at, spatial or tabular data. However, when we talk about the scalability, when it comes to tabular data, we've handled most of the processes needed for a clean dataset to be analyzed. For the spatial data, there are lots of preprocessing that are taking on there are lots of developments that are taking on in this special environment, and we're trying to scale that up to a good amount of, preprocessing. When it comes to the domain, we basically handle various domains of datasets.
Initially, we started with, ecological datasets, but currently we support all datasets. As long as it's a dataset and it's, defined clearly by a given standard, the retriever will, install that dataset for you in the specified engine or in the specified data database management system. And are the majority of the datasets
[00:06:27] Unknown:
that are currently packaged by data retriever within a certain bound in terms of their size? And what are some of the challenges with the datasets as they start to grow beyond a certain point? The challenge is,
[00:06:40] Unknown:
speed. We do have, some engines that work optimally with that. As datasets get larger, obviously, the time required to download them and install them becomes larger. But, currently, we are really performing at a good,
[00:07:00] Unknown:
optimal standard. And are the target systems generally people's individual laptops or desktop computers, or do you also have datasets that are targeting a more distributed or larger compute cluster for being able to analyze the information?
[00:07:15] Unknown:
So here we at this point, I think we're talking about the size and most datasets, are not that huge, but the data retriever is size independent. It could handle any size as long as your laptop or your computing machine can handle that and as long as they're providing it because it's just a pipeline. And, with the pipeline, there is we have designed it to have no limit because we are not treating all the datasets in memory. And for somebody who is working on building
[00:07:50] Unknown:
a set of scripts or the preparation involved in installing a dataset, what are the steps involved? And from what I was reading too, it looks like you're using a standardized data format for actually simplifying that process.
[00:08:04] Unknown:
Yeah. So the concept of standardization is very crucial in this aspect of, using datasets. If we have 2 researchers who use different standards of labeling or categorizing or defining their datasets, then we shall have a difficulty when it comes to reusing these datasets. Luckily, the Open Knowledge Foundation have created a data specification specification for data packages. This helps us to standardize how you describe your data. With this specification, People can plug in your data to most of their, products or software. And, with that help, if you have a Jason package, basically the packages are in JSON format.
And if you have described your JSON data package, clearly, the data retriever will take in this, data package in its raw form. However, there are datasets that we need for the pro preprocessing. Right? And this data set, we actually create different kinds of files to support support this pre processing. So we create Python files from the specified descriptions of the data package. We do some further preprocessing and when you the tool is ready, the dataset is ready to be used. When I was reading through the documentation
[00:09:27] Unknown:
and as you mentioned, there's the r and the Julia wrappers around data retriever. So I'm curious how much of the logic is contained within the Python library, and how much of it needs to be shared between those additional language implementations?
[00:09:42] Unknown:
So we have the core functionalities developed in the data retriever, in the Python software, in the Python tool, and most all these packages, we have the r data pack r data retriever, which is wrapping around the core library, the core API or the data retriever in Python. Then we're also developing a Julia wrapper. All these are wrappers, and they're trying to wrap around the core functionalities of the data retriever. So I'm wondering if you can dig into how the project is actually designed
[00:10:13] Unknown:
and the ways that it will push the specification for how the data is intended to be installed and how you are able to support multiple different destination targets for the source data? So we use basically,
[00:10:27] Unknown:
most of the, programming design patterns like composition and inheritance. And we have, engines. Right? And these are specifications for how engines are supposed to preprocess it preprocess the data. So the engines are kind of like plugins. You have an engine that you want to support, we can easily plug it in, because we've changed a few specifications. Then we have the core pack, which is the main class that's that handles all these engines that we describes their schemers. Right? Then we have the modules, the scripts basically from, the scripts. We also have a definition of what the scripts contain and that's kind of populated from the JSON data package. So the JSON data package populates the script class and the script classes communicate with the main core part and they define where, when, where to install the dataset, which engine to install the dataset. So if a user comes in and plugs in a new JSON package, that JSON package is transformed, is read, and initiated into the scripts class. The scripts class communicates with the Kowa platform.
And, based on the specified engine, we use the the the specified sorry. If a user plugs in his JSON data package, the data package is initial initialized or initiated as a script. The script then communicates with the core part of the retriever, which determines the engine that is supposed to be used. And at that moment, it defines the kind of schema that is used for that kind of engine. If it's Postgres or MySQL or SQLite, they both, they all have different schemers. So that's the part, that's how the the whole engine is the whole software is set up. What have been some of the most difficult aspects of building and maintaining data retriever? So most of the complexity comes in with supporting various platforms because we have to support people who are using Mac. We have to support people who are using Windows. We have to support people who are using, Linux and other operating systems.
But on top of that, we have also system or Python dependencies or, whereby somebody is using maybe a Python 3 program or a Python 2 program and all these systems, all these systems process data differently, all they, they, they function differently. And we also have complex when we are trying to support backward compatibility because, when we release a software, many things have changed and we need to keep our users who are using the previous software functional. Then the other problem that or complexity that we have always found is, dependencies, keeping up with dependencies because people always update their dependence versions and their functionality, and this usually breaks down most of the of the of the tool. And, I think those are the main problems that we face. And is there a sizable community around DataRetriever,
[00:13:56] Unknown:
or is it still in the growth phase where you're trying to get people interested in it and engaged in contributing new datasets to sort of grow the overall utility of it? Yeah. I have had quite a few people
[00:14:08] Unknown:
tell me they've tried the tool at a few of the conferences that I've been to, we still developing, we trying to enable more dataset. We're trying to get users to see the benefits of using the data retriever. And we're also enhancing it by putting more functionality, putting more, scaling up the spatial preprocessing. And I think, within I think after some time, people are going to come up on come on board and, use the data retriever more often. What are some of the features that you have planned for the future that you're hoping to implement? So we are trying to include what we call the data weaver. So this is a tool that, would help in integrating most of the datasets because most of the researchers don't use the same datasets or individual datasets. They try to use datasets from other researchers and integrate these datasets to come up with a new dataset, and we're developing, the tool. Then we're also developing a Julia package for the retriever so as to also support people who are using other programming languages.
And, we are trying to I think, right now that's, that's all that, is that is
[00:15:29] Unknown:
going to be updated soon. When you were originally starting the project, was Python the only language that you considered implementing it in, or were you looking at other possibilities as well?
[00:15:40] Unknown:
So Python is 1 of those very, very important languages when it comes to data science. It's 1 of those languages that processes data very fast. So there are other options when developing this tool, but given the given the environment that we're trying to solve, Python is really, really important. And, we we I think that's the best choice that we made. Okay.
[00:16:10] Unknown:
For anybody who is interested in learning more about DataRetriever and following the work that you're up to, I'll have you add your preferred contact information to the show note. And so with that, I'll move us to the picks. And for my pick, I'm going to choose a Bluetooth receiver that I picked up a little while ago for using to listen to podcasts while I'm commuting and, set of headphones that I got to go with it. So I was looking at some of the different Bluetooth receivers available and some of the ones that are built to clip onto your shirt or whatever and ended up finding this 1 that was a little cheaper than most of those and supported newer versions of the spec. And the only problem was that it was designed for use in cars, but I also found a little pocket clip that fits nicely onto it so I can just clip it to my shirt while I'm commuting.
So I'll add links to all those in the show notes. And with that, I'll pass it to you. Do you have any picks for us this week, Henry? So the pick for the listeners
[00:17:02] Unknown:
is a movie from India that I always watch. I watched it last weekend and it's called the 3 idiots. I hope you enjoy that movie.
[00:17:11] Unknown:
Okay. Great. Well, I appreciate you taking the time out of your day to tell me about the work you're doing with Data Retriever. Definitely seems like a very valuable addition to the scientific community and I look forward to seeing where you take it in the future. Thank you very much and thank you for the time. Have a great night.
Introduction and Guest Introduction
Henry's Journey with Python
Overview of DataRetriever
Data Management and Package Management
Types of Datasets and Scalability
Target Systems and Data Size
Standardization and Data Preparation
Project Design and Architecture
Challenges in Development and Maintenance
Community and Future Features
Choice of Python and Final Thoughts
Picks and Recommendations