Summary
The ecosystem of tools and libraries in Python for data manipulation and analytics is truly impressive, and continues to grow. There are, however, gaps in their utility that can be filled by the capabilities of a data warehouse. In this episode Robert Hodges discusses how the PyData suite of tools can be paired with a data warehouse for an analytics pipeline that is more robust than either can provide on their own. This is a great introduction to what differentiates a data warehouse from a relational database and ways that you can think differently about running your analytical workloads for larger volumes of data.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Taking a look at recent trends in the data science and analytics landscape, it’s becoming increasingly advantageous to have a deep understanding of both SQL and Python. A hybrid model of analytics can achieve a more harmonious relationship between the two languages. Read more about the Python and SQL Intersection in Analytics at mode.com/init. Specifically, we’re going to be focusing on their similarities, rather than their differences.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host as usual is Tobias Macey and today I’m interviewing Robert Hodges about how the PyData ecosystem can play nicely with data warehouses
Interview
- Introductions
- How did you get introduced to Python?
- To start with, can you give a quick overview of what a data warehouse is and how it differs from a "regular" database for anyone who isn’t familiar with them?
- What are the cases where a data warehouse would be preferable and when are they the wrong choice?
- What capabilities does a data warehouse add to the PyData ecosystem?
- For someone who doesn’t yet have a warehouse, what are some of the differentiating factors among the systems that are available?
- Once you have a data warehouse deployed, how does it get populated and how does Python fit into that workflow?
- For an analyst or data scientist, how might they interact with the data warehouse and what tools would they use to do so?
- What are some potential bottlenecks when dealing with the volumes of data that can be contained in a warehouse within Python?
- What are some ways that you have found to scale beyond those bottlenecks?
- How does the data warehouse fit into the workflow for a machine learning or artificial intelligence project?
- What are some of the limitations of data warehouses in the context of the Python ecosystem?
- What are some of the trends that you see going forward for the integration of the PyData stack with data warehouses?
- What are some challenges that you anticipate the industry running into in the process?
- What are some useful references that you would recommend for anyone who wants to dig deeper into this topic?
Keep In Touch
Picks
- Tobias
- Foundations Of Architecting Data Solutions: Managing Successful Data Projects by Ted Malaska & Jonathan Seidman
- Robert
- Reading old academic papers such as CStore
- Python Machine Learning by Sebastian Raschka
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Altinity
- Clickhouse
- MySQL
- Data Warehouse
- Column Oriented Database
- SIMD == Single Instruction Multiple Data
- PostgreSQL
- Microsoft SQL Server
- Pandas
- NumPy
- Tensorflow
- Jupyter
- Data Sampling
- Dask
- Ray
- Map/Reduce
- Vertica
- Sharding
- Hadoop
- SnowflakeDB
- Delta Lake
- BigQuery
- RedShift
- Snowflake Data Sharing
- OracleDB
- Kubernetes
- DBT
- CSV
- Parquet
- Kafka
- UC Davis
- Web Scraping
- Clickhouse Python Driver
- SQLAlchemy
- Materialized View
- PyTorch
- scikit-learn
- Spark
- BigQuery ML
- Apache Arrow
- Wes McKinney
- User Defined Function
- KDB
- CStore Paper by Dr. Michael Stonebraker, et al
- Kinetica
- MapD/OmniSci
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models and running your continuous integration, they just launched dedicated CPU instances. Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
Taking a look at recent trends in the data science and analytics landscape, it's becoming increasingly advantageous to have a deep understanding of both SQL and Python. A hybrid model of analytics can achieve a more harmonious relationship between the 2 languages. Read more about the Python and SQL intersection and analytics at mode.com/init. That's init. Specifically, we'll be focusing on their similarities rather than their differences. And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.
We have partnered with organizations such as O'Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit in Graforum, and Data Council in Barcelona. Go to python podcast.com/conferences today to learn more about these and other events and take advantage of our partner discounts when you register. Your host as usual is Tobias Macy. And today, I'm interviewing Robert Hodges about how the PI Data Ecosystem can play nicely with data warehouses. So, Robert, can you start by introducing yourself? Hi. My name is Robert Hodges, and I'm CEO of Alteniti. We offer commercial
[00:02:15] Unknown:
support and software for ClickHouse, which is a popular open source data warehouse. Beyond that, I have a pretty long background in databases. I started working on databases in 1983 with a system called m204, and ClickHouse is now database number 20, I think, because I'm kinda losing count. Yeah. After the first few, I'm sure you've probably stopped bothering to keep exact track. Well, every time I count, I keep finding ones that I forgot about. And it's it's definitely it's been very variable. For example, I worked for over 10 years with MySQL. Then there were others like DB 2 where I used it for a day and decided I hated it, so it still counts. And also databases is, in some ways, a bit of a nebulous definition where if you squint enough, different things can be considered databases that you might not think of in at first blush. Absolutely. And I think what's interesting over the last, perhaps, say, 10 to 12 years is the range of things that we properly consider databases have in has increased enormously.
As people deal with different kinds of data, different amounts of data, and different problems that they're trying to solve when they analyze the data. Those have all driven sort of a plethora of different kinds of approaches to databases. And do you remember how you first got introduced to Python? You know, that's a really good question. I think what it was, and I can't remember where it happened, I read an article that said that a long time ago that said Python was so beautifully designed that it is something where you would be up and running and productive in about 4 hours. And so I thought, okay. That sounds good, and I went and tried it. And it did seem pretty easy to use, but I didn't program with it at the time. I've got into introduced to it at an industrial level at VMware about 2 and a half years ago where I was working on a project doing tools for 1 of the VMware Cloud products.
And I was helping to, I was sort of architect for a group that was, developing open source Python APIs. And that was where I really learned how to use Python well, learned how to do things like actually properly program modules, do documentation, set up, do setup dot p y, things like that. So that that was really my full education in it. And so for this conversation, can you start with giving a bit of a quick overview about what a data warehouse is and some of the ways that it differs from a, quote, unquote, regular database that somebody might think of for anybody who isn't familiar with this space? Sure. So I think that it helps to start this out by defining what we mean by a regular database because I just said, as I as I've referred to and as I think many people know, there are many types of databases. So when people think about a database, they commonly think about something like MySQL, which stores data in tables, and the tables consist of so that means that there's a a set of columns that each table contains, and then the data are actually stored as rows. And this is the same data structure that you see when you work in Excel. So what we generally speaking, we call that kind of database a row oriented store, which means that when you actually look at how the data are stored, on disk or on SSD as the case may be, as well as the way that they're accessed, we go get them by rows. And this type of database was 1 of the earliest that was developed, and it's very well optimized for things like managing data in a bank or for managing sessions in on a website because there's a lot of little chain there's a lot of little updates, and we can go to the row, change it, put it back, and so on and so forth. Now that kind of database has a problem, which is that as you get to very, very large amounts of data, the fact that it's efficient for update leads to a trade off that it's not particularly efficient for reading. And 1 of the reasons is that if you wanted to, for example, had a a data a data warehouse like a a table in your in your database that contained taxi rides, and 1 of those fields was the number of people that that rode in the taxi, and you wanted to just take the average, you would end up reading and pulling into storage every single all the rest of the rows just to read that 1 little tiny field. So as a result, data warehouses began to develop in the nineties, and they took a different approach to the way they organize data, which is that they organized it in columns. So you could have your 1, 000, 000, 000 row table, and you might have your field, which contains the number of people in the taxi, that would be stored in a single column. And the idea is that the that the data warehouse is in this and many other ways optimized to read data very, very efficiently.
So if you wanna do that average, what you're gonna do is you're gonna read down this column very quickly. And and on top of that, the data warehouse typically has optimizations like that. That array of data, if you will, is going to be compressed, and the compression algorithms are extremely can be extremely efficient. So, for example, you can get 200 full compression on 1 of those columns depending on, depending on what the values are. So it compresses down to a very small amount of data, plus we use very, sophisticated processing techniques so that we can take advantage of what are called SIMD instructions, single instruction multiple data that allow us to to operate on multiple values at a time when these things are loaded on the CPU registers.
So this type of database, which is optimized for read, is now generally what we know of when we talk about data warehouses.
[00:07:24] Unknown:
And there are some cases where people might end up using a row oriented store as a data warehouse just because that's what they have close to hand. I know that I've seen it with things like, the Microsoft SQL Server, and I've heard of people using Postgres for that purpose if they have a small enough data set where they're able to perform their analytics, but the pattern in which they're using it is more along the lines of a sort of multiple more read oriented than write oriented so that they can run their analytics off of it without necessarily interrupting their application databases.
[00:07:56] Unknown:
Exactly. And in fact, what I started to notice, perhaps 12 years ago when I was working with customers, was that you would see we I I ran a business that was actually focused on on, as it turned out, clustering for MySQL databases. And we had people that ran pretty big MySQL, installations. But what you would see in their datasets was they might have 200 tables in a database. But when you actually went and looked at how big the tables were, they were probably 1 or 2, which turned out to be extremely large. Sometimes they would contain 100 of millions of rows, and then the rest of the tables would tail off real quickly because then they would be, you know, things like customers, currency values, or things like that that turned out to be small numbers. So what has happened is that people have done exactly as you described. They start out using MySQL, Postgres, Microsoft SQL Server. They get to the point where they have about a 100 or 200, 000, 000 rows, and that's the kind of the point where it becomes very, very difficult to do these read operations on the, on the row oriented database.
Not only does it disrupt the operational work that's going on, so for example, trying to update the tables, but the other thing is it becomes extremely slow. And the difference in performance can be staggering. It when you have 200, 000, 000 rows in a MySQL table, the difference between the performance there and running in a data warehouse will be a factor of a 100 sometimes, even more in some cases.
[00:09:15] Unknown:
And on the other side of what we're talking about today, there's the PyData ecosystem, which encompasses a large number of packages, but at the core of which most people will consider things such as NumPy, pandas, maybe the Jupyter Notebooks format, and probably things such as scikit learn or maybe some of the machine learning libraries. And I'm wondering what are some of the cases where those libraries on their own aren't necessarily sufficient for being able to process data either efficiently or where you might need to lean on something such as a data warehouse in conjunction with the PI Data Stack for being able to achieve a particular project? Yeah. That's a great question. So I I think the biggest thing that I see in is that well, first of all, I wanna say that these tools, the the PI data,
[00:10:01] Unknown:
that whole ecosystem is really inspiring because it has such a wealth of operations that you can do on on matrix and vector oriented data, plus it's coupled with really great machine learning as well as visualization tools. So so it's a really, really great ecosystem, but I think the single biggest thing that I hear about is people probably, first of all, people just complaining, hey, my my data don't fit into memory anymore. And so what you end up seeing, at least in my experience, is that when you're running models in Python, you'll end up sampling because you actually can't hold all the data in memory. So you you just you just load a sample, say 10% or 1% or something like that. You train your model on that or you run your model or or you you do whatever whatever operation that you're doing, you'll do on a small fraction of data, and the basic reason is you can't put it all in memory and process it at once. And I think part of this is part of this is because I think this is happening for 2 reasons. 1 is that Python, bless its heart, does not have a particularly efficient way of representing data in memory. This is actually common to other systems like Java. It's not it's not it's not a Python problem, per se, but it's just that when you materialize objects in memory, they're not necessarily using memory layout as efficiently as they can. The second thing, and I think this is maybe more important, is that by default, Pandas wants to have everything in memory at once. So if you have a a a matrix or or, you know, like a Pandas data frame, it's gonna wanna have the whole thing in memory. And there's not a notion, for example, at least by default, of being able to stream through data. And, and so that can also be a problem when you're trying to deal with large amounts of information.
[00:11:38] Unknown:
Yeah. For a lot of these out of core systems, people might start to lean on things such as DASK or the Ray project for being able to scale across multiple machines and multiple cores, both for the memory issues that you were laying out as well as some of the issues with dealing with embarrassingly parallel computations. And I'm curious how that plays into the overall, use case for data warehouses and, some of the other ways that those broader elements in the ecosystem can also help in terms of managing larger data volumes and also where the data warehouse can benefit. Yeah. I think that's a great question because it begins to because I think what you see happening with projects
[00:12:17] Unknown:
like Dask, at 1 level, you could say, hey. This is Python kind of rediscovering or reimplementing MapReduce, which, you know, the notion that you can do in the in the PyData ecosystem is that people are beginning to replicate in Python things that are already solved in data warehouses. And I'll give you a couple of examples. The data warehouses like Vertica or ClickHouse, the 1 that I I operate on, are very good at spreading data across multiple nodes. So they have the ability to break it up into, we call this, sharding. So break it up into pieces that can be spread across notes and then run the query in parts and join the results together or or aggregate the results together before they return. Another thing we can do is we can replicate the data. So as as the number of users on your system span extends up upwards and and you're, you know, beginning to add, you know, ask a lot of, you know, more questions concurrently. The ability to have replicas allows you to scale the performance very efficiently. So because, you know, different queries can run against different replicas.
So I think this is the I I think what's happening is that as these systems like Dask start to emerge, it's time to ask the question, do we want to implement this entire thing in Python, or do we want to actually go look in the SQL data warehouse, see what they've got there, and maybe begin to sort of shift the boundaries a little bit,
[00:13:52] Unknown:
between what we do inside the database and what we do in Python. And in terms of the actual data warehouses themselves, you've already mentioned ClickHouse and Vertica, and there are a number of other ones out there such as Snowflake. That's a recent, addition, and then there are also various
[00:14:08] Unknown:
things that you could potentially consider in terms of data warehousing that are bordering the line with data lake, such as the Hadoop system or things that that the, Databricks folks are doing with Delta Lake. And I'm wondering what are some of the main axes along which the different data warehouses try to different differentiate themselves and some of the considerations that somebody should be thinking about as they're trying to select a platform for moving forward with. Sure. I think that's I think we can definitely divide the market up into into some important segments. So for example, you mentioned Snowflake. That's a cloud data warehouse. And there there's a family of data warehouses that are really cloud native. They are always going to they may be tied to a particular cloud implementation. That would be the case with, things like BigQuery on Google Cloud or Redshift on Amazon or Snowflake, which can span clouds, but depends on the underlying cloud capabilities to work. So those are those data warehouses have some have great capabilities. They have very full SQL. They're well funded. They have very good SQL, implementations.
They also deal 1 of the things that particularly Snowflake and BigQuery do very efficiently is they decouple the processing from the story from the from the storage. So, for example, 1 of the things I like about Snowflake and their processing Snow Snowflake handles this is they spin up what are called virtual data warehouses, which are the compute part. Each business unit will get their own virtual data warehouse, and they can go to town on this underlying data that they're reading without interfering with each other at all. So so that's that's 1 class of data warehouse that I think is really important to look at. And I think where people tend to make choices in that direction is if I I think probably the single biggest factor is has your business just decided, hey. We're gonna be on Google. If you are, then there's probably a pretty strong argument for looking very closely at Google Cloud because you're or Google, BigQuery because you're already there. Your data is stored in object storage. So that's 1 big class. I think another important class of data warehouse is the traditional data warehouses, like, of which Vertica and Microsoft SQL Server and and Oracle are sort of all, you know, sort of all play into this. I think the most interesting 1 is Vertica. That was a very innovative column store. They're doing some interesting things, in terms of separating compute and storage. They've now got a cloud version of it called EON. So that's another 1 to look at. I think the trade they have good capabilities. I think the trade off there is they tend to be expensive to operate, and it's not just because they're proprietary software with with expensive licensing, they tend to require pretty pretty good hardware to run on. And then you have the data warehouses like ClickHouse.
ClickHouse is kind of an interesting case. It's open source, Apache licensed. It's also more like MySQL in the sense that it is, even though it's a column store and has all these these properties that make it run very fast for reads, it's very simple to operate. And it's also very portable across a lot of a lot of different environments. So for example, we run ClickHouse, on everything from, bare metal. We have a lot of people who run their own still run their own data centers, or or lease, space, to run their equipment to Kubernetes all the way to, VMs running in the cloud. So these are 3 different classes of data warehouse. And I think depending on your use case,
[00:17:30] Unknown:
you know, where your price points are, you know, what what is it you're looking at doing there, each of them has has virtues and also drawbacks. And I'll also just point out that you and 1 of your colleagues was also on my other podcast talking a bit more deeply about ClickHouse, and I've also covered a number of these other technologies on that show as well for anybody who wants to go further down the rabbit hole. Exactly. And it's a great rabbit hole. We we we like living down there. So Absolutely. Out to, you know, sorta out in the open air for a little while this afternoon. And so once somebody has selected a data warehouse, they've got it deployed, and now they're looking to actually populate it with data and be able to start integrating and integrating it into their overall workflow. What are some of the ways that the Pydata ecosystem can help facilitate that process of getting data into it and making sure that all of the data is appropriately cleaned and the schema is matching appropriately, and then also in terms of ways to ensure that the schema is optimized for the types of analytics that you might be doing and just some of the other considerations
[00:18:36] Unknown:
implementation of the data warehouse? Yeah. That's an interesting question. So I can give an example from, you know, from our work on ClickHouse, which I think illustrates some of where Python fits in. So a lot of data in the in the real world is in is in CSV, so comma separated values. And it turns out that Pandas is pretty nice for has good ability to read CSV. It's relatively tolerant to different formats. And so 1 of the ways that we I've actually seen customers reading or or using Python to help with ingest is that they will actually use Pandas to to read the CSV files, clean them up, write them to Parquet, and then we have the ability to ingest Parquet into ClickHouse. So that's an example of where, Python is kind of there to help in terms of the data cleanup up front. I think more generally, 1 of the things and I this is not true of everybody, but I think that in the systems where we see very large amounts of data being ingested, actually, I think what happens is that Python, you kind of stay out of the way because, for example, about half of the people that that we work with ingest data from Kafka.
So the, ClickHouse, like other, like some other data warehouses can actually read Kafka queues directly. So that's an example of where of where you don't actually need an intermediary. And if you wanna do cleanup, you'll actually wait till it's in the data warehouse, and then you often use SQL to clean it up. So, and that's a very common, that's a very common use pattern. I think the other place where the other way that data gets loaded is that you read it from object store. So for example, if you if you use Redshift, it has this great command called copy, which is used to read files, but it's been adapted in Redshift so that it can read data directly out of s 3. So where PIData would where the PIData ecosystem would fit in is PIData might be the stuff upfront, for example, that's scraping the data off other online systems. So for example, I did an application with, students from University of California, Davis, where they built scrapers that would go search for prices for web, for web resources like EC 2 on Amazon. They'd search for them on the web. They'd put them into s 3, and then we would read them direct then we would read them into the data warehouse from there. So if Pydata is involved, it's really at the front end of collecting the data and putting into Object Store. Those are 2 big ways that data gets into these,
[00:20:53] Unknown:
systems at scale, and Python is sort of helping along the way. And on the other end, once the data's already been loaded and you're looking to do some sort of analysis or maybe even train some machine learning models based on the data that's there? What are some of the ways that an analyst or a data scientist might use for interacting with the data warehouse?
[00:21:11] Unknown:
Yeah. I think that in in this particular case, I can I can really only speak to examples from ClickHouse, but but I think these are are relevant for other databases? So, for example, 1 of the ways that you can 1 of the simple ways that you can interact with the data warehouse is you just fire up the Python client and, you know, sort of build a simple application on top of it. So when I start to analyze data, I typically start in a Jupyter notebook. And what I will use is there's 2 drivers that I can use. 1 is called the ClickHouse driver, so that's a that's a standard, data, database client. You make a connection. You call a method where you supply a SQL command. You get back a result set, and it's usually 1 or 2 lines of code to pop that into pandas. So that's 1 way that you can get it. Another way is there's a SQL alchemy driver.
So and SQLAlchemy is is, implemented pretty widely across, across databases. You'll use the there's a a percent SQL magic function that will give you a result set, which again you can you can pop into Pandas and begin and and begin operating on it. So so those are pretty typical ways to get data out of the data warehouse. I think where it gets interesting is is where you start to explore, hey, how much can it can the data warehouse do more for me than just dump raw data? Can I actually do some of the things that I wanna do in pandas and, you know, do it in the data warehouse and save myself some time?
So that's a,
[00:22:38] Unknown:
that that's where you begin to to dig into. Okay. What's the what can I do in the data warehouse that's actually gonna save me time in in Python? And 1 of the potential issues that you might deal with if you're just trying to do, as you said, and just use the database as just a dump storage layer and pull all the data back out into Python is running into the memory bottlenecks that we referenced earlier. And so what are some of the strategies for working around those bottlenecks, particularly if you're dealing with large volumes of data and just ways that the data warehouse can help supplement the work that you would be doing within Python? Yeah. I think that's I think when you start to frame it as, like, how do I work with the memory I have available? At that point, you're asking the right question because
[00:23:21] Unknown:
this is a problem that that databases have fundamentally been occupied with since, well, at least, as long as I've been working with them, so which is to say decades. So the idea there's there's a couple different ways that you can think about this. What the data warehouse does is it basically allows you to access data that is vastly larger than the amount of memory that you have. And so there's, like, at least 3 different ways that you can think about using this. For example, if you need to do queries that only need that only need certain columns, the data warehouse is going to is going to answer those queries for you very efficiently. So instead of thinking of things in terms of being you know, having everything in a Pandas data frame that contains all of the data and all the rows, just think in terms of having only the columns that you're working with because you can get them very quickly out of the database, and if you need more, you can go back and ask for it. So thinking in terms of, like, let's go and and all and isolate the columns we're working with, bring them in, and then operate them on them in Panda, that's that's 1 thing we can do. Another thing that data warehouses can do very efficiently is downsampling.
So SQL has this great feature called the materialized view, and the idea with a materialized view is that it is a different representation of the data that is designed to be more either smaller or more efficient because of the way it's sorted, for example, or the way that it's stored than the original source data. So, for example, if you have, if you're doing sampling of, devices or collecting prices in the market, what you can do if this is essentially gonna be time series data. What you can do with a materialized view is you can downsample it so that instead of getting a, a data point for every time you do a measurement, you can actually reduce it to time segments, like 15 minute segments. This vastly reduces the amount of data that you, that you collect. Moreover, with the materialized view, the database will calculate will basically do this downsampling at the time you ingest data. It does it once so that if you then go to the view and ask for the data, you're gonna get it back really fast, and you're gonna get a much, much smaller amount of data that you can then operate on. I think what happens in the in pandas, if you just work off files, which is the way some, you know, the way some of the pipelines work, you end up asking these quest questions again and again and again. So that's another, that's another important way that,
[00:25:47] Unknown:
the data warehouses can help. And also another Python tool that can help in terms of creating and maintaining those materialized views is the data build tool or DBT for short that will help in terms of ensuring that you have some measure of testing and consistency as far as processing the the the source data and creating the materialized views from it. And then also once you have a materialized view or even if you're just dealing with the table without doing any additional processing on it, 1 of the ways that it can help from what you were referring before as far as pulling in window to time segments is that you can actually let the database handle the cursor and feeding chunks of memory at a time instead of trying to have to implement that logic yourself in your Python code to say, okay. I'm only gonna process this amount, then I need to shift this out and then pull in the next bit. You can actually use the database as your working memory, pull in the piece that you actually care about for that particular computation, and then put it back into the data warehouse. Absolutely.
[00:26:44] Unknown:
Correct. And in fact, I I mentioned that there's 3 ways that data ware warehouses can help. That sort of gets to the 3rd. You can think of the data warehouse you can think of, like, when you when you particularly as your data sets grow very large, you can think of this, the the the raw data, as existing in a window. So, for example, when you're collecting, web logs or you're collecting, perhaps, temperature sensor data, the data that you collect that's that's newest is the most interesting. And what tends to happen is after a certain period of time that data becomes less interesting to keep around. So the data warehouse can can do what you're describing at a very at a very large scale in the sense that you can put a time to live on data. This is a common feature in in many data warehouses so that your raw data will actually time out after some period of days or weeks or whatever you choose, and and it just goes away. And the database does this automatically. So it's maintaining a window of of the raw data that you can then jump in and look at without you having to do anything special, like having complex pipelines or a lot of logic. The other thing is then with materialized views and other techniques, you can downsample, and you can actually keep those around for much longer periods of time. So I think when you combine these together, you then, in the database itself, it's kind of, you know, holding your data in an optimized way that also builds, you know, creates these windows that you can, you know, so in effect creates these these different granularies of data that you can look at. And then finally to the point that you were making, database connectivity APIs are very good at streaming data. So, for example, in the ClickHouse, streaming APIs or or the the the wire protocol, you tend to get data in chunks. So if you write your your, your Python code well, you can basically get a piece of it at a time process that throw the memory away, get the next piece, so on and so forth. And this is something this kind of buffering of data is something that the connectivity APIs have been doing since the late eighties. The databases are really good and, very well optimized for this problem. And another case where the data scientist or the analyst might be leaning on Python is for being able to
[00:28:50] Unknown:
interact with tools such as TensorFlow or PyTorch for building machine learning models, which generally require a fairly substantial amount of training data. And I'm wondering how the data warehouse fits into that flow versus some of the other ways that they might be consuming the training information for building the models either in these deep learning neural networks
[00:29:10] Unknown:
or in some of the more traditional machine learning algorithms that they might be leaning on scikit learn for. I think that's the place where actually we see the biggest gap in the technology right now. And what is happening right now is if you look at state of the art and particularly among the the people that I speak with who in some cases deal with very large datasets, what they're typically doing is pulling this data out into Spark and and doing their machine learning there, or they're pulling it out and and and doing intensive flow. So you basically are taking the raw data or maybe, you know, down sampled aggregates that are in the data warehouse and you're just copying them out. You're, you know, you're you're running your machine learning on it, for example, training models, in which case you would just drop the data after you're done, or you're executing models, in which case you would, you know, score the data and maybe put it back in the database.
So there's still a pretty big gap there. I think that as we go forward, I'm seeing 2 things which are really pretty interesting. 1 is that you're beginning to see the data warehouses, actually put at least basic machine learning into the data warehouse warehouse itself. So, Google BigQuery is doing that. They have a in databases traditionally have a create table command. They have a create model command. And so you can begin to you can begin to run basic machine learning models there. The other thing that's happening is we're starting to see the emergence of other ways of sharing data between data warehouses and machine learning system, and these are represented in projects like ARROW, which is a columnar memory format that is very pretty accessible from Python. And that was that's actually, a project that's being driven by Wes McKinney, who's also the author of Pandas. And the idea there is that we're going to have common memory formats to make both transfer of data simpler, but also open up the possibility that we can now communicate through things like shared memory as opposed to having TCPIP, streaming to move the data. So I think those are a couple of interesting things that are happening, but it's still very, very basic. And and I think there's a lot more that we need to do to to move that forward.
[00:31:13] Unknown:
Another interesting development too is things like what Microsoft is doing with embedding the Python runtime into the Microsoft SQL Server to try and bring the compute directly in line with the data so that you don't have to deal with the wire transfer and serialization and deserialization
[00:31:29] Unknown:
itself. Right. That's a that's a really interesting that that's a really interesting development. In fact, I think that's why 1 of the reasons why I'm I'm very interested in Arrow is I we call that that kind of that kind of, operation, we call that UDF, a user defined function. And databases like Postgres have had this for a long time, the ability to hook in, for example, you know, c routines, but also Java and and Python and other things like that. I think the problem that that I see with this is that if you just do it in a naive way that for every row you go and call Python, it's just horribly inefficient. The, because what the way that data warehouses operate on on data efficiently is they everything's an array, they basically break up the array into pieces, they form it out onto all the cores that are available, and they just go, you know, screaming through this data to process it as quickly as possible. If you have to turn around for every value you have and call something in Python, that doesn't work. So I think that's I I think what and I'm not sure I haven't used the the Microsoft SQL Server capabilities, but I think what we need is something that allows you to deal with data at the level of blocks. And that's where I think something like Arrow, something that combines the capabilities of Arrow where you can actually share the data formats as well as the way that, for example, ClickHouse processes materialized views where we don't just when we populate a materialized view, we don't actually just do 1 row at a time. We do 1, 000 or 100 of thousands of rows at a time. So you have to think in terms of processes that allow you to to do these operations for, you know, on very, very large amounts of data using kind of a streaming sort of a streaming processing model that allows you to get to the data really quickly. And so in addition to things like machine learning where we don't have a an excellent solution yet, what are some of the other limitations of data warehouses in the context of the Python data ecosystem as far as being able to run analysis or some of the other, data science workflows that somebody might be trying to perform? Well, I think the biggest thing I see is there's just such a richness of things that you can do in NumPy and Pandas that still aren't fully supported in in data warehouses. So, most data most mature data warehouses will do a pivot, but for example, ClickHouse doesn't do a full pivot the way that the way that Pandas does it. I mean, Pandas makes this so easy to do. It's, you know, and Pandas also has Pandas and Python in general has a wealth of statistical functions.
So there are certain databases. So for example, KDB is a is a column oriented store or, you know, sort of a data warehouse that that's, is renowned for the fact that it has it has a very rich set of functions. But in general, the databases don't have the the richness of of statistical functions that you find in the PI data ecosystem. And I think that creates a problem because if you're if you're looking to do, you know, as you look at what people are doing with machine learning and data science, it's fundamentally driven by statistics.
So if you don't have those statistics in SQL, you actually can't form work out to SQL very easily. You either have to have user defined functions and people add them themselves. As I mentioned, there's a lot of inefficiencies there. So I think this is a big gap. I think this is where, you know, we need to look over and see the good things that, you know, that are being done in in Python and actually pull more of that stuff into into the database and make the SQL implementation richer.
[00:34:59] Unknown:
And in terms of overall trends in the industry and in the Python community, what are the some of the things that you see going forward that you're excited about as far as the tighter integration of the Py data stack and data warehouses,
[00:35:13] Unknown:
and any challenges that you anticipate us running into as we try to proceed along that path? Yeah. I I think that the I think moving models into the database is a very interesting is a very interesting development. I'm a little bit skeptical that databases are going to to be able to do this as well as Python does, and the reason is that if you look at what if you look, for example, if you go to scikit learn and you look and you look at the models, they just have a raft of parameters that are, that you actually need to be able to twist and turn to enable the the model to give you the, you know, to, you know, not to be overfitted, not to be underfitted, and and work effectively.
The database implementations that I've seen have very limited, parametrization. And so, for example, with BigQuery, as far as I can tell, it just kinda figures out. It, You know, you give it the data to train the model, it kinda does it, but you don't actually know quite what it did to to train the model. There's you don't have the full access to the hyperparameters that that you would want to adjust to to make the model work well. I think that's a problem and it's it's not clear to me that that data warehouses are going to are gonna solve this. I think that 1 of the really interesting questions is that we all have to work on is how can we take advantage of the fact that there are very very powerful machine learning and deep learning systems that that exist outside the database. How can we combine them? So instead of thinking about this this data being pushed into the database, instead thinking about how these systems can work together.
And, so for example, we have a 1 of the 1 of the ClickHouse users that I know, their their biggest desire is to be able to take a row of data, pass it to a model for scoring, and write that data back into the database all in a single operation. And so that's that's what we have to focus on to be able, you know, to be able to, I think, to really join these 2 worlds together. It's problems like that, and we have to think about how to do it at scale. So and I think there's some very interesting, there's some very interesting things that we can do. The you know, if you have data data spread across many nodes, in many ways, this is not unlike what we had in the in the Hadoop file system where you had data spread across a bunch of a bunch of disk spindles, and you could send the processing down to to go grab the data off those disks. So I think there's some things that we can do, but I'm I I think there's definitely some work to do to to make the to implement those ideas and and really be able to to join these 2 worlds together in an efficient way. For anybody who wants to dig deeper into this space, what are some references that you have found useful that you'd recommend?
That's an interesting question. So I'm sort of into academic papers. I think 1 of the things that you if you wanna understand what a data warehouse is, I think just the quick and you're and you're reasonably familiar with data. I think 1 of the quickest ways to get up to speed is to go read the c store paper by Mike Stonebraker and a bunch of database stars. It was, written by about 14. There's about 14 different people on the paper, but that described the that basically described what what later became Vertica. So it was a column storage built on things that had already been done in the nineties, and then it introduced a bunch of other interesting things. I think that sort of gives you a notion of how data warehouses work and the kinds of things that they can do. Beyond that, I think it's I think the simplest thing is to go try them out. So ClickHouse, for example, is very easy to use. You can just if you're running on Ubuntu, you say app install ClickHouse server, it comes down. I think there are other systems that you can you can try out. So for example, Redshift on on Amazon was really a complete ground ground breaker in terms of ease of ease of use, just being able to click a few buttons and have a data warehouse spin up. BigQuery is the same way, and and if you're on Azure, you can do similar things with with the Microsoft equivalent based on, Microsoft SQL SQL Server. So I think just going and trying this stuff is probably the best thing to do and, and just sort of begin to to understand how you can actually use these systems. They're very accessible at this point. And are there any other aspects
[00:39:31] Unknown:
of the overall space of the PI Data Ecosystem and data warehouses in general that we didn't discuss yet that you'd like to cover before we close out the show? I think another place that there's sort of an interesting long term integration, which,
[00:39:43] Unknown:
is how we deal with data warehouses and GPU integration. So for example, 1 of the reasons that 1 of the reasons you go to TensorFlow is TensorFlow will use you know, has a much more efficient processing model with way, way more compute than we can get to when you're on conventional hosts. The the data warehouses today are optimized for IO. They don't necessarily have GPU integration. So I think that's another interesting case where we can you know, maybe some of that processing, If if data warehouses begin to be able to take advantage of GPU, that may open up some other interesting opportunities for doing something, you know, for for sort of adjusting the split of where where processing happens. So that's another that's another thing that we're definitely looking at with a lot of interest. Yeah. I know that there's been some movement in that space with things like Connecticut
[00:40:27] Unknown:
that is a GPU powered database, but I haven't looked closely at it myself to be able to give any deep details on that. Right. There's a there's a database called mapd, which I believe is now called OmniSci.
[00:40:37] Unknown:
That's really that's a really interesting space. And I do wanna say that, you know, I've been sort of focused on moving things into the into the database, but I think that there's just as PI Data people can learn from what's going on in the data warehouse, because some of the problems I think what's happening in the PI Data Spaces is people are beginning to recapitulate solutions to problems that data warehouses have already solved a long time ago. You know, sort of how to distribute data, how to, you know, sort of turn it into an embarrassingly parallel problem so you can get results very quickly. On the flip side, I think database I think people in databases need to be looking very, very hard at the ease of use and just the the wealth of operations that you can perform using the Pydata, ecosystem modules, you know, NumPy, Pandas, Seaborn,
[00:41:28] Unknown:
scikit learn. There's just so much stuff there that I think this is something that we can really learn from. And for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a book that I've been reading called foundations of architecting data solutions.
[00:41:48] Unknown:
It's an O'Reilly published book, and it's been great for just getting a high level overview about the things that you need to be thinking about if you're trying to plan out a new data infrastructure. So definitely recommend that for somebody who's interested in getting more into this space. And, with that, I'll pass it to you, Robert. Do you have any picks this week? I think my pick is going back and reading all papers. That's as I say, the c store paper I headed up preparing for this show, I love reading that paper. It's it's just a it's it's a really, really great thing to read. Beyond that, the books that really interested me are things like, Python Machine Learning by Sebastian Raschka. That's something that came out. It's now in the second edition. I just got it a little while ago. It's not something you can read all at once. I just keep going back to it. And whenever I have time, I go look at what he has, you know, sort of work the exercises and just try to keep learning more and more stuff about about how to deal with data in Python. Well, I appreciate you taking the time today to join me and share your interest and experience
[00:42:43] Unknown:
in the cross section of data warehouses and the PyData ecosystem. It's definitely an interesting cross section and an area that I am excited to see more development with. So thank you for your time and all of your efforts on that front, and I hope you enjoy the rest of your day. Yeah. Thank you, Tobias. It's always great being on your show. Thank Thank you for listening. Don't forget to check out our other show, the data engineering podcast at data engineering podcast dot com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Message
The Intersection of SQL and Python in Data Analytics
Interview with Robert Hodges
Understanding Data Warehouses vs. Regular Databases
Challenges in the PyData Ecosystem
Choosing the Right Data Warehouse
Integrating PyData with Data Warehouses
Analyzing Data with Python and Data Warehouses
Machine Learning and Data Warehouses
Limitations and Future Trends
Closing Remarks and Picks