Summary
Testing is a critical activity in all software projects, but one that is often neglected in data pipelines. The complexities introduced by the inherent statefulness of the problem domain and the interdependencies between systems contribute to make pipeline testing difficult to manage. To make this endeavor more manageable Abe Gong and James Campbell have created Great Expectations. In this episode they discuss how you can use the project to create tests in the exploratory phase of building a pipeline and leverage those to monitor your systems in production. They also discussed how Great Expectations works, the difficulties associated with pipeline testing and managing associated technical debt, and their future plans for the project.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
- Finding a bug in production is never a fun experience, especially when your users find it first. Airbrake error monitoring ensures that you will always be the first to know so you can deploy a fix before anyone is impacted. With open source agents for Python 2 and 3 it’s easy to get started, and the automatic aggregations, contextual information, and deployment tracking ensure that you don’t waste time pinpointing what went wrong. Go to podcastinit.com/airbrake today to sign up and get your first 30 days free, and 50% off 3 months of the Startup plan.
- To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com
- Your host as usual is Tobias Macey and today I’m interviewing James Campbell and Abe Gong about Great Expectations, a tool for testing the data in your analytics pipelines
Interview
- Introduction
- How did you first get introduced to Python?
- What is Great Expectations and what was your motivation for starting it?
- What are some of the complexities associated with testing analytics pipelines?
- What types of tests can be executed to ensure data integrity and accuracy?
- What are some examples of the potential impact of pipeline debt?
- What is Great Expectations and how does it simplify the process of building and executing pipeline tests?
- What are some examples of the types of tests that can be built with Great Expectations?
- For someone getting started with Great Expectations what does the workflow look like?
- What was your reason for using Python for building it?
- How does the choice of language benefit or hinder the contexts in which Great Expectations can be used?
- What are some cases where Great Expectations would not be usable or useful?
- What have been some of the most challenging aspects of building and using Great Expectations?
- What are your hopes for Great Expectations going forward?
Contact Info
Picks
- Tobias
- James
- Unplug and spend some time away from the computer
- Abe
Links
- Superconductive Health
- Laboratory for Analytical Sciences
- Great Expectations
- Medium Post
- DAG (Directed Acyclic Graph)
- SLA (Service Level Agreement)
- Integration Testing
- Data Engineering
- Histogram
- Pandas
- SQLAlchemy
- Tutorial Videos
- Jupyter Notebooks
- Dataframe
- Airflow
- Luigi
- Spark
- Oozie
- Azkaban
- JSON
- XML
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200 gigabit network, all controlled by a brand new API, you've got everything you need to scale. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute. Finding a bug in production is never a fun experience, especially when your users find it first. Air Brake error monitoring ensures that you'll always be the first to know so you can deploy a fix before anyone is impacted.
With open source agents for Python 23, it's easy to get started, and the automatic aggregations, contextual information, and deployment tracking ensure that you don't waste time pinpointing what went wrong. Go to podcastinnit.com/airbreak today to sign up and get your first 30 days free and 50% off 3 months of the startup plan. To get worry free releases, download Go CD, the open source continuous delivery server built by Thoughtworks. You can use their pipeline modeling and value stream app to build, control, and monitor every step from commit to deployment in 1 place. And with our new Kubernetes integration, it's even easier to deploy and scale your build agents.
Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add ons, and visit the site at podcastthenit.com to subscribe to the show, sign up for the newsletter, and read the show notes. Your host as usual is Tobias Macy, and today, I'm interviewing James Campbell and Abe Gong about Great Expectations, a tool for testing the data in your analytics pipelines. So, Abe, could you start by introducing yourself?
[00:01:49] Unknown:
Sure. I am a data scientist, data engineer, currently the CEO at a new company called Superconductive Health. I'm excited to be here. And, James, how about yourself?
[00:01:58] Unknown:
Hi. I am a researcher at a government sponsored lab called the Laboratory For Analytical Sciences. I I would say that I'm very interested in all sorts of the engineering aspects of data science problems in particular. I have gotten to work across a pretty wide range of problems, mostly in the national security space,
[00:02:16] Unknown:
from cybersecurity to political modeling. And, Abe, going back to you, do you remember how you first got introduced to Python? I first got introduced to Python in grad school at the University of Michigan. I was doing a bunch of my dissertation was mining the political blogosphere for civility, which was kind of a boring topic 10 years ago, and it's it wasn't 10 years ago. 6 years ago. It's a boring topic 6 years ago. It's more interesting now, and I needed a lot of glue code just to hold everything together. Python was the best language then, and I still think the best language now for doing that kind of thing. And, James, how about you? Do you remember, hey, first, got introduced to Python?
[00:02:54] Unknown:
I don't remember what the first project that I worked on in Python was. But I think for me, I would definitely say that I came to Python because of its libraries. I think that's been true for a lot of the analytics projects where I've worked that the language has sort of been secondary to the ecosystem that was available for writing with it. And so for me, I think the availability of libraries like Pandas and Scikit, really opened up the use of Python in a variety of projects.
[00:03:24] Unknown:
This is the xkcd import antigravity thing? Exactly.
[00:03:29] Unknown:
And as I mentioned at the start, we're talking about the project that you have both been collaborating on and recently released called Great Expectations. So I'm wondering if you can just give a high level overview of what the project does and what was your inspiration for creating it. Sure. I you know, I think it's really that it's a fun question because I think Abe and I,
[00:03:50] Unknown:
have known each other for a long time, but we've never actually worked together, before this project on anything subs you know, super substantive. And I would say we sort of came at it from slightly different directions, but really had this moment of realization that we had very overlapping goals and interests. For me, I was, leading an an analytics team where we were developing a variety of analytics that we were sharing with other teams in our organization. And what I realized is we often had built our models around some pretty deep assumptions about the kinds of data that we're gonna be used with them. But the when we shared the models, it could be difficult for someone else to really, understand in a deep way all of the assumptions that we had sort of baked in. And so I was looking for a way to make it easier for people to express the assumptions that they had about data and, really clarify those expectations in a way that could be shared across teams and make that collaboration happen more smoothly.
When Abe and I were sort of talking about things, I think he was approaching it maybe from a different perspective. Abe, I should let you just, describe that.
[00:05:01] Unknown:
Sure. So I was coming off of a stint as the chief data officer at a really exciting medical technology startup, and we had been doing these, like, large scale ongoing data integrations against all of the insurance companies. It like, well, major insurance companies in the US. And insurance companies are not famous for having really great data systems or data quality, and we just spent a ton of time trying to make make these systems flow together, like, normalize the data and make sense of it, be able to run consistent queries across lots of platforms. It's just been a massive headache. And I've realized that we needed this abstraction to describe, like, what is it that we expect from our upstream data providers at each point. So, I I mean, the exact same set of concerns that James is describing as, kind of analytic integrity, understanding what, what the data is going to be used for.
For me, we're really important from a a productivity standpoint to understand, like, how do we not waste weeks with each time we do an integration? How do we make sure that it just works each time? And what are some of the complexities
[00:06:05] Unknown:
that are associated with being able to test analytics pipelines and ensure that the integrity of the data remains solid as new systems are built or as the source data maybe evolves?
[00:06:17] Unknown:
I love that question because I think the the really greatest complexity from pipelines is that they are incredibly interdependent. That's a lot deeper though than at the level of a data source or different data sources that we're merging together. There's a lot of interdependence that comes with teams and people because of the active selection that individuals are making at different points in terms of what data to collect and introduce and how it's going to be modeled at different places. When we have that that interdependence, you know, what we really want is that there's a lot of change because that's the way that we're learning and improving our system and models. So we want there to be new enrichments in the pipeline. We want corrections to upstream models. We want new data sources and all of those things. But at the same time, when those things happen, they can violate some of the assumptions that a downstream system may have had, and in ways that are really difficult to predict.
So I I think it's very similar to a lot of other engineering challenges. It's just that we don't have as robust tools in many cases for dealing with those very data centric problems.
[00:07:28] Unknown:
Let me piggyback on that. I'll just give a specific example. I think logging is a really good example of this. So and I wanna say this. It's this idea that James has tossed out about data systems change and you want them to change. I I do a lot of work with companies that have logged a bunch of interesting data, and often it's there either for debugging purposes for the engineering team to make sure your software works right. Sometimes it's there for operational purposes. Like, you have accounting as sort of an a log data that can be used for other things. And a lot of the time when you dive in on analytics and machine learning, you're repurposing that log data for something else. Right? You're using it to train models or get a window into past user behavior. So this this data that was originally intended for the purpose of debugging that later on becomes useful for all kinds of other stuff. And so would like, when you go from the original intended use case to all of these other use cases, that's cool because you can make really good use of the data exhaust, but it's also often just really complicated and risky from an inter team perspective because the upstream team that's putting all this data in the logs might not even know all the downstream uses that it's being put towards.
And it's not as if you want to bind the upstream team into never making any changes. But, of course, anytime they make a change, there's a risk that they're gonna mess up something downstream. Yeah. And both of those responses give me a couple of different
[00:08:51] Unknown:
follow on. So 1 question I have, James, from what you were saying is have you found that the variability in the source data tends to be greater when you're working with teams that are all contained within a single organization versus when you're consuming data from some external API or data provider?
[00:09:12] Unknown:
Well, I definitely think that when you're consuming data from an external API, the the more formal relationship that maybe exists to that API is designed to make the changes a little bit, more understandable. But I don't know that that necessarily means that the actual downstream system that you're building will actually have less change. Because, you know, there may be new versions of that API, or more to the point, as you encounter new kinds of problems or, like Abe was mentioning, new abilities to repurpose data, what you, I think, often find is that, you're gonna still end up having to, adapt what it is that you're doing with the API or how you're interacting with it. That that said, I think there's another another kind of case where we're really targeting with great expectations, which is at the very beginning of a project as you're evaluating what's out there.
I think there's a a kind of norming process that a a group goes through where what the data sets that each team will provide are are getting a little bit more solidified. So, you know, I think I think of an example recently that happened at the lab where there's a team that is beginning the process of building a model related to the veracity of different news sources. And so, they're considering a really wide range of possible features about news sources in a wide range of kinds of models that they'll use. But as they're doing that, they're needing to interact across, different people. And in in this case, you know, people who have different language skills and are working in different cultures in terms of the different media environments that they're evaluating and generate datasets or even the processes that they will use to generate those datasets.
And that has to be able to evolve very quickly.
[00:11:02] Unknown:
And with Abe, what you were saying about logging data, that's something that I'm very close to because I work as a systems and operations engineer. So I've actually been working on building and maintaining a log aggregating system for the past couple of years now. And 1 of the things that I find to be particularly complex about that type of pipeline is that there isn't necessarily a consistent structure across all of the different systems that you're consuming logs from, and yet you still need to be able to try and unify them all within 1 storage engine to be able to try and correlate some of the information across multiple systems. And, also, as you mentioned, there's a likelihood that the either volume is going to change or the structure of the logs are going to change between different software releases, and new information may be added, potentially even sensitive information. So all of those things add a lot of complexity to being able to manage the system that's actually consuming that information and storing it and trying to gain some useful information out of it. So I don't know if there are any particular types of tests contained within Great Expectations or that are supportable by Great Expectations to be able to bring some manner of sanity and data integrity to systems like that. And then to, from what you were saying, James, ways that great expectations can be used to ensure that the source data that you're consuming is matching or in an expected schema or format.
[00:12:29] Unknown:
Yeah. I mean, the question that I hear you asking there is, like, where in data science, data engineering workflows is great expectations starting to fit in the best? And I hear you bring up logging as kind of 1 specific example of that. That there are a lot of places you can use it. The place that I see most people kind of gravitating most naturally towards is bringing their upstream data dependencies under test. So the mental model that I have is this directed acyclic graph, right, a DAG with data flowing through it from upstream to downstream. And it like, most data systems, after they've worked for a year or 2, they become this complex web of a flowing into b, flowing into c, flowing into d, and and all the way through. What I find is that most teams, at least they think they understand the systems that they build and maintain, and mostly they do.
And the place where the most fear and uncertainty creep in is at the intersection between teams. So people will naturally start to, as they have the capability to add these kinds of pipeline tests to the data pipes they're maintaining, they'll naturally put it in on the stuff that's flowing into their system, basically, as a defensive measure. So that if the format or the structure or the content and distribution of the data that they're consuming changes, they'll learn about that right away. Right? They won't learn about it later when it shows up and breaks the dashboard or starts corrupting some downstream model. They learn about it as soon as it hits their system, and then you can go and have a conversation with the upstream team and react to that. We are starting to have some interesting conversations with teams that want to basically get a step ahead of that and use expectations as a way to set up almost an SLA for data quality between teams so that, you know, upstream team a and downstream team b can agree on what the data should look like. And, then a can use it essentially, use that, specification of expectation tests.
They can use it basically as a set of unit tests before they or or integration tests rather before they, publish new systems. And in that way, like, team b never gets corrupt data they have to go back to ask team a about. It's just always right. I think, Abe, that point about the way that the integration happens and and the way that Great Expectations
[00:14:41] Unknown:
facilitates the clarification of the roles there is really where I see a lot of value, especially, Tobias, as you were describing even in that log case. I think, you know, anytime you have a log processing pipeline, as you mentioned, when there are different formats, there's got to be some parsing that happens that is intelligent enough to be aware of the different potential formats. And probably it's going to extract certain key fields and and structure the log data in different ways. With Great Expectations, you can the parsing logic from your expectations about that parsing logic in a really useful way so that you can have a clear specification of what you want and expect to be present that you can really 0 in on because you've been very precise about just what it is. So rather than saying, you know, there's a parsing error or even if you have a great, a great set of parsing logic that'll that'll throw very intelligent errors, it's still a part of that processing pipeline under that approach. Whereas with Great Expectations, you're separating that out into a an explicit testing pipeline or testing portion of your pipeline that you can use to 0 in more directly on, where there's a problem if there is 1. Going back slightly to the concept of why
[00:16:01] Unknown:
data pipelines are so difficult to test and manage as they evolve too is that with software systems, you can bring in unit tests and run integration tests with some measure of isolation between various systems. But once you have data flowing continuously through a system, there are bound to be anomalies, and so that brings in the the need to actually run those tests in production contexts as well. So that adds another dependency of how the systems are all related to each other. And because data pipelines usually sit in the middle of a number of different systems or a complex web of systems, it increases the dimensionality of concerns that you need to be thinking about as you're writing those types of tests. Yeah. I think that's exactly right.
[00:16:48] Unknown:
1 of the key insights that has led great expectations to be structured the way it is is, basically that in data pipelines, just unit testing the code isn't enough because that the output is the joint product of the input data and the code itself. In a lot of software systems, you have pretty complete control over the code, and that's sort of the only thing that's really changing. So as long as you test the code, you can be confident that you're testing everything that's important. That's just not true in most data systems. Right? The data itself can change, and data, code that worked really well on data last week might not work on different data this week. I also think there's a really interesting aspect,
[00:17:28] Unknown:
you know, that ties into the the issues that we're seeing at a strategic level with cybersecurity where, you know, we realize you can't just have a perimeter security approach to really defending a network. I think there's a a really strong analogy to data pipelines. I mean, you can't just test the data when it comes into a system and then expect that because you've built your pipeline properly and tested the actual pipeline code, no anomalies will be present. Even more, once there is an anomaly, you really wanna be able to 0 in on what it is, where it came from, and how it has propagated throughout the rest of the system, which is why it's useful to be able to have those you know, that defense in-depth approach where you're looking throughout the system at many different points to understand how the the data are interacting and and changing.
[00:18:15] Unknown:
Kudos for saying data are as opposed to data is. And so we've been talking a lot about the ways that data systems can evolve and grow this organic complexity that makes it difficult to be able to reason about how the data is flowing or manage how it's being processed. So what are some of the ways that Great Expectations can be used to pay down the pipeline debt and the types of tests that can be run by Great Expectations to keep, to control and manage that complexity?
[00:18:52] Unknown:
So the core abstraction in Great Expectations is is an expectation. And the idea behind expectations is they are a very simple, direct, declarative language for what data should look like. So some some of them are things like expect column to exist. So like this, you know, in this table, a column named x must exist, or expect column to be of type. But we also have things that get into a lot more detail. So things like expect column to be JSON parsable or expect it to match a a certain date time format. And we also have some some really interesting things that allow you to do essentially histogram comparison. So expect column this expectation name is super long. I always forget exactly what it is, but expect column KL divergence to be between Bootstrap? Yeah. Yeah. Yeah. Bootstrap.
Yeah. We have these distributional expectations. And conceptually, what they do is they let you take a histogram of either a categorical or a continuous variable and verify that future data is close to that distribution. So it's everything from structure to schemas to the actual content of those variables.
[00:20:01] Unknown:
And I think that's actually really where, people start to have that moment of what Great Expectations is giving for you, especially at a level that's a bit richer than just a schema. A schema is great for being able to describe the types of data and and the structure that should be there. To some extent, schemas can also, you know, put bounds on data and and so forth. I think those are really powerful concepts to be sure. But where great expectations differs from that is that it also allows you to have, like, Abe was mentioning, what we call those aggregate expectations where you're looking instead of at a row level, at a dataset, you're looking, at an aggregate characteristic of a dataset, which may be, of course, a little batch if you're flowing data through a system. It also is a very extensible system. So the abstraction that Abe mentioned, that expectation, is something where we've chosen a a a language that is very intentionally verbose and and precise. You know, Abe mentioned expect column values to be in set. For example, that's the actual name of the expectation that we use, and you provided a set of values that that you expect the column to be, drawn from. But you can extend that easily.
1 of the projects that we're working on, at my lab actually, someone wanted to express the expectation about the number of words that were in sentences that were being parsed. And, you know, for her particular analytic pipeline, that was an important concept. And so she was able to, you know, in just a couple of lines of code, add that as a kind of an expectation that she could make. Now that certainly could have been present as a traditional check-in the pipeline itself. But by breaking it out like that, she is allowing herself to get much more precise, you know, diagnostic information about how frequently that expectation may be violated in the data that she's already gotten. But then importantly, as new data starts to arrive, the extent to which that new data is conforming
[00:22:03] Unknown:
to the previous datasets that she's been looking at. And that concept of statefulness within these tests is 1 of the other things that makes it so difficult to manage. And so I'm wondering how you have architected great expectations to be able to handle that statefulness within these aggregate tests so that you can, 1, make sure that you don't exceed bat you know, system bounds in terms of memory or disk space as you are collecting those aggregates, but also how you're able to ensure that you don't accidentally drop data or make sure that you're maintaining appropriate context, particularly as you start to scale and distribute some of these pipelines?
[00:22:45] Unknown:
That's a good question, and I think our answer right now is probably a little disappointing, which is we've mostly punted on those issues. So the first issue of great expect or sorry. The first, release of great expectations was pure Python, and it was just Python pandas. We didn't even support SQL when it was launched. So you had to load everything in as data frames. You had to process it right there in memory as data frames. And that addresses a lot of use cases. But, yeah, there's certainly a bunch of issues about, like, if you're IO constrained or if you're running on really large datasets.
That's just not really the way you want to deploy. We now have added on, and this was always the plan, but we've we've added on core support for SQL through SQLAlchemy, so you can now bind expectations directly to most SQL data systems. We're still working on building out the full library of expectations, but we'll get there. And we actually see a pretty direct path forward to things like Spark and other kind of computation platforms. So for the moment, we haven't really hit optimization and, high like, really high degree scalability very hard. But there are a bunch of things that we can do within the framework that I think will support that really nicely
[00:23:52] Unknown:
when that's the thing that the community is really clamoring for. I think that's right. I I do think that, you know, there are some ways where, you know, there are, you know, near nearer optimizations that are relevant, things that Abe was mentioning. There there are also things that are potentially interesting at a higher level, like, looking at the composition of expectations. When we view expectations as a sort of grammar, then we can think of how we might optimize, sets of expectations or the evaluation of sets of expectations together. But that said, I also think that there's just something, something to be said for the simplicity that we've taken right now because we're we really are trying to push this to be something where you're thinking of it as an evaluation that happens at batch time. So we're not imagining typically that you're going to evaluate expectations on, you know, your entire data center.
Rather that, you know, as batches of data are coming in or as batches of data are flowing between systems that the the evaluation can happen at those points. You know, I think, like Abe mentioned, we're, at this point, trying to listen to feedback from people who are already starting to use Great Expectations in a wider variety of contexts and let that sort of guide where we need to go in terms of optimization. And I think too that this space is largely
[00:25:17] Unknown:
unexplored, so any benefit that can be provided even for people who are doing work on a single systems or systems that don't necessarily run across distributed clusters, that still provides an immense amount of value where there was a large gap in the availability of tools to help manage that complexity.
[00:25:38] Unknown:
I think that's right. In fact, I almost think that that's where there's a greater risk that people are delaying the introduction of tests because they're saying themselves, oh, you know, this is still small. And so they're not investing upfront in addressing that kind of a gap. But I think that's actually a risky strategy. You know, I think it's, in many cases, better to go ahead and be very explicit about the data, expectations as you're building the system so that you can grow into that more effectively.
[00:26:08] Unknown:
And for somebody who is first getting started with great expectations, what does the workflow look like for
[00:26:15] Unknown:
starting to build up your suite of tests and then deploying it within the execution framework where the data is flowing through? James has some great videos on this, where he walks you through the process. There are a bunch of different ways that you can approach it, but the 1 that we see most people using, at least to start, is using Python notebooks or Jupyter notebooks and diving in and using great expectations to instantiate a data frame. So it's it's a 1 line great expectations dot read CSV, just like you would do in pandas. And in fact, if you use that entry point, we've made it so that the data frame that you get back from great expectations, it is subclass from a pandas data frame. So it has all of the normal functions that you're used to there. Like, you're just in a pandas notebooks data exploration mode. The difference is that when you dive in, you get this whole other set of methods on that on that object that are all things like expect column to exist, expect column to or expect column mean to be between, expect column values to be parsable by dateutil, and and so on. And you can invoke those as you go along. For many of those, you can actually use the function as a quick data check for things like expect column values to be increasing. I I actually find our method to be more convenient than anything that Pandas gives you, and so it's a quick way to just, like, verify the data looks like it should. You run the function. It gives you back true or false. If it's false, it gives you some some summary statistics to describe why that expectation was violated. And then as you're doing that, Great Expectations is behind the scenes, building up a log of all the expectations you run.
And when you're done with your session, you can export that as a JSON file that has this map of everything that you expect future data to do. So that that's what we see as the the entry point for for most people doing exploration, building up a library of expectations.
[00:28:03] Unknown:
That's absolutely right. And I think, for me, I think that Great Expectations aims a little bit at having an opinion on this matter. We're not entirely agnostic about the way that you proceed in building a data pipeline or or an a modeling system. And so I think we are intentionally guiding people toward that very exploratory workflow, where they are interacting with a data set and very explicitly declaring their expectations about it and getting direct feedback on whether those expectations are met so that when you're beginning the process of actually writing the code that you're gonna keep for your pipeline or for your model development process, you've really gone through a thorough exploratory workflow.
And you understand the relationships between the, datasets that you're gonna be using. And, you know, we're we're we're trying very explicitly to guide you toward that kind of a workflow.
[00:29:03] Unknown:
Let me piggyback on that and say, when we say guide you toward that type of a workflow, I I think it's important to call out how this is different from a typical exploratory workflow. It's like when I dive in or when I used to dive in before using expectations as a language, I would, you know, bring up my notebook, dig into the data, produce some charts and graphs, and mostly, like, after I spent a couple hours doing that, mostly what would change is I would then have a notebook full of scratch code that I would throw away, and I would have a bunch of ideas in my head about how the data worked that would fade over time. And there was really no permanent asset from that that that was useful in the future. You can do some work and make notebooks more useful for that, but because they're hard to call for production and because they're just, like for various reasons, they're not usually a great fit for really documenting your data well. Yeah. Anyway, so that that I think of as sort of the before and the after with great expectations is you have this nice set of declarations. Right? If you export that file, it's gonna be this JSON object, that's, like, kind of big and hairy. But if you scan through the main pieces of it, it's describing exactly what that data object should look like. Like, here are the columns that exist. Here are the types that they have. Here are the ranges or the reg x's that they match.
Here's histograms for them. And that's sort of knowledge that is documented reproducible for anybody who wants to come back to it later. And and it's machine readable. So if you want to run a future dataset
[00:30:30] Unknown:
against those and validate that future data set, you can verify that this future data conforms to the expectations you had in the past. I expect we're just preaching to the choir here, but I I really think that actually just captures it really well, that notion that if the before is you're throwing away a lot of your code that was the kind of proto pipeline code, the after is that you when you start the process of writing your production code are bringing along with you all of that, configuration for the expectations that has been built up by great expectations watching the process of you making those expectations explicit in the notebook.
[00:31:09] Unknown:
And the fact that you have made the naming for so many of the different built in tests so long and descriptive, I imagine, also helps with being able to reconstruct the context in which you were working when you have extracted those expectations and you're trying to then convert it into a production system because you don't have to try and unpack some terse variable names or terse function declarations to try and figure out, okay, what is it that I was actually trying to test? Because those function names are already very descriptive of the exact steps that were happening.
[00:31:46] Unknown:
Exactly right. You really are building the documentation while you're doing the exploration.
[00:31:52] Unknown:
And we've been talking a lot about the integration with Pandas and some of the numerical Python stack. 1 thing that I'm curious about is the types of production systems that you're able to integrate with. So I imagine things like airflow and Luigi would be a natural fit. But is it also possible to integrate with things such as Spark or some of the projects in the Hadoop and Java ecosystem, like Uzi or Azkaban?
[00:32:19] Unknown:
So, specifically with respect to part Spark, like Abe mentioned earlier, that's, you know, still in the to be, seen or still be be coming feature section. But with respect to other kind of job running workflows, you know, Uzi and whatnot, the we we give you a, great expectations, executable as a part of the package that can take any, dataset and well, I should say, at this point, it will take a an instantiated datasets, basically, a CSV file or a connection to a data, table that's already supported, so some SQL supporting store. And it will take a fully formed great expectations configuration file and validate that. So, you know, like you mentioned, at that point, it becomes easy to plug it into other workflows, but the constraint is the evaluation context that's available to you, you know, where that data is located.
[00:33:18] Unknown:
And if you were to start the project over again, do you think that you would still choose Python as the language for implementation,
[00:33:25] Unknown:
or do you think that there are some contexts in which you are not able to use great expectations because of that language choice such as the JVM type systems that we were mentioning? We would absolutely, hands down, start with Python again. I think Python is just such a ubiquitous glue language that it's a good fit to begin with. And then when you take into account how much work goes on in Pandas, I I think there's just nowhere else where we could get as much reach and as much impact as quickly as starting in Python.
[00:33:51] Unknown:
Agreed. And have you found that there are any cases where great expectations would not be particularly
[00:33:59] Unknown:
usable or useful for being able to build up a suite of tests or validations within a system? I've got a few cases. Some of these are things that we want to tackle. So we are in the process of building out expectations, the the full suite of expectations to cover SQL, but that's not done yet. And we do get, yeah, some complaints and grumbling about that. So we know that that needs to get done. It it will get done. Another 1 that we've talked about quite a bit and are, I would say, still coming to a perspective on how we're gonna deal with it is nested data formats like JSON and XML. So as currently written, grid expectations basically defaults to tabular data. You can kind of force other stuff into it, but that's not what it's really best at. We're trying to figure out how do you extend that to cover nested data types as well because there are a lot of data types or sorry, a lot of data pipelines that transition back and forth between the 2. We were talking about logs earlier, and logs often have this characteristic where you have JSON objects or, sort of, embedded strings that are kind of semi tabular, but not really. And so, like, transitioning back and forth between those formats and tabular formats would be really valuable. We don't yet have a perfect way of doing that. We have a lot of ideas.
[00:35:15] Unknown:
And what have you found to be some of the most challenging aspects of building and using great expectations
[00:35:22] Unknown:
and also promoting it? I think some of the most challenging things that we have come across are just the the scope of the ambition in some sense. I mean, we're talking about doing something that we, I think, both really think of as as having a place in most every data pry pipeline related project that, we work on. And, you know, I think it's can be easy to look at the process of, you know, getting optimization right and working across all of the data contexts in which this would be relevant and as a small kind of open source project, get overwhelmed at that. But, you know, I think we've instead sort of taken the approach of making incremental improvements based on the things that we and other people who are using the project have as their, next step. So, you know, we're Abe and I are just, you know, a very much a distributed team, and keeping the, you know, coordination going has certainly been 1 of the, you know, really fun but interesting challenges for building great expectations.
[00:36:26] Unknown:
Yeah. Let me let me echo that. I mean, when you think of what we're trying to do in terms of scope, we've got a system that on the low end is trying to do typing and schemas and, kind of data cleaning stuff, so, like, rules oriented data cleaning. And at the more complicated end is trying to do anomaly detection on all kinds of different types and different data systems. So, like, just getting that right, there's there's a lot that goes into it. So when I look at how this project has proceeded compared to a lot of other data engineering, data science work I've done, It's been it's been very methodical. James and I, and other people as we've joined the team, we tend to really hash things out and talk them through carefully and think about lots of edge cases.
We do a lot of this in writing in GitHub issues to partly leave a paper trail and partly just making make sure we're covering things. And I think that that has been the right way to approach it. We've definitely favored architecture and very careful naming and consistency and those kinds of things in this project over, slapping a lot of stuff together to get to shiny things quickly. So it's it's been a very deliberate project, but I think that that's right. And I I think that we've arrived at a place where it is useful enough that people can start to see that, and we're you know, we're getting lots of good adoption and, you know, more stories coming back from the field. So I'm I'm excited to see it gain momentum. But like James said, like, there's just so many details to get right. Like, a a thing that we haven't completely figured out is typing. Right? So, like, typing, how do you have 1 consistent language that describes type in pandas and in every dialect of SQL and in Spark? Oh, and, also, we wanna do this nested jay or nested data objects like XML and JSON. Like, how do you really think about typing in that world? And at some level, we have to get it right, but it's complicated. There are a lot of moving parts. You know, I think that we've talked about this notion that great expectations is really valuable
[00:38:22] Unknown:
precisely when it helps to encourage and facilitate communication and clarity of hand off between different teams. But that really is also 1 of the the big challenges. Right? I mean, those things are problems precisely because people have different backgrounds and ways wanting to interact. You you mentioned the extremely verbose names that we have for the expectations, and I think that's an a great example of that. It's something where people from a software development background kind of cringe a little bit and say, wow. That's, you know, so long and verbose. On the other hand, people from a, you know, a deep data subject matter, expertise background, maybe look at the statistical tests that we use and say, you know, what does that what does that mean? And so we're really working very hard to try and and balance that and, you know, we have this mantra that we keep bandying back and forth that, you know, explicit is better than implicit. And I think that's really a catchphrase of the project where we are trying to encourage and facilitate communication, which means we have to work really hard on on that. And, you know, it's something that is certainly evolving, and I don't know that we're, you know, we've gotten it right and we're done. But I I I I think we're doing a a good job at least of paying attention to the right stakeholders that we hope will become good users for the project.
[00:39:38] Unknown:
And we've spoken a bit about some of the particular features that you have in mind for future releases of Great Expectations.
[00:39:47] Unknown:
But in the broader sense, what are your hopes for the project going forward, and what are some ways that the community can help you out on that? Well, I think 1 of the things that we're hoping to see is and we've already been able been really grateful to see community engagement and, you know, pull requests from people who, have a a feature that they'd like to see. In terms of the the design aspects that we see, you know, we've hit on a few of them in terms of expanding the kinds of data sources that can integrate directly with great expectations. There are certainly some additional, parts of the expectation vocabulary that we'd like to see implemented. We've expanded into multi column expectations, and we want to increase that vocabulary to make, you know, the the full range of things that we hope somebody would want to do available. But I also think at at a kind of a a more, you know, maybe 1 level back, you know, I'd like to see Great Expectations grow into something where, it can be used itself as a part of the analytics value chain. So for example, I'd like to see an ability to have templates of expectations that become a sort of descriptor for data systems so that it would be easier to identify where a dataset has come from based on the features, if you will, that the expectations produce when evaluated.
You know, that's kind of more certainly a longer term thing. In the in the shorter term, I think it's just that we'd like to see continued engagement from folks in the community.
[00:41:25] Unknown:
Yeah. Let me just piggyback on engagement. Specifically, we'd love to hear, honestly, more negative feedback. Like, hey. This is not working for me. This doesn't do this. I have too much of a hard time doing that, because it you know, if we can learn about that, then some of those things will be solvable. Some of those things are actually things that we can engineer for. Would love to hear success stories to people who are using this and kind of finding new ways to take this tool and put it into practice. Most of my favorite experiences with great expectations have been of that type. And then, you know, for those who, you know, coalition of the willing, those who who are ready, if they wanna add poll requests, that would be awesome. There are quite a few fairly incremental things that would, add a lot to the code base, especially things like building out tests that exist or sorry, rather, expectations that exist in Pandas, but don't yet exist in SQL. Or yeah. That's actually probably a good place to start. So, yeah, all of those things, more stories, and, maybe a little bit of code would be awesome things to get back from the community. And are there any other aspects of the project
[00:42:26] Unknown:
or pipeline testing or pipeline debt in general that you think we should discuss before we start to close out the show? Not that I can think of. James, do you have any? I'm I'm I'm just wondering if we if we hit enough on the
[00:42:38] Unknown:
aspect of the complexity carrying capacity. Maybe maybe I'll say, Abe and I sometimes talk about what you get by using Great Expectations as an increase in the complexity carrying capacity, which is a bit of a jargony way of saying the, the ability that we or is it or that a team has to deal with the interactions that are required as a as a data pipeline goes, grows. Tobias, you sort of mentioned that notion that, you know, there's this n squared complexity that comes when you add a new person into a team, and eventually, there are just more and more interactions that have to happen. And I think what we're really hoping is that we can make a difference in the entire trajectory of the way that the cost of dealing with complexity moves forward for teams so that it's not a matter of, you know, putting out fires or changing pipelines all of the time when, something breaks, but rather that we can have a lot more proactive and well understood processes for handoff between teams and that that can really make the process of adding the kind of complexity that we want, you know, that enrichment
[00:43:53] Unknown:
easier. Yeah. James, that's a really good point. I just to say the same thing from a different direction, every other branch of software engineering has long ago decided that automated testing was the way to go. That, like, if you wanted your code base to get above a certain level of complexity, you had to use automated tests, or there was just no way to hold all the pieces together. And if you didn't, right, if you neglected your tests and took on that technical debt, then it wouldn't take long before you'd find yourself in this place of just spending all your time running around, putting around, putting out fires, unable to refactor, unable to move forward very quickly on features, and, like yeah. It just I I've worked in code bases like that, and it's not fun. It's just not the right way to do it. So the I think the core insight behind Great Expectations is data pipelines have not had really a very good notion of automated testing, and expectations
[00:44:43] Unknown:
provide, as far as I know, the first really coherent framework for doing that. It's definitely very important topic, and it's great to see that there's been a lot more conversation happening around the concept of building up those guarantees of how the different components of data systems are going to interact because there has been historically a bit of a lack in ensuring that there are certain contracts that are maintained between those systems to be able to grow and scale them as they gain these additional layers of And I think that from the perspective of data engineering, it's been my experience talking to people that there isn't as much of a focus on automation and the idea of pets versus cattle in terms of the underlying infrastructure that's been happening with the systems administration and cloud automation kind of stuff. So it's good to see that there is more of that discussion happening in general across the discipline of data engineering as it becomes a more mature component
[00:45:44] Unknown:
of data's teams and companies in general. Yeah. That's a good point. Agree. Yeah. I've my my wheels are turning in my head on pets versus cattle and how that applies here. That that's an interesting bridge.
[00:45:55] Unknown:
Alright. Well, I appreciate the both of you taking the time to talk to me about this. So I'll have you both add your preferred contact information to the show notes for anybody who wants to follow the work that you're up to or get in touch with you. And so with that, I'll move us into the picks. And this week, I'm going to choose the Fitbit Versa. I picked it up a few days ago because I was interested in being able to do sleep tracking and get some insights into that kind of stuff. And so so far, it's been very interesting and enlightening. It's a great device. Seems to be very easy to use, and there is actually a, SDK for being able to write your own applications for it. So interested what kinds of new developments will happen for that. And so with that, I'll pass it to you, James. Do you have any picks for us this week? Alright. Well, I will
[00:46:41] Unknown:
pick the, chance to kinda get away and unplug every now and then. I think that's 1 of the things that, you know, helps manage the complexity of real life. So, my fun this weekend was, getting to head out, on on the lake in a canoe with my 2 girls. And, I encourage everyone to take some time to unplug from all the data engineering that they have to do and get outside and enjoy nature. Abe, do you have any pics for us this week? Yeah. Let me let me give I'll put in a plug for my company. That's not usually the hat I wear, but, we're superconductive
[00:47:16] Unknown:
health, and we're focused on building building the boring stuff. So we're a data engineering team. We're essentially a boutique consulting company focused on ETL, data normalization, data cleaning, data quality, exactly the sorts of things that you run into over and over again if you're building expectations. And, we're trying to do all of that better, faster, and cheaper. So if that's the kind of thing that your listeners are running into, we'd love to help out. Let me also put in a plug for a book that I read about a year ago, and that has just kinda bubbled up top of mind several times recently. It's called Slack, and the subtitle is getting past burnout, busywork, and the myth of total efficiency. It's it's 1 of my favorite business books. It talks about why why you actually, James, this goes along with what you said, why you have to have spare time and how that really plays out from an organizational perspective. Why having some Slack capacity
[00:48:07] Unknown:
is just totally crucial. So highly recommended. Alright. Tobias, I think I have to if it's okay, I have to give you 1 more pick and you can choose if you get to. But I think we've been talking a lot about data engineering, you know, because it's the thing that great expectations is offering. But if there's a pick of an idea that I have, I think it's that in a lot of the dialogue that I see right now in data science, we're focused on the very, very last mile as it were, that last analytics step that takes all the datasets that have been curated and developed and chosen and cultivated over time and produces some really amazing results from that, which is phenomenal. But I think it's important when we think about these kinds of projects to realize where the entire value chain has come from, and that as we build in the the that last analytics layer, our, our greatest, you know, new models and start deploying those to great effect, we have to remember the foresight that has come long before in terms of building the underlying collections that have made that possible. There's a project I'm working on right now related to machine translation where, we are getting some really incredible results, but there are results that would have only been possible because of the foresight of some other folks years in advance to start structuring data in a new way that made it possible for us to, you know, really quickly get that last mile. And that kind of, you know, data engineering work that went well before us is just incredibly valuable, and it creates a value for a long time. So I wanna just, you know, plant that reminder to be grateful for all the data engineers in your life.
[00:49:49] Unknown:
Shoulders of
[00:49:50] Unknown:
giants. Yeah. Alright. Maybe go with Kine. Alright. Well, thank you both for taking the time to join me today and discuss the work that you're doing with great expectations, and thank you for your efforts in building and growing that library framework. It's definitely something that is going to provide a lot of value for a lot of people. So thank you for that and I hope you enjoy the rest of your day. Alright, you too. Thank you so much for reaching out to us.
Introduction to Guests and Great Expectations
Origins and Inspiration Behind Great Expectations
Complexities in Testing Analytics Pipelines
Use Cases and Benefits of Great Expectations
Core Features and Expectations
Getting Started with Great Expectations
Integration with Production Systems
Challenges and Future Directions
Closing Thoughts and Picks