Summary
Becoming data driven is the stated goal of a large and growing number of organizations. In order to achieve that mission they need a reliable and scalable method of accessing and analyzing the data that they have. While business intelligence solutions have been around for ages, they don’t all work well with the systems that we rely on today and a majority of them are not open source. Superset is a Python powered platform for exploring your data and building rich interactive dashboards that gets the information that your organization needs in front of the people that need it. In this episode Maxime Beauchemin, the creator of Superset, shares how the project got started and why it has become such a widely used and popular option for exploring and sharing data at companies of all sizes. He also explains how it functions, how you can customize it to fit your specific needs, and how to get it up and running in your own environment.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to pythonpodcast.com/census today to get a free 14-day trial.
- Your host as usual is Tobias Macey and today I’m interviewing Max Beauchemin about Superset, an open source platform for data exploration and visualization
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by giving an overview of what Superset is and what it might be used for?
- What problem were you trying to solve when you created it?
- What tools or platforms did you consider before deciding to build something new?
- There are a few different ways that someone might categorize Superset, such as business intelligence, data exploration, dashboarding, data visualization. How would you characterize it and how it fits in the current state of the industry and ecosystem?
- What are some of the lessons that you have learned from your work on Airflow that you applied to Superset?
- Can you give an overview of how Superset is implemented?
- How have the goals, design and architecture evolved since you first began working on it?
- Given its origin as a hackathon project the choice of Python seems natural. What are some of the challenges that choice has posed over the life of the project?
- If you were to start the whole project over today what might you do differently?
- Can you describe what’s involved in getting started with a new setup of Superset?
- What are the available interfaces and integration points for someone who wants to extend it or add new functionality?
- What are some of the most often overlooked, misunderstood, or underused capabilities of Superset?
- One of the perennial challenges with a tool that allows users to build data visualizations is the potential to build dashboards or charts that are visually appealing but ultimately meaningless or wrong. How much guidance does Superset provide in helping to select a useful representation of the data?
- In addition to being the original author and a project maintainer you have also started a company to offer Superset as a service. What are your goals with that business and what is the opportunity that it provides?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Superset used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing the Superset project and community?
- When is Superset the wrong choice?
- What do you have planned for the future of Superset and Preset?
Keep In Touch
- @mistercrunch on Twitter
- mistercrunch on GitHub
Picks
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Superset
- Preset
- Airflow
- AirBnB
- Lyft
- Django
- Flask
- CRUD == Create, Read, Update, Delete
- Business Intelligence
- Apache Druid
- Presto
- Trino (formerly known as Presto SQL)
- Redash
- Looker
- Metabase
- Flask App Builder
- React Redux
- Typescript
- GraphQL
- Celery
- Redis
- RabbitMQ
- S3
- AirBnB Superset Blog Post
- D3
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'd like to welcome back Max Beauchmann to talk about Superset, an open source platform for data exploration and visualization.
[00:01:05] Unknown:
So Max, can you start by introducing yourself? Well, first, thank you for having me on the show again. It's a pleasure to be here. Quick intro for me. So I'm Max. I'm the original creator of Apache Airflow and Apache Superset. I started these projects while I was at Airbnb about 5 years ago or so. Since then, I spent some time at Airbnb at Lyft, and more recently, I started a company. So now I'm the the CEO and founder of a company called Preset. So we're at preset.io. We commercialize superset in some way. So we contribute heavily to Apache superset, and we we participate in in making the open source project great.
[00:01:41] Unknown:
And then we also offer, kinda hassle free hosted experience around it with, bells and whistle and kinda crust around this open core. For people who didn't listen to the previous interview that I did with you where we talked about airflow, can you remind us how you got introduced to Python?
[00:01:57] Unknown:
Yes. So I was thinking about that this morning trying to recollect, like, when I was first introduced to Python, but I think that was in around, like, 2, 000 6 or 7, so pretty early on looking at Django. So as a software engineer, I've been building websites here and there, you know, since the beginning of the Internet. And looking at Django, I got pretty excited by the the admin functionality and just, like, the breadth of the frameworks. I think it I thought it was super powerful. So I I remember at the time, I started listening to a lot of Django related podcasts. So there was a little bit of a golden era around that time around Python and Django, and I thought it was super exciting to be able to build a website kinda quickly, taking advantage of the sheer kind of breadth of the framework itself, and more specifically, the the admin functionality.
[00:02:44] Unknown:
Your history with Django is interesting knowing the fact that both the web UI for Airflow and Superset itself are actually based on Flask. So maybe we can dig a bit more into your migration from Django to Flask. But
[00:02:56] Unknown:
It's probably, like, the interest, you know, playing with new toys, I think. Right? And so I guess that's kind of a trend in software too. So while Django, I think, was a super rational choice, I think it's it's interesting to at the beginning of a project, you're like, oh, what can I play with? Right? What are the libraries I wanna use? And sometimes it's, like, semi rational. I think Flask is a great choice. I was interested to look into the at that time, there was Flask Admin, Flask App Builder, and other kinda CRUD system. And that was really my main selection criteria. I wanted to be able to not have to build essentially these CRUD web form for the audience. CRUD stands for create, read, update, delete. Right? It's all, you know, as you define an application like, you know, Airflow or Superset, you're gonna have these essentially, like, this entity relationship type diagram idea, right, where we're gonna have so, say, for Superset, we have, you know, charts and dashboards and users and roles.
And, you know, these web frameworks are great in the sense that they they kind of auto generate a lot of these web forms that are really painful to write from scratch if you're gonna do it. So that really helped me, I mean, both these projects to to get to an MVP kinda quickly by not having to build any of these web forms. And so digging a bit more into Superset, can you give a bit more of a background about what it is and some of the ways that it's being used? As much as I'm not a big fan of the term business intelligence, it just feels like a dated kind of term. Right? The business intelligence industry, you know, is probably, you know, more than 20 years old at this point. But I think it describes fairly well what Superset does in terms of, you know, you know, Superset is a place where people can connect to their data, explore data, put together dashboard, and share them. Right? So it's really a tool for organizations to serve all their data consumption exploration dashboarding needs.
Superset also has a SQL IDE built in. So now we see the rise of, you know, data analysts, data scientists, engineers, people who just know SQL. SQL is very much affirmed itself as the lingua franca of data too. So we also offer this powerful SQL IDE. So very much a holistic solution for your team, for your organization
[00:05:11] Unknown:
to really dig in and go and understand their data and collaborate with data. When you first created it, I know that it came about as part of a hackathon, but what was the problem that you were trying to solve at the time? And what are some of the tools or platforms that you had worked with before
[00:05:28] Unknown:
that led you to the decision that there was a need or an opportunity for something new to fulfill this particular use case? So thinking about, like, my journey in business intelligence too, so I started my career early 2000. And very quickly, I became a business intelligence engineer, data architect. Right? So I so I did, a fair amount of ETL and just making you know, using these tools and making them available for the companies where I worked at for people internally to to self serve with the data that we were kinda preparing in the backroom for them. So I used extensively things like Microsoft SQL Server, Analysis Services, you know, Excel Business Objects, MicroStrategy, you know, and all sorts of other packages and solutions over the years.
These solutions were kinda as enterprise y and as kinda big desktop type application as they come to. So looking at the context in which Superset was created more directly, so picture, you know, I think it's summer 2015, 3 day hackathon at Airbnb. The premise there originally was the scope was much smaller. The idea was not to build a business intelligence solution over 3 days. That would have been kinda crazy. Maybe not completely unrealistic now looking back, but the prompt was we had we were playing at a time our POC ing an Apache Druid cluster. So Apache Druid being an in memory real time database, and that's fairly early in the history of Druid. Right? At the time, there was no web UI in front of Druid, and I thought it was just this really interesting database because it was so fast and so fresh in a sense, like, you know, real time data, very low that you can see really fast in memory type queries.
So I was just excited to go and build a data exploration front end for Druid. So for that period of 3 days, you know, I was able to come up with a certain number of visualization and kind of a very simple explorer where you can pick your metrics, pick your pick your dimensions, and apply some some filters. So do basic Druid did not have a SQL interface. It had this proprietary kinda API to it, and there's no web UI. So so I thought it was new and innovative. I was also trying to recreate I was out of Facebook at the time, and Facebook had these really interesting set of internal tools. Right? Some of these tools were part of the inspiration for Airflow and for Superset. And more specifically for Superset, internally at Facebook, there was something called Scuba that everybody love and use super extensively. It's it's probably the secret behind some of the successes at Facebook. Right? Like, Facebook being so data driven. And so I would say agile with data was in part because of this awesome system called Scuba that was similar to very similar to Druid in terms of the back end, and the front end that I built on top of Druid at the time was inspired by the Skuva front end in some ways too. Right? So a mix of inspiration from my experience in business intelligence and, you know, my experience using tools like scuba at Facebook. So that was very much the prompt. At the time too at Airbnb, we were investing heavily in a Presto and Hive cluster, so, like, big kind of data lake, big HDFS cluster, big investment in in Hive and Presto. The tools that we had at Airbnb at the time were mostly just like a Tableau kinda limited license, so Tableau did not really work well with Presto at the time too, so there was an opportunity to to extend Superset very quickly, you know, to work with Druid, and I extended to work with SQL in general, right, but Druid in particular that enabled easy direct access to visualization against Presto without having to request a license or install a desktop application or these sort of things. Right? So that that made kind of easy access. And a lot of people wanted more access to data internally at Airbnb at the time, and I think building these tools could just kinda enable people to get quick access without having to worry too much and just go straight to it. You know, so that grew from that point, you know, we open source fairly quickly after this, and the rest is history.
[00:09:45] Unknown:
So you mentioned a bit about your background with business intelligence and some of the ways that super set is used for these types of workflows of being able to build a dashboard and present information to end users to be able to understand some of the business metrics and be able to dig in and explore it a little bit. And business intelligence as a product category is fairly old at this point in terms of, you know, the the overall scale of computing. It's been around since at least the nineties and has gone through a number of different iterations and generational shifts. And there are a number of other tools out there right now that have their own particular focus and thinking in terms of things like Looker or Metabase and Redash. And I'm wondering if you can just give your impression about where Superset fits in the overall matrix of business intelligence and data exploration tools and data visualization and dashboarding tools and sort of how the broader ecosystem looks and sort of why somebody might wanna choose Superset over the other options. You know, digging into business intelligence as a product category,
[00:10:48] Unknown:
and I think it's very unique in the sense that BI has always been and still is today in many ways, kinda everything for everyone. Right? So it's like all of your data needs are served by this BI platform and solution. I think that's driven by the fact that the buyers in the space and the people in general just want want a single solution for their organization where they can take care of all of their data needs. There's also, like, a growing aspect of data work, which is, like, making it collaborative. Right? So if you have, like, a lot of different tools, so we'll see. Like, we talked to companies that have, you know, 7 or 8 BI tools that they've accumulated over the years too just to try to satisfy, you know, different needs and different personas. But I think there's still very much this will of having 1 tool that everyone in the organization can use and collaborate on. Right? Really often, you'll have certain personas that are more kind of a content creator that wanna make data available to more business users or people that have more operational. So I think that need of a single platform that does it all, and I think it drives some of the way that the products are baked in the space.
There's much more, I think, to your question to dig into. So there's 1 question. That's where does superset fit into this? So, personally, I'm passionate about the fact, the idea of, like, catering to a wide spectrum of level sophistication with date. Right? So to have a platform where you have a SQL IDE, you have the slice and dice UI, you have the dashboards. So depending on your use case and how deep you wanna go, you'll find the right level of, in some cases, you want kinda ease and intuitive and kinda just easy to use type interface, but sometimes you need more power and SQL IDE, and, presumably into a SQL IDE and, presumably, go a level deeper into, say, a notebook where you might wanna do things that are a lot less intuitive or less accessible to people in general, but much more powerful in the sense that if you need to do a simple model or do some forecasting or do some custom visualization work because you're trying to get to something very kinda uncanned, right, something that's not cookie cutter, it's good to be able to go up and down this ladder of level of complexity.
So I think Superset does that well. Now in terms of positioning, where does Superset fit in this big kind of competitive market? So 1 thing that's for sure is, like, Superset's cloud native. Right? So we were born in the Kubernetes and and Docker era, which is very different from some of the old school vendors are still kinda tangled Windows Server and desktop applications. So that's a clear thing from day 1 that we've had. I think the things that are most interesting in Superset is the extensibility and integratability, like the way that Superset is integrated with all databases, the fact that it's extensible because we have this open source story, it becomes very natural for us to have a lot of point of extensibility.
So dig into that. Just kinda listing out some of the things that come with that are first, it's like a very awesome, like, REST API. Right? So everything you can do in Superset as a user by clicking around or most everything, you can do through a REST API. You can automate, so you can do things programmatically. Right? So for the people who know airflow too, that's kind a premise there too, to be able to programmatically do the things that users do. Sky is the limit in terms of automation. If you wanna automate workflows and things in and around superset, you know, you can do that with the rest API.
We also have, like, a really good plugin framework. So on the preset blog, we have some pretty good posts around how to get started in writing a new visualization plug in for Superset. So I think we're starting to see kind of a plug in ecosystem grow, and that's something we really wanna invest in. Right? So that we can have people that are maybe from different fields. Like, if someone has from genomics or from micro biology doing, like, very specific data visualization for DNA or genomes and things like that and contribute that back to the community. We're very excited about that.
Talking a little bit more about the open source ground. So for people who want a lot of control on how they run their software and how they scale it, places like Airbnb, Dropbox, Netflix, Tesla, right, that have very large superset cluster, and they want to control that experience internally. You know, they can do that because of the flexibility of open source. 1 part is, like, so say the security layer, and Super said there's an abstraction there called the security manager, where you can really define, like, how you do things like authentication, authorization, how who has access to what in terms of, like, data access policy or in terms of what they can do in terms of permission in the application.
So lots of power there that comes from the fact that it's programmatic, extensible, was born on the cloud. Right? So you can really define how you wanna run that software in your organization. And if you're not interested in that as much, you can have the benefit to have a commercial open source vendor where if you want someone else to run the software for you, you know, there are companies like Preset that can just kinda roll out and have, like, a hassle free kind of get started in 2 minutes experience too. If you prefer that, and you could always go to the no lock in kinda running it on your own down the line if you decide so. That's a 1 thing, you know, that comes that's in our into open source, talking about the no lock in part of open source. I think people are fed up with these big vendor contracts, right, that are looking to land and expand and, you know, they're kinda extracting as much money as they can from your organization. So with open source and commercial open source vendors, it kinda keeps all of us honest in terms of providing value, you know, and people, like, keeping their freedom in terms of how they wanna run and operate that software.
[00:17:02] Unknown:
Yeah. The open source and extensibility aspects are some of the things that I personally find interesting about Superset and digging a bit more into some of the architectural aspects. You met you were mentioning some of the plug in interface and the extensibility of the access and authorization layer. I'm wondering if you can just give a bit more of a broad view about how Superset is actually implemented and some of the ways that the overall goals and design and architecture of the system have evolved since that first day that you began it at the hackathon.
[00:17:31] Unknown:
Now I'm gonna start digging into archaeology, right, of the software and the different layers, you know, over the past 5 years. 1 thing that's really interesting to note is the fact that Superset was born a Python project. I know we're on a Python podcast, but over time, as derived to become much more of a JavaScript slash TypeScript React project over time. You know, we're building user experiences, and I think very naturally, we tend to evolve in the direction of adding just a really clear, solid, like, Python API in the back end and for everything else to become more and more of a just a single page app written in JavaScript. It's still very much served by a Python layer by Flask currently. So Flask is the web framework.
So from day 0, I was talking about how my interest in, like, kind of trying new things when I started to start projects. Right? So we use something called Flask App Builder that is so the Flask ecosystem is the opposite of, like, a monolithic framework. Right? It's a decoupled, lightweight framework where you'll have all sorts of, like, small packages like Flask login. I forgot all of the different, like, Flask extensions that there are, but there's a huge network of extensions in the Flask world. Flask app builder takes an opinionated approach to this and brings back a collection of these plug ins into a more call it a recomposed framework. Right? So it's like monolithic assemble from its pieces with more assumptions and therefore more guarantees, right, that come with stronger set of assumptions.
But, yeah, so Flask Gap Builder over time allowed us to get these CRUD forms very quickly and get started quickly with some NVC code. And I think we've been just very much, like, removing a lot of that skip folding over time. So it's a good approach to, like, software development. Right? Like, you probably wanna get to an MVP fairly quickly, prove some value, and over time, remove the training wheels and the guardrails and kind of tell that skiffle that we needed at the beginning. So that's part of the evolution of the software. Within the front end, things have been moving extremely fast. I mean, people know it's it's almost a gag at this point. You know, how much the front end, ecosystem is evolving fast. Right? So it's kind of a moving target.
But I would say things that we've settled on early on are things like React Redux. We're moving more towards functional components in React using more hooks as opposed to Redux. We're also moving towards TypeScript. We also move from Bootstrap, kind of a component framework where that's very well known into something called n d over the past year or 2. So I think there on the front end side of things, things are moving extremely fast, and we're staying kind of fresh on top without going crazy and going into, like, the latest, you know, like, today's fashion. So we're probably, like, adapting to kind of this year's trend in fashion, the front end side of things.
It's been interesting to try to keep up, right, with things like testing libraries, you know, go in and out of fashion. And, you know, we went from selenium to Cypress and from things like Enzyme to React Testing Library, I think, is the latest 1 we're moving to. So really very much, like, keeping up with pace of Volusia front end
[00:20:52] Unknown:
there. You mentioned the kind of evolution of it being a primarily Python app to now being an application that is Python on the back end, but very heavily on the JavaScript for the actual main user experience. But that there are these APIs for being able to populate and automate a lot of the user experience for building out these dashboards and being able to do things like propagate changes from a QA to a production environment. I'm wondering if you can just talk through a bit more about how you feel now about Python as the kind of core building block of the application? And if you were to start over today, what are some of the things that you might do differently either in terms of language choice or system architecture?
[00:21:34] Unknown:
There's 2 main things I would say that make us kinda reevaluate Python over time. 1 is the GraphQL kind of permanence over time. Right? So I think, like, the GraphQL story in Python is a little bit lagging. It's not a great as great as in the Node ecosystem. And then everything around sockets and web sockets. So I think if you're gonna pick a solution that has GraphQL and WebSockets as a premise, you know, I don't think Python's a great choice nowadays. It's feasible. It's doable, and maybe I'm not in tune with the latest. Like, maybe it's actually a great choice nowadays, but it seems like from that perspective, you know, I would pick a node app for some of this. We're looking to use some WebSockets as we build more collaborative type features inside Superset.
There are some places where WebSockets are just much better than using polling. So we're looking to bring some optional components inside the Superset architecture that would run on Node, for instance. So if you do want certain features or for some features to run better at scale, you would have to have to kinda enable an extra service that's a node service that's optional. So I'm still very much, like, for people who know Airflow and Superset, you know, it was always important to me to allow people to just kinda PIP install Airflow, PIP install Superset, to get into a demo environment where they can at least, like, start doing some work. This demo environment might not, you know, scales to thousands of users, but at least it allows people to get started quickly. And we know that time to value in software everywhere, probably, you know, just in the human experience is really important.
Right? So we we really care about time to value. Talking about that, you know, and getting into you mentioned the word, you know, architecture somewhere in your question, so I'll drill into that a tiny bit, which is to run a Superset cluster at scale, right, is a fairly hard thing because when you start digging into the architecture, you know, sure, you can pip install superset and go navigate to, you know, local host port 8, 000, you know, and then click around and try things. But if you're you're trying to serve a lot of users, then we have all of these architectural components that you can switch on. Right? So things like an async back end where we have async worker using the the Celery framework in Python that I'm sure you're some people in your audience are familiar with. So it's not required for you to have Celery to run Superset. But if you run Superset at scale, you should certainly have an async service switched on, which requires a message queue, probably something like Redis or Rabbit.
If you also wanna provide a great experience, it's great to enable the different caching layers that we have in Superset. So then you're relying again probably on a multi Redis of some kind. You know, the approach to architecture is kinda incremental in in some sense, right, where an organization might run a very basic install. Superset, as their requirements grow, the cost from operating that software and making decision and the complexity of the architecture gets a little bit private. Right? So that's definitely part of the value proposition around the commercial open source company, where if you want just to try Superset and maybe just not worry about running at scale and being on call for it, then you can elect to just have this, like, you know, continuous experience and hosted experience where it just works.
[00:25:05] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to python podcast.com/census today to get a free 14 day trial and make your life a lot easier. 1 of the interesting things about Superset in general is as I was looking through the documentation to explore using it for some of my own work for in my day job is that a lot of pieces of software, they say, you know, download the software or put it on disk, you know, and then you have the binary, and then you just drop a text file, whether it's YAML or INI or TOML, and then you start it up. But with Superset, the config file is actually a Python module, which I found interesting. I'm wondering if you can maybe just talk a little bit about that and then go from there into some of the extensibility and interfaces that are available for people who want to build their own integrations or extend the capabilities of the software.
[00:26:23] Unknown:
Yeah. You know, there's an underlying, like, deeper philosophical question there too. Like, you know, when you think about just gonna immutable configuration, or I know, like, this trend of, like, you know, should a pipeline be defined as YAML or as code. Right? And then, you know, if I know where I stand on this, I think, like, I really like infrastructure as code, configuration as code, you know, pipeline as code. And then, you know, I think on top of that, you can build a more kind of semantic static approach to it with things like YAML files or things that are not as dynamic as needed.
I think for certain configuration things, you need these, like, much more complex hooks that need to have logic in them. Right? An example is we have a configuration hook that just, like, gives you a handle on the Flask app and allows you to mutate it in some way. So if you wanna tweak your headers or if you want to just kinda mutate that Flask app object or extend it further through a configuration hook. You can do this. So if we didn't have that, we would have to have the YAML files that come with a documentation, you know, that is, like, you know, a 100 page long. Right? Every single option of configurability in the Flask app, we would have to expose as something that's kind of static that just gets kind of cumbersome and hard to manage.
There are a bunch of other instances of things where we have these really useful hooks, things that can kind of mutate the database connection or the database query for that matter, you know, kind of just in time too. So you can start doing much more complex things. We have also some hooks around feature flags where you can say, maybe sometimes you do have static feature flags, but sometimes you want to say, like, oh, I want 10% of my users to see this. I want a certain set or group of people to have a feature flag on or off. And how would you define that in static configuration? It would be pretty tricky to do that. You would have to have very complex semantics in there. So I believe in, like, configuration as code in in many instances. It's not always the best, but it's convenient certainly as convenient for maintainers to expose a lot of flexibility without getting tangled in the details.
In terms of what's configurable, I don't even know at this point. There's a lot of configuration hooks. I think Airflow is similar too. You know, it's part of the burden of when you really take ownership over a piece of open source software, you kind of have to sort through all of these options. Right, and kinda understand what your choices are. And sometimes it gets fairly intricate. Essentially, what is exposed to you as someone who's ramping up on the software, wants to operate that software is all of the options that everyone in the community ever wanted to perhaps configure differently. Right? Maybe someone at Netflix was, like, for us, we need to have this option. Right? We need to have this flexibility.
So, you know, we try, as maintainers, to put sensible defaults in place, but I think it's extremely challenging, because people want to operate the software in intricate ways. That's part of value proposition of open source, and sometimes that turns into, you know, a certain amount of I wouldn't call it headache, but it's like mountain of complexity that you're not really aware of that as you start running the software, 1 day you're probably like, oh, you know, I wonder if there's a way for me to do x. And then sure enough, there's a configuration flag that, you know, exists that does that, that may be documented or not. Right? So I think, you know, a lesson maybe for me and for open source maintainers is we should be better at this. Right?
So I think it's important to maybe have different operating modes and, you know, configuration sets, maybe they are more comprehensive and well documented, and to really have the sensible defaults there is a is super key. Yeah. I definitely think that your point of configuration as code is valuable because it does
[00:30:23] Unknown:
put the power in the lap of the end user without having to own all the complexity as the maintainer. Because as the maintainer, somebody says, oh, I want it to do this, and you're using a static configuration file. Well, then all of a sudden, you have to either say, no. That's never gonna happen, or, okay. Fine. I'll add this flag. And as with everything in open source, it's free as in puppy, where you say yes once, and then you have to own it for the rest of its life.
[00:30:49] Unknown:
Yeah. It's like when you're holding a hammer, everything looks like a nail too. And so Yeah. So every time someone shows up and it's like, hey, I need that hook or I've got this special need. I'm like, yep, configuration hook. Let's just, you know but yeah. So there comes a burden with that. Yeah. And then as a consumer of the open source, if you see there are 5, 000 flags that you can set and you maybe understand half of them,
[00:31:10] Unknown:
there's a lot of cognitive barrier to actually adopting that software because you say, oh, well, now I have to understand what is the vast matrix of possibilities, and then you end up building a whole another piece of software on your side just to generate the config that you care about. So you have this kind of double ended pyramid of complexity in the middle and, you know, simplicity on the edges, and it's easier to kind of flip that where you have sort of simplicity in the core, and then you expand on the potential complexity that the end user is able to adopt without having to have this initial barrier of you have to understand these 5, 000 toggles and all the different ways that it's going to interact with the software.
[00:31:47] Unknown:
Certainly, it is a challenge for people operating the software. 1 big inherent problem that comes with this too is the matrix that you mentioned. Right? The all the potential combination of flags being turned on and off becomes, you know, infinitely complex and impossible to test for. Right? You can build a very complex big build matrix that tries to run tests with all the feature flags on and off, but you won't be able to test all the combination of feature flags growing exponentially very quickly. Now that matrix becomes massive.
It certainly is a challenge. I was talking about configuration sets that are, like maybe there is a 1, 000 flags that Intrigo. Ly can combine. But the reality, when you think about the different operating modes of the software, it probably gets a lot more simple. There's, you know, you wanna run, like, a really massive instance of Superset that serves 10, 000 people versus you wanna run something just that you run on your laptop versus small team. So we could be better, I think, at creating these little kind of recipes or kind of presets. But yeah. So I think people need these sets of configuration that make sense together. We've seen that emerge in places like, you know, the Docker Helm Chart kinda combo or Kubernetes operators where people will publish these different sets or call it, like, opinionated operating mode for given piece of platforms or pieces of software. Yeah. Having a strong opinion loosely held is definitely 1 of the
[00:33:17] Unknown:
undervalued aspects of a good open source maintainer.
[00:33:20] Unknown:
Yeah. It certainly is a challenge too. I feel like as a maintainer, we drift with the wave of the community too. So we're trying to please people and we don't always have time to come up with, you know, a strong opinion. So sometimes we'll drift for a little while, and then, you know, maybe we'll take a stance when things become a little too mushy, unclear, or fuzzy over time. But certainly is an art to try to get all that feedback and contributions into a place that somehow works for people.
[00:33:51] Unknown:
For a brief digression on the point of having opinions about things, data visualization as a problem domain is something that is often fraught with misunderstandings or the potential to misrepresent information. And so if you kind of don't have any guardrails on a tool and let somebody go nuts on building whatever visualization they want, then it can be easy to accidentally end up in a situation where you have useful data, but the way you try to represent it is either wrong or misleading or just confusing. And I'm curious if there are any pieces of guidance within the superset platform to help people to build meaningful representations of the information that they're exploring.
[00:34:33] Unknown:
It extends to the whole, what I call the analytics process. Right? Like, from instrumentation to data collection and, you know, cleansing, you know, data modeling, data engineering, kinda in general, all the way to visualization. So there's a lot of things that can go wrong in the analytics process for data to be, you know, not trustworthy. And it starts with instrumentation. You can get your instrumentation wrong. You can get your data collection, kind of bringing that into the warehouse can go wrong too. You can get your remodeling and cleansing and, you know, kind of ETL process wrong as well. And then, you know, we're talking about the layer of like, can you misrepresent the data?
So 1 thing that's important and good, I think, is to start with the premise that you have a table that has a set of dimension and metric that is, like, somewhat trustworthy. Right? So for us in Superset, it's hard to know on that layer kinda everything that happened before up and until that point. So I would say the contract between, say, the deep end or the data warehouse and the visualization tool, in this case, Superset, is these data structures that are mostly tabular in general. You can also do some data transformation inside Superset. Right? And good kind of part of that contract, you know, can be established in Superset. But really often, what happens is people prepare these flat data tabular structures that, you know, are the place from which you're gonna build your visualization.
So if that's pretty comprehensive and well made and you have well documented fields and kind of simple data structures, like actions of users in time with really good dimensions and metrics, then it becomes a little bit harder to mess up the visualization or from that layer onwards. Now in terms of, like, doing, like, the wrong type of visualization for the information you're trying to convey, there's a lot of pitfalls there. There's a lot of material out there. You can point to some of, like, Steven Fuse's work in terms of, like, you know, what is the grammar of data visualization and how to use the right chart type to represent the right information, I think there's an opportunity for tools like Superset to educate in the process and suggest.
It's hard for software to understand intent in in some ways and to kinda capture things and tell the user that something is wrong. Currently, we assume that the users know. But I would like, I think, to have a little bit more guardrails and guidance in terms of, like, helping users achieve their visualization goals, whether it's to create a specific chart or a dashboard. We've been talking and thinking about, you know, wizard type. I know people don't really like wizard as a user experience construct, but, something like that that might help users guide them to create the thing that they're actually trying to create.
It's challenging because people come. We have different personas coming to tools like Superset with different set of intentions and assumptions. So I think it's a challenging problem, but I think the software in the space is generally becoming better. Most importantly, I think that the information workers and data professionals are becoming more data educated. Right? So you could almost, like, extend this question of, like, you know, you can use a spreadsheet to represent that information. Is it the role of Google Sheets or Microsoft Excel to make sure you do the right thing when you use a spreadsheet?
No. Not really. Right? They do very little in terms of it's like, oh, this sum is actually not your assets and liability total on your balance sheet. Right. They don't know.
[00:38:07] Unknown:
So I think there's a balance to be struck there. Yeah. I definitely agree that there's the potential to maybe educate users in the software, but it's more important to just educate everyone more broadly about fluency and data and how to use it and how to represent it. And going back to your point too about the different locations within the data life cycle about where things might go wrong, it's also worth the effort of ensuring that you have an accurate semantic representation of the data so that you have, you know, metadata management that lets you understand where this information came from, how it was transformed, what it actually means in context rather than just treating it as a scaler or, you know, at face value. Right. I know recently Airbnb contributed some metadata management type things to Superset. 1 is, like, certified metrics and datasets.
[00:38:55] Unknown:
So people might have different definition of what a certified dataset might be, but at least there's a way in Superset to kind of flag something to administrators or people that have access to do that. Or through an API. Right? Like, through the REST API, you can go and say these, like, 30 datasets have been delivered by the data engineering team and are therefore, you know, certified and can be used in a trusted fashion around, you know, things like SLAs or just raw, you know, pure data quality assertions are made to ensure that this dataset is reliable.
[00:39:27] Unknown:
So in terms of your experience of building and using and working with end users of Superset, what are some of the most interesting or innovative or unexpected ways that you've seen it used?
[00:39:38] Unknown:
1 answer here is, you know, of course, Airbnb is where Superset was born. Right? And they have gone through, I would say, multiple cycles with their data team at this point where it's been 5 years and there's been just, like, awesome data talent there and a large data team building all sorts of cool stuff internally at Airbnb. So they really pushed a limit in terms of customizing Superset in intricate ways. I would direct the listeners towards a recent Airbnb blog post around how they use Superset and how they've changed it and mutated it to support some of their internal use cases. So I think that's a really inspiring post there saying that if you want to take ownership over an open source solution and, you know, build in and around it, that's a proof that it can be done and done well.
In terms of the more common use cases are things around the REST API, building charts and dashboards dynamically, refreshing the cache as part of data pipelines. So, say, if you have a user table that you process every day through batch ETL and something like airflow, you can add a step at the end that says, I'm gonna call superset and tell it to flush the cache and rewarm up the cache for all of the charts and dashboards that use this dataset. So that makes it such that the data is always fresh and the cache hit rate is super high. Right? So you're not hitting the analytics database as much as a result.
I think the area that I'm most excited about is the visualization plugins, right? Ecosystem to have more people contribute more plugins over time. And we're working to to make it easier for people to go kind of in an app store. Not that we want to sell plugins, but just to discover the ecosystem of plugins. Right. And have a assurance set of certified. Well, so core plugins that ship with Superset, certified plugins that you can install and use and play with, that we have make sure that they come from, kind of, certified sources, and then probably an easy way for for people to go and try the long tail of other experimental plug ins that people in the community are building.
You know, we've seen kind of a renaissance of data visualization with D3, right, over the past, like, decade or so, and there's been just awesome, creative, beautiful, inspiring work in the data visualization space. And we'd love for Superset to be able to to showcase that in an environment where you can compose them into dashboard. You can play with them. You already knows how to connect to your data and can show you all the knobs and button of said visualization interesting areas for me, you know, and the potential for people to contribute and extend the power of Superset.
[00:42:22] Unknown:
For people who are looking at Superset and trying to figure out what to use for their dashboarding and business exploration and business intelligence platform, what are the cases where Superset is the wrong choice? If what you want is kind of a quick solution
[00:42:37] Unknown:
that you can just kinda, you know, install and use at scale. I think, you know, the cost of ownership of open source and the time to value if you're just gonna do it on your own, I think is prohibitive in some way. So this is where companies like Preset come in, where if all you want is just going to host that solution that works, and maybe you care about the no lock in guarantee that comes from open source, you can kinda get best best of both worlds by going with a commercial open source company like Preset. Right? So we guarantee that people can move in and out of the software, kinda import and export their assets if they elect to run the software on their own in the future. So we don't fork, and we just provide the best experience. You know, we provide value. It's just that you can just get up and running very quickly.
Trying to think if there's other shortcomings. So currently so Supersuit does not have a notebook type layers. You know, we're assuming that if you have people that wanna use notebooks, that they do so, you know, on their laptops or on a hosted notebook solution that you might have inside your organization. So we're considering, like, adding this layer in some ways to superset or preset too in the future.
[00:43:51] Unknown:
Are there any other aspects of the superset project or your work at preset or your overall experience of working in the data would
[00:44:07] Unknown:
say, in would say, in terms of, like, database connectivity and data munging processing with libraries like Pandas and and everything around this ecosystem. And I sure thought that we were gonna see similar things develop in the node ecosystem faster or in other languages. And so far, I think what we see is that Python is kinda the home of data, data processing, data connectivity, you know, ML in a lot of ways. So it's really interesting to see this dominance gonna assert itself over time, maybe even over JVM based languages, and for that to stay true over time. So I'm not sure how that's gonna evolve and where we're going, but right now, I think the data world is still, like, very much finding a home in Python.
[00:44:50] Unknown:
Yeah. I definitely agree with that. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose the tool SOPs that I started using recently. So it's a tool from Mozilla for being able to easily store encrypted values in your source repository. So things like configuration, database passwords, etcetera, and just has a very useful and easy workflow. So I've been enjoying experimenting with that for some of my personal projects. And with that, I'll pass it to you, Max. Do you have any picks this week? I stumbled upon this documentary on Frank Zappa that I thought was really interesting. So Frank Zappa is a
[00:45:30] Unknown:
American composer, but also just kind of a rock icon, amazing guitar player, and just, like, really kind of creative inspiring artists. So if people are interested in rock and roll and weirdness, I think that's an interesting thing to watch. Another pick is a book a little bit more serious and related to some of the things that are more top of mind professionally. So I found this book. The full title is The Science of DevOps Accelerate the Science of Lean software and DevOps building and scaling high performance technology organization. That's just been a great read for me and thinking a lot about, like, software development velocity, thinking about DevOps, thinking about, like, releasing high quality software with confidence. She's, like, super top of mind. I had some good, like, validation of intuition reading through this book. So highly recommended for
[00:46:21] Unknown:
people who are interested in the software development life cycle. Well, thank you very much for taking the time today to join me and share the work that you've been doing on Superset. It's definitely a very interesting and useful tool, and I'm excited to see all the progress that has happened over the past few years and some of the acceleration that's happened recently. So definitely look forward to digging into it and starting to use it for some of my own work at my day job. And thank you again for all of that, and I hope you have a good rest of your day. Thank you very much. It's a pleasure to be on the show. Thank you for listening. Don't forget to check out our other show, the data engineering podcast at data engineering podcast dotcom for the latest on modern data management.
And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Welcome
Max Beauchmann's Background and Projects
Introduction to Python and Django
Transition from Django to Flask
Overview of Superset
Origins of Superset
Business Intelligence Landscape
Superset's Architecture and Evolution
Python's Role in Superset
Configuration and Extensibility
Data Visualization Best Practices
Use Cases and Innovations
When Not to Use Superset
Future of Data and Python