Be Data Driven At Any Scale With Superset

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n

o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'd like to welcome back Max Beauchmann to talk about Superset, an open source platform for data exploration and visualization.

So Max, can you start by introducing yourself? Well, first, thank you for having me on the show again. It's a pleasure to be here. Quick intro for me. So I'm Max. I'm the original creator of Apache Airflow and Apache Superset.

I started these projects while I was at Airbnb

about 5 years ago or so.

Since then, I spent some time at Airbnb at Lyft, and more recently, I started a company. So now I'm the the CEO and founder

of a company called Preset. So we're at preset.io.

We commercialize

superset in some way. So we contribute heavily to Apache superset, and we

we participate in in making the open source project great.

And then we also offer, kinda hassle free hosted experience around it with, bells and whistle and kinda crust around this open core. For people who didn't listen to the previous interview that I did with you where we talked about airflow, can you remind us how you got introduced to Python?

Yes. So I was thinking about that

this morning trying to recollect, like, when I was first introduced to Python,

but I think that was in around, like, 2, 000

6 or 7, so pretty early on looking at Django. So

as a software engineer, I've been building websites here and there, you know, since the beginning of the Internet.

And looking at Django, I got pretty excited by the the admin functionality and just, like, the breadth of the frameworks. I think it I thought it was super powerful. So I I remember at the time, I started listening to a lot of Django related podcasts.

So there was a little bit of a golden era around that time around Python and Django, and I thought it was super exciting to be able to build a website kinda quickly,

taking advantage

of the sheer kind of breadth of the framework itself, and more specifically, the the admin functionality.

Your history with Django is interesting

knowing the fact that both the web UI for Airflow and Superset itself are actually based on Flask. So maybe we can dig a bit more into your migration from Django to Flask. But

It's probably, like, the interest, you know, playing with new toys, I think. Right? And so I guess that's kind of a trend in software

too. So while Django, I think, was a super rational

choice, I think it's it's interesting to at the beginning of a project, you're like, oh, what can I play with? Right? What are the libraries I wanna use? And sometimes it's, like, semi rational.

I think Flask is a great choice. I was interested to look into the at that time, there was Flask Admin, Flask App Builder, and other kinda CRUD

system. And that was really my main selection criteria. I wanted to be able to not have to build essentially

these CRUD web form for the audience. CRUD stands for create, read, update, delete. Right? It's all, you know, as you define an application

like, you know, Airflow or Superset, you're gonna have these essentially, like, this entity relationship

type diagram idea, right, where we're gonna have so, say, for Superset, we have, you know, charts and dashboards and users and roles.

And, you know, these web frameworks

are great in the sense that they they kind of auto generate a lot of these web forms that are really painful to write from scratch if you're gonna do it. So that really helped me, I mean, both these projects to to get to an MVP kinda quickly by not having to build any of these web forms.

And so digging a bit more into Superset, can you give a bit more of a background about what it is and some of the ways that it's being used? As much as I'm not a big fan of the term business intelligence, it just feels like a dated kind of term. Right? The business intelligence industry, you know, is probably, you know,

more than 20 years old at this point. But I think it describes fairly well what Superset does in terms of, you know,

you know, Superset is a place where people can connect to their data, explore data, put together dashboard,

and share them. Right? So it's really a tool for organizations

to serve all their data consumption exploration dashboarding needs.

Superset also has a SQL IDE built in. So

now we see the rise of, you know, data analysts, data scientists, engineers, people who just know SQL. SQL is very much affirmed itself as the lingua franca of data too. So we also offer this powerful SQL IDE.

So very much a holistic solution for your team,

for your organization

to really dig in and go and understand their data and collaborate with data. When you first created it, I know that it came about as part of a hackathon, but what was the problem that you were trying to solve at the time? And what are some of the tools or platforms that you had worked with before

that led you to the decision that there was a need or an opportunity for something new to fulfill this particular use case? So thinking about, like, my journey in business intelligence too, so I started my career early 2000. And very quickly, I became a

business intelligence engineer, data architect. Right? So I so I did, a fair amount of

ETL and just making you know, using these tools and making them available for the companies where I worked at for people internally to to self serve with the data that we were kinda preparing in the backroom for them. So I used extensively

things like

Microsoft SQL Server,

Analysis Services,

you know, Excel Business Objects, MicroStrategy,

you know, and all sorts of other packages and solutions over the years.

These solutions were kinda as

enterprise y and as kinda big desktop type application as they come to. So looking at the context in which Superset was created

more directly, so picture, you know, I think it's summer 2015,

3 day hackathon at Airbnb.

The premise there originally was the scope was much smaller. The idea was not to build a business intelligence solution over 3 days. That would have been kinda crazy.

Maybe not completely unrealistic now looking back,

but the prompt was

we had we were playing at a time our POC ing an Apache Druid cluster.

So Apache Druid being an in memory real time database,

and that's fairly early in the history of Druid. Right? At the time, there was no web UI

in front of Druid, and I thought it was just this really interesting

database because it was so fast and so fresh in a sense, like, you know, real time data, very low that you can see really fast in memory type queries.

So I was just excited to go and build a data exploration front end for Druid. So for that period of 3 days, you know, I was able to come up with a certain number of visualization and kind of a very simple explorer where you can pick your metrics, pick your pick your

dimensions, and apply some some filters. So do basic

Druid did not have a SQL interface. It had this proprietary kinda API to it, and there's no web UI. So so I thought it was new and innovative.

I was also trying to recreate

I was out of Facebook at the time, and Facebook had these

really interesting set of internal tools. Right? Some of these tools were part of the inspiration for Airflow

and for Superset. And more specifically for Superset,

internally at Facebook, there was something called Scuba that everybody love and use super extensively. It's it's probably the secret behind some of the successes

at Facebook. Right? Like, Facebook being so data driven. And so

I would say agile with data was in part because of this awesome system called Scuba that was similar to very similar to Druid in terms of the back end, and the front end that I built on top of Druid at the time was inspired by the Skuva front end in some ways too. Right? So a mix of inspiration from my experience in business intelligence

and, you know, my experience using tools like scuba at Facebook. So that was very much the prompt. At the time too at Airbnb, we were investing heavily in a Presto and Hive cluster, so, like, big kind of data lake, big HDFS cluster, big investment in in Hive and Presto. The tools that we had at Airbnb at the time were mostly just like a Tableau kinda limited license, so

Tableau did not really work well with Presto at the time too, so there was an opportunity to to extend Superset very quickly, you know, to work with Druid,

and I extended to work with SQL in general, right, but Druid in particular

that enabled

easy direct access to visualization

against Presto without having to request a license or install a desktop application

or these sort of things. Right? So that that made kind of easy access. And a lot of people wanted more access to data internally at Airbnb at the time, and I think building these tools could just kinda enable people to get quick access

without having to worry too much and just go straight to it. You know, so that grew from that point, you know, we open source fairly quickly after this, and the rest is history.

So you mentioned a bit about your background with business intelligence and some of the ways that super set is used for these types of workflows of being able to build a dashboard and present information to

end users to be able to understand some of the business metrics and be able to dig in and explore it a little bit. And business intelligence as a product category

is

fairly old at this point in terms of, you know, the the overall scale of computing. It's been around since at least the nineties and has gone through a number of different iterations and generational shifts. And there are a number of other tools out there right now that have their own particular focus and thinking in terms of things like Looker or Metabase and Redash. And I'm wondering if you can just give your impression about

where Superset fits in the overall matrix of business intelligence and data exploration tools and data visualization and dashboarding tools and sort of how the broader ecosystem

looks and sort of why somebody might wanna choose Superset over the other options. You know, digging into business intelligence as a product category,

and I think it's very unique in the sense that BI has always been and still is today in many ways, kinda everything for everyone. Right? So it's like all of your data needs

are served by this BI platform and solution.

I think that's driven by the fact that the buyers in the space

and the people in general just want want a single solution for their organization where they can take care of all of their data needs. There's also, like, a growing aspect of data work, which is, like, making it collaborative. Right? So if you have, like, a lot of different tools, so we'll see. Like, we talked to companies that have, you know, 7 or 8 BI tools that they've accumulated

over the years too just to try to satisfy,

you know, different needs and different personas. But I think there's still very much this will of having 1 tool that everyone in the organization can use and collaborate on. Right? Really often, you'll have certain personas that are more kind of a content creator that wanna make data

available to more business users or people that have more operational. So

I think that need of a single platform

that does it all, and I think it drives some of the way that the products are baked in the space.

There's much more, I think, to your question to dig into. So there's 1 question. That's where does

superset fit into this? So,

personally, I'm

passionate about the fact, the idea of, like, catering to a wide spectrum

of level sophistication with date. Right? So to have a platform where you have a SQL IDE, you have the slice and dice UI, you have the dashboards. So depending on your use case and how deep you wanna go, you'll find the right level of, in some cases, you want kinda ease and intuitive and kinda just easy to use type interface, but sometimes you need more power and

SQL IDE, and,

presumably

into a SQL IDE

and, presumably,

go a level deeper into, say, a notebook where you might wanna do things that are a lot less intuitive or less accessible to people in general, but much more powerful in the sense that if you need to do a simple model or do some forecasting

or do some custom visualization

work because you're trying to get to something very

kinda uncanned,

right, something that's not cookie cutter, it's good to be able to go up and down this

ladder of level of complexity.

So I think Superset does that well.

Now in terms of positioning,

where does Superset fit in this big kind of competitive market? So 1 thing that's for sure is, like, Superset's cloud native. Right? So we were born in the Kubernetes and and Docker

era, which is very different from some of the old school vendors are still kinda tangled Windows Server and desktop applications. So that's a clear thing from day 1 that we've had. I think the things that are most interesting

in Superset is the extensibility

and integratability,

like the way that Superset is integrated with all

databases, the fact that it's extensible

because we have this open source

story,

it becomes very natural

for us to have a lot of point of extensibility.

So

dig into that. Just kinda listing out some of the things that come with that are first, it's like a very awesome, like, REST API.

Right? So everything you can do in Superset as a user by clicking around

or most everything, you can do through a REST API. You can automate, so you can do things programmatically.

Right? So for the people who know airflow too, that's kind a premise there too, to be able to programmatically

do the things that users do. Sky is the limit in terms of automation. If you wanna automate

workflows and things in and around superset, you know, you can do that with the rest API.

We also have, like, a really good plugin framework.

So on the preset blog, we have some pretty good posts around

how to get started in writing a new visualization

plug in for Superset.

So I think we're starting to see kind of a plug in ecosystem

grow, and that's something we really wanna invest in. Right? So that we can have people that are

maybe from different fields. Like, if someone has from genomics or from

micro biology doing, like, very specific data visualization

for DNA or genomes and things like that and contribute that back to the community.

We're very excited

about that.

Talking a little bit more about the open source ground. So for people who want a lot of control on how they run their software and how they scale it, places like Airbnb, Dropbox, Netflix,

Tesla, right, that have very large superset cluster, and they want

to control that experience

internally.

You know, they can do that because of the flexibility

of open source.

1 part is, like, so say the security layer, and Super said there's an abstraction there called the security manager,

where you can really define, like, how you do things like authentication, authorization,

how who has access to what in terms of, like, data access policy or in terms of what they can do in terms of permission in the application.

So

lots of power there that comes from the fact that it's programmatic,

extensible, was born on the cloud. Right? So you can really define how you wanna run that software

in your organization. And if you're not interested in that as much, you can have the benefit to have a commercial open source vendor where if you want someone else to run the software for you, you know, there are companies like Preset that can just kinda

roll out and have, like, a hassle free kind of get started in 2 minutes

experience too. If you prefer that, and you could always go to the

no lock in kinda running it on your own down the line if you decide so. That's a 1 thing, you know, that comes that's in our into open source, talking about the no lock in

part of open source. I think people are fed up with these big

vendor contracts,

right, that are looking to land and expand and, you know, they're kinda

extracting as much money as they can from your organization. So with open source and commercial open source vendors, it kinda keeps all of us honest in terms of providing value, you know, and people, like, keeping their freedom

in terms of how they wanna run and operate that software.

Yeah. The open source and extensibility

aspects are some of the things that I personally find interesting about Superset

and digging a bit more into some of the architectural aspects. You met you were mentioning some of the plug in interface and the extensibility of the access and authorization layer. I'm wondering if you can just give a bit more of a broad view about how Superset is actually implemented

and some of the ways that the overall goals and design and architecture of the system have evolved since that first day that you began it at the hackathon.

Now I'm gonna start digging into archaeology,

right, of the software and the different layers, you know, over the past 5 years.

1 thing that's really interesting to note is the fact that Superset was born a Python project. I know we're on a Python podcast, but over time, as derived to become much more of a JavaScript slash TypeScript

React project over time. You know, we're building user experiences,

and I think very naturally,

we tend to evolve in the direction of adding just a really clear,

solid, like, Python API in the back end and for everything else to become more and more of a just a single page app written in JavaScript.

It's still very much served by a Python layer by Flask

currently. So Flask is the web framework.

So from day 0, I was talking about how my interest in, like, kind of trying new things

when I started to start projects. Right? So we use something called Flask App Builder

that is so the Flask ecosystem is the opposite of, like, a monolithic

framework. Right? It's a decoupled,

lightweight framework where you'll have all sorts of, like, small packages like Flask login.

I forgot all of the different, like, Flask extensions that there are, but there's a huge

network of extensions in the Flask world. Flask app builder

takes an opinionated

approach to this and brings back a collection of these plug ins into a more call it a recomposed

framework. Right? So it's like monolithic assemble from its pieces

with more assumptions and therefore more guarantees, right, that come with stronger set of assumptions.

But, yeah, so Flask Gap Builder over time allowed us to get these CRUD forms very quickly and get started quickly with some NVC code. And I think we've been just very much, like, removing a lot of that skip folding over time.

So it's a good approach to, like, software development. Right? Like, you probably wanna get to an MVP fairly quickly, prove some value, and over time, remove the training wheels and the guardrails and kind of tell that skiffle that we needed at the beginning.

So that's

part of the evolution

of the software. Within the front end, things have been moving extremely fast. I mean, people know it's it's almost a gag at this point. You know, how much the front end, ecosystem is evolving

fast. Right? So it's kind of a moving target.

But I would say things that we've settled on early on are things like React Redux.

We're moving more towards

functional components in React

using more hooks as opposed to Redux. We're also moving

towards TypeScript. We also move from Bootstrap, kind of a component framework where that's very well known into something called n d over the past year or 2. So I think there on the front end side of things, things are moving extremely

fast, and we're staying kind of fresh on top without

going crazy and going into, like, the latest, you know, like, today's fashion. So we're probably, like,

adapting to kind of this year's trend in fashion, the front end side of things.

It's been interesting to try to keep up, right, with things like testing libraries, you know, go in and out of fashion. And, you know, we went from selenium

to

Cypress and from

things like Enzyme to React Testing Library, I think, is the latest 1 we're moving to. So really very much, like, keeping up with pace of Volusia front end

there. You mentioned

the kind of evolution of it being a primarily Python app to now being an application that is Python on the back end, but very heavily on the JavaScript for the actual main user experience. But that there are these APIs for being able to

populate and automate a lot of the user experience for building out these dashboards and being able to do things like propagate changes from a QA to a production environment.

I'm wondering if you can just talk through a bit more about how you feel now about Python as the kind of core building block of the application? And if you were to start over today, what are some of the things that you might do differently either in terms of language choice or system architecture?

There's 2 main things I would say that make us kinda reevaluate Python over time. 1 is the GraphQL kind of permanence over time. Right? So I think, like, the GraphQL

story in Python is a little bit lagging. It's not a great as great as in the Node ecosystem.

And then everything around sockets and web sockets.

So I think if you're gonna pick a solution that has GraphQL

and WebSockets as a premise,

you know, I don't think Python's a great choice nowadays.

It's feasible. It's doable, and maybe I'm not in tune with the latest. Like, maybe it's actually a great choice nowadays, but it seems like from that perspective,

you know, I would pick a node app for some of this. We're looking to use some WebSockets as we build more collaborative

type features inside Superset.

There are some places where WebSockets are just much better than using polling. So we're looking to bring some optional components inside the Superset architecture

that would run on Node, for instance. So if you do want certain features or for some features to run better at scale, you would have to have to kinda enable an extra service that's a node service that's optional.

So I'm still very much, like, for people who know Airflow and Superset, you know, it was always important to me to allow people to just kinda PIP install Airflow, PIP install Superset,

to get into a demo environment where they can at least, like, start doing some work. This demo environment might not, you know, scales to thousands of users,

but at least it allows people to get started quickly. And we know that time to value in software

everywhere, probably, you know, just in the human experience is really important.

Right? So we we really care about

time to value.

Talking about that, you know, and getting into you mentioned the word, you know, architecture somewhere in your question, so I'll drill into that a tiny bit, which is to run

a Superset

cluster

at scale, right, is a fairly hard thing because when you start digging into the architecture,

you know, sure, you can pip install superset and go navigate to, you know, local host port 8, 000, you know, and then click around and try things. But if you're you're trying to serve a lot of users,

then we have all of these architectural components that you can switch on. Right? So things like an async back end where we have async worker using the the Celery

framework in Python that I'm sure you're some people in your audience are familiar with. So it's not required for you to have Celery to run Superset. But if you run Superset at scale, you should certainly have an async service switched on, which requires

a message queue, probably something like Redis or Rabbit.

If you also wanna provide a great experience, it's great to enable the different caching layers that we have in Superset.

So then you're relying again probably on a multi Redis

of some kind.

You know, the approach to architecture

is kinda

incremental

in in some sense, right, where an organization might run a very basic install. Superset, as their requirements

grow, the cost from

operating that software and making decision and the complexity of the architecture gets a little bit private. Right? So that's definitely part of the value proposition around the commercial open source company, where if you want just to try Superset and maybe just not worry about running at scale and being on call for it, then you can elect to just have this, like, you know, continuous experience and hosted experience where it just works.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce,

Marketo, HubSpot, and many more. Go to python podcast.com/census

today to get a free 14 day trial and make your life a lot easier. 1 of the interesting things about Superset in general is as I was looking through the documentation

to explore using it for some of my own work for in my day job is that a lot of pieces of software, they say, you know, download the software or put it on disk, you know, and then you have the binary, and then you just drop a text file, whether it's YAML or INI or TOML, and then you start it up. But with Superset, the config file is actually a Python module, which I found interesting. I'm wondering if you can maybe just talk a little bit about that and then go from there into some of the

extensibility

and interfaces that are available for people who want to build their own integrations or extend the capabilities of the software.

Yeah. You know, there's an underlying, like, deeper

philosophical

question there too. Like, you know, when you think about

just gonna immutable configuration,

or I know, like, this trend of, like, you know, should a

pipeline be defined as YAML or as code. Right? And then, you know, if I know where I stand on this, I think, like, I really like infrastructure as code, configuration as code, you know, pipeline as code. And then, you know, I think on top of that, you can build a more kind of semantic

static approach to it with things like YAML files or things that are not as dynamic

as needed.

I think for certain configuration

things, you need these, like, much more complex hooks that need to have

logic in them.

Right? An example is

we have a configuration hook that just, like, gives you a handle on the Flask app and allows you to mutate it in some way. So if you wanna

tweak your headers or if you want to just kinda mutate that Flask app object or extend it further through a configuration hook.

You can do this. So if we didn't have that, we would have to have the YAML files that come with a documentation, you know, that is, like, you know, a 100 page long. Right? Every single option

of configurability

in the Flask app, we would have to expose as something that's kind of static

that just gets

kind of cumbersome and hard to manage.

There are a bunch of other instances of things where we have these really

useful hooks, things that can kind of mutate the database connection

or the database query for that matter, you know, kind of just in time too. So you can start doing much more complex things. We have also some hooks around

feature flags where you can say, maybe sometimes you do have static feature flags, but sometimes you want to say, like, oh, I want 10% of my users to see this. I want a certain set or group of people to have a feature flag on or off. And how would you define that in static configuration?

It would be pretty tricky to do that. You would have to have very complex semantics in there. So I believe in, like, configuration as code in in many instances. It's not always the best, but it's convenient certainly as convenient for maintainers

to expose a lot of flexibility without

getting tangled in the details.

In terms of what's configurable,

I don't even know at this point. There's a lot of configuration hooks. I think Airflow is similar too. You know, it's part of the burden of when you really take ownership

over a piece of open source software,

you kind of have to sort through all of these options.

Right, and kinda understand what your choices are. And sometimes it gets fairly

intricate.

Essentially, what is exposed to you

as someone who's ramping up on the software, wants to operate that software is all of the options that everyone in the community ever wanted to perhaps configure differently. Right? Maybe someone at Netflix was, like, for us, we need to have this

option. Right? We need to have this flexibility.

So, you know, we try, as maintainers, to put sensible

defaults

in place, but I think it's extremely challenging, because people want to operate

the software in intricate

ways. That's part of value proposition of open source, and sometimes that turns into, you know, a certain amount of I wouldn't call it headache, but it's like mountain of complexity that you're not really aware of that as you start running the software,

1 day you're probably like, oh, you know, I wonder if there's a way for me to do x. And then sure enough, there's a configuration flag that, you know, exists that does that, that may be documented or not.

Right? So I think, you know, a lesson maybe for me and for open source maintainers is we should be better at this. Right?

So I think it's important to maybe have different operating modes and, you know, configuration

sets, maybe they are more comprehensive and well documented, and to really have the sensible defaults there is a is super key. Yeah. I definitely think that your point of configuration as code is valuable because it does

put the power in the lap of the end user without having to own all the complexity as the maintainer.

Because

as the maintainer, somebody says, oh, I want it to do this, and you're using a static configuration file. Well, then all of a sudden, you have to either say, no. That's never gonna happen, or, okay. Fine. I'll add this flag. And as with everything in open source, it's free as in puppy,

where you say yes once, and then you have to own it for the rest of its life.

Yeah. It's like when you're holding a hammer, everything looks like a nail too. And so Yeah. So every time someone shows up and it's like, hey, I need that hook or I've got this special need. I'm like, yep, configuration hook. Let's just, you know but yeah. So there comes a burden with that. Yeah. And then as a consumer of the open source, if you see there are 5, 000 flags that you can set and you maybe understand half of them,

there's a lot of

cognitive barrier to actually adopting that software because you say, oh, well, now I have to understand what is the vast matrix of possibilities, and then you end up building a whole another piece of software

on your side just to generate the config that you care about. So you have this kind of double ended pyramid of complexity in the middle and, you know, simplicity on the edges, and it's easier to kind of flip that where you have sort of simplicity in the core,

and then you expand on the potential complexity that the end user is able to adopt without having to have this initial barrier of you have to understand these 5, 000 toggles and all the different ways that it's going to interact with the software.

Certainly, it is a challenge for people operating the software.

1 big inherent problem that comes with this too is the matrix that you mentioned. Right? The all the potential combination

of flags being turned on and off becomes, you know, infinitely complex and impossible to test for. Right? You can build a very complex big build matrix that tries to run tests with all the feature flags on and off, but you won't be able to test all the combination of feature flags growing exponentially very quickly.

Now that matrix becomes massive.

It certainly is a challenge.

I was talking about configuration

sets

that are, like maybe there is a 1, 000 flags that Intrigo. Ly can combine. But the reality, when you think about the different operating modes of the software,

it probably gets a lot more simple. There's, you know, you wanna run, like, a really massive

instance of Superset that serves 10, 000 people versus you wanna run something just that you run on your laptop versus small team. So we could be better, I think, at creating these little kind of recipes or kind of presets. But yeah. So I think people need these sets of configuration that make sense together.

We've seen that emerge in places like, you know, the Docker Helm Chart kinda combo or Kubernetes operators where people will publish these different

sets or call it, like, opinionated operating mode for given piece of platforms or pieces of software. Yeah. Having a strong opinion loosely held is definitely 1 of the

undervalued

aspects of a good open source maintainer.

Yeah. It certainly is a challenge too. I feel like as a maintainer, we drift with the wave of the community too. So

we're trying to please people and we don't always have time to come up with, you know, a strong opinion.

So sometimes we'll drift for a little while, and then, you know, maybe we'll take a stance when things become a little too mushy, unclear, or fuzzy over time.

But certainly is an art to try to get all that feedback and contributions

into a

place that somehow works for people.

For a brief digression on the point of having opinions about things, data visualization

as a problem domain is something that is often fraught with

misunderstandings

or the potential to misrepresent information.

And so if you kind of don't have any guardrails on a tool and let somebody go nuts on building whatever visualization

they want, then it can be easy to accidentally end up in a situation where you have useful data, but the way you try to represent it is either wrong or misleading or just confusing. And I'm curious if there are any pieces of guidance within the superset platform to help people

to build meaningful representations of the information that they're exploring.

It extends to the whole, what I call the analytics process. Right? Like, from

instrumentation

to data collection

and, you know, cleansing,

you know, data modeling, data engineering, kinda in general,

all the way to visualization. So there's a lot of things that can go wrong in the analytics process for data to be, you know, not trustworthy.

And it starts with instrumentation. You can get your instrumentation wrong. You can get your data collection, kind of bringing that into the warehouse can go wrong too. You can get your remodeling and cleansing and, you know, kind of ETL process wrong as well.

And then, you know, we're talking about the layer of like, can you misrepresent

the data?

So

1 thing that's important and good, I think, is to start with the premise that you have a table that has a set of dimension and metric that is, like, somewhat trustworthy. Right? So

for us in Superset, it's hard to know on that layer kinda everything that happened before up and until that point. So I would say the contract between,

say, the deep end or the data warehouse and the visualization tool, in this case, Superset,

is these data structures that are mostly tabular in general. You can also do some data transformation inside Superset. Right? And good kind of part of that contract, you know, can be established in Superset. But really often,

what happens is people prepare

these

flat data

tabular structures that, you know, are the place from which you're gonna build your visualization.

So if that's pretty comprehensive and well made and you have well documented fields and kind of simple data structures, like actions of users in time with really good dimensions and metrics,

then it becomes a little bit harder to mess up

the visualization or from that layer onwards.

Now in terms of, like, doing, like, the wrong type of visualization

for the information you're trying to convey, there's a lot of pitfalls there. There's a lot of material out there. You can point to some of, like, Steven Fuse's work in terms of, like, you know, what is the grammar of data visualization

and how to use the right chart type to represent the right information,

I think there's an opportunity for tools like Superset to educate in the process and suggest.

It's hard for software to

understand intent in in some ways and to kinda capture things and tell the user that something is

wrong.

Currently, we assume that the users know. But I would like, I think, to have a little bit more guardrails and guidance in terms of, like, helping

users achieve

their visualization goals, whether it's to create a specific chart or a dashboard.

We've been talking and thinking about,

you know, wizard type. I know people don't really like wizard as a user experience construct,

but, something like that that might

help users guide them to create the thing that they're actually trying to create.

It's challenging because people come. We have different personas coming to tools like Superset with different set of intentions and assumptions.

So I think it's a challenging problem, but I think the software in the space is generally becoming better. Most importantly, I think

that the information workers and data professionals are becoming

more data educated.

Right? So you could almost, like, extend this question of, like, you know, you can use a spreadsheet to represent

that information. Is it the role of Google Sheets or Microsoft

Excel to make sure you do the right thing when you use a spreadsheet?

No. Not really. Right? They do very little in terms of it's like, oh, this sum is actually not your assets and liability total on your balance sheet. Right. They don't know.

So I think there's a balance to be struck there. Yeah. I definitely agree that there's the potential to maybe educate users in the software, but it's more important to just educate everyone more broadly about fluency and data and how to use it and how to represent it.

And going back to your point too about the different locations within the data life cycle about where things might go wrong, it's also worth the effort of ensuring that you have an accurate semantic representation

of the data so that you have, you know, metadata management

that lets you understand

where this information came from, how it was transformed, what it actually means in context rather than just treating it as a scaler or, you know, at face value. Right. I know recently Airbnb contributed some metadata management type things to Superset. 1 is, like, certified metrics and datasets.

So people might have different definition of what a certified

dataset might be, but at least there's a way in Superset to kind of flag something to administrators or people that have access to do that.

Or through an API. Right? Like, through the REST API, you can go and say

these, like, 30 datasets have been delivered by the data engineering team and are therefore, you know, certified and can be used in a trusted fashion around, you know, things like SLAs or just raw, you know, pure data quality

assertions are made to ensure that this dataset is reliable.

So in terms of your experience

of building and using and working with end users of Superset, what are some of the most interesting or innovative or unexpected ways that you've seen it used?

1 answer here is, you know, of course, Airbnb is where Superset

was born. Right? And they have gone through,

I would say,

multiple cycles with their data team at this point where it's been 5 years and there's been just, like, awesome data talent there and a large data team building all sorts of cool stuff internally at Airbnb. So they really pushed a limit in terms of customizing Superset in intricate ways. I would direct the listeners towards a recent Airbnb blog post around how they use Superset and how they've changed it and mutated it to support some of their internal use cases. So I think that's a really inspiring

post there saying that if you want to take ownership over an open source solution and, you know, build in and around it, that's a proof that it can be done and done well.

In terms of the more common use cases are things around the REST API, building charts and dashboards dynamically,

refreshing the cache as part of data pipelines. So, say, if you have a

user table that you process every day through batch ETL and something like airflow,

you can add a step at the end that says, I'm gonna call superset and tell it to flush the cache and rewarm up the cache for

all of the charts and dashboards that use

this dataset.

So that makes it such that the data is always

fresh and the cache hit rate is super high. Right? So you're not hitting the analytics database as much

as a result.

I think the area that I'm most excited about is the visualization plugins,

right? Ecosystem

to have more people contribute more

plugins

over time. And we're working to to make it easier for people to go kind of in an app store. Not that we want to sell plugins, but just to discover

the ecosystem of plugins. Right. And have a assurance set of certified.

Well, so core plugins that ship with Superset,

certified plugins that you can install and use and play with, that we have make sure that they come from, kind of, certified sources,

and then probably an easy way for for people to go and try the long tail of other experimental plug ins that people in the community

are building.

You know, we've seen kind of a renaissance of data visualization with D3, right, over the past, like, decade or so, and there's been just awesome, creative, beautiful,

inspiring work in the data visualization space.

And we'd love for Superset to be able to to showcase that in an environment where you can compose them into dashboard. You can play with them.

You already knows how to connect to your data

and can show you all the knobs and button of said visualization

interesting

areas for me, you know, and the potential for people to contribute and extend the power of Superset.

For people who are

looking at Superset and trying to figure out what to use for their dashboarding and business exploration and business intelligence

platform, what are the cases where Superset is the wrong choice? If what you want is kind of a quick solution

that you can just kinda,

you know, install and use at scale. I think, you know, the cost of ownership of open source

and the time

to value if you're just gonna do it on your own, I think is prohibitive in some way. So this is where companies like Preset come in, where if all you want is just going to host that solution that works,

and maybe you care about the no lock in guarantee that comes from open source, you can kinda get best best of both worlds by going with a commercial open source company like Preset. Right? So we guarantee that people can move in and out of the software, kinda import and export their assets if they elect to run the software on their own in the future. So we don't fork, and we just provide the best experience.

You know, we provide value. It's just that you can

just get up and running very quickly.

Trying to think if there's other

shortcomings. So currently

so Supersuit does not have a

notebook type layers. You know, we're assuming that if you have people that wanna use notebooks,

that they do so, you know, on their laptops or on a hosted notebook solution that you might have inside your organization. So we're considering, like, adding this layer in some ways to superset or preset too in the future.

Are there any other aspects of the superset project or your work at preset or your overall experience of working in the data would

say, in would say, in terms of, like, database connectivity and data munging processing with libraries like Pandas and and everything around this ecosystem.

And I sure thought that we were gonna see

similar things develop in the node ecosystem

faster or in other languages. And so far, I think what we see is that Python is kinda the home

of data, data processing, data connectivity,

you know, ML in a lot of ways. So it's really interesting to see this dominance gonna assert itself over time, maybe even over JVM based languages,

and for that to stay true over time. So I'm not sure how that's gonna evolve and where we're going, but right now, I think the data world is still, like, very much finding a home in Python.

Yeah. I definitely agree with that. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pics. And this week, I'm going to choose the tool SOPs that I started using recently.

So it's a tool from Mozilla for being able to

easily store encrypted values in your source repository. So things like configuration,

database passwords, etcetera,

and just has a very useful and easy workflow. So I've been enjoying experimenting with that for some of my personal projects. And with that, I'll pass it to you, Max. Do you have any picks this week? I stumbled upon this documentary on Frank Zappa that I thought was really interesting. So Frank Zappa is a

American composer, but also just kind of a rock icon, amazing guitar player, and just, like, really kind of creative inspiring artists. So if people are interested in

rock and roll and weirdness, I think that's an interesting thing to watch.

Another pick is a book a little bit more serious and related to some of the things that are more top of mind professionally. So I found this book.

The full title is The Science of DevOps Accelerate the Science of Lean software and DevOps

building and scaling high performance technology organization.

That's just been a great read for me and thinking a lot about, like, software development velocity,

thinking about DevOps, thinking about, like, releasing high quality software with confidence. She's, like, super top of mind. I had some good, like,

validation of intuition reading through this book. So highly recommended for

people who are interested in the software development life cycle. Well, thank you very much for taking the time today to join me and share the work that you've been doing on Superset. It's definitely a very interesting and useful tool, and I'm excited to see all the progress that has happened over the past few years and some of the acceleration that's happened recently.

So definitely look forward to digging into it and starting to use it for some of my own work at my day job. And thank you again for all of that, and I hope you have a good rest of your day. Thank you very much. It's a pleasure to be on the show.

Thank you for listening. Don't forget to check out our other show, the data engineering podcast at data engineering podcast dotcom for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__