Summary
When working with data it’s important to understand when it is correct. If there is a time dimension, then it can be difficult to know when variation is normal. Anomaly detection is a useful tool to address these challenges, but a difficult one to do well. In this episode Smit Shah and Sayan Chakraborty share the work they have done on Luminaire to make anomaly detection easier to work with. They explain the complexities inherent to working with time series data, the strategies that they have incorporated into Luminaire, and how they are using it in their data pipelines to identify errors early. If you are working with any kind of time series then it’s worth giving Luminaure a look.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Python has become the default language for working with data, whether as a data scientist, data engineer, data analyst, or machine learning engineer. Springboard has launched their School of Data to help you get a career in the field through a comprehensive set of programs that are 100% online and tailored to fit your busy schedule. With a network of expert mentors who are available to coach you during weekly 1:1 video calls, a tuition-back guarantee that means you don’t pay until you get a job, resume preparation, and interview assistance there’s no reason to wait. Springboard is offering up to 20 scholarships of $500 towards the tuition cost, exclusively to listeners of this show. Go to pythonpodcast.com/springboard today to learn more and give your career a boost to the next level.
- Your host as usual is Tobias Macey and today I’m interviewing Smit Shah and Sayan Chakraborty about Luminaire, a machine learning based package for anomaly detection on timeseries data
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Luminaire is and how the project got started?
- Where does the name come from?
- How does Luminaire compare to other frameworks for working with timeseries data such as Prophet?
- What are the main use cases that Luminaire is powering at Zillow?
- What are some of the complexities inherent to anomaly detection that are non-obvious at first glance?
- How are you addressing those challenges in Luminaire?
- Can you describe how Luminaire is implemented?
- How has the design of the project evolved since it was first started?
- What was the motivation for releasing Luminaire as open source?
- For someone who is using Luminaire, what is the process for training and deploying a model with it?
- What are some common ways that it is used within a larger system?
- How do sustained anomalies such as the current pandemic affect the work of identifying other sources of meaningful outliers?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Luminaire being used?
- What are some of the most interesting, unexpected, or challening lessons that you have learned while building and using Luminaire?
- When is Luminaire the wrong choice?
- What do you have planned for the future of the project?
Keep In Touch
- Smit
- shahsmit14 on GitHub
- Sayan
- Website
- @tweettosayan on Twitter
Picks
- Tobias
- Smit
- Sayan
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Luminaire
- Zillow
- Anomaly Detection
- Facebook Prophet
- IEEE Big Data Conference
- Unsupervised Learning
- ARIMA (Autoregressive Integrated Moving Average) Model
- Airflow
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.
[00:00:19] Unknown:
When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
[00:00:43] Unknown:
Go to python podcast.com/linode,
[00:00:46] Unknown:
that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Smit Shah and Sayan Chakraborty about Luminair, a machine learning based package for anomaly detection on time series data. So, Smit, can you start by introducing yourself?
[00:01:09] Unknown:
Hi. My name is Smit Shah, and I'm working as a senior data engineer at Zillow for almost 4 years. And I'm mostly involved working building data related products at Zillow.
[00:01:21] Unknown:
And, Sayan, how about you?
[00:01:23] Unknown:
Hi. My name is Sayan Chakrabodi. I work as senior applied scientist in the Zillow AI team. I mostly work on, like, anomaly detection and help building the anomaly detection and, in turn, the machine learning methods inside Luminare.
[00:01:38] Unknown:
And going back to you, Smit, do you remember how you first got introduced to Python?
[00:01:42] Unknown:
Actually, when I joined Zillow in 2016, that was the very first time I got into Python as most of the data teams were using Python. So I started, like, learning more about Python and using it. And it's really easy and convenient language to use, and I really love it. Just before Python, I was more into Java and Objective C programming.
[00:02:06] Unknown:
And, Sayem, do you remember how you first got introduced to Python?
[00:02:09] Unknown:
Yeah. So I'm from a stats background. So I used to code in R. And I remember, like, during my 3rd year of my PhD program, I decided to learn a new language, and Python was picking up, and I thought it would be a good candidate. I'm full on Python. I think I didn't wrote any code in r in last 2 years.
[00:02:29] Unknown:
Yeah. That's definitely 1 of the interesting sort of ongoing holy wars in tech is Python versus r when it comes to stats. I'll leave that at the sidelines for now, but I think it's funny that you started in R and have since come over to Python. And so I'm wondering if you can just start by giving a bit of an overview about what the Luminair project is and some of the origin story of how it got started and why you decided that you needed to build out this library from scratch?
[00:02:55] Unknown:
So Luminar is a tool for detecting anomalies, specifically for time series data, and we do it for both batch and streaming use cases. The in the place where Luminaire stands out is, like, automating the whole modeling process and somehow, like, democratizing them. Anomaly detection across the community because, like, anomaly detection is a hard problem, and and not everyone comes with a domain a domain expertise or, like, the ML expertise needed for doing an omnidirection. That's where Lumina comes in to automate the process.
[00:03:27] Unknown:
And what's the story behind the name?
[00:03:30] Unknown:
So the dictionary definition of Luminaire is complete electric unit of light. And 1 of our Zillow's core value is turn on the lights. So as Sayan mentioned, we wanted to, like, democratize anomaly detection, and we also wanted to bring visibility to the teams about various anomalies or data health issues in their data. And that's how we came up with the word Illuminaire.
[00:03:55] Unknown:
And so there are a number of other projects that are intended for working with time series data even just within the Python ecosystem, 1 of the most notable ones recently being Prophet. But I'm wondering if you can just give a bit of an overview about where Luminare fits within that ecosystem of time series frameworks and libraries, and what are the use cases that it's well suited for versus when you might want to use something else?
[00:04:19] Unknown:
Luminair is built for anomaly detection, so it's focused on classifying anomalies. So when you are talking about, like, other forecasting packages, those kind of depend on how much signal that you're getting from the data. But anomalies are anomalies. So they can show up either you have signal in the data or you don't have any good signal in the data. So that's where Luminaire come in. We have built the library in such a way like it's focused on detecting anomalies for time series data and brings automation for when the user wants automation.
And whoever wants, like, to play with the model, we also open up the configuration for, like, doing all the changes during the modeling. So it's kind of also works well well in terms of, like, bring building an explainable model for anomaly detection. And specifically, like, for forecasting, we actually have seen, like, whenever you have good signal, Luminar can be used well as a forecasting tool, and it performs pretty well whenever you have good signal in the time series data. We actually recently submitted a paper in I triple e big data conference this year, which got accepted. And we shown, like, some few benchmarks where we have shown, like, Luminaire is outperforming many other competing forecasting or anomaly detection methods in different scenarios.
That ranges from, like, Luminol, ADTK, even Auto Remind, also Prophet as well.
[00:05:54] Unknown:
Because of the fact that there are so many other libraries out there for time series, what was the motivation for actually starting an entirely new project versus just adding new capabilities onto an existing library or just wrapping
[00:06:16] Unknown:
worlds. Like, either we wanted to have more worlds. Like, either we wanted to have more control over the model building process, or we wanted to build an anomaly detection tool for those users who wants more automation. And Lumina comes good in both world, and we have seen, like in many others so when we were building this anomaly detection tool, we looked for several existing solution, and we found, like, there is no such tool that is solving this problem in a robust and reliable way because I mentioned before, like, most of the tools which are more sophisticated for dealing with time series data are focused on forecasting, not for anomaly detection. And those tools which are focused on anomaly detection does very basic modeling, has very basic modeling capabilities. So that is where Luminaire comes in, which is kind of combining these 2 capabilities into, like, a more powerful, more sophisticated anomaly detection platform.
[00:07:18] Unknown:
Yeah. Let me explain, like, how this whole Luminar project actually got started from inception. So at Zillow, since we are a data company and the company identified there was a need for having a centralized data quality team. And that's where our core team got formed. And what we observed also was there was no standard or formal processes that everyone were following in general to detect their general data health issues for their metrics. And that's where we started creating standards and processes to help these teams out. So that's where the very first utility function in Python was created internally, which was helping teams to generate data health metrics like volume, availability, completeness, or even comparison.
And later on, teams were interested to how are not getting alerts on top of this matrix. So we started building deterministic anomaly alerts on top of this. But later on, we also found that we had a lot of time series use cases within a lot of our teams. And that's where the need for doing a time series anomaly detection came in. And that's was the first place where we build this, where we added this utility function within our core package to detect tensors anomalies. And we started with the off the shelf ARIMA model. But from that point and onwards, we then saw, like, okay, these are, like, 2 different use cases. And that's where we splitted the project into, like, Luminaire deterministic checks. And this Luminaire, what we have open source right now, which is the core time series anomaly detection.
And from there and onwards, we started adding more models, more sophisticated models to support just time series anomaly detection and, overall, trying to democratize self serving, anomaly detection at scale. And over this period, like, our core team, which I would like to also mention just not just 2 of us, but it was also, like, Anna Swigert, she is our their manager, and then Rui Yang, Kumar Sultani, and Kyle Buckingham who were there from the initial days of building these packages.
[00:09:37] Unknown:
So I'd like to also thank them. And in terms of the ways that it's being used now, you mentioned the data quality aspects of maintaining your data pipeline. And, Sayan, you also mentioned actually using it for some of the forecasting capabilities. I'm wondering if you can just discuss some of the different ways that it has grown to be used throughout Zillow within the data team, but also maybe some use cases outside of just the specific data pipeline and data analysis life cycle?
[00:10:03] Unknown:
Yeah. Sure. So within Zillow, like, we currently don't use it as a forecasting tool, but use it as an anomaly detection platform. So in general, like, we have lots of, like, internal and external services, and we process enormous amount of data every day. So there are data producer who generate, like, batch or streaming data that gets consumed by the different downstream services. And the producers wants to make sure the data they are generating are good quality data as well, like the downstream team who are consuming the this data and creating maybe business metrics or generating feature for their ML system. Like, for example, like Zestimates, Zillow Offers, the recommender system we have in Zillow. All of their team, they want to make sure, like, the data they are ingesting is good enough. So that's where Lumina comes in. So Lumina intervenes different part of this process pipeline and make sure the data flow and the data that is going from 1 place to another is good quality and does not have anomalies.
So we do different checks, like checks for volumes, like availability, nulls, completeness, and so on and so forth, and to make sure the data within Zillow and that's being ingested to the services are are healthy enough.
[00:11:24] Unknown:
Just wanted to say, like, sometimes also so that's where we were saying, like, we were trying to bring this standard or processes at Zillow and trying to guide them, like, what data health means, what quality means. So that's where we started teams to encourage more on not just working on building the data pipelines. It's not their end goal. But it is also making sure within your pipeline process, you are outputting, healthy data. And that's where, as Sayan mentioned, like, we act as an intermediate intervene process where team leverages this tool.
[00:12:05] Unknown:
Particularly for things like data quality or if in you're in the operational use case and you're doing anomaly detection on maybe system metrics, it can be very easy to accidentally get to the point where you're generating too much noise because there's a, you know, certain variance in the signal. And so it can be hard to determine if something is meaningful or not. And I'm wondering if you can maybe dig into some of the complexities that are inherent to anomaly detection that are not obvious at first glance and that are difficult to overcome or that are important for avoiding the case of creating too much noise that people will start to ignore the types of anomalies that are being detected?
[00:12:46] Unknown:
Anomaly detection is a challenging problem indeed. Specifically, like, if you are building an anomaly detection model for a given problem, like a given dataset, you can keep on optimizing that because you can ingest more data, more information about the data, and you can keep on optimizing your model so that it works best for that dataset. But from a tool or from a service perspective, when you're making, like, an anomaly detection service, so that is a challenging problem because that is something anomaly detection is, like, an unsupervised problem.
So anomaly detection is, like, an unsupervised problem, so it does not really come with labeled data. So that is, like, 1 tricky problem to understand, like, the performance of the anomaly detector, whether how good it is performing versus, like, how bad it is performing. And, also, like, since Luminaire is a time series anomaly detection tool, it struggles a problem of handling nonstationarity, which is like a never ending problem for time series data. And, also, for batch and streaming time series anomaly detection, those are 2 very different problem. We have observed, like, from our past experience, like, when you start aggregating the data or, like, you start seeing the time series data over different frequencies, the behavior of the data changes a lot. So these are the different issues that we have to keep in mind when you're building time series anomaly detection or anomaly detection in general as a service.
And from the actionability point of view, that is also, like, a very important aspect because, like, anomaly detection comes with an error rate because it's a probabilistic solution. So you have to understand, like, if your model misses or fails sending an alert or there is some issue in the model or in the pipeline that alert does not receive to the end user, what is the cost of that? So that is a very important problem to handle. And, also, like, the time sensitivity, because in many cases, mostly in the streaming use cases, detecting anomalies in time is a very important
[00:15:01] Unknown:
problem. And so in terms of the actual design of Luminaire, how are you addressing some of those problems of being able to solve for the general case while also being able to provide some escape hatches or tuning capabilities for being able to identify some of these special cases or make it fit a particular use case and just managing the flexibility and the breadth of the overall problem space?
[00:15:27] Unknown:
Yeah. So Luminary is is an anomaly detection tool, which supposed to work for wide ranges of problem. So we take different measure. And since it's a machine learning internally, it uses machine learning. We take standard techniques followed in the machine learning literature that kind of processes the data, models the data, and then use it for training. So from the beginning, like, we start with data cleaning. We check for non stationarity and do all the adjustment that you need to do for modeling or dealing with the time series data. And then we get signals from the data itself. And as I mentioned before, since we have less control over the externalities, we use the history of the data in order to incorporate all the information we need in order to model the model for anomaly detection. We check, for for example, temporal correlations, like periodic patterns, or sometimes we have seen in use cases where data has, like, local nonlinearity.
Those are the things that are incorporated during the model building process. And finally, like, all the steps are like, require some actions and some decisions that the users need to make. That is where Luminaire stands out from the other systems, and Luminaire has a built in configuration optimization feature as well where the user just comes in and says, okay. Like, I would like to monitor my time series. And for a given problem, we optimize the configuration based on the dataset, and that actually brings a lots of automation during the process. This is, like, 1 side of the thing. And another side is, like, where how to deal with streaming solutions. Because in streaming anomaly detection or specifically for speech streaming data that I mentioned, like, it behaves differently, we have different solution. Like, instead of, like, doing, like, a predictive or uncertainty based modeling, we do some sort of, like, baseline matching or density matching, where we do, like, data checks over sequential windows in order to see, like, whether has Telvie or not.
So, yeah, these are the typical steps or measures we take to build Luminess as a successful anomaly detection tool.
[00:17:48] Unknown:
And on the subject of the windowing, that can be another challenging optimization problem is determining what are appropriate sizes for those windows for ensuring that you're actually, you know, determining what is the proper bucket for being able to determine whether something is anomalous and, you know, how much information do I need for being able to compute that. And then also cases such as seasonality where something might be anomalous within the previous time window on the order of days or weeks, but on an annual basis, it's actually entirely normal. And I'm curious how you handle some of those types of problems.
[00:18:24] Unknown:
For the windowing aspect, this is a very important question because sometimes, like, anomaly detection, specifically, like, this kind of problem are, like, context based. So sometimes, let's say, at the middle of the night, you have less traffic or some data showing very low volume, high variability? So we make sure we we consider the seasonalities of this pattern. Like, if it is a repeating pattern, then we consider that into incorporating that model into the model. And, also, in terms of determining the width size of the window, we take measures of setting, like, the either the user can pick the width size of the window for if if the user knows what would be the optimal size for their problem, or we are actually working on, like, bringing the automation of measuring the same problem over different window size. That means you're basically seeing the problem over different contexts.
We have seen in many use cases like that seems to work pretty well and does not really implement it inside Lumina right now, like changing the window size for a given solution, but we are planning to build in future.
[00:19:35] Unknown:
And, also, like, that is definitely a tough problem. And right now, usually, what we have seen is, like, usually, like, if your data in terms of streaming, if they are, like, in like, you're receiving data, let's say, every second or every minute, Right now, we kind of expect the users, as of what we support provide right now, to give us that kind of information. And, yeah, that definitely is challenging. But at least for the batch style of processing, that becomes very much easier. Like, if your data is like, you are getting your frequency of your data is, like, every day or every week or every month or even, like, every hour, that is where it becomes a little bit easier to automate that process.
[00:20:22] Unknown:
And in terms of the actual design and implementation of Luminaire, can you dig into some of the internals of the project and some of the ways that the design and goals of the project have evolved since you first began working on it? Yeah. So it initially started from, like, just implementing with a basic type of model.
[00:20:42] Unknown:
And we just had the training data, doing some play basic cleaning, and we're processing that, doing anomaly detection. And we've seen, like, there are many caveats and in dealing with, like, anomaly detection in in time series data. Specifically, you have to deal with if you have missing data or you have some change points in the data, which is a very serious problem in time series modeling. So we take different measures of of detecting those and making the data ready before it goes to the model building process. Lumina has 3 main components, which can be used independently or sequentially in order to perform a complete end to end anomaly detection. So that start with, like, the data preparation and profiling. So in the preparation and profiling part, you can prepare the data for being ready for modeling.
And you can also do some exploration where you can see what the historical patterns and what has changed in the data. So processing in the sense, like, that Illumina detects change points and also, like, Illumina detects trend turning, which are very useful and, like, sometimes interesting to see in many use cases. Data imputation, if there is missing data or, like, doing any other adjustment, if there is any change point observed. And in terms of, the modeling, we in internally in Luminaire, like, we have different type of models. Some models focusing on the forecasting capabilities compared to some models focusing on, like, the variational and uncertainty pattern where the data has very less signal.
And on top of all of this, we have an optimizer that can optimize the choices given a problem. So if you have a data, Luminaire optimizer can run different scenarios and check which specific configuration fits best for a given problem, and it can suggest to suggest the best optimized configuration to you. And on top of that, like, what we do and also, like, whoever uses Luminaire, they can run a scheduling engine, like, for training, like, in terms of scheduling the training and the scoring process because time for specifically for time series modeling, you need a very periodic structure of straining and score straining specifically because you don't want to use a very old model in order to score newer time points. So you have to make sure you are always generating new and most recent models at a specific frequency.
And for streaming use cases, it is like a trade off between efficiency versus speed because in streaming use cases, you want to process the data fast and you want to result you want to send the result to the user in a timely fashion. So as I mentioned before, we do, like, a baseline comparison, like a volume comparison or distribution comparison approach where we compare different time windows and we take a baseline time window and we do the processing. And, similarly, we have a training scoring schedule for that. And the scoring process is very lightweight, where the most recent model can be used to pull the relevant baseline and can be used to score that specific window to see any problem is there or not.
[00:24:06] Unknown:
And I'm wondering what the motivation was for releasing the project as open source and any extra effort that was necessary to maybe remove any assumptions about the way that Luminair was being deployed or used based on how it was being employed at Zillow and how to make it more general purpose and accessible for people outside of the company?
[00:24:29] Unknown:
So we actually looked for several solution out there, like, first, when we are trying to solve this problem. And there was no tool we found like that solves the problem the way we want. Because as you mentioned before, actually, the time series model comes with several challenges dealing with the seasonalities or dealing with streaming versus batch data or, like, solving the problem of. So that encouraged us to build our own solution. And we wanted to open source Luminaire because, like, whatever we have built, we wanted to contribute back to the community because, like, since we have already invented the wheel, we didn't want the wheel should be reinvented for a because we are solving a very common problem because anomaly detection is is is not a problem within Zillow. But even outside, many people are trying to solve the same problem.
And, basically, like, this is an industry standard as well. Whenever you open source something, instead of different people working on the project independently rather than building something on top of each other, some on top of your solution so that we get incremental increase and, overall, the whole industry benefits from it. And, finally, like, the open sourcing helps incorporating lot of brains. And if we want to have, like, a high quality solution, this is, like, a good way to go.
[00:25:57] Unknown:
Also, 1 of the problems in the initial time that we found was when we started providing this utility functions to the teams, initially, they had to select what models to run when from the suits that we were providing. And within that same model, they also had to specify what parameters they also have to set. So this was 1 of the bigger challenges that we found for not just, like, even for the ML teams, but even for the data teams. Because we wanted even, like, any generalized data teams to benefit from all the sophisticated ML systems.
So that's where we started making our models, like, bringing more models to the suit and making it more sophisticated. And on top of that, also added the layer of AutoML. So where users now don't even have to figure out what models they have to select, that kind of becomes 1 of the parameter to Luminaire. So I would say, like, that is, like, 1 of the key things that has helped lot of the teams at Zillow right now and solving lot of the time series problems. Because teams are also not just onboarding just 1 time series that they are interested in. They are onboarding, like, 100 or thousands of metrics that they care about. And imagine the time a team has to spend figuring out for this 1 metric, what model should I select and what parameters should I set in and scaling that to thousands of metrics. So that's where we are, like, trying to solve that bigger problem.
And that was the main motivation also to make it open source because as also Sajid mentioned, like, nothing we saw was available to solve that use case.
[00:27:47] Unknown:
Python has become the default language for working with data, whether as a data scientist, data engineer, data analyst, or machine learning engineer. SpringBoard has launched their school of data to help you get a career in the field through a comprehensive set of programs that are 100% online and tailored to fit your busy schedule. With a network of expert mentors who are available to coach you during weekly 1 on 1 video calls, a tuition back guarantee that means you don't pay until you get a job, resume preparation, and interview assistance. There's no reason to wait. Springboard is offering up to 20 scholarships of $500 towards the tuition cost exclusively to listeners of this show.
Go to python podcast.com/springboard today to learn more and give your career a boost to the next level. For individuals or teams who are adopting Luminair, can you talk through the overall workflow of introducing it to an ecosystem or a particular application and what's involved in actually getting it set up and training the model and getting it deployed? So,
[00:28:51] Unknown:
ideally, what this open source package is providing you is the brains, which is the models. Now for teams, if they wanna leverage it, so their main goal is, first of all, the matrix that they care about. And that matrix basically involves, like, let's say, 2 main columns. 1 is your time column and 1 is your actual metric column. So that kind of becomes the initial inputs to Luminaire. And then you train it using that historical data. So make sure you provide enough history so you get a proper model, trained model. And now you have this trained model which is getting outputted. Now you wanna store that model somewhere, and that can be any of your desired storage format. You can either go with any kind of file storage or even you can store the object in a database if you like that. So that was the main reason of, like, decoupling the model with the storage and everything around it. Once you have this model, now the time becomes a scoring. So let's say if you have the metric which is generated every day or every hour, so you can have your scheduling system that can run at that interval and pulls that train model to to score that specific metric and determine what is the score of that metric. And that's not where the process ends for the user because scoring will output few few metrics about the scored results.
From there, you also have to figure out, like like, for your stakeholders, what is the sensitivity that they are interested in. So if it is, like, highly sensitive, like, if it is, like, highly anomalous, then only you wanna get alerted. Or if it is, like, say, like, 95% anomalous or 99.9%, it is anomalous. So, like, anomalous is different for different users. So that's where users just have to set up what threshold makes more sense to them. And that's how you leverage the output to figure out if this point that we just scored right now is an anomaly or not. So that's how you design your basic training, deploying of your model, and the scoring process.
[00:31:08] Unknown:
And are there any particular common patterns that you've seen either internal to Zillow or from users of the open source project as to ways that it's being integrated and some of the sort of main use cases that people have been able to benefit from?
[00:31:24] Unknown:
So this system this model that we provided and the example that I shared was just for 1 metric. Okay? And in a larger system, and even within Zillow as well, we have a lot of teams who wants to leverage this. And they just don't wanna leverage it just for 1 metric asset. They wanna leverage it for multiple of their metrics. So 1 thing that we have developed, like, internally and that what we recommend to the listeners as well is creating a process where your input is not just 1 metric, but it can be multiple metrics. Let's say, for example, you have some, like, page views by devices of your website. Okay? And you can have multiple devices over there. So you wanna create a process where you view that query or that dataset as an input, and you wanna, like, divide each of this metric into each single mini process and train and score them train and store the model object and then do the same process even for scoring.
So that's where we recommend users to create some kind of mapping mechanism. So for example, 1 of the metric might be a device's like, iOS, and 1 of your metric is page views. So you can create a mapping for that specific matrix and assign some kind of a unique key, and that's how you can scale your system. So that's 1 of the aspects of scaling. And another thing that we have also seen is the code that is written is pure Python. So if you have a lot of this matrix that you wanna run parallelly, so you can leverage some kind of distributed processes.
So, internally, we are using this core pack like, this package with Spark. So that helps us to train all these individual metrics in your dataset much more faster. Anyone scored them faster. So that that is highly recommended to leverage those kind of distributed processing. And another thing that we have seen is if you have a metric or a dataset that you are in like, you have, like, you have generated, but there might not be just 1 team that is interested in that metric. We have seen a lot of the other product users or business users are interested in that same metrics. Let's say, again, take the example of, let's say, page views by some devices. Okay.
So there might be a lot of business stakeholders who care about that metric, and they wanna be notified if there is some kind of anomalies. So instead of, like, each team doing that same training and scoring process for the same metric, you are kind of duplicating that whole resources. So what we have done is we have just on like, created a process where user can come in and say, okay. This is that job which is running, and this is that metric that I'm also interested in. And I would like to get alerted if the sensitivity reaches certain level.
So that way, what we are doing is we just have 1 process which does the training and scoring, but we have another process which maps the result of the scored value to the subscribers and notify them accordingly. And we have seen that has helped a lot of our teams and stakeholders. So definitely something we would recommend. And 1 last thing is when we initially started a Luminaire, and in order to onboard jobs, it was more like a config driven. Like, people had to specify they're creating some kind of config form, which they push it in our repo. And a lot of the nontechnical users were finding it difficult. So what we started is we started creating a self-service UI, and we integrated that with our data catalog system.
So where like, user can easily come and onboard any kind of job process they'd wanna do and also easily specify the alerting thresholds. And we automatically just take care of all the downstream processes, like all the scheduling processes, all the orchestration around it, airflow, running the jobs in PySpark. So decoupling all this process and making it as a self serve is definitely has benefited Zillow and a lot of teams at Zillow and something we would recommend all our listeners to also try to do that.
[00:35:58] Unknown:
1 of the other interesting aspects of things like anomaly detection and working with time series data is situations where you have some protracted anomalies, such as the current situation with the pandemic where everything is thrown out of whack and things that might have worked well for your steady state are now difficult to predict because of the constantly changing environment. And I'm wondering how that affects the work of people who are using Luminaire or who are trying to identify sources of the outliers in the anomalies that are detected and just sort of the overall impact of things like the current pandemic on people who are trying to use time series for building meaningful signals that instructs the way that they want to drive their business or manage their data pipelines or things like that?
[00:36:47] Unknown:
As I'll repeat what I said before. So anomaly detection is, like, a very contextual problem. So in the context of the current situation, like in the pandemic, like, we are kind of living in a anomalous state right now. But the nice thing about, like, the time series models who are using Illuminaire or any other tools is, like, you can tweak them to bring more contextual information into the models. So for example, you can work on narrower time windows, which are, like, cases for many matrix. Like, for example, like, many operational matrix, they don't depend on, like, long term term history, and you can treat them and you can detect any problems.
In that or, like, on a local history, you can take in and detect any problems into that. But in general, like, the business matrix or other matrix which has a longer context of longer history, that kind of get more impacted. So in general, like, time series models are very fast to adapt. And when you want to have anomaly in the context of a local outlier, like you're trying to find, like, something happened today, which is independent of the overall context of these these pandemic. And time series models are pretty pretty good at that. But in general, like, when you are dealing with such different, like, data of such a different pattern, its ideal is to observe it from different perspective and involving, like, varying time windows and understanding the relationship between different time series or somehow correlating the time series or correlating the outlier sometimes help. And, also, like, doing some multivariate processing of the time series where you can relate 1 problem with another 1, that helps a lot.
So even though you have such big impactful externalities such as COVID or like this pandemic, I would say, like, yeah, I mean, taking this measure helps a lot.
[00:38:52] Unknown:
And in terms of the specific projects that you've seen built with Luminaire, either internally at Zillow or out in the community, what are some of the most interesting or innovative or unexpected ways that you're seeing it used?
[00:39:05] Unknown:
So in general, any anomaly detection tools are designed to work as, like, detecting anomalies in the data. Like, even internally, like, we have seen use cases where teams are using Luminaire to detect, like, the different quality matrix relate related to modeling. That kind of takes Luminair just being like a data quality tool rather than from a data quality tool, like a tool that monitors, like, a whole ML data on our products. So, for example, like, we have teams who are using this for tracking model drips or any changes in the code that might have introduced the bugs which have changed the out outcome of the model radically.
So that's where Luminaire comes in. And not only Luminaire, that means tracking the input side of the thing, but also tracking the output side of the thing. And, also, like, there are situation where Luminaire has been used as tracking slow drift, where you see, like, models are going down model performance is going down for some specific reason, some temporal change slow slow temporal change in the data, which cause, like, slow drift, not a steady jump or drop. So these are, like, some many interesting cases where we haven't seen anomaly detection tool being used, but Lumina is being used in this case.
[00:40:28] Unknown:
In terms of the external users, it's been pretty recent that we open sourced the project. So we have not, like, got much, like, feedback from the outside world, like, how they have been benefited yet. That's where we are, like, trying to spread the word and see if someone has used it and if they found anything. Yeah. That is something where we are happy to learn more from them as well.
[00:40:53] Unknown:
And in terms of your own experience of working on the Luminaire project and using it internally for your own projects, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:41:06] Unknown:
There are, several aspect of answering this question. Like, first, from a system point of view, like, as I discussed before, like, 1 of the challenges of dealing with not only for an omnidirection problem, specifically for time series problems, it has, like, temporal dependency. Like, time is itself a factor in it. So it is very important to understand this time factor not only from a statistical perspective, but also from a system perspective because that's where we make sure, like, we are processing the right anomalies, and we are sending the optimal number of alerts to the user.
So from the system perspective, it is very important to find the sweet spot of scheduling the training process because you want to train your model frequently, but not too frequently that will increase, cost of resources and but, also, like, you don't want to train it very infrequently so that it creates like, start degrading the model. And, also, we have a context of model detail, like time to live, where we publish a model that comes with an expiry. That means, like, if you have published a model that would expire in some point, that is deliberately done because we don't want to use a very old model to to score a recent data point.
So these are the different things from the system side of the things you need to understand when you you're dealing with time series data. Now from the alerting side of the things, like, from the user's point of view, we have seen challenges in, like, setting up the post processing, like, how do we process the output from Luminaire? So in the context from a user, sometimes we send an alert, which user find kind of obvious because the context that model is training is different from the context the user is looking into the model. For example, like, you are seeing, like, continuously a metric to be at the 100 for, like, last few months, and suddenly you have seen that to be 101.
And that time, like, you will get an alert. But from the user's perspective, user might think, okay. I don't really care about 100 being 101. I want to know, like, go in a 100, 000. But so these are the different challenges we have seen, like, for from someone coming from a non ML background to to see this kind of problem in a different angle. And from a probabilistic perspective, like, anomaly detection is a probabilistic classification, which comes with an error rate. So it's very important to understand as well, like, what metrics do you want to monitor? As Smith mentioned before, like, there are user, like, who comes up with a task with lots of metrics, and it is very important to understand which are the important metrics to monitor. Because if you onboard a very noisy metrics or a very unimportant metrics that usually generate noise, it's tend to get more alerts from those metrics rather than from the stable ones, which you think to be more important to monitor.
So for this reason, we, like, internally, like, have smart alerting processes where we do some alert throttling, muting, grouping so that we make sure, like, we send the right amount of alert so that it does not really create an alert fatigue so that the user start ignoring the alert, but also, like, it does not suppress as an important alert.
[00:44:38] Unknown:
I would just like to add 1 thing over here that we learned from the initial stages was, initially, we were just sending out alerts for that 1 point itself. So what was happening is, like, users were always not having the context of, like, why is this point treated as an anomaly. So they always wanted to know the previous trend along with it. So that was, like, 1 of our 1 of the interesting and challenging thing. So that's what we what we started to do is we started showing not just the recent data point that we scored, but we also started showing them, like, last couple histories of what was observed.
So showing that context and showing your anomaly along to the user helped them a lot to immediately make the decision. Oh, okay. This is why it is an anomaly. Like, sometimes, the significant dip or drop might not be obvious, but then it becomes very obvious when you show them the current point and some history with it.
[00:45:44] Unknown:
For people who are evaluating Luminaire or trying to understand if anomaly detection is the right tool for their particular problem. What are the cases where Luminaire is the wrong choice?
[00:45:55] Unknown:
When I introduced Luminaire, we mentioned, like, Luminaire is an anomaly detection tool, which works for wide ranges of time series data, and it also brings automation. That means we take historical information patterns or, like, different structures in the time series to understand what is an anomaly versus what is not. But if a user has more information about the data, like, more information about the externals of the data or, like, more context on the data, then building a more feature based model will make more sense. So in that context, like, for example, like, if the user knows, like, there will be, like, major release, which might increase the traffic by March, then incorporating this information would reduce an alert, which is a good thing because the user already know if the alert is coming, why the alert is coming. This is, like, a very important aspect. Like, if you have more information on the data, like, then building a feature based model.
And, also, it's important to understand whether you have the resources to build a feature based model. If you have the resources and if you have more context of the data, I would say Luminaire better option to go beyond Luminaire.
[00:47:07] Unknown:
This also kind of ties with the challenging part of it as well. Like, sometimes people, if they are alerted, they want to get more context on why something is an anomaly, And that's where, like, their teams have to later on dig into it. And that is more, like, outside of the scope of the package right now, but that is something teams should think more about it. Like, there's something as an anomaly trying to give them more context if possible why we think this is an anomaly. Like, not just specific to the models or to the characteristics, but is there anything external reasons that might have caused it? And that's 1 of the very challenging problem.
[00:47:48] Unknown:
As you look to the future of the project, what are some of the plans that you have for the near to medium term or any areas of contribution that you're looking for help from the community?
[00:47:58] Unknown:
Mostly, we are now, like, working on the streaming anomaly detection model, and that is at this moment, it's a very initial stage. We have open source, the first version of it, but we would like to build and bring more automation and bring more sophisticated feature into it to bring more end to end processing, like, where the user will let less would need less configuration. Also, like, from the context of where Lumina is a wrong choice, as Smith mentioned, giving a context of an alert would bring more insight on why someone is getting an alert. So from that perspective, diving and doing some root cause detection, like doing some data driven context extraction, is a very important part. And we actually published a paper last year about root cause detection and how we are planning to do that inside Lumina.
So that is 1 key part. And in terms of improving the existing anomaly detection model, we are planning to incorporate more sophisticated model into the system. And, currently, we do optimization, but we are planning to do some voting or, like, some sort of bagging approach in order to identify if something is an anomaly versus nonanomalous because they're like having an ensemble approach of dealing with the classification model would bring more reliability. So that is something we are planning for future.
[00:49:26] Unknown:
Are there any other aspects of the Luminaire project or the problem space of anomaly detection and dealing with time series data that we didn't discuss yet that you'd like to cover before we close out the show?
[00:49:37] Unknown:
So right now, what what we internally kind of do is, like, having a fixed frequency when the training runs. Okay? But that is also not a scalable solution. You actually wanna trigger tuning when, actually, your model is degrading. So you wanna tune your model based on your past scoring results and taking that into effect. And if that degrades, then only would you retune it. Yeah.
[00:50:06] Unknown:
So, basically, this is like an internal tool that is not in the open source project right now. So what we do is so we talked about the automation part, but this is more of a self awareness part where it reduces the maintenance cost as as well. Because when you onboard anomaly detection model and when you want to monitor something, you have to continuously check, like, whether the model degrades and when you want to retune. Because you don't want to retune at every stage as well because that kind of makes things less reliable and also, like, increases the competition cost because, like, the tuning part is pretty expensive. So what we do is, like, whenever we score, we store some model performance metrics. And every time when the the retraining schedule is there, we check whether the model performance started starts degrading with different voting methods. Like, we take different measures of measuring a model. And if we see, like, okay. This model should be retuned, we trigger a retuning. So that means, like, it's kind of a complete loop of full automated ML system.
[00:51:14] Unknown:
That's very cool. So for anybody who wants to follow along with either of you and get in touch, I'll have you each add your preferred contact information to the show notes. And so with that, I'm going to move us into the picks. And this week, I'm going to choose a tool called flake hell, which is a wrapper around the flake 8 linting utility, which gives you the ability to maintain your configuration in the pyproject dottoml, as well as better maintenance of plugins and determining which errors you want to have reported and in which contexts are for particular paths. So just a great convenience utility on top of flake 8. So definitely recommend checking that out for maintaining code quality. And so with that, I'll pass it to you, Smit. Do you have any picks this week?
[00:51:57] Unknown:
Like, over, like, last couple months, I am, like, looking into this tool called Apache Ranger, which is an open source tool. And it kind of provides you a way to manage or control data authorization. And the reason I find it very interesting is because as the companies are becoming more data driven companies, Lot of data are getting generated every day, like, every minute, actually. So and now a lot of teams also needs access to this data. So how do you control that? And that's 1 of the interesting project I found, and that will be the pick of my yeah. That will be my pick. And, Sayon, do you have any picks this week?
[00:52:41] Unknown:
So I would like to pick a book recently I read that's prediction machines, the simple economics of artificial intelligence. So I found this books in book interesting because, like, this talks about from a broader aspects of machine learning and AI and predictive modeling. So specifically those who are working on very technical side of machine learning, like the machine learning practitioners or even the data engineers. This book should be very interesting for them because this gives a broader picture, like, from a business and strategy point of view. So I highly recommend to everyone to read this book.
[00:53:17] Unknown:
Well, thank you both very much for taking the time today to join me and discuss the work that you've done on the Luminaire project. It's definitely a very interesting tool and 1 that I plan to take a look at myself. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thank you, Tobias, for having us on your show.
[00:53:35] Unknown:
Thanks, Tobias, for having us in the show and for all your time. Thank you.
[00:53:42] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Guests' Background and Introduction to Python
Overview of Luminaire and Its Origin
Luminaire's Unique Features and Use Cases
Motivation Behind Luminaire
Development and Evolution of Luminaire
Current Uses of Luminaire at Zillow
Challenges in Anomaly Detection
Design and Implementation of Luminaire
Handling Windowing and Seasonality
Internal Design and Components of Luminaire
Open Sourcing Luminaire
Adopting Luminaire: Workflow and Integration
Scaling and Use Cases of Luminaire
Impact of the Pandemic on Anomaly Detection
Interesting Use Cases of Luminaire
Lessons Learned from Developing Luminaire
When Luminaire is Not the Right Choice
Future Plans for Luminaire
Closing Remarks and Picks