Add Anomaly Detection To Your Time Series Data With Luminaire

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host as usual is Tobias Macy. And today, I'm interviewing Smit Shah and Sayan Chakraborty about Luminair, a machine learning based package for anomaly detection on time series data. So, Smit, can you start by introducing yourself?

Hi. My name is Smit Shah, and I'm working as a senior data engineer at Zillow for almost 4 years.

And I'm mostly involved working building data related products at Zillow.

And, Sayan, how about you?

Hi. My name is Sayan Chakrabodi.

I work as senior applied scientist in the Zillow AI team.

I mostly work on, like, anomaly detection and help building the anomaly detection and, in turn, the machine learning methods

inside Luminare.

And going back to you, Smit, do you remember how you first got introduced to Python?

Actually, when I joined Zillow in 2016,

that was the very first time

I got into Python as most of the data teams were using Python. So

I started, like, learning more about Python and using it. And it's really

easy and convenient language to use, and I really love it.

Just before Python, I was more into Java and Objective C programming.

And, Sayem, do you remember how you first got introduced to Python?

Yeah. So I'm from a stats background. So I used to code in R. And I remember, like, during my 3rd year of my

PhD program, I decided to learn a new language, and Python was picking up, and I thought it would be a good candidate.

I'm full on Python. I think I

didn't wrote any code in r in last 2 years.

Yeah. That's definitely 1 of the interesting sort of ongoing holy wars in tech is Python versus r when it comes to stats.

I'll leave that at the sidelines for now, but I think it's funny that you started in R and have since come over to Python.

And so I'm wondering if you can just start by giving a bit of an overview about what the Luminair project is and some of the origin story of how it got started and why you decided that you needed to build out this library from scratch?

So Luminar is a tool for detecting anomalies,

specifically for time series data,

and we do it for both batch and streaming use cases.

The in the place where Luminaire

stands out is, like, automating the whole modeling process and somehow,

like, democratizing them. Anomaly detection across the community because, like, anomaly detection is a hard problem, and and not everyone comes with a domain a domain expertise or, like, the ML expertise needed for doing an omnidirection. That's where Lumina comes in to automate the process.

And what's the story behind the name?

So the dictionary definition of Luminaire is complete electric unit of light.

And 1 of our Zillow's core value is turn on the lights. So as Sayan mentioned, we wanted to, like, democratize

anomaly detection,

and we also wanted to bring visibility to the teams

about various anomalies or data health issues

in their data.

And that's how we came up with the word Illuminaire.

And so there are a number of other projects that are intended for working with time series data even just within the Python ecosystem,

1 of the most notable ones recently being Prophet. But I'm wondering if you can just give a bit of an overview about

where Luminare fits within that ecosystem of time series frameworks and libraries, and what are the use cases that it's well suited for versus when you might want to use something else?

Luminair is built for anomaly detection, so it's focused on

classifying anomalies.

So when you are talking about, like, other forecasting packages, those kind of depend on how much signal that you're getting from the data. But

anomalies are anomalies. So they can show up either you have signal in the data or you don't have any good signal in the data.

So that's where Luminaire come in. We have built the library in such a way like it's focused on detecting anomalies for time series data and brings automation

for when the user wants automation.

And whoever wants, like, to play with the model, we also

open up the configuration

for, like, doing all the changes during the modeling. So it's kind of also

works well well in terms of, like, bring

building an explainable model for

anomaly detection. And specifically, like, for forecasting,

we actually have seen, like, whenever you have good signal, Luminar can be used well as a forecasting tool, and it performs pretty well whenever you have good signal in the time series data. We actually recently submitted a paper in I triple

e big data conference this year, which got accepted. And we

shown, like, some few

benchmarks where we have shown, like, Luminaire

is outperforming many other

competing

forecasting or anomaly detection methods in different scenarios.

That ranges from, like, Luminol, ADTK,

even Auto Remind, also Prophet as well.

Because of the fact that there are so many other libraries out there for time series,

what was the motivation for actually starting an entirely new project versus just adding new capabilities onto

an existing library or just wrapping

worlds. Like, either we wanted to have more worlds. Like, either we wanted to have

more control over the

model building process, or we wanted to build an anomaly detection tool

for those users who wants more automation.

And Lumina comes good in both world, and we have

seen, like in many others

so when we were building this anomaly detection tool,

we looked for

several existing solution, and we found, like, there is no such tool that is

solving this problem in a robust and reliable way because I mentioned before, like, most of the tools which are more sophisticated for dealing with time series data are focused on forecasting, not for anomaly detection. And those tools which are focused on anomaly detection does very basic

modeling,

has very basic modeling capabilities. So that is where Luminaire comes in, which is kind of

combining these 2 capabilities

into, like, a more powerful, more sophisticated anomaly detection platform.

Yeah. Let me explain, like, how this whole Luminar project actually got started from inception.

So at Zillow,

since we are a data company and the company identified

there was a need for

having a centralized data quality team. And that's where our core team got formed. And what we observed also was there was no standard or formal processes

that everyone were following in general to

detect their general data health issues for their metrics.

And that's where we started

creating

standards

and processes to help these teams out. So that's where the very first utility function in Python

was created internally,

which was helping teams to

generate data health metrics like volume, availability, completeness, or even comparison.

And later on, teams were interested to how are not getting alerts on top of this matrix. So we started building deterministic

anomaly

alerts on top of this. But later on, we also found that we had a lot of time series use cases within

a lot of our teams. And that's where the need for doing a time series anomaly detection came in.

And that's was the first place where we build this, where we added this utility function

within our core package to detect tensors anomalies. And we started with the off the shelf ARIMA model.

But from that point and onwards, we then

saw, like, okay, these are, like, 2 different use cases. And that's where we splitted the project

into, like, Luminaire deterministic

checks. And this Luminaire, what we have open source right now, which is the core time series anomaly detection.

And from there and onwards, we started adding

more models, more sophisticated models to support just time series anomaly detection

and, overall, trying to democratize

self serving,

anomaly detection at scale.

And over this period, like, our core team, which I would like to also mention just not just 2 of us, but it was also, like, Anna Swigert, she is our their manager, and then Rui Yang,

Kumar Sultani, and Kyle Buckingham

who were there from the initial days of building these packages.

So I'd like to also thank them. And in terms of the ways that it's being used now, you mentioned the data quality aspects of maintaining your data pipeline. And, Sayan, you also mentioned actually using it for some of the forecasting capabilities. I'm wondering if you can just

discuss some of the different ways that it has grown to be used throughout Zillow within the data team, but also maybe some use cases outside of just the specific data pipeline and data analysis life cycle?

Yeah. Sure. So within Zillow, like, we currently don't use it as a forecasting tool, but use it as an anomaly detection platform.

So

in general, like, we have lots of, like, internal and external services, and we process

enormous amount of data every day. So there are data producer who generate, like, batch or streaming data that gets consumed by the different downstream services. And the producers

wants to make sure the data they are generating are good quality data as well, like the downstream team who are consuming the this data

and creating maybe business metrics or

generating feature for their ML system. Like, for example, like Zestimates, Zillow Offers, the

recommender system we have in Zillow. All of their team, they want to make sure, like, the data they are ingesting is good enough. So that's where Lumina comes in. So Lumina intervenes different part of this process pipeline and make sure

the data flow

and the data that is going from 1 place to another is good quality and does not have anomalies.

So we do different checks, like checks for volumes, like availability,

nulls, completeness,

and so on and so forth, and to make sure the data within Zillow and that's being ingested to the services are are healthy enough.

Just wanted to say, like, sometimes also

so that's where we were saying, like, we were trying to bring this standard or processes

at Zillow and trying to guide them,

like, what data health means, what quality means.

So that's where we started

teams to

encourage more on not just working on building the data pipelines. It's not their end goal.

But it is also making sure

within your pipeline process,

you are

outputting,

healthy data.

And that's where, as Sayan mentioned, like, we act as an intermediate intervene process where team leverages

this tool.

Particularly for things like data quality or if in you're in the operational use case and you're doing anomaly detection on maybe system metrics,

it can be very easy to accidentally get to the point where you're

generating too much noise because there's a, you know, certain

variance in the signal. And so it can be hard to determine if something is meaningful or not. And I'm wondering if you can maybe dig into some of the complexities that are inherent to anomaly detection that are not obvious at first glance and that are difficult to overcome or that are important for avoiding the case of creating too much noise that people will start to ignore the types of anomalies that are being detected?

Anomaly detection is a challenging problem indeed. Specifically, like, if you are building an anomaly

detection model for a given problem, like a given dataset,

you can keep on optimizing that because you can ingest more data, more information about the data, and you can keep on optimizing your model so that it works best for that dataset.

But from a tool or from a service perspective, when you're making, like, an anomaly detection service,

so that is a challenging problem because

that is something anomaly

detection is, like, an unsupervised problem.

So

anomaly detection is, like, an unsupervised problem,

so it does not really come with labeled data. So that is, like, 1 tricky problem to understand, like, the performance of the anomaly detector, whether how good it is performing versus, like, how bad it is performing.

And, also, like, since Luminaire is a time series anomaly detection tool, it struggles a problem of handling nonstationarity, which is like a never ending problem for time series data. And, also,

for batch and streaming time series anomaly detection, those are 2 very different problem. We have observed, like, from our past experience, like,

when you start aggregating the data or, like, you start seeing the time series data over different

frequencies,

the behavior of the data changes a lot. So these are

the different issues that we have to keep in mind when you're building

time series anomaly detection or anomaly detection

in general as a service.

And from the actionability

point of view, that is also, like, a very important

aspect because, like,

anomaly detection comes with an error rate because it's a probabilistic solution.

So you have to understand, like, if your model misses

or fails sending an alert or there is some issue in the model or in the pipeline

that alert does not receive to the end user, what is the cost of that? So that is a very important problem

to handle. And, also, like, the time sensitivity, because in many cases, mostly in the streaming use cases,

detecting anomalies in time is a very important

problem. And so in terms of the actual design of Luminaire, how are you addressing some of those problems of being able to solve for the general case while also being able to provide some

escape hatches or tuning capabilities for being able to identify some of these special cases or

make it fit a particular use case and just managing the flexibility and the breadth of the overall problem space?

Yeah. So Luminary is is an anomaly detection tool, which supposed to work for wide ranges of problem. So we take different measure.

And since it's a machine learning

internally, it uses machine learning. We take standard

techniques followed in the machine learning literature that kind of processes the data, models the data, and

then use it for training. So from the beginning, like, we start with data cleaning. We check for non stationarity and do all the adjustment that you need to do for

modeling or dealing with the time series data.

And then we get signals from the data itself. And as I mentioned before,

since we have less control over the externalities,

we use the history of the data in order to incorporate all the information we need in order to model the

model for anomaly detection. We check, for for example, temporal correlations,

like periodic patterns,

or sometimes we have seen in use cases where data has, like, local nonlinearity.

Those are the things that are incorporated during the model building process.

And finally, like, all the steps

are like, require some actions

and some decisions that the users need to make.

That is where Luminaire stands out from the other systems, and Luminaire

has a built in

configuration optimization

feature as well where the user

just comes in and says, okay. Like, I would like to monitor my time series. And for a given problem, we optimize the configuration

based on the dataset,

and that actually

brings a lots of automation during the process.

This is, like, 1 side of the thing. And another side is, like, where how to deal with

streaming solutions. Because in streaming anomaly detection or specifically for speech streaming data that I mentioned, like, it behaves differently, we have different solution. Like, instead of, like, doing, like, a predictive or uncertainty based modeling, we do some sort of, like, baseline matching or density matching,

where we do, like, data checks over sequential windows in order to see, like, whether

has Telvie or not.

So, yeah, these are the typical steps or measures we take to build Luminess as a successful anomaly detection tool.

And on the subject of the windowing,

that can be another challenging optimization problem is determining what are appropriate sizes for those windows for ensuring that you're actually,

you know, determining what is the proper bucket for being able to determine whether something is anomalous and, you know, how much information do I need for being able to compute that. And then also cases such as seasonality

where

something might be anomalous

within the previous time window on the order of days or weeks, but on an annual basis, it's actually entirely normal. And I'm curious how you handle some of those types of problems.

For the windowing

aspect, this is a very important question because

sometimes, like,

anomaly detection,

specifically, like, this kind of problem are, like, context based. So sometimes, let's say, at the middle of the night, you have less traffic or some data showing very low

volume, high variability?

So we make sure we we

consider the seasonalities

of this pattern. Like, if it is a repeating pattern, then we consider that into incorporating that model into the model.

And, also,

in terms of determining the width size of the window,

we

take measures of setting, like, the either the user can pick the width size of the window for if if the user knows what would be the optimal size for their problem,

or we are actually working

on, like, bringing the automation of

measuring the same problem over different window size. That means you're basically seeing the problem over different contexts.

We have seen in many use cases like that

seems to work pretty well

and does not really implement it inside Lumina right now, like changing the window size for a given solution, but we are planning to build in future.

And, also, like, that is definitely a tough problem. And

right now, usually, what we have seen is, like, usually, like, if your data in terms of streaming, if they are, like, in

like, you're receiving data, let's say, every second or every minute,

Right now, we kind of expect

the

users,

as of what we support provide right now, to give us that kind of information.

And, yeah, that definitely is challenging. But at least for the batch style of processing,

that becomes very much easier. Like, if your data is

like, you are getting your frequency of your data is, like, every day or every week or every month or even, like, every hour, that is where it becomes a little bit easier to automate that process.

And in terms of the actual design and implementation

of Luminaire, can you dig into some of the internals of the project and some of the ways that the design and goals of the project have evolved since you first began working on it? Yeah. So it initially started from, like, just implementing with a basic type of model.

And we just had the training data, doing some play basic cleaning, and we're processing that, doing anomaly detection. And we've seen, like, there are many caveats and in dealing with,

like, anomaly detection in in time series data.

Specifically, you have to deal with if you have missing data or you have some change points in the data, which is a very

serious problem in time series modeling.

So we take different measures of of detecting those and making the data ready before it goes to the model building process.

Lumina has 3 main components,

which can be used independently or sequentially

in order to

perform a complete end to end anomaly detection. So that start with, like, the data preparation and profiling. So in the preparation and profiling part, you can

prepare the data for being ready for modeling.

And you can also do some

exploration where you can see what the historical patterns and what has changed in the data.

So processing in the sense, like, that Illumina detects change points and also, like, Illumina detects trend turning, which are very useful and, like, sometimes interesting to see in many use cases.

Data imputation, if there is missing data or, like, doing any other adjustment, if there is any change point observed.

And in terms of,

the modeling,

we

in internally in Luminaire, like, we have different type of models.

Some models focusing on the forecasting capabilities compared to some models

focusing on, like, the variational and uncertainty pattern where the data has very less signal.

And on top of all of this, we have an optimizer

that can optimize

the choices

given a problem. So if you have a data,

Luminaire optimizer

can run different scenarios and check which

specific configuration

fits best for a given problem,

and it can suggest

to suggest

the best optimized

configuration to you. And on top of that, like, what we do and also, like, whoever uses Luminaire,

they can run a scheduling engine,

like, for training, like, in terms of scheduling the training and the scoring process because time for specifically for time series modeling,

you need a very periodic structure of straining and score straining specifically because

you don't want to use a very old model

in order to

score

newer time points. So you have to make sure you are always generating new

and most recent

models

at a specific frequency.

And for streaming use cases,

it is like a trade off between

efficiency versus speed because in streaming use cases,

you want to process the data fast and you want to result

you want to

send the result to the user in a timely fashion.

So

as I mentioned before, we do, like, a baseline comparison,

like a volume comparison or distribution comparison approach

where we

compare different time windows and we take a baseline time window and we do the processing. And,

similarly, we have a training scoring schedule for that. And the scoring process is very lightweight,

where the most recent model can be used to pull

the relevant baseline and can be used to score that specific window to see any problem is there or not.

And I'm wondering what the motivation was for releasing the project as open source and any extra effort that was necessary

to maybe remove any

assumptions about the way that Luminair was being deployed or used based on how it was being employed at Zillow

and how to make it more general purpose and accessible for people outside of the company?

So we actually looked for several solution out there, like, first, when we are trying to solve this problem.

And there was no tool we found like that solves the problem the way we want.

Because as you mentioned before, actually, the time series model comes with several

challenges dealing with the seasonalities

or dealing with streaming versus batch data or, like, solving the problem of.

So that encouraged us to build our own solution.

And we wanted to open source Luminaire because, like, whatever we have built,

we wanted to contribute back to the community because, like, since

we have already invented the wheel, we didn't want the wheel should be reinvented

for a because we are solving a very common problem because anomaly detection is is is not a problem within Zillow. But even outside,

many people are trying to solve the same problem.

And, basically, like, this is an industry standard as well. Whenever you open source something,

instead of different people working on the project independently rather than

building something on top of each other, some on top of your solution

so that we get incremental

increase and, overall, the whole industry benefits from it.

And, finally, like, the open sourcing helps incorporating lot of brains. And if we want to have, like, a high quality solution, this is, like, a good way to go.

Also, 1 of the problems in the initial time that we found was

when we started providing this utility functions to the teams,

initially, they had to select what models to run when from the suits that we were providing.

And within that same model, they also had to specify

what parameters they also have to set. So this was 1 of the bigger challenges that we found

for not just, like, even for the ML teams, but even for the data teams. Because

we wanted even, like, any generalized data teams to benefit from

all the sophisticated

ML systems.

So that's where we started

making our models, like, bringing more models to the suit and making it more sophisticated.

And on top of that, also added the layer of AutoML. So where

users now don't even have to figure out what models they have to select,

that kind of becomes 1 of the parameter to Luminaire.

So I would say, like, that is, like, 1 of the key things

that has helped lot of the teams

at Zillow right now

and solving lot of the time series

problems.

Because

teams are also not just onboarding just 1 time series that they are interested in. They are onboarding, like, 100 or thousands of metrics that they care about.

And imagine the time

a team has to spend figuring out for this 1 metric, what model should I select and what parameters should I

set in and scaling that to thousands of metrics. So that's where we are, like, trying to

solve that bigger problem.

And that was the main motivation

also to make it open source because as also Sajid mentioned, like, nothing we saw was available to solve that use case.

Python has become the default language for working with data, whether as a data scientist,

data engineer, data analyst, or machine learning engineer.

SpringBoard has launched their school of data to help you get a career in the field through a comprehensive set of programs that are 100%

online and tailored to fit your busy schedule.

With a network of expert mentors who are available to coach you during weekly 1 on 1 video calls, a tuition back guarantee that means you don't pay until you get a job,

resume preparation, and interview assistance. There's no reason to wait.

Springboard is offering up to 20 scholarships of $500

towards the tuition cost exclusively to listeners of this show.

Go to python podcast.com/springboard

today to learn more and give your career a boost to the next level.

For individuals or teams who are adopting Luminair, can you talk through the overall workflow of introducing it to an ecosystem or a particular application

and what's involved in actually getting it set up and training the model and getting it deployed? So,

ideally, what this open source package is providing you is the brains, which is the models.

Now for teams, if they wanna leverage it, so their

main goal is, first of all, the matrix that they care about. And that matrix basically involves, like, let's say, 2 main columns. 1 is your time column and 1 is your actual metric column. So that kind of becomes the initial inputs to Luminaire.

And

then you train it using that historical data. So make sure you provide enough history so you get a proper model,

trained model. And now you have this trained model

which is getting outputted. Now you wanna store that model somewhere,

and that can be any of your desired storage format. You can either go with any kind of file storage or even you can store the object in a database if you like that. So that was the main reason of, like, decoupling

the model with the storage and everything around it. Once you have this model,

now the time becomes a scoring. So let's say if you have the metric which is generated every day or every hour,

so you can have your scheduling system

that can

run at that interval

and pulls that train model to

to score that specific

metric and determine

what is the score of that metric. And

that's not where the process ends for the user because

scoring will output few few metrics about the scored results.

From there, you also have to figure out,

like like, for your stakeholders,

what is the sensitivity

that they are interested in. So if it is, like, highly sensitive,

like, if it is, like, highly

anomalous,

then only you wanna get alerted. Or if it is, like, say, like, 95%

anomalous or 99.9%,

it is anomalous.

So, like, anomalous is different for different users. So that's where users just have to set up what threshold

makes more sense to them.

And that's how you leverage the output to figure out if this point

that we just scored right now is an anomaly or not. So that's how you design

your

basic training,

deploying of your model, and the scoring process.

And are there any particular

common patterns that you've seen either internal to Zillow or from users of the open source project as to ways that it's being integrated and some of the sort of main use cases that people have been able to benefit from?

So this system

this model that we provided and the example that I shared

was just for 1 metric. Okay? And in a larger system, and even within Zillow as well, we have a lot of teams who wants to leverage this. And they just don't wanna leverage it just for 1 metric asset. They wanna leverage it for multiple of their metrics.

So 1 thing that we have developed, like, internally and that what we recommend to the listeners as well

is creating a process where

your input is not just 1 metric, but it can be multiple metrics. Let's say, for example,

you have some, like, page views by devices

of your website. Okay? And you can have multiple devices over there. So you wanna create a process where you view

that query or that dataset as an input, and you wanna, like, divide each of this metric into each single mini process

and train and score them train and store the model object and then do the same process even for scoring.

So that's where we recommend users to create some kind of mapping mechanism.

So for example,

1 of the metric might be a device's

like, iOS,

and 1 of your metric is page views. So you can create a mapping for that

specific matrix and assign some kind of a unique key,

and that's how you can scale your system.

So that's 1 of the aspects of

scaling. And another thing that we have also seen is

the code that is written is pure Python.

So if you have a lot of this matrix that you wanna run parallelly,

so you can leverage some kind of distributed

processes.

So, internally, we are using this core pack like, this package with Spark.

So that helps us to train

all these individual metrics in your dataset much more faster. Anyone scored them

faster. So that that is highly recommended to leverage those kind of distributed processing.

And another thing that we have seen is

if you have a metric or a dataset

that you are in like, you have, like, you have generated,

but there might not be just 1 team that is interested in that metric. We have seen a lot of the other product users or business users are interested in

that same metrics. Let's say, again, take the example of, let's say, page views by some devices. Okay.

So there might be a lot of business stakeholders who care about that metric, and they wanna

be notified if there is some kind of anomalies.

So instead of,

like, each team

doing that same training and scoring process for the same metric, you are kind of duplicating

that whole resources.

So what we have done is

we have just on like, created a process where user can come in and say, okay. This is that job which is running, and this is that metric that I'm also interested in. And I would like to get alerted

if the sensitivity

reaches certain level.

So that way, what we are doing is

we just have 1 process which does the training and scoring, but we have another process

which maps the result of the scored value

to the subscribers

and notify them accordingly. And we have seen

that has helped a lot of our teams and stakeholders.

So definitely something we would recommend.

And 1 last thing is when we initially started a Luminaire,

and in order to onboard jobs,

it was more like a config driven. Like, people had to

specify they're creating some kind of config form,

which they push it in our repo. And a lot of the nontechnical

users were finding it difficult. So what we started is we started creating a self-service

UI,

and we integrated that with our data catalog system.

So where

like, user can easily come and onboard any kind of job process they'd wanna do

and also easily specify

the alerting thresholds.

And we automatically just take care of all the downstream

processes, like all the scheduling processes,

all the orchestration around it, airflow,

running the jobs in PySpark.

So decoupling all this process and

making it as a self serve is definitely

has benefited

Zillow and a lot of teams

at Zillow and something we would recommend

all our listeners to also

try to do that.

1 of the other interesting aspects of things like anomaly detection and working with time series data is situations where you have some protracted

anomalies, such as the current situation with the pandemic where

everything is thrown out of whack and things that might have worked well for your steady state are now

difficult to predict because of the constantly changing environment.

And I'm wondering

how that affects the work of people who are using Luminaire or who are trying to identify

sources of the outliers in the anomalies that are detected and just sort of the overall impact of things like the current pandemic on people who are trying to use time series for

building meaningful signals that instructs the way that they want to drive their business or manage their data pipelines or things like that?

As I'll repeat what I said before. So

anomaly detection is, like, a very contextual problem.

So in the context of the current situation, like in the pandemic, like,

we are kind of living in a anomalous

state right now.

But the nice thing about, like, the time series models who are using

Illuminaire or any other tools is, like,

you can tweak them

to bring

more contextual information

into the models. So for example, you can work on narrower time windows, which are, like, cases for many matrix. Like, for example,

like, many operational matrix, they don't

depend on, like, long term term history, and you can treat them and you can detect any problems.

In that or, like, on a local history, you can take in and detect any problems into that. But

in general, like, the business matrix or other matrix which has

a longer

context of longer history, that kind of

get more impacted.

So in general, like, time series models are very fast to adapt. And

when you want to have

anomaly in the context of a local outlier, like you're trying to find, like, something happened today, which is independent of the overall

context of these these pandemic.

And time series models are pretty pretty good at that. But in general, like, when you are dealing with

such different,

like, data of such a different pattern,

its ideal is to observe it from different perspective and involving, like, varying time windows and understanding the relationship between different time series

or somehow

correlating the time series or correlating the outlier sometimes help. And, also, like, doing some multivariate processing

of the time series where you can relate 1

problem with another 1, that helps a lot.

So even though you have such

big impactful

externalities such as COVID

or like this pandemic,

I would say, like, yeah, I mean, taking this measure helps a lot.

And in terms of the

specific projects that you've seen built with Luminaire,

either internally at Zillow or out in the community, what are some of the most interesting or innovative or unexpected ways that you're seeing it used?

So in general, any anomaly detection tools are

designed to work as, like, detecting anomalies in the data.

Like, even internally, like, we have seen use cases where teams are using Luminaire to detect,

like, the different quality matrix relate related to modeling. That kind of

takes Luminair

just being like a data

quality tool rather than

from a data quality tool, like a tool that

monitors, like, a whole ML data on our products.

So, for example, like, we have teams who are using

this for tracking model drips

or any

changes in the code that might have introduced the bugs which have changed the out outcome of the model radically.

So

that's where Luminaire comes in. And not only Luminaire, that means tracking the input side of the thing, but also

tracking the output side of the thing. And, also, like, there are situation where Luminaire has been used as

tracking slow drift, where you see, like, models are going down model performance is going down for

some specific reason, some temporal change slow slow temporal change in the data, which cause, like, slow drift, not a steady jump or drop. So these are, like, some many interesting cases where we haven't seen anomaly detection tool being used, but

Lumina is being used in this case.

In terms of the external

users, it's been pretty recent that we open sourced the project.

So we have not, like, got much, like,

feedback from the outside world, like, how they have been benefited yet. That's where we are, like, trying to spread the word and

see if someone has used it and if they found anything. Yeah. That is something where we are happy to learn more from them as well.

And in terms of your own experience of working on the Luminaire project

and using it internally for your own projects, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

There are,

several aspect of answering this question. Like, first, from a system point of view, like,

as I discussed before, like, 1 of the challenges of dealing with

not only for an omnidirection problem, specifically for time series problems, it has, like,

temporal dependency. Like, time is itself a factor in it. So it is very important to understand this time factor not only from a statistical perspective, but also from a system perspective

because

that's where we

make sure, like, we are processing the right anomalies, and we are sending the optimal

number of alerts to the user.

So from the system perspective, it is very important to find the sweet spot of

scheduling the training process because

you want to train your model frequently,

but not too frequently that will increase,

cost of resources and but, also, like, you don't want to train it

very infrequently so that it creates like, start degrading the model.

And, also, we have a context

of model detail, like time to live, where

we publish a model

that comes with an expiry.

That means, like, if you have published a model that would expire in some point,

that is deliberately done

because we don't want to use a very old model to

to score a recent data point.

So these are the different things from the system side of the things you need to understand when you you're dealing with time series data.

Now from the alerting side of the things, like, from the user's point of view,

we have seen challenges in, like,

setting up the post processing, like, how do we process the output from Luminaire?

So in the context from a user,

sometimes we send an alert,

which user find kind of obvious because the context

that model is training

is different from the context the user is looking into the model. For example, like, you are seeing, like,

continuously

a metric to be at the 100

for, like, last few months, and suddenly

you have seen that to be 101.

And that time, like, you will get an alert.

But from the user's perspective, user might think, okay. I don't really care about 100 being 101. I want to know, like, go in a 100, 000. But so these are the different

challenges we have seen, like, for from someone coming from a non ML background to to see this kind of problem in a different angle. And from a probabilistic

perspective, like, anomaly detection is a probabilistic

classification,

which comes with an error rate. So it's very important to understand as well, like,

what

metrics do you want to monitor? As Smith mentioned before, like, there are user, like, who comes up with a task with lots of metrics, and it is very important to understand which are the important metrics to monitor. Because

if you

onboard a

very noisy metrics or a very

unimportant metrics that

usually generate noise, it's tend to get more alerts from those metrics

rather than from the stable ones, which you think to be more

important to monitor.

So for this reason, we, like,

internally, like, have smart alerting processes

where we do some alert throttling, muting, grouping so that we make sure, like, we send the right amount of alert

so that it does not really create an alert fatigue so that the user start ignoring the alert, but also, like, it does not suppress as an important alert.

I would just like to add 1 thing over here that we learned from the initial stages was,

initially, we were just sending out alerts for that 1 point itself.

So what was happening is, like, users were always not having

the context of, like, why is this point treated as an anomaly. So they always wanted to know

the previous trend along with it. So that was, like, 1 of our 1 of the interesting and challenging thing. So that's what we what we started to do is we

started showing

not just the recent data point that we scored, but we also started showing them,

like, last couple

histories

of what was observed.

So showing that context and showing your anomaly

along to the user helped them a lot to immediately make the decision. Oh, okay. This is why it is an anomaly.

Like, sometimes,

the significant

dip or drop might not be obvious, but then it becomes very obvious when you show them

the current point and some history with it.

For people who are evaluating

Luminaire or trying to understand if anomaly detection is the right tool for their particular problem. What are the cases where Luminaire is the wrong choice?

When I introduced Luminaire, we mentioned, like, Luminaire is an anomaly detection tool, which works for wide ranges of time series data, and it also brings automation. That means we

take historical information patterns or, like, different structures in the time series to understand what is an anomaly versus what is not. But if a user has

more

information about the data, like, more information about the externals of the data or, like, more context on the data, then

building a more feature based model will make more sense.

So in that context, like, for example,

like, if the user knows, like, there will be, like, major release, which might increase the

traffic by March, then incorporating this information would reduce

an alert, which is a good thing because the user already know if the alert is coming, why the alert is coming. This is, like, a very important aspect. Like, if you have more information

on the data, like, then building a feature based model.

And, also, it's important to understand whether you have the resources to build a feature based model. If you have the resources and if you have more context of the data, I would say Luminaire

better option to go beyond Luminaire.

This also kind of ties with the challenging

part of it

as well. Like, sometimes

people, if they are alerted, they want to get more context on

why something is an anomaly,

And that's where, like, their teams have to later on dig into it. And that is more, like, outside of the scope of the package right now, but that is something teams should

think more about it. Like, there's something as an anomaly

trying to give them

more context if possible why we think this is an anomaly. Like, not just specific to the models

or to the characteristics, but is there anything external reasons that might have caused it? And that's 1 of the very challenging problem.

As you look to the future of the project, what are some of the plans that you have for the near to medium term or any areas of contribution that you're looking for help from the community?

Mostly, we are now, like, working on the streaming anomaly detection model, and that is at this moment, it's a very initial stage. We have open source, the first version of it, but we would like to build and bring more automation and bring more sophisticated feature into it to bring more end to end processing,

like, where the user will let less would need less configuration.

Also, like, from the context of where Lumina is a wrong choice, as Smith mentioned, giving a context of

an alert would bring more

insight on why someone is getting an alert. So from that perspective,

diving and doing some root cause detection, like doing some data driven

context extraction,

is a very important part. And we actually published a paper last year about root cause detection and how we are planning to do that inside Lumina.

So that is 1 key part. And in terms of

improving the existing

anomaly detection model, we are planning to

incorporate more sophisticated model into the system. And, currently, we do

optimization, but we are planning to do some voting or, like, some sort of bagging approach

in order to identify if something is an anomaly versus nonanomalous because they're like having an ensemble approach of dealing with

the classification model would bring more reliability.

So that is something we are planning for future.

Are there any other aspects of the Luminaire project or the problem space of anomaly detection and dealing with time series data that we didn't discuss yet that you'd like to cover before we close out the show?

So right now, what what we internally kind of do is, like, having a fixed

frequency

when the training runs. Okay?

But that is also not a scalable solution. You actually wanna trigger tuning

when,

actually, your model is degrading.

So you wanna tune your

model based on your past

scoring results

and taking that into effect. And if that degrades, then only would you retune it. Yeah.

So, basically, this is like an internal tool that is not in the open source project right now. So what we do is

so we talked about the automation part, but this is more of a

self awareness

part where it reduces the maintenance cost as as well. Because when you onboard anomaly detection model and when you want to monitor something, you have to continuously

check, like, whether the model degrades and when you want to retune. Because you don't want to retune at every stage as well because that kind of

makes things less reliable and also, like, increases the competition cost because, like, the tuning part is pretty expensive. So what we do is, like, whenever we score,

we store

some model performance metrics. And every time when the the retraining schedule is

there, we check whether the model performance started starts degrading

with different voting methods. Like, we take different measures of

measuring a model. And if we see, like, okay. This model should be retuned,

we trigger a retuning. So that means, like, it's kind of a complete loop of full automated

ML system.

That's very cool.

So for anybody who wants to follow along with either of you and get in touch, I'll have you each add your preferred contact information to the show notes. And so with that, I'm going to move us into the picks. And this week, I'm going to choose a tool called flake hell, which is a wrapper around the flake 8 linting utility, which gives you the ability

to maintain your configuration in the pyproject dottoml,

as well

as better maintenance of plugins and determining

which errors you want to have reported and in which contexts are for particular paths. So just a great convenience utility on top of flake 8. So definitely recommend checking that out for maintaining code quality. And so with that, I'll pass it to you, Smit. Do you have any picks this week?

Like, over, like, last couple months,

I am, like, looking into this tool called Apache Ranger, which is an open source tool.

And it kind of provides

you a way to

manage or control

data authorization.

And the reason I find it very interesting is because

as

the companies are becoming more data driven companies,

Lot of data are getting generated

every day, like, every minute, actually.

So

and now a lot of teams also needs access to this data.

So how do

you control that? And that's 1 of the interesting project I found, and that will be the pick of my yeah. That will be my pick. And, Sayon, do you have any picks this week?

So I would like to pick a book recently I read that's prediction machines, the simple economics of artificial intelligence.

So I found this books in book interesting because, like, this talks about from a broader aspects of

machine learning and AI and predictive modeling. So specifically those who are

working on very technical side of machine learning, like the machine learning practitioners or even the data engineers.

This book should be very interesting for them because this gives a broader picture, like, from a business and strategy point of view. So I highly recommend to everyone to read this book.

Well, thank you both very much for taking the time today to join me and discuss the work that you've done on the Luminaire project. It's definitely a very interesting tool and 1 that I plan to take a look at myself. So I appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thank you, Tobias, for having us on your show.

Thanks, Tobias, for having us in the show and for all your time. Thank you.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__