Build Composable And Reusable Feature Engineering Pipelines with Feature-Engine

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers, 40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host as usual is Tobias Macy. And today, I'm interviewing Soledad Ghali about feature engine, a Python library to engineer features for use in machine learning models. So, Soledad, can you start by introducing yourself?

Hi. Thanks.

I'm Sole. I'm a data scientist.

I'm instructor of machine learning courses and an open source Python developer.

And do you remember how you first got introduced to Python?

Yes. I think I do. It happened, like, about 4 to 5 years ago when I decided to leave science and leave academia and retrain myself as a data scientist.

So when I was looking for what was needed to become a data scientist, I quickly learned that I needed to learn either r or Python and ideally both.

So that was probably the first time in my life that I heard about Python.

I started studying R because

at the time most of the online courses were

based on R. And also because R is more of a statistical software, we was more familiar with what I was used to back in the day. But then I landed my first job

and a consulting company.

And I was encouraged to learn Python because they thought that it was most widely used and also most useful.

So while I was waiting for my first assignment in this

organization within the consulting

jargon is called at the bench, I was trying to quickly get up to speed with Python through books and online courses.

And so that brings us to the feature engine project that you created, and I'm wondering if you can talk through

some of the capabilities that you're providing with that and some of the story behind how you ended up building this project.

Yes. So feature engine is an open source Python library

that works extremely similarly to scikit learn in the sense that the classes that we have in feature engine use the methods fit to learn parameters from the data and then the method transform to go ahead and transform the data.

Feature engine

hosts a lot of transformations

to, for example, impute missing data or to transform categorical variables that are strings into numbers or to apply the most widely used transformation of of variables.

And

it also includes some methods to select features that are currently not available in

other

libraries.

So the idea of feature engine came up the

I worked in like various companies and various projects and I realized that I was actually coding and recoding the same transformations

over and over again,

project after project.

And

I thought that at the time, it will be great if we actually had a library like scikit learn where you can pick and choose whatever machine learning model that you want to try in your data. But instead of picking and choosing models, we should pick and choose

data transformations

to accommodate based on the features.

So I think this was 1 of the

main drivers that got me

into

going ahead and designing feature engine.

As far as the overall process of feature engineering, I know that it's necessary for being able to have some sort of aggregate information that is fed into the machine learning model because the base data that you're working with is not necessarily going to have the sort of right set of inputs or signal to be able to,

generate useful predictions. And I'm wondering if you can just talk through some of the complexities and challenges that are inherent to the process of feature engineering and the amount of domain knowledge

and familiarity with the data that's necessary to be able to do that effectively?

Actually, mention 1 of the main challenges in my opinion regarding feature engineering, which is

the amount of domain knowledge and familiarity with the data that you need to have. Because to extract

real value from the feature,

you are much better placed if you actually know what your data is, what each variable

is telling you, what information the variables are giving you.

And

my experience in working organizations is that

trying to understand what each variable is is sometimes

really hard.

And even the people that gives you the data

don't really know. So, like, trying to get a dictionary

of the variables

is really hard. So you end up working a little bit in in the dark, like, this I find that is is a problem for for some reasons. First of all, if

you don't really know what's the variable is then

you can't really make an educated

decision on how you're going to transform it, what you can do with your variable.

Can I use this variable to begin with?

And this that sounds perhaps not so relevant for model performance if we're going to use these models to serve people,

it is actually very crucial that we know what we're feeding into our models

because our models are going to make predictions

with that feature, we're going to make decisions

at the back of those predictions.

So I think,

yes,

understanding

what your variables are, what you're going to give to your model is 1

main complexity or 1 of the biggest challenges

that is not perhaps

entirely tied to how I'm going to transform that feature if you want this to make a prerequisite.

Some of the other challenges on more of the technical side

have to do

with time.

It's feature engineering is very time consuming.

Getting to know your data is very time consuming.

The quality of the data sometimes is not good so you spend a lot of time trying to get the data in a shape

in which you can actually start working with.

It is repetitive, as I said. Like, previously, you find yourself doing the same transformation

over and over, variable per variable, and also feature per feature,

and project after project. And something that is perhaps more important is that your colleague is also doing the same data transformations

in their project as well.

And this ends up with different versions

of the same code that produce the same functionality.

And this brings us to the problem of reproducibility

and tractability.

If we both coded the same functionality with different lines of code, then we have different versions of the same functionality,

and then we don't know which 1 is the last version. So how are we going to maintain that as a team? So I think it's kind of important to start creating

joint libraries

or using open source as well,

where we can actually, as a team, tap into the same functionality

and build on the same functionality

instead of having different sources that then

is very difficult to maintain, is very difficult to document which 1 was the 1 that had the bug and which 1 was the 1 that didn't.

When we create features, sometimes depending on our application, we need to store the features.

And this is particularly relevant when we have

huge streams of data coming in

because our models are not

often going to use that live stream to make our predictions. We want to actually

reduce the dimension

of that enormous amount of data into something that we can actually store and quickly read in order to fit to our models. So this

is problem that some feature storages are starting to

provide solutions for.

And then I think

the problem of explainability,

like I was kind of mentioning at the beginning.

Can I use this feature to begin with? And then when I create my new features or when I transform

my features, for example, when I'm encoding categorical variables,

do I understand the output of the transformation, and can I track it back to the original variable?

And this is particularly important when we're going to use

models in organizations that are highly regulated or when we're going to make predictions that are going to affect the lives of people or not the lives, but their well-being or

these kind of things. Because

first of all, we need to be able to explain

why we make the our decisions,

and this is at the back of the model. So we need to be able to

explain what the models are telling us. But more importantly, we also want our models to be

FAIR.

So I think this is perhaps 1 of the most important

issues in my opinion.

As far as the overall process of feature engineering, you mentioned that before you created the library, you find yourself repeating a lot of the same operations. And I know that it also becomes complex when you're trying to collaborate with multiple data scientists on a team because everybody might have their own different way of achieving the same outputs of the transformation, or they might be doing

what they think are the same transformations but in slightly different ways so they get diverging results. And I'm wondering

what are some of the

complexities and issues of technical debt that you've encountered in the process of feature engineering and model development

and just some of the incidental complexity that comes in when you're trying to

explore particularly complicated datasets or maybe complex domains?

Yes. I think in terms of collaborating

with some of my colleagues, the thing that

was perhaps more problematic is that we are all working on Jupyter Notebooks because it is extremely

convenient to kind of see what you're doing. You transform the data, and you can immediately

visualize what you're doing. But then

tacking or versioning a Jupyter notebook is really hard. And then you run it over and over,

you kind of lose which 1 is the last version, which 1 is the new version. So I think

this is 1 of the main problems, the lack

of that ability, almost the impossibility to have some sort of version in guaranteed

in a Jupyter notebook.

That's yeah 1 of the problems in the way that we actually

work.

Incidental complexity, I think I had

more experience with that in in feature engine than with the Jupyter notebook. The problem is that you end up copying code and then you have, like, 20 versions of the same thing.

But with feature engine, what happened was that

I was learning or I was becoming a Python developer when I started the project and actually this was as well 1 of the main motivations to develop feature engine to become a better Python developer.

So

what I did back in the day was to

put all the classes in the same

Python script.

This was the first version of the

of the package back in the day. But then

I had the fortune to have a very good colleague, Chris, who is an amazing Python developer, and I really admire him. And he volunteered to have a look at

feature engine.

And his feedback was that

the code was alright, but I want to

chop this code back into pieces so that it is easier to maintain.

And then

I did that,

but that was a lot of work that I had to do just to restructure the project.

And then in order not to affect how users use the package, I start I needed to start doing

modifications in the init file so that they could try and keep as much as possible the imports as in the original version,

not to

have changes that break the functionality

between versions. So I think

going back to the question, 1 of the main problems

of incidental complexity or technical depth is the fact that

you may need to do

massive

refurbishments

at some point.

Otherwise,

it becomes

very hard to maintain

or you may end up

having to introduce changes that they break the flow within 1 version and the other. And that

affects

user experience if you want. Because they need to relearn how to use the library,

and then newer versions will not do what they are used to.

So, yeah, I think the biggest problem is is maintaining the code moving forward. And

in my opinion, you want to try

and remove incidental complexity or technical debt as soon as you see that it's piling up

because then it becomes so ingrained

that it's it's becomes really difficult.

As far as the

sort of ecosystem of tools for

doing feature engineering and building these pipelines, I'm wondering

what you have seen as far as the

sort of what is present in the ecosystem, and what did you see as being missing when you decided to build your own library, and maybe some of the capabilities that you've borrowed from other approaches and some of the ideas that you have built into feature engine that you think should be more widespread across these tools?

Back in today, when

we launched feature engine for the first time, I think

that the main library that we were using for feature engineering was pandas

because most of us would work with data frames, and pandas has the beauty that you can transform your data

and then visualize it at the back of it. So it's very, very convenient.

So we were all doing that, but then the problem is

that pandas doesn't really

store the parameters that you learn

and that you need to pretty much code any transformation by hand. And then if you want to apply the same transformation to a lot of variables,

very often, you find doing these horrible loops.

So I think

back in the day, basically, a library for feature engineering

ideally with the same functionality that scikit learn how

was missing.

At the same time that we launched feature engine,

scikit learn started to accommodate some of those transformations within the API, so this was

amazing.

The scikit learn has a little bit of a different design because

the APIs in scikit learn work like the classes will transform the entire data. And if you want to narrow down the transformations to a certain group of variables, which is very often what we do when we transform variables,

you need to use another class.

And

also, scikit learn retries

returns NumPy arrays.

And NumPy arrays, you need to transform them back to data frames if you want to do visualizations, for example. So I found that it was not extremely convenient. So I thought that

something that, you know, takes in a data frame, returns a data frame, and also allows you to apply transformations to specific variable groups

was much needed or at least I was looking forward to something like this.

And then also,

we needed a library that included

a great variety of feature transformation, something that

we can apply not only just the mainstream

data imputation techniques, but also some

data transformations that are used, for example, in data science competitions

and a little bit more novel and different.

And then as far as the actual work of doing the feature engineering, I'm wondering what are some of the

types of domain knowledge that are necessary to be able to

understand

what features are going to be useful in the model and just some of the process of

identifying which variables to combine, what types of transformations are necessary, and just some of the workflow of actually

building a set of features to input into the actual model training process?

I can probably give a few examples of the domains in which I've worked in.

I think

if you've been

for a while in 1 space and you actually

understand your data and their behavior as well, then you are better placed to produce

features that will actually help

you and your models make more

accurate and fair predictions.

Just to pick an example, if we're talking about credit data,

credit agencies, for example,

they collect a lot of information from financial institutions

that have to provide data to the credit agencies in order to then be able to receive data from these credit agencies.

And the data has the form of

the balances in banks accounts or the payments to credit cards or to loans and that kind of thing. And that is month after month customer per customer. And

you wouldn't actually use that in your models, but

credit agencies have been working on these variables for so long that they now understand

what aggregated views of this data they can provide the customers. For example,

the number of payments to credit cards in the last 3 months or in the last 6 months or the maximum debt or the minimum

whatever number of payments or whether the customer has defaulted in the last 3 months or something like that, for example. So

these

are things that you actually know after working

in the field for some time. So if you work given, for example,

loans,

you begin to know what are the variables that correlate more with the possibility

that the customer has to actually

fulfill

their

commitment to repay the loan and then

you understand that for example, the income

to debt ratio

is or what the other calls we call the disposable income is is a quite an important variable. So then you have 2 variables and you know that you need to combine them to calculate

the disposable

income or

the total debt that the customer has. Perhaps is a good indication and then you know that you need to sum the debt across all the variables.

If we're talking about insurance, for example, sometimes

when a claim comes in and

the insurance people fill the forms,

we have these forms where they have to tick

in the car

basis. So you have a picture of the car, and then the person that is handling the claim needs to tick, like, on the door or on the roof or on the light all the pieces where the customer is saying that they have an accident or the things have been broken.

And that then comes to us as binary variables, and then we have a broken light or a broken mirror and then you can add all those up to create a picture of the total damage, for example. Those are some examples of things that you

you get to know when you are familiar with the data and also you're familiar with

how you collect the data.

Something that I, for example, find more useful is that when we're creating models

for insurance claim, for example, you take the car fuel as a variable,

and

eventually,

we started having cars that were actually electric.

So we start having to our models electric

cars. And it's like, what is our model going to do with this? We have no clue

because we don't have this in our historic data. This is just a very new

thing. But then you need to know that this now exists, So then you can make a decision of how you're going to process or maybe not process

these types of applications.

These are some of the examples.

Because of the fact that you are performing all these transformations on the input data

and you're using your

particular domain knowledge and your understanding of the problem domain and the solution that you're trying to provide,

you know, where you might be imputing missing values

or removing

missing data or

combining 2 different variables to create a new input, I'm wondering what are some of the potential risks that you are

taking on by doing all these different transformations

and, you know, the potential for invalid assumptions about the

impact that those might have on the resulting model and just some of the ways that you can

identify those risks in the process of building up the transformations and then be able to revisit them as you iterate on the model development process?

I would make a distinction here in 2 things. So first, when we talk about meeting the assumptions, I think this is extremely important when we're going to use those variables in statistical tests, for example. Because

when we apply a statistical test,

these tests very often are going to make a lot of assumptions on how the data needs to look like and how the data was originally collected.

And if our data

does not fulfill those assumptions, then the answers

that those tests give us

may or may not be accurate. So they may or may not be reflecting whether the difference that we see is actually a difference.

So then we can really be confident when we derive the conclusions

at the back of those statistical tests.

And moving on to the field of machine learning,

I think

if we

don't use

optimal to call them transformations,

the worst thing that can happen is that the model loses some performance.

And how bad that is, I think it kind of depends on what the model does

and who the model is serving.

I think it is important to know that

if we're going to train a linear model, our variables need to be linearly related to the target. And if not, we're better off using a model that is not. So I think it's important to know

what the model is assuming

and then are the variables actually fulfilling that because if not, as I said, the model will not perform as well.

And

having

a model that doesn't perform well because of feature engineering is a little bit of a missed opportunity

because we could be doing other transformations that improve the performance.

And the performance of the model will translate in various things depending on what we're trying to use that model for. For example, it could translate in

customer satisfaction.

It could translate

in revenue for the company.

So in those cases,

I would argue that less performance is not so dramatic.

But in other cases, for example, when we're building models to assess

health, to predict

the prognosis of a disease,

when we are building models to

decide who is going to receive a visa and who is not.

We make models that we are quite that we qualify

how good a teacher is,

who gets

access to this university,

then I think it becomes more important to try and understand

if our model is performing well,

If not, why not? And particularly, if our model is being fair.

I think more and more we're starting to hear

stories about

biased models

that end up

affecting negatively some sectors of society

because they were not trained

on the correct

variables. So we're not using the variables that we should be using or

the variables

are not really good proxies

to

approximate

what we actually want to approximate

with that model.

So going back to how do we

know about that, I think

once we train our models, it is very important to understand

which feature is driving the decision

and why

is that feature driving that decision, which boils back to the point of

understanding what my feature is telling

and can I use that feature or not?

And then when we evaluate model performance, I think it is kind of important

to try and see

if our model is being

fair for all the sectors

of this society or all the sectors of my customer base that I am serving of if the model is

favoring some

groups over others, for example.

This you do with a lot of research at the back of the models and the predictions that this model produce.

Digging now into the feature engine project itself, can you talk through some of the ways that it's implemented

and the overall design goals that you had in mind as you were iterating on the implementation of it and some of the ways that it has changed and evolved since you first began working on it? So the intention of feature engine is to work as much as possible as scikit learn. So in fact,

we inherit 2 of the main transformers from scikit learn that

serve the skeleton of the class and provide much of the scikit learn functionality

already under the hood

in the sense of how you set your parameters and then how you retrieve the parameters of the class.

And then the important bits that we work mostly with is with modifying

the fit method and the transform method. So in the fit method, we have all the functionality that will learn the parameter for the feature transformation.

For example, simple things like the mean or the median to impute or the mappings from string to numbers.

But when we're doing feature selection, then we have a whole logic to run models, evaluate features, and then store the features with high performance.

And then in the transform method, we basically transform the data

based on the parameters that we learn during fit. So I think that this in the main implementation of feature engine is is very straightforward.

Originally, I kind of

envisioned feature engine to fill

this gap that I was filling

was there

when I was working in an organization. So I I wanted a library that already had all of these

functionality

built in and that also stored the parameters

within the class. Just like scikit learn in that sense. I can when you apply a model from scikit learn, it will learn coefficients, it will learn the

the divisions of the decision tree or how close the the different observations are if you're doing the nearest neighbors, and it will store all that information within the class.

Feature engine classes do exactly the same. They will store

all the information.

And, originally,

I was focused on

building classes

that help

transform data in a way that your features

at the end of the day are still explainable and interpretable

by a person.

So you can apply a transformation and you know exactly what you're doing and you can also revert back to the original data. So if tomorrow you need to explain

why your model made that decision,

you can

go back to the original data also with feature engine and try to understand

that feature. So they are human readable.

So that was the original vision.

But as feature engine

becomes adopted by the community,

I've noted that people want to

also have in feature engine

functionality

that they use, for example, in data science competitions. I think they kind of want to speed up the way they transform data using these

transformations that perhaps do not

return

the most understandable feature or feature that makes business sense,

but that they would use in the data science competition or something like this. So I think we're kind of

steering a little bit into

that

direction because

users want it, and I find it very exciting that people want to actually use feature engine. So I think

we're departing a little bit from the original idea of

producing features that are fully explainable

into producing features that are used and accepted by the community.

But if we think that they are not super interpretable or this poses a risk when you're using

this feature to serve people, we make that absolutely clear in the documentation or at least

this is what I would like to do. Like, I want it to be really clear,

but it might be risky.

As far as the

feature engineering process, you mentioned that feature engine is

working to be very compatible with scikit learn. And I'm wondering what you have seen as far as

the either requirements or just the overall workflow

of building features for scikit learn as your machine learning toolkit versus using maybe a deep learning approach with tools like PyTorch or TensorFlow, and if there is any difference or sort of what those

divergences might be depending on the style of machine learning that you're doing. Too many differences perhaps in my opinion. The first 1, it's

explainability

versus

not so much explainability.

As in when you work with scikit learn, I could argue that you can still interpret the decisions that the models make there. So you can interpret the decisions of linear models. You can certainly interpret the decisions

of 3 based algorithms

nearest neighbors. So

you can

you can try and understand why the model made that decision

provided

you understand what your features are.

So I

think, yeah, this

is and most likely is going to remain the main choice

for building models in organizations

that actually need

to justify

why they make decisions.

And the other main difference is that using deep learning with libraries like PyTorch, TensorFlow, and so on

makes sense

when you have

huge amount of data.

So deep learning becomes competitive when you have enormous amount of data.

If you have little data,

I don't think there will be an off the shelf algorithm like a boosted

machine. So

and then,

again,

the models in Scikit Learn

are not really ready to cope with the ginormous amount of data that deep learning is reading to cope. So I think they serve different

purposes.

As far as the workflow

of building the features and training the model and then maybe revisiting models that you've worked on to either

tune them for better performance along whatever axes you're trying to optimize for, or

if you are, you know, joining a team that has a set of models with the feature transformation pipeline in place. I'm wondering

what are some of the ways that you have found useful for being able to

build the feature engineering pipeline and

determine which transformations

to create and then maybe embedding some of the

reasoning and context behind those choices

in the code that you're writing so that people who are either revisiting it, you know, maybe your future self or people who are new to the team or new to the project to be able to help them know

why certain transformations are being made without having to do all of that exploration and gaining of domain knowledge to be able to be effective?

Yes. That's a good 1. And I'm not really sure. I particularly have done a lot of progress on that front if I have to be fair.

I think

the workflow normally goes like you get the data, you try to understand

what the data is, what the variables are,

like do you have redundant data, how good the quality of the data is.

So it's a very big portion of data exploration that you go to try and understand.

Do I have duplication?

Are my variables correlated?

That kind of thing. Then there's probably some sort of iteration

between building a model, deriving feature importance, trying to understand which are the

most important features. And if a data transformation that maybe you did

is actually making the feature more or less predictive or how is that affecting

the performance of the model. So there's a little bit of a back and forth between

learning the variable, transforming the variable, seeing the impact of that transformation

on the variable itself and then on the performance of the model, and then if you're not happy with that, reiterating.

And documenting that is a little bit hard because how do you do it? Like, you run the notebook over and over and then you overwrite yourself and sometimes

you don't even know. Like, if you pick up the notebook

a year later, you may not even remember what happened before the version that you have in front of your eyes. So I think

it might be important to kind of try

and write in the notebook as much as possible. I try to do that and I've seen that my colleagues do that as well. Try and document why

you made

those decisions.

As we move forward, then some of things that I'm trying right now is

to build different pipelines and then you have already the different pipelines

stored in your notebook with the different versions of the feature transformation. So then you can see the output of 1 pipeline

and compare it exactly with the other. And this is 1 of the advantages

that you have now because there are so many future transformation techniques in open source packages packages that you can accommodate them all in the pipeline, whereas before when you were using pandas, you probably couldn't.

So there is that advantage a little bit now.

But, yeah, more than that, I'm I'm not too sure you can actually do or or at least

I have not.

I'm curious if there's the

capacity or opportunity to be able to add metadata

into the individual

transformations that you're creating so that you can embed some of that reasoning

into the function call so that it can maybe generate a set of metadata

in the resulting model or just within the code itself to be able to say, I'm

imputing the median value for these numerics because it's useful to be able to normalize the distribution

for this reason or maybe embedding some of that context into the actual future transformation itself so that it is more

sort of self documenting and discoverable as you revisit or as new people come across the code? That's actually a great idea. And I think we're trying to do that as part of feature engine

documentations.

Now in the last

pull request we're trying to adopt a little bit scikit learn's way of presenting documents where you have the API that describes exactly what the API does And then you have a user guide where you

give information to the user. In in this sense, we say, you know, this transformation works well for these types of variables and these types of models, but if you have this other type of model, then maybe you want to try this other transformation.

So all of that would be documented there as part of the package

so it will be accessible

for everybody.

And I think this is a little bit how things are done in open source.

I don't think you put the information

within the class, but you do a whole lot of documentation

to actually help

users use your package

as well as possible.

So that's it. In terms of

when you work can you create classes

with metadata?

I'm not sure I have not done it. I think some users do is that they add the metadata

directly in the data frame. For example, if you're trying to simulate different scenarios,

they will add different variables

with the scenario

hard coded. For example, variable a could be we're launching a

an advertisement campaign or variable c, we're mimicking recession to say something.

So then you basically have the metadata in your data frame,

and then you build your models

on the variables

except that with the metadata. So that's another way

of simulating scenarios.

Well, I guess it reflects some decisions and simulations on how you evaluate your mother. But,

yeah, it wouldn't store how you make your decisions. Not yet. I think you're left with taking notes.

In the same vein, I'm also interested in understanding

the process for being able to

test and debug the feature transformations and the feature pipelines that you're building and being able to validate the

correctness or utility of the features that are generated as a result of the pipelines and transformations that you're building?

This is another very important aspect of

creating feature engineering pipelines.

I think 1 advantage of using open source

is that

the d classes that come within these packages are tested themselves. So,

I mean, it is not 100%

bulletproof, but

very often what the class is intended to do is indeed what the class is doing.

So but going back to developing

your own classes when you're working in your own projects, I think

when we're building a pipeline it's important

to have

byte chunks or pieces of functionality

and then testing those pieces of functionality individually. So, like, if I'm transforming variable a

in way x, then I need to have a class and I need to test that class. So I need to be sure that my class is transforming the variable exactly as I expect.

And then ideally, I want to have that for all the classes and all the individual transformations that I have across the pipeline.

So it's much better to have it that way than to just test the pipeline as a whole. Like, it would comes in, prediction comes out.

Because then it is harder to debug. Like, if you're not obtaining the output that you expected after it went through the entire pipeline, like, where is the error?

So I think it is important to have

individual tests for every individual transformation,

and then, of course, a final test input comes in, prediction comes out. Is this what I expected? I think

in my opinion, this is the best way forward

into creating

a pipeline that is easy to test and then it is easy to debug.

And another aspect of feature engineering that you mentioned earlier is the growing use of feature stores for being able to

store a definition of a feature and have it computed

at sort of query time by the machine learning system that is either training or inferring information. And I know that there's the open source feast project, a number of others, and then there are a number of different commercial options, including things like Tekton. And I'm wondering what you see as some of the

potential for being able to use feature engine in conjunction with these feature stores either to publish

the transformation pipelines to those stores or just some of the potential interplay between those 2

stages of feature development?

This is an interesting question as well. It's an aspect that I have not myself explored

much.

So I am aware that these kind of feature stores particularly useful when you have a huge amount of data coming in through, for example, your organization and then you need to kind of

store

and have that data

ready to be used in your machine learning models.

And so

could happen here is that the data comes in and then you have pipelines to create

features that you can actually use from that

enormous amount of data ready in a store, and then maybe you trigger some feature transformations

once per day or once per hour

depending on

how frequent the load of data you you get.

So in that sense, I think this is particularly useful to leverage the power of that

way of receiving data from wherever you're receiving data

and maybe wearable devices.

But I don't think you can integrate feature engine

today

with these stores because feature engine is designed to work with data frames.

And that was the original vision and I think this is what drives most of the value of using feature engine. But if you have enormous amount of data, you don't really want to put it into a data frame. So So you can do your transformations.

So for the time being, I don't think you could integrate them. I think these platforms have their own ways of creating pipelines so that they can

help the user

to create these features fast

given the architecture and the nature of the data.

Looking forward, if there is enough

appetite

to use or extend the functionality of feature engine to big data,

We could think about it.

Some people have

asked me

if we can extend these to PySpark.

At the moment, we don't have capacity

list. So yeah. But maybe in the future.

In your experience of

building the feature engine project and using it in your own work and sharing it with the community, what are some of the most interesting or innovative or unexpected ways that you've seen it used? As a maintainer

or at least as a maintainer of a library that is fairly new,

you don't get to see a lot of how people actually use your package.

I know that it's being used because you can't detect how many times it's it's been downloaded. So at the moment we have like 40, 000 per month, I think. And I can also see the visits

in the documentations.

So that I know that this been

quite extensively used but unfortunately I don't

really know on what projects. I I would love to hear more from the users.

I can guess that some people in the finance sector

are using feature engine

because we have a recent contribution from 1 user that

actually came up with a selector that would be very useful for finance. Because in finance, they use

the population stability index, for example, to determine if distribution of a feature

holds in time. And if not, you can actually not use that feature

based on current financial regulation.

Yeah. He thought that feature would be very useful for his sector and he's actually developing that functionality himself. So it's I know that it's being used in finance.

I also know that it's being used in Kaggle because some other users

have requested other features and they link to a video that is being

given by a Kaggle master, for example, or they link to an explanation

from a person that has done that transformation in Kaggle.

So these are the most exotic examples that I came across so far.

And in your experience of building and using feature engine, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I didn't really know what to expect

as a maintainer

and I think I'm learning

as I go. I'm a little bit of a free styling.

When I got

the first pull request with the first contribution, I got so excited

so excited. And then it's like, oh my god. So how do I reply to this?

What should I say? What should I not say?

I had to

learn

basically how to be a maintainer, how to interact with people,

what things to accept, what things not to accept, how to communicate,

what could be accepted and what would not be accepted into the package, how to

engage with people

in order to kind of

get something that was good for them and for the package as well. So all of that I needed to learn. Then

I didn't know

how often should I engage

with the contributors, but more than how often I should engage, like, what I didn't know and I actually still don't know

is how other more experienced

maintainers

interact with their communities because I just thought, you know, the way they interact

may help me

learn and keep an active and thriving community

in feature engine. So I think there was a bit of a learning curve there

and some

unexpected

findings is for example that actually

I'm

quite an engaged maintainer because whenever I get a pull request, I reply in 1 or 2 days, for example.

Whereas I have made pull requests to other libraries that are way more popular than feature engine and months later,

I haven't even received a thank you note.

As a contributor,

that is so frustrating because you put so much of your

free time onto developing something that you think that is good for everybody and then not receiving a reply is is a bit frustrating. So I try not to do that. And I try to reply to every pull request.

And then if I make suggestions for changes, I always try to

give reasons

of why I think this and

some other

unexpected

feedback was like some contributors have told me that they learned a lot

by making this contribution to Scikit Learn based on the subjects suggestions that I've given.

And that was a bit unexpected because

I've never considered myself like a super awesome Python developer. So someone tells me, look, oh, I've learned a lot. I found it quite rewarding.

I certainly learned a lot from some contributions. So overall, I think this was also unexpected is how exciting

it is to create and maintain a Python package. It's way more than I originally expected.

For people who are

working on building out their feature engineering pipelines, and they want to be able to

bring in more reusability,

what are some of the cases where a feature engine is the wrong choice?

No. I think

feature engine is steered into

working with data frames. So if it's steered to

receiving data from a data frame during the process a transformation and then returning a data frame, so you can actually

do the transformations

and continue with data analysis as you go along, and you could do a little bit of those, of both, 1 at the back of the other.

And then because it fits nicely into a scikit learn pipeline and it shares scikit learn functionality, you can actually

deploy that pipeline at the end. So I

think feature engine is suitable

for

people working in projects where the datasets

have a size that would fit in a data frame when the features need to be

understandable and explainable and when we're going to build off the shelf machine learning models like those that we find in scikit learn.

Our data is super unstructured.

We don't have data frames.

And

we also not going to do off the shelf algorithms then I think it's not what it's designed for.

And as you continue to work on and use the feature engine project, what are some of the things that you have planned for the near to medium term, and what are some of the types of help or contributions from the community that you're looking for?

I think

the main

things that we want to include in the short term is the first 1 is we want to adopt scikit learn way of presenting documentation

where you have 1 interface that describes what the API is doing. And then you have another interface, which is the user guide when you give a lot of information about

the technique, not just what it's doing, but

why it's doing, what it's doing, when you should use it, when you should not use it, and then provide

some examples and hopefully some references as well.

Do a little bit of both providing the functionality, but also providing

the context on

why

you should use it. So a little bit of an education project.

So that's number 1. The other thing that is coming very soon is we're also going to include functionality

to work with time

to extract features from date time variables.

And the next step for us would also be to

expand feature engine functionality

to work also with time series.

At the moment, feature engine works only for tabular data, which is also what scikit learn does,

but we want to expand this to be able to preprocess variables that come in time series.

And at the back of this, something that we need to discuss is if

we need

to move away from pandas and also adapt other frameworks that will allow us to handle bigger datasets,

like, for example, DASK.

Some of the contributions that we're looking or we're hoping for is that, to be honest,

any contribution, and I think this is true for any open source package,

Any contribution, no matter how small it is, it helps

from

feedback.

Like, is this useful or is this not?

That is super helpful.

People perhaps sharing the use of the package or sharing some projects whenever they can on how they are using feature engine,

fixes to the documentation,

suggestions for new

classes or new transformations that they would like to see in the package. Of course, code contributions

are more than welcome.

Contributions to the documentation

as well. This is really important, something that is always overlooked. I think

there is this general

belief that contributing to

software is about

producing code with new functionality

and it's not just that. That

I learned when I started developing feature engine. I put so much more work

into creating docs than into creating code. So help with creating

documentation is actually

greatly appreciated as well. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

This week, I'm going to choose the recent Dune movie that came out. Really enjoyed reading the book series, so definitely recommend that as well. But whether you have or haven't read the books, I definitely recommend the movie. It's very well done. Definitely looking forward to the next 1 that's supposed to come out in a couple of years. So definitely give that a watch if you get the chance. And so with that, I'll pass it to you, Soleil. Do you have any picks this week?

Yes. I think since we're talking about so much about how important is to understand what we feed into the models to try and make fair algorithms, I think 1 movie that is

very related to the topic is the social dilemma.

And

1 book that

as well taps on the same items of

re infer with the use of big data and algorithms is don't be evil from RADA for OHA.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on feature engine and your experience of working in the space of machine learning and trying to make feature engineering a

more tractable

problem. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dotcom for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.init