Summary
Every machine learning model has to start with feature engineering. This is the process of combining input variables into a more meaningful signal for the problem that you are trying to solve. Many times this process can lead to duplicating code from previous projects, or introducing technical debt in the form of poorly maintained feature pipelines. In order to make the practice more manageable Soledad Galli created the feature-engine library. In this episode she explains how it has helped her and others build reusable transformations that can be applied in a composable manner with your scikit-learn projects. She also discusses the importance of understanding the data that you are working with and the domain in which your model will be used to ensure that you are selecting the right features.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Soledad Galli about feature-engine, a Python library to engineer features for use in machine learning models
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what feature-engine is and the story behind it?
- What are the complexities that are inherent to feature engineering?
- What are the problems that are introduced due to incidental complexity and technical debt?
- What was missing in the available set of libraries/frameworks/toolkits for feature engineering that you are solving for with feature-engine?
- What are some examples of the types of domain knowledge that are needed to effectively build features for an ML model?
- Given the fact that features are constructed through methods such as normalizing data distributions, imputing missing values, combining attributes, etc. what are some of the potential risks that are introduced by incorrectly applied transformations or invalid assumptions about the impact of these manipulations?
- Can you describe how feature-engine is implemented?
- How have the design and goals of the project changed or evolved since you started working on it?
- What (if any) difference exists in the feature engineering process for frameworks like scikit-learn as compared to deep learning approaches using PyTorch, Tensorflow, etc.?
- Can you describe the workflow of identifying and generating useful features during model development?
- What are the tools that are available for testing and debugging of the feature pipelines?
- What do you see as the potential benefits or drawbacks of integrating feature-engine with a feature store such as Feast or Tecton?
- What are the most interesting, innovative, or unexpected ways that you have seen feature-engine used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on feature-engine?
- When is feature-engine the wrong choice?
- What do you have planned for the future of feature-engine?
Keep In Touch
- @Soledad_Galli on Twitter
- solegalli on GitHub
Picks
- Tobias
- Soledad
- The Social Dilemma
- Don’t Be Evil by Rana Foroohar
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- feature-engine
- Feature Engineering
- Python Feature Engineering Cookbook
- scikit-learn
- Feature Stores
- Pandas
- PyTorch
- Tensorflow
- Feast
- Tecton
- Kaggle
- Dask
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Soledad Ghali about feature engine, a Python library to engineer features for use in machine learning models. So, Soledad, can you start by introducing yourself?
[00:01:08] Unknown:
Hi. Thanks. I'm Sole. I'm a data scientist. I'm instructor of machine learning courses and an open source Python developer.
[00:01:16] Unknown:
And do you remember how you first got introduced to Python?
[00:01:19] Unknown:
Yes. I think I do. It happened, like, about 4 to 5 years ago when I decided to leave science and leave academia and retrain myself as a data scientist. So when I was looking for what was needed to become a data scientist, I quickly learned that I needed to learn either r or Python and ideally both. So that was probably the first time in my life that I heard about Python. I started studying R because at the time most of the online courses were based on R. And also because R is more of a statistical software, we was more familiar with what I was used to back in the day. But then I landed my first job and a consulting company.
And I was encouraged to learn Python because they thought that it was most widely used and also most useful. So while I was waiting for my first assignment in this organization within the consulting jargon is called at the bench, I was trying to quickly get up to speed with Python through books and online courses.
[00:02:22] Unknown:
And so that brings us to the feature engine project that you created, and I'm wondering if you can talk through some of the capabilities that you're providing with that and some of the story behind how you ended up building this project.
[00:02:34] Unknown:
Yes. So feature engine is an open source Python library that works extremely similarly to scikit learn in the sense that the classes that we have in feature engine use the methods fit to learn parameters from the data and then the method transform to go ahead and transform the data. Feature engine hosts a lot of transformations to, for example, impute missing data or to transform categorical variables that are strings into numbers or to apply the most widely used transformation of of variables. And it also includes some methods to select features that are currently not available in other libraries.
So the idea of feature engine came up the I worked in like various companies and various projects and I realized that I was actually coding and recoding the same transformations over and over again, project after project. And I thought that at the time, it will be great if we actually had a library like scikit learn where you can pick and choose whatever machine learning model that you want to try in your data. But instead of picking and choosing models, we should pick and choose data transformations to accommodate based on the features. So I think this was 1 of the main drivers that got me into going ahead and designing feature engine.
[00:04:02] Unknown:
As far as the overall process of feature engineering, I know that it's necessary for being able to have some sort of aggregate information that is fed into the machine learning model because the base data that you're working with is not necessarily going to have the sort of right set of inputs or signal to be able to, generate useful predictions. And I'm wondering if you can just talk through some of the complexities and challenges that are inherent to the process of feature engineering and the amount of domain knowledge and familiarity with the data that's necessary to be able to do that effectively?
[00:04:38] Unknown:
Actually, mention 1 of the main challenges in my opinion regarding feature engineering, which is the amount of domain knowledge and familiarity with the data that you need to have. Because to extract real value from the feature, you are much better placed if you actually know what your data is, what each variable is telling you, what information the variables are giving you. And my experience in working organizations is that trying to understand what each variable is is sometimes really hard. And even the people that gives you the data don't really know. So, like, trying to get a dictionary of the variables is really hard. So you end up working a little bit in in the dark, like, this I find that is is a problem for for some reasons. First of all, if you don't really know what's the variable is then you can't really make an educated decision on how you're going to transform it, what you can do with your variable.
Can I use this variable to begin with? And this that sounds perhaps not so relevant for model performance if we're going to use these models to serve people, it is actually very crucial that we know what we're feeding into our models because our models are going to make predictions with that feature, we're going to make decisions at the back of those predictions. So I think, yes, understanding what your variables are, what you're going to give to your model is 1 main complexity or 1 of the biggest challenges that is not perhaps entirely tied to how I'm going to transform that feature if you want this to make a prerequisite.
Some of the other challenges on more of the technical side have to do with time. It's feature engineering is very time consuming. Getting to know your data is very time consuming. The quality of the data sometimes is not good so you spend a lot of time trying to get the data in a shape in which you can actually start working with. It is repetitive, as I said. Like, previously, you find yourself doing the same transformation over and over, variable per variable, and also feature per feature, and project after project. And something that is perhaps more important is that your colleague is also doing the same data transformations in their project as well.
And this ends up with different versions of the same code that produce the same functionality. And this brings us to the problem of reproducibility and tractability. If we both coded the same functionality with different lines of code, then we have different versions of the same functionality, and then we don't know which 1 is the last version. So how are we going to maintain that as a team? So I think it's kind of important to start creating joint libraries or using open source as well, where we can actually, as a team, tap into the same functionality and build on the same functionality instead of having different sources that then is very difficult to maintain, is very difficult to document which 1 was the 1 that had the bug and which 1 was the 1 that didn't.
When we create features, sometimes depending on our application, we need to store the features. And this is particularly relevant when we have huge streams of data coming in because our models are not often going to use that live stream to make our predictions. We want to actually reduce the dimension of that enormous amount of data into something that we can actually store and quickly read in order to fit to our models. So this is problem that some feature storages are starting to provide solutions for. And then I think the problem of explainability, like I was kind of mentioning at the beginning.
Can I use this feature to begin with? And then when I create my new features or when I transform my features, for example, when I'm encoding categorical variables, do I understand the output of the transformation, and can I track it back to the original variable? And this is particularly important when we're going to use models in organizations that are highly regulated or when we're going to make predictions that are going to affect the lives of people or not the lives, but their well-being or these kind of things. Because first of all, we need to be able to explain why we make the our decisions, and this is at the back of the model. So we need to be able to explain what the models are telling us. But more importantly, we also want our models to be FAIR.
So I think this is perhaps 1 of the most important issues in my opinion.
[00:09:42] Unknown:
As far as the overall process of feature engineering, you mentioned that before you created the library, you find yourself repeating a lot of the same operations. And I know that it also becomes complex when you're trying to collaborate with multiple data scientists on a team because everybody might have their own different way of achieving the same outputs of the transformation, or they might be doing what they think are the same transformations but in slightly different ways so they get diverging results. And I'm wondering what are some of the complexities and issues of technical debt that you've encountered in the process of feature engineering and model development and just some of the incidental complexity that comes in when you're trying to explore particularly complicated datasets or maybe complex domains?
[00:10:28] Unknown:
Yes. I think in terms of collaborating with some of my colleagues, the thing that was perhaps more problematic is that we are all working on Jupyter Notebooks because it is extremely convenient to kind of see what you're doing. You transform the data, and you can immediately visualize what you're doing. But then tacking or versioning a Jupyter notebook is really hard. And then you run it over and over, you kind of lose which 1 is the last version, which 1 is the new version. So I think this is 1 of the main problems, the lack of that ability, almost the impossibility to have some sort of version in guaranteed in a Jupyter notebook.
That's yeah 1 of the problems in the way that we actually work. Incidental complexity, I think I had more experience with that in in feature engine than with the Jupyter notebook. The problem is that you end up copying code and then you have, like, 20 versions of the same thing. But with feature engine, what happened was that I was learning or I was becoming a Python developer when I started the project and actually this was as well 1 of the main motivations to develop feature engine to become a better Python developer. So what I did back in the day was to put all the classes in the same Python script.
This was the first version of the of the package back in the day. But then I had the fortune to have a very good colleague, Chris, who is an amazing Python developer, and I really admire him. And he volunteered to have a look at feature engine. And his feedback was that the code was alright, but I want to chop this code back into pieces so that it is easier to maintain. And then I did that, but that was a lot of work that I had to do just to restructure the project. And then in order not to affect how users use the package, I start I needed to start doing modifications in the init file so that they could try and keep as much as possible the imports as in the original version, not to have changes that break the functionality between versions. So I think going back to the question, 1 of the main problems of incidental complexity or technical depth is the fact that you may need to do massive refurbishments at some point.
Otherwise, it becomes very hard to maintain or you may end up having to introduce changes that they break the flow within 1 version and the other. And that affects user experience if you want. Because they need to relearn how to use the library, and then newer versions will not do what they are used to. So, yeah, I think the biggest problem is is maintaining the code moving forward. And in my opinion, you want to try and remove incidental complexity or technical debt as soon as you see that it's piling up because then it becomes so ingrained that it's it's becomes really difficult.
[00:13:48] Unknown:
As far as the sort of ecosystem of tools for doing feature engineering and building these pipelines, I'm wondering what you have seen as far as the sort of what is present in the ecosystem, and what did you see as being missing when you decided to build your own library, and maybe some of the capabilities that you've borrowed from other approaches and some of the ideas that you have built into feature engine that you think should be more widespread across these tools?
[00:14:18] Unknown:
Back in today, when we launched feature engine for the first time, I think that the main library that we were using for feature engineering was pandas because most of us would work with data frames, and pandas has the beauty that you can transform your data and then visualize it at the back of it. So it's very, very convenient. So we were all doing that, but then the problem is that pandas doesn't really store the parameters that you learn and that you need to pretty much code any transformation by hand. And then if you want to apply the same transformation to a lot of variables, very often, you find doing these horrible loops.
So I think back in the day, basically, a library for feature engineering ideally with the same functionality that scikit learn how was missing. At the same time that we launched feature engine, scikit learn started to accommodate some of those transformations within the API, so this was amazing. The scikit learn has a little bit of a different design because the APIs in scikit learn work like the classes will transform the entire data. And if you want to narrow down the transformations to a certain group of variables, which is very often what we do when we transform variables, you need to use another class.
And also, scikit learn retries returns NumPy arrays. And NumPy arrays, you need to transform them back to data frames if you want to do visualizations, for example. So I found that it was not extremely convenient. So I thought that something that, you know, takes in a data frame, returns a data frame, and also allows you to apply transformations to specific variable groups was much needed or at least I was looking forward to something like this. And then also, we needed a library that included a great variety of feature transformation, something that we can apply not only just the mainstream data imputation techniques, but also some data transformations that are used, for example, in data science competitions and a little bit more novel and different.
[00:16:38] Unknown:
And then as far as the actual work of doing the feature engineering, I'm wondering what are some of the types of domain knowledge that are necessary to be able to understand what features are going to be useful in the model and just some of the process of identifying which variables to combine, what types of transformations are necessary, and just some of the workflow of actually building a set of features to input into the actual model training process?
[00:17:07] Unknown:
I can probably give a few examples of the domains in which I've worked in. I think if you've been for a while in 1 space and you actually understand your data and their behavior as well, then you are better placed to produce features that will actually help you and your models make more accurate and fair predictions. Just to pick an example, if we're talking about credit data, credit agencies, for example, they collect a lot of information from financial institutions that have to provide data to the credit agencies in order to then be able to receive data from these credit agencies. And the data has the form of the balances in banks accounts or the payments to credit cards or to loans and that kind of thing. And that is month after month customer per customer. And you wouldn't actually use that in your models, but credit agencies have been working on these variables for so long that they now understand what aggregated views of this data they can provide the customers. For example, the number of payments to credit cards in the last 3 months or in the last 6 months or the maximum debt or the minimum whatever number of payments or whether the customer has defaulted in the last 3 months or something like that, for example. So these are things that you actually know after working in the field for some time. So if you work given, for example, loans, you begin to know what are the variables that correlate more with the possibility that the customer has to actually fulfill their commitment to repay the loan and then you understand that for example, the income to debt ratio is or what the other calls we call the disposable income is is a quite an important variable. So then you have 2 variables and you know that you need to combine them to calculate the disposable income or the total debt that the customer has. Perhaps is a good indication and then you know that you need to sum the debt across all the variables.
If we're talking about insurance, for example, sometimes when a claim comes in and the insurance people fill the forms, we have these forms where they have to tick in the car basis. So you have a picture of the car, and then the person that is handling the claim needs to tick, like, on the door or on the roof or on the light all the pieces where the customer is saying that they have an accident or the things have been broken. And that then comes to us as binary variables, and then we have a broken light or a broken mirror and then you can add all those up to create a picture of the total damage, for example. Those are some examples of things that you you get to know when you are familiar with the data and also you're familiar with how you collect the data.
Something that I, for example, find more useful is that when we're creating models for insurance claim, for example, you take the car fuel as a variable, and eventually, we started having cars that were actually electric. So we start having to our models electric cars. And it's like, what is our model going to do with this? We have no clue because we don't have this in our historic data. This is just a very new thing. But then you need to know that this now exists, So then you can make a decision of how you're going to process or maybe not process these types of applications.
These are some of the examples.
[00:20:46] Unknown:
Because of the fact that you are performing all these transformations on the input data and you're using your particular domain knowledge and your understanding of the problem domain and the solution that you're trying to provide, you know, where you might be imputing missing values or removing missing data or combining 2 different variables to create a new input, I'm wondering what are some of the potential risks that you are taking on by doing all these different transformations and, you know, the potential for invalid assumptions about the impact that those might have on the resulting model and just some of the ways that you can identify those risks in the process of building up the transformations and then be able to revisit them as you iterate on the model development process?
[00:21:34] Unknown:
I would make a distinction here in 2 things. So first, when we talk about meeting the assumptions, I think this is extremely important when we're going to use those variables in statistical tests, for example. Because when we apply a statistical test, these tests very often are going to make a lot of assumptions on how the data needs to look like and how the data was originally collected. And if our data does not fulfill those assumptions, then the answers that those tests give us may or may not be accurate. So they may or may not be reflecting whether the difference that we see is actually a difference. So then we can really be confident when we derive the conclusions at the back of those statistical tests.
And moving on to the field of machine learning, I think if we don't use optimal to call them transformations, the worst thing that can happen is that the model loses some performance. And how bad that is, I think it kind of depends on what the model does and who the model is serving. I think it is important to know that if we're going to train a linear model, our variables need to be linearly related to the target. And if not, we're better off using a model that is not. So I think it's important to know what the model is assuming and then are the variables actually fulfilling that because if not, as I said, the model will not perform as well.
And having a model that doesn't perform well because of feature engineering is a little bit of a missed opportunity because we could be doing other transformations that improve the performance. And the performance of the model will translate in various things depending on what we're trying to use that model for. For example, it could translate in customer satisfaction. It could translate in revenue for the company. So in those cases, I would argue that less performance is not so dramatic. But in other cases, for example, when we're building models to assess health, to predict the prognosis of a disease, when we are building models to decide who is going to receive a visa and who is not.
We make models that we are quite that we qualify how good a teacher is, who gets access to this university, then I think it becomes more important to try and understand if our model is performing well, If not, why not? And particularly, if our model is being fair. I think more and more we're starting to hear stories about biased models that end up affecting negatively some sectors of society because they were not trained on the correct variables. So we're not using the variables that we should be using or the variables are not really good proxies to approximate what we actually want to approximate with that model.
So going back to how do we know about that, I think once we train our models, it is very important to understand which feature is driving the decision and why is that feature driving that decision, which boils back to the point of understanding what my feature is telling and can I use that feature or not? And then when we evaluate model performance, I think it is kind of important to try and see if our model is being fair for all the sectors of this society or all the sectors of my customer base that I am serving of if the model is favoring some groups over others, for example.
This you do with a lot of research at the back of the models and the predictions that this model produce.
[00:25:45] Unknown:
Digging now into the feature engine project itself, can you talk through some of the ways that it's implemented and the overall design goals that you had in mind as you were iterating on the implementation of it and some of the ways that it has changed and evolved since you first began working on it? So the intention of feature engine is to work as much as possible as scikit learn. So in fact,
[00:26:09] Unknown:
we inherit 2 of the main transformers from scikit learn that serve the skeleton of the class and provide much of the scikit learn functionality already under the hood in the sense of how you set your parameters and then how you retrieve the parameters of the class. And then the important bits that we work mostly with is with modifying the fit method and the transform method. So in the fit method, we have all the functionality that will learn the parameter for the feature transformation. For example, simple things like the mean or the median to impute or the mappings from string to numbers. But when we're doing feature selection, then we have a whole logic to run models, evaluate features, and then store the features with high performance.
And then in the transform method, we basically transform the data based on the parameters that we learn during fit. So I think that this in the main implementation of feature engine is is very straightforward. Originally, I kind of envisioned feature engine to fill this gap that I was filling was there when I was working in an organization. So I I wanted a library that already had all of these functionality built in and that also stored the parameters within the class. Just like scikit learn in that sense. I can when you apply a model from scikit learn, it will learn coefficients, it will learn the the divisions of the decision tree or how close the the different observations are if you're doing the nearest neighbors, and it will store all that information within the class.
Feature engine classes do exactly the same. They will store all the information. And, originally, I was focused on building classes that help transform data in a way that your features at the end of the day are still explainable and interpretable by a person. So you can apply a transformation and you know exactly what you're doing and you can also revert back to the original data. So if tomorrow you need to explain why your model made that decision, you can go back to the original data also with feature engine and try to understand that feature. So they are human readable.
So that was the original vision. But as feature engine becomes adopted by the community, I've noted that people want to also have in feature engine functionality that they use, for example, in data science competitions. I think they kind of want to speed up the way they transform data using these transformations that perhaps do not return the most understandable feature or feature that makes business sense, but that they would use in the data science competition or something like this. So I think we're kind of steering a little bit into that direction because users want it, and I find it very exciting that people want to actually use feature engine. So I think we're departing a little bit from the original idea of producing features that are fully explainable into producing features that are used and accepted by the community.
But if we think that they are not super interpretable or this poses a risk when you're using this feature to serve people, we make that absolutely clear in the documentation or at least this is what I would like to do. Like, I want it to be really clear, but it might be risky.
[00:29:44] Unknown:
As far as the feature engineering process, you mentioned that feature engine is working to be very compatible with scikit learn. And I'm wondering what you have seen as far as the either requirements or just the overall workflow of building features for scikit learn as your machine learning toolkit versus using maybe a deep learning approach with tools like PyTorch or TensorFlow, and if there is any difference or sort of what those divergences might be depending on the style of machine learning that you're doing. Too many differences perhaps in my opinion. The first 1, it's
[00:30:22] Unknown:
explainability versus not so much explainability. As in when you work with scikit learn, I could argue that you can still interpret the decisions that the models make there. So you can interpret the decisions of linear models. You can certainly interpret the decisions of 3 based algorithms nearest neighbors. So you can you can try and understand why the model made that decision provided you understand what your features are. So I think, yeah, this is and most likely is going to remain the main choice for building models in organizations that actually need to justify why they make decisions.
And the other main difference is that using deep learning with libraries like PyTorch, TensorFlow, and so on makes sense when you have huge amount of data. So deep learning becomes competitive when you have enormous amount of data. If you have little data, I don't think there will be an off the shelf algorithm like a boosted machine. So and then, again, the models in Scikit Learn are not really ready to cope with the ginormous amount of data that deep learning is reading to cope. So I think they serve different purposes.
[00:31:47] Unknown:
As far as the workflow of building the features and training the model and then maybe revisiting models that you've worked on to either tune them for better performance along whatever axes you're trying to optimize for, or if you are, you know, joining a team that has a set of models with the feature transformation pipeline in place. I'm wondering what are some of the ways that you have found useful for being able to build the feature engineering pipeline and determine which transformations to create and then maybe embedding some of the reasoning and context behind those choices in the code that you're writing so that people who are either revisiting it, you know, maybe your future self or people who are new to the team or new to the project to be able to help them know why certain transformations are being made without having to do all of that exploration and gaining of domain knowledge to be able to be effective?
[00:32:47] Unknown:
Yes. That's a good 1. And I'm not really sure. I particularly have done a lot of progress on that front if I have to be fair. I think the workflow normally goes like you get the data, you try to understand what the data is, what the variables are, like do you have redundant data, how good the quality of the data is. So it's a very big portion of data exploration that you go to try and understand. Do I have duplication? Are my variables correlated? That kind of thing. Then there's probably some sort of iteration between building a model, deriving feature importance, trying to understand which are the most important features. And if a data transformation that maybe you did is actually making the feature more or less predictive or how is that affecting the performance of the model. So there's a little bit of a back and forth between learning the variable, transforming the variable, seeing the impact of that transformation on the variable itself and then on the performance of the model, and then if you're not happy with that, reiterating.
And documenting that is a little bit hard because how do you do it? Like, you run the notebook over and over and then you overwrite yourself and sometimes you don't even know. Like, if you pick up the notebook a year later, you may not even remember what happened before the version that you have in front of your eyes. So I think it might be important to kind of try and write in the notebook as much as possible. I try to do that and I've seen that my colleagues do that as well. Try and document why you made those decisions.
As we move forward, then some of things that I'm trying right now is to build different pipelines and then you have already the different pipelines stored in your notebook with the different versions of the feature transformation. So then you can see the output of 1 pipeline and compare it exactly with the other. And this is 1 of the advantages that you have now because there are so many future transformation techniques in open source packages packages that you can accommodate them all in the pipeline, whereas before when you were using pandas, you probably couldn't. So there is that advantage a little bit now.
But, yeah, more than that, I'm I'm not too sure you can actually do or or at least I have not.
[00:35:14] Unknown:
I'm curious if there's the capacity or opportunity to be able to add metadata into the individual transformations that you're creating so that you can embed some of that reasoning into the function call so that it can maybe generate a set of metadata in the resulting model or just within the code itself to be able to say, I'm imputing the median value for these numerics because it's useful to be able to normalize the distribution for this reason or maybe embedding some of that context into the actual future transformation itself so that it is more
[00:35:50] Unknown:
sort of self documenting and discoverable as you revisit or as new people come across the code? That's actually a great idea. And I think we're trying to do that as part of feature engine documentations. Now in the last pull request we're trying to adopt a little bit scikit learn's way of presenting documents where you have the API that describes exactly what the API does And then you have a user guide where you give information to the user. In in this sense, we say, you know, this transformation works well for these types of variables and these types of models, but if you have this other type of model, then maybe you want to try this other transformation. So all of that would be documented there as part of the package so it will be accessible for everybody.
And I think this is a little bit how things are done in open source. I don't think you put the information within the class, but you do a whole lot of documentation to actually help users use your package as well as possible. So that's it. In terms of when you work can you create classes with metadata? I'm not sure I have not done it. I think some users do is that they add the metadata directly in the data frame. For example, if you're trying to simulate different scenarios, they will add different variables with the scenario hard coded. For example, variable a could be we're launching a an advertisement campaign or variable c, we're mimicking recession to say something.
So then you basically have the metadata in your data frame, and then you build your models on the variables except that with the metadata. So that's another way of simulating scenarios. Well, I guess it reflects some decisions and simulations on how you evaluate your mother. But, yeah, it wouldn't store how you make your decisions. Not yet. I think you're left with taking notes.
[00:37:48] Unknown:
In the same vein, I'm also interested in understanding the process for being able to test and debug the feature transformations and the feature pipelines that you're building and being able to validate the correctness or utility of the features that are generated as a result of the pipelines and transformations that you're building?
[00:38:08] Unknown:
This is another very important aspect of creating feature engineering pipelines. I think 1 advantage of using open source is that the d classes that come within these packages are tested themselves. So, I mean, it is not 100% bulletproof, but very often what the class is intended to do is indeed what the class is doing. So but going back to developing your own classes when you're working in your own projects, I think when we're building a pipeline it's important to have byte chunks or pieces of functionality and then testing those pieces of functionality individually. So, like, if I'm transforming variable a in way x, then I need to have a class and I need to test that class. So I need to be sure that my class is transforming the variable exactly as I expect.
And then ideally, I want to have that for all the classes and all the individual transformations that I have across the pipeline. So it's much better to have it that way than to just test the pipeline as a whole. Like, it would comes in, prediction comes out. Because then it is harder to debug. Like, if you're not obtaining the output that you expected after it went through the entire pipeline, like, where is the error? So I think it is important to have individual tests for every individual transformation, and then, of course, a final test input comes in, prediction comes out. Is this what I expected? I think in my opinion, this is the best way forward into creating a pipeline that is easy to test and then it is easy to debug.
[00:39:51] Unknown:
And another aspect of feature engineering that you mentioned earlier is the growing use of feature stores for being able to store a definition of a feature and have it computed at sort of query time by the machine learning system that is either training or inferring information. And I know that there's the open source feast project, a number of others, and then there are a number of different commercial options, including things like Tekton. And I'm wondering what you see as some of the potential for being able to use feature engine in conjunction with these feature stores either to publish the transformation pipelines to those stores or just some of the potential interplay between those 2 stages of feature development?
[00:40:35] Unknown:
This is an interesting question as well. It's an aspect that I have not myself explored much. So I am aware that these kind of feature stores particularly useful when you have a huge amount of data coming in through, for example, your organization and then you need to kind of store and have that data ready to be used in your machine learning models. And so could happen here is that the data comes in and then you have pipelines to create features that you can actually use from that enormous amount of data ready in a store, and then maybe you trigger some feature transformations once per day or once per hour depending on how frequent the load of data you you get.
So in that sense, I think this is particularly useful to leverage the power of that way of receiving data from wherever you're receiving data and maybe wearable devices. But I don't think you can integrate feature engine today with these stores because feature engine is designed to work with data frames. And that was the original vision and I think this is what drives most of the value of using feature engine. But if you have enormous amount of data, you don't really want to put it into a data frame. So So you can do your transformations. So for the time being, I don't think you could integrate them. I think these platforms have their own ways of creating pipelines so that they can help the user to create these features fast given the architecture and the nature of the data.
Looking forward, if there is enough appetite to use or extend the functionality of feature engine to big data, We could think about it. Some people have asked me if we can extend these to PySpark. At the moment, we don't have capacity list. So yeah. But maybe in the future.
[00:42:32] Unknown:
In your experience of building the feature engine project and using it in your own work and sharing it with the community, what are some of the most interesting or innovative or unexpected ways that you've seen it used? As a maintainer
[00:42:46] Unknown:
or at least as a maintainer of a library that is fairly new, you don't get to see a lot of how people actually use your package. I know that it's being used because you can't detect how many times it's it's been downloaded. So at the moment we have like 40, 000 per month, I think. And I can also see the visits in the documentations. So that I know that this been quite extensively used but unfortunately I don't really know on what projects. I I would love to hear more from the users. I can guess that some people in the finance sector are using feature engine because we have a recent contribution from 1 user that actually came up with a selector that would be very useful for finance. Because in finance, they use the population stability index, for example, to determine if distribution of a feature holds in time. And if not, you can actually not use that feature based on current financial regulation.
Yeah. He thought that feature would be very useful for his sector and he's actually developing that functionality himself. So it's I know that it's being used in finance. I also know that it's being used in Kaggle because some other users have requested other features and they link to a video that is being given by a Kaggle master, for example, or they link to an explanation from a person that has done that transformation in Kaggle. So these are the most exotic examples that I came across so far.
[00:44:17] Unknown:
And in your experience of building and using feature engine, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:44:25] Unknown:
I didn't really know what to expect as a maintainer and I think I'm learning as I go. I'm a little bit of a free styling. When I got the first pull request with the first contribution, I got so excited so excited. And then it's like, oh my god. So how do I reply to this? What should I say? What should I not say? I had to learn basically how to be a maintainer, how to interact with people, what things to accept, what things not to accept, how to communicate, what could be accepted and what would not be accepted into the package, how to engage with people in order to kind of get something that was good for them and for the package as well. So all of that I needed to learn. Then I didn't know how often should I engage with the contributors, but more than how often I should engage, like, what I didn't know and I actually still don't know is how other more experienced maintainers interact with their communities because I just thought, you know, the way they interact may help me learn and keep an active and thriving community in feature engine. So I think there was a bit of a learning curve there and some unexpected findings is for example that actually I'm quite an engaged maintainer because whenever I get a pull request, I reply in 1 or 2 days, for example.
Whereas I have made pull requests to other libraries that are way more popular than feature engine and months later, I haven't even received a thank you note. As a contributor, that is so frustrating because you put so much of your free time onto developing something that you think that is good for everybody and then not receiving a reply is is a bit frustrating. So I try not to do that. And I try to reply to every pull request. And then if I make suggestions for changes, I always try to give reasons of why I think this and some other unexpected feedback was like some contributors have told me that they learned a lot by making this contribution to Scikit Learn based on the subjects suggestions that I've given.
And that was a bit unexpected because I've never considered myself like a super awesome Python developer. So someone tells me, look, oh, I've learned a lot. I found it quite rewarding. I certainly learned a lot from some contributions. So overall, I think this was also unexpected is how exciting it is to create and maintain a Python package. It's way more than I originally expected.
[00:47:19] Unknown:
For people who are working on building out their feature engineering pipelines, and they want to be able to bring in more reusability, what are some of the cases where a feature engine is the wrong choice?
[00:47:31] Unknown:
No. I think feature engine is steered into working with data frames. So if it's steered to receiving data from a data frame during the process a transformation and then returning a data frame, so you can actually do the transformations and continue with data analysis as you go along, and you could do a little bit of those, of both, 1 at the back of the other. And then because it fits nicely into a scikit learn pipeline and it shares scikit learn functionality, you can actually deploy that pipeline at the end. So I think feature engine is suitable for people working in projects where the datasets have a size that would fit in a data frame when the features need to be understandable and explainable and when we're going to build off the shelf machine learning models like those that we find in scikit learn.
Our data is super unstructured. We don't have data frames. And we also not going to do off the shelf algorithms then I think it's not what it's designed for.
[00:48:39] Unknown:
And as you continue to work on and use the feature engine project, what are some of the things that you have planned for the near to medium term, and what are some of the types of help or contributions from the community that you're looking for?
[00:48:52] Unknown:
I think the main things that we want to include in the short term is the first 1 is we want to adopt scikit learn way of presenting documentation where you have 1 interface that describes what the API is doing. And then you have another interface, which is the user guide when you give a lot of information about the technique, not just what it's doing, but why it's doing, what it's doing, when you should use it, when you should not use it, and then provide some examples and hopefully some references as well. Do a little bit of both providing the functionality, but also providing the context on why you should use it. So a little bit of an education project.
So that's number 1. The other thing that is coming very soon is we're also going to include functionality to work with time to extract features from date time variables. And the next step for us would also be to expand feature engine functionality to work also with time series. At the moment, feature engine works only for tabular data, which is also what scikit learn does, but we want to expand this to be able to preprocess variables that come in time series. And at the back of this, something that we need to discuss is if we need to move away from pandas and also adapt other frameworks that will allow us to handle bigger datasets, like, for example, DASK.
Some of the contributions that we're looking or we're hoping for is that, to be honest, any contribution, and I think this is true for any open source package, Any contribution, no matter how small it is, it helps from feedback. Like, is this useful or is this not? That is super helpful. People perhaps sharing the use of the package or sharing some projects whenever they can on how they are using feature engine, fixes to the documentation, suggestions for new classes or new transformations that they would like to see in the package. Of course, code contributions are more than welcome.
Contributions to the documentation as well. This is really important, something that is always overlooked. I think there is this general belief that contributing to software is about producing code with new functionality and it's not just that. That I learned when I started developing feature engine. I put so much more work into creating docs than into creating code. So help with creating documentation is actually
[00:51:29] Unknown:
greatly appreciated as well. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose the recent Dune movie that came out. Really enjoyed reading the book series, so definitely recommend that as well. But whether you have or haven't read the books, I definitely recommend the movie. It's very well done. Definitely looking forward to the next 1 that's supposed to come out in a couple of years. So definitely give that a watch if you get the chance. And so with that, I'll pass it to you, Soleil. Do you have any picks this week?
[00:52:00] Unknown:
Yes. I think since we're talking about so much about how important is to understand what we feed into the models to try and make fair algorithms, I think 1 movie that is very related to the topic is the social dilemma. And 1 book that as well taps on the same items of re infer with the use of big data and algorithms is don't be evil from RADA for OHA.
[00:52:26] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on feature engine and your experience of working in the space of machine learning and trying to make feature engineering a more tractable problem. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dotcom for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
Introduction to the Episode
Guest Introduction: Soledad Ghali
Journey into Python and Data Science
Overview of Feature Engine
Challenges in Feature Engineering
Technical Debt in Feature Engineering
Ecosystem of Feature Engineering Tools
Domain Knowledge in Feature Engineering
Risks in Feature Transformations
Implementation and Design of Feature Engine
Feature Engineering for Different Machine Learning Approaches
Building and Documenting Feature Pipelines
Testing and Debugging Feature Transformations
Feature Stores and Feature Engine
Community Use Cases and Contributions
Future Plans for Feature Engine
Conclusion and Picks