Summary
One of the challenges of machine learning is obtaining large enough volumes of well labelled data. An approach to mitigate the effort required for labelling data sets is active learning, in which outliers are identified and labelled by domain experts. In this episode Tivadar Danka describes how he built modAL to bring active learning to bioinformatics. He is using it for doing human in the loop training of models to detect cell phenotypes with massive unlabelled datasets. He explains how the library works, how he designed it to be modular for a broad set of use cases, and how you can use it for training models of your own.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
- To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. And with their new Kubernetes integration it’s even easier to deploy and scale your build agents. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Tivadar Danka about modAL, a modular active learning framework for Python3
Interview
- Introductions
- How did you get introduced to Python?
- What is active learning?
- How does it differ from other approaches to machine learning?
- What is modAL and what was your motivation for starting the project?
- For someone who is using modAL, what does a typical workflow look like to train their models?
- How do you avoid oversampling and causing the human in the loop to become overwhelmed with labeling requirements?
- What are the most challenging aspects of building and using modAL?
- What do you have planned for the future of modAL?
Keep In Touch
- @TivadarDanka on Twitter
- cosmic-cortex on GitHub
- https://www.tivadardanka.com?utm_source=rss&utm_medium=rss for anything else
Picks
- Tobias
- Tivadar
- Uri Alon: An Introduction to Systems Biology – Design Principles of Biological Circuits, book and online lectures
Links
- modAL homepage
- modAL on GitHub
- modAL paper
- Bioinformatics
- Hungary
- Phenotypes
- Active Learning
- Supervised Learning
- Unsupervised Learning
- Snorkel
- Active Feature-Value Acquisition
- scikit-learn
- Entropy
- PyTorch
- Tensorflow
- Keras
- Jupyter Notebooks
- Bayesian Optimization
- Hyperparameters
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200 gigabit network, all controlled by a brand new API, you've got everything you need to scale. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute. To get worry free releases, download Go CD, the open source continuous delivery server built by Thoughtworks. You can use their pipeline modeling and value stream app to build, control, and monitor every step from commit to deployment in 1 place. And with their new Kubernetes integration, it's even easier to deploy and scale your build agents.
Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add ons, and visit the site at podcastinit.com to subscribe to the show, sign up for the newsletter, and read the show notes. Your host as usual is Tobias Macy. And today, I'm interviewing Thibodard Dhankar about modal, a modular active learning framework for Python 3. So, Thibodar, could you start by introducing yourself?
[00:01:18] Unknown:
Yeah. So hello, everyone. I'm really glad to to be here. My name is Tivadar, and I'm a postdoctoral researcher at the Hungarian Academy of Sciences. I work in a group called bioimagingalization machine learning group, Biomag in short. Basically, I did my PhD in pure mathematics 1 and a half year ago. But after I I finished my PhD, I I decided to do something which actually can help make the world a better place. So I joined, like, a bioinformatics group, and we are analyzing microscopic images of cells and actually helping drug discovery, helping physicians, doctors, creating diagnostic tools, and brief. This is it. And do you remember he first got introduced to Python? Yeah. Basically, I was introduced to Python 1 and a half year ago when I I introduced myself.
So as I mentioned, I became interested in bioinformatics, biology, bioimaginologist, and I I needed to pick a language. So I had 3 options, Perl, R, or Python. And I was advised against Perl because it used to be really popular in bioinformatics, but but now it's it's not that popular. And my other choice is with bar r and Python. So I I basically tried Python. I really, really enjoyed the syntax. I really like the language. I like the I can do many things in a really, really easy way, not not like c, which I had some prior experience back in the day when I was a university student. So bottom line, I chose Python, and I actually I'm really glad I chose Python. Great. This is how I was introduced.
[00:03:01] Unknown:
And so in the introduction, I mentioned that the project we're talking about is a framework for active learning. So I'm wondering if you can just describe what that terminology means.
[00:03:13] Unknown:
Okay. Yeah. So active learning is actually a machine learning tool. It falls between supervised learning and unsupervised learning. It's actually a subset of semi supervised learning. The premise is that that you you have a small set of labeled data and a bunch of unlabeled data. And what you would like to do is to improve your your understanding of your data, improve your classifier accuracy by using the unlabeled data. And the main main feature of active learning is that it actually asks you questions about the data. It asks the human experts to label some examples which it finds most useful. So for instance, a typical scenario is that when you have some kind of classifier, for instance, you want to classify tweets, sentiment of tweets, opinions, and you have a small set of labeled data and a 1000000000 tweets, which are unclassified.
And you you should you you you want to pick some unlabeled instances, which would give you the the most amount of information for your model. And active learning is actually, like, basically a set of tools trying to to help you choose those instances. So, basically, it's it's a it's a it's it's it measures the informative the informativeness of the of the unlabeled instances. And after that, it's it presents it to the Oracle. I mean, it it can be a human. It's not always a human. And then it includes in the in the next question. So it retrains the model and then asks a question again and again and again until the model is refined enough.
[00:04:49] Unknown:
And so as you mentioned, having the labeled data for being able to do the training is necessary to increase the accuracy. And what is what are some of the challenges of being able to actually obtain, properly labeled data for these datasets to be able to get the necessary accuracy from these machine learning models?
[00:05:12] Unknown:
For instance, I work in biology. So sometimes the the most the the greatest difficulties is is that you have billions of cells for which you would like to find the phenotypes. So our our basic setup is that we have a a bunch of cell cultures or or tissues, and we screen them with with a microscope. And in these screens, you have, like, a 1, 000, 000, 000 of cells and the 1, 000, 000, 000 of images for each cell. And and you need to classify the cells, for instance, which is a healthy cell, which cell has a cancer mutation and so on. And if if if a researcher or or a doctor sits down and starts to annotate the cells, it's not that representative. So you cannot cannot really browse through a 1000000000 of cells, and and they expect to to get a nice training dataset by random.
Okay. So I'm just going to continue with another challenge. Right? Okay. So another challenge is that not not all female tags are discovered by the by the biologists, for instance, because you can have something which you you don't just stumble upon. So this is not included in your training set. So your unlabeled dataset actually has labels which you you you didn't see, and you you don't know it exists. So 1 of the greatest challenge for us is to to find those those instances of the cell, which can actually help you define new labels.
And this can be done by, for instance, exploiting cluster structure in the data or doing some kind of spatial imbalance sampling in the feature vector space and so on. Other other challenge is is to actually try to explain the the the biologist how what what kind of questions can they ask about the data, to be honest. Okay.
[00:07:08] Unknown:
And for some datasets, there are companies where you can actually do things like crowdsource having the data labeled. So whether it's, images, having people look at them and say, you know, this is a car, this is a dog, etcetera, or, you know, if it's just doing something like sentiment analysis, having somebody go through and label whether it's positive or negative sentiments or whether there's sarcasm. So for certain fields, particularly in the area of bioinformatics, I imagine that there's some difficulty in being able to just have any random person appropriately label the data. So, I imagine that that's why you need to be able to do the sampling technique of surfacing only certain subsets of the data to have somebody with domain expertise label it appropriately. Is that accurate? Yeah. Exactly. So this domain expertise is really important. It's it's also
[00:08:01] Unknown:
if I just sit down and start labeling cells, we won't get anything useful because I'm I'm a mathematician and a computer scientist. So I I don't have domain knowledge in it. But the biologists do have domain knowledge, and you you you cannot crowdsource it even even if you you may crowdsource it to to many biologists. But but this may not also be viable because people are really really specialized. So if you are studying certain types of cells, certain diseases, certain markers, certain phenotypes, you you you can only find, like, really a handful of person who who is an expert in this field. And they may be competing with each other, so they may not able to they may may not be willing to help each other out with this sense. So yeah. And so
[00:08:45] Unknown:
the project that you built is intended to be able to simplify the process of servicing some of the subsets of the data for labeling. So I'm wondering if you can just discuss a bit about what the project is and how it works and your motivation for starting it in the first place. Okay. So actually, Mod.ai was built
[00:09:04] Unknown:
it it had grown out from a research project, which I did. So, basically, I am trying to develop active learning strategies for something called active feature value acquisition, and I I came up with a lot of these query strategies. Okay. So a query strategy is is basically the technical term for for this selecting an instance which is the most informative and presenting it to the to the biologist or or so. So, basically, I came up with with several query strategies, and I I developed an actual framework where we we can test these strategies. And I have have developed in a way that it is compatible to every scikit learn model. This this framework is this framework can be used with any scikit learn model, basically.
So then then I did this research project, and this was kind of the model prototype was created. I realized it may be useful for for other researchers and other persons who are actually applying active learning in in practice. So I decided to to make this, like, a a complete package, make it so that other people can use it. So, basically, this was my biggest motivation. So first, it started out as something for me, a tool for myself. But but, basically, I realized that, actually, it can be a tool for for other people. So I I I I created it, and I made, like, documentations, tutorials. And 1 of my my my main aim with this is that I would like researchers to to to help actually develop active learning strategies.
And this is why I created model in a way such that if you would like to to replace the underlying model or replace the underlying query strategy, You can you can do it in a single line of code, and you don't have to change anything else. This is this is why I call it modular because I designed it thing in such a way. So even if if someone is not an expert in in Python or in in programming, he or she can actually use this tool really really easy to to prototype active learning workflows. So this was my main motivation.
[00:11:14] Unknown:
And so as far as being able to obtain these labeled datasets and bring domain expertise to the overall process, I don't know if you're familiar with there's another project called Snorkel that takes a slightly different approach where they have the domain experts write, you know, various functions that will apply labeling to the data by running it through certain algorithms, whereas you're using human in the loop to be able to surface interesting data points to ensure that you're getting the most value for the time spent. So I'm wondering if you can just discuss a bit about some of the differences in approach there and some of the relative challenges of doing that human in the loop approach as opposed to just having a function that can run across the entire dataset?
[00:12:02] Unknown:
Actually, I am not really aware of their approach. I haven't heard of this article you mentioned, but it's really interesting. I will check it out. Okay. So difficulties with the human in the loop. Okay. So first of all, it's abuse, but even a domain expert can make mistakes. Right? So the data you get can be noisy. This is Max. This is 1 1 problem which can affect your result. Ideally, it's it's not the case, but but sometimes it is, so you have to be be aware of that. Another problem maybe is that, actually, selecting out that specific instance which you want to query can take you a lot of time. So many of these active learning algorithms are, like, really, really computationally heavy. So it can take you several minutes to present, like, a unlabeled example of the query.
And if you present the labels by by, I don't know, random, you you may you may get even better performance because by the time you you present the sample, present the unlabeled example, the if if you present them randomly, you may label, I mean, 10 times as many data. This this is actually a challenge which needs to be overcome. And but this local approach, I'm not sure if I can comment on that because I'm I'm I'm not not entirely sure. Sure. And so when you're typical
[00:13:25] Unknown:
work these interesting data points to be labeled, what does the typical workflow look like for being able to run through from start to finish of having a desired, model that you're trying to build and picking out these data points to ensure that you are getting the most value for time spent?
[00:13:49] Unknown:
Basically, what you need is 2 things in this workflow. 1 is a scikitram model, and 1 is a query strategy, which are implemented in model, some of them at least. And after you you train your model on some kind of initial data, you can you can initialize an active learner object from from model, and then you present the so you you you give in the query strategy and the model. And then, actually, you have just simple 2 methods, the query method, which will use the query function, the query strategy you presented, and find you the most informative instance, and the teach method, which actually provides which which actually teaches the the model for these newly acquired labels. And this this is as as simple as it gets. It's a few lines of code. And if if you want to replace the model or the current strategy, you simply need to change, like, 1 line. So if you would like to test out several strategies, you can do it very easily. 1 of the 1 of the 1 of the very important stuff which is missing from model, which I I plan to to include it, is like a like an interactive labeler.
So right now, it's basically just a high level wrapper over cyclical models. But what I would like to do in the future is that I would like to like to write a a labeler, which can actually help you label data interactively, preferably from a browser, maybe. But a typical active learning workflow is is really easy in Moliere. It's just 5 or 10 lines of code.
[00:15:26] Unknown:
And what are some of the metrics that you use when you're picking out the data points to be able to determine which ones will actually improve the accuracy of the model versus the ones that are, within the range of accuracy that you're already working with?
[00:15:45] Unknown:
Okay. It's a really good question. This actually lies at the heart of active learning. So let me just give you, like, a quick baseline and then a few some kind of more sophisticated measures. The the most baseline measure is classifier uncertainty. So if you have, like, an orderly trained model, you can actually make predictions for your unlabeled dataset, and you can you can measure the uncertainty of those predictions. And 1 of the simplest thing which you can do is to pick out those instances which have the the highest uncertainty. But this cannot fall short because if you think about it, it's it's really, really shortsighted.
It cannot cannot discover you new phenotypes, and it it's really, really biased by that particular model which you have. But sometimes it's it's a good baseline model, so it can actually give you better performance in some cases. Another very, very baseline query strategy is the entropy of the predictions. So you have the the classification uncertainties. Basically, the higher the entropy, the more useful this instance will be to you. This is something which you can use, but this is also sort of shortsighted. It's biased to the classifier which you actually have on your hands.
So what can be done against this is to actually have several classifiers forming a committee, basically, and deciding the next query, the next instance to be labeled by some kind of disagreement measure. So you you can imagine you have, like, 10 classifiers. Each each of them were trained on on different data or even they they may be like a different model. So 1 can be a random forest classifier. 1 can be a k nearest neighborhood classifier. So somehow they each capture different aspects of your data. And what you can do to kind of help this shortsightedness is that you can you can measure the disagreement between these these models.
For instance, there there will be some instances where the models are completely agreeing. You you can suppose that that's there. Your prediction is good. There, you don't need to improve. But there are some regions in the space where your your 2 classifiers or 3 classifiers or how many classifiers you have are actually having, like, a like, a large disagreement. You can you can query based upon that. This is a like a a little bit more sophisticated example of these queries. Also, there are query strategies which can actually exploit structure in the data.
For instance, you can you can have clusters in a really high dimensional data, and you can you can ask labels from from the clusters, which you you use some kind of hierarchy or clustering or so. This also helps you to kind of explore the data and and overcome the shortsightedness.
[00:18:53] Unknown:
And is there any sort of general rule of how many data points you need to label for being able to get the desired outcome with the machine learning model, or is it more a case by case outcome with the machine learning model, or is it more a case by case basis and highly dependent on the data that you're actually working with? Yeah. I I wish there would be, but, unfortunately,
[00:19:13] Unknown:
there are I I think there are no such so so you you you cannot read that. It's okay if I if I sample 1, 000 data points, then I will be okay. I think it's it's it's heavily dependent on your data. And so
[00:19:27] Unknown:
you mentioned that the primary framework that you're working with for modal is scikit learn. I'm wondering if you have any intention of or what the difficulty would be for extending it to support additional frameworks such as PyTorch or TensorFlow or Keras.
[00:19:46] Unknown:
It actually supports Keras because Keras is from so so a scikit learn wrapper is provided in Keras. So you can actually take your Keras models and wrap it so that it will it will have the scikit learn API. And those models, you you can use right away. So Keras supports implementing. I think this is this is 1 of the best features of Model AI, is that you can you can use these deep learning models with Keras. And I'm also planning to do this for PyTorch and TensorFlow, but this this is this is, like, for for the future.
[00:20:18] Unknown:
But but this this would be really important. Yeah. And what have been some of the most difficult or challenging aspects of building and working with modal?
[00:20:27] Unknown:
Okay. So as I mentioned, I am a mathematician, and I had basically no formal training in in in computer science and software developments. I'm completely self taught. So my my biggest challenge was to write maintainable and readable code for for others to to use, others to to improve, others to contribute. And this was 1 of my my greatest challenge, basically, and I I really, really focus on on on this to to to do it right. Also, the challenge was that in order to to have a general framework, you need to design, like, a really good API. You need to design the right classes. And this was also 1 1 challenge for me because I I really had no prior experience in in object oriented programming. So in some sense, this is my first project. So, yeah, this is 1 of the greatest challenge, but I I actually really, really enjoyed this. I really enjoyed designing the API and employing these design patterns.
So, yeah, from the from the algorithmic side, it was not that big of a challenge because I I I speak the language of the algorithms, and then I'm I'm just a newbie speaking the language of software. But I I hope I I I will do a nice job. Yeah. And
[00:21:46] Unknown:
you've mentioned some of the goals that you have for future versions of modal, but are there any, sort of broader plans that you have for future developments or improvements or any direction that you wanna take the library going forward?
[00:22:00] Unknown:
Yep. So, basically, 1 of the first agenda which I have is that I would like to include Jupyter notebooks with the examples. So this would be the most important part because, as I mentioned, 1 of my goal is to make it usable for everyone, and then Jupyter notebooks are are great for for teaching purposes. So this would be, like, my my immediate next job to do on on my list. Another thing which I would really like to include is Bayesian optimization. So probably you have heard about Bayesian optimization. It's it's an optimization technique which you use where the evaluation of your your objective function is either really expensive or you don't have gradients. For instance, if you have, like, a deep learning model, you want to tune the hyperparameters.
And in order for you to to actually experiment, to try out hyperparameter, sometimes you need to teach it for for several days. And what Bayesian optimization does is it actually, like, builds Gaussian process regressor or practically any kind of regressor, but mostly Gaussian process regressors. So it was between the the parameter space and the the performance space. So, basically, it tries to predict the the performance from the hyperparameters, and it has some kind of feedback loop, which is really, really similar to active learning. So it actually finds you those hyperparameters in the hyperparameter space, which you should explore, and it would give you either an improvement or it would really give you, like, a new new information.
And I actually included some kind of baseline Bayesian optimization objects in model, but it's still experimental right now. So I would like to to improve it. For instance, I would like to be able to to optimize over, like, really, really weird search bay search spaces, hyperparameter spaces. This is not strictly active learning, but it's really closely connected, so the principle is the same. So you you have some kind of space where you you have no information. You ask questions, and you ask the next question based on the answer you had. So it's actually the the feedback loop is hopefully the same as with active learning.
So this is this is my my main main goals. There is this deep learning support. It would be really important to support PyTorch and TensorFlow because they are really, really important parts of this this field. I mean, you you you cannot cannot really avoid TensorFlow and Python, so I would really like to to include support for this. Currently, I'm the only developer of model, and I would back on any any feedbacks, any issues, pull requests, contributions. So, please, if you if you have any ideas or have any suggestions for this, I mean, feel free to to comment me or even even send a pull request. I would be really happy.
[00:25:04] Unknown:
Alright. So for anybody who wants to get in touch with you or follow the work that you're up to or contribute, you will have your contact information in the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the movie re rendition of Peter Rabbit. I watched it with my family, the other week, and it was hilarious and very entertaining and well done. So I highly recommend that for anyone. And so with that, I'll pass it to you, Thibodar. Do you have any picks for us this week? Yes. So my pick is a is a book and
[00:25:37] Unknown:
and an online course about the same subject. It's basically based on the book, and the book is is written by a scientist called Yuri Allen, and the title is The Introduction to Systems Biology, Design Principles of biological circuits. And the reason I picked this is that it's kind of common knowledge in computer science that many algorithms were inspired by nature or biology, for instance. Evolutionary optimization or even neural networks were inspired by biology. But in in some sense, the opposite is also true. So what we kind of created, it has an analog in the living system. So actually, it was like a really, really life changing experience for me when I realized that cells, bacteria, organisms are actually programmable.
So you can actually program organisms to to to express, like, Boolean functions. And this is kind of explained in this book how you can you can do it, and it was like a, as I said, a fascinating and life changing experience for me. So if you are fascinated with the evolutionary algorithms or neural networks or or kind of the the the the effect of of biology on computer science, you can you can also try to look at it from the other perspective, and this might be also really interesting to you. So this this was my pick. Well, thank you very much for taking the time out of your day to discuss the work that you're doing with modal.
[00:27:08] Unknown:
It's definitely an interesting project and 1 that is seems to be adding a lot of value to the biological research community. So thank you for that, and I hope you enjoy the rest of your day. Okay. Thank you for the opportunity. I would also wish for you to enjoy your day, and I would also like to thank you for listening.
Introduction to Thibodar Dhankar and His Work
Discovering Python and Choosing a Programming Language
Understanding Active Learning in Machine Learning
Challenges in Labeling Data for Machine Learning
Overview of the Modal Project
Human-in-the-Loop Approach vs. Automated Labeling
Workflow for Active Learning with Modal
Metrics for Selecting Data Points in Active Learning
Scalability and Framework Support in Modal
Future Plans and Improvements for Modal
Contact Information and Closing Remarks