Summary
Python has become a major player in the machine learning industry, with a variety of widely used frameworks. In addition to the technical resources that make it easy to build powerful models, there is also a sizable library of educational resources to help you get up to speed. Sebastian Raschka’s contribution of the Python Machine Learning book has come to be widely regarded as one of the best references for newcomers to the field. In this episode he shares his experiences as an author, his views on why Python is the right language for building machine learning applications, and the insights that he has gained from teaching and contributing to the field.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Sebastian Raschka about his experiences writing the popular Python Machine Learning book
Interview
- Introductions
- How did you get introduced to Python?
- How did you get started in machine learning?
- What were the concepts that you found most difficult in your career with statistics and machine learning?
- One of your notable contributions to the field is your book "Python Machine Learning". What inspired you to write the initial version?
- How did you approach the challenge of striking the right balance of depth, breadth, and accessibility for the content?
- What was your process for determining which aspects of machine learning to include?
- You have made 3 editions of the book from 2015 through December of 2019. In what ways has the book changed?
- What are the biggest changes to the ecosystem and approaches to ML in that timeframe?
- What are the fundamental challenges of developing machine learning projects that continue to present themselves?
- What new difficulties have arisen with the introduction of new technologies and the rise of deep learning?
- What are some of the ways that the Python language lends itself to analytical work?
- What are its shortcomings and how has the community worked around them?
- What do you see as the biggest risks to the popularity of Python in the data and analytics space?
- What are some of the common pitfalls that your readers and students face while learning about different aspects of machine learning?
- What are some of the industries that can benefit most from applications of machine learning?
- What are you most excited about in the applications or capabilities of machine learning?
- What are you most worried about?
Keep In Touch
Picks
- Tobias
- Sebastian
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Python Machine Learning (Packt)
- Buy On Amazon (affiliate link)
- UW Madison
- Pascal
- Delphi
- R
- Perl
- Bioinformatics
- CodeCademy
- Udacity CS101
- Andrew Ng
- Coursera
- Support-Vector Machine
- Bayesian Statistics
- Matlab
- scikit-learn
- NumPy
- Pandas
- Sebastian’s Blog
- Perceptron
- Heatmaps In R
- The Hundred Page Machine Learning Book by Andriy Burkov
- ImageNet
- Random Forest
- Logistic Regression
- XGBoost
- Theano
- Generative Adversarial Networks
- Is This Person Real / This Person Does Not Exist
- Reinforcement Learning
- AlphaGo
- AlphaStar
- Ray
- RLlib
- Open AI
- Google DeepMind
- Google Colab
- CUDA
- Julia
- Sebastian Raschka, Joshua Patterson, and Corey Nolet (2020). Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. Information 2020, 11, 193
- Swift Language
- Swift for TensorFlow
- Matplotlib
- Differential Privacy
- PrivacyNet
- YouTube recordings of Stat453: Introduction to Deep Learning and Generative Models (Spring 2020)
- ffmpeg-normalize
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit in private networking, node balancers, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API. You've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI and CD pipelines, they've got dedicated CPU and GPU instances. Go to python podcast.com/linode, that's l I n o d e today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
Your host as usual is Tobias Macy. And today, I'm interviewing Sebastian Raschka about his experiences writing the popular Python machine learning book and working in machine learning with Python. So Sebastian, can you start by introducing yourself?
[00:01:11] Unknown:
Yeah. Hi, everyone. My name is Sebastian Raschka. I'm originally from Germany. And, yeah, currently, I'm an assistant professor of statistics at UW Madison since 2018. And besides that, I'm very active. I'm still in the open source community. I like programming. And yeah, as you heard from the introduction, writing books occasionally. And, yeah,
[00:01:33] Unknown:
that is, yeah, me in a nutshell. And do you remember how you first got introduced to Python?
[00:01:38] Unknown:
Yeah. That has been quite some time ago. I didn't learn it in school. So in high school, we actually learned Pascal. And I remember using Delphi. But back in the day, I wasn't really, paying that much attention because computer games were pretty popular back then. I remember during high school classes playing Counter Strike and all these fun games. But later on, during my undergrad, I taught myself r. I learned Perl in a bioinformatics course. And for some reason, I don't know exactly what was the trigger. But Python became very popular, that must have been around 2, 012 2013.
And I essentially taught it myself, I mean, not entirely myself, but I took an online class, I forgot which 1 came first, I remember taking the codecademy interactive tutorial. And there was a new Udacity class cs 101, which was about programming a search engine. So it was learning computer science basics with Python, but then also a lot of applications using Python to implement the search engine. And I also remember, I think his name is Steve Hoffman, the person who developed Reddit initially, or the original version of Reddit, I think he's still he came back to Reddit and is working on it. And he was also teaching section of that course. So I found that very inspiring and very interesting. Resources, which was really great. So yeah, that was my my quick journey into Python. And
[00:03:13] Unknown:
in terms of your work in machine learning, how did you get interested and involved in that area of the industry?
[00:03:20] Unknown:
Yeah, that is also almost a decade ago. So how I got introduced in to machine learning was by taking another class. So that was thing must have been cs 801@michiganstateuniversitystatisticalpatternclassification, which was back then really interesting. So I didn't really know much about machine learning when I went into that class. And it was called statistical pattern classification. I thought, okay, that might be interesting or useful for my projects. So back in the day, I was working on computational biology problems. And I thought, pattern classification. So that maybe allows me to identify or predict certain patterns in my protein structure data sets. So I took that class and it was kind of eye opening and very fascinating.
So a lot of this was still very math heavy, maximum likelihood estimation, maximum a priori, expectation, maximization algorithms and these kind of things. But it was like a first kind of glimpse into the world of machine learning, I would say. And from there, I, I was really interested in that I was like looking up for resources. I think it must have been around the time when Andrew Ng's Andrew Ng started Coursera, because I remember also taking the Coursera class after that class. So I think it was brand new. It was the 1st MOOC ever even 1 of the first 1 so Coursera was brand new. That was the first course and I read about it and was super interested in that. So I remember taking that class as well, which was very complementary to the class I took at the university because the topics were quite different. I mean, it was again, still petite classification and machine learning, but the methods were different. So Andrew Ng's course, if I remember correctly, focused more on SVMs and neural networks. And like I said before, the class I took was more statistics heavy. And I would say, the Bayesian methods and things like that. But yeah, that that was I think, the story in a nutshell also. And so you mentioned that your background prior to that was largely in bio informatics or computational biology,
[00:05:20] Unknown:
and then you were getting deeper into these areas of statistics and then some of the more advanced algorithms for being able to do some of these pattern matching capabilities. I'm wondering what you found to be some of the concepts that were either most interesting as you were getting involved in that, but also some of the ones that were most difficult to comprehend or understand where to apply them. Yeah. So I would say the most also the most difficult part was
[00:05:45] Unknown:
when I learned about all these topics, we had these very nice datasets, like toy datasets, where you had a table with columns, the rows, where your data points or data records, and the columns were the different measurements or features and how this would map to let's say, my protein structure problem. So there was always like, there was this, I would say not direct application of what I learned to what I was working on. Because protein structures, they are three-dimensional structures. So you have coordinate XYZ coordinates in a big file. And then how does it translate to the nice little tables I learned about? How can I apply the machine learning methods to that, and then also, both classes were based on MATLAB? I also kind of struggled a lot with MATLAB. And that was also the time where I found psychic learn. And so I did it a little bit backward, implemented everything in Python for myself to understand things. And then I had to translate it to MATLAB. And also back then all the resources were mostly as far as I remember MATLAB. So there was this problem, how I can how can I apply what I learned really to my projects? So that was, I would say 1 of my most, my biggest struggle. So I had like these 2 disjoint worlds where I was still working on the computational biology problems in the classic way. But then I was really working on machine learning problems that were actually not that closely related to my project anymore, because I couldn't, back in the day, I couldn't really see how, how I can work with the datasets.
So I was also doing a lot of fun projects on the side like sports predictions, and mostly fantasy sports, daily fantasy sport predictions. And that is really how I learned Python or data wrangling. Because when I worked on these fantasy sports predictions, okay, this is like a little tangent, but I was writing a lot of scripts to scrape websites to, I would say massage, massage the data, clean the data, get the data into the right shape. And that is where I learned all about NumPy pandas and things like that. And that gave me also more confidence in somehow applying machine learning to my computation biology problems because I was much more proficient in using Python and all these computing tools. Because 1 of really 1 of the biggest parts of machine learning is just getting the data into the right format, or just knowing how you can apply machine learning to your data, what data you need. And I would say still today, that's 1 of the biggest problems because I mean, it's relatively easy. When you go to Kaggle, you have a table with certain records and features as columns. And then then it's really straightforward, applying machine learning algorithms. And then you can just go to a library like scikit learn and just try out different things, see what works best. But in real life, the really the problem is how you get your data into the right format.
[00:08:27] Unknown:
And 1 of the notable contributions that you've made to the field of both machine learning and its applications within Python is the book Python Machine Learning that we mentioned at the open. And I know that it's become a very popular reference for people who are either just getting into machine learning or trying to gain a better understanding of it. And I'm wondering, what was your inspiration for
[00:08:49] Unknown:
getting involved in being an author and writing that initial version? Yeah, that is an interesting question. The answer is maybe a little bit boring. The publisher reached out to me and I thought it might be a fun challenge. So a little bit of a background is I as an undergrad, I wrote a very small book before it's small like a booklet. It was heat maps in R. That was before I was working with Python, I had my R face. Nothing against R is I think, still a great language. Anyways, so I was writing this book back in the day. And that is there was a really short book, I think 50 60 pages. But that was my first taste, like writing books. And I kind of liked the process. I liked thinking about things, organizing things, and then writing them down. I don't know, it's like some also when I assumed when I studied, what I always did is I wrote things down in my own words. So I was always compiling reference material for myself. And writing a book is in a way very similar, except that you pay more attention to, let's say, writing proper sentences and adhering to grammar rules. And yeah, things like that. So how I got then involved was I was blogging at the time, because I mentioned that I was really interested in machine learning, learning all about the techniques. And I also set up a block where I wrote about topics like principal component analysis, linear discriminant analysis, and general techniques. I think I also wrote an introduction to neural networks and different topics. And they were kind of disjoint. But I also had the plan at some point to write more like a series of topics. I'm not quite sure if I remember the title correctly, but I called it I think introduction to neurons or neural networks, part 1, and there was the plan to have a multi part series going from a very simple computational neuron, like the perceptron, then building the knowledge up to a multi layer perceptron, and so forth. And that was about the time when the publisher reached out, and they probably saw my blog posts. And they asked me whether I would be interested to write a book. And yeah, I was a student back then. And it was a fun challenge. I thought as a student, you had much more time than you have right now. So I thought why not writing a book is maybe cool because then kind of gives the whole thing a little more structure than having these blog articles. And also you get the benefit of having reviewers and working with a team. So I was just thinking, maybe I should write this time a real book and compare to the small booklet, the heat maps in our book. Yeah. And that is how I started. And it's also like 6 years ago now. But the title was originally Python machine learning essentials.
And the goal was on the restriction was to only have a book that so in the format, they had a limit page limit of 200 pages. And I was compiling so much reference material when I was writing this that we then luckily had the discussion to extend it to Python machine learning. So what the book is right now where they dropped the page limit or extended the page limit, at least. And, yeah, that was how it all came together. It was more like I was writing already on my blog and then the publisher reached out and I thought it might be a fun challenge and good opportunity as a student. And I'm wondering how you approached the challenge of determining what the right balance
[00:11:50] Unknown:
was of the overall depth of material to cover, the breadth of what to include in it, and how to keep it accessible for people at different levels of their journey into machine learning or their knowledge of Python and how it's being used in that space? Yeah, that is also a good question. So
[00:12:07] Unknown:
like I mentioned before, I liked writing things down for myself. And while I was studying machine learning by myself, I compiled texts for myself from various different resources. I simply use these for as a template for writing the chapters. So I had my own notes. And they were a mix between concepts like broader concepts, basically cliff notes, and then also how I would use these concepts in practice. And for the book, I really just went to my personal notes and flesh them out a little bit. So there was also this balance, that there was still a page limit. So like I mentioned before, the page limit got extended, we had 200 pages in the beginning, but there was still a page limit, as I remember, so I couldn't go really overboard, which is a good thing. Because I think that's also when I teach sometimes when I teach in person, I go on these tangents, maybe also you notice that here on the podcast, and then things get get way too long. And I kind of tend to get distracted this way. So the page limit was actually a good thing that was helping me with keeping the balance of depth, breadth and accessibility. So it helped me thinking about okay, what is absolutely absolutely necessary? Let's say I have 25 pages per chapter, what are the most essential things I should cover? And also how I usually try to organize the thing is, or the book is that I first thought about the general concepts. So why do I care about this topic? What is the goal of this method? Or what can it do, but then also really seeing how, how we can apply it, even though it's only a toy data set in the book, usually, I try to have an application of each method that I explained. And the reason why it's having a toy method is actually not too bad, because in most cases, you use that as a template for your real project. If you would work on a real project in the book, I think it would be there would be too much data wrangling involved, there would be too many distractions because all the methods you apply that are very specific, they may not apply to your project. But again, I would also think that might be useful to have a book where you really have a case study. But yeah, in this book, it's basically just showing you how to apply these methods. And I hope they are useful as a template as a cookie cutter or
[00:14:13] Unknown:
more like your framework for applying these methods to your own projects. And I know that it has become a very popular reference because I've heard it mentioned in a few different contexts. And I imagine that some of the reason for that popularity is because of this enforced brevity and the fact that you did have to try and boil it down to the core ideas to be able to carry them across so that people are able to pick them up in a shorter period and start trying it out in their own work rather than digging through pages and pages of the theoretical underpinnings and ways that it can be applied before they get to the point of this is how you do it in your own work. Yeah, I agree. I think also,
[00:14:52] Unknown:
the attention span, became much shorter over time. I mean, everyone's attention span, thanks to the internet. And there was recently also a book by Andrei book of it was called the 100 page machine learning book, which is also a very popular book. This 1 is, I think, even briefer than mine, it doesn't have the application aspect. I think it's without code, just explanation. But again, it's a very popular book because it's something digestible. So also, I try to keep it digestible. So because people often, they want to get going. And I think it's important to get going because that's how you keep the motivation up. When people who asked me about learning how to best learn about machine learning, I always try to suggest them to work on project or get a bigger picture first, apply the methods, and then dig deeper. Because if you start from the bottom up, you I think it's easy to lose motivation because then you take 10 math classes. It's 2 years later, you still haven't done any machine learning. And then you're wondering where it all leads to. And I think this is why in high school, we all find math very, most of us found math very boring, because it is like the bottom up approach that you don't see yet why this is useful. But now I think going back now you would be very excited about the math part because you know, okay, this is actually useful for understanding my machine learning methods. So going that way, I think starting with something big picture first, and then digging deeper, I think that is more motivating. And that's also what I tried to do that my book is, of course, not the most comprehensive book, there are many better theory books out there, if you only want to learn about the theory. So my book is more like an overview. It's a 500 page or now 700 page overview. But it's still, I would say an overview, it's not covering every bit of detail about each method. So it is more it's long now, but it's still a book to get you started. It's not the final destination. So there are books that are way more in-depth. And but then again, like I said, it takes a while to digest them. And they don't really necessarily help you to start going or working on your project. And you published that initial book in 2015.
[00:16:59] Unknown:
And since then, you've gone through 3 editions, most recently being published at the end of last year. I'm wondering what have been some of the most notable changes in the content or the overall approach that you've taken with those successive editions?
[00:17:13] Unknown:
Yeah. So, the first book, well, I wrote this in 2, 000 14 15. So in this book, the focus was mostly on machine learning, because the title of the book was Python machine learning nowadays. So back then deep learning started to became popular. So there was, for example, it was around 2012, early 2013, where Alex net, a deep neural network convolutional network had a really impressive outcome on ImageNet on the ImageNet competition. So having the error of all compared to all traditional methods, and that is where people started to pay more attention to deep learning, but it was still in its infancy. So the focus of this book was more on machine learning nowadays. So I recently wrote an article with Joshua Patterson and Corey Nollet, who work at NVIDIA, on reviewing the recent trends in machine learning. There we really made the distinction between traditional machine learning and deep learning. So machine learning is more like a summary term for the traditional methods, I would group support vector machines, random forest decision trees into that category, and then deep learning. So and but back in the day, when you said machine learning, you meant usually techniques such as Yeah, random forest support vector machines, logistic regression, xG boost, even classic machine learning methods. So the focus of the book back then was mostly on machine learning. And I really still think it's useful to start learning about start with machine learning before you learn about deep learning. And there was only a very small section on neural networks. In the end, the last chapter was a chapter using piano. So that was before TensorFlow. There was no TensorFlow back then I think TensorFlow got introduced 1 year later. So I had this 1 chapter on deep learning with theano. And but the biggest focus of the book was really Python, machine learning in Python.
And in the 2nd edition, we extended the machine learning part more into the deep learning realm. So the last chapter, that was when TensorFlow was popular, I replaced that with a chapter on TensorFlow. So we dropped theano completely. And then we added additional chapters on convolutional networks and recurrent neural networks. And also, I vahit morality got involved. He is a co author, he helped a lot with the chapters because it was also the time 2017 when I was about to graduate from college. So I was super busy. And he was a good friend and collaborator back then we also worked on several research papers together. And we partnered up to write the new chapters. And, yeah, that was the 2nd edition where we extended the deep learning part because also, like in real life, deep learning became way more popular in 2017.
And we also both did a lot of research involving deep learning. So that was a good fit. And then for the 3rd edition, we went a step further. So that was after TensorFlow 2.0 was released shortly thereafter. So we had to update all the last and latest chapters from 10 TensorFlow 1.0 to TensorFlow 2 point 0, there was a major change in the library because they also switched, not only did they clean up the API, but they also switched from the static graph to the dynamic graph mode, which is I would say, a very different paradigm before when you use TensorFlow. In contrast to using something like NumPy, you usually had a separate graph definition step and a graph execution step. So it was not very Pythonic, I would say. So you define the graph in in Python, but it looked more like a pseudo language or a meta language. And then you executed the graph. And that made it very hard to debug and very clunky.
And the community kind of complained about that. So they switched to this eager or dynamic graph mode in TensorFlow 2.0. And we rewrote the chapters to reflect that to have all the newest features of TensorFlow. And then again, we added 2 more chapters. 1 was on generative adversarial networks, which are maybe among the most popular deep learning methods for applications. Like you can do a lot of fun things like generating certain things with that, like there's a website thing it's called, is this person real? Or is this a real person or something like that, where you can see some examples of generating, for example, faces entirely with deep learning faces of people who don't exist, but they look like real people. So 1 chapter was on generative adversarial networks. And then the last chapter was on reinforcement learning. Because I remember in the 1st edition, I had a section in the introduction where I was presenting the 3 big fields of machine learning, which are supervised learning, unsupervised learning and reinforcement learning. And in the book itself, though, I covered supervised learning a lot that is also the biggest area of machine learning and unsupervised learning. But I didn't even mention reinforcement learning after the introduction. So a lot of readers wrote me Hey, where's the part about reinforcement learning. So in this 3rd edition, we finally brought a chapter about reinforcement learning as well. And you may be now it's also 1 of the really interesting and popular areas of research. There were lots of interesting applications recently, for example, 2 years ago now, where they beat the world champion in in go, the bot game go, and also alpha star, which is the StarCraft 2 bot that can or an AI that can play stuff. StarCraft 2, I think they didn't beat the world best player, but they were very competitive, at least. So with that, I think reinforcement learning is also a very interesting area of research. Of course, what I just said, these are toy problems playing Go and StarCraft. But you can also think of this as something where you can design robots, for example, for Amazon warehouses, and just in general, having a method that learns a series of actions. So reinforcement learning is really about learning a series of actions rather than just outputting in 1 label prediction.
And yeah, also, just last thing here, about reinforcement learning, I'm also seeing, I still go for walks. I mean, I try to keep the social distancing. But when I go to walks, I see now a lot of these starship robots here on campus. So these are little food delivery robots and for especially for situations like right now with the Coronavirus, and where you probably want to avoid to get too close to people. Such delivery robots are, I think, a very nice and interesting idea.
[00:23:21] Unknown:
It's definitely an interesting approach to, handling the current situation. And to your point about reinforcement learning, I know that it's gaining ground currently as well because at least up until recently, some of the challenge of being able to apply in a realistic setting was the complexity of being able to ingest the feedback data in a near real time fashion to then be able to retrain the supervised and unsupervised learning approaches that you then deploy. And you might periodically refresh the model, but it's not being done in real time with the feedback happening in that same amount of time. And I know things like the Ray project are trying to solve some of those challenges from the computational and deployment perspective.
[00:24:11] Unknown:
Yeah, I think, reinforcement learning is really an interesting topic. But yeah, it is still computationally also prohibitive, though, because, yeah, I think there's a common joke of saying that we can train a machine learning model or reinforcement learning model without, let's say, boiling the ocean, because right now, it's something that is kind of reserved to big companies who can do that, I would like to see methods that are that can also be used by an individual soon, maybe at some point that we develop methods that are more efficient that individual, let's say, with 1 GPU can train interesting models efficiently at some point, that would be great.
[00:24:49] Unknown:
And despite the evolution of machine learning and deep learning and the different capabilities that have been coming out in recent years, I'm wondering what you still see as being some of the core fundamental challenges of developing machine learning projects that continue to present themselves despite some of these advancements.
[00:25:09] Unknown:
Yeah. 1 now that I said that 1, 1 problem would be still the access to resources. So GPUs, for example, especially if you use deep learning, you need nowadays, not only 1, but multiple GPUs. And I think this is 1 of the real big bottlenecks. So right now, the most exciting application, but also research come from major companies, for example, Google, Facebook AI Research, Open AI, and so forth, deep mind. And these are really big companies with a lot of resources. It's a bit hard for smaller teams, it's smaller companies or in academia, to really have competitive methods or interesting applications, because not everyone has access to many GPUs, I have enough GPUs for my research students I'm working with. But for example, when I'm teaching the deep learning class, we have 70 students in the class. And currently, we don't have GPUs for everyone. So what I recommend is, for example, using free resources, such as Google colab, where you can use 1 GPU, which is enough maybe for studying or trying things out. But then the students also work on class projects where it's up to them what they work on with my feedback, but they are usually very ambitious. And they would like to do multiple things at once. And that is where the limitation comes into play where you have only 1 GPU, and you have to be very resourceful with that. And this is 1 I think 1 of the biggest challenges right now that the resources have not caught up yet. Before that with I would say with classic machine learning, that was not such so much of a problem. Of course, if you work with big data sets, you had Hadoop and things like that spark that were maybe sometimes required. But most of the time, I would say 90 percent of the use cases, you just needed a computer with enough memory and memory was comparatively cheap compared to having GPUs now. So right now, I think the 1 of the bottlenecks is GPU, but then the other 1 is also the data. So for deep learning, you usually need these very big data sets. And if you think about something outside text and image data, there's not much where deep learning is super useful right now. So that is also still 1 of the challenges. How can we make deep learning useful for non image and text datasets. So recently, there has been a lot of research applying deep learning to graph data. So that is social graphs, but also graph structures such as molecules. It's also 1 area I'm currently working on. That kind of goes back to my earlier computational biology problems where I mentioned I had some struggles mapping the problem onto all the methods onto the problem. So nowadays, with graph neural networks, or graph convolutional networks, there are hundreds of graph deep learning methods now, that is also getting more interesting in that area. So yeah, not yet, that will be the major challenges, the resources for deep learning, and then also having the right data format for machining and deep learning. And then also the data size for deep learning, it's usually requiring 100 of 1, 000 or millions of data points, which we don't often have. And to your point about the
[00:28:15] Unknown:
limitations of the availability of GPUs, I'm wondering what your thoughts are on some of the recent research that has been done to change the approach to the ways that the algorithms are being applied that allows them to have much better performance just on CPUs without the GPUs. I don't remember the specifics of it, but I know that I've seen some articles about that coming out recently.
[00:28:36] Unknown:
Yeah. That is a good point. I am not an expert in this area at all. But as far as I know, because I'm not usually working on the deployment side. So I'm in academia. But as far as I know, what CPUs are involved in are mostly for inference. So that in deep learning, inference just means prediction. So I think the CPUs, you're right, the algorithms are more efficient on the CPU, but mostly only for the prediction, you still have to train the models. And I think model training on the GPU is still much, much, much faster. I remember seeing an article recently where people developed a new method for speeding up CPU using just caching. So memorization, for example, just caching results. And that was also very interesting direction to go. I'm not sure where this will lead to. I mean, this is very experimental. But, yeah, what I can can rather imagine is developing more specialized hardware. I think all the major companies are working on that IBM, Nvidia, Google with a tensor processing units. But again, as far as I know, TPUs are now competitive also for training. But for example, you can't buy TPUs yourself, you have to use Google services, for example. So in a way, there is no other really good hardware right now that allows you to train deep learning models very efficiently. I mean, GPUs, but then again, not everyone has access to the GPU. So what we would need is designing the algorithms such that we can just run them on the laptop, for example. Yeah. So that is 1 aspect. And related to that, just thinking about this, another challenge is really that the code becomes deep learning libraries become more and more difficult to read and modify.
So TensorFlow and pytorch, for example, they are very efficient libraries, they are just using Python as a glue language to glue together parts from CUDA and cuDNN, which are libraries for NVIDIA graphics cards and other low level c plus plus code, which is nice, because then Python is not really a big bottleneck, because you have only Python for executing lower level more efficient code. However, the problem with that is it's very difficult to develop your custom code. Let's say I want to develop my custom convolutional layer, for example, and digging into the c plus plus code or CUDA code, it is really difficult, it would be great to untangle this again a little bit to have something that is maybe not requiring a person to understand low level code, but to be able to somehow develop machine learning systems using simple language, but then a simple programming language, but then also being able to make custom modifications to it while still having an efficient thing that is 1 1 of the challenges as well, because pytorch is evolving and TensorFlow as well. But also, the code base becomes more and more complex. I was recently digging into for example, just finding out how the weights are initialized and every iteration of the software library, it's harder to find the code that is doing something.
So that is also I think, 1 challenge that we develop tools that maybe they become too hard to for researchers to make modifications. So you're kind of restricted to what other people have implemented for you.
[00:31:41] Unknown:
Core challenges of Python in general is that it does a great job of acting as that glue layer. But whenever you need to do anything that's performance intensive, you need to implement it in a different language runtime and then add the bindings on top. And in terms of just the Python language itself, I'm wondering what you see as being the benefits to the overall approach of doing this data analysis and these analytical workloads and some of the other risks that you see to the popularity of Python in the data and analytics space as we do continue down this road of needing much more performant algorithms to be able to analyze the volumes of data that we're working with? So as far as, I I would I would say I'm not concerned with the, Python,
[00:32:24] Unknown:
the popularity of Python. And I think it's also important to keep productivity in mind to have a language where you don't need to relearn things. And I, back in the day, I looked at Julia, which I think is a very great language, but you can see that there's some hurdle. People don't want to learn a new language because just for the sake of learning a new language, but Julia would make certain things more efficient. But I think we are at the stage right now in our society where we are happy with the status quo. And we want to focus maybe more on developing new methods rather than choosing a different language for doing the same thing. But again, I think I can imagine it might be frustrating for trial apps, they had an article on Swift, trial lapse, they had an article on swift, TensorFlow for swift or swift for TensorFlow, which is another interesting idea using the Swift language, which is the Apple language for iOS. I'm not that much familiar with Swift, but the syntax didn't look so bad. And they have certain, or there are certain capabilities that you can run certain things natively in that code, for example, has much better multi threading than Python. And also, automatic differentiation can be done natively there, I think if I understood that correctly, and that would also enable get maybe getting better performance while having a language that is still relatively easy to read. But again, I don't think I think this will be more like a niche thing, I still think people will keep using Python because there are a lot of benefits. That's also 1 of the reasons why I picked it up. You can do a lot of things with it without having to relearn a lot of tools. So just a small tangent. But before I was using Python, I was using mix between shell scripts, Perl and R. So I use some shell scripts to get to automate the running of some command line tools. Then I wrote my pearl scripts to analyze some data, or to manage to wrangle my data more, I would say, and then I used our to do the plotting. So I had all these different tools, and they didn't connect to each other very well. I mean, you can connect, let's say bash and pearl pretty efficiently. But still, what I was doing, I was writing these intermediate outputs, and then also reading an intermediate output by the other program that I wrote and stuff like that. And I had to keep a lot of things in my head, all these different tools. And Python just made my life easier where I could analyze the data, I could run everything there. And then I could also just visualize using matplotlib my results. And that was much more pleasant to use. So I would say, even though it was maybe not the most efficient thing, sometimes things are slow in Python, I'm still happy with that. Because it's, I feel like I'm kind of productive. And in deep learning, also, you are usually doing many, many things in parallel. And sometimes you just don't mind waiting a little bit because you work on something else in the meantime. But, yeah, I can understand from a engineering perspective, maybe maybe there might be something better to look out for.
[00:35:17] Unknown:
And in terms of the feedback that you've gotten from your readers and from the students that you're teaching, what are some of the common pitfalls that they run into and some of the challenges
[00:35:28] Unknown:
that are continuing to present themselves as people try to get involved in this space of machine learning and advanced data analytics? Yeah. I think 1 of the main, I wouldn't call it a pitfall, but 1 of the main issues is also that sometimes it's so hard to see also the problem I had how to apply machine learning to your problem. It's usually you have machine learning, and then you, you look for projects that match your method. So you have machine learning, you think, oh, 0, this is very exciting. I want to use machine learning. But then what is what can I apply to so you look out for data sets that are available, but it isn't usually not very useful? It's useful for learning, but it's not a useful, I would say it's not advancing the field because someone has done this before. So you're kind of replicating things. But if you want to work on novel research projects, where you have your research project, and then you want to see how can I apply machine learning to it, then it becomes a little bit difficult about thinking how you can get the data into the right format?
And also still, when I talk to people, or collaborators, it's still this difficulty how you can map your problem into machine learning, and then also how to interpret the results. So 1 thing is also to think about what you care about most understanding a certain relationship in your data. So for example, if you have some output variable, and you want to understand which input variable is mostly correlated with that, for that you don't need machine learning, This is an interesting question maybe. And this is also important then to keep in mind that other methods exist. And then for machine learning, usually you care mostly about predictive performance. So it's important to be clear about whether you need to understand the method or whether you just care about getting good prediction results. And for different applications, different methods are more useful in that respect. And then, for example, once you get a machine learning model that performs very well, you can inspect it. So there are different methods where you can look at what the machine learning model looks at for making a prediction. But then there's also the danger, sometimes I find that people think that is also how humans think, for example. So what I mean is, for example, if you analyze face images, and you want to predict a certain attribute, and let's say it pays most attention to the, to the eyes, So you say then, okay, the eyes are predictive of x, y, z. But maybe that's not what the human thinks. It's just what the machine learning model thinks. So there's also danger of over interpret interpreting or confusing machine learning, what the machine learning algorithms things with what humans do. For example, some people just want to understand what humans are interested in. And there's this kind of, I would say, disparity between these 2 interpretations.
[00:38:20] Unknown:
And another aspect of the overall machine learning space, as you mentioned earlier, is the need for access to datasets that are relevant to the problem that you're trying to solve. And I'm wondering what your thoughts are on the industries or problem domains that could benefit most from applications of machine learning, but might be hamstrung by lack of access to data that they can then apply to the problems that they're trying to solve. Yeah, I would say, 1 of the biggest problems is, or it's more like an issue
[00:38:50] Unknown:
that we created and now are trying to solve. It's kind of related to privacy. So when I grew up in Europe, or still when I go to Europe, I think people think a little bit differently there compared to here to the s, how, who should have access to my data and my information, my privacy. So most industries, I would say they rely on the fact that you have some information about individuals, for example, the industry, they want to personalize things and other industries, I recently read about insurance companies having a pilot study use of for people to wear wearable devices such as smart watches, and then tracking the health data to see whether this is predictive of health in general. And then they promise to lower the premiums if you wear such a device.
And I think this is great. This is a great opportunity. But then always, it's like, do you want to how much do you want to share and things like that? That's a very, yeah, that's a very sensitive issue. And I think sharing more data is, of course, useful for machine learning. But it's the same reason we don't have glass walls in our houses, right? We don't want everything to be public in a way or seen by someone, it makes us uncomfortable. And there are certain things you maybe don't want to share. And I think right now, 1 of the issues is that people lost trust, to some extent, if you think of Facebook, and so forth.
So how can you use machine learning while respecting the privacy of individuals. And I think this is maybe 1 of the biggest challenges right now that we want to use the data for something positive, for example, also, again, going back to the Coronavirus issue that maybe tracking people where they go might have may have been useful, but again, then you don't want to be always always be tracked. So if you have a precedence for tracking people, then what do you do with that data in future or this access in future. So there's a really fine line between using as much data as you can by collecting as much data as you can, but then also being ethical about it. So thing, this is a problem I I don't have an answer for. But this is I think, 1 of the biggest challenges when collecting
[00:41:02] Unknown:
data. And in your own research or areas that you're paying attention to, what are some of the aspects of machine learning or new areas of research that you're most excited about in terms of the applications or capabilities?
[00:41:16] Unknown:
Yeah. So personally, going back to the very beginning, well, to our interview here, when I mentioned the computational biology aspect, where I had not a clear vision of how I can apply machine learning or deep learning to my computational biology projects. Now with graph neural networks, for example, I think that is a very exciting area of research, at least for me. So using data representations that more I would say natural drug it as drug molecules or potential drug candidates. We call them small molecules for some reason. So yeah, modeling small molecules and protein structures with graph structures, which can then be used with deep learning. That is what I'm currently very excited about. And you mentioned that 1 of the things that you're currently worried about are issues with privacy or ethics
[00:42:07] Unknown:
in the data science domain or in the machine learning applications. I'm wondering if there are any other areas of ways that machine learning is being used
[00:42:16] Unknown:
or industry trends or challenges facing the industry as a whole that you're most worried about. Yeah, I think the thing I'm mostly worried about is, again, coming back to the data collection part and people losing trust. So the problem may be that at some point, we get very strict laws, and then it won't be possible to even collect data that is not capable of identifying individuals anymore. And that would, I think, hamper pretty much everything around machine learning. So there are many positive applications of machine learning. And I feel like, if we are not careful, and things get out of hand, we may also become too worried about things. And then we hamper also the good applications of machine learning. And related to that, with the tracking itself, there are methods to address that also, for example, we don't maybe have to report all the data back, we can be selective of what data we report back. And then there's also a big area of research called differential privacy, where you essentially develop methods such that you can't identify individuals from a data set from a collection of data points. So I think these are very important areas of research to also keep in mind. And yeah, that that is, I think, 1 of the things to look out for also to make machine learning kind of more trustworthy again, that we maybe try to be ethical about what we do with the data also. And 1 problem is also many companies promise to keep your data private. And that is totally reasonable. But as you've seen probably recently, there are so many data hacks and leaks. So I think just storing the data in the first place is maybe dangerous. So having data from users is a big responsibility.
So I think it would be really good if companies would really be more clear about how much data they store and what they store about you and why they have to keep a history of that, because the history of data that is usually what can leak. So in a research project, a series of research projects, I worked on something related to that. We call that semi adversarial networks for protecting privacy. And we started out with a project where we wanted to hide soft biometric attributes. So soft biometric attributes are attributes of a person that can aid in biometric identification.
So going back by this going 1 step back, biometric identification means usually matching people by identifying someone by facial images or fingerprints or the iris of the eye. And soft biometrics are attributes that can be extracted from biometrics, for example, gender, age and health of a person would be soft biometric attributes. And for certain applications, for example, if you have a passport scanner, you may not need to, so you maybe want to identify a person. But when you store images in the data database, let's say you have a security camera, and you want to match people with potential dangerous people, so that you make sure there's no security threat in your area. For that you don't need to have all the information about a person in the database, you don't need to keep the image forever, for example, and even then you don't need to know maybe the age or the health on gender for that. So we develop methods for hiding certain soft biometric attributes selectively.
So in the recent version, we call it privacy net, we try to hide age and gender information, and while keeping in skin color while keeping the image intact that it can be used for matching purposes. So with that, we the goal would be to not store more data than you really need for your application. If you're only concerned about matching, let's say, is this the same person in front of my door camera, as in the passport ID, for example, in the picture, driver's license picture. So for that we don't need to know everything about the person we can, if the method works with high accuracy, the matching is highly accurate, We don't need to know the age, gender and skin color. So we don't even store that information. Then even if our database leaks, people don't know everything about the people in the database. So we don't leak personal information, for example.
[00:46:26] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the pick. And this week, I'm going to choose the movie Trolls World Tour. I haven't actually watched it yet, but I'm pretty excited for it. And it just came out on the day that we're recording this. So another thing that I'm excited for is the recent, development of movie studios releasing their movies directly to streaming because of the fact that all the movie theaters are shut down. So I'm hoping that that continues to be a thing even after we resume normal activities. So, if you're looking for a movie to watch, definitely recommend at least taking a look at that 1. And with that, I'll pass it to you, Sebastian. Do you have any picks this week? I would first say I definitely,
[00:47:09] Unknown:
was thinking about watching a movie. But yeah. So thank you for the recommendation here. I may even check that out after that. Yeah, my pick would be a Python library. So it would be the, FF mpeg normalized library. And recently, so since we switched to online teaching, I was recording my lectures, I'm recording them and uploading them to YouTube, we can maybe include a link to the show note if it's useful for our listeners. So currently, I'm teaching the statistics 4 5 3 class introduction to deep learning and generative models. And I dived into the all the audio processing and video processing for that. So the problem was that I wasn't really consistently close to my microphone, and the audio was loud and not so loud. So I was looking into all the audio processing and found this really nice library, FFmpeg normalize, which really nicely normalizes the loudness of a video to a standard recommended level. And also, if you have multiple clips, it will all threshold the clips to the same volume, which is really nice. I think it makes the videos much more easy to watch, so you don't have to crank up and down the volume control on your computer. So, yeah, that would be my recommendation for today. I'll definitely have to take a look at that for my podcast as well. So thank you for the recommendation. Thank you for your time. And, yeah, I appreciate all the work that you're doing to help educate people on the capabilities of machine learning and how to get involved with that. So I hope you enjoy the rest of your day. Yeah. Thank you for the invitation. I really enjoyed our interview today. And, yeah, with that, I hope also you have a nice rest of the day. And, yeah, we'll talk to each other soon.
[00:48:42] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dotcom for the latest on modern data management. And visit the site of python podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Sebastian's Background and Journey into Python
First Encounter with Python and Learning Resources
Introduction to Machine Learning
Challenges in Applying Machine Learning to Real-World Problems
Writing the Python Machine Learning Book
Balancing Depth and Breadth in Educational Content
Evolution of the Book Across Editions
Core Challenges in Machine Learning Projects
Advancements in CPU Performance for Machine Learning
Python's Role in Data Analysis and Machine Learning
Common Pitfalls and Challenges in Learning Machine Learning
Industries Benefiting from Machine Learning
Exciting Areas of Research in Machine Learning
Ethics and Privacy in Machine Learning
Closing Remarks and Picks