Summary
Statistical regression models are a staple of predictive forecasts in a wide range of applications. In this episode Matthew Rudd explains the various types of regression models, when to use them, and his work on the book "Regression: A Friendly Guide" to help programmers add regression techniques to their toolbox.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Matthew Rudd about the applications of statistical modeling and regression, and how to start using it for your work
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing some use cases for statistical regression?
- What was your motivation for writing a book to explain this family of algorithms to programmers?
- What are your goals for the book?
- Who is the target audience?
- What are some of the different categories of regression algorithms?
- What are some heuristics for identifying which regression to use?
- How have you approached the balance of using software principles for explaining the work of building the models with the mathematical underpinnings that make them work?
- What are some of the concepts that are most challenging for people who are first working with regression models?
- What are the most interesting, innovative, or unexpected ways that you have seen statistical regression models used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on your book?
- What are some of the resources that you recommend for folks who want to learn more about the inner workings and applications of regression models after they finish your book?
Keep In Touch
- @MatthewBRudd on Twitter
Picks
- Tobias
- The Argument podcast from the NY Times
- Matthew
Links
- Regression: A Friendly Guide
- Sewanee University of the South
- Sewanee Data Lab
- Mark Lutz Python books
- Elements of Statistical Learning
- Linear Regression
- Logistic Regression
- Modeling Binary Data
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host, as usual, is Tobias Macy. And today, I'm interviewing Matthew Rudd about the applications of statistical modeling and regression and how to start using it for your work. So, Matthew, can you start by introducing yourself?
[00:01:08] Unknown:
Absolutely. And thanks for having me. My name is Matthew Rudd. I teach math and statistics at a small college in Suwanee, Tennessee that's called Suwanee, the University of the South. I'm also the department chair for math and computer science. I'm the director of the Sewanee data lab, which is a data science training program in the summer. We've had 1 summer so far, and we're gonna have our next round this next summer. My training's in pure mathematics, and then I got into data science and regression probably about 10 years or so ago when I started teaching statistics for the first time. I've never taken a stats course ever in my whole life.
So I learned it by teaching it, and that's sort of how I got into all of this, which we'll talk more about as we go. And do you remember how you first got introduced to Python? Yeah, actually. So I first learned Python or I I first learned about Python, I should say, about 20 years ago when it was fairly it wasn't brand new, but it was not what it is now. It was a a new language, kind of a niche language. And at that time, I was a grad student at the University of Utah, and they have an institute there called the Scientific Computing Institute. And they were using Python as a scripting language for a much bigger software product that would do modeling and simulations.
And so I first heard of it there, and I think I had the very first edition of programming Python by, I think, Mark Lutz, which even then was, like, you know, how many hundreds of pages it was. It was a big, thick, you know, giant book. And so I would read some of that, but I didn't have any I was not directly involved in the project, so I just sort of knew about things. And and I was trying to finish my PhD in math. So I kind of learned some Python then, didn't use it that much. And then, as we all know, Python has since become this sort of juggernaut software and pure science. And so about 10 years ago, I started teaching statistics, and I started using r.
And so I mostly used r for the last 10 years, I would say. And I would play with Python here and there, and I would always run into trouble with virtual environments. It was always a pain in the neck. And so I would spend just enough time to have some sort of, like, problem come up, and then I would basically say, you know, screw this. I'm gonna go back to r. So because for me, if teaching statistics and doing, you know, data analysis and regression, r is perfectly it's a great tool for JEP. And and so I sort of play with Python here and there. And I I spent a couple months once trying to learn Django, but I'm not actually a website developer, so it's just sort of for fun. It's always been the same thing. But then I got this contract to write the book for Manning, and I started writing a book in r.
And at some point after about 4 or 5 chapters, they asked me to switch to Python, which makes total sense because Python books sell better than r books. So I think around June or July this last summer, I translated the first few chapters into Python, and now I'll write the rest of it in Python. So that's sort of a long answer, but I know about Python for a long time. I'm really much more experienced with r, but I'm getting to know Python better. And it's kinda funny because after struggling with Python here and there with, you know, virtual environments and some linguistic differences, it's starting to grow on me, which kinda surprises me, which is a good thing. So I'm gonna write the rest of the book in Python. And I use Python in some of my consulting work, so it all sort of fits together nicely.
[00:04:24] Unknown:
And so as you mentioned, you're in the process of writing a book that's focused on statistical regression and making it more accessible to programmers. And I'm wondering if you can talk to some of your motivation for embarking on that project because I know that writing a book is very rarely the path to getting rich. So I'm curious what the sort of motivations are.
[00:04:45] Unknown:
That was my whole plan.
[00:04:47] Unknown:
What your motivations are for writing the book and some of your overall goals for it. Elementary statistics for years now.
[00:04:54] Unknown:
And we have here at where I teach, we have a second course on Cisco modeling. So I taught that course, I think, maybe 7 years ago for the first time, and that's really a class on regression, so more advanced topics. And when I started looking for a book for that class, there's really not a good book to help people go from elementary stats to, you know, more advanced statistics, like modeling and aggression. There are a ton of intro stats books, and they're all kind of the same, and they're all pretty terrible in in in some ways because they go really fast through lots of topics, and the datasets are usually nice and small and clean.
And then there's a ton of those books, and there are a ton of graduate level textbooks on statistics, on regression, you know, whatever kind of methods you wanna learn about. But there's a huge gulf in between. Like, there there just really aren't very many books that help you go from an intro stats course to, you know, more advanced stuff. And so I think that was in 2014 when I taught that class the first time. And so I started putting together notes. In fact, I used the book, an introduction to statistical learning, by Hastie and Tibshirani and Witten and I'm forgetting 1. Friedman, I think. And it's sort of the baby version of elements of Cisco Learning, which everyone talks about. And that's 1 of those books everyone talks about, and I believe that not that many people have read it it, let's talk about it, but they always reference it. Anyway, so I used ISL, the baby version, and there are some mathematical expectations for the reader that aren't well articulated and are sort of too much to expect of most undergraduates.
So at a certain point, I realized that book was too advanced for most of my students because most of my students have had elementary statistics. Maybe they've had some upper level math courses, maybe not. And some material is just not accessible to them. So the basic goal is to write a book that that would help students go from an intro stats course to more advanced modeling. It's the book I wish I had been around when I was started learning this stuff several years ago. So that's the basic motivation.
[00:06:55] Unknown:
In terms of the applications of regression, in particular, what are some of the primary use cases where it's employed in some of the applications that software engineers and mathematicians might find for it in their day to day work?
[00:07:10] Unknown:
Okay. So regression is a huge category of techniques. So there are a bunch of different use cases, and I'll sort of stick to what's gonna be in the book. So the 3 classes of regression models that I'm gonna focus on that I'm writing about are linear regression, which shows up in every ML book out there, logistic regression, which also shows up in every mlai book out there, and then regression for counts, which does not show up that often. The 3 models are all the same and that there are ways to essentially guess average outcomes or predict averages. So every regression method in some sense predicts the average response to a given predictor or collection of predictors or for a given, you know, class of people or individuals. So in the case of linear regression, linear regression predicts the average for a given quantitative input or categorical input. So the average life expectancy or the average income or the average, you know, literacy rate or something like that. So linear regression is for predicting the average quantitative response, something, like an income, like blood pressure, stuff like that.
Logistic regression is typically for predicting whether something will be a success or a failure. It's for predicting outcomes when there are, like, good outcomes and bad outcomes, and you wanted the chance of 1 or the other based on some inputs. So, like, what's the chance of living if someone has a certain condition? What's the chance someone has a chronic medical condition based on various factors like age, weight, blood pressure, all kinds of things. And then regression for counts is based on, you know, estimating how many things will happen in a given unit of time or in a given unit of area. So from the high level, they're all the same. They're all ways of guessing averages based on inputs. But the kind of average you're guessing will then determine the method you're gonna use. So if you're guessing quantitative things, linear regression is the basic tool. You're guessing outcomes, this or that, binary outcomes, logistic regression of the tool. If you're dealing with counts so account is just, you know, for a non negative integer, like you're counting things 012 and so on. Dealing with counts, then you would use regression for counts. And regression for counts is typically either Poisson regression or negative binomial regression or something more complicated depending on how things are dispersed. But those are the 3 basic use cases.
And beyond that, there are, you know, tons of specialized versions of these things. They're all kind of the same, and then you tailor the method to the kind of data you've got, basically.
[00:09:40] Unknown:
As far as the audience that you are targeting for the book, you know, as you mentioned, you are using a programmatic approach to introducing the ideas, but I also know that you take the time to explain the underlying mathematical principles. And I'm wondering sort of who the different target groups are that would benefit most from reading this book and some of the skills or knowledge that you're hoping that they take away from it. The reader I have in mind when I'm writing a book is someone who has had a first course in statistics,
[00:10:11] Unknown:
has, you know, maybe a little bit experience with mathematics, maybe not so much, but, you know, ideally some. And they wanna go from that introductory knowledge to some more specialized advanced knowledge. As you said, I'm trying to work programmatically, systematically, incrementally, I should say, to go from, you know, what are the basic ideas of the method to what are some of the underlying mathematics. Because, like I said, that there are not very many books that take people from intro stats to advanced methods. The 2 classes of books sort of are either superficial or they're too high level mathematically. And so I'm trying to sort of, again, bridge that gap. And for every different topic, I start with some examples based on data.
So here's a big data set. And if you analyze this big data set so for example, with logistic regression, I have a dataset on chronic conditions in adults or in in actually some people. But chronic conditions and, you know, the likelihood of chronic conditions in people of various ages. So if you take a big enough data set, you can often sort of extract the regression model and kind of a brute force way, or you can see why you would ever think to use it. So in that case, if you look at chronic condition incidents by age, the logistic curve just falls out naturally when you look at age groups. So that's a way of introducing someone to the topic and showing that this actually happens with real data. Like, you get this nice curve where, you know, first aid you know, people below a certain age are unlikely to have a chronic condition. People above a certain age are likely to have a chronic condition.
The curve shows up there giving you probabilities, and it does fall out naturally, like I said, from the actual real data set. And so if you start there, you can introduce the topic. And then I try to build from there and talk about the mathematics. And I have it now separated into 2 chapters per topic. The first chapter is just, you know, the regression method in action. You know, here's a data set or a few data sets. Here's what happens when you when you build the model, when you investigate the model. And the second chapter on each topic, I talk about more of the theory. Because I think the theory I feel strongly, in fact, that the theory is really important to understanding the output. So, you know, in working with students and faculty and reading blog posts, what you see often is people will, you know, quote unquote do regression by apply you know, with Python or r, they will build a regression model. They'll get significance codes, and then they will just go from there and say, okay. This break is significant. This one's not, and then they move on. I suspect that most people don't actually know what those significance codes are all about.
And I don't think most people know what, like, deviances are for generalized linear models, all these different things. And they're complicated, but not that complicated. And I feel like if you see some real datasets and sort of work into it, you can help bring it out sort of in a natural organic way. So there's no question that some of the mathematics is kinda complicated, but but it's also important. My hope is that when a reader finishes the book, they will have seen some concrete examples that make sense and will at least have some sense, if not some expertise, in the theory that's behind all the models. Because the theory is super important. And if you violate the assumptions, then your results, you know, you can't use them so much. So that's sort of the goal.
[00:13:22] Unknown:
Yeah. And there's definitely a big challenge of making things too accessible because at some point, you know, anybody can take a package off the shelf and say, oh, I ran the code, and it did the thing. So now I know what I'm talking about. Or, you know, oh, 0, I've got a model, so now I'm gonna put it into production without realizing all of the other things that go along with care and feeding of that model or the proper interpretation of the outputs. And so I'm wondering what you have seen as some of the pitfalls that people fall into when they are, you know, going down that path of saying, oh, I've just used scikit learn, and I can build this model. And now I know this is the output of the model, so I'm going to use that for some other input that I don't fully understand or I don't necessarily know what the confidence interval means in this context. And so just some of the other conceptual traps that people fall into when they just decide to take an open source package and run with it. The first thing comes to mind is
[00:14:18] Unknown:
lots of times people will treat, you know, response variables. Like, lots of times people will apply linear regression when it shouldn't really be applied. So, for example, I'm thinking about there's this thing called the happiness score or the happiness index for a country. And I think it's a 10 point scale or something like that, but the score is, like, a 1, a 2, a 3 up to a 10. It's something like that. But the score is totally categorical. It's not a measurement. Right? So but once you've got quantitative scores, you can compute all kinds of things based on the scores, averages, and whatnot. And so if you have a bunch of countries, if a score for each country, you can do linear aggression and then say things about, you know, like, you know, what are the happiness scores of these countries? How do they depend on some predictor?
But the reality is that response variable is not quantitative, and so you shouldn't do linear aggression. And it doesn't even make sense to talk about the average happiness score because their categories are not measurements. And so this is 1 of my big pet peeves with data analysis is that people don't, I mean, some people are very careful, but some people are not very careful about, you know, what kind of data do you have? What does it even make sense to summarize that data? And so what regression models even make sense to apply? That's a big 1 for me. The other biggest pitfall I've I've ran into is people are way too quick to want to build some complicated model, either a system model or machine learning model, when sometimes just basic exploration of the data is sufficient. If you have a huge data set, then sometimes you could just take the huge data set and look at averages or other summary statistics by categories or, you know, however you wanna do it. And you can get important insights without building some regression model and without throwing it into some neural network or deep learning model. Know? So I think people are just far too quick to take the latest, fanciest gadget and throw data into it and get data out of it and then wanna make you know, draw conclusions from that. This happens all the time with consulting work where, you know, people wanna know, like, you know, how does tuberculosis depend on these predictors or on coughing? And, you know, sometimes just some simple stats are sufficient, or a simple model is the best thing to use because it's simple and interpretable.
So yeah. So I think knowing what kind of data you actually have and respecting that data and then being too quick to go for something complicated when something simple will do, those are the 2 biggest things I can think of. I'd like to call out the fact that you made the distinction between a statistical model and a machine learning model. And I'm wondering if you can dig deeper into the semantics of that and maybe even how those 2
[00:16:51] Unknown:
ideas or applications relate to deep learning, which has, you know, obviously taken the lion's share of attention lately?
[00:16:58] Unknown:
Yeah. I think that's an important distinction. And a statistical model, basically, you sort of acknowledge that outcomes, whether it's a single number or a classification or whatever, a Cisco model acknowledges that outcomes are probabilistic. There's some randomness involved. I won't say the outcomes are random, but there is some randomness involved. So the, you know, variation in sampling, variation, just natural variation from person to person or from thing to thing. But those kinds of models, again, they acknowledge that there's some unpredictability, some uncertainty in outcomes, that the outcomes are, you know, mathematical random variables. And that those random variables have some distributional properties, but they also have some random error, random effects built in. And then you try to build a model that accounts for those things as much as you can or estimates effects as much as you can from the data you have access to.
And so, again, linear regression, logistic regression, regression for counts, those are all statistical models that try to predict outcomes, but take into account the uncertainty and the randomness that's involved. So with the count model, for example, you might say, like, you know, the number of times a person coughs per hour is gonna have a negative distribution with this average and this variance. And so how many times will a person cough the next hour? I don't know, but I know that it's probably between this and that. Right? So and then you observe things and see how well your model does with observations versus what you expected to see. I mean, there's a lot of overlap between machine learning models and system models. In fact, regression is, I guess, a machine learning model because you, you know, you build a model based on data, and you rely on your software to do the calculations. Those are machine learning models, but then there are machine learning models that are not so interpretable, that are more like these black boxes.
And so 1 of the nice things about regression models is that they are, like I said, interpretable. You have to check all kinds of assumptions, but you can explain why a logistic model works. You can explain what it means when you've built it in ways that you can't explain how deep learning models work. So deep learning models are, you know, notoriously not interpretable. Right? So, you know, you dump your data in, and you get out these classifications. Why does it work? I think that there are some people made progress on that front. Like machine like, neural networks are essentially, you know, high order polynomial regression. Right? Great. Fine. But, I mean, doesn't really help with the interpretation of the results.
That's really the distinction that I think is important is that, you know, some models are, you know, mathematical. They have these random variables. You're describing them. You can interpret the results. And then some models are kind of voodoo black boxes. And sometimes that's fine, and sometimes that's not fine. Right? So, you know, there are all these examples now of built in biases where machine learning models that aren't interpretable are really making, I would say, flawed decisions because the data that's put into them is flawed. And so they make judgments that aren't really appropriate and are not fair and all these kinds of things. I think it's important to know what the assumptions of any model are. You have to know about the data going into the model, and then ideally, be able to interpret the output once you get it. Because for humans, that's much more important than just, you know, approving or denying a credit request or, you know, a loan application. You know? You probably wanna know why you were approved or denied.
But then other things, you know, it doesn't matter so much. For example, like with coughing, I work with a company that is building AI cough detection tools. And, you know, knowing whether you just cough or not, I can imagine that it doesn't matter so much why identify those cough or not as long as it's accurate. Right? So I think it depends on the context, but interpretability is important
[00:20:41] Unknown:
and all that kind of stuff. In terms of being able to get people up to speed on the programmatic and mathematical principles that go into regression, I'm curious how you've approached that balance of introducing the ideas, showing some of the code to be able to express those ideas and build a model, and then being able to provide the appropriate context and background from a mathematical theory perspective to understand how this works and why it works and, you know, how deep you want to go in the, you know, in enumeration of the proofs?
[00:21:15] Unknown:
Again, I think that the way I'm trying to do it now, and I have a lot of feedback from readers, is start with some actual datasets first and build it up slowly, and then sort of fold in the mathematical stuff after getting some concrete experience with the datasets. What I thought you were ahead of that was about sort of under the hood algorithms for fitting the models. And I talked about that some, but I have not gotten into, like, implementations of least squares fitting or anything like that. Just because there's so much other stuff to talk about. And as far as the language goes, you know, the the ideas of regression are really language agnostic, whether it's r or Python or c or whatever. I'm just taking sort of a data first approach where we look at some datasets, see how the method, you know, sort of falls naturally out of what's in the data, and then how you can use the progression model when you have small datasets
[00:22:05] Unknown:
as a substitute for having lots of data. Did that answer your question? I'm not sure it did. But No. That that was good. And another interesting element of writing any book that has to do with software is that at some point, it's going to be wrong where, you know, today, everything works. And then 6 months from now, everything breaks because there was a change in the language or the package that you used as the example has bumped the version, and nobody knows that they need to pin it to an earlier version. And I'm curious how you have approached some of that challenge of continue to learn and appreciate and learn from the book, you know, 5 years from now without having to worry about, oh, well, you know, I'm using this version of NumPy, which doesn't compile on my m 1 Mac book now.
[00:22:49] Unknown:
I have dealt with that by ignoring it. So that's a it's a it's a super important point. And like I mentioned earlier that I've had frustrations with Python before, and those are the kinds of frustrations I've had where a notebook works 1 day, and then, like, 2 days later, it doesn't because, you know, matplotlib has a different version now. I'm like, what what happened? And, yeah, I I don't know yet as the answer to that question. 1 of the things that I plan to do when I first, you know, proposed the topic to Manning was I was gonna write the book without any code at all, without any language at all, and make it all pseudo code.
But that didn't really work because then there was no way to say, like, oh, here's this data. Here's this plot. Here's what you get. So 1 of my frustrations with the other book, ISL, that I used was they had and they had results, but they didn't have the code for how they actually made the plots. And I remember thinking, like, it'd be really good to know how they actually produce that plot, what they did to get this result. And so I was gonna go down the pseudo code path. It just did not work for me. Then again, I wrote in r and have not switched to Python, but I don't know the best way to deal with that, to be honest.
And I still have some hope that there will be 2 versions of the book, 1 in r and 1 in Python. Because at this point, I actually have versions in both languages. Once I finish the chapter on, I will no longer have versions in both languages, but I don't know. That's a tricky issue, I think. I will say for regression, the basic stuff, it's probably not as important as if I were working on a book on, I don't know, pick your, you know like, if I were using Elm or something that under actual development, it'd be a bigger issue. But even with scipy, like, and stats models, like, there are things you can do in r that are harder to do in Python. So there's no question that some of that will change and will evolve and things will break and you know, let's just pretend nothing will break. How about that?
[00:24:38] Unknown:
That's always a good strategy. Yeah. I think 1 of the approaches that I've seen people use in this situation is having an example GitHub repository that has all of the code samples with the dependencies pinned so so that you can say, as long as you install it with these versions, then it should work. Or using things like the Google Colab notebooks where you say, click on this link. It'll open a notebook that has the environment predefined so you don't have to worry about setting it up on your laptop, and then you can play with it. But no matter what approach you go, there are going to be limitations, and people will end up breaking it. So at some point, you just have to abdicate responsibility and say, this worked at the time that I wrote it. Good luck.
[00:25:14] Unknown:
Exactly. And I'm hoping that the ideas are sort of general and basic enough or language agnostic enough that, you know, if someone has a hard copy of the book and they don't open it for 5 years, at least the basics will come across and still be applicable. But yeah. And the other thing with with datasets because with datasets, you know, you have a URL or whatever, a source for dataset. And then for some data, it's just not there 6 months later or a year later. You know? Like, that's a real problem too. Or I'm working with some people, and we're working with WHO verbal autopsy data. And our biggest challenge right now actually is just getting old survey instruments. This is not a regression problem, but it's a related issue. So, like, we have data from 2008 on verbal autopsy, and we can't fund the WHO 2008 survey instrument to know what the column names mean. I mean, it's kind of a silly thing, but it's a totally important realistic thing that, you know, whether it's software or, you know, data, any of it, like, there are real challenges with getting data, updating software, all that kind of stuff. It's definitely
[00:26:19] Unknown:
a perennial problem of the age that we're in where everything is iterating and evolving so quickly that, oh, look at this new shiny thing, all these great things that we can do. But now I can't do the things that I used to be able to do 2 years ago or 5 years ago. And, you know, now I need to go hunt down an old CRT monitor with the appropriate baud rate modem with the, you know, specific version of DOS that this was written for and make sure that I have to put this entire machine in a glass case so that I can reproduce this specific piece of research that happened 20 years ago.
[00:26:48] Unknown:
Absolutely. Although, it is better than it used to be. Like, when I first learned Python, you know, there was no GitHub. There was no Stack Overflow. All these things that we use all the time now and and sort of take for granted. We have different problems now, but it's nice you actually do things. You know? So even mathematical things, like, you know, when I was in grad school, I learned about wavelets for signal processing, but there was no wavelet tool. I mean, like, MATLAB maybe had a toolbox then, but unless you happen to be in a certain research group that was doing a certain thing, you couldn't even, like, really get your hands on an algorithm or a dataset or, you know, like doing machine learning 20 years ago, it was totally different than now. Now we have repos of datasets and examples and tutorials and,
[00:27:30] Unknown:
you know? Yeah. Now you're stuck in analysis paralysis of which framework do I wanna use. And
[00:27:36] Unknown:
Exactly. And I was just complaining about how, you know, people write blog posts and they get things wrong, but, like, at least there are blog posts now, and then you can, like, see what other people have thought about and worked on. So Yeah.
[00:27:46] Unknown:
For people who are starting to embark on the path of studying regression and figuring out how to apply it, figuring out how to prepare the data, figure out the features that they want to incorporate into their algorithm, figuring out, you know, which category of regression they wanna build and how to approach the modeling. What are some of the concepts that you have found people stumbling over most frequently, whether they're coming from a mathematical or a programmatic background?
[00:28:14] Unknown:
I think, actually, just basic things like what makes sense to the kind of data you've got. You know, even working with, like, faculty colleagues who have survey data or other kinds of data. I mean, just basic things like, is this data categorical or quantitative? Is this data are you counting like, with regression for counts, are you counting things, or are you measuring things that could be continuous? That's number 1. And start out simple. Like, just explore your data with some basic visuals and tables. Get some proportions for your categories.
Don't immediately jump into some generalized additive model when, you know, a scatterplot is gonna tell you what you need to know. And so those are the basics. I think people just are way too quick to, you know, wanna use some fancy thing. And if you need a fancy thing, fine. But I think you should understand what it's doing and how it works. I talk to people and read, again, things people write about, you know well, like, generalized additive models. People mention them all the time, and and they'll say, like, oh, I built a model and have these smoothing functions, you know, but I'm not sure what to do with this thing. I'm like, well, that data you have is not, you know, quantitative. So why are you doing this thing? And but they don't really know. Yeah. I think just starting with some very basic things first saves a lot of trouble
[00:29:32] Unknown:
and headache. Even when you have a certain level of expertise, it's still the stupid things that trip you up where, you know, you've been fighting with a particular bug for a day and a half and you realize that it was because you flipped the order of a sequence of letters or, you know, you forgot to put an underscore somewhere. And it's just a silly typo that has been stymieing you for the past 2 days. And working with students, you know, thinking about writing my my book. I work with undergraduates all the time. We just had final exam this week,
[00:29:57] Unknown:
and I've been teaching elementary statistics, been teaching them r, and students get hung up on, I mean, things that seem simple if you've been programming for a while, but, like, they're real issues for them. I mean, like, you know, lowercase, uppercase. It's a thing to them. You know? Like, typos that they spend forever trying to find, and then it takes forever, and then they get really frustrated. And then they wanna quit because it's so frustrating. I wish I knew a better way to just tell people, like, this is what happens. You know? Whether you've been programming for a year or 20 years, like, you're gonna type something wrong, and then you'll spend an hour trying to find it. And you can't find it because you typed it wrong, so you can't search for it. So, like, that kind of stuff is crazy. And then what like, with r and Python, you know, the whole off by 1 thing, and you get used to counting in 1 way in Python, and you switch to r, and you screw it up. I mean, it's it's the life of a programmer, I guess. Yeah. The classic joke of there are 2 hard things in computer science, naming things, cache invalidation, and off by 1 errors.
That's right. That's right. Yeah.
[00:30:56] Unknown:
And in terms of your own work that you've been doing in your consulting and some of the teaching that you do and your background in mathematics, what are some of the most interesting or innovative or unexpected ways that you've seen these statistical regression techniques
[00:31:10] Unknown:
used? First thing I would say is people should use logistic regression more instead of advanced, you know, deep learning models, and people should take the time to learn how to interpret logistic regression output. So I've worked on several projects recently where a logistic model does a great job of distinguishing between, you know or determining what has an influence on outcomes. And 2 of them are a tuberculosis model and bad outcomes with COVID and coughing. So for this tuberculosis model, we have some data from Uganda, and I am not an epidemiologist by any means. In fact, my background is in pure mathematics, you know, not even really that applied.
But with tuberculosis, we have this data from Uganda. We built a logistic model from the data, and it turns out that the single most important predictor in the data set so they had data on things like heart rates and, you know, is the patient coughing persistently or not? BMI, weight, sex, is the person coughing up blood, weight loss, things like that. So all these things are related to tuberculosis. And from this data set, if you just build a model and interpret the results, the single most important predictor was heart rate. You know? And it turns out that, you know, someone with a heart rate over a 100, you know, is very likely to have TB. And someone with a heart rate under a 100 was likely not to have TB. You know, just based on the data, there's nothing else. And when we met with our client, who is a physician, he was shocked that's what came up with because he said, oh, yeah. Tachycardia is, like, 1 of the single most important things in the field to triage TB patients. You know? So it's really fun, actually, to work on models like that and to tell someone the results you got and find out the results are meaningful and that they make sense, and that it's sort of useful to just get that out of the data. Another example is with coughing and and some other collaborations.
We looked at COVID patients and coughing and found out that, you know, if the person coughs a lot, that's actually a good sign. We think it's a sort of a sign of vitality. The person coughs, you know, more than 3 over times an hour, that's good. If they cough less than that per hour, it's probably a sign that they're gonna have a bad outcome, have to be animated or or die. And so those are simple models, but they're really useful, and you can explain to someone. You know? So I I think that, you know, people should learn more about logistic regression and use it more and more. In fact, there are tons of articles in the last few years about logistic models outperforming complicated deep learning models in medical applications. So, yeah, that's super important. And then just basic statistics.
Again, with coughing, it turns out that if you study someone's coughing patterns, it is not enough just to quote their coughs per hour. So if you wanna know how someone coughs sort of day to day, what you really wanna know are things like what you really know are the average cost per hour and the variance of the person's cost per hour. The basic reason is, Poisson distribution does not describe coughing well, but a negative distribution does. It's a totally technical thing, or it's kind of a technical thing, but that's what you see in the data. And if people don't know about those distinctions, you know, Poisson versus negative binomial, they would never think to look into that. Right? So, yeah, people need to learn more statistics and be aware of these different technologies, I guess.
[00:34:25] Unknown:
Just to be contrary and then throw out a trigger word, Bayesian versus frequentist.
[00:34:31] Unknown:
You know, I think Bayesian stuff is super important. I have never really done much Bayesian work myself, which is bad. I should admit that, but it's true. It's 1 of those it's on my to do list. In fact, writing this book, you know, I said I have a certain audience in mind. Part of the reason for writing is so I have a chance to learn this stuff, you know, and learn it really well. But it's totally true that if you wanna learn something, you know, as an academic, the best way to learn is teach a class on it or write a book about it. So it's a way to force myself to learn these things. So maybe after this, I'll write a book on Bayesian stuff because that's super important, and I don't know enough about it. So
[00:35:07] Unknown:
So in your work of writing this book and doing the necessary work to learn the material and figure out the programmatic constructs and how to formulate the progression of the material? What are some of the most interesting or unexpected or challenging lessons that you've learned? Well, writing a book is hard.
[00:35:28] Unknown:
That's not really a new lesson, but I've been stunned at how much work it is to write, like, a paragraph or a page. Or you write a few pages or whatever, and then you realize, oh, I totally screwed up in the you know, I I shouldn't have said it this way. I should have, like, developed things this way instead. And so there's a lot of rewriting. So at least in my experience, writing means more time rewriting than actually writing the first time, I guess. And starting a new chapter is always it's like starting from scratch. It takes forever.
But then there are also moments where, like, the writing is super fun, and I'll learn something I didn't understand before, and that's really rewarding. Writing a book is really difficult even more than I would have guessed from having heard people say that over and over again. The other thing the other big challenge is finding good data. I mean, people talk about data all the time and how much data is out there, but if you wanna find good datasets to explain a topic or to explore a topic, it is super challenging to do that. And if you wanna find a dataset that hasn't been beaten to death by everyone else out there, you know, then it's super hard. Because there are some datasets that we've all seen too many times, like the Titanic dataset and the Irish dataset. And, you know, these examples from the UCI machine learning repository. Those are great, but then we've seen that before. So I have spent a ton of time trying to find good datasets that are not 30 years old, that are realistic, that are, you know, interesting, that work well for the method, You know, all these kind of things. That is really hard to do, and I wish that were easier. But there are tons of scientific journals now that require authors to share data.
But even though that's the case, if you go and try to find the data, the authors say things like either data available upon request or all the data in need of the articles in the article itself, which is kind of b s. Because if you try to get the data, I mean, like, in some cases, I don't know how you would extract what you want from their article. I think it's sort of a cop out. So I wish that people were better about sharing their data and making it available publicly.
[00:37:24] Unknown:
Or it's embedded in a PDF table that you then have to be an expert in OCR to extract in a meaningful way.
[00:37:29] Unknown:
Such a nightmare for data analysis. And I think we've all heard people talk about, you know, oh, we have data on that. It's in this horrible PDF. Like like, good luck.
[00:37:40] Unknown:
All you need is an army of interns.
[00:37:42] Unknown:
Yeah. Exactly. And a project that we had students working on last summer on, carbon sequestration. So this is a project in Haiti, and 1 of the partners sent a 300 something page document, 300 pages of handwritten measurements of tree diameters and tree heights. Handwritten. 300 some pages. And then we had an intern, a poor student intern. He went through the entire document and recorded all the measurements in a spreadsheet. And then we had, you know, 50 different sheets in a spreadsheet instead of, like, a nice database with tables that you could query.
So the usual stuff.
[00:38:22] Unknown:
Yes. The joys of data and academia and when the 2 combine.
[00:38:26] Unknown:
Yeah. Joy is 1 word for it, I guess.
[00:38:30] Unknown:
And so for people who read your book and want to learn more about regression and its applications and some of the other styles of regression models that they might want to dig into or maybe neighboring concepts. What are some of the resources that you have found useful as you're preparing the book and some resources that you recommend that folks start to look at when they want to expand their knowledge and understanding of the space?
[00:38:54] Unknown:
Okay. So for writing the book, there are several good resources on the different topics. There are good resources on linear aggression. There are tons of resources on linear regression, maybe too many. And there's some good resources on logistic regression. There's a book by Collett, David Collett, I think his name, c o l e t or double t. Anyway, Collett has a book on modeling binary data, which is fantastic. It's definitely geared at statisticians and mathematicians, but it's a fantastic book. Highly recommended. For linear regression, I've looked at so many books on linear regression.
There's 1 by oh my gosh. I can't think of the name of the authors. There's 1 by 3 authors, Kutner, Noxheim, and Nieder, I think is the are the authors. That's a really fantastic book on Lindner. It's also, like, I don't know, 1200 pages long. It's a huge thick book. Those are 2 good books. There are really not very many good books on count models. There are some books by Joseph Hilbe, h I l b e, on count modeling that are okay, but they're kind of they use a lot of Stata, and they're kind of hard to read, I would say. But there are also not many of those books out there, so they're helpful to read just to get a sense of what has been done. There's a book by Cameron and Trivedi that's aimed at economists. It is super academic and aimed at people who do econometrics. 1 of my goals with this book is to write an account of basic regression for counts that is accessible and digestible to most people. Right?
So I would say the things to read after reading this book would be the books by Hasty and Tibshrine. You know, like, elements versus school learning. They have a book on generalized additive models. They have a bunch of books on different topics. They're all sort of cutting edge, like the lasso and all these different things. Their books would be, like, sort of the next things to read. There's also an older book by Nelder and McCullough on generalized linear models. It's probably 30 years old now. It's very mathematical. It would be a good next to read after this too.
Yeah. Like I said earlier, there are tons of advanced books. There are tons of intro books, but not much in between, and I'm hoping to fill it gap.
[00:41:04] Unknown:
Are there any other aspects of the work that you're doing on the book or the overall space linear regression and statistical modeling that we didn't discuss yet that you'd like to cover before we close out the show? I just wanna emphasize that I think that 1 of the things I hope will happen more
[00:41:18] Unknown:
and soon is a better integration of deep learning kinds of approaches and sort of old school I think they seem like sort of competing approaches, you know, like, just old school random variables kind of stuff and new neural network stuff. But I think that there are lots of opportunities to combine them and get the best of both worlds. I think it's maybe starting to happen more and more, but it definitely needs to happen more. So, for example, 1 thing that I think is a really interesting problem is how to combine deep learning models, like neural networks, with sort of, you know, old school regression models. So how could you take, for example, clinical data on patients and deep learning results when you're modeling things like, you know, determining disease diagnosis?
How can you combine those kinds of models work together? Because you have tabular data on 1 hand and you have, you know, audio or visual data that gets processed by a deep learning model, and you get results from that. How do you combine these 2 kinds of data together to get more robust, more useful models that maybe have an opaque sort of black box part but are still interpretable. That's an area that I think is ripe for, I don't know, disruption or whatever the word is. It's not clear at all yet how to combine tabular data and neural network models to get better insights.
[00:42:35] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose the podcast, The Argument, from the New York Times and an enjoyable show to listen to. So in each episode, they have a particular topic, and they'll bring on experts from either side of the perspectives having to do with it and have a civil and informative debate about it. So it's just a good way to kind of educate yourself on the different ways people think about things without necessarily getting bogged down in a lot of the clickbait and sort of social media powered arguments that don't necessarily get anybody anywhere. So definitely worth taking a listen to if you're looking for a way to kind of expand your views on some things. So with that, I'll pass it you, Matthew. Do you have any picks this week? In a previous life, I was really active in music.
[00:43:30] Unknown:
You know, it would have been great to be a musician for a career, but that didn't happen. So 1 of the bands that was really important to me has been forever is Primus, which, I don't know, maybe you know, maybe you don't know. So here's my pick. A record from 2 years ago by the Claypool Lennon delirium, Les Claypool and Sean Lennon. So Sean Lennon, John Lennon's son, and Les Claypool of Primus fame. They have a record from a couple years ago called south of reality. And I've listened to it a ton in the last year, and my son, who's 8, he really likes it. So there's my pick.
[00:44:03] Unknown:
Alright. Yeah. Definitely. I appreciate Primus and all of Les Claypool's different endeavors. He's got a ridiculous number of side projects, all of which are interesting in their own way. So definitely, we'll second that pick. So thank you again for taking the time today to join me and share the work that you've been doing on the book and your perspectives on the utility and applications of regression and statistical modeling. Thank you again for all the time and effort you're putting in there, and I hope you enjoy the rest of your day. You too. Thank you for having me. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to the Guest: Matthew Rudd
Matthew's Journey with Python and R
Writing a Book on Statistical Regression
Applications and Use Cases of Regression
Target Audience and Goals for the Book
Common Pitfalls in Regression Modeling
Statistical Models vs. Machine Learning Models
Balancing Theory and Practice in the Book
Challenges of Writing a Technical Book
Interesting Applications of Regression Techniques
Resources for Learning More About Regression