Summary
Most programming is deterministic, relying on concrete logic to determine the way that it operates. However, there are problems that require a way to work with uncertainty. PyMC3 is a library designed for building models to predict the likelihood of certain outcomes. In this episode Thomas Wiecki explains the use cases where Bayesian statistics are necessary, how PyMC3 is designed and implemented, and some great examples of how it is being used in real projects.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Thomas Wiecki about PyMC3, a project for probabilistic programming in Python
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what probabilistic programming is?
- What is the PyMC3 project and how did you get involved with it?
- The opening line for the project README is packed with a slew of terms that are rather opaque to the lay-person. Can you unpack that a bit and discuss some of the ways that PyMC3 is used in real-world projects?
- How much knowledge of statistical modeling and Bayesian statistics is necessary to make effective use of PyMC3?
- Can you talk through an example use case for PyMC3 to illustrate how you would use it in a project?
- How does it compare to the way that you would approach the same problem in a deterministic or frequentist modeling framework?
- Can you describe how PyMC3 is implemented?
- There are a number of other projects that build on top of PyMC3, what are some that you find particularly interesting or noteworthy?
- What do you find to be the most useful features of PyMC3 and what are some areas that you would like to see it improved?
- What have been the most interesting/unexpected/challenging lessons that you have learned in the process of building and maintaining PyMC3?
- What is in store for the future of PyMC3?
Keep In Touch
- PyMC
- Thomas
Picks
- Tobias
- Thomas
- Hyperion by Dan Simmons
- The Mind Illuminated
Links
- PyMC3
- Quantopian
- University of Tubingen
- MatLab
- Probabilistic Modeling
- Probability Distribution
- A/B Testing
- Bayesian Statistics
- Beta Distribution
- Bernoulli Distribution
- P-Value
- Hamiltonian Monte Carlo sampling algorithm
- Metropolis Hastings Inference Algorithm
- Theano
- Bayesian Methods For Hackers by Cameron Davidson-Pilon
- Bayesian Analysis With Python by Osvaldo Martin
- Tensorflow
- MXNet deep learning framework
- Tensorflow Probability
- BAMBI package to build generalized linear models
- PMProphet PyMC3 implementation of Facebook’s Prophet for timeseries prediction
- Exoplanet
- BEAT (Bayesian Earthquake Analysis Tool)
- PyMC3 in Google Summer of Code
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or building your CI pipeline, they just launched dedicated CPU instances. Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference. Go to python podcast.com/conferences to learn more and take advantage of our partner discounts when you register. And visit the site at python podcast.com to subscribe to the show, sign up for the newsletter, and read the show notes.
And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
[00:01:36] Unknown:
Your host as usual is Tobias Macy. And today, I'm interviewing Thomas Viki about pymc3,
[00:01:41] Unknown:
a project for probabilistic programming in Python. So, Thomas, could you start by introducing yourself? Sure. Yeah. So I'm Thomas Wiki. I currently work at Quantopian, which is a Boston based startup in the financial space. And, I'm the VP of data science there. Before I did my undergrad in computer science at the University of Tubing and then did my PhD at Brown University. And do you remember how you first got introduced to Python? Yeah. Certainly. So, when I was still an undergrad, I spent a summer at MIT doing brain stuff. And particularly, we were training, rats to do, like, certain experiments.
And it's particularly the rats, I was training to, like, do, like, a discrimination task that required me to lock myself into a dark room in the middle of the summer and just train these rats, basically. So that felt like a really long time, these 2 hours, but it left me a lot of other time to explore other things. And that day, I started using Python before I was like a a MATLAB user, like a lot of the field back then. So that was early 2000. And what I liked about MATLAB was that it really allowed you to very easily, like, do numerical processing, have access to matrices and all these things, but it really did feel like a programming a proper programming language. Like so building actual packages out of that was really hassle.
And so Python, I immediately fell in love with because it was just as easy to work, like, from the Python shell or the Python shell and then, create plots and get NumPy and SciPy and all these other things that could get you to parody with MATLAB, but also be a natural programming language to build packages. So that was really yeah. We're really dove into that because they had really time to dedicate to it. So at the end of the summer, while those rats never learned what I tried to teach them, I I came out of that being a pretty good Python programmer. And so the pymc3u
[00:03:45] Unknown:
package says that it is usable for probabilistic programming. So can you start by explaining a bit about what that means?
[00:03:53] Unknown:
Sure. Yeah. So as the name suggests, it is about, probabilistic models. And these probabilistic models you specify in computer code. And that already like makes me really like it because it allows someone with a computer science background to really dive into statistics. So I guess I should just talk a little bit more about, like, what probabilistic models are. So, basically a probabilistic model is something that is made up out of probability distributions. So probabilistic program allows you to do that quite easily where you just have a whole bunch of probable distributions that represent how your problem is structured, and then you plug those together. And that is the common way that you do that. Right? You plug probabilities, divisions together, that is statistics, and then you do a whole bunch of math derivations to get at your estimator.
Now because we here are in code, not math, we can just write a line representing this is normal distribution and it's the input to another normal distribution. And then the computer figures out all the math for us so we don't have to stand in front of whiteboard. We are inside of a Jupyter notebook and just write down the model that we want. The other really important aspect is that the type of models we're building, generative. Yes. So it allows us to plug these probable distributions together in, in computer code and not in math. And also these types of models are generative. So that means we think about the causes of our problem and often these are unobserved.
And then we think of how these causes generate the data that we end up observing. So 1 example for that would be AB tests, which are very common in, if you're building websites. Right? You wanna know which of those 2 websites, say, has a higher conversion rate. Maybe you want people to sign up to your website. So you have 2 different versions and you want to say which 1 has a higher conversion rate. So at the core, that's a statistical problem, and you could use frequentist statistics. But in here, we're gonna use, based on statistics. And the way to think about this in a general framework is, well, we have these 2 unknown, unobservable causes, which are which produce our data, And these unknown causes are the conversion rates, and these produce either a user to sign up or not sign up. And these conversion rates, which we can observe, because we're in a base and framework here, we have to place priors on them. And that's a technical term from Bayesian statistics. And what it means is we define a probability distribution without having seen the data of what we believe likely values of the parameters to be. So in this case, we could say, I know ahead of time, right? If you if you're working in the web space, you know, that conversion rates tend to be extremely low. I don't do that myself, but I know that it's less than a percent or in that area, but definitely not say 20%.
So the probability distribution that I would build would place a lot of a lot of probability mass on really low values because I know ahead of time that it's going to be likely, but something like 50%, while not completely possible is going to be, have a low, very, very low probability. And there's definitely some leeway in how you can define these priors. And maybe you're not an expert, so you could place wider priors on that, or you could ask an expert to provide you with that insight. So okay. Now we have to find our 2 conversion rates. And, oh, and 1 more thing on the priors is we also because it's a conversion rate, it's essentially a probability. So it has to be between 0 and 1 100%. And the distribution that is commonly used for that is the beta distribution.
So that's just the name of that. And we can define these parameters of the beta distribution to give that lot of probability mass to very low probability values. And then we have to link that to observe data. So we say, okay. From these unobserved cases, we generate data that we observe. And in our case, these will be binary events, right? Someone did sign up or they didn't sign up. And the probability distribution, the likelihood function for that is called the Bernoulli or conflict distribution. So that basically defines our model. So we have our beta distributions as priors for our unobserved conversion rates. And then we have our likelihood function, how my data is distributed Bernoulli. So now so far, I haven't seen any data yet.
So I will just so that is just a model specification. Next, I once I observe data, I actually wanna make inference about these hidden causes. So what are the most likely cause to have generated that data? And again, because we are in a statistical setting, we're not satisfied with just a single answer. Like the conversion rate is going to be 0.5%. Here, we also want a distribution of possible values. So it could be between 0.2.6%. But which of those values exactly, I can't say because I only have observed limited amounts of data. So and and that distribution that, formalizes this is called the posterior distribution.
And that posterior distribution, another way to think of that is just as my belief in in the in these underlying causes in the conversion rate. So we start with some initial belief without having seen the data, my priors, then I observe data. I've learned something about the world. I update my knowledge into my posterior distribution. Once I have these posterior distributions, I can do statistics on them, like saying, okay, which of those has the higher probability, or what is the probability rather that 1 is better than the other? And that will tell you, okay, now with a 98% probability, the second version of your website is going to be better. So immediately, you know, what is happening and you can also then say, maybe it's, still not high enough. So I wanna observe more data, then you update your posteriors again, and then it's at 99%. So then you say, okay, now we're gonna deploy. So that is how the whole problem is structured. The difficulty that I haven't said is in this updating process, going from converting my priors into posteriors by observing data, that is, what Bayesformula is doing. Bayesformula is amazing.
It's, it's actually a very simple result, but it, is a very elegant formula and looks very simple. But then when you actually try and do the math of that, you very quickly run into problems for even like fairly simple models. And that is what it has held based on statistics back for a really long time as well. We have this great framework, but we can't use it because once we start using it, we get these nasty multidimensional integrals over infinities, and we just can't solve them. So luckily, there have been all kinds of approximation inference algorithms being developed. And what is especially powerful about them for the purposes of probabilistic programming is that these work largely automatic on whatever model you dream up. So that allows you in a probabilistic programming framework to write your model in code, then hit what I like to call the inference button and just get your posterior approximation.
And, then maybe you're not totally satisfied with the model. Some aspect of the data is not being modeled correctly. So then you go ahead and change the model. And without having to, like, rederive anything in math, you just hit the inference button again and you get your answers. So that is, I think how I view the, the benefits and the of the approach of probabilistic programming. And I know that there's sort of this,
[00:11:41] Unknown:
dichotomy in statistics between Bayesian approaches and frequentist approaches, and it sounds like 1 of the reasons that frequentist approach has sort of gained, larger mind share is because of these issues of being able to effectively compute the probability distributions of the Bayesian model. And it seems distribution of the Bayesian model. And it seems that because we now have tools like pymc3 and other Bayesian statistical algorithms that it's starting to gain back a bit of interest and sort of popularity in the statistics community. So I'm curious what your experience has been on that front. Yeah. I think that's definitely
[00:12:19] Unknown:
true. So I would view frequentist as more like the 20th century statistics where, yes, we had to be able to derive these estimators, using math to closed form solutions and frequentist just was easier in that regard, for many cases. While, like, all the math is like completely correct and it's a solid framework, it is often a, framework that is not really well suited for the scientific discovery process. And, 1 of the key reasons is that people think about frequent statistics in a way that is actually much more compatible with Bayesian statistics. So they think that, like, p value of smaller of smaller than 0.05 means that the probability of a false discovery is smaller than 0.05.
But that is the the basin answer, not the frequentist answer. I don't wanna go into too much detail. I encourage listeners to just, read up on what is actually, what actually a p value means, But it comes with all these problems, like, it's very dependent on the exact exact data's generation and data collection process. So it's different if you say, I collected data for, 15 days and got 20 subjects, or I collected so much data that had 20 subjects, or all kinds of different ways that give you the same number of data points. But actually it matters for the statistical test, what your protocol was. And that just is is not very well suited. And now the other aspect I would say would where Bayesian statistics really shines is that because frequentist statistics relies on these mathematical der derivations, it is building a new model, the model that you actually think your problem is. Right? So if it's not a t test, but something like a t test, you would need to go back to the drawing board and redo write the formulas.
And if you're lucky, then everything checks out. But if not, then you're basically left, 3 different devices In basic statistics and with probabilistic programming, you just build whatever model you think is best suited to understand your data. And and then just, yeah, hit the inference button and be on your way. And I definitely see a, slow but steady move away from frequency statistics.
[00:14:36] Unknown:
Academia in general seems to be slow on the uptake, but I think it's mainly a question of time. And so now that we've established what probabilistic programming is and some of the differences between Bayesian and frequentist approaches, so I'm wondering if you can describe what the pymc3 project is and what it offers and how you got involved with it. Sure. Yeah. So when I first learned about Bayesian statistics
[00:14:59] Unknown:
in grad school, we were using that was at a workshop in Amsterdam with, which EJ Wannockers, taught, and there we're using wind bugs, which is like the package for which was at its time ahead of its time. But now it's definitely neither of those. So that, like most other packages, required you to, like, write your model in this domain specific language, which they defined. And because I was already back then, an as a Python user, I was looking for a package that, was giving that, and that way found, pyme C2, and that allowed you to specify models directly in Python code. So that really suited me well. And I started, writing for the models I was building in grad school, a library that built on top of that. So just that being possible was just extremely powerful. And that's how I got to know the authors of PIM g 2 better, like Chris Vonaspak and John Salvatier was really involved with that. And talking more to John, he talked about these, new class of samplers, called Hamiltonian Monte Carlo, which I didn't really know anything about. Other packages at that time, we're just using, what's called metropolis Hastings, which is like a quite easy and general inference algorithm, but it is yeah. It's it's it's not great. It often fails and it's quite slow. And these new class of samplers just way better. So we talked about ways of maybe including that in PIME C2, but it just wasn't suited for that because for these new types of samples, you really need the gradients of your model. And the package just wasn't designed to handle that.
So John, and I think he was the first to really make the connection, found, that package Theano, which, is the first of its kind to really be a deep learning library where you do all the matrix multiplications to build your neural network model. And then also compute the gradients, which you need because, to do your back propagation and gradient descent to fit them, the neural network model. And he realized that this particular package was not just, that it was possible to use that package, not just for neural networks, but it was a general computational package where you can implement all kinds of math and then get your derivatives for that. And moreover, it allows us to compile things to see, and then, run at machine speed to be really fast. So he realized that you could implement a probabilistic programming framework on top of theano and then did a lot of the core work to really do that. And that is what, results from PIMC 3. And then later, yeah, Chris and I basically got in involved in in that and pushed it further.
And but yeah. So it is, basically the extension of, that package. The benefits are that you can write your models in Python and everything stays in Python, and you can interact with your model in Python. And it is really fast because we compile the code to see or the GPU, and it has these,
[00:18:10] Unknown:
next generation samples, which are just vastly better than anything that came before. And I know that some of the samplers are things like Markov chain Monte Carlo or variational inference, and the opening line for the project read me is actually very packed with some of these very technical terms. So it says pymc3 is a Python package for Bayesian statistical modeling and probabilistic machine learning focusing on advanced Markov chain Monte Carlo and variational inference algorithms. So I'm wondering if you can just unpack that a bit and discuss some of the ways that pymc3 is used in some real world projects. I know you already mentioned the possibility of being able to use it for being able to establish the statistical significance of things like AB testing. But wondering if you can just talk through some of the other ways that you have used it or that you've seen it used in other contexts.
[00:18:56] Unknown:
Sure. Yeah. So that has been really rewarding to see is, like, how people really make use of the software. And it's actually not that easy to get a sense of what people use your package for. If they use it and it works, they just don't come back and tell you that, like, they used it for a really cool purpose and that it worked well for that. In academia, we see that just because, of all the papers citing PIMC 3. And these come from all kinds of, scientific disciplines, like, a lot of astronomy stuff, ecology, quant finance, econometrics, biology, zoology, like, earthquake stuff. It's like, it's it's probably 1 of my favorite papers because, like, all the other papers are, of course, in my domain where I did my PhD. And so I can read them and understand them, but the people that are citing PIMZ 3, like, I have no idea what is going on there. So it's, yeah, it's being used in, like, all kinds of different areas for all kinds of interesting purposes.
And more of it's also used in industry. So there are quite a few companies that we know that use it in production to on all kinds of different ways to solve all kinds of different problems, really big companies and startups. 1 common use case, so it's, yeah, it's useful, all kinds of stuff. But actually 1 thing that I do see often is AB tests, interestingly. And finally, we also use it for, at Quantopian, to make our asset allocation. So just, I guess a word on that. So we, I like to call it crowdsourced hedge funds. So we have a huge community of quants that come to our website and develop trading algorithms to invest in the stock market. And we pick the best of those algorithms and then license them from the author to include into our fund. And part of my job is to select the best ones of those, and PIMC 3 is being used in that process as well. And I know that a lot of the documentation as I was reading through is pretty heavy with different references to probability theory and statistical
[00:21:03] Unknown:
pymc3 being in Python is that it helps provide a bit of a more higher level abstraction over some of those approaches. So I'm curious what you have found to be the requisite amount of knowledge of statistical modeling and basic statistics to be able to make effective use of pymc C3?
[00:21:19] Unknown:
Right. Yeah. So there's definitely an initial investment that has to be done. So you need to know about probability distributions just so you know, basically, what those do and how to set the right priors, then you need to know about priors and posteriors and how to interpret them and how how to do them, a little bit about the model building process, how to do that properly. And then probably quite some basic practitioners would disagree, but I don't think you really need an in-depth knowledge of, like, the inference algorithms, for example. Like, the hope is, and I think we have succeeded to that to some degree, is that you don't really need to know about that. It's just the inference button and you press it. And without really knowing what's happening, it will do something. It will do the right thing and give you the right answers. And that actually works well also because these new type of inference algorithms, Hamilton Monte Carlo Samples, when they don't work, they fail in a spectacular way. And that is really easy then to look at the user and say, actually, don't trust these results.
Something went really wrong there. And then, unfortunately, it can be quite difficult and require quite some depth knowledge to solve those kind of problems. So when the when the default approach doesn't work, things get more difficult and intricate, but yeah, otherwise you get by with like core knowledge of, of statistics and probability. And I think, yeah, the the best way to get started is probably to pick up a book that introduces these concepts. So there's, basic methods for hackers, which is a really fun read, which you can find online, and it's ported to PIMC 3. So that is a great read. And then, Oswaldo Martine has come just has just come out with the second edition of his book, which focuses on PIME c 3 and does a great job of introducing everything in a very practical and applied way. So for those types of data scientists, I would definitely recommend that both. So can you start by just talking through an example
[00:23:22] Unknown:
project that would use pymc3 and the overall workflow of setting up the model, some of the syntax, not you know, I don't need you to sort of spell it out, but just, at a high level, just like the way that the syntax is represented and, just the overall workflow of any heuristics that would be useful for figuring out how to set your priors, building the model, any sort of training process that goes on, and then evaluating the outputs?
[00:23:47] Unknown:
Sure. Yeah. So 1 particular use case, which was also very educational for me was, building that model for Quantopian, to help with our asset allocation. And the problem is actually quite similar to the AB testing problem. We have different algorithms and they produce daily returns. So either you increase your returns by 5 percent on the 1st day and then maybe 2% down the next day. Mostly these are much smaller values, but you get the idea. So we have these, return series for every single algorithm, and we wanna know which 1 of those has the best performance to risk profile. So there, we have the unknown causes, which are well, what is the mean of that returns distribution? Is it positive or negative or 0?
And we have the variance of that. So these are 2 parameters that we wanna estimate here. And so times t 3 code, that would be 2 lines. So we say, okay. We have the mean that we wanna estimate, but don't know, and we placed prior on that. So here I would say, okay, I already know ahead of time that the mean returns of any strategy will be very, very close to 0, right? Just 10% on average every day is absolutely impossible. So it's going to be close to 0. So I would pick normal distribution that sent us things very close to, to a very narrow range around 0 that is possible given what we've seen. So that's the first line. And then the second line says, okay, now I need a parameter for the variance of those returns.
And there, the first thing that I realize is, okay, well, the variance must be positive. So it can't be just a normal distribution because that has also probability mass of 0. So we choose something like, the absolute value of a normal distribution, called a half normal or, or there's other choices that you could make, but those are like some pretty sane starting conditions. And then I would define my likelihood function, which is how my data is distributed. So that's the third line. And that has 2 input parameters. And here I'm linking it to my unobservable causes. So the mean of that likelihood function is my parameter defined in the first line and the variance is the parameter find in the second line. So that's how these, how the model links together. And then I pass in the data that that I have observed. So that is the daily return series that I have. Now here, I have some interesting choices actually about what kind of likelihood I use or how I think my data my observed data is distributed.
So good default choice here would be a normal distribution, but, actually, it's a pretty well known fact that, financial return series are not normal, but they have much heavier tails, much higher outliers than a normal distribution would have. So during something like the financial crash in 2, 008, 2009, you get, like, 20%, all of a sudden, of, like, daily moves. And a normal distribution just doesn't account for that really. So you could use something else like a student t distribution. Yeah. Those are like different choices you would have, but in the model that just changes 1 line, essentially, where you'd say, well, now I'm using a different likely function. So with 3 lines, I have to find my model that tells me what is the expected return and the risk, the variance of that strategy, the volatility. And then I would just run my sampler and get my posterior that tells me, okay, the mean, then I could take a statistic and ask, okay, what is the probability that the mean return of that strategy is positive? And so that is similar to the model that, that I built in the first pass. And then I would report that number and say, okay, the probability that that strategy is profitable is say 75%.
So we're operating here in, like, very high uncertainty environments. And then that number could be used by the person selecting the strategies to make a more informed decision. And I thought, okay. That's amazing. Really cool approach and, yeah, but should make our asset allocations much better. But what I found was that number just wasn't used. Right? It just, like, was reported, but people, like, looked at it in the beginning, but then, like, just started ignoring it. And the reason is, well, there's several reasons. 1 is we're pretty bad at incorporating this uncertainty information in our decision making process, and there are, like, no easy answers where, like, it says, oh, 99% and this 1 has 50%. So obviously you should choose the 1 that has profitability with 99% probability. So it's all in this gray area. And then you say, well, I have 10 strategies, and they have different probabilities ranging from 60 to 75%.
Should I use all of them? Should I just use the top 10? Like these are really difficult questions that basically led to people to make the decisions that they made before, which are the decisions that they think that are best, but not really based on evidence or the model. And the solution that we found, which I think is profound is, pushing the model 1 step further into the decision making pipeline. And I find that is often unappreciated. So what you can do once you have your model, rather than just reporting that number, right, that probability output, you can actually generate new data, different scenarios of what the future could look like given your model and what the model has learned on the data. So something that has a very high probability of generating good returns will generate really good looking returns into the future. Something that is in between will generate. Sometimes it'll go down. Sometimes it'll go up.
So you do that for all your strategies. You generate all kinds of different future scenarios, and then you plug those into an optimizer and say, okay, give me the selection of strategies that will maximize my profitability or rather minimize the risk of getting a terrible loss. But either way, you wanna optimize so that you find the best selection of algorithms with the best properties. And now all of a sudden, rather than reporting a number that people didn't look at, now the output is given, like, all the inputs that are available, this is the best selection and how to weight them of your portfolio allocation problem. So and and that, like, was what we used then to actually make these decisions. So all of a sudden, like, there was no more human in the loop to make these really difficult decisions, making very difficult trade offs, and and that model just worked way better and was actually being used for its intended purpose. So that idea, I think of not just making a model and it could even be a machine learning model, but really going the extra mile and not just reporting like, oh, yeah. This is an amazing predictor or whatever, and it is that accurate.
It's not that useful. Right? In in in the business world, often people don't really care as much about how cool your model is. What they care about is, well, A, does it work? And B, does it make money? So you really need to be successful. So as a data scientist, you really need to speak their language. And then that allows you to do that where you can show like if this process works way better than what we did before, and we think it'll be 10% more profitable or something like that. So, yeah, I think that is really critical tool for data scientists to have impact in actual business processes.
[00:31:18] Unknown:
And so could you talk a bit now about how the pymc3 project itself is implemented? I know that it's gone through a few different generations with the going from 1 to 2 and then 2 to 3, and I know that, version 4 is in the works. But, curious if you can just dig into some of the architecture and some of the ways that it has evolved and some of the things that you're thinking about as you look to the next version. So pinesy 1, I don't even know what oh, yeah. I think so, basically, pinesy 1 was, like, a pure Python based
[00:31:50] Unknown:
implementation of the probability distributions that Chris Fonsbeck wrote, I think in his PhD or something. And because it was all in Python, it was really, really slow. Then he, with a few other people, started working PyMC 2, where they found that, like, well, most of the time is being spent in evaluating these probable distributions, So we should make these as fast as possible and faster than Python is Fortran. So they wrote all the likelihood functions in Fortran, and that would speed up the model evaluation by a lot. And that was so it was much faster.
But the downside is that now you have this Fortran dependency and linking between Python and Fortran that make things really difficult to, hack on and really difficult to compile and and ship and take to production. So that was obviously a downside. And then as I already mentioned, John Saldatea wrote Pine j 3 and used Stiano for all of that, and that just made things way easier and also way faster. So our likelihood functions are now again written in Python, but they're using Theano, so they're still being compiled on the fly to run at machine speed. So so now, yeah, we're back at a pure Python based code base that is still fast and gives us gradients all due to Theon. So it's very hackable and much more maintainable, much easier for users to install and contribute to. So, yeah, it's that is much better. And, PIMC 4 basically takes things to the next
[00:33:21] Unknown:
step forward in that will be built based on TensorFlow, but with a similar basically set up. Yeah. That was 1 of my questions was gonna be asking what would be involved in integrating with some of the other deep learning frameworks that are available now and whether that would even make sense because I know that in Theano, 1 of the main features that you're making use of is its ability to dynamically generate and compile the c functions for speed improvements. So I'm curious if you can talk a bit about some of benefits that are gained by going with TensorFlow over Theano or any of the other frameworks and if there's any thought to making it swappable. Yeah. That's a good question. So Theano
[00:33:58] Unknown:
is actually a great package. Like, they have been developing this for at least 10 years and worked out a lot of the kinks. It is really, really fast. It has very low Python call overhead, and the code base is quite readable. Unfortunately, they decided to discontinue maintaining it. So it's being phased out, which leaves piant piant c 3 sort of hanging. We actually took over maintain maintenance of the Theano code base. So at least that way, it'll not completely decay. So anything that, any bugs or version compatibility is as NumPy and other packages move ahead. We will keep it updated so that Pyme C3 users will be able to still use it in a modern ecosystem. But nonetheless, it's not great for the future, right, if, like, your core dependency basically gets gets dropped. So for a long time, we've been thinking of all kinds of different ways of trying to find a solution. And 1 of the solutions, and there was actually a pull request for that, is to just take Pime C3, switch out the back end from Theano to that particular 1 was about TensorFlow, but you could also do it with, MXNet or some other deep learning library. And that is a pretty viable approach, but unfortunately, TensorFlow in particular has quite a high Python call overhead. And the way that PYMC3 works is that only the model is written in theano and compiled down to C. The samplers are still in Python. So they, every time you evaluate the probability of the model, you eat the Python call overhead. So on the TensorFlow, that just added way more overhead than we had in Theano. So we would have a really big performance impact.
Then we talked to various other groups maintaining, MXNet and TensorFlow, in particular, the TensorFlow probability guys have been, really helpful and talking a lot with them in Timeshift 4 will be built on TensorFlow Probability, which is a package that also a team at Google is writing, where they implement basically all the core functionality that you would need in a Bayesian framework. They have all the probable distributions. They have the core components of samplers. Its focus is not on usability. It's on being able to flexibly build all kinds of different models. But yeah, so it was pretty clear that what was needed is a high level interface to Tensor Probability.
And that is what we're doing with Pimc4. So now basically after replacing the computation backend from Fortran with Theano, now we can not only have a new backend in in TensorFlow, but we also get all the probable distributions and a lot of the inference algorithms already in our core dependency. And we just have to focus on making the best API we can and really focus on usability. And also Google is, generous enough to to support us financially. So for example, now we'll have a developer summit in in Montreal, with some of the Google guys. So that is it's going quite well. There's a lot still to be done, and the software isn't pre alpha, I would call it. But nonetheless, that approach showed that, like, with very few lines of code, we could actually get something that was already usable and and quite performed as well. So we're quite excited about what the future holds. And in the process of these rewrites
[00:37:20] Unknown:
and upgrades to the different back ends, have you been trying to maintain API compatibility, or are you more focusing on just being able to move the package forward and take advantage of the different benefits that the back ends provide and then just relying on, users to upgrade their implementation to take advantage of whatever new interfaces you're exposing?
[00:37:41] Unknown:
Right. So we try to do that. So, pimesc3 didn't really care too much about backwards compatibility with pimesc2. And I think the use user base expanded by a lot from pimesc2 and also just yeah. People it was quite easy for people to switch. Now we were much more cognizant of backwards compatibility. Unfortunately, due to quite technical reasons, maintaining the exact same interface was impossible for PIMC 4, but it will look quite similar. So there is, a different boiler plate code of how you set up your model, but the model definition itself should look almost identical. So the hope is that it will be very easy to port. So you you maybe only need to change a couple of lines that set up your model, but then the model itself can remain largely intact. Having said that, the awesome nice API
[00:38:35] Unknown:
fixes that we're also introducing here so, actually, I think the API will will be better. And I know that in the context of particularly deep learning frameworks, but also some of the other machine learning approaches, there's a need for a fairly large amount of data to be able to plays out in pymc3 and using Bayesian approaches as to how it compares in terms of the need for volume of data, labeled data, and just how the, overall differences manifest between a, probabilistic programming approach versus some of these more unsupervised learning methods where you're just feeding it a lot of data and just tuning the hyperparameters to ensure that you get the desired outputs.
[00:39:24] Unknown:
Right. So it's definitely true that all of that flexibility and automaticity that you get from probabilistic programming comes at a cost. And that is that is generally much slower than something like machine learning, like in scikit learn. If you fit a logistic regression, that will be orders of magnitude faster than if you run a similar Bayesian version of that logistic regression in PIMC3, just because they have the inference algorithms particularly tuned to that 1 particular model while we are much more, toolbox where you can build very general models and then it just won't be as fast. So the focus has been mostly on small to medium sized models and definitely not big data type of datasets.
And there are, that is definitely a well recognized problem and there are different solutions that you can take. So 1 of them is rather than sampling, which is most of what I talked about, which gives you a very accurate approximation of your posterior distribution, you can use things like variational inference. And these scale much, much better. So rather than sampling from the posterior distribution, they fit a target distribution to your posterior distribution. So you introduce several assumptions like that your posterior will be normal, and then it's an optimization problem rather than a sampling problem. And that will just work much, much faster and approach speeds like you see if you were to build, like, a neural network model. And but, of course, that comes at the cost of, like, a not as good approximation.
Then there's also a lot of work being done in types of the back end. So particular TensorFlow is well suited for very big models. And that is what I hope we'd be able to do with pymc4. So it's possible that these smaller models will be, a bit slower, although there's also a good chance that they'll be quite a bit faster. But definitely in PIMT 4, I think it'll be much more feasible to build, like, really large models and train them really large datasets as well just because TensorFlow was really built for that and also make really good use of GPUs or TPUs.
And we have seen quite impressive speed ups in that regard. So it's a bit too early to say whether there's actually a use case for that. Like a lot of the models that are currently being built are being, are these small to medium sized models. But the question is like, is that just due to people wanting small to medium sized models or just because the computational framework can't handle more? So that'll be really interesting to see and push those boundaries. And there's definitely a good reason to assume that these big models will be very powerful because with more and more, if your data grows more and more in size, you can estimate more and more aspects of your data and model them accurately. So with few data points, your model needs to be very simple, but as you can scale the dataset size, there's so much more you can, as we estimate from there and much more complex models that you could build on top of that. And
[00:42:31] Unknown:
with different machine learning frameworks or deep learning approaches, I know that there's occasionally the need to retrain the model from scratch if you start to, experience model drift in a production context. And I'm wondering if the same is true of pymc3 or if because of its
[00:42:47] Unknown:
Bayesian approach, it's able to evolve alongside the data. Right. So in principle, that is possible, and Bayes formula is at the core of that. So as I said, that is just the formula to update your beliefs. So you see some data, you update your beliefs, you see some more data, you update those beliefs again. And each time you only need to incorporate the new data that you have seen in addition to what we already know beforehand. And so that would make you think that it's very well suited for online learning. Unfortunately, once you go to a sampling approach, those kinds of things become more difficult because there is approximation error that gets introduced. So every time you do that, you introduce some approximation error and your, the quality of your inference will decay over time. With variational inference, that is much easier.
And so online learning, there's more guarantees that you have in in that world. So, yes, that is definitely a possibility, but it's not quite as straightforward
[00:43:47] Unknown:
as 1 would hope. And in terms of your overall experience of building and maintaining Pymc 3, I'm wondering what you have found to be some of the most interesting or unexpected or challenging aspects.
[00:44:01] Unknown:
Yeah. So definitely, it's not the code. It is around the people. And that is definitely the most interesting aspect for me is building the community and, specifically, the developer community has really grown. So we started with just, 3 of us, and then people would show up randomly on GitHub and submit a pull request here and then never be heard of again. Or maybe they submitted a second pull request. And for a long time, that's how we went about and we received just random contributions from random people from the internet. And then we started for like those people that were becoming more active to invite them to become core developers and sort of keep lowering that bar just because often that really activates people. So we have an internal Slack channel where people then get invited once they join the development team. And by now we have 8 active people and we have monthly meetings. So just building that community has been really rewarding and, and makes the whole thing just much more fun. And that way, you learn really get to know really talented, developers.
And the other thing that also is just really important is to stay on top of what users do. So we started a discourse, forum where people can ask questions, and Sean Peng is really, who's 1 of the core developers, is really active in answering all kinds of modeling questions that people have there. And that really fosters also the the community. The challenges well, 1 of the challenges that I currently think a lot about is what things to say no to and what features not to include. So in the same vein, right, so you want to merge people's pull requests because they, that gets them excited and, to contribute more. But you also don't wanna dilute the quality of your code base by merging things that maybe aren't that required. So and then also you wanna go back and actually delete all the stuff that people don't really use or that is not the best. And that we started doing a little bit, but I think there's still a lot of cleanup work to be done. And it's also not, of course, not the most prestigious work. So there's less volunteers to do that. So these are sort of the the the pros and cons
[00:46:13] Unknown:
of of being involved, I think, in an open source in general. And as you mentioned, the developer community around PyMC 3 has grown in recent years, and I know that there are a number of different projects that have been built on top of it. Someone I'm wondering if there are any that you have found to be particularly interesting or noteworthy or any other ways that you've seen pymc3 used that was particularly unexpected or interesting?
[00:46:36] Unknown:
Yeah. So that has also been really rewarding is seeing other people build on top of that. And my favorite packages are definitely Bambi, which is an analog to another package called brms, which is built on top of Stan. And what that allows is to build generalized linear models in very succinct syntax. So if you know r, then it has that really nice syntax where you can say, well, y is distributed according to, like, some covariance that you can include and then you have interaction terms and just a very simple, language to specify linear models. And well, because because in a Bayesian framework, you wanna also place priors. So you have to extend that syntax a little bit, but then also you wanna have hierarchical coefficients, in your, in your linear model. So you can include those as well.
So overall that is a really powerful package for 1 of the most commonly used models that people build. Another package that I really am excited about is called PMprofit. And that is a PIME C3 implementation of the profit model coming out of Facebook, which is a time series prediction model. And that is different from, like, all kinds of other time series models like ARIMA or recurrent neural networks and that it knows the business calendar. So it knows that Christmas is a particular day and that Monday reoccurs every 7 days. So it has that logic built in and, just works extremely well on these business type of time series, which is something that other haven't done before.
And then 2 other ones, that I think are really interesting, which are much more domain focused. 1 is called exoplanet, which is for discovering exoplanets, outside of our solar system. And I don't know anything about that, but they were basically able to just add a couple of probable distributions that people use in that field and build a package on top of that, that people then use to detect planets. I mean, how cool is that? And then, another package in a similar vein is Beat, which is Bayesian earthquake analysis 2, I think. So you can use PyMg 3 to
[00:48:55] Unknown:
study earthquakes. So, yeah, I just I just want that kind of stuff. And in your own use of Pymc 3, what have you found to be some of the most useful features, and what are some areas that you think could use some additional improvement, particularly as you're preparing for the next release? So for me, the biggest
[00:49:12] Unknown:
thing is, that inference button that you build your model in Python and then just usually don't need to worry. You just call p m dot sample. And also, now we change it so that you don't have to supply anything else. So it just detects, like, what type of model it is, what sampler, then it will decide which sampler to use, and automatically tune that sample and everything. So you don't have to do anything. You just call p m dot sample. And in, I would say, 90% of the cases, it will just work. Now for those cases where it doesn't work, you can be lucky and it's not that difficult to fix.
But if your model has some characteristics, which lead to, like, a really difficult posterior that is very difficult to sample from, that is a really difficult issue then to solve. And that is where we enter the dark magic of model re parametrizations. So for example, in a hierarchical model that is, and I've written a blog post about this, this, very dense area in the posterior where the sample has a really hard time of getting in and out of. And there's a very simple trick that you can then do to re parameterize the model. So you rewrite the model in a way that still gives rise to the same model as before, but now the geometry of the posterior is different and makes it way easier to sample from. And you just that is not automated. Right? You just have to know to do that, and that that is a thing that you can do in the first place. And but also you can't always do that. Like, it depends on the data. So these type of issues are just really, really difficult and require expert knowledge of, like, what can be done and what should be done.
And that I think is an area right for improvement. And, Brandon Willett, who's a PyMAP3 developer as well, is working on a basic language where you can easily transform these models from 1 specification to the other. So you could imagine that, you write your hierarchical model in the way that is problematic and automatically it'll get transformed to the version that you can easily sample from. And so then, like, the u user wouldn't have even have to know about that or care about that. We just do it as well automatically behind the scenes for you and give you the best representation of that model. So that would be incredible and and Brandon is making great progress on that. So I'm really excited about what that will hold. Now that will probably not be in the upcoming release, maybe not in the 1 after that, but maybe the 1 after that. And are there any other aspects
[00:51:53] Unknown:
of pymc3 or probabilistic
[00:51:55] Unknown:
modeling that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. So 1 fun thing that is now happening, I think, in the 3rd or 4th year is Google Summer of Code. So we've always participated in that. And people have made great contributions in the past, like the Gaussian process sub module, which is very flexible to build nonlinear models, nonlinear regression type models is 1 such an addition or like all our variational inference support comes from Google sum of quick projects. So that is just extremely rewarding, and it's great for the students as well. So, if if someone is interested in in school and wants to participate, I'll definitely invite that. That's a great way to get involved. Other things that we've gotten there is approximate Bayesian computation, like which our models, where if you can't specify a likelihood function, then that is, then you can still basically, but you have a general, you need to can specify a likelihood function, but you can generate a data from, your process.
Then you can still run based on inference. So, yeah, these types of things, are are getting worked on and there's just much more to do. So, yeah, definitely invite everyone to get involved. For me, it's just been a lot of fun to to build that package and meet a lot of really interesting people. Right. So for anybody who wants to follow along with the work that you're doing or get involved with pymc3
[00:53:17] Unknown:
or just get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the fantastic beasts movies. So fantastic beasts and where to find them and crimes of Grindelwald. So they're set in the same universe as Harry Potter, but they take place before all of those books. And it's just, a lot of fun, very well written, enjoyable. So definitely worth a watch if you're at all into any of the Harry Potter or if you just like a good fantasy movie. And so with that, I'll pass it to you, Thomas. Do you have any picks this week? Yeah. So,
[00:53:49] Unknown:
science fiction, Hyperion by Dan Simmons. Probably 1 of my favorite books. And then if you're into meditation at all, check out, The Mind Illuminated, which is, like, a fantastic technical introduction
[00:54:02] Unknown:
to to learning meditation. Great. I'll have to check those out. Well, thank you very much for taking the time today to join me and talk about pymc 3 and probabilistic modeling and Bayesian statistics. It's been very interesting, and it is an area that I've, found fascinating for a while. So thank you for that, and I hope you enjoy the rest of your day. Well, it's my pleasure. Thanks so much for having me.
Introduction and Sponsor Message
Interview with Thomas Viki
Introduction to Probabilistic Programming
Bayesian vs Frequentist Approaches
Overview of pymc3 Project
Example Project: Asset Allocation at Quantopian
Challenges in Model Implementation
Future of pymc3 and TensorFlow Integration
Community and Contributions
Useful Features and Areas for Improvement
Google Summer of Code and Future Projects
Closing Remarks and Picks