Probabilistic Modeling In Python (And What That Even Means)

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up.

And for your tasks that need fast computation, such as training machine learning models or building your CI pipeline, they just launched dedicated CPU instances.

Go to python podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity,

and the Open Data Science Conference.

Go to python podcast.com/conferences

to learn more and take advantage of our partner discounts when you register.

And visit the site at python podcast.com

to subscribe to the show, sign up for the newsletter, and read the show notes.

And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

Your host as usual is Tobias Macy. And today, I'm interviewing Thomas Viki about pymc3,

a project for probabilistic programming in Python. So, Thomas, could you start by introducing yourself? Sure. Yeah. So I'm Thomas Wiki. I currently work at Quantopian, which is a Boston based

startup in the financial space. And,

I'm the

VP of data science there. Before I did my undergrad in computer science at the University of Tubing and then did my PhD at Brown University. And do you remember how you first got introduced to Python?

Yeah. Certainly. So,

when I was still an undergrad, I spent a summer at MIT

doing brain stuff. And

particularly, we were training,

rats

to do, like, certain experiments.

And

it's particularly the rats, I was training to, like, do, like, a discrimination task that required me to

lock myself into a dark room in the middle of the summer and

just train these rats, basically. So that felt like a really long time, these 2 hours, but it left me a lot of other time to explore other things. And

that day, I started using Python before I was like a a MATLAB user, like a lot of the field back then. So that was early 2000.

And what I liked about MATLAB was that it really allowed you to

very easily, like, do numerical processing, have access to matrices and all these things, but it really did feel like a programming a proper programming language. Like so building actual packages out of that was really hassle.

And so Python, I immediately fell in love with because it was just as easy to work, like, from the Python shell or the Python shell and

then, create plots and get NumPy and SciPy and all these other things that could get you to parody with MATLAB, but also be a natural programming language to build packages.

So that was really yeah. We're really dove into that because they had really time to dedicate to it. So at the end of the summer, while

those rats never learned what I tried to teach them,

I I came out of that being a pretty good Python programmer. And so the pymc3u

package

says that it is usable for probabilistic

programming. So can you start by explaining a bit about what that means?

Sure. Yeah. So

as the name suggests, it is about,

probabilistic

models.

And these probabilistic models you specify in computer code. And that already like makes me really like it because it allows someone with a computer science background to really dive into statistics.

So I guess I should just talk a little bit more about, like, what probabilistic models are. So,

basically a probabilistic model is something that is made up out of probability distributions.

So probabilistic program allows you to do that quite easily where you just have a whole bunch of probable distributions that represent how your problem is structured, and then you plug those together. And that is the common way that you do that. Right? You plug probabilities, divisions together, that is statistics,

and then you do a whole bunch of math derivations to get at your estimator.

Now because we here are in

code, not math,

we can just

write a line representing this is normal distribution and it's the input to another normal distribution.

And then the computer figures out all the math for us so we don't have to stand in front of whiteboard. We are inside of a Jupyter notebook and just write down the model that we want.

The other really important aspect is that the type of models we're building, generative.

Yes. So it allows us to plug these probable distributions together in, in computer code and not in math.

And also these types of models are generative. So that means

we think about

the causes of our problem and often these are unobserved.

And then

we

think of how these causes

generate the data that we end up observing.

So 1 example for that would be

AB tests, which are very common in,

if you're building websites. Right? You wanna know which of those 2 websites, say,

has a higher conversion rate. Maybe you want people to sign up to your website. So you have 2 different versions and you want to say which 1 has a higher conversion

rate. So at the core, that's a statistical problem, and you could use frequentist statistics.

But in here, we're gonna use,

based on statistics. And

the way to think about this in a general framework is, well, we have these 2

unknown, unobservable

causes, which are which produce our data, And these unknown causes are the conversion rates,

and these produce

either

a user to sign up or not sign up. And these conversion rates, which we can observe, because we're in a base and framework here, we have to place priors on them. And that's a technical term from Bayesian statistics. And what it means is

we define a probability distribution

without having seen the data of what we believe likely values

of

the

parameters

to be. So in this case, we could say,

I know

ahead of time, right? If you if you're working in the web space, you know, that conversion rates tend to be extremely low. I don't do that myself, but I know that it's less than a percent or in that area, but definitely not say 20%.

So the probability distribution that I would build

would place a lot of a lot of probability mass on really low values because I know ahead of time that it's going to be likely, but something like 50%,

while not completely possible

is going to be, have a low, very, very low probability.

And

there's definitely

some leeway in how you can define these priors.

And maybe

you're not an expert, so you could place wider priors on that, or you could ask an expert

to provide you with that insight. So okay. Now we have to find our 2 conversion rates.

And, oh, and 1 more thing on the priors is we also because it's a conversion rate, it's essentially a probability. So it has to be between 0 and 1 100%. And the distribution that is commonly used for that is the beta distribution.

So that's just the name of that. And we can define these parameters of the beta distribution to give that lot of probability mass to very low probability values. And then we have to link that to observe data.

So

we say, okay. From these unobserved cases, we

generate

data that we observe.

And in our case, these will be binary events, right? Someone did sign up or they didn't sign up. And the probability distribution, the likelihood function for that is called the Bernoulli or conflict distribution.

So that basically defines our model. So we have our beta distributions as priors for our unobserved

conversion rates. And then we have

our likelihood function, how my data is distributed Bernoulli. So now

so far, I haven't seen any data yet.

So I will just so that is just a model specification.

Next, I once I observe data, I actually wanna make inference about these hidden causes. So what are the most likely cause to have generated that data? And again, because we are in a statistical setting, we're not satisfied with just a single answer. Like the conversion

rate is going to be 0.5%.

Here, we also want

a distribution

of possible values. So it could be between 0.2.6%.

But which of those values exactly,

I can't say because I only have observed limited amounts of data. So

and and that distribution that, formalizes this is called the posterior distribution.

And that posterior distribution, another way to think of that is just as my belief in in the in these underlying causes in the conversion rate. So we start with some initial belief without having seen the data, my priors, then I observe data. I've learned something about the world. I update my knowledge into my posterior distribution. Once I have these posterior distributions, I can do statistics on them, like saying, okay, which of those has the higher probability,

or what is the probability rather that 1 is better than the other? And that will tell you, okay, now with a

98% probability,

the second version of your website is going to be better. So immediately, you know, what is happening and you can also

then say, maybe it's, still not high enough. So I wanna observe more data, then you update your posteriors again, and then it's at 99%. So then you say, okay, now we're gonna deploy. So that is

how the whole problem is structured. The

difficulty that I haven't said is in this updating process, going from converting my priors into posteriors by observing data,

that is,

what Bayesformula is doing. Bayesformula is amazing.

It's, it's actually a very simple result, but it, is a very elegant formula and looks very simple. But then when you actually try and do the math of that, you very quickly run into problems for even like fairly simple models. And that is what it has held based on statistics back for a really long time as well. We have this great framework, but we can't use it because once we start using it, we get these nasty multidimensional

integrals over infinities,

and we just can't solve them. So luckily,

there have been all kinds of approximation inference algorithms being developed. And what is especially powerful about them for the purposes of probabilistic programming is that these work largely automatic on whatever model you dream up. So that allows you in a probabilistic programming

framework to write your model in code, then hit what I like to call the inference button and just get your posterior approximation.

And, then maybe you're not totally satisfied with the model.

Some aspect of the data is not being modeled correctly. So then you go ahead and change the model. And without having to, like, rederive anything

in math, you just hit the inference button again and you get your answers. So that is,

I think how I view the,

the benefits and the of the approach of probabilistic programming. And I know that there's sort of this,

dichotomy in statistics between Bayesian approaches and frequentist approaches, and it sounds like 1 of the reasons that frequentist approach has sort of gained,

larger mind share is because of these issues of being able to effectively

compute the probability distributions of the Bayesian model. And it seems

distribution

of the Bayesian model. And it seems that because we now have tools like pymc3 and other Bayesian statistical

algorithms

that it's starting

to gain back a bit of interest and sort of popularity in the statistics

community. So I'm curious what your experience has been on that front. Yeah. I think that's definitely

true. So I would view frequentist as more like the 20th century

statistics where, yes, we had to be able to derive these estimators,

using math to closed form solutions and frequentist just

was easier in that regard, for many cases.

While, like, all the math is like completely correct and it's a solid framework,

it is often a,

framework that is not really well suited for the scientific discovery process. And,

1 of the key reasons is that people

think about frequent statistics

in a way that is actually much more compatible with Bayesian statistics.

So they think that, like, p value of smaller of smaller than 0.05 means that the probability of a false discovery is smaller than 0.05.

But that is the the basin answer, not the frequentist answer. I don't wanna go into too much detail.

I encourage listeners to just,

read up on what is actually,

what actually a p value means,

But it comes with all these problems, like,

it's very dependent on the exact

exact data's generation

and data collection

process. So it's different if you say, I collected data for,

15 days and got 20 subjects, or I collected

so much data that had 20 subjects,

or all kinds of different ways that give you the same number of data points. But actually it matters for the statistical test, what your protocol was. And that just is is not very well suited. And

now

the other aspect I would say would where Bayesian statistics really shines is that because frequentist statistics relies on these mathematical der derivations,

it is building a new model, the model that you actually think your problem is. Right? So if it's not a t test, but something like a t test, you would need to go back to the drawing board and redo write the formulas.

And if you're lucky, then everything checks out. But if not, then you're basically left,

3 different devices

In basic statistics and with probabilistic programming, you just build whatever model you think is best suited to understand your data.

And and then just, yeah, hit the inference button and be on your way. And I definitely see a,

slow

but steady

move away from frequency statistics.

Academia in general seems to be slow on the uptake, but I think it's mainly a question of time. And so now that we've established what probabilistic programming is and some of the differences between Bayesian and frequentist approaches, so I'm wondering if you can describe what the pymc3

project is and what it offers and how you got involved with it. Sure. Yeah. So when I first learned about Bayesian statistics

in grad school,

we were using that was at a workshop in Amsterdam with, which EJ Wannockers,

taught, and there we're using wind bugs, which is like

the package for which was at its time ahead of its time.

But now it's definitely

neither of those.

So

that, like most other packages, required you to, like, write your model in this domain specific language, which they defined.

And because I was already back then,

an as a Python user, I was looking for a package that,

was giving that, and that way found, pyme C2, and that allowed you to specify models directly in Python code. So that really suited me well. And I started,

writing for the models I was building in grad school,

a library that built on top of that. So just

that being possible was just extremely

powerful. And that's how I got to know the authors of PIM g 2 better, like Chris Vonaspak and John Salvatier was really involved with that. And talking more to John,

he talked about these,

new class of samplers,

called Hamiltonian Monte Carlo, which I didn't really know anything about. Other packages at that time, we're just using, what's called metropolis Hastings, which is like a quite easy and general

inference algorithm, but it is yeah. It's it's it's not great. It often fails and it's quite slow. And these new class of samplers

just way better. So we talked about ways of maybe including that in PIME C2, but it just wasn't suited for that because for these new types of samples, you really need

the gradients of your model. And the package just wasn't designed to handle that.

So John, and I think he was the first to really make the connection,

found, that package Theano, which,

is the first of its kind to really be a deep learning library

where you do all the matrix multiplications

to build your neural network model. And then also compute the gradients, which you need because,

to do your back propagation and gradient descent to fit them, the neural network model. And he realized that

this particular package was not just,

that it was possible to use that package, not just for neural networks, but it was a general computational package where you can

implement all kinds of math

and then get your derivatives for that. And moreover, it allows us to compile things to see, and then, run at machine speed to be really fast. So he realized that you could implement a probabilistic programming framework on top of theano and then did a lot of the core work to really do that. And that is what, results from PIMC 3. And then later,

yeah, Chris and I basically got in involved in in that and pushed it further.

And but yeah. So it is,

basically the extension of,

that package. The benefits are that you can write your models in Python and everything

stays in Python, and you can interact with your model in Python. And it is

really fast

because we compile the code

to see or the GPU,

and it has these,

next generation samples, which are just vastly better than anything that came before. And I know that some of the samplers are things like Markov chain Monte Carlo or variational inference, and the opening line for the project read me is actually very packed with some of these very technical terms.

So it says pymc3 is a Python package for Bayesian statistical modeling and probabilistic machine learning focusing on advanced Markov chain Monte Carlo and variational inference algorithms. So I'm wondering if you can just unpack that a bit and discuss some of the ways that pymc3 is used in some real world projects. I know you already mentioned the possibility of being able to use it for being able to establish the statistical significance

of things like AB testing. But wondering if you can just talk through some of the other ways that you have used it or that you've seen it used in other contexts.

Sure. Yeah.

So that has been really rewarding to see is, like, how people really make use of the software. And it's actually not that easy to get a sense of what people use your package for. If they use it and it works, they just don't come back and tell you that, like, they used it for a really cool purpose and that it worked well for that. In academia, we see that just because,

of all the papers citing PIMC 3. And these come from

all kinds of,

scientific disciplines, like, a lot of astronomy stuff, ecology,

quant finance,

econometrics,

biology,

zoology,

like, earthquake

stuff. It's like,

it's it's probably

1 of my favorite papers because, like, all the other papers are,

of course, in my domain where I did my PhD. And so I can read them and understand them, but the people that are citing PIMZ 3, like, I have no idea what is going on there. So it's,

yeah, it's being used in, like, all kinds of different areas for all kinds of interesting purposes.

And more of it's also used in industry. So there are quite a few companies that we know that use it in production to on all kinds of different ways to solve all kinds of different problems,

really big companies

and startups.

1 common use case, so it's, yeah, it's useful, all kinds of stuff. But actually 1 thing that I do see often is AB tests,

interestingly. And finally, we also use it for, at Quantopian,

to make our asset allocation. So just, I guess a word on that. So we, I like to call it crowdsourced hedge funds. So we have a huge community of quants that come to our website and develop trading algorithms to invest in the stock market. And we pick the best of those algorithms and then license them from the author to include into our fund. And part of my job is to select the best ones of those, and PIMC 3 is being used in that process as well. And I know that a lot of the documentation as I was reading through is pretty heavy with different references to probability theory and statistical

pymc3

being in Python is that it helps provide a bit of a more higher level abstraction over some of those approaches. So I'm curious

what you have found to be the requisite amount of knowledge of statistical modeling and basic statistics to be able to make effective use of pymc C3?

Right. Yeah. So there's definitely an initial investment that has to be done. So you need to know about probability distributions just so you know, basically, what those

do and how to set the right priors,

then you need to know about priors and posteriors and how to interpret them and how how to do them, a little bit about the model building process, how to do that properly. And

then probably quite some basic practitioners would disagree, but I don't think you really need an in-depth knowledge of, like, the inference algorithms, for example. Like, the hope is, and I think we have succeeded to that to some degree,

is that you don't really need to know about that. It's just the inference button and you press it. And without really knowing what's happening, it will do something. It will do the right thing and give you the right answers. And that actually works well also because

these new type of inference algorithms, Hamilton Monte Carlo Samples, when they don't work, they fail in a spectacular way. And

that is really easy then to look at the user and say, actually, don't trust these results.

Something went really wrong there. And then,

unfortunately,

it can be quite

difficult and require quite some depth knowledge

to solve those kind of problems. So when the

when the default approach doesn't work, things get more difficult and intricate, but yeah, otherwise you get by with like core knowledge of, of statistics and probability.

And I think, yeah, the the best way to get started is probably to pick up a book that introduces these concepts. So there's,

basic methods for hackers, which is a really fun read, which you can find online, and it's ported to PIMC

3. So that is a great read. And then, Oswaldo Martine has come just has just come out with the second edition of his book, which focuses on PIME c 3

and does a great job of introducing everything in a very practical and applied way. So for those types of data scientists, I would definitely recommend that both. So can you start by just talking through an example

project that would use pymc3

and the overall workflow

of setting up the model,

some of the

syntax, not you know, I don't need you to sort of spell it out, but just, at a high level, just like the way that the syntax is represented

and, just the overall workflow of any heuristics that would be useful for figuring out how to set your priors, building the model,

any sort of training process that goes on, and then evaluating the outputs?

Sure. Yeah.

So 1 particular use case, which was also very educational for me was,

building that model for Quantopian,

to help with our asset allocation. And the problem is actually quite similar to the AB testing problem. We have different

algorithms

and they produce

daily returns. So either you

increase your returns by 5 percent on the 1st day and then maybe

2% down the next day. Mostly these are much smaller values, but you get the idea.

So we have these,

return

series for every single algorithm, and we wanna know which 1 of those has the best

performance to risk profile. So

there, we have the unknown causes, which are well, what is the mean of that returns distribution? Is it positive or negative or 0?

And we have the variance of that. So these are 2 parameters that we wanna estimate here. And so times t 3 code, that would be 2 lines. So we say, okay. We have the mean that we wanna estimate, but don't know, and we placed prior on that. So here I would say, okay,

I already know ahead of time that the mean

returns of any strategy will be very, very close to 0, right? Just 10%

on average every day is absolutely impossible.

So it's going to be close to 0. So I would pick normal distribution that sent us things very close to, to a very narrow range around 0 that is possible

given what we've seen. So that's the first line. And then the second line says, okay, now I need a parameter for the variance

of those returns.

And there,

the first thing that I realize is, okay, well, the variance must be positive. So it can't be just a normal distribution because that has also probability mass of 0. So we choose something like, the absolute value of a normal distribution,

called a half normal or,

or there's other choices that you could make, but those are like some pretty sane starting conditions.

And then I would define my likelihood function, which is how my data is distributed. So that's the third line. And that has 2 input parameters.

And here I'm linking it to my unobservable causes. So the mean of that likelihood function is my parameter defined in the first line and

the variance is the parameter find in the second line. So that's how these,

how the model links together. And then I pass in the data that that I have observed. So that is the daily return series that I have. Now here, I have some interesting choices actually about what kind of likelihood I use or how I think my data my observed data is distributed.

So good default choice here would be a normal distribution,

but, actually,

it's a pretty well known fact that, financial return series are not normal, but they have much heavier tails, much higher outliers

than a normal distribution would have. So during something like the financial crash in 2, 008, 2009,

you get, like, 20%,

all of a sudden, of, like, daily

moves. And a normal distribution just doesn't

account for that really. So you could use something else like a student t distribution.

Yeah. Those are like different choices you would have, but in the model that just changes 1 line, essentially, where you'd say, well, now I'm using a different likely function. So with 3 lines, I have to find my model that tells me what is the expected return and

the risk, the variance of that strategy,

the volatility. And then I would just run my sampler and get my posterior that tells me, okay, the mean,

then I could take a statistic and ask, okay, what is the probability

that the mean return of that strategy is positive? And so that is similar to the model that,

that I built in the first pass. And then I would report that number and say, okay, the probability that that strategy

is profitable

is

say 75%.

So we're operating here in, like, very high uncertainty

environments.

And then

that number could be used by the person selecting the strategies to make a more informed decision. And I thought, okay. That's amazing.

Really cool approach and,

yeah, but should make our asset allocations much better. But what I found was that number just wasn't used. Right? It just, like, was reported, but people, like, looked at it in the beginning, but then, like, just started ignoring it. And the reason is,

well, there's several reasons. 1 is we're pretty bad at incorporating

this uncertainty information in our decision making process, and there are, like, no easy answers where, like, it says, oh, 99% and this 1 has 50%. So obviously you should choose the 1 that has profitability with 99% probability. So it's all in this gray area. And then you say, well, I have 10 strategies, and they have different probabilities

ranging from

60 to 75%.

Should I use all of them? Should I just use the top 10? Like these are really difficult questions

that basically led to people to make the decisions that they made before, which are the decisions that they think that are best, but not really based on evidence or the model.

And the solution that we found,

which I think

is profound

is,

pushing the

model

1 step further into the decision making pipeline. And I find that is often unappreciated. So what you can do once you have your model,

rather than just reporting that number, right, that probability output, you can actually generate new data, different scenarios

of what the future could look like given your model and what the model has learned on the data. So something that has a very high probability of generating good returns will generate

really good looking returns into the future. Something that is in between

will generate. Sometimes it'll go down. Sometimes it'll go up.

So you do that for all your strategies. You generate all kinds of different future scenarios, and then you plug those into an optimizer and say, okay, give me the selection of strategies

that will

maximize my profitability

or rather minimize the risk of getting a terrible loss. But either way, you wanna optimize so that you find the best

selection of algorithms with the best properties. And now all of a sudden, rather than reporting a number that people didn't look at, now the output is given, like, all the inputs that are available, this is the best selection and how to weight them of your portfolio allocation problem. So

and and that, like, was what we used then to actually make these decisions. So all of a sudden, like, there was no more human in the loop to make these really difficult decisions, making very difficult trade offs, and and that model just worked way better

and was actually being used for its intended purpose. So that idea, I think of not just making a model and it could even be a machine learning model, but really going the extra mile and not just reporting like, oh, yeah. This is an amazing predictor or whatever,

and it is that accurate.

It's not that useful. Right? In in in the business world, often people don't really care as much about how cool your model is. What they care about is, well, A, does it work? And B, does it make money? So you really need to be successful. So as

a data scientist, you really need to speak their language. And then

that allows you to do that where you can show like if this process works way better than what we did before, and we think it'll be 10% more profitable or something like that. So,

yeah, I think that is really critical tool for data scientists to have impact in actual business processes.

And so

could you talk a bit now about how the pymc3

project itself is implemented? I know that it's gone through a few different generations with the

going from 1 to 2 and then 2 to 3, and I know that, version 4 is in the works. But,

curious if you can just dig into some of the architecture and some of the ways that it has evolved and some of the things that you're thinking about as you look to the next version. So pinesy 1, I don't even know what oh, yeah. I think so, basically, pinesy 1 was, like, a pure Python based

implementation of the probability distributions that Chris Fonsbeck wrote, I think in his PhD or something. And because it was all in Python, it was really, really slow.

Then he, with a few other people, started working PyMC 2, where they found that, like, well, most of the time is being spent in evaluating these probable distributions,

So we should make these as fast as possible

and faster than Python is Fortran. So they wrote all the likelihood functions in Fortran, and that would speed up

the model evaluation

by a lot. And that was so it was much faster.

But the downside is that now you have this Fortran dependency

and linking between Python and Fortran that make things really difficult

to, hack on and really difficult to

compile

and and ship and take to production.

So that was obviously a downside.

And then

as I already mentioned,

John Saldatea wrote Pine j 3 and used Stiano for all of that, and that just made things

way easier

and also way faster. So our likelihood functions are now again written in Python, but they're using Theano, so they're still being compiled

on the fly to run at machine speed. So so now, yeah, we're back at a pure Python based code base that is still fast and gives us gradients

all due to Theon. So it's very hackable and much more maintainable,

much easier for users to install and contribute to. So, yeah, it's that is much better. And,

PIMC 4 basically takes things to the next

step forward in that will be built based on TensorFlow, but with a similar basically set up. Yeah. That was 1 of my questions was gonna be asking what would be involved in integrating with some of the other deep learning frameworks that are available now and whether that would even make sense because I know that in Theano, 1 of the main features that you're making use of is its ability to dynamically generate and compile the c functions for speed improvements. So I'm curious if you can talk a bit about some of benefits that are gained by going with TensorFlow

over Theano or any of the other frameworks and if there's any thought to making it swappable. Yeah. That's a good question. So Theano

is actually a great package. Like, they have been developing this for

at least 10 years and worked out a lot of the kinks. It is really, really fast. It has very low Python call overhead,

and the code base is quite readable. Unfortunately,

they

decided

to discontinue

maintaining it. So it's being phased out, which leaves piant piant c 3 sort of hanging. We actually took over maintain maintenance of the Theano code base. So at least that way, it'll not completely decay. So anything that,

any bugs or version compatibility is as NumPy and other packages move ahead. We will keep it updated so that Pyme C3 users will be able to still use it in a modern ecosystem.

But nonetheless, it's not great for the future, right, if, like, your core

dependency basically gets gets dropped. So for a long time, we've been thinking of all kinds of different ways of trying to find a solution. And 1 of the solutions, and there was actually a pull request for that, is to just take Pime C3, switch out the back end from Theano to that particular 1 was about TensorFlow, but you could also do it with, MXNet or some other deep learning library. And

that is a

pretty viable approach, but unfortunately,

TensorFlow in particular

has quite a high Python call overhead. And the way that PYMC3 works is that only the model is written in theano

and compiled down to C. The samplers

are still in Python. So they, every time you evaluate the probability of the model, you eat the Python call overhead. So on the TensorFlow,

that just added way more overhead than we had in Theano. So we would have a really big performance impact.

Then we talked to various

other groups maintaining,

MXNet

and TensorFlow, in particular, the TensorFlow probability guys have been,

really helpful and talking a lot with them in Timeshift 4 will be built on TensorFlow Probability,

which is a package that also a team at Google is writing, where they implement basically all the core functionality that you would need in a Bayesian framework.

They have all the probable distributions.

They have the core components of samplers. Its focus is not on usability. It's on being able to flexibly

build all kinds of different models. But yeah,

so it was pretty clear that what was needed is a high level interface to Tensor Probability.

And that is what we're doing with Pimc4.

So now basically after replacing the computation backend from Fortran with Theano, now we can not only have a new backend in in TensorFlow, but we also get all the probable distributions and a lot of the inference algorithms

already in our core dependency.

And we just have to focus on making the best API we can and really focus on usability.

And also Google is, generous enough to to support us financially. So for example, now we'll have a developer summit in in Montreal,

with some of the Google guys. So that is it's going

quite well. There's a lot still to be done, and the software

isn't pre alpha, I would call it. But nonetheless,

that approach showed that, like, with very few lines of code, we could actually get something that was already

usable

and and quite performed as well. So we're quite excited about what the future holds. And in the process of these rewrites

and upgrades to the different back ends, have you been trying to maintain API compatibility,

or are you more focusing on just being able to move the package forward and take advantage of the different benefits that the back ends provide and then just relying on,

users

to upgrade their implementation

to take advantage of whatever new interfaces you're exposing?

Right. So we try to do that. So, pimesc3 didn't really care too much about backwards compatibility with pimesc2.

And

I think the use user base expanded by a lot from pimesc2

and

also

just yeah. People it was quite easy for people to switch.

Now we were much more cognizant of backwards compatibility.

Unfortunately, due to quite technical reasons,

maintaining the exact same

interface

was impossible for PIMC 4, but it will look quite similar. So there is,

a different boiler plate code of how you set up your model, but the model definition itself

should look almost identical. So

the hope is that it will be very easy to port. So you

you maybe only need to change a couple of lines

that set up your model, but then the model itself can remain largely intact. Having said that, the awesome nice API

fixes that we're also introducing here so, actually, I think the API will will be better. And I know that in the context of particularly deep learning frameworks, but also some of the other machine learning approaches, there's a need for a fairly large amount of data to be able to

plays out in pymc3

and using Bayesian approaches as to how it compares in terms of the need for volume of data, labeled data, and just how the, overall differences manifest between a,

probabilistic

programming approach versus some of these more unsupervised learning methods where you're just feeding it a lot of data and just tuning the hyperparameters

to ensure that you get the desired outputs.

Right. So it's definitely true that all of that flexibility

and automaticity

that you get from probabilistic programming comes at a cost. And that is that is

generally

much slower than something like machine learning, like in scikit learn. If you fit a logistic regression, that will be orders of magnitude faster than if you run a similar Bayesian version of that logistic regression in PIMC3,

just because they have the inference algorithms particularly tuned to that 1 particular model while we are much more,

toolbox where you can build very general models and then it just won't be as fast. So the focus

has been mostly on small to medium sized

models

and definitely not big data type of datasets.

And there are,

that is definitely a well recognized problem and there are different solutions that you can take. So 1 of them is rather than sampling, which is most of what I talked about, which gives you a very accurate

approximation of your posterior distribution, you can use things like variational inference.

And these

scale much, much better. So rather than sampling from the posterior distribution, they fit a target distribution to your posterior distribution.

So you introduce several assumptions like that your posterior will be normal, and then it's an optimization problem rather than a sampling problem. And that will just work much, much faster and approach speeds like you see if you were to build, like, a neural network model. And but, of course, that comes at the cost of, like,

a not as good approximation.

Then there's also a lot of work being done in types of the back end. So particular TensorFlow

is well suited for very big models.

And that is what I hope we'd be able to do with pymc4.

So it's possible that these smaller models will be,

a bit slower, although there's also a good chance that they'll be quite a bit faster. But definitely in PIMT 4, I think it'll be much

more feasible to build, like, really large models and train them really large datasets as well just because TensorFlow was really built for

that and also make really good use of GPUs or TPUs.

And

we have seen quite impressive speed ups in that regard. So it's a bit too early to say whether there's actually a use case for that. Like a lot of the models that are currently being built are being,

are these small to medium sized models. But the question is like, is that just due to people wanting small to medium sized models or just because the computational framework can't handle more? So that'll be really interesting to see and push those boundaries. And there's definitely

a good reason to assume that these big models

will be very powerful

because

with more and more, if your data grows more and more in size, you can estimate

more and more aspects of your data and model them accurately. So with few data points, your model needs to be very simple, but as you can scale the dataset size, there's so much more you can, as we estimate from there and much more complex models that you could build on top of that. And

with different machine learning frameworks or deep learning approaches, I know that there's occasionally the need to retrain the model from scratch if you start to,

experience model drift in a production context. And I'm wondering if the same is true of pymc3

or if because of its

Bayesian approach, it's able to evolve alongside the data. Right. So in principle, that is possible, and Bayes formula

is at the core of that. So as I said, that is just the formula to update your beliefs. So you see some data, you update your beliefs, you see some more data, you update those beliefs again. And each time you only need to incorporate the new data that you have seen

in addition to what we already know beforehand.

And so that would make you think that it's very well suited for online learning. Unfortunately,

once you go to a sampling approach,

those kinds of things become more difficult because there is approximation error that gets introduced. So every time you do that, you introduce some approximation error and your, the quality of your inference will decay over time. With variational inference, that is much easier.

And so online learning,

there's more guarantees

that you have in in that world. So,

yes, that is definitely

a possibility, but it's not quite as straightforward

as 1 would hope. And in terms of your overall experience

of building and maintaining Pymc 3, I'm wondering what you have found to be some of the most interesting or unexpected or challenging aspects.

Yeah. So definitely,

it's not the code. It is

around the people. And that is definitely the most interesting aspect for me is building the community and, specifically, the developer community has really grown. So we started with just,

3 of us, and then people would show up randomly on GitHub and submit a pull request here and then never be heard of again. Or maybe they submitted a second pull request. And for a long time, that's how we went about and we received just random contributions from random people from the internet. And

then we started for like those people that were becoming more active to invite them to become core developers and sort of

keep lowering that bar just because often that really activates people. So we have an internal Slack channel where people then get invited once they join the development team. And by now we have 8

active people and we have monthly meetings. So just building that community has been really rewarding and,

and makes the whole thing just much more fun. And that way, you learn really get to know really talented,

developers.

And

the other thing that also is just really important is to stay on top of

what users do. So we started a discourse,

forum where people can ask questions, and Sean Peng is really, who's 1 of the core developers, is really active in answering all kinds of modeling questions that people have there. And that really fosters also the the community. The challenges

well, 1 of the challenges that I currently think a lot about is

what things to say no to and what features not to include. So in the same vein, right, so you want to merge people's pull requests because they,

that gets them excited and, to contribute more. But you also don't wanna dilute the quality of your code base by merging things that maybe aren't that required. So and then also you wanna go back and actually delete all the stuff that people don't really use or that is not the best. And that

we started doing a little bit, but I think there's still a lot of cleanup work to be done. And it's also not, of course, not the most prestigious work. So there's less volunteers

to do that. So these are sort of the the the pros and cons

of of being involved, I think, in an open source in general. And as you mentioned, the developer community around PyMC 3 has grown in recent years, and I know that there are a number of different projects that have been built on top of it. Someone I'm wondering if there are any that you have found to be particularly interesting or noteworthy or any other ways that you've seen pymc3

used that was particularly unexpected

or interesting?

Yeah. So that has also been really rewarding is seeing other people build on top of that. And my favorite packages are definitely Bambi, which is

an analog to another package

called brms, which is built on top of Stan. And what that allows is to build generalized linear models in very succinct syntax. So if you know r, then it has that really nice syntax where you can say, well,

y is distributed according to, like, some covariance that you can include and then you have interaction terms and just a very

simple,

language

to specify linear models. And well, because because in a Bayesian framework, you wanna also place priors. So you have to extend that syntax a little bit, but then also you wanna have hierarchical

coefficients,

in your, in your linear model. So you can include those as well.

So overall

that is a really powerful package for 1 of the most commonly used models that people build. Another package that I really am excited about is called PMprofit.

And that is a PIME C3 implementation

of

the profit

model coming out of Facebook,

which is a time series prediction

model. And that is different from, like, all kinds of other time series models like ARIMA or recurrent neural networks and that it knows the business calendar.

So it knows that Christmas is a particular day and that

Monday reoccurs every 7 days. So it has that logic built in and,

just works extremely well on these business type of time series, which is something that other haven't done before.

And then 2 other ones,

that I think are really interesting, which are much more domain focused. 1 is called exoplanet,

which is for discovering

exoplanets,

outside of our solar system.

And

I don't know anything about that, but they were basically

able to just add a couple of probable distributions

that people use in that field

and build a package on top of that, that people then use to detect planets. I mean, how cool is that?

And then, another package in a similar vein is Beat, which is Bayesian earthquake analysis 2, I think. So you can

use PyMg 3 to

study earthquakes. So, yeah, I just I just want that kind of stuff. And in your own use of Pymc 3, what have you found to be some of the most useful features, and what are some areas that you think could use some additional improvement, particularly as you're preparing for the next release? So for me, the biggest

thing is,

that inference button that you build your model

in Python and then just usually don't need to worry. You just call p m dot sample.

And

also,

now we change it so that you don't have to supply anything else. So it just detects, like, what type of model it is, what sampler, then it will decide which sampler to use,

and automatically tune that sample and everything. So you don't have to do anything. You just call p m dot sample. And in, I would say, 90%

of the cases,

it will just work. Now

for those cases where it doesn't work,

you can be lucky and it's not that difficult

to fix.

But if your model has some characteristics,

which lead to, like, a really

difficult

posterior that is very difficult to sample from,

that is a really difficult issue then to solve.

And that is where we enter the

dark magic of model re parametrizations.

So for example,

in a hierarchical model that is, and I've written a blog post about this, this,

very dense area in the posterior

where the sample has a really hard time of getting in and out of.

And there's a very simple trick that you can then do to re parameterize the model. So you

rewrite the model in a way that still gives rise to the same model as before,

but now the geometry of the posterior is different and makes it way easier to sample from. And you just that is not automated. Right? You just have to

know to do that, and that that is a thing that you can do in the first place. And

but also

you can't always do that. Like, it depends on the data. So these type of issues are just really, really difficult and require

expert knowledge of, like, what can be done and what should be done.

And that I think is an area

right for improvement. And, Brandon Willett, who's a PyMAP3

developer

as well, is working

on

a basic language where you can easily transform

these

models from 1 specification to the other. So you could imagine that,

you write your hierarchical model in the way that is problematic and automatically it'll get transformed

to the version that you can easily sample from. And

so then, like, the u user wouldn't have even have to know about that or care about that. We just do it as well automatically behind the scenes for you and give you the best representation of that model. So that would be

incredible and and Brandon is making great progress on that. So I'm really excited about what that will hold. Now that will probably not be in the upcoming release, maybe not in the 1 after that, but maybe the 1 after that. And are there any other aspects

of pymc3

or probabilistic

modeling that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. So 1 fun thing that is now happening, I think, in the 3rd or 4th year is Google Summer of Code. So we've always participated in that. And people have made great contributions in the past, like the Gaussian process sub module,

which is very flexible to build nonlinear models,

nonlinear regression type models is 1 such an addition or like all our variational inference support comes from Google sum of quick projects. So that is just extremely

rewarding, and it's great for the students as well. So,

if if someone is interested in in school and wants to participate, I'll definitely invite that. That's a great way to get involved. Other things that we've gotten there is approximate Bayesian computation,

like which our models, where if you can't specify a likelihood function, then that is, then you can still basically, but you have a general, you need to can specify a likelihood function, but you can generate

a data from, your process.

Then you can still run based on inference.

So,

yeah, these types of things,

are are getting worked on and there's just much more to do. So, yeah, definitely invite everyone to get involved. For me, it's just been a lot of fun to to build that package and meet a lot of really interesting people. Right. So for anybody who wants to follow along with the work that you're doing or get involved with pymc3

or just get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the fantastic beasts movies.

So fantastic beasts and where to find them and crimes of Grindelwald. So they're set in the same universe as Harry Potter, but they take place before all of those books. And it's just, a lot of fun, very well written,

enjoyable. So definitely worth a watch if you're at all into any of the Harry Potter or if you just like a good fantasy movie. And so with that, I'll pass it to you, Thomas. Do you have any picks this week? Yeah. So,

science fiction, Hyperion by Dan Simmons.

Probably 1 of my favorite books. And then if you're into meditation at all, check out,

The Mind Illuminated, which is, like, a fantastic

technical introduction

to to learning meditation. Great. I'll have to check those out. Well, thank you very much for taking the time today to join me and talk about pymc 3 and probabilistic

modeling and Bayesian statistics.

It's been very interesting, and it is an area that I've, found fascinating for a while. So thank you for that, and I hope you enjoy the rest of your day. Well, it's my pleasure. Thanks so much for having me.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.init