Moving to MongoDB with Michael Kennedy

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at ww

w.podcastinnit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app or experimenting with something that you hear about on the show.

You can visit the site at www.podcastinit.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Mike Kennedy about his work scaling his apps and his business. So, Mike, could you introduce yourself? Yeah. Hi, Tobias, and hello, everyone. Thank you for having me on your show. I'm Michael Kennedy, and I've been

running a couple of Python podcasts, doing some Python courses, and just generally trying to be out there in the community and doing things, to help everyone. So I'm doing the Talk Python in Me podcast, the Python Bites podcast, and, like I said, classes,

to sort of follow on some of the interest there. And in some of the more recent episodes of your podcasts, you've mentioned the work that you've been doing to migrate your applications to run on Mongo. So I'm wondering if you can just start by describing a bit about the business uses for those applications

and what the initial design looked like.

Sure. So I'm a huge believer

of

don't let the perfect be the enemy of the good. And so I'm always about, like, how can we get something started quickly? How can we get something that is super simple in place?

And then if we need to, if the thing is successful or takes off or whatever, we can design it a different way. But let's design it first optimizing for getting something out there and seeing what the world thinks. So with that in mind, I created my original websites

for the podcast, as well as for the training site, just running on SQLite.

So,

I don't know how many people view SQLite, but it's built into Python. It's super easy to use. There's

1 fewer servers to configure because it's,

in process

database

and things like that. That was working pretty well, actually. I was surprised the latency

on SQLite is really, really good because there's not even, like, a network hop. It's in process. Right? So in that sense, it's really, really good. However, as as it got more and more data, it started to be a problem. Like, 1 of my SQLite databases was over a gigabyte, and I'm like, alright. This is kinda getting ridiculous.

It's starting to cause problems, and some other reporting issues were a problem. So, basically, I decided,

I've got these custom Python apps I've written. They're using, SQLAlchemy

to talk to SQLite.

I've gotta do something different. Right? And, basically, I need to set up a proper

database of of some sort. So I looked around and I said, well,

MySQL would be a totally decent option.

I could use a hosted database somewhere,

but,

I'm really a big fan of the document databases. So I said, alright. Well, let me do a prototype in MongoDB and see how it goes. So, I decided, alright. I'm gonna rewrite this in MongoDB. And,

you know, I I thought, well, maybe it'll be a couple of days work or something like that. I think in the end, it probably took 3 or 4 days because

the early versions of what I'd created with SQLAlchemy were pretty simple, but those apps had been evolving and growing in complexity for all sorts of reasons

for, like, a year, year and a half after that. And so the final things were actually fairly complicated, and it took, I don't know, it took took 3 or 4 days, to get it going. I'd say maybe a third of that was to just get it working, sort of. It would seem like working, and then a third was performance tuning, and then a third was wait. Why doesn't that work anymore? What happened to this? You know, the little bugs that, like, the differences between a relational database and a document database or just SQL, Alchemy, and, what I was using to talk to MongoDB,

those differences,

you know, they started showing up in the middle edge cases. And you mentioned briefly the sort of calculus that you did to determine whether or not to go to another relational store such as MySQL or postgreSQL

versus going to Mongo. But I'm particularly curious why you didn't sort of do that as a first step, at least, particularly given the fact that you were already using SQLAlchemy. And in theory, it should have just been a matter of changing out a config line, and then you could have run it against Postgres or MySQL with basically the app code unchanged. Yeah. I I totally could have. Right? That would have

worked pretty well. But some of my other workflows,

my sort of administration

of the server would've I would have to set up another server and admin it.

It's running on DigitalOcean, so it's not

it's not like I could plug in RDS or something like that. You know, some of the hosted things where it just works. So I and you have to go set up a server and sort of, set up some backup processes for that. I

I did consider that as an intermediate step. But in the end, I thought, well, you know, I think it's probably about the same amount of time, a little bit more, but not a huge amount, like a day or 2 more of time to just switch it over to MongoDB.

And I'm a big fan of

MongoDB for keeping things simple. You know, like, these NoSQL databases, people talk about them for their performance and their scalability,

excuse me, for their performance and scalability and things like that. Right? And while

that can be true, I think what's really pretty excellent about

MongoDB and some of these types of databases,

especially document databases,

is they're they are very simple to evolve.

Right? You add a field or even a sub document

to some part of your data record, and now it's just there. Right? There's not migrations.

There's not, oh, the staging version of the site has a different

schema than the other 1. You didn't apply the migration

or anything like that. Right? You just

evolve your app and, you know, 9 times out of 10, the database will evolve,

just adapt with you because of the schema less nature. It's not that it's entirely schema less. It's just that the schema is implied instead of enforced. And so you do still run into this situation where you have a new model that you're using for the data that you're writing out, but then when you go to read old data, it doesn't necessarily have the fields that you're expecting. So you do have to have some error handling for that case or some potential migration path to backfill that data. That's that's an interesting point. So I I do agree that that's true. I think I think the way you wanna look at it is, you know, the the new the new stuff, most of the time, changes to databases are additive.

Right? I wanna add a new field. I wanna add a new subdocument. I wanna add a new relationship. Every now and then, you take stuff away, and then that's that's pretty,

seriously bad for how even the app runs depending on what way you're talking to the database to to, document database. You do have to basically

assume that the the things could be none or missing or or something else. And, you know, sometimes it's fine if it's a string or something. But if it's like a number, you're trying to treat it like a number, and now it's not even a number, that that can be a problem.

I totally agree. So you do have to be a little bit careful.

Yeah. So but you you basically a lot of times, those

those fields or or,

sub documents would they would basically be none in the case where it's the old data anyway. I sometimes there's a default value that's not just nothing, but, more often than not than it is. So you have to have that sort of check-in your code anyway. But, yeah, it it is,

it is something you gotta consider. But, you know, with SQL alchemy, it just goes, no. It's the wrong shape. Done. You know what I mean? Yep. And it just full on crash website unavailable.

Right? That sort of experience. So, you don't have to be as

careful. Like, you still do have to be careful. In other places, you have to be more careful. Right? The database doesn't have your back. Right. If if you put wrong stuff into it, that's you. It's on you. Right? So you have to be more careful in your code. I think you need, like, a

more structured,

more layered application. Right? So you've got a data access layer, and you only talk to the data access layer. And it's very careful about how it works with the database

because that's the only consistency you have. And like you said, there's this implied or implicit schema, and that's the schema that is in the documents, but the database doesn't have it. So, yeah, MongoDB and document stores in general are definitely great for

the getting started aspect, particularly if you have, for instance, an open source project that you wanna use and it requires a backing store. Mongo is great because you don't have to go through the steps of bringing the schema up to the point where the app is expecting it to like you would with a relational store. You can just say, I'm gonna start writing documents, and it will just take whatever you've got. But it can also be deceptive because as you're evolving the format of the data, you might end up in a situation

where

you have deeply nested information, and then you want to be able to access it from somewhere else. And then the fact that it doesn't have those relationships available, although Mongo, to be fair, does have the capability of joining between documents.

You you can get into a situation where if you didn't think ahead about how you're going to access and format the data, then it can be more difficult to actually

read it or manipulate it or use it for the cases that you want. So what are some of the things that people should be thinking of when they are determining between a relational or a document store as far as how they want to architect their application and the way that they're interfacing with that data?

Yeah. That's that is 1 of the trickiest

elements of working with these document databases.

So when you work with a relational database,

it's like, okay. How are we gonna design our data? What are the top level pieces? Alright. 3rd normal form. Let's apply this this normalization

concept, and it's very clear when you have 3rd normal form with all the relationships and and things like that set up. And that's a pretty well known well, everybody not almost everybody can look at a database and sort of break it up like this, or at least give it a a pretty good shot. Right? There's there's a standard way of modeling in relation database. In document databases,

it's kind of the wild west. It's a free for all. Right? And the reason

is you can nest these documents highly like you say. So, like, suppose you have, like, a customer, and they have orders, and the orders have order items, and these are all types. Right? You could, like, jam that all into 1 document, or you could have a customer with orders and with order items as 3 separate tables, which is probably the not not the right answer. That sounds very much like just a relational database on top of a document database.

Right? You might as well go back to a relational database if you can build it that way. Or maybe you'll have, like, the customer in the order, and the order itself contains order items.

That probably is an optimal solution, but you get into situations where, like, I wouldn't know how many times this particular item was ordered. Right? So you've gotta reach down inside, not there's not, like, a table where you go look for that. You've gotta go inside the order object and look at its order items. And so that turns out to be a a challenge.

Basically, the the way you wanna think through how do you break these pieces apart or do you how do you keep them together is to look at the way your application works.

So you wanna look and say, what are the common query patterns that my application makes against this database?

Like, when I get an order, do I almost always want to have the order items with that? Like, would I go back and do a join almost all the time or a lazy evaluation

of that relationship and say SQLAlchemy or something like that, do you usually do that? And if you do, probably makes sense to embed that order items in the orders. Because

doing the query by the primary key for the order automatically

gives you, like, that pre joined data. Like, think of these embedded elements of these documents as pre joined data. You've already computed it. Right? So it's also the way that you get extremely strong relationships.

Right? In relational databases, you have foreign key constraints. Say, okay. I'm gonna insert this into the order items, and it has an order ID, and that's a required link back to the primary key of the order. We don't have that in MongoDB or document databases. But just the same, you cannot insert an order item into the database unless it is inside an order. Like, that is as strong a relationship as you have in relational databases.

So this this nesting is the way that you get those guaranteed relationships, as opposed to, like, the soft in the application, weak relationship from the the customer object over to the order, which is not embedded. Right? So you want the the basically, the the challenge is

when do you embed these objects,

like order items into order, and when do you keep them, like, separate,

like, customer and order? So I would say the rule of thumb that you work with is, what are your applications query patterns? Do you most of the time want the order items when you want the orders? Then embedded. If you most of the time don't, it's only gonna slow things down. It's more network traffic, and it makes the queries that you're talking about harder. Yeah. And I would say that particularly in the case with document stores,

given the fact that the document is

very similar in the structure and the content to a Python dictionary, you can largely think of it as the way of, you know, how is my application architected

and then basically just map the doc map the database exactly to that because whatever you read and write in your application is going to be what you're going to need in your database. So you can almost think of it as an persistent storage for the transient data types that you're using at runtime. It's almost like pickling, but it's not evil.

With all the versioning and other weirdness that pickling has, the security issues and whatnot. Yeah. So I totally agree. And that's 1 of the simplicities. Right? You often hear about the,

object relational impedance mismatch, where, like, in memory, you have this, hierarchy of things. But in the database, you shred it into a bunch of little pieces with the related tables and put it in there, and then you get it back out. You reassemble into an object graph. You save it. You shred it back into the pieces. Right? In the document databases, it's like you can take the thing from memory and turn it into a JSON variation

in the same shape, same relationships, and everything. So I I think it makes programming

a little more intuitive about the data layer. Right? Like, you're you're not going through these back and forth gyrations all the time. What you have in memory is very often what you have in the database

for the reasons you're talking about. And bringing it back to the business need that you were trying to address in the first place. I'm curious what was happening in the in your day to day that was precipitating the need to be able to actually grow your data layer and evolve the application in such a way? Well, yeah, to be fair, the the thing had fully outgrown

what what the way I was using SQLite. So, definitely definitely,

that was like, I need to go somewhere else. I totally believe that it would have been fine to go to MySQL or Postgres.

But if I was gonna move away from this simple version that I had, and I I needed to because, like, just 1 of the databases was gigs of data, And even trying to do, like, basic reporting on it would, like, lock it up and cause, like, operational issues. So I knew I had to move it to a proper database on a dedicated database server, things like that. So then the question came down to, well, if I'm gonna put in a couple of days' efforts to rework my infrastructure from my various websites, what do I want the final outcome to be? And I could put in a couple days and get to my MySQL, or I could put in a couple days to get to MongoDB. And for me, operating and running MongoDB is just a nicer experience than running MySQL or other relational databases. I'm not picking on that 1 in particular. And so as you were going through the process of reworking how your application talked to your data layer, how did you manage to perform the cutover in a way that it didn't cause a lot of downtime or introduce a lot of regressions

into the application

while while still making sure that the data that you were reading and writing out of the data store was,

bit for bit compatible?

Sure. So

there's there's a couple of angles to this. So I have a fair number of unit tests that run against the websites.

So I ran those tests, of course, and got them all passing before I ever deployed anything. That was, you know, step 2, I guess. Step 1 was rewrite it. Step 2 is make sure that does pass, just run the stuff locally. Then I went through and created a proper production MongoDB database, which we can talk about later. There's actually a lot to that. There's a lot of not good choices. Let's say, hang about the defaults, the way this thing comes out of the box, and it's a big problem, actually. So there was some work to get the MongoDB server set up and make sure all the pieces could work, and then

I went through a process of

exporting all the data. I had written an application

that would

go through the SQL alchemy

classes

and relationships and all that kind of stuff. And I would run that against the current database and then transform that into what was going into MongoDB.

So it's not as simple as, like, export

a bunch of CSVs or something and then import them into the others. Because if you're just taking the same shape and you put it in MongoDB, you're probably doing it wrong. Right? You you probably should just stay in a relational database if you're just taking relational things and putting them over there. So 1 of the big concepts you have to go through is, like, how do I remap this data? Like, what is a nested object in these sub as a sub document, and what is a proper external thing with the relationship and things like that? So I've written this app to, like, do that, and I ran it and copied all of the production data over. And then I just, you know, redeployed the app, restarted,

micro whiskey,

and,

you know, it it started riding over there. There was possibly a little bit of data that

got stuck

into the old database. Right? It's probably around analytics. A lot of the other days data is somewhat static, like the podcast data changes, but not very often. The courses change, but only when I push new versions.

The the users come in and buy stuff or register or take action, some analytic stuff on the the training stuff, but, you know, not terribly often. Right? Like, not not every few seconds or anything like that, except for maybe the analytics. So then I I moved the data over, then I ran 1 more thing that said, okay. Get the data after.

Get the changes,

you know, after this time. Right? And do 1 more sync from the old data that was no longer being updated because it already been pointed over to the new database. So it was kind of a 2 step process. Move the data, switch the app, and then do 1 more check to catch any bits that got missed. And that's when you cross your fingers and hope the site stays up because, you know, those little edge cases.

Right. I was like, wait a minute. I didn't know they were calling it that way. That didn't work over here. There were there was no, like, full on crashes, but weird stuff happened. Like, for example, I had episode 0 on the podcast, and episode 0 went away. Why did episode 0 go away? All the others are here. And I had rewritten that little query and

was making sure I got an episode back. I was like, if episode

ID

oh, wait.

Episode 0 isn't true anymore. Right? You know, just like it wasn't necessarily a Mongo thing. It was just like you rewrite this code. It might not be exactly exactly the same. Right. Yeah. But yeah. It it wasn't too major.

A couple of customers

who had some had some classes sent me, like, this is not working exactly right anymore. I had to fix it, but it it was all good. Right. Yeah. The,

application development is easy until you have to introduce state into the equation, and then everything becomes hard.

That's that is for sure. Yeah. But it's yeah. It you know, it it went okay, I think. I mean, it

would have been better from a stability perspective to move to MySQL because, like you said, I could have just changed the connection string in SQLAlchemy.

But since my final goal was to move to a different kind of database, like, going through that intermediate stage would have just been more work if if I felt like I could just get it done. Yeah. And it sounds like a big part of the reason for wanting to go to the document store is because of the fact that you wanted to have that ease of evolution of the application

as your business needs change, particularly as the business continues to grow and you add new offerings?

Yeah. Absolutely. So, basically, my deployment process is

over SSH. I run a script. It grabs the latest version of a certain branch, restarts the worker processes,

and that's it. Right? It does kick off some notifications to, like, role bar and things like that, but that does doesn't really matter. Right? That's just a notification to me, basically. But yeah. So I don't have to go through a lot of gyrations to, like, evolve the tables. You know, like, that's the thing that you always ran into with SQLAlchemy or it's not their fault. Any relational database. Right? But SQLAlchemy says, like, if the table doesn't match the shape of your class, we're a big problem you know, have a big problem. It's not gonna run right, basically. And so anytime the data get the database and the data definition in your app gets out of sync, you've got to really carefully manage that switch over so the database and the app are changed, like, at the same time. And now I don't have to manage that database side because it just sort of evolves to be the what's defined in the app itself.

Right. Because even if the migrations work properly, but then you find a bug in the app that necessitates a rollback while you figure out how to address the issue, then all of a sudden you have to worry about making sure that your reverse migrations are working the way you want. And if it adds a column that's required and then it goes to remove it, it might cause other bugs. So yeah. Yeah. And it's just this cascading change of and it all of that stuff happens at the worst possible time. It's not like you've had your coffee and you're kind of ready and rejuvenated Monday morning. No. This is like when you're deploying and then you pull up the website and it's like 500 server unavailable, you're like, oh, no. You know, like, you're you're totally at, like, max stress when you're trying to, like, solve this cascading change of change of issues, which is just, you know, you just don't have to even worry about that kind of stuff for the most part. Like you said, you do have to go, like, maybe this data doesn't exist, but it

doesn't completely die.

And so as you mentioned, when you first started developing these sites, it was primarily for doing the podcast, and then you have added new capabilities and new applications to support your trainings.

So talking a bit more about the trainings and the growth that you've been experiencing there, I'm curious, in particular, what are some of the aspects of teaching people Python as opposed to other languages or other subject matter that is unique to the particular format that you're using? Or or more specifically, how does the Python language kind of bleed into you the approach that you take for that learning experience?

Well, I think there's a a couple angles. First of all, like, there's so many people coming into Python

right now. You know, you look at all the stats. Like, Python is increasingly heading towards the number 1 programming language.

There's various

measures. Right? Sometimes it's 5th. Sometimes it's 1. Sometimes it's 2nd. There's an interesting article that basically said Python is basically the most moved to language,

if you study GitHub and and things like that. So that means there's a lot of people coming

to the language, and I think

I think there's a really careful balance you have to strike between like, I'm teaching people who don't have experience with this stuff, and it seems

really easy. Right? You look at the Python language. You're like, oh, I could learn that this afternoon. This is totally good. I don't need anything. Right? But at the same time, if you expand your area of focus a little bit to include

the ecosystem,

the packages, for SQL Alchemy, for example, or Mongo engine for talking to Mongo in a similar way. Like, those you know, every 1 of those that you bring in is like a full on

experience and, like, pretty complicated thing. So I think, you know, you gotta kinda balance between, you know, not not going too deep into what would be, like, simple stuff and and boring people, keeping it kind of real world. I think the real world uses of Python are very, very interesting and very rich,

because mostly because of the packages.

Yeah. And also people who are using Python

as opposed to something like c plus plus or Java that requires a lot more overhead and,

investment

investment to be able to actually become effective. They might not necessarily be coming to Python purely for the purpose of starting a career in software engineering. It might just be for a simple use case of I need to be able to automate this aspect of my job. So I need to be able to

use case of I need to be able to automate this aspect of my job. So I need to be able to learn enough about Python to understand how to talk to an API service, for instance, or

or how to automate a report. Yeah. That's a super good point. Like, I I do think most people that learn Java and c plus plus are doing that to be developers.

Not all of them, but many of them. Whereas, there are a lot of people coming to Python

to add, like, a superpower. I I think 1 of the most important things that we can do with this whole everybody should learn coding sort of initiative around schools and stuff is everybody should have a little bit of programming skill to amplify what they actually do. Are you a biologist? Well, now your biology studies can be amazing because you can use pandas on it or whatever. You know? Are you in finance? Well, you know, put Excel down and actually automate what you're doing or something like that. Right? And and I think a lot of those people are coming to Python.

Yeah. And also just as a way of being able to foster the computational thinking that goes along with the ability to tell a computer how to do what it needs to do. It's not so much necessarily of, you know, you using the programming in their day to day as just being able

to fathom what the computer is doing when you tell it to do stuff.

Mhmm. Mhmm. Yeah. Could we teach less geometry and axiomatic proofs and more programmatic thinking? I would hope so. Yeah. That that would that would be a good a good day to see.

Yeah. I've I've been listening to I was actually just listening to a show recently where they were talking about a study that was done that was coming to the conclusion that

the sort of archetypal homework assignment is

being viewed as less relevant and less important in this computational age because, you know, as has been the case for a while now, but,

there are, you know, computers and techniques that make it easy to perform the necessary computation. And it's not so much a matter of understanding how to do every single step because chances are you won't do that be being able to understand

why you need to why those steps need to happen and what the actual purpose of that calculation happens to be more than just, you know, okay, sit down and do a bunch of long division because we're telling you to do a bunch of long division.

Yeah. I remember my kids' teacher in, like, 2nd grade telling them something like, it's not like you're always gonna have a calculator with you. You're gonna need to learn how to do things like long division and stuff. It's like, well, I've worn this calculator.

So maybe they will always have it with them. If they they might forget their shoes, but they'll have their phone.

So going back to the data layer, you started off using SQLAlchemy.

And with Mongo, there are a few different options for being able to talk to it, whether just directly using the Mongo

JSON format, or there are also some higher level APIs that provide a bit more structure in terms of that data access layer. So I'm wondering which side of the fence you landed on where you're just writing and reading dictionaries to PyMongo or if you're using something,

1 1 of the Mongo ORMs or ODMs rather. There's 2 ways to look at this.

1, I had already written things in SQLAlchemy.

So the most direct translation would be to do something in SQLAlchemy

ish

on the MongoDB side. That would be just 1 of the ODMs. So I decided to use Mongo engine, which is, as far as I can tell, a couple of order a couple of times more popular than the second most popular 1, and it has a lot of features. It has interesting features that actually MongoDB itself doesn't even have, so that that's pretty cool. But even if I hadn't started that way, I I think

you have to be extremely careful when you're working with these schema less document databases about enforcing a schema in your application layer. And 1 super easy way to do that is to use

something that maps classes that are fixed types in Python into the database, because then it's always gonna look like that class unless you change it, and there's, like, you know, the evolution stuff we talked about. But I think there's a huge amount of value of using classes

in an ODM

because it adds that structure that otherwise would be entirely missing. If you're just passing dictionaries, which is what you do with Pymongo the direct way, there's no guarantee those dictionaries are gonna have the right structure every single time. You know what I mean? Yeah. So ended up using Mongo engine, and I'm really a huge fan of Mongo Engine. Like, it has certain features that you maybe maybe don't exist in MongoDB. Maybe they don't

maybe they exist, but you don't use them because they're more complicated. So, for example,

in MongoDB, you don't have a fixed schema. You kinda do with these classes. In MongoDB,

you definitely don't have default values or required values or type checking. Right? You can't say if you have an a price. It has to be a float.

But if you have

a created date, it has to be a date time. Right? It's just like, you know, more or less typeless

JSON, BSON dictionary type stuff that goes in there. So

Mongo engine lets you have default types. It default values. It has type checking. You say this column is a string, and you try to put an integer in there. It won't take it, for example, things like that. It also lets you model the

indexes and stuff in your classes. So for the most part, I'm really, really happy with Mongo engine and definitely recommend it. But there are a couple of drawbacks I ran into as well.

So what are some of the gotchas that somebody should be keeping an eye out for when they are considering using Mongo or another document store as opposed to some of the more battle tested relational databases?

I I think there's there's a couple of areas, you know, just to round out the MongoDB, Mongo

engine sharp edges. I would say the the primary sharp edge there is the speed of serialization.

So when I was doing performance tuning and stuff, and I I loaded up the the app and got it running in Mongo, I'm like, this is slower than I really had hoped. So I I added a bunch of indexes, and that's really easy to do, and that made it much better. However, there's

some point where if you load up

a 1000 or 10000

documents,

the actual speed of taking that from the the binary

format that comes from MongoDB and turning that into objects in Python,

not because of Python necessarily, but because of all the type checking and assignment and the little steps that Mongo engine goes through, it's a lot slower than you'd want it to be.

Right? Like, reading a 100 records is fine.

Reading a 100000 records, that's, like, probably seconds of serialization if they're, like, big, rich documents. So I would say be very careful of the serialization speed of Mongo engine. That's not a MongoDB thing. That's a Mongo engine thing. So maybe you would drop down to PyMongo for certain types of things if you had to do that much data at once. But probably you just do paging or do, like, some kind of analysis in the database and just get a few results back. But in in MongoDB

itself,

there's there's been a couple of decisions

that,

this product has sort of made or people make this product made, and it's gone through over time that have really resulted in some horror stories.

And

most of those problems are entirely

removed

in the past, have nothing to do with today. But I bring them up because they're still in the sort of community mind of especially people who haven't done a lot with MongoDB. In the early days,

they're they were overly

optimizing for performance, in my opinion, and performance over safety, which is a trade off you can make, maybe for analytics or something, but I wouldn't make it for my ecommerce store. And so by default,

the the drivers, the

the API you'd use to talk to to Mongo,

would not wait for acknowledgments

on on the right. And they had,

32 bit and 64 bit versions of MongoDB, and the 32 bit versions, for various reasons, could only hold 2 gigs of data. So there were certain startups

that had run the 32 bit version, for whatever reason, a MongoDB, and were using or talking to it without right acknowledgments.

And they put more than 2 gigs of data in it, it stopped taking it, and nobody knew for quite a while until it just wasn't working and things like that. Right? So none of those are issues anymore. There's no 32 bit version. The way the storage works has nothing to do with that anymore. You can store way more data. The right acknowledgments are required by default unless you turn them off. All those kind of things. So there's a lot of,

former sharp edges that don't make any sense. Another 1 would be journaling. In the early days, the redundancy

data backup was meant to be done by via replication. So I have, like, 3 MongoDB servers that are always triplicate in sync. If 1 crashes, it's fine. The other 1 will just pick it up. And under very

rare circumstances, a certain type of crash could lose some data because it hadn't been saved to disk yet. But it's fine because there's this replica set, and it'll just pick up. But most people were actually running MongoDB as a standalone server, so this replica safety that they had assumed was there wasn't there. Right? And so then, again, that led to a bunch of,

hacker news posts and all all sorts of negative press about it. That again is gone. Right? Those now is, like, the same type of journaling that the relational databases have, and single servers don't lose their data. Things like this. Right? So all those things are sort of what was in the past. If they had just said from the beginning, this thing's gonna have journaling.

From the beginning, we're gonna have a mode that works really well as a stand alone server if it's not in a replica set. Things like that. There would have been so many more good things and less bad things said,

in certain circumstances. But that said, there's still some things that are really, really,

you have to watch out for. And I would say the 2 major, major things you have to watch out for if you're gonna run MongoDB.

First, there is

no authentication

by default. You go and you, you know, apt install

MongoDB

dash org or whatever it is, and you get a database that if you put it on the Internet,

people can log into it from anywhere on port 27017,

and they could just read your data or write your data all day long. What do you think? Does that sound good? Absolutely not. Particularly as somebody who runs operations for my living. It's a yeah. Such a bad idea. Something to be avoided.

It is such a bad idea. Yeah. And it goes a bit into the the broader issue of just devices in general

having default passwords that don't prompt you to change them such as Wi Fi routers or wireless cameras where anybody who wants to can then access it and use it for their own nefarious deeds. Alright. Like, DDOS, for example. Yeah. So,

if I was if I were king of Mongo

DB, I'm not. But if I was, I would say MongoDB will not listen on anything other than local hosts unless you have authentication

turned on and required, period. It just won't even listen on the network. Right? But that's not the way it works now. If you install it, I think on Linux, it does only listen on local host or possibly brew install it on Mac. I'm not sure. But,

you know, people go, well, it needs to listen on the network. Turn it on. Right? And then you end up with all these horror stories. Like, there's this crazy horror story of,

a whole bunch of MongoDB databases

that got ransomware ed,

and all their data was gone. And there was, like, 1 record in the database saying, like, send us a Bitcoin

to this address. And, of course, they didn't back up your data. They just erased it, right, because it's too much work to back it up.

So, that's part of that story. The other,

other 1 that goes along with it, this is less bad, but it's still nontrivial,

is it doesn't run encryption by default. You can turn it on, but it doesn't, do encrypted

communication by default. So if you're gonna talk to this thing over a network, you know, make sure it's a virtual private network

or something in your data center that really locks it down, or absolutely make sure it has a username and password. You can go farther. You can restrict the IP addresses that get to that port or change the default port, all all sorts of things. But at a minimum, you know, be careful of that. So I guess the rough edge that still exists very clearly is this whole authentication,

encryption

thing. Like, it shouldn't be able to listen on the public network

if it's not if it doesn't have those things turned on, but that's how it works. Yeah. So for anybody who is interested in running this in production, if you don't have experience

doing, production deployments and running things in operations, then please find somebody who does have that experience and ask them questions because chances are they'll be happy to provide answers.

Yeah. Absolutely.

There's a, like, a

a security checklist at mongadb.orgcom.

I can't remember. And there's also an option of just doing this as a hosted service. So mlab,

mlab.com,

They actually have a free

hosted MongoDB

for, like, 500 megs, and then beyond that, you can pay for, like, MongoDB as a service. I think Rackspace, Object Rocket,

might have 1 as well. And MongoDB themselves also has a hosted option now, the Mongo Mongo Atlas. That's right. They do. Yeah. And so, basically, Mongo Atlas, you you kinda give them permission to configure your AWS account. Like, here's my AWS keys. And, like, great. We'll make you 3 servers and set them up for you safely automatically. Yeah. That's how that works, I think. But, yeah, there's definitely there's there's definitely some sharp edges about hosting MongoDB. So if you're if you haven't done it before, make sure you're a little bit careful because these are things that are not obvious. Like, oh, wait. There's literally no account required

unless I add 1 like that. I didn't know that. Well, that's why you have Bitcoin number in your where your data used to be. So for anybody who wants to learn more about using MongoDB and in particular using it with Python, what are some resources that you recommend?

Well, there's a couple of things.

MongoDB itself themselves just had their yearly conference, MongoDB World. And so if you search MongoDB World 20 17 videos, they have a bunch of videos and examples and stuff. Now these are not really tutorial

style in the sense. They're not like, we're gonna get you started from

here to there. It's more showcasing features and whatnot. But if if you're interested in, like, a lot of high level stuff, see how other people are using it. Maybe they solved a certain problem, they talked about it there. You can watch those videos. That's pretty good.

MongoDB has a course,

a free course, I think. You can go take that. It's quite academic, in my opinion. It is Python, but it feels I don't know. It doesn't feel like you're building a real app. It feels like you're kind of poking at the database a little bit, and that's that's fine. And you do learn some things that way, but you can check that out. You even get certified there.

I'm working on a free course. I don't have the free course done yet because it's,

coming soon, but I also have

a full on

MongoDB for Python developers course atma at talk python dot f m. So you can check that out, and that takes you through building, like, a whole,

sort of

simplistic

web app type of thing. It goes through

performance tuning,

document

modeling, and even all the stuff we just talked about, the sharp edges, it takes you through, like, creating a hardened

production server that you can manage and back up and and work with and things like that. Alright.

Well, are there any other topics or questions you think we should cover before we start to close out the show?

You know, not really. I guess, maybe let me leave everyone with this. Like,

from a developer side, not necessarily from a operation side, I think these document databases

really provide

a a nice, simple way to work on your application. They let your application code define what the data looks like, and they let the the database,

you know, very much,

most of the time, really easily adapts automatically to what you need. So a lot of people look at NoSQL and and things like that as, I need to make this scale like crazy. There's, like, that whole silly, video about web scale and whatnot. But the real problem and advantage I think you get from these databases is you get a simpler story. We talked a bit about the migrations, the rollback, the backwards migrations,

like all that. You don't have to think about that nearly as much. It doesn't entirely go away, but it goes away for most of the time. And so I would say, like, you really get a simple software development story here. And in terms of the various databases, if you look at the 2017

Stack Overflow developer survey, MongoDB is by far the most wanted

document

sorry. The most wanted database technology, period, by, like,

a factor of almost 2 over any other database, including Postgres, including MySQL, including

other document databases, and also 1 of the most loved. So

MongoDB, of those various choices, is a a pretty solid 1. Alright. Well, for anybody who wants to get in touch with you or follow the work that you're up to, I've got your contact information in the show notes. And with that, I'll move us to the picks. And my pick this week,

I've got a couple. So 1 of them is org mode for emacs.

So I've been moving all of my note taking and to do list tracking to that. And it's really great because it lets you easily go from writing down notes and ideas about, you know, things that you have to work on or just trying to, you know, set reminders for yourself and then turn those into a to do list. Then you can also set, you know, scheduled deadlines and set reminders. And it's just a very powerful and flexible

mode

of keeping track of all the different things that you think about throughout the day, particularly given the fact that I do all my development in Emax. I don't have to context switch to take down notes. I can just do it right there. That's pretty cool that it's, like, right there. Yeah. It it's really powerful.

I had avoided getting into it for a while because there's a lot of depth to it, but it's really easy to just start at the surface and get, you know, get going with it and then peel back the layers as you need them. So for anybody who either uses emacs or has thought about using emacs, org mode is sort of the killer story for that. So definitely take a look. That's cool. I don't use emacs, so I don't have org mode, but I have this thing called momentum dash, which turns my new tab into something similar. Here's the things you're supposed to be doing today. Stay on target every time I open up a new tab.

And then, my other pick this week is the podcast, LeVar Burton Reads. So for anybody who isn't familiar with who that is,

he started his career with a show called Reading Rainbow, where he would read books to kids on public television. And now he has moved to reading pieces of short fiction to adults on podcasts. And so each week, he picks a different short story, reads it aloud. It has great performance, great,

production quality, and it's just a lot of fun to be able to actually, you know, know, fit that into your day because it's not a full novel. It's just little bite sized pieces, and you get to sort of experience

a good performance and a good story that's curated and handpicked by somebody who is very much adept at and steeped in storytelling.

So for anybody who likes a good story, I definitely recommend checking that out. And with that, I'll pass it to you. Do you have any picks for us this week, Mike? I do.

Let me give you a couple. I tried to pick a a bit of a spectrum.

So let's start with something related to our topic really closely. So first 1 is something called Robo Mongo, which is now been renamed to Robo 3 t because 3 t, the company bought them. But the idea is, you know, there's a couple of ways to talk to and administer and explore MongoDB, and it often involves typing the word Mongo on the command line and just doing a CLI type of thing. Like, think the equivalent of the Python REPL, but from MongoDB. And while it's pretty cool, it would be nice to have some kind of graphical interface where you can, like, see, analyze, and work with the data. But a lot of problems with the GUIs are, like, you lose some of that fluidity, that functionality of the command line. So RoboMongo

basically has a little tiny shell where you type anything you want on the cam as if it were the command line, but then it's the results are represented in a GUI. It's glorious, and it's free. So if you work with Mongo, check out Robo Mongo. It works on all the platforms.

So on the the next up is this package called Newspaper. Tobias, have you heard of newspaper? I have not. So I just had a guest on Talk Python that recommended this to me, and oh my gosh. This is crazy. So let me just read you a little bit of, a little bit of just a summary of, like, the code of how this works. Like, so newspaper is a Python package. You pip install it. So I can say import newspaper, and I can say build from

cnn.com.

Then I can say

print all of the articles on that home page.

Or I can go to an article, and I can say, parse it. Who are the authors?

What is the text? What is the top image? I can do NLP

processing, natural language processing on it, and it'll show me the keywords, the summary,

all that stuff. And you just point it at, like, CNN,

Fox, MSNBC,

whatever. New York Times. Looking at the documentation page for it, it definitely looks very cool. It's crazy. Right? It's like a little bit, like, a little bit of magic. Right? I just point this newspaper thing at, like, a URL, and I have, like, analytics on it. It's it's crazy. So if that if you're doing a web scraping around news, check this thing out. It's awesome.

Then speaking of news, I have an article I picked for us, and that's called The Dark Secret at the Heart of AI.

So very mysterious, and the idea is basically, you know, very much in the Python space. We have tons of data science. We have machine learning. We have all this AI stuff coming, but it's gonna lead to a world

that

is very different from the world that you and I learned to program in. Right? Like,

how

how do you debug

a deep a deep learning network?

How do you understand why it made a decision?

What are your thoughts on this? Yeah. From from the literature that I've read and the podcasts I've listened to on the subject of deep learning and neural networks, it's just very much black box for, you know, even things like self driving cars where you want to be able to have a high level of confidence that it's making the right decisions. It's just the way that these networks are trained is you just feed it a lot of data and then you tune it for the output that it produces until it gives you what you want, but you don't actually have any understanding

of the decisions that it made along the way from the raw data to the output. And so it it's definitely a big issue in terms of reproducibility

and just being able to grok the fundamental aspects of what the network is doing. Yeah. Absolutely. So this is a pretty deep article by MIT Technology Review, and they have some examples. They say,

NVIDIA

recently

released, you know, NVIDIA, the graphics card folks, released

automated car, and it's not like Tesla or Google or whatever. And that the only way they taught it to drive is they just had it observe humans a lot. Watched how the humans break, watched how the humans steer, and then they released it on the road, and it drove around. So, you know, what if it crashes into somebody? Or what if it sits at a green light? Like, how do you deal with that? E even going farther, like, there's all sorts of decisions these machines are making. The EU is considering,

a law to say that getting an explanation

from an automated system on why it made a decision 1 way or the other about you, like, you're approved to get this home loan, you're denied, is a a fundamental legal right. And so how's that work with AI? It's gonna be interesting. Yeah. Sure. So the last 1 is just a fun,

a fun electronic

thing. So I've been trying to

electrify

my life a little bit, or maybe use less gas in it or something like this, some combination thereof. So,

in Portland,

we're pretty lucky that 50 to 80% of the energy on the grid is renewable energy, just as it is. And then you can pay a little bit extra to the utilities to have a 100% of energy,

coming to your house, being renewable. So I'm like, how do I get how do I take advantage of that and, you know, help the world a little bit? So the next thing I wanna recommend, the final 1, is this thing called a Haibike,

which is a really cool electric

bike that you plug it in overnight the more you wake up, and it goes between 4080

miles on a single charge. And it'll go, like, under its own power up to, like, 20 miles an hour. It's pretty awesome. So if you're looking for something fun and a different way to get around, that's a I've got that as a recommendation. Great. Well, I appreciate you taking the time out of your day to talk about your experience

of

evolving your business and your applications

along with your data layer and some of the learning that you've gotten out of that. So I appreciate that, and I hope you enjoy the rest of your day. Yeah. Thanks so much for having me on. It was great to talk about it. And as always, fun to catch up with you, Tobias. And thanks everyone for listening.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__