Summary
Most applications require data to operate on in order to function, but sometimes that data is hard to come by, so why not just make it up? Mimesis is a library for randomly generating data of different types, such as names, addresses, and credit card numbers, so that you can use it for testing, anonymizing real data, or for placeholders. This week Nikita Sobolev discusses how the project got started, the challenges that it has posed, and how you can use it in your applications.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
- To get worry-free releases download GoCD, the open source continous delivery server built by Thoughworks. You can use their pipeline modeling and value stream map to build, control and monitor every step from commit to deployment in one place. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add-ons.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- Your host as usual is Tobias Macey and today I’m interviewing Nikita Sobolev about Mimesis, a library for quickly generating synthetic data
Interview
- Introductions
- How did you get introduced to Python?
- What is mimesis and how does it compare to other projects such as faker and factory_boy?
- What was the motivation for creating it?
- One of the features that is advertised is the speed of Mimesis. What techniques are used to ensure that the data is generated quickly?
- What are the built in mechanisms for generating data?
- What options do users have for customizing the types of data that can get generated?
- What are some of the most complicated providers to write and maintain?
- What are some of the use cases outside of unit or integration tests where Mimesis could be beneficial?
- How would you use Mimesis to anonymize data from a production environment to be used for testing?
- What are the most challenging aspects of maintaining the Mimesis project?
- What are some of the plans that you have for the future of Mimesis?
Keep In Touch
Picks
- Tobias
- Nikita
Links
- Mimesis
- Django
- Faker
- Factory Boy
- Internationalization (I18N)
- Unicode
- Enum
- Pipfile
- GeoJSON
- Mimesis Cloud
- Sanic
- GraphQL
- Impostor Syndrome
- Imposter Syndrome Disclaimer: Add this to all of your projects!
- Jacob Kaplan-Moss PyCon Keynote
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you get everything you need to scale. Go to podcast in it.com/linode to get a $20 credit and launch a new server in under a minute. And to get worry free releases, download Go CD, the open source continuous delivery server built by Thoughtworks. You can use their pipeline modeling and value stream app to build, control, and monitor every step from commit to deployment in 1 place. Go to podcastinit.com/gocd to learn more about their professional support services and enterprise add ons, And visit podcastthenit.com
[00:00:55] Unknown:
to subscribe to the show, sign up for the newsletter, and read the show notes. Your host as usual is Tobias Macy. And today, I'm interviewing Niketas Sobolev about Mimesis, a library for quickly generating synthetic data. So, Nikita, could you start by introducing yourself?
[00:01:09] Unknown:
Hello, Tobias. My name is Nikita, and I'm co developer of Mimesis Library. Mimesis Library is a library for generating fake and synthetic data for various purposes. And the main idea of this library is to create a useful tool to be fast, to be useful, and to have a lot of providers and to have a lot of data that you use and to help you with a lot of use cases along the way. And this library had a very interesting history. And, first of all, I want to thanks, do want to thank, my friend and the original author of this library, Isaac. So thank you, Isaac. And, he was responsible for inviting me here in the first place. So thanks.
[00:01:57] Unknown:
And how did you first get introduced to Python?
[00:02:01] Unknown:
That's a really interesting question because it is all connected. It was, like, 8 years ago, I think, we had a university project. It was a group project. And as you know, all group projects are made by 1 person in the last night, so I was I was this person. We had some kind of, you, we had some kind of project that I I almost don't remember, and we had to do it really fast. And, there was only 1 requirement, so it was going to be in Python. We have chosen Django as very productive web framework, and I really like it because I was successful in providing all the code in 1 night. So I think, that's a great thing. Why would I use it in normalcy life situations?
So, I really like the language, the framework, and I have started to to build some websites and some applications using Python and Django. So as I do it today from my day life. So after that, I wanted to share my experience with Python among other people, among some newcomers, among I had some audio problems, about, with newcomers, some people from other jobs and and other professions. So, I have started, to give Python courses. And Isaac was my 1 of the first students, the codeveloper of Emesis. And, since that time, we begin to be friends and coworkers and maintainers over the same libraries. So it was a really nice experience. And as I said earlier, it is all connected to to this topic. So
[00:03:46] Unknown:
That's interesting that you initially taught him Python, and now you're working on this project together. It's a good history to it. And what was the motivation for creating Mimesis?
[00:03:56] Unknown:
I think it was, just a passion for open source from both Isaac and me because, when we started to build to when Isaac's originally started to build this library, I don't think that there was, a clean purpose, but, this was driven by passion. And then we have started to figure out some things like, oh, we can do this, you can do this, and we can be better than, this library, and we can work together with this library to build some nice user experience and some nice features that anyone would love to use.
[00:04:34] Unknown:
And 1 of the other projects that is similar in the space of being able to generate data, generally for the purposes of testing is Faker. And I know that there's another library named Factory Boy that Mimesis actually, acts as a provider for. So I'm wondering if you can provide some comparison between the features and capabilities of Faker versus Mimesis and how those relate to other projects such as Factory Boy or any others that I'm leaving out?
[00:05:06] Unknown:
First of all, our sailing point main sailing point is speed because we use, a lot of techniques to make maintain nice speed and we always check our numbers before releasing new features. And the second point is, we have a lot of providers that Faker doesn't have, but on the same side, Faker has some providers that we don't have. So we can say that it is a complementary project because sometimes you can use Faker, sometimes you can use methods, and you have to choose providers and do the most important parts, really. Because providers are also separated by the language.
So some providers has 2 languages, some providers have 3 and more languages, and sometimes it is really important to find the language you need. For example, we live in Russia, and we we really really want to have everything in Russian. So we have Russian providers for names, Russian providers for emails, websites, and so on and so. And for other people from other countries, it is very important too. So we try to maintain, language part as complete as we can. So that's the 2 main reasons for using our library and, to choose, some providers for faker or other libraries.
[00:06:25] Unknown:
And when you're referring to providers, this is for things like being able to create names or addresses. And also just to as a point of clarification for listeners because of the sort of overloaded term of language where it could be referred to programming or spoken. In this case, we're referring to spoken languages such as Russian or German or English. So
[00:06:49] Unknown:
Yes. Yes. And
[00:06:51] Unknown:
that's another interesting point too is that when you are generating this fake data, particularly for purposes of testing an application, you want to be able to generate it in multiple languages to ensure that your applications properly supports internationalization and localization because different languages might have different character sets or right to left versus left to right text. So having those different language providers is definitely valuable for those purposes as well. Mhmm. Right? Yes. Sure. Because sometimes English speaking people do not,
[00:07:24] Unknown:
care about unicode for a lot of reasons because they don't need it. And other people struggling a lot with unicode support. We all know this problem in Python 2. So, right now, so painful for developers to work with strings anymore, but it was just some time ago. And to name a 1 feature that we are really proud of, and it was our first experience for this kind of feature is, Python 3 support. And, we don't support Python 2. It was a decision made by Isaac in the very beginning of the project, and it saved us a lot of time. And we have implemented a lot of features such as MyPy support, type annotations, type hints, and all kinds of things like that. And we did it from the very beginning, and it was really a life changing experience for me to maintain Python free only,
[00:08:19] Unknown:
libraries. So it's really good. Yeah. When I was digging through some of the source code, I noticed the type annotations and, meant to ask you that. So thank you for volunteering the information. And I'm curious if the type annotations and type hints have provided any benefits in terms of the actual generation of the data.
[00:08:38] Unknown:
Probably. Because at some point of time, we have switched to enums instead of strings, to generate data. And I think just really beneficial to user experience because you don't have to memorize all these strings and stuff like that. You can use after completion, Inspect this enum for all different values and options. And when we switch to type annotations, we can clearly define what enums we're using in this function and what enums we are using in this method and so on. So, yes, Python tree help us and type annotations help us as well. And going back to the topic of the speed of the library, what are some of the techniques that you're using to be able to generate the data so quickly? 1st of all, we try to be as simple as possible. We don't use any kind of complicated or dark complicated steps or dark magic of Python. So we try to write simple functions, simple methods, and simple classes to get things done.
On the other hand, we have, integrated benchmark support which we run as usually as we can. So we measure our speed, and we try to focus on algorithm that I really like. Make it work, make it right, make it fast. So on the stage, make it fast, we try to measure the speed, find some bottlenecks, and to eliminate them. And have you had the need to
[00:10:03] Unknown:
rewrite any portions of the library and see to get any improvements in speed or use things like async?
[00:10:10] Unknown:
No. No. We don't use any kind of async frameworks. We don't use any kind of c bindings and stuff like that because it really complicates the setup and it really complicates, all the things possible with the development. So it's not worth it, really. And I noticed as well that there aren't any external
[00:10:30] Unknown:
library dependencies from Amasis and it all uses the built in capabilities of Python. So do you think that that has also helped to maintain the speed of the library and the simplicity?
[00:10:42] Unknown:
Sure. Because, my initial idea was to add dependencies, dependencies, and dependencies, but Isaac said that, no. Let's don't do that because, there are a lot of disadvantages in this behavior. Like, you can't maintain the library as easy as we do it now. You have to check on the developer versions, on dependencies versions, and I have to write a lot of tests to test it with this dependency version this dependency version, and so on. We have all the freedom we we really want to maintain completely free of dependencies, and it is really nice, and we don't have any any kind of rejection. So any kind of thoughts, why why don't why don't we add any dependencies? So I think it is a really nice situation, and I really like that it is this way. Yeah. I imagine that helps with user adoption as well of not having
[00:11:45] Unknown:
to determine whether there are any conflicts in the dependencies that Mameses brings in or any
[00:11:52] Unknown:
licensing when you're interested in using it within a project? Yeah. That's an important part too. But I think right now, we are coming to the situation when dependency conflicts are going away from the creation of PIP file and Pynf. So that's an I think we have in the Python ecosystem.
[00:12:08] Unknown:
And as far as the ways that the data is generated, do you maintain a certain, corpus of text that you can use for being able to take samples for that mixing and matching names or textual data? Or is it all purely just a generative mechanism where you take some sort of seed or input for being the feeds into an algorithm to randomly generate the names at request time?
[00:12:36] Unknown:
We load a JSON file, from the very beginning of the creation of the first object, and then we use cache dict, to get this data when where when we need it. And we have some optimizations that make this process really fast. So that's how we do it. And we don't have any any kind of interesting algorithm that we can really show you. It's really simple.
[00:13:01] Unknown:
And so when you're adding in, let's support for additional languages, you would just swap in a different JSON file or if a user has a requirement for a particular size or content of the text that gets generated, they would just swap in their own JSON file?
[00:13:18] Unknown:
Yeah. It's possible. You can, create your own JSON file. You can, create new locale for this, kind of
[00:13:28] Unknown:
data and so on. It's really simple. And when a user needs to either customize the type of data that gets generated or create a new type of data, what would the workflow look like for being able to either override or add support for those different data types?
[00:13:48] Unknown:
First of all, you can always, subclass, the initial provider class and change how things are going for you. Secondly, you can create your own providers. We have an API for creating providers, some specific in specific areas that we don't want to cover in core library, but we are free to provide any kind of API you you need. So
[00:14:12] Unknown:
I think these are main ways to go if you need it. And I'm not gonna ask you to enumerate the list of providers because anybody who's interested can refer to the documentation for that. But are there any that you found to be particularly difficult or complicated to write and maintain?
[00:14:29] Unknown:
Yes. There is 1. I can mention country provider, and we have a method called generate points inside. And, that's really a hard thing to do because we need to have all the country borders, and we need some specific algorithm to generate points inside this country, and we need to have some, additional parameters to generate points in some specific region or in some specific area. So and it's not implemented yet. So we are trying to implement it, but it's really hard. So we are struggling to do that. And for that, would you
[00:15:09] Unknown:
pull down some GeoJSON from an API to be able to get current data to ensure that things like county borders or city borders are up to date and properly represented?
[00:15:22] Unknown:
Yeah. No. We need to do something like that, or we need to maintain this data inside our library, and we need to update it on a daily basis because, you know, country borders are changing these days. Very frequent. Yes. So, yeah, that's a really hard thing to do because it is related on a lot of things, and it's really hard to do it properly.
[00:15:48] Unknown:
And I imagine that for things like addresses, that would also be fairly difficult because of the variation of of formats and mechanisms that are used within different countries for being able to represent them?
[00:16:03] Unknown:
Yes. Now because of that, we have some specific, providers called, for example, German spec provider or Russian spec provider. And in these providers, we have, some specific data types and some some specific formats that we that are used in this country only. For example, in Russia, we have things like NNM, individual security numbers, and so on. It is specified in a lot of areas and so on. And, these providers are not included into the default provider's list because they are specific, and we don't want to pollute the namespace of our library with these providers. But, you can find them in, my mazes slash built ins section. So you can just plug it that mean when we need it and use it. And 1 of the
[00:16:49] Unknown:
main use cases that people generally think of when talking about generating fake data is for unit tests to ensure that functions are able to take appropriate inputs and provide appropriate outputs or for being able to see the test database with information to make sure that everything functions as expected. But are there any other interesting use cases outside of that where Mimesos could be beneficial?
[00:17:18] Unknown:
I could mention anonymizing die data and production data. That's a really, really useful use case to do. We all know we all have, stage servers, and these stage servers must copy our production setup. And these stage servers must copy our production data. But on the other hand, we don't want to have this data leaked, we don't have this data corrupted, or we don't want our customers or users to be notified on the real emails when some developers screw up and so. So we use this technique in some applications. We take our production data and we met modify names, emails, some specific numbers if we do store them, and then we just undermine them. So we have all our production data with all possible kind of errors, all possible kind of, missing fields, missing relationships, and all kind of trials we can possibly imagine. But on the other hand, we are not risk to lose our data. We don't risk to be it to be leaked.
[00:18:21] Unknown:
So we have, benefits from 2 worlds, and so we can use them safely. Yeah. That's 1 of the difficulties that has cropped up in a lot of the places I've worked is being able to copy data from production. Because of the volume of interactions. Any error that can happen will. And when they when you're trying to debug something, it's often because of the data that's in that environment. So being able to pull it down to a lower environment for testing it and not risk those, breaches as you mentioned or accidentally emailing the customers is valuable. And is there any particular workflow that you found useful for being able to override that data? Such as do you override it at the point where you're exporting it from the database or do you load the production data into the test environment database and then use the ORM, for instance, to be able to generate the data and override it after it's been imported?
[00:19:16] Unknown:
I don't think that, importing the real data into the test database is a good idea. Because when you do this, you already have your data leaked if your server is compromised or things like that. So, as I do it, I export this data as comma separated values or something something similar. And then I change this data in this specific format, inside trusted environment. And then I will upload it to the test server without any production data and without any possible chances to be to be leaked. And I know that Mimesis supports
[00:19:54] Unknown:
creating a schema for the data. So is that something that you found helpful when you're overriding the data in the CSV is to load the CSV into Python in memory and then apply the schema to overwrite the specific fields that you want to anonymize?
[00:20:10] Unknown:
To say, honestly, I haven't used schema. Also, no. I I don't I don't find this particular feature useful for me, but I know that it is useful for many other users. So, I'm just not representative in this case. I try to use, row values and raw calls to methods and so on because I have all the control I need and all the flexibility I can have. But in some cases, yes, Schema is very useful because it can generate a lot of data and, you can have the configuration
[00:20:44] Unknown:
really simple. And are there any particularly interesting or unexpected uses of Mimesys that you've seen people put it to?
[00:20:53] Unknown:
I would like to mention, 1 specific thing that we don't want, to Mimesis to be used. We don't want Mimesis to be used as tool to generate a lot of fake data that is is going to be used some kind of unlawful actions or some kind of some kind of thing that is violating laws or some kind of thing that is violating some moral or ethic rules and so on. Because, I can see some use cases that are violating these rules. So I won't I don't want to speak about them, but I can quickly receive them.
[00:21:33] Unknown:
So so so don't use Mimesis for writing your spam messages. Yes. And are there any aspects of building and maintaining the library that you found to be particularly challenging?
[00:21:46] Unknown:
Yeah. I would like to, mention, the very the very first the first part when you create a new library, it says choosing the name, And we had 3 names. The first 1 was Church, and there was some kind of joke. And, I'm not going to judge if it was a good 1 or bad 1, but the joke was, Church generating fake date data. So some people found out that it is not a pro appropriate name for the library, so we have changed it, to be Elidibus. So it was, Elidibus. It is the main character of the game, Bioshock Infinite. So but then we thought that it's just a random name, and so it is not showing what this library is for. And so we don't have any, kind of relationship to this game or to this character. So we have changed it 1 more time. And now we are really proud of the name. So right now, we're not going to change it 1 more time.
[00:22:52] Unknown:
Yeah. It's definitely a very descriptive and fitting name for what the library is used for. So you did well on that front. Thank you. And as far as the way that the project is architected, I'm wondering if you can discuss discuss some of the ways that it has changed and evolved from an from when you first started working on it and some of the motivations for any rearchitecting that may have happened.
[00:23:16] Unknown:
We really had a lot of rearchitect during the development, and it was just a natural process because, from the very beginning we had all our providers, all our core logic in the single file because it was the easier thing to do. And we really like this approach to make things work and then to make them right. And, then we saw that these kind of classes can be created and these functions can be created at utility functions and so on. So we have changed the architecture, I think, twice. And, right now, we're happy with it and to work. It it is fast and it is simple. What can you even want more from the architecture?
[00:24:03] Unknown:
And are there any particular plans or new features that you have in mind for the future of Mimesis?
[00:24:09] Unknown:
We currently work on several features that are not included in the core library, but, are complementary features. The first 1 is some servers, then you can access with REST or GraphQL or things like that and to have your data fake data as a HTTP response. So I can find a lot of situations when I really needed some mocking, some testing, stuff like that. On the other hand, we are working on the integration with existing tools, such as Factory Boy or single libraries. So this integration has to be really easy and clean. So you can use our library and to generate fake data in your unit tests, integrational tests with ease. So that's our 2 main objectives right now.
[00:25:06] Unknown:
And I noticed when I was looking through the documentation that rather than provide any support for a given unit test framework or anything else in the core of Mimesis, you're relying on pushing that to, external libraries to create those integration points and similarly for ORMs. And from a maintenance perspective, that definitely seems like the right approach because it decreases the surface area that the core library needs to be concerned about and provides a way for the community to provide more value without
[00:25:47] Unknown:
I I can call it a JavaScript approach. Yeah. I think that, this is the right way to go because, as you said, it is really easy to integrate a lot of features and not to pollute the call library. And it is really easy to maintain these plugins because only these plugins will know about dependency versions, about API changes, and so on. So I think this is really simplifying our workflow. And we can focus on the call library or we can focus on some plugin libraries, or we can focus on some service libraries, as I said. Right now, it is called Mimesis Cloud. It is still a work in progress, but I hope it's going to be finished soon, and, we will launch it as a service for developers.
[00:26:35] Unknown:
And in terms of the external libraries, I imagine that it may also pose a bit of a challenge because at that point, it somewhat solidifies the external API that Mamesis is providing. So anytime you need to modify the way that it is interfaced with, you would want to be careful not to break those other integrations. So is Luminess is at a point where the external facing API is stable and, you don't need to worry about any breaking changes unless you do another major re architect, maybe do a major version bump? It's stable. Yeah. We follow,
[00:27:15] Unknown:
semantic version. So, we right now are at 22.00 or some kind of maybe I am wrong about some minor patch bumps. So we have 2 stable releases, and this 1 is not going to go anywhere sometime soon. So we are really fierce about fearless about any kind of API changes or breaking changes because the API is stable. We are not going to break anything. We are just going to add add some features, some providers, methods, and so on. So we are at the point when working on plugins are really easy.
[00:28:03] Unknown:
And are there any particular aspects of the library or specific integrations that you would like to see help from the community with?
[00:28:14] Unknown:
Yes. We try to build a helpful and open community around our, libraries. And, right now, we try to finish our work on my mine is this cloud. So I think we should concentrate on this 1 because we try to use some bleeding edge technologies such as, Sanic, such as, Asyncio, stuff like that, GraphQL, Apollo for Python. I think it's going to be really interesting for a lot of people to try these technologies to write something that's going to be used by other developers, I hope. And
[00:29:03] Unknown:
we encourage people to try it. And are there any other aspects of the project that you think we should discuss before we start to close out the show?
[00:29:12] Unknown:
I think we have covered everything.
[00:29:15] Unknown:
So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes.
[00:29:23] Unknown:
Yes. I have a GitHub account, so you can freely contact me there. And I have my mail there, so you can email me privately if you want to.
[00:29:33] Unknown:
And so with that, I'll move us to the pics. And this week, I'm going to choose the movie Coco. I just watched it last night with my family, and it was really well done and very visually appealing. And I thought it was a well structured movie with a good storyline and, so any other families who are interested in watching it or even if you're, Sam's family and just wanna watch it anyway, it's definitely worth a watch. So, for yeah. I definitely recommend that. And so with that, I'll pass it to you. Nikkita, do you have anything that you wanna pick this week?
[00:30:07] Unknown:
First of all, I've seen this movie too, and I think it's great, really great. So I recommend it too. And I want to recommend, my article that was really well met. It's called I am a mid midiocrea midiocre developer. And I will send you a link. So this article is about that a lot of developers have this, imposter syndrome, and a lot of developers are suffering to write good software from the very beginning. And this article tries to motivate people to use right tools to achieve their goals. And it clearly states that even developers with, almost 10 years experience of experience, scared sometimes too. They also write bad code, they also code write code with bugs, with errors.
They write code that breaks in production and so on. So we shouldn't be afraid of that. We should create environment that help us to solve these problems.
[00:31:18] Unknown:
Yeah. It's definitely a widespread issue in terms of the ways that people think about themselves in relation to programming. So it's always great when more people share their experience and discuss it in public forums. So thank you for that. And, I wanna thank you as well for taking the time to join me today and talk about the work that you've done with Mises. It's definitely an interesting library and 1 that I'm interested in using for anonymizing my production data for test environments. So thank you for that, and I hope you enjoy the rest of your day.
[00:31:53] Unknown:
Thank you for inviting me. It was my first podcast, and I hope it's a nice 1. And, I want to thank, the original author of Mimesis Isaac for inviting me also to this show and to coming up with this idea. So thank you.
Introduction and Sponsor Messages
Interview with Nikita Sobolev
Nikita's Introduction to Python
Teaching Python and Meeting Isaac
Motivation for Creating Mimesis
Comparison with Other Libraries
Providers and Language Support
Speed and Performance Techniques
No External Dependencies
Customizing Data Generation
Use Cases Beyond Unit Testing
Schema Support and Anonymizing Data
Challenges in Building the Library
Future Plans for Mimesis
Maintaining External Libraries and Plugins
Community Contributions
Contact Information and Picks