Synthetic Data Generation Using Mimesis with Nikita Sobolev

Hello. Welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you get everything you need to scale.

Go to podcast in it.com/linode

to get a $20 credit and launch a new server in under a minute.

And to get worry free releases, download Go CD, the open source continuous delivery server built by Thoughtworks.

You can use their pipeline modeling and value stream app to build, control, and monitor every step from commit to deployment in 1 place. Go to podcastinit.com/gocd

to learn more about their professional support services and enterprise add ons,

And visit podcastthenit.com

to subscribe to the show, sign up for the newsletter, and read the show notes. Your host as usual is Tobias Macy. And today, I'm interviewing Niketas Sobolev about Mimesis, a library for quickly generating synthetic data. So, Nikita, could you start by introducing yourself?

Hello, Tobias. My name is Nikita, and I'm co developer of Mimesis Library. Mimesis Library is

a library for generating fake and synthetic data for various purposes. And

the main idea of this library is to create a useful tool to be

fast,

to be useful,

and to have a lot of providers and to have a lot of data that you use and

to help you with a lot of use cases along the way. And this library had a very interesting history.

And, first of all, I want to

thanks,

do want to thank,

my friend and

the original author of this library,

Isaac. So thank you, Isaac. And, he was

responsible for inviting me here in the first place. So thanks.

And how did you first get introduced to Python?

That's a really interesting question because it is all connected.

It was, like, 8 years ago, I think,

we had a university project. It was a group project. And as you know, all group projects are made by 1 person in the last night, so

I was

I was this person.

We had some kind of, you, we had some kind of project that I I almost don't remember, and

we had to do it really fast. And,

there was only 1 requirement, so it was going to be in Python.

We have chosen Django

as

very productive web framework,

and I really like it because I was successful in providing all the code in 1 night. So I think,

that's a great thing. Why would I use it in normalcy life situations?

So, I really like the language, the framework,

and I have started to to build some websites

and some applications using Python and Django.

So as I do it today from my day life. So

after that, I wanted to share my experience with Python among

other people,

among some newcomers, among

I had some audio problems,

about, with newcomers,

some people from other

jobs and and

other

professions.

So, I have started,

to

give Python courses.

And

Isaac was my 1 of the first students, the codeveloper

of Emesis.

And, since that time, we begin to be friends and coworkers

and maintainers over the same libraries. So it was a really nice experience. And as I said earlier, it is all connected to to this

topic. So

That's interesting that you initially taught him Python, and now you're working on this project together. It's a good history to it. And what was the motivation for creating Mimesis?

I think it was,

just a passion for open source from both Isaac and me because,

when we started to build to when Isaac's

originally started to build this library,

I don't think that there was,

a clean purpose, but, this was driven by passion. And then we have started to figure out some things like, oh, we can do

this, you can do this, and we can be better than,

this library, and we can work together with this library to build some nice user experience and some nice features that anyone

would love to use.

And 1 of the other projects

that is similar in the space of being able to generate data,

generally for the purposes

of testing is Faker.

And I know that there's another library named Factory Boy that Mimesis actually,

acts as a provider for. So I'm wondering if you can provide some comparison between

the

features

and capabilities of Faker versus Mimesis

and how those relate to other projects such as Factory Boy or any others that I'm leaving out?

First of all, our sailing point main sailing point is speed because we use,

a lot of techniques to make maintain

nice speed and

we always check

our numbers before releasing new features. And the second point is, we have a lot of providers that Faker doesn't have, but on the same side, Faker has some providers that we don't have. So we can say that it is a complementary

project because sometimes you can use Faker, sometimes you can use methods,

and you have to choose providers and do the most important parts, really. Because providers

are also separated by the language.

So some providers has

2 languages, some providers have

3 and more languages,

and sometimes it is really important to find the language you need. For example, we live in Russia, and we we really really want to have everything in Russian. So we have

Russian providers for names, Russian providers for emails, websites, and so on and so. And for other people from other countries, it is very important too. So we try to maintain,

language part as complete as we can. So that's the 2 main reasons for

using

our library and,

to choose, some providers for faker or other libraries.

And when you're referring to providers, this is for things like being able to create

names or

addresses.

And also just to as a point of clarification for listeners because of the sort of overloaded term of language where it could be referred to programming or spoken. In this case, we're referring to spoken languages such as Russian or German or English. So

Yes. Yes. And

that's another interesting point too is that when you are generating this fake data, particularly for purposes of testing an application, you want to be able to generate it in multiple languages to ensure that your applications

properly supports internationalization

and localization

because different languages might have different character sets or right to left versus left to right text. So having those different language providers is definitely valuable for those purposes as well. Mhmm. Right? Yes. Sure. Because sometimes English speaking people do not,

care about unicode for a lot of reasons because they don't need it. And other people struggling a lot with unicode support. We all know this problem in Python 2. So, right now,

so painful for developers to work with strings anymore, but it was just some time ago.

And

to name a 1 feature that we are really proud of, and

it was our first experience for this kind of feature is, Python 3 support. And, we don't support Python 2. It was a decision made by Isaac in the very beginning of the project, and

it saved us a lot of time. And we have implemented a lot of features such

as MyPy support, type annotations, type hints, and all kinds of things like that. And we did it from the very beginning, and it was really a life changing experience for me to maintain Python free only,

libraries. So it's really good. Yeah. When I was digging through some of the source code, I noticed the type annotations and,

meant to ask you that. So thank you for volunteering the information. And I'm curious if the type annotations

and type hints have provided any benefits in terms of the actual generation

of the data.

Probably. Because at some point of time, we have switched to enums instead of strings, to generate data. And I think just really beneficial to user experience because you don't have to memorize all these strings and stuff like that. You can use after completion,

Inspect this enum for all different values and options. And

when we

switch to type annotations, we can clearly define what enums we're using in this function and what enums we are using in this method and so on. So, yes, Python tree help us and type annotations help us as well. And going back to the topic of the speed of the library, what are some of the techniques that you're using to be able to generate the data so quickly? 1st of all, we try to be as simple as possible. We don't use any kind of complicated or dark complicated

steps or dark magic of Python. So

we try to write simple functions,

simple methods,

and simple classes to get things done.

On the other hand, we have,

integrated benchmark support which we run as usually as we can. So

we measure our speed, and we try to focus on algorithm that I really like. Make it work, make it right, make it fast. So on the stage, make it fast, we try to measure the speed, find some bottlenecks, and to eliminate them. And have you had the need to

rewrite any portions of the library and see to get any improvements in speed or use things like async?

No. No. We don't use any kind of async frameworks. We don't use any kind of c bindings and stuff like that because it really complicates

the setup and it really complicates,

all the things possible with the development. So it's not worth it, really. And I noticed as well that there aren't any external

library dependencies from Amasis and it all uses the built in capabilities of Python. So do you think that that has also helped to

maintain the speed of the library and the simplicity?

Sure.

Because,

my initial idea was to add

dependencies,

dependencies, and dependencies,

but Isaac said that, no. Let's don't do that because,

there are

a lot of disadvantages in this behavior. Like, you can't

maintain

the library as easy as we do it now. You have to

check on the developer versions,

on dependencies versions, and

I have

to write a lot of tests to test it with this dependency version this dependency version, and so on. We have

all the freedom we we really want to maintain completely free of dependencies,

and it is

really nice, and we don't have any

any kind of rejection. So any kind of thoughts, why why don't why don't we add any dependencies? So I think it is a really nice situation, and I really like that it is this way. Yeah. I imagine that helps with user adoption as well of not having

to determine whether there are any conflicts in the dependencies that Mameses brings in

or any

licensing when you're interested in using it within a project? Yeah. That's an important part too. But I think right now, we are coming to the situation when dependency conflicts

are going away from the creation of PIP file and Pynf. So that's an I think we have in the Python ecosystem.

And as far as the ways that the data is generated, do you maintain a certain,

corpus

of text that you can use for being able to

take samples for

that mixing and matching names or textual data? Or is it all purely just a generative mechanism where you take some sort of seed or input for being the feeds into an algorithm to randomly generate the names at request time?

We load a JSON file,

from the very beginning of the creation of the first object, and then we use cache dict,

to get this data when where when we need it. And we have some optimizations

that make this process really fast. So that's how we do it. And we don't have any any kind of interesting algorithm that we can really show you. It's really simple.

And so when you're adding in, let's support for additional languages, you would just swap in a different JSON file or if a user has a requirement for a particular

size or

content of the text that gets generated, they would just swap in their own JSON file?

Yeah. It's possible. You can, create your own JSON

file. You can,

create new locale for

this,

kind of

data and so on. It's really simple. And when a user needs to either customize the type of data that gets generated

or create a new type of data, what would the workflow look like for being able to either override or add support for those different data types?

First of all, you can always, subclass,

the initial provider class and change how things are going for you.

Secondly, you can create your own providers. We have an API for creating providers,

some specific in specific areas that we don't want to cover in core library, but we are free to provide any kind of API you you need.

So

I think these are main ways to go if you need it. And I'm not gonna ask you to enumerate the list of providers because anybody who's interested can refer to the documentation for that. But are there any that you found to be

particularly

difficult or complicated

to write and maintain?

Yes. There is 1. I can mention country provider, and we have

a method called generate points inside.

And, that's really a hard thing to do because we need to have all the country borders, and we need some specific algorithm to generate points inside this country, and we need to have some,

additional

parameters to generate points in some specific

region or in some specific

area. So and it's not implemented

yet. So we are trying to implement it, but it's really hard. So we are struggling to do that. And for that, would you

pull down some GeoJSON

from an API to be able to get current data to ensure that things like county borders or city borders are up to date and properly represented?

Yeah. No. We need to do something like that, or we need to maintain this data inside our library, and we need to update it on a daily basis because,

you know, country borders are changing

these days. Very frequent.

Yes.

So, yeah, that's a really hard thing to do because

it is

it's really hard to do

it properly.

And I imagine that for things like addresses,

that would also be fairly difficult because of the variation of of formats and mechanisms that are used within different countries for being able to represent them?

Yes. Now because of that, we have some specific,

providers

called, for example, German spec provider or Russian spec provider. And in these providers, we have,

some specific

data types and some some specific formats that we that are used in this country only. For example, in Russia, we have things like NNM,

individual security numbers, and so on. It is specified in a lot of areas and so on. And, these providers are not included into the default provider's list because they are specific, and we don't want to pollute the namespace of our library with these providers. But, you can find them in, my mazes slash built ins section. So you can just plug it that mean when we need it and use it. And 1 of the

main use cases that people generally think of when talking about generating fake data is for unit tests to ensure that functions

are able to

take appropriate inputs and provide appropriate outputs or for being able to see the test database with information

to make sure that everything functions as expected.

But are there any other interesting

use cases

outside of that where Mimesos could be beneficial?

I could mention anonymizing die data and production data. That's a really, really useful use case to do. We all know we all have, stage servers, and these stage servers must copy our production setup. And these stage servers must copy our production data. But on the other hand, we don't want to

have this

data leaked, we don't have this data corrupted,

or we don't want our customers or users to be notified on the real emails when some developers screw up and so. So we use this technique in some applications. We take our production data and we met modify names, emails,

some specific numbers if we do store them, and then we

just undermine them. So we have all our production data with all possible kind of errors, all possible kind of, missing fields,

missing

relationships,

and all kind of trials we can possibly imagine. But on the other hand, we are not risk to lose our data. We don't risk to be it to be leaked.

So we have, benefits from 2 worlds, and so we can use them safely. Yeah. That's 1 of the difficulties that has cropped up in a lot of the places I've worked is being able to copy data from production. Because of the volume of interactions. Any error that can happen will. And when they when you're trying to debug something, it's often

because of the data that's in that environment.

So being able to pull it down to a lower environment for testing it and not risk those,

breaches as you mentioned or accidentally emailing the customers is valuable.

And is there any particular workflow that you found useful for being able to

override that data? Such as do you override it at the point where you're exporting it from the database or do you load the production data into the test environment database and then use the ORM,

for instance, to be able to generate the data and override it after it's been imported?

I don't think that,

importing the real data into the test database is a good idea. Because when you do this, you already have your data leaked if your server is compromised or things like that. So, as I do it, I export this data

as comma separated values or something something similar. And then I

change this data in this specific

format, inside trusted environment. And then I will upload it to

the test server without any production data

and without any

possible chances to be to be leaked. And I know that Mimesis supports

creating a schema for the data. So is that something that you found helpful when you're overriding the data in the CSV is to load the CSV into Python in memory and then apply the schema to overwrite the specific fields that you want to anonymize?

To say, honestly, I haven't used schema.

Also,

no. I I don't I don't find this particular feature useful for me,

but I know that it is useful for many other users. So, I'm just not representative in this case.

I try to use,

row values and raw calls to

methods and so on because I have all the control I need and

all the flexibility I can have. But in some cases, yes,

Schema is very useful because it can generate a lot of data and,

you can have the configuration

really simple. And are there any particularly

interesting or unexpected uses of Mimesys that you've seen people put it to?

I would like to mention,

1 specific

thing that we don't want, to Mimesis to be used. We don't want Mimesis to be used as

tool to generate a lot of fake data that is is going to be used some kind of

unlawful

actions or some kind of

some kind of thing that is

violating laws or some kind of thing that is violating some moral

or ethic rules and so

on. Because,

I can see some

use cases that

are violating these rules. So I won't I don't want to speak about them, but I can quickly receive them.

So so so don't use Mimesis for writing your spam messages. Yes.

And are there any aspects of building and maintaining the library that you found to be particularly challenging?

Yeah. I would like to,

mention,

the very the very first the first part when you create a new library, it says choosing the name, And we had 3 names.

The first 1 was Church,

and there was some kind of

joke. And, I'm not going to judge if it was a good 1 or bad 1, but the joke was, Church

generating fake date data. So some people found out that it is not a pro appropriate name for the library, so we have changed it,

to be

Elidibus.

So it was,

Elidibus.

It is the main character of the game, Bioshock Infinite. So but

then we thought that it's just a random name, and so it is not showing what this library is for. And so we don't have any,

kind of relationship to this game or to this character. So we have changed it 1 more time. And now we are really

proud of the name. So

right now, we're not going to change it 1 more time.

Yeah. It's definitely a very descriptive and fitting name for what the library is used for. So you did well on that front. Thank you.

And as far as the way that the project is architected,

I'm wondering if you can discuss discuss some of the ways that it has changed and evolved from an from when you first started working on it and some of the motivations for any rearchitecting that may have happened.

We really had a lot of rearchitect during the development,

and

it was

just a natural process because,

from the very beginning we had all our providers, all our core logic in the single file because it was the easier thing to do. And we really like this approach to make things work and then to make them right.

And, then we saw that these kind of classes can be created and these functions can be created at utility functions and so on. So we

have changed the architecture,

I think, twice.

And, right now, we're happy with it and to work. It it is fast and it is simple. What can you even want more from the architecture?

And are there any particular

plans or new features that you have in mind for the future of Mimesis?

We currently work on several features that are not included in the core library,

but,

are complementary features. The first 1 is

some servers,

then you can access with REST or GraphQL or things like that and to have your data fake data as a HTTP response.

So

I can find a lot of situations when I really needed some mocking,

some testing,

stuff like that. On the other hand, we are working on the integration with existing tools,

such as Factory

Boy or

single libraries.

So this integration has to

be really easy and

clean. So you can use

our library and to generate fake data in your unit tests, integrational tests with ease. So that's our 2

main objectives right now.

And I noticed when I was looking through the documentation that rather than provide any support for a given unit test framework or anything else in the core of Mimesis, you're

relying on pushing that to,

external libraries

to create those integration points and similarly for ORMs.

And

from a maintenance perspective, that definitely seems like the right approach because it decreases

the surface area that the core library needs to be concerned about

and provides a way for the community to provide more value without

I I can call it a JavaScript approach.

Yeah. I think that, this is the right way to go because, as you said, it is really easy to integrate a lot of features

and not to pollute the call library. And it is really easy to maintain these plugins because only these plugins will know about

dependency versions, about

API changes, and so on. So I think this is really simplifying

our workflow. And we can focus on the call library or we can focus on some plugin libraries, or we can focus on some service libraries, as I said. Right now, it is called Mimesis Cloud. It is still a work in progress, but I hope it's going to be finished soon, and, we will launch it as a service for developers.

And in terms of

the external libraries,

I imagine that it may also pose a bit of a challenge because at that point, it somewhat solidifies

the external API that Mamesis

is providing.

So anytime you need to

modify

the way that it is interfaced with, you would want to be careful not to break those other integrations. So is Luminess is at a point where the external facing API

is stable

and, you don't need to worry about any breaking changes unless you do another major re architect, maybe do a major version bump? It's stable. Yeah. We follow,

semantic version. So,

we

right now are at

22.00

or some kind of maybe I am wrong about some minor

patch bumps. So we have 2 stable releases, and this 1 is not going to go anywhere sometime soon. So we are really

fierce about

fearless about

any kind of API changes or

breaking changes because the API is stable. We are not going to break anything. We are just going to add add some features, some providers,

methods, and so on.

So we

are at the point when

working on plugins

are really easy.

And are there any particular

aspects of the library or

specific integrations that you would like to see

help from the community with?

Yes.

We try to

build a helpful and

open community

around our,

libraries.

And, right now, we try to

finish our work on

my mine is this cloud.

So I think we should concentrate on this 1 because we try to use some bleeding edge technologies

such as,

Sanic,

such as,

Asyncio,

stuff like that, GraphQL,

Apollo

for Python.

I think it's going to be really interesting for a lot of people to

try these technologies to

write something

that's going to be used by other developers, I hope.

And

we encourage people to try it. And are there any other aspects of the project that you think we should discuss before we start to close out the show?

I think we have covered everything.

So for anybody who wants to follow the work that you're up to or get in touch, I'll have you add your preferred contact information to the show notes.

Yes. I have a GitHub account, so you can freely contact me there. And I have my mail there, so you can

email me privately if you want to.

And so with that, I'll move us to the pics. And

this week, I'm going to choose the movie Coco. I just watched it last night with my family,

and it was really well done and very visually appealing. And I thought it was

a well structured movie with a good storyline and,

so any other families who are interested in watching it or

even if you're,

Sam's family and just wanna watch it anyway, it's definitely worth a watch. So,

for yeah. I definitely recommend that.

And so with that, I'll pass it to you. Nikkita, do you have anything that you wanna pick this week?

First of all, I've seen this movie too, and I think it's great, really great. So I recommend it too. And I want to recommend,

my article that was

really well met.

It's called I am a mid midiocrea

midiocre

developer.

And I will

send you a link. So this article is about that a lot of developers have this, imposter syndrome, and a lot of developers

are suffering to

write

good software from the very beginning.

And this article tries to motivate people

to

use right tools to achieve their goals. And it clearly states that even developers with,

almost 10 years experience of experience,

scared sometimes too. They also write bad code, they also code write code with bugs, with errors.

They write code that

breaks in production and so on. So

we shouldn't be afraid of that. We should

create environment

that

help us to

solve these problems.

Yeah. It's definitely

a widespread

issue in terms of the ways that people think about themselves in relation to programming. So it's always great when more people share their experience and discuss it in public forums. So thank you for that.

And, I wanna thank you as well for taking the time to join me today and talk about the work that you've done with Mises. It's definitely an interesting library

and 1 that I'm interested in using for anonymizing my production data for test environments. So thank you for that, and I hope you enjoy the rest of your day.

Thank you for inviting me. It was my first podcast, and I hope it's

a nice 1. And, I want to thank,

the original author of Mimesis Isaac for inviting me also to this show and

to coming up with this idea. So thank you.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__