Automatically Generate Your Unit Tests From Scratch With Pynguin

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?

Hi touch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL.

Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at pythonpodcast.com/hitouch.

Your host as usual is Tobias Macy. And today, I'm interviewing Stefan Lukaszuk about Penguin, the Python general unit test generator. So, Stefan, can you start by introducing yourself?

Yeah. First of all, thanks for having me, Tobias. I'm Stefan. I'm a PhD student at the University of Pozau.

Pozau is a small town, like, 200 kilometers

east of Munich in Germany,

between Germany, Austria, and the Czech Republic.

And I used to do my studies there in computer science, a bachelor and master's degree, and I'm now pursuing a PhD. And the project we're talking about, Penguin,

is my main research project here. Do you remember how you first got introduced to Python?

I thought about that, and I, actually, I cannot really narrow that down. I played around with it before I started my studies, which

maybe around,

like, 2010 or something.

But I really got introduced

during my bachelor studies and used it quite a lot.

Although our programming experience

is basically mainly Java background from what is taught in the study curriculum.

But I used Python there a lot. And when I started my PhD, I focused a lot on Python

and

shifted my main focus basically,

in that area, in that universe.

Now in terms of the Penguin project, you mentioned that that's some of the focus of your current research with your PhD program. I'm wondering if you can just give a bit of a description about what it is that you're building and some of the story behind

how you came on this particular problem space to work through and some of the interesting areas of research that it's uncovering as you go through your program?

1 of the things to note here is that my supervisor, professor Gordon Frazer, he's an expert in testing software, and

he developed a tool for Java unit test generation called EvoSuite. Some of you might have heard that name somewhere around

when you're interested in testing.

And

so we were discussing a lot of opportunities.

Immediate almost immediately, it shifted my focus and interest into to Python because I thought this would be an interesting world, to see

a dynamically typed language and how 1 can apply the research that's actually there. Be it static or dynamic analysis

or testing or whatever,

which is a lot of it is done in for statically typed languages,

mainly Java and stuff.

1 of the things can be applied to Python or to a dynamically typed language in general? And what other ideas 1 can come up with? And after, well,

spending quite a good time on investigating

what's actually there, what people have done, and

what could be done. We came up with the idea of, well, let's see if we can

automatically generate tests actually for Python.

And that's how it actually started. And

yeah. Now I'm here, getting interviewed in the series of podcasts where so many prominent projects have been there, which is a great honor for me. So

the project started

like 1 and a half year ago, and it's still development. And I hope I can share some light into the project and share some insights. And maybe some of you want to try this project out and give me some feedback.

In terms of the project itself, you mentioned that the intent is for being able to automatically generate unit tests for Python programs. But I'm wondering if you can dig a bit more into

some of the, you know, types of tests that it's able to create, some of the limitations

on the types of code that it's able to generate tests for and the sort of the overall sort of problem statement and how you're framing the

development approach and how you're determining the areas of focus as you continue to iterate and, as you mentioned, being in a heavy development cycle, sort of how you prioritize that work?

First of all, we want to have it as general as possible. So in theory, it should be applicable

to all kinds of projects that are written in Python.

In practice,

we cannot hold that premise. So 1 of the problems is that as soon as you don't know about type information, which is an optional feature, a feature I highly encourage people to use.

But as soon as you leave that out,

it becomes more and more difficult. So the problem here is you have basically 2 things in mind. 1 is

you have to create inputs for your functions and methods and your code, whatever is there, which is kind of difficult when you know about what types you have to put in there. So if you know that you have to put in an int or a string,

you can generate a particular object of that type. But if you don't have that

information present, you can just randomly guess or maybe do an educated guess, but it's basically a guess. And

the space of choices is very huge. And the second thing is then to bring this together into a test case with the target of reaching high code coverage. Because

we all know when we do a testing coverage

is 1 of the metrics that is at least very interesting. I mean,

you cannot reason.

About codes you don't have any coverage

but can only reason about the the covered parts of the code and

so I think this is 1 of the, the biggest problems and there to combine these, calling the right methods with the right parameter values

and developing and evolving,

the test cases, the process

in a way that you achieve a high coverage.

Another aspect that's

totally

out of focus up to now here

is you don't only want to have high coverage, but but you also want to reason about the correctness of the values.

So as we all know, this testing, And

for

that,

we

need

assertions

that the

And for that, we need assertions

that the results that we get or the values that we get back from functions

and so on. And

this is the the next level of complexity. So to say, a second research area, basically,

how should an oracle, how should such an assertion look like that we can reason about the code and maybe find some bugs, for example.

In terms of the research that you're doing and the approach that you're taking with Penguin, I'm curious to get

your perspective on how it compares to projects such as hypothesis of doing property based testing or, you know, systems that are doing sort of contract based validation

in Python and just some of the

relative trade offs of using those different approaches and where you're going with Penguin as it relates to those other types of testing and validation exercises?

So hypothesis is actually

a framework with a lot of possibilities.

That's also something that's quite close to what we achieve with Penguin.

Hypothesis,

the focus, as you said, is on property based testing

where you usually have some property in mind or specified by requirements or whatever.

And when you know about what you want to put in, a function and what you want to get out. Like, for example, if you have a function that adds 2 values,

and if you put in the same value, then you have the property that the result should be twice the value. And there, you know about the property, the underlying, and that's what you specify explicitly.

What hypothesis in the end or in the background does is some kind of test generation in a random way because it generates input values

to explore basically the space of possible values

and to figure out whether there is some violation of the property.

So I guess

does something similar, but you don't have to specify anything. Basically, you have to just give it a module

and maybe the dependencies of the module,

and it can just run on this

and tries to generate inputs for all your functions that you have there in order to achieve a maximum coverage level. So we don't focus on

proving some or

disproving some properties like hypothesis does, but we just want to generally cover all the code, which is something

as far as I know hypothesis does not really do that they explore coverage in the first place.

I mean, if they can cover a lot of code, that's good for them, but it's not the main target, I guess.

And another interesting area of discussion in terms of automatically generating tests is the sort of different

styles of testing that people aim for, you know, where some people will go for the

given when then style or different setups for being able to,

you know, put the code into a state where it has the expected state that you're trying to validate and just, you know, what your thoughts are in terms of where Penguin falls as far as the style of tests that it generates or being able to

sort of parameterize

the

setup and tear down approaches that you're building in the tests that you're creating.

So currently,

in this process, we fail quite early. This was not yet our focus,

but it's totally

an interesting focus. So what we currently do is basically this setup execute and the assertion, the check part is still under heavy construction. We are currently working on that

to also being able to generate somehow reasonable assertions and checks for that. So basically, what we have is now the setup and the calling of the functions, but not the checks. And

parametrization,

for example, as you mentioned, is a very interesting topic. And the setup and teardown code, which is definitely

worth future research

and definitely on my agenda. And I hope I can get there sometimes

after solving all the other problems on the way to that. So 1 of the problems that we are facing, and I also kind of noted this before, is the type information that you have or that you don't have.

This is by far the

largest project to extra the problem to actually achieve coverage

because you don't know what to put inside. But, yeah, parametrization

or set up and teardown code, all the things that you do when you do manual

testing

or when you write your tests manually, that are definitely things that can be explored and that sometimes should be incorporated into the framework, of course.

And then another interesting aspect of it is

you seem to have settled on pytest as the framework that you're targeting for these test generations.

Pytest. And particularly given the fact that unit test itself is

modeled after the JUnit suite and your adviser has the Evo suite that targets the Java runtime and just if there were any sort of potential shortcuts that you would have been able to take if you had gone with unit test?

We actually

had

an exporter for unit test. So inside the framework, it doesn't make any assumption on how it's exported.

It's basically building its internal data structures. And in the end, we create an abstract syntax tree out of it. And basically out of the AST, we write the the code to the files. And

we just recently deleted

the unit test exporter

because of maintenance reasons, actually.

So we decided back when we started the project, we had some data. It was, I think, from the Python survey of Stack Overflow

or was it JetBrains? I'm not sure.

Nevertheless,

they said, okay. Well, most of the people are using the Pytest framework,

are using that style.

And pytest

also is able to execute the standard unit tests style

tests. And

In the end, I can imagine that we will re add the unit test exporter. The thing why we deleted it was because we had issues with assertions and floats

where

you not necessarily can check for equivalents, but you have checked whether they are closely enough.

And this was then the main cause,

together with other things with the assertions,

why we deleted the code recently. But we might add it if there's any request on that. Digging more into Penguin itself, can you talk through the

approach that you've taken for building the project and some of the, you know, design and architectural

aspects of it and just some of the challenges that you face because of the fact that Python is so dynamic and being able to build Penguin to generate tests for, you know, this code that might have, you know, monkey patching or unknown types and just this this overall problem space that you're trying to cover.

Yeah. So you will notice when you look in the code is that we have a strong training background in Java. And many things and design decisions are heavily inspired

by the EvoSuite.

That's

something that

might work at some point, but that also

caused some struggle at other points. So the whole internal representation

was based on EvoSuite, and we saw that this is maybe not the best design decision,

contain single classes,

which is if you're coming from Java, the natural way, because you have 1 class there per file,

but it's maybe not the best way of having it in Python. And so what I learned by heart there is that you get to the import hell, basically,

where you have circular imports if you are not careful enough and stuff and have to deal with all these kinds of strange things, how the import mechanism actually works

when you're working this Python like you would do with Java.

So that's something that I learned the hard way. And

you will see a lot of places in the code. And only recently,

I started to clean up whenever I touch something. I clean up those things and remove

classes that just contain static methods to, like, top level functions and

all the things that are basically

standard if you come with a strong Python training background.

But if you come from the Java world that you not necessarily do in the 1st place, that you have to learn over time and adapt it. But we are improving on that. And this is something that's currently under heavy change. And so

every version that we release might break something if you rely on internals

here.

So

if there's anybody out there who is developing and then relying on some of the APIs that that are there, please drop me a note. I mean, I don't know of too many people who have played around with the framework. So I'm

in the nice situation that I can just break things without asking people. But

if there are any people that are relying on APIs,

please tell me because then we need to maybe

carefully discuss what to do and not just say, look, let's refactor that. Right? Let's break everything. Let's do it in a nicer way and don't care what

potential users might wanna have.

As far as the actual development of Penguin, I'm curious what you have found to be particularly useful from the broader ecosystem as far as libraries, you know, 3rd party libraries or built in modules from the standard library

and just how you've gone about selecting

which tools to use to be able to build out this test generation framework.

From a library side, I have to mention, basically, we have, well, 2 libraries.

1 might become obsolete for being used this once we move to Python 39, which is the astro library

for converting AST spec to source code, which is the feature we rely on when we write out the tests in the end,

where we basically

transfer our internal representation

into the into Python AST statements

and then write them out to source code.

And the second, and maybe the library that saved me or saved us a lot of time is the bytecode library. I don't know whether you know it. Basically,

a wrapper library to deal with the byte code that's produced and that's interpreted by the Python interpreter.

They provide a nice API to add, for example, statements. So the byte code itself,

everything is immutable, but they have some kind of code there that you can convert

the byte code to different representations,

add some statement, and basically bring it back in a way that you can then afterwards execute

it. We need this feature a lot. And so when we started the project,

we were thinking about the how to measure coverage, but not only coverage because we actually need more information. So

what our

current

focus is is branch coverage. So covering all the branches in the program

and wanna know is not only which branches you have covered, but you also want to know how close you failed. So basically a concept that's called branch distance.

So for example, if you have a statement,

if x

less than 42 and your current x was

22,

then you know that your branch distance, that you are basically

20 an interval of 20 away from covering the then branch.

And to have this information, we rely heavily on bytecode instrumentation

after dealing a lot with can we maybe also do this with a tracing mechanism from the standard library,

which has some serious drawbacks in my opinion.

I can come to that later. And so the bytecode library

saved us a lot of time that we could actually instrument it in a similar way. 1 would instrument it, like when you do it in Java, where you can just instrument the bytecode, these libraries as well. And maybe on the tracing that I just mentioned,

so you might know that tracing only supports 1 tracing function at a time. And when you want to have more of them for different aspects, you basically need to somehow stack them together. And from the outer call, the inner function or whatever,

and need some wrapping or the functions need to behave in certain ways or need to be edited, removed in certain ways, which is quite complicated

to deal with, especially if you don't know what people are doing. And when you, for example, use coverage pi to measure coverage, they register their own tracer.

And that could just remove your tracing that you are using

and disable your measurements at all just because you, for example, run your own unit tests to measure the coverage that you have, which is a bit of a pity actually. It would be great to be able to register

just like more tracing functions that are called in parallel,

but that's currently not supported. So we decided to do it with bytecode instrumentation,

and the bytecode library saved me a lot of time there and

enabled us to do this without

going back to manipulating

basically single bytes in a byte stream or something.

I wonder if there have been any

peps or efforts in the past to modify the tracing interface to be able to add support for multiple tracers or if that's something that you've looked into, but

that's definitely sounds like a shortcut. I have actually did not investigated in this, whether there are some proposals on that. But,

nevertheless, I would strongly support something like that even besides Penguin. Some other projects might use that as well,

especially

since almost everybody is using coverage pie to measure coverage.

They might be happy to have that to also add further traces. For example, if you want to have, like, debugging

or

some other introspection

where you want to rely on the tracing functionality.

The fact that you're relying so closely on the bytecode implementation,

I'm assuming that that means that currently you're only able to operate when running within CPython. And I'm wondering if you've explored the possibility of working with the alternative runtimes that are out there.

I thought about that. Yes. So currently, as you said, we are relying on CPython. And since the byte code is basically implementation specific,

we cannot even tell for sure whether everything works in Python 3 9

or a 3 10. I don't know. What is the alpha version or whatever the current? I think they might have just released the r the first RC. Oh, or the first RC. So 310, I haven't tried. 310

breaks already

before when we install dependencies because some of our dependencies are not yet supporting

310. So we have not even tried it. We have 39 running in our CI pipeline. And from our test, it looks good.

But our unit tests cannot prove correctness, of course,

and so we don't really know it. Whenever I find some time, I will do some benchmarking and comparison maybe

on 38 versus 39 to have more confidence

whether this works or whether we need to change some things there. And speaking of other run times, this will be nice. I would particularly be interested in pypy

interpreter written in Python, basically doing just in time compilation and all these fancy things

that are very interesting.

But this has now 2 drawbacks for me. So 1 is we're relying on this bytecode instrumentation, which is something that we would need to be able to solve

for supporting PyPI.

The other thing is that our code these days relies on features that are introduced

in 3.8,

mainly Walrus operator

to name maybe most prominent feature.

And

as far as I know, PyPI

supports only up to 37.

I have seen that they

might have some development version that supports 38 and that they are working on that now.

But I have not tried it if this would even work in this early stage of their development of supporting Python 3.8.

But it's definitely interesting. And

maybe we in the future, we can support that when we can come up with a different way of

instrumenting

the code under execution.

And as far as the actual use cases for something like Penguin, most developers are probably familiar with writing their own unit tests and figuring out how do I set up this function to be able to be in the state that I wanna validate and then build in the validation functions.

And

I'm wondering

what are the

sort of environments or use cases or code bases that are out there where something like Penguin would be beneficial? You know, is it something where you would just bootstrap a set of tests and then manually tune it? Or is the intent to

be at a point where you can potentially

do away with the manual testing and just use something like Penguin to be able to provide the set of assertions that are most important for being able to get at least a baseline of coverage.

So I guess 1 could do both. So the first thing is maybe

the

more

the easier or maybe also more reasonable way of starting

to have some basic tests. And I think of, like, having tests for

basically data classes or classes that do not rely, like, on web APIs or databases or something.

For this, Penguin might be a good start to provide you at least a couple of tests that you don't need to do all the way manually.

When you

generate

tests for more mature projects, more mature code basis where you also have tests, of course, you could do that in the end, especially

assertions,

a problem that you need to inspect manually, at least in, like, code review or something.

1 possible scenario that comes to my mind is 1 could

just run

Penguin in a nightly build or something

and maybe let it create merge request automatically

with new tests. And then you inspect those manually and either you accept them and add them, or you just discard them if they are not fitting your needs. But, yeah, it depends largely and heavily on the project that you are doing and what you want to achieve in terms of testing.

And there are a lot of possibilities for further testing. We were discussing about hypothesis before,

which is something I'm currently digging deeper into because I think it's very interesting also their approach on on dealing

with input generation.

So combining a lot of tools here might be a good choice.

In terms of Penguin itself, are you able to take advantage of an existing test suite to help inform where you might want to focus on generating new tests or being able to inform

the existing constraints or sort of because you mentioned that you're also focusing on coverage, you know, test coverage if you're able to

use any of the coverage information based on your existing tests to identify code paths that you might want to focus on for generating new tests for.

Yeah. That's a very recently added feature, basically, and a feature that could also be improved, of course.

What to know about this is here how Penguin works internally. So basically, there are 2 styles of using it. 1 utilizes a random test generation where you basically

select randomly

the next method

that you want to call and

try to fulfill its parameter values.

And the second approach is using evolutionary algorithms

where you basically,

like in

Darwin, in evolution theory,

where you basically evolve the test case using crossover, using mutation.

So mutating the test cases by adding statements, removing statements,

replacing some statements or modifying them,

and changing or crossing over to test cases, meaning that you'd the the beginning part of the 1 and the end of the second test and basically flip those.

So

using that, especially the evolutionary style,

we have the possibility

to now load test cases that are existing

as long as they are supported by our internal representation. So we don't support all the possible syntax constructs that you could use in a test case, of course. I mean, some people are using loops in their test case, which is

maybe considered bad practice in general, but you can do it,

which will not be used by Penguin, then will not be loaded. But you in theory, you could also evolve that and and extend that

and load existing test cases and let the evolution

start from that. So using them as a basis

and let them the evolutionary algorithm try to give more coverage

by mutating, by crossover, by changing, adding statements, whatever else it wants to do in order to yield higher coverage results.

And then as far as actually getting started with using Penguin on your existing project, I'm wondering if you can just talk through the process of setting it up or determining what are some useful initial target areas to start generating tests for or if there are any

tasks that you would want to do ahead of time to maybe add more type information or if you would need to have some sort of particular

architectural style as far as, you know, class inheritance versus composition or, you know, the functional decomposition of the program to make it easier Penguin to do its job?

So type information is definitely

a critical point. And as far as you can go, add types, add all the type information that you can

and use type checking tools like MyPy.

Just use them on your projects

and edit type annotation. It not only gives you a lot more safety there because

the type checking tool rules out certain bugs, but also your editor

will give you something back because you get better code completion, for

example. And having this information is somehow crucial. I mean, Penguin works without it, but less good when it has the type information because it then needs to guess all the input types. And if you have them,

it only has to deal with

the difficulty of generating

objects of certain types. So to get back to your question, when you want to get started,

I would start with, like, a small module

that has only a couple of functions or methods in a class.

And what you basically need to do is to

have

best is a virtual environment where you install penguin inside and where you install

all the project dependencies that your project has inside it. So

1 approach you could do is adding a penguin to your virtual ends that you have for your project development,

or you do it basically vice versa, creating a new environment for Penguin and add your project dependencies on whatever else you want to have there.

And then you can basically just invoke the tool. 1 problem that might occur there very fast is depending on what your code does,

This can have some unexpected side effects. So for example, and this is a real world example,

was also had an issue there on on the GitHub repository, but somebody was asking this.

Penguin executes

your code actually. And so whatever your code does

can be done.

And the particular example that we had was that after running Penguin on the project,

the person had a lot of empty folders and files in their file system

with almost arbitrary names.

This was in the end because their code that they ran penguin on had some usage of the file API, which basically created those folders. And

from parameters and random input strings, it produced these random names.

And so what you maybe want to have, because this can, of course,

cause serious harm. I mean, if your module

deletes your whole file tree and you run it as a root user,

I hope you better have a backup

because it will basically delete everything from your hard disk.

So what you should do is to isolate it as good as possible. And what I do these days

is running Penguin in a Docker container. So we provide

already a Docker file that 1 can start building on.

You can also just install the version from the package index inside your Docker container together with all the other dependencies that you want to have there. And

then the worst thing that can happen is that the execution basically breaks the container. But as you restart it, you will have a plain image again. And

there can't be too much harm to your hard disk, except you spot some bug on Docker maybe. But then they might be curious to know about that.

So

isolate it whenever you can isolate it because it can really cause serious harm

whenever your code can cause serious harm.

And so as far as the sort of execution

of Penguin, you mentioned starting with a small module that has a limited scope. But for somebody who wants to

kind of get a general coverage of their project, is it possible to say, you know, start at the root module and have Penguin walk through the project hierarchy, or do you have to execute it 1 module space at a time or sort of what the targeting capabilities are for being able to run Penguin and generate the tests and determine, you know, where the tests are going to be written out to?

So currently, it has to be run on each and every module that you want to generate tests for. And it will generate tests then or try to generate tests for the functions and stuff that's inside this module, which can, of course, involve any dependencies.

And you can also generate tests for 1 module, cover parts of other modules, of course, if they are called. But you have to currently invoke it on each and every module that you want to have tests for. And by that, you also have the possibility to have

single files for for single modules with the tests in it.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce,

Marketo, HubSpot, and many more. Go to python podcast.com/census

today to get a free 14 day trial and make your life a lot easier.

As far as the

environment setup, if your code that's under test has dependencies on, you know, a database or,

you know, some mocked API endpoints,

is penguin able to

sort of shell out to sub suite or, like, call out to, you know, your pytest setup to be able to

bring up those systems? Or is it something where you would need to bring the system into a state where it's ready to be tested and then call Penguin on the module with the expectation that everything is already there?

Currently, definitely the latter.

Having something like databases

is something we currently excluded.

We excluded also in our focus several other things, like,

in general,

APIs that you call remotely, like web services or so. Because you need to have a lot of domain knowledge

to actually come to a state where your program is actually in a reasonable state. Like, if you imagine, for example, you want to have JSON BLOBs.

Generating a valid JSON blob is 1 problem, which is not trivial. But generating a JSON blob

that serves calls a particular API or that has a particular

meaning to the code under test,

that's basically

almost impossible without domain knowledge.

So we are not focusing currently on those things. I mean, I'm happy to accept any contribution if somebody

wants to work on that. Because I think that's a great topic and especially for using Penguin in practice

by many people,

it's something definitely also worth in it. But I might not be able during my research or my PhD to work on that.

And the second thing that we're

currently somehow excluding,

so not because it's a general necessity, but

because it's easier for us is everything that relies

on

native code modules. Like, if you have

most prominently maybe NumPy as a dependency where large parts of the library are implemented in c for performance reasons, of course,

we are currently not focusing on these libraries. I mean, in theory, it should work, but we don't focus on this just because we don't want to deal with all the compilation,

of the source code or having some prebuilt binaries that might work or might not work depending on your system.

So this is also definitely an interesting thing. So basically, the whole, let's say, data analysis

thing

also coming up this, like, inputs for for data analysis pipeline would be very interesting. And

as far as I know, for example, hypothesis

also has some strategies to generate at least, like, data frames and pandas or something like that. This will be definitely interesting, but that's not my main focus these days.

But, yeah, if you have some ideas on that,

I'm happy to talk to you and to have that contributions

because I think it would only be good for Penguin.

And as far as the overall goals of the project

and the assumptions that you had going into it, I'm wondering if you can talk through sort of the

general

destination that you're aiming for and how the sort of goals and

ideas of how Penguin would work or what it would be able to do have changed or evolved as you've been working on building it out and exploring new areas?

So when I started,

I had in mind

I was like this guy coming from the Java world where everything is nice and typed.

And I thought of, well, the feature is not that new. So maybe a lot of people have already adapted it to their project. And this was the first lesson I learned that

actually finding

reasonably

sized projects that have type information available is a nontrivial task.

So, yeah, there are quite a bunch of toy projects outside.

I call it toy projects now, but projects with only a couple

of functions and maybe a few hundred lines of code where people did this effort. But for larger project, that are also widely used, it's very difficult to find some of these.

So this is 1 of the assumptions that I had back when we started this

that failed and that also shifted somehow the way of what I'm working on and

what I'm focusing on. So the idea was basically to build a framework that can generate unit tests for

almost arbitrary Python code. We figured out that there are some, like, type information, some other problems,

or that we might need to solve or at least come up with some partial solution to allow

Penguin to really and reasonably

generate tests for it. So

my focus was then more shifted, for example, into the type inference

direction. What can actually be done and what is actually done? So these days, a lot of people, especially in in research communities,

build tools

that use some kind of learning, machine learning, deep learning approach

to predict some type information, which is something that

we might be able to utilize if we can just query such a tool as a black box and say, give me

a suggestion for an input type.

And other things in the type inference

world as well. So this is 1 of the main problems that we are currently facing.

And, well, then we are still like this Java inspired thing where we had some pitfalls there, and we learned a lot that you cannot do things easily in Python. So an example that just comes to my mind is

fields and classes,

which are added dynamically, basically. And you can do best effort static analysis to detect those, like searching through the constructors

and whatever

else 1 can come up with, but you will never have a full and complete solution to this problem

and detect all the types. And especially also other dynamic features like removing

fields dynamically during run time is something

that at least I don't know of a way to easily deal with it. And these are all

items that are, I guess, important and that are used in real world software

that we have also seen

there where we don't have a real solution for now. So this is definitely something where we could improve and that we hopefully will improve over time.

But at this point, Penguin in this may be in a too early stage to be able to deal with it

and needs a lot of work to

support more of the basic features actually of the language.

For example, the only reason that edit supports generate input for simple collection types like lists and and sets and dictionaries,

which are heavily used, of course, in practice,

but which are maybe not that

trivial

to deal with when you automatically

generate inputs. Yeah. Especially given the heterogeneity

of the

potential

values within those collections because

especially if you have, you know, a dict that is of type string any. You know? Oh, yeah. Anything's game.

That's a nice 1. So any is the best type that you can have, of course.

I don't need mode off.

Any is basically if you have to write any as an annotation,

usually, you don't need to have an annotation because it's basically the same.

And any is a nice placeholder

if you don't want to figure out what the actual type is or if you can't for some reason,

but it's something that you want to avoid. Also, as a general strategy, if you're

adding type information to your project,

try to avoid any

whenever you can because this won't help you and this will just bring you further problems down the path because then you will

easily come to the point where you just can add any as a type annotation

because you did this at some methods,

down in the corner of your project

and everything else can just be any because otherwise, you cannot specify the the type checker anymore.

And in terms of the type inference and type generation

approach, I'm wondering if you've looked at tools such as PyType because I know that that has an execution mode of being able to

walk you know, statically analyze a code base and be able to generate at least a best guess set of type stubs for the project.

Yeah. I'm also

investigating on that, and we also played around with things like

monkey type, which is a framework that basically

instruments

or that basically uses the tracing mechanism,

traces your code while it's executed, meaning it collects all the information

on types that it sees when the code is executed, like, parameter and return types of functions.

And also can generate stops on, I guess, they can also apply directly to the code. So basically extract this information into files.

And something like that is also

there are a lot of interesting ideas and lot of tools that implement those ideas.

And

the thing here for me is to to figure out what

is maybe promising

what

might also work in our context and within our tool

and then adapt it in a way that we can use this without having too much overhead. So the experiments I did by just calling wipe, the monkey type is a black box,

revealed that this is way too heavy. There's way too much overhead of the monkey type framework around it.

And this basically slows down test generation

too much. But the idea itself is very nice and

maybe need to be adopted to also

extract information then during the execution of a test

for further methods, for example.

And in terms of

the development of Penguin, I'm curious how much you're able to use Penguin to generate the tests for Penguin and, you know, what

what that cycle looks like.

That cycle looks like you fail hardly with

exceptions. The problem that you're immediately facing is that basically the tool with the same names and imports in the same name space then you want to generate for,

which is prone to failure, I guess. And

an idea

could be

to rename basically all the modules

dynamically when you want to do this. But yeah. So I tried it once, and the easy approach just failed. And I did not investigate any further on this. It would be, of course, nice to have, like, a compiler being able to to compile its own code,

a test generation tool that can generate test for its own. But we are far away from that, and there are a lot of challenges to be solved besides renaming modules.

And in terms of the actual

usage of Penguin,

either in your own research or in the community, I'm wondering what you have seen as being some of the most interesting or innovative or unexpected ways that it's been applied.

I have not seen too much usage

actually in practice by other projects or by other people. I had some reports. Usually, they come with bug reports or with questions on things that are not clear.

So I have

seen it be you being used by some people that are creating libraries. And also, 1 person was dealing with a database

API wrapper, basically,

which

is like what I said before.

Creating tests for those kind of

APIs is especially difficult since you need to have at least certain protocols of calling methods in a particular order

and stuff, which needs to be added as information somehow.

But I would love to see more people

trying out their project and

seeing what happens, and especially tell me what happens, what they experienced there.

As far as your own work on Penguin, I'm wondering what have been some of the most interesting or unexpected or unexpected or challenging lessons that you've learned on the in the process?

I learn new things almost daily. I learn by heart that, like, the tracing mechanism, that this is

well, it would be great to have more than 1 tracing function as we've discussed before.

The things with comes to my mind with the with the fields and classes that they are

basically only really a little dynamically

and that you cannot tell for sure with the static analysis or some static checks.

So basically, there's each and everything that that you can think of when it comes to, like, the dynamic nature of the language

is something that causes a headache, basically, when you want to generate tests for it because it's something that you maybe don't have in mind in the first place when you think of

using

the language, like, in a very

basic way, but you do not rely on these features or where don't use them because you might not even know about them. And so there is actually too many things to name them or

just can name these 2, like, directly from my head.

As far as people who are interested in exploring Penguin, or they want to be able to have some means of building tests for their application. What are the cases where Penguin is the wrong choice and they'd be better off just building the tests manually?

Well, the thing is if you have complex APIs that follows certain protocols, like, for example, database communications

or that need very particular inputs like

communicating with your web service or whatever you have, then Penguin might be the wrong choice.

It might also be the wrong choice if you are just hunting for bugs. If you're just hunting for bugs, you might be better served with a fuzzing tool where you basically throw random input on your program

and see what happened.

And I only recently

played a bit around with the Acerus tool by Google, which they released to the public,

which is a nice fuzzing approach. And there are others as well. And

for bug hunting there,

I guess, using something else is

maybe easy, maybe also faster because penguin is then

the heavy lifted test generation process that's quite complex. And if you just throw random inputs on your application and see whether it crashes or not, this can be achieved in a much cheaper way than what we do.

And as you continue to work on Penguin and continue through your PH PhD program

and also once you've completed your program, what do you have planned for the future of the project either in terms of new features or capabilities or research directions?

2 things are things that we also already mentioned. So 1 of the things is coming up with proper assertions

because that's basically an essential thing if you want to have reasonable

and usable generated tests and not bother the user to come up with assertions.

The second thing is working more on this, let's say, type inference

kind of thingy because

this information is basically crucial. And

having more information is better for the test generation to work and

also supporting

more features of the language actually to being able to deal with that. That are definitely things I want to explore and that I want to work on in the next months, maybe years.

And of course, there are always things that might be interesting and that are maybe domain specific, but that are interesting to a a broad community. Like I mentioned before,

communicating to web services with JSON blobs that you are able to somehow specify,

for example,

let's say a grammar, how your JSON blobs have to look like or whatever else can specify this automatically

in explorable way

that we can more easily generate valid inputs for such APIs.

These are definitely things that are worth investigating. And also, I only reasonably stumbled again about a problem I built myself or a bug that I built myself during data analysis.

Being able to actually generate tests for a data analysis pipeline

is something

that, in my opinion, is definitely worth investigating.

That is definitely

not easy that many people might profit off because

so many people are doing data analysis

in some sense or another these days.

And they all build their scripts and pipelines and maybe use some Jupyter notebooks or whatever.

But this is only okay. We write it and we look at the results. And as long as the results look good, we hope that it's correct. But nobody actually tests those things or almost nobody tests those things.

So this is also maybe a domain that would be interesting for future research.

And are there any other aspects of the Penguin tool or or the overall space of unit test generation or testing in general that we didn't discuss yet that you'd like to cover before we close out the show? There are many things, like, in the context that now of test generation, like, things like mocking,

which is definitely a thing

and definitely a nontrivial

thing. So I always struggle with the mocking library of Python in the unit test API,

getting the names of the mocked functions and classes. Right? So this is something that, in my opinion at some point could be automatized

to

not bother the programmer anymore.

So this is definitely something that is worth future work. And if you want to do testing for your project, a good thing is to go out to read a lot of blog posts,

Twitter feeds, podcasts

like yours

that explore

tools and methodology and give you hints on what you could do

and a good thing

I would suggest to everybody is to use as much tooling as possible

that supports you, especially

since the language is so dynamic

and allows you to do so many

dirty tricks that

are prone to introduce bugs.

A lot of them, you can prevent by using type checking and

linting and whatever else.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or contribute to Penguin, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the concourse framework for CI and CD.

Just recently started using that at my work and have been pretty impressed with the overall design of the system and the capabilities that it comes with. So if you're looking for a CI pipeline or a framework for being able to build general sort of arbitrary pipelines where you have dependency inputs and outputs trigger different downstream tasks. It's definitely worth taking a look at. And so with that, I'll pass it to you, Stefan. Do you have any picks this week? Yeah. First of all, because you asked me how to keep in touch, So there's this GitHub repository,

and that's maybe the easiest way

of coming feedbacks or bugs or whatever you have.

And you can also find me on Twitter. Feel free to drop me a note or a message there.

And if you are searching for me, you, of course, also find my email at the university

and I'm happy to take also emails there, but I guess the easiest if you're

related to Penguin, how you can use it or from the bug is to use the tools and especially the issue tracker of the GitHub repository.

Because then not only me, but also others

are able to maybe help you or to figure out the problem, what's going on there. And you might have a shorter and faster feedback loop

than contacting me directly.

Some picked from me.

So besides doing all this thing these days with the COVID pandemics going on in many countries and some it's getting better, but in in others, it's getting worse.

Do something for your own health.

I recently experienced this. I'm going to cycling now regularly again.

My body is in better shape. I'm feeling better right now. Eat healthy, have enough sleep,

and also you have to respect some of the rules from the pandemics

to not meet people or to only meet a few people. Stay in contact with your peers, with your family, with your friends,

because otherwise, you will be lonely

very, very fast, and being alone is especially difficult in this situation.

And

so talk to your peers,

meet them if possible, or at least do some phone calls or video calls or whatever.

Use whatever technology is available if you can't meet in person.

And all of you stay healthy and safe,

and we hope that we can

overcome this pandemic in the near future

to be able to also meet in person and hopefully also meet at events in person and talk personally,

which is something I'm missing. Yeah. Definitely.

Well, thank you very much for taking the time today to join me and share the work that you're doing on Penguin. It's definitely very interesting project and an interesting area of research, and definitely look forward to seeing

where you take it and being able to use it for simplifying the work of writing tests for my own code. So thank you for all of the time and effort you've put into that, and I hope you enjoy the rest of your day. Yeah. Thank you for your newest invitation, and I hope you also enjoy your rest of the day and also enjoy your weekend that's coming up.

And I hope we can meet in person somewhere down the road in the near future. Absolutely.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show

notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastthenit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__