Summary
A large portion of the software industry has standardized on Git as the version control sytem of choice. But have you thought about all of the information that you are generating with your branches, commits, and code changes? Davide Spadini created the PyDriller framework to simplify the work of mining software repositories to perform research on the technical and social aspects of software engineering. In this episode he shares some of the insights that you can gain by exploring the history of your code, the complexities of building a framework to interact with Git, and some of the interesting ways that PyDriller can be used to inform your own development practices.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to pythonpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today!
- Your host as usual is Tobias Macey and today I’m interviewing Davide Spadini about PyDriller, a framework for mining software repositories
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what PyDriller is and how the project got started?
- How is Pydriller different from other Git frameworks?
- What kinds of information can you discover by mining a software repository?
- Where and how might the collected information be used?
- What are the limitations of the capabilities offered by Git for investigating the repository?
- What are the additional metrics that you are able to extract using PyDriller?
- Can you describe how PyDriller itself is implemented?
- How has the project evolved since you first began working on it?
- I noticed that for testing PyDriller you crafted a set of repositories to serve as test cases. What has been the most complex or challenging aspect of writing meaningful tests to ensure a reasonable coverage of this problem domain?
- What would be required to add support for other version control systems?
- How have you used PyDriller in your own research?
- What are some of the most interesting, unexpected, or innovative ways that you have seen PyDriller used?
- What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with PyDriller?
- What do you have planned for the future of PyDriller?
Keep In Touch
- Website
- ishepard on GitHub
- @DavideSpadini on Twitter
Picks
- Tobias
- Davide
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- PyDriller
- Delft
- Git
- GitPython
- PyGit2
- RepoDriller
- Mining Software Repositories Conference
- Lizard
- Hadoop
- Mercurial
- Subversion
- CVS
- Neo4J
- GraphRepo
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try out a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to python podcast.com/linode today. That's l I n o d e, and get a $60 credit to try out our Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.
For more opportunities to stay up to date, gain new skills, and learn from your peers, there are a growing number of virtual events that you can attend from the comfort and safety of your own home. Go to python podcast.com/conferences to check out the upcoming events being offered by our partners and get registered today. Your host as usual is Tobias Macy. And today, I'm interviewing David Spadini about Pydriller, a framework for mining software repositories.
[00:01:23] Unknown:
So David, can you start by introducing yourself? Yes. Thank you very much. Thank you for having me. I'm David. I'm a PhD student at Delft, University of Technology in the Netherlands, and, part time research at SIG, software improvement group company in Amsterdam. And I actually just finished my PhD, like, 2 months ago. I'm I'm in the thesis writing phase. And in the last 3 months, I've been working in, as a software engineer intern at Facebook. And, yeah, as you said, I'm the author of of by Twitter. And do you remember how you first got introduced to Python? Oh, yeah. That that's that's a long time ago. I first I was first introduced to Python. I I think it was high school.
I did my high school in software engineering slash computer science, and they were teaching me, like, I got it. And they're they're yeah. They're they taught me first Pascal and then Python, and then we moved on to Java and c. But then I really started developing Python, I think, when I was in my bachelor. So around 10 years ago. And as we mentioned, you created the PyDriller
[00:02:25] Unknown:
framework as a means of mining information from software repositories and version control systems that house them. So I'm wondering if you can give a bit more of a description about what the PyDriller project is and how it got started.
[00:02:38] Unknown:
So well, as you said, it really is a Python framework for for Git. So we all know Git is very famous, very popular, very powerful. And especially in the research community, what we do, with gate is often we we go over the history of a project, the history of, I don't know, developers and how they write files, etcetera. And to do so, we well, it's often not not enough to use git from the terminal. It's just quite complicated. Right? You know, 1 line of command at a time. So what, what we generally do is we use git in our preferred programming language that is Python, Java, whatever people want to use. And so people created frameworks for Git in all the major programming languages.
And for Python, there is Git Python and there is also Pykit too. However, these tools are also quite complicated as they are like a 1 on 1 mapping of git just in Python. And what PyTorch tries to do is it well, it eases some some, functions that the researchers usually do. So what what we generally, what we generally use of Git is we read information. So as as a researcher, as I said before, we go over the history of a project, and we just extract information. We we almost never commit the push and merge. We just read. So what by Driller does, it it's just simplify this reading operation extremely, and you can do reading,
[00:04:13] Unknown:
in just 1 or 2 lines of code. And you mentioned that there are a number of other frameworks in other languages, and I'm sure that there are probably some others in Python as well. I'm curious how the PyDriller framework differs from some of the other ones that are available and what motivated you to create it rather than use something that already existed? So the reason why I started by doing it in the first place is, I was helping my adviser
[00:04:37] Unknown:
in, teaching, the MSR course, mining software repository course here in Delft. And we were teaching to students how to use Git and how to go over the history of a project, etcetera. And they were often coming to us and say, okay, but is there a a tool? Is there a better way to to to do this in Python? And we always said, well, no, there isn't. Sorry. We you just have to call git as a as a process from, from Python or you have to rely on Git Python or by Git and other tools. At the 10th time, the students came to us. I said, okay. Well, let me build it for you. And there was a similar project written in Java. It was called the repo driller, and it's it was from a good friend of mine. And so I took his this idea, and I just did it in Python. And then, yeah, then it just became popular and it exploded, and I just put so many features on it. And I think the the difference with other tools is that, so what what PyTorch does, those less feature, as I said before, I PyTorch doesn't map. It it's not a mapping 101 of Git. So in PyTorch, you cannot, for example, push or commit or add files to a commit or all these kind of things you cannot do. For this, you have to rely on git or other tools.
However, for reading operations such as give me all the commits of the last year of, this on this branch of this author, then in by little, this is just 1 line of code. So that that's I think it's that that it's the big advantage of of Pydriller, I think, and and the biggest difference with other tools.
[00:06:14] Unknown:
And in terms of the types of research that you would be able to perform by mining these software repositories, what are some of the topics that you're looking to cover and the types of information that you're able to discover by digging through the history of a Git repository?
[00:06:30] Unknown:
There are so many things that we that we can do research on by looking at the history of a project. So there is even a specific conference of software engineering. It's called MSR, and this conference is it's, every year, it's 1 of the top software engineering conferences in the world. And in there, there are just MSR studies. So just studies that look at some kind of mining. And so mining software repositories is is a pretty broad, a pretty fast, topic. It doesn't only mean mining code. So we often think as a software repository as a Git repository, but it's actually not. A software repository can be anything containing software related information. For example, we can emails or, back tracker tools and everything that contains some kind of software information. So only focuses on Git, but coming from, I mean, by analyzing history of projects, we can get so much information.
For example, all the path about different prediction. So trying to detect whether the file in this commit contains a defect or not based on the history of that file or on previous, like, on similar kind of code changes of the past. So this is something that many people do, different prediction. But yeah. Or we can also there is also a line of, like, how to transform, how to be how to improve contribute. And so how to allow developers to do contribute in an easier way and how to detect the code review time that there are problems. And all this is done, like, analyzing the history of all the code reviews, etcetera. Yeah. So pretty much everything you can extract. So Git is a very it's a very big source of information.
So and we have to be very careful because, you know, there are I mean, many people work on those projects, and they contain names and emails and this kind of thing. So we have to be quite careful on, how we wanna publish those data and those results. But otherwise, it contains an enormous amount of information.
[00:08:59] Unknown:
And so with all of the information that you're able to pull from Git as far as things like the authors, the change sets of the code that's being committed, all of the timing information, what are some of the ways that that collected information might be used in some of the research projects that you've been involved with to take advantage of all the information that's stored in these repositories?
[00:09:23] Unknown:
A lot of empirical studies on, for example, like, quality of the code. So can we somehow identify files that, have poor code quality production files as well as test files? And and here and we can even go deeper and analyze at method level and maybe trying to predict whether this has, will have some kind of impact in the future. And this is all done by, like, analyzing in the past and trying to predict the future and whether this file will contain a defect. Or, like, we studied test MELS, which are like test, let's say, test design issues. And in there, we we went into the history of projects and see whether these design issues were present and what was, what was their, like, their effect on quality.
So yeah. So these are some of the studies we've done, but as I said before, like, sky's the limit. I mean, the history, and it's full of information. So all the branch of different prediction on contribute, it's about history of projects. So, yeah, pretty pretty much anything that we can learn from the past to to try to improve the future is done using using, like, these kind of studies.
[00:10:45] Unknown:
And in addition to what's stored in the Git repository itself, there are also these collaborative platforms such as GitLab and GitHub that layer on additional things like conversations within the code reviews and, issue tracking and things like that. I'm wondering if you've done any work to integrate those additional sources of information with the content that's stored in the repository itself to be able to gain any deeper insights or do some research that covers ground that isn't directly related to the specific code itself, but the ways that people are interacting around the code? Definitely. Definitely. That's a that that's a very good point. We generally do this, and we generally use multiple sources of information. So when we study something about code review, we we generally look at the files that change during,
[00:11:35] Unknown:
code review. And, I mean, contribute is just a fancy way of saying it's a commit that it's there to review, but it's, I mean, it's a commit. So there are files changed there. There is source code. There are diffs. So there are a lot of information that we can analyze. So how to, become more effective in Kotoview at the end becomes, okay, let's analyze the commit and let's see whether we can extract some kind of information, whether we can improve, but how we can support developers in doing code review. So for example, we study we study whether the position of, the test files within the code review, whether it is before or after its production code, it has an impact on, like, the time required to do the contribute. And this is all done, like, yeah, looking at the history and doing experiments and changing, like, how Git and oh, sorry. How, GitHub presents the file in in which order.
So, yeah, definitely, we we we look at many different, source of information. As you said before, back backtracking tools like Jira. We also look at those. Generally, each commit has a bug ID that you can refer to. So, we can look at, like, the severity, the type of the bug, etcetera. And and again, then we go back in the history and we see what whether it had some kind of impact and whether we can avoid this to happen again in the future.
[00:13:07] Unknown:
Yeah. So definitely, yes. And in terms of the tool itself, as you said, using the command line directly can be quite cumbersome and isn't really conducive to being able to do isn't really conducive to being able to do any sort of in-depth analysis. And I'm wondering what are some of the additional metrics on top of Git itself that you're able to extract using PyDriller and some of the capabilities that you've built into it? Yeah. So
[00:13:33] Unknown:
so by default, PyDriller is, a git framework. So we only extract information that are present, like, in the Git repository. However, we saw that many users were using PyTorch together with some other tool to perform some static analysis. For example, with PyTorch, you can get the source code of the file in that specific commit. And then once you have source code, maybe you want to prefer some analysis on it, like calculate the complexity of the method that changed in that commit before and after the change, etcetera. So at the end, what we decided to do is, was to make use of another tool, a static analyzer tool is called lizard. And lizard is, a tool that can calculate code metrics. So we integrated the tool inside PyTorch, and now, we can, for example, for each commit, for each file changed in the commit, see the complexity before the before the change, after the change, and maybe compare them. So you can get, as we said before, let's take the commit at, in that code review, and we see whether, for example, the change improved or decreased the complexity or increased the complexity. We can do kind of very fences very fancy things. So, yeah, other than Git information, we now provide code matrix information.
And yeah. Then we stop. Then then everything else
[00:15:03] Unknown:
is then on the on the user side. In terms of the implementation of Py Driller, how did you go about developing it and determining the overall structure and the API to make it easy to use and accessible, particularly for people in the class who aren't necessarily expert programmers?
[00:15:21] Unknown:
Yeah. So I think, for this particular case, so I I try to be as as Pythonic as possible. And Python is really an amazing programming language, and it's so easy to read. So and the same applies to the same should apply to all frameworks written in Python. So what I wanted to do was, like, just simply focus on simplicity and abstracted from the user all the all that Git complexity because Git is really, really complex. And just hand PyWheela handles that. So someone should be able to just write for each commit in the repository and for each file in that commit, do something. I mean, that that was, like, my main goal. 2 lines of code, it should be it. This this is it. So at the end, we managed, and that that's what we've done. So now with just 2 lines, and you you can do all this.
And and then what we decided to do is so this kind of studies, research studies are well, we often need to customize, like, inputs. So, for example, we we we don't we don't we generally, at least, don't analyze the entire history of a project. A a big project like Hadoop with 100 of thousands of commits, it's gonna be very hard. So what we generally do is we customize it a lot. So for example, we only get commit of, on a specific branch or maybe from a date to a date or only commit on that particular file or only commit of the daughter, etcetera, etcetera, etcetera. So another priority for me when we when I built PyTorch was, okay, it needs to be highly customizable because, yeah, users and, you know, in these studies, we really put a lot of filters in, in commit level mainly. So and this in git is possible, but still it's like in the documentation for it somewhere. So in my driller, all this is just 1 line of code. So you you just put, like, for committing the repository, and you put there all the filters you wanna apply. And we have so many, and you just enter,
[00:17:37] Unknown:
and and then that's it. It it does it for you. And when I was looking through the documentation, I noticed that 1 of the base interfaces is to open up a repository with this class object then be able to iterate over the commits. And I'm curious how you handle things like branches and ensuring that you're able to handle things like dead end branches that never merge back into the trunk or things like that. And just some of the strange complexities that can be built into the workflow of how somebody interacts with Git and the ways that they handle things like tagging or multiple branches or merge commits versus rebasing versus squashing and all of that. Yeah. Yeah. So
[00:18:17] Unknown:
that was a lot of issues opened on PyTorch, to be honest. So when I when I started, I was mainly handling, like, the simple cases. So few branches and few merge commits. And if the merge happen, it's only 2 branches no more and etcetera etcetera. And then and then I think that the community just started to work on it, and they really helped me a lot. And I think, overall, I just, I mean, when I started 3 years ago, I thought I knew Git, and I I soon discovered that I didn't. And even now, I think I still don't know that much. It's just I mean, there are so many cases to consider, and there is, like, 1 issue open, I think, every couple of months saying, yeah. But if you open, a branch like this, and then you commit the new amend, and then you rebase, and then, and then breaks.
And and I'm like, okay. Let me let me learn something new about Git that I didn't know before. So, overall, I think, now I think now it's pretty solid. I think we are we are able to handle, like, many cases, and it should be, like, 99% of the cases that you that you mentioned before. But still, so another thing that, it really simplifies the by to alert is that we only prefer reading operation. So we never write and actually almost never because I let the user to check out and and reset. It's not really writing, but it's, let's say, changing the state of the repository. Those are the only 2, kind of operations that I let the users do. Otherwise, we never perform writing. So by reading something, it's often I mean, it's hard.
It's not impossible, but it's let's say it's hard to report something wrong. If a commit is in that branch and that branch, well, it is. It I mean, if Git says that it is, it is. So I just report it. So, yeah, I I think by not changing the state of the repository, we we really simplify simplify a lot by Driller. And and overall, I mean, again, as I said before, if if you find an issue and a case that they didn't cover, well, GitHub is there. Just open the issue, and I will try to read the documentation and try to learn something new. For being able to write the tests for it, I noticed that you have a zip file of some handcrafted repositories
[00:20:47] Unknown:
to be able to cover the different scenarios that you're trying to address in the tests. I'm curious what you have found to be some of the most complex or challenging aspects of being able to write meaningful tests for a framework that is trying to address something as complicated as Git, as well as the challenges of a repository into the state for being able to give a test case for those assurances?
[00:21:11] Unknown:
Yeah. So I I think so 1 of the most complicated things was to, handle different branches and merge commits. So that was definitely something that I let I I had to put a lot of effort on it. And, you know, you check out in a new branch and you modify a file on the other branch. And then on this branch, then you merge and then you have conflicts. And by the way, they should always return I mean, the git side, of the view from from git side, let's say. So all those particular and kind of corner cases scenarios, then, yeah, I had to I had to manually create them. And thinking about a lot of them, it it was quite hard to to try at least to cover. I mean, of course, you cannot cover a 100% of the cases, but many of them, that that was that was really hard. But otherwise, I think most of the most of the handcrafted repositories I have there, they are they are pretty simple.
And they are mainly, you know, for example, a repository with 2 branches and 1, and this repository is is used to filter, like, all the commits in a specific branch. And, I mean, this is a very simple scenario, and the repository is really simple. I think it will it gets complicated only when we have merge commits and conflict files, etcetera, and then it's a bit hard to understand. What was the other question? I don't remember.
[00:22:39] Unknown:
Just the issues of writing tests to be able to cover representative samples of what a Git repository can look like as well as building a repository that represents those different states that you're trying to test? Yeah. So
[00:22:55] Unknown:
I think, again, as as I said before, this is but by really, it's community driven. When I started, I think I had something like 4 or 5 repositories, and already now I have, like, 12 or or even more. And and the the only reason why this happens is that the, the community someone opened an issue saying, look, if I check out this branch and I modify that file and then I check out the other branch and then this happens and this happens, you don't cover the case and you report a wrong result. And then what I generally do is I recreate manually the like, a a simple case scenario of of that problem. And then I will create a test, and then I will well, and then this repository will become, like, part of by Driller because from there, it will always stay inside. So so, yeah, I think it's community driven. So as soon as I, I need I I'm able to solve an issue, I will create a new test and a new repository that handles that that particular problem. At the beginning, to be honest, I tried to use I tried to use, real real real world cases.
So for example, I tried to download some project from Apache, and I was doing operation on Apache. The problem was that even though it's it's a real case scenario, So, yeah, so, you know, it's real data. It was really slow, and and it was hard for me to find specific cases that I wanted. So by manually creating them, I know it's not real anymore. However, I can tackle specific cases that they really want. So that that's why I manually create all of them instead of just relying on, existing repositories.
[00:24:42] Unknown:
And PyDriller is built specifically for addressing Git repositories, but there are also other version control systems. The most notable contender being Mercurial, as well as some of the predating systems such as subversion or a CVS and some esoteric ones. I'm wondering what the level of effort would be to add support for other version control systems and what the
[00:25:06] Unknown:
overall utility or benefits would be if you were to try and address that case? So there are 2 problems. The first 1 being someone needs to learn the new version control system. And at the beginning, for example so to be honest, in in pipe Wheeler, there is a branch called, I think, Mercurial or something like this where I ported I already ported for for Mercurial. And the problem though is that I I'm not an expert. I I was not an expert, and I still not not an expert on that. So many things well, I I was I was not sure how to do it properly. And that's strange because is written in Python, so it should be easier than git to do, but it wasn't.
So, and that's that's 1 of the reason why I haven't merged everything together. And this was already, like, 2 years ago. It's just that some, I mean, someone needs to learn it. And and I I was not there at the moment, and I didn't know Mercurial that well. Because, again, I mean, this this should be a framework for physical tool system. So to be able to do those kind of things, you really, really need to know the mechanics of Git itself or, in this case, Mercurial. So some things were just not right. And I think now Facebook uses Mercurial. So it has been some time now that I use Mercurial. I'm still not an expert, so I'm not sure whether I will actually start again and try to see again how to do those kind of things.
And and and this is because of the second reason why I haven't done it yet is, well, let's say know your audience. So most of the people uses Git, and the research studies are done on Git repositories. So me spending, like, I don't know, 6 months or something like this to port PyDriller in Mercurial, it's, well, the the gain is gonna be really marginal here because not a lot of people are gonna use it. And Mercurial is mainly used, like, in industry, and I'm pretty sure that I don't have many users that come from industry. So at the time, I said, well, let's just focus on Git. And now that it's becoming a bit more popular, I know Padilla is used in a couple of companies. So Mozilla is 1 of them in some of their products they use by the inside. SIG, that the company where I used to work, I also use this, by the way, but still this company use Git. So I'm not sure whether I will actually spend time on on put Mercurial on it. But, anyway, I think if someone wants to do it, of course, it's more than welcome. And I think I mean, the kind of operation that does on the repository is is a separate meaning reading. So get the list of commit, list of auto states, and the diff for, for each commit, you get diff and the source code of the file. And and, I mean, that should be more or less it. So those are the kind of information that I I extract from from the repository.
So it it should be doable to do in other control systems, but, I am not that much of an expert right now. So I'm not sure how how
[00:28:22] Unknown:
complex will it will really be. And you mentioned that it has started to make its way into use within industry. And I've seen when I was looking at the dependence graph of tools that are using PyDriller, some examples of frameworks that people are using to pull out useful code metrics for determining things like development velocity and help to provide feedback to engineering leadership to direct their colleagues in how to address some of the, you know, introductions of bugs and things like that. I'm curious what you have seen as some of the most interesting or unexpected or innovative ways that you've seen PyDriller used. Okay. So
[00:29:02] Unknown:
it's unexpected, I would say. Some time ago, I I don't know how I did it, but I found, next exam, university exam asking asking, questions on side. That was that was unexpected and really fun. 11, 1 tool that came out just recently, I think, less than a month ago that internally uses Spyderall. It's an amazing tool from, students here in the Netherlands as well. And so what they do is they, represent the repository as a graph. So in each node is like a commit or an author or a file, and they build this graph using Pydriller, and they they build this graph in, Neo 4 j.
And so after the after PyDriller is done and the analysis is done, you have this beautiful graph with that you can query. And, I mean, it's it's like, but it's, you know, visual. You you have something you that you can see. And the queries are super fast, and you can do amazing stuff with this. So I I'm really looking forward to see how this tool will evolve in the future. I think this tool is called Graph Repro, and it just came out. And it's really, really interesting. They published a paper about it, and internally, they use Spidealer. So very happy to see that, you know, other people use Spidey to build their own tools. And yeah. So others, I saw I saw bugzilla that internally has has, by dealers are as a requirement.
So overall, I think I'm really happy that that people just use Spyderlet to do their own research and their studies. And if we can somehow, you know, help in building better better tools, then,
[00:30:51] Unknown:
then it's I'm all in. And in your experience of building the PyDriller tool and using it to help teach people about mining repositories and growing the community around it and supporting it as a maintainer, I'm wondering what you have found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process.
[00:31:10] Unknown:
So there are there are many, there are many that I learned. So this this was actually my first, like, big open source project. It's not even that big, but let's say impactful open source project. So first of all, I I learned how how much Git is complicated. I didn't know. And when I presented the tool, I showed that Git has, 150 primary commands, like git log, git push, git add, etcetera. These are only 1 command, 1 at a time. But then if we look at subcommands like git log, it's probably gonna have, like, 500 subcommands or something like this. So, I mean, it's crazy how how Git is complicated and how powerful the tool is. And and it's crazy to see how people really know everything about Git. And, I mean, I I developed a Git framework, and I'm not an expert. Not at all. Because really, it's amazing to see the community around around this version of control system and how well they know their system. It's it really amazes me every time I Google something on Stack Overflow, and they know everything about it. It's it's really awesome.
So that on the Git side, let's say, on the maintainer side, well, everything everything was new for me. And I think something I'm I, I'm still struggling with is understanding when is the limit. So as a maintainer, you are supposed to, well, maintain the code. Right? So if someone finds a bug, you are supposed to fix it. I think more or less, that that's the theory. So what is the line where you that that you draw where, like, you say, okay. From this moment on, I will reply, okay. Thanks for reporting the bug. Can you please fix it? So this is something I I'm not sure I'm you know, when I when I have to say it. Like, is it already done? I mean, is it already that big that I can just say, please, can you fix it by yourself? Can you go in the code and fix the bug? Or should I still, you know, fix it by myself and say, thank you for reporting it. I will work on it in the last in the next couple of weeks. So I'm I'm not sure, yeah, I I'm not sure that what's the role of the maintainer here. Like, yeah, when it's the point where I will just say, okay. No more bug fixing for me. If you wanna fix it, you fix it by yourself. So, I'm not sure. I've I actually Googled it, and I read articles, but there is no, I think, right and wrong. It's just, well, I I think at 1 point, I will just give up and
[00:33:48] Unknown:
and say, okay. By the way, it's done for me. You just fix Yeah. It's definitely 1 of the big unanswered questions in open source is what what what is your responsibility after you release something to the world? And at what point do you say I'm done with this or this code is complete and it's not going to be touched anymore or handing it off to a different maintainer who can then take it in new or different directions or just maintain the status quo. So it's definitely something that, as you said, is is challenging, and there is no 1 solution. It's just whatever feels right for the person who is the progenitor of the code and, has the ultimate responsibility for it. Yeah, exactly. Definitely.
[00:34:26] Unknown:
There is no right and wrong. I think it's 3 years now and I still fixing I'm still fixing the bugs that are reported by the users. So that's the point. I think
[00:34:36] Unknown:
eventually I will just I will just give up. And in terms of PyDriller itself, what are some of the plans that you have for it either in terms of fixes or improvements to the code base itself or ways that you plan on using it for conducting your own research or building additional tooling on top of it or with it? Yeah. So I think as features, what what what I see in the last
[00:35:00] Unknown:
year that what I see that happened in the last year is users ask to to for users ask for new filters. So as I said before, PyTorch has a huge amount of filters that you can filter repository, the commits coming from a repository. So at the beginning, I started with just from a date to a date. That that was it. That's that was the filter. But then I added from commit to commit, from tag to tag, only in certain branches, only, only the commits that touches this file or only the commits from this author. And all this was coming from users that were just saying, can you please can I filter for author or something like so I think from the Git side, I think I already report everything I that I can? So, like, the commit hash, of course, the author, the date, the diff, and the source code, and the branches, I think, I report.
I mean, those are, the the, let's say, the fields of of a commit. Those are what what what we can say about a commit. However, what I can actually improve in but then what we are trying to do as well as a community is adding new fields new features. The last 1 was we added the possibility to to return the commits coming from all the branches even if they are not cloned here. It's like the git log minus minus o from the terminal and the same thing we just did it in implied. So now you you have the minus minus o as well. So these kind of things, I think, well, there are so many, as I said before, gate is so big that there are so many possible commands that you can pass. And I think by the way, we'll definitely increase on this. And another thing we can definitely do is put more code metrics in there. So right now, Padilla returns, I think, the complexity of the methods and the coupling, if I don't remember if I remember correctly, and lines of code, something like this.
But they are, yeah, they are quite simple. We have the source code before and after the commit for each file in the commit. So, I mean, we can do everything we want about it. We can see, like, the delta of the complexity. This was 1 of the last changes that someone made in in Pydealer. We can really do a lot. I think it's, again, it's community driven. If people need something, they can just open an issue on Pydealer, and I will be happy to to help them in trying to implement it, kind of coming to the same point as before.
[00:37:32] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a tool. I may have picked it before. I know I did a podcast about it, but in any case, the pre commit framework for managing pre commit hooks in your git repository to handle things like running linters or executing eyesore before you commit all that kind of stuff just to help reduce the cognitive overhead of having to make sure that all that gets run. Definitely a useful tool to help with maintaining the cleanliness and maintainability of your code. So definitely recommend checking that out. And so with that, I'll pass it to you, David. Do you have any picks this week? Yeah. So I was thinking about it, and I just discovered 2 days ago, I just discovered a beautiful
[00:38:20] Unknown:
game for PlayStation. It's called Fall Guides. And this is, like a sofa game. Let's say I played with my girlfriend. So I love play PlayStation. She often watch me playing because, she doesn't like like those FPS games. But this game is just the best for couples. We just play 1 match at a time and it's really fun and it's a lot of love every time. And it's not difficult at all. And it really small games like 30 seconds, 1 minute games. It's really, really fun. You you should have a try. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with PyDriller and your experience with mining software repositories
[00:39:01] Unknown:
and doing research on practices for writing code and evolving it over time. It's definitely a very interesting framework and an interesting problem domain that helps to inform the ways that we do software engineering. So thank you for all of your time and effort on that, and I hope you enjoy the rest of your day. Thank you very much, Tobias. You
[00:39:19] Unknown:
too.
[00:39:21] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Welcome
Guest Introduction: David Spadini
David's Journey with Python
Overview of PyDriller
Research Applications of PyDriller
Integrating Collaborative Platforms
Metrics and Capabilities of PyDriller
Developing and Structuring PyDriller
Handling Git Complexities
Testing Challenges
Supporting Other Version Control Systems
Industry Adoption and Use Cases
Lessons Learned
Future Plans for PyDriller
Contact Information and Picks