Summary
Analysing networks is a growing area of research in academia and industry. In order to be able to answer questions about large or complex relationships it is necessary to have fast and efficient algorithms that can process the data quickly. In this episode Eugenio Angriman discusses his contributions to the NetworKit library to provide an accessible interface for these algorithms. He shares how he is using NetworKit for his own research, the challenges of working with large and complex networks, and the kinds of questions that can be answered with data that fits on your laptop.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Eugenio Angriman about NetworKit, an open-source toolkit for large-scale network analysis
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what NetworKit is and the story behind it?
- A core focus of the project is for use with graphs containing millions to billions of nodes. What are some of the situations where you might encounter networks of that scale?
- There are a number of network analysis libraries in Python. How would you characterize NetworKit’s position in the ecosystem?
- What are the algorithmic challenges that graph structures pose when aiming for scalability and performance?
- How do you approach building efficient algorithms for complex network analysis?
- Can you describe how NetworKit is architected?
- What are the design principles that you focus on for the library?
- How have the design and goals of the project changed or evolved since you have been working on it?
- NetworKit’s code base has now a discrete size and several developers contributed to it. Are there any minimum quality requirements that new code needs to fulfill before it can be merged into NetworKit? How do you ensure that such requirements are met?
- What are some of the active areas of research for networked data analysis?
- How are you using NetworKit for your own work?
- What are kind of background knowledge in graph analysis is necessary for users of NetworKit?
- What are some of the underutilized or overlooked aspects of NetworKit that you think should be highlighted?
- What are the most interesting, innovative, or unexpected ways that you have seen NetworKit used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on NetworKit?
- When is NetworKit the wrong choice?
- What do you have planned for the future of NetworKit?
Keep In Touch
Picks
- Tobias
- NetworKit
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- NetworKit
- Humboldt University Berlin
- graph-tool
- NetworkX
- Adjacency List
- Cython
- Node Embeddings
- Centrality Score
- NetworKit In The Cloud
- Gunrock
- Hornet
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Eugenio Angrajman about NetworkKit, an open source toolkit for large scale network analysis. So Eugenio, can you start by introducing yourself? Yeah. Thank you, Tobias, for the introduction. So,
[00:01:10] Unknown:
yeah. My name is Eugenio Andreieman, and I am a PhD student at Humboldt University of Berlin. I work with Netogeek almost a daily basis in my work, and I'm happy to be here talking about it with you. And do you remember how you first got introduced to Python? Yeah. So I was introduced to Python quite late, I have to say, when I was working on my master's thesis. So my thesis was merely about efficient graph algorithms, and I was quite familiar with the graph algorithms, but not so much with programming. So I decided to, yeah, implement my algorithms in pure Python because I wanted to learn this new exciting programming language.
But, unfortunately, as you know, yeah, when it comes to performance, yeah, Python is not as good as c plus plus, especially when you want to go to large networks. So, yeah, I kept using Python, but I started using the graph tool library, which is much more efficient because it implements algorithms in c plus plus and it exposes them to Python. And, yeah, since then, I still use Python to do data analysis or to do plots or to write utility tools for my that I use on a daily basis. So, yeah, I'm super happy with Python. And you mentioned that you had used GraphQL
[00:02:24] Unknown:
as you were starting to get introduced to Python, and that's a project that I had on the show recently. But I'm wondering how you ended up moving from Graph Tool to Networkit and how you ended up getting involved in the Networkit project and sort of what your comparison or what what your motivations are for using NetworkKit instead of GraphQL?
[00:02:43] Unknown:
So when I started my PhD, I started my PhD in a group that was developing Netrokits at the moment. So I started using Netrokits for that main reason. And all the people I was working with were experienced with Enterkit. So they could introduce me, better to the environment and how to properly contribute to this project.
[00:03:05] Unknown:
And so can you dig a bit more into the network project itself and some of the story behind sort of how it got started and some of the core focus of the project?
[00:03:15] Unknown:
So as far as I know, so NetroKid the NetroKid project started in 2013. So, yeah, 8 years ago as we speak. Now, unfortunately, I don't know so much about network kits early years because, yeah, I started contributing to this project since 2017. Yeah. Anyway, as as far as I know, yeah, 1 of the main driving forces of Network, was Christian Stout, which was, at the time, 1 of the first PhD students of my supervisor. So he was working on graph algorithms. Yeah. He preferred starting writing his own library rather than using something that was already there available. Since then, yeah, I have to say that Netgear kept growing, not only thanks to the, yeah, early efforts of the founders, but also thanks for the things we have from external contributors, like students, researchers, and other people from other universities.
So look in our website. We have a long list of collaborators, and we also can check that in our GitHub page. Yeah. Without their help, yeah, Netrokitten would have far less features than the 1 that it has right now. Yeah. For example, we have many students that work with network kitting for their master thesis or bachelor thesis. And once this was finished, we included their algorithms inside the metric kit. And so they were available to use for for the open source community. We have some other code was developed, like, by PhD students or other postdoc researchers as part of their research.
Once a paper is published or, yeah, some work is accepted for publication, then we integrate the code on the new algorithm presented back to NetSuite so it is available for everyone to use to reproduce our results. Yeah. So maybe long story short, NetRoute, it was and still is a really collaborative effort that involves many people around the world and to provide the efficient, scalable algorithms for large scale network analysis.
[00:05:13] Unknown:
And with that emphasis on efficiency and large networks, I'm wondering what are some of the situations where you might encounter datasets of that particular scale? Because I noticed on the website, it says that it's able to process on the order of billions of nodes in a graph. And I was wondering what are some of the sources of those types of data and some of the acute challenges that you come up against as you're trying to manage efficient processing of those datasets?
[00:05:42] Unknown:
So as you can imagine, yeah, many of today's phenomena or, you know, problems can be represented as graphs. And depending on the problem or how you're trying to model it, then you might end up with graph with different types or sizes. Maybe I can start with the most obvious example, which are social networks. So we all know that social media is used by every day by a lot of people and involves billions of people around the world. So just representing interactions between people leads to perhaps that are huge in their size. And we might also add more about users, we might also add the places, events, or, yeah, pages, and many other other things that make these networks even bigger. Maybe another simple example are road networks.
So, yeah, 1 can see a road as an edge and crossings as nodes. Road networks alone can also build up very large graphs. So if you just think about the road networks in the US or in Europe, yeah, we can have graphs with the millions to, yeah, 100 of millions of nodes. And 3rd obvious example, of course, is the Internet. So, yeah, the Internet that can be represented as a graphs with nodes as web pages and edges as hyperlinks between pages and representing portions of the Internet, yeah, we might end up with the network with a considerable size. So these are 3
[00:07:04] Unknown:
obvious examples, but we can go on more with the other examples. I guess that with the interest of time, we can move on to other questions. Yeah. Yeah. There are definitely a lot of opportunities for building up network structures. And your point of the internet as well, you know, there's web pages with links. There's also machines with network connections and, you know, just even just managing to map out the sort of core backbone of the Internet with ISPs and edge providers and things like that. There's all kinds of interesting analysis too there. And then in terms of the sort of overall ecosystem of managing network analysis, particularly in Python, as we mentioned already, there's the graph tool project, another 1 that is pretty well known as network x, which I know is implemented in pure Python, so it has some issues in terms of its scalability.
But how would you sort of characterize NetworkIt's position in the overall ecosystem of Python and doing network analysis, particularly on these larger datasets?
[00:08:04] Unknown:
Yeah. So I think it's a really interesting question. And I think that the answer also why NetLogist was born. So, yeah, I think that this ecosystem of Python libraries for network analysis has 2 main types of libraries. So as you mentioned about network x, we have libraries that are implemented in pure Python, which is a really popular language. So they can count on many contributors who provide a wide range of features. Right? On the other side, we have libraries like a graph tool, which, yep, I also used in my master's thesis that implement their algorithms efficiently using c plus plus and expose them to Python. So they have maybe more restricted set of features than other libraries like NetworkX, but they provide them in a more efficient way. So with NetworkX, we are trying to fill this gap. So we want to provide a wide range of features with the new algorithms and state of the art features for network analysis without sacrificing performance.
So we want to provide many features, but efficiently to the users in the Python community.
[00:09:10] Unknown:
We can dig into it a little bit later, but 1 of the interesting things to touch on is who the primary audience is for NetworkIt, whether it's built for the purpose of researchers and being able to do, you know, research and analysis of networks or if it's intended for sort of more broadly use by developers who might encounter some of these network datasets and need to incorporate graph analysis as part of a larger application and just how that informs the interface for the library and the features that you focus on building out. Yeah. I think that EntruKit can serve successfully
[00:09:45] Unknown:
both of, these kind of audiences. So we an inexperienced researcher well, maybe a researcher that doesn't have a lot of experience with coding or computer science can still use Naturgy, thanks to the handy Python interface because it's really straightforward to use. So it's 1 of our targets. Another target, of course, other more experienced developers who want to develop efficient algorithms for network analysis as we also provide network kit as a c plus plus only library. So if 1 also wants to work just with c plus plus then it is also free to do so by developing algorithms directly using c plus plus and the network its APIs. So we have our targeting, maybe, also a broad range of users, not just a particular 1. In terms of the actual
[00:10:34] Unknown:
analysis of the graphs, I'm wondering what are the algorithmic challenges that you run into and some of the issues with complexity analysis of the data that you're working with and how to properly traverse and analyze it? In my experience, yeah, I faced quite a few of these challenges.
[00:10:52] Unknown:
So, yeah, 1 is, for example, when you aim for scalability and high performance, then, yeah, you can really cannot afford an algorithm that has an empirical running time complexity that is super linear. So this, as we all know, maybe just does not scale. So I say empirical and not worst case around time complexity, because, yeah, many scalable algorithms that we develop also in network have indeed a worst case time complexity that is super linear like quadratic. But in reality so in real world graphs, they are much faster. So this is 1 aspect about, yeah, maybe, run time complexity.
Other challenges, my opinion, consider the memory. So 1 simply cannot afford to use too much memory. So, yeah, for algorithms that target large graphs, similarly with time, we cannot store more than linear amounts of additional memory. More of that does not simply scale. Yeah. And another aspect that I think is very important is the cache locality. So many graph algorithms, for example, need to traverse the graph, which means that they access the graph data structure in a kind of unpredictable way. And this often leads to cache misses or inefficient memory accesses, and this leads to performance loss.
So we experienced this phenomenon in some of our recent papers. A considerable amount of time of our algorithms was spent just waiting data to be loaded than just processing the data. So even if we use the faster CPU, then this would not speed our algorithms up. So the main bottleneck was really the the memory bandwidth. This is a huge issue. It's also documented in the literature, I noticed, and it can probably be mitigated with using a better graph data structure or by, yeah, using some dedicated hardware to speed up the the algorithms.
[00:12:51] Unknown:
Yeah. And the structures themselves is another thing worth digging into. I'm wondering sort of what the representations are of these graphs and some of the considerations and trade offs that go into the actual storage of the data that you're analyzing and sort of things like the types of attributes that are available on the different nodes and the edges or if the edges don't have any information associated with them other than the nodes that they're connecting into some of there are a number of different ways that you can approach building out the actual graph structure and where you put the relevant information.
[00:13:25] Unknown:
Yeah. True. So at the moment, Networking uses adjacency based structure. So basically, a vector of vectors, which is a really simple way to store a graph in memory. But, yeah, maybe it's not the most efficient way to do that. So we are working into improving this. Concerning attributes, yeah, at the moment, we Netlify does not provide this feature. So it does not provide a way to express additional attributes for nodes or edges. The only additional thing 1 can store on edges are simply edge weights, so just a floating point number, which, of course, is something that lacks compared to other libraries. We are working also on that as a new as a new additional feature in attributes for the future.
[00:14:09] Unknown:
In terms of the actual Networkit project itself, can you dig into how it's architected and some of the design principles that you're focusing on for how to build out the library and make it maintainable
[00:14:21] Unknown:
and make it accessible for other people to contribute to? Yeah. So, yeah, the Netrokits at least the Netrokits Python package is organized in different modules. These modules are independent. Each module basically tries to solve a specific problem or a family of problems. So for example, obvious problem when working with graphs is reading and writing graphs. So we have the graph by your module that does only that. So reads and write graphs according to a specific format. Other modules are, for example, the community module, which implements all the algorithms about community detection and so on. So then similar to Graph 2, I think the network is built on top of a c plus plus core library, and they communicate together using a site.
[00:15:09] Unknown:
As you mentioned, Networkit has been around since 2013, and you've been working on it for about 4 years now. And so over that time, I'm sure that there's a certain level of maturity that it's reached. And after a while, software projects start to go through a phase where the adaptability or its capacity to change directions or morph starts to kind of solidify. And so there's a more concrete structure around it. You're still able to iterate and add new features, but the sort of overall design has become sort of what it's going to be. And I'm wondering how that factors into how you manage quality and onboard new contributors to the project?
[00:15:52] Unknown:
Yeah. As many other open source projects, in order to ensure a minimum of code quality and maintainability over years, we require that every new metric algorithm that comes from external developers, but also from us, internal developers, must be tested. So the all my algorithms must be tested in test suite. Now when tests are executed in our continuous integration system, which checks that the code works on all platforms, so all major platforms like macOS, Linux, and Windows. On top of that, we also add static code analysis with code linters, and we treat most nearly all warnings as errors. So this helps us spotting common coding mistakes without having to look into them ourselves.
Then once all this CI pipeline works correctly, then we have introduced recently a mandatory review process where at least 1 of the network maintainers must approve changes before the changes are actually merged into NetWorkIt. This, of course, makes adding new code into NetWorkIt harder and takes more time. But I have to admit that over time, it improved a lot the quality of the code base, and I wouldn't call back. In terms of the actual
[00:17:15] Unknown:
research that is being done with NetworkIT and just the overall space of graph structures and networks data, what are some of the active areas of research that are ongoing and some of the specific areas that you're focused on during your PhD?
[00:17:31] Unknown:
So I think that there are quite a lot of research areas concerning your network analysis. Maybe I can start with 1 that I'm most familiar with, which is the centrality. So as some of you might already know, so given a graph, 1 might want to find which are the most important nodes or edges according to some specific criteria. Now, centrality centrality basically assigns score of importance to each vertex or edge in the graph. So 1 can easily determine which are the nodes or the edges that are the most important in the graph. So the ones that have the highest centrality score. Now unfortunately, computing these centrality measures is very expensive.
They were really not conceived with scalability in mind when they were first introduced many decades ago. So today, it's it's challenging to, yeah, efficiently computing those synthetic measures in large scale graphs. Now the research in this area, yeah, started to take different directions to solve this problem. 1 way is to simply approximate the centrality score instead of computing them exactly. And this allows us already to speed up this process quite a lot. Another solution is just trying to find the top k, most central nodes. Usually, this has been shown to be much faster than just trying to compute the centrality scores for all the nodes.
And another issue about this problem is graphs today changing. So new friendships are established in social networks. Roads are destroyed and built continuously. So graph are not static in most scenarios. So another challenge here is to update those in practice scores after the graph changed. And And this usually can be done much more efficiently when starting from an initial state when synchronicities are already known or instead of doing all the computation from scratch after the graph changed. Alright. Then another research area that people are really excited about these days, I think is node embeddings. So this is not really surprising, I think, because node embeddings basically are a bridge between graph theory and machine learning.
In machine learning, people often work with datasets where each element is basically represented as a vector of values, and these values are later processed by a machine learning model. Now representing graphs or nodes as vectors is not trivial. Node embeddings basically simply achieve this goal. So they are used to create vectors or a vector representation of a node or, that can just use to create a vector representation of a node that can be used later in tasks like node classification.
[00:20:22] Unknown:
Those are definitely interesting areas to dig into. And I know that, in particular, machine learning on graph structures is an area that's gaining a lot of attention currently. But your point too about the evolvability of graphs where you're not just working with a static dataset is an interesting point as well. And I'm wondering what the, I guess, challenges are in terms of being able to diff the state of a graph where you have a snapshot of the graph that you're doing the computation on and then being able to understand what are the changes to this graph from time a to time b and then being able to iterate through that graph and just being able to manage the diffing of those states. Because as a graph evolves, it's not going to be, you know, the same as if you're diffing a text file where somebody appends lines because new nodes might come in with have edges that, you know, feed into nodes that you've already computed. And I'm wondering just how you manage those differential states.
[00:21:19] Unknown:
Yeah. There are lots of techniques that you can use to manage these changes. So common way to deal with these changes, for example, finding which nodes are affected by the change. So when I add a new connection or new edge, maybe new shortest paths might be created, thanks to this new edge. So this new edge might change the score of some of my nodes in the graph. So a good starting point here is to understand which nodes are affected by this change and then try to reduce the computation by just focusing on those affected nodes and not processing all the other nodes that are not affected. This was already shown to be an effective way of dealing with this kind of dynamism in networks.
[00:22:05] Unknown:
And then another interesting element of dealing with graphs, particularly as they scale, is being able to effectively parallelize the computations and understanding where to shard the graph structure if you need to scale out across multiple machines. And I'm wondering what are some of the interesting challenges that you've run up against with that, or if you have kind of punted on that problem with NetworkIT and just say we deal with structures that can fit in a single machine, or, you know, if there are any aspects of that sort of graph partitioning that you need to deal with in sort of parallelizing the algorithms.
[00:22:39] Unknown:
I didn't deal with the this kind of huge graphs that you have to share into different machines. So I only worked with the shared memory algorithms that so where the graph can fit into 1 single machine. However, I also worked with the distributed memory, but in a different kind of scenario. So we used multiple compute nodes, and each compute node hosts a copy of the graph. So in this way, when we have to distribute the computation on a single graph, we can not only use parallelism on a single machine to distribute the work over multiple threads, but we can also distribute the work on multiple compute nodes that basically do the same thing on the same graph. And then at the end, we gather the results from all these compute nodes and combine them together to get the results faster than working just on a single machine.
And this also had worked quite well in our case. But this, again, requires the possibility of storing the graph on a single machine.
[00:23:47] Unknown:
In terms of being able to use Networkit and do some of these network analysis tasks, I'm wondering how much overall background knowledge is necessary in terms of graph theory and graph algorithms versus somebody who just wants to pull a network it and be able to say, I wanna compute the node centrality score for this node. You've given this graph for, you know, purposes of building a web app or something?
[00:24:13] Unknown:
Yeah. I think, yeah, this depends on, yeah, what 1 wants to achieve. So if 1 can want just to use some basic metric functions, then I think that just a basic knowledge of Python and of graph theory, like just what is a graph and how distances are computed in graphs is already more than enough to start working with neural kids. So network kit algorithms are really straightforward to use. In a few lines of code, you can read a graph, launch some algorithms, and visualize some results. So I think that rather to use network hits, 1 needs to have a background about network analysis in order to understand which kind of algorithm you want to use to solve your problem and to interpret the results that are yielded by network type rhythms.
[00:25:03] Unknown:
And then in terms of the capabilities of the project, I'm wondering if there are any aspects of the library or its feature set that you think are underutilized or often overlooked and just things that you think should be highlighted and mentioned for people who are starting to look at Networked or starting to think about ways that they can use graph structures to solve interesting problems?
[00:25:25] Unknown:
So I think that 1 aspect I think is still underutilized because it is very recent is our new network. It's in the cloud system. So we presented this new system at the SIEM conference on computational science and engineering in March of this year. And, basically, we are offering now the possibility of using NetworkingIT inside a JupyterLab instance with just your web browser. So basically, you can use NetrokIt just using your web browser without needing to install Nettrkits in your machine because the computation is totally done in the cloud.
So, yeah, of course, the cloud resources are limited, so you cannot run expensive algorithms on large graphs. But at least this offers you the possibility of trying NetworkKits without installing it on your machine.
[00:26:19] Unknown:
In terms of projects that you've seen built with NetworkKit or ways that you've used it in your own research, what are some of the most interesting or innovative or unexpected ways that you've seen it employed?
[00:26:30] Unknown:
Yeah. So, yeah, I really have to thank you for this question because when I was preparing for today's episode, I did some research on Google Scholar, and I was quite pleased in finding that there are quite a lot of papers out there from other research areas that are using Netrokits in their experimental pipeline. So maybe I can try to mention a couple of them. 1 is from Johan Keutel. He's from the Beuth University of Applied Science in Berlin, and he joined the MetroKit Day last year. And this is an event that we organize on a regular basis to get in touch with the Netrokits community, and he presented a paper about bibliographical networks.
So he gathered datasets from various universities in Berlin, and they were about the German fictional literature or international scientific bibliographies and other biographical data about the authors. And he created some networks that were connecting writings and other institutions. So he created some big graphs and used the network it to find more insights about the historical evolution of some discipline or which were the the main characters or agents that developed some field in the literature or which were the institutions responsible for the development or the relevance of some field. So it was a really interesting talk from him. And in our Networking website, we also have a page dedicated to Networking Today, and 1 can also download his slides from there. Yeah. This was a really interesting project, I have to say. And the second example that I found on Google Scholar is a paper that is titled Job Impact Assessment on Road Networks and Healthcare Access at Jakarta, Indonesia.
So as some of us already know quite well, I guess, so Jakarta is a huge coastal city, which unfortunately is sinking underwater like Venice. And here, the authors are trying to study the impact of floods in Jakarta, and they're using data from OpenStreetMap. So, basically, the Jakarta road network. And they also use network hits to analyze the road network before and after offload. So in particular, they study the reachability between citizens and the head care facilities. Yeah. They use a synthetic measures such as betweenness to determine the importance of some roads before and after the flood. And this is quite useful, they showed, to predict congestions in road and see how the flood impacts. This could be the possibility of citizens to reach health care facilities.
[00:29:24] Unknown:
And then as far as your own experience of working on Network8 and using it for your own research, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:29:36] Unknown:
There are quite a few lessons that I learned in this process. So 1 of them, I think, is being before coding. First solution that pops up in your mind might not be the best 1 or the right 1 for the problem that you are trying to solve. So I think it might be very tempting to start coding immediately without thinking enough. This often results in losing time because you have to fix the code later if you didn't implement the right algorithm or not implemented in the right way. But another cost, I think, of wrong decisions is that if your code end up in production, then some users might rely on some functionality or some API that you later want to remove or change. And you have to take this into account, so you cannot take arbitrary decisions later. So removing functionalities or changing APIs because this might break other people's pipeline. So it's really important to think thoroughly before introducing some important change or new algorithms because, yeah, it might be quite hard to undo these, these actions.
Yeah. I think that another lesson that I learned that applies for lots of people is not being intimidated by criticisms from other developers. So many developers might raise criticism to your code, especially during a code review. I have to admit that some of my contributions to NetSuite were not so brilliant. Other more experienced developers bluntly say, why are you doing this? This is really bad for this or that reason. And immediately, I felt like, oh, no. Now my incompetence is shown to apply a word, and I'm really ashamed of that. So but the thing that these criticisms are really, really useful because, yeah, they teach me quite a lot. So I learned quite a lot from them. Now I'm really, really thankful for those people
[00:31:36] Unknown:
who spend time criticizing my code because they taught me a lot of useful lessons for the future. Yeah. Code review is always a scary time, and there are definitely a lot of issues around things like imposter syndrome and the question of trying to remove your ego from your code and sort of like your code is not you, but it's a complicated issue, particularly for people who are first getting involved in open source. And then in terms of, you know, if somebody is looking to do some network analysis and they're figuring out which library to build on top of, what are the cases where NetworkKit might be the wrong choice for doing some sort of network analysis or graph structure discovery?
[00:32:16] Unknown:
Yeah. So I can think at least about 2 scenarios where NetLogistix is not the right choice. 1 is, as we were saying before, when you are dealing with graphs that are so big that they cannot fit on a single machine and you need distributed system. So as I said, Netrok is a shared memory library, so we do not support distributed memory. And you want to go for other kinds of library if you are dealing with these kind of graphs. And another scenario that I think is when you want efficient algorithms that are tailored to GPUs.
So for example, algebraic algorithms. So algebraic algorithms often rely on operations with metrics and vectors, which can often run much more efficiently on a GPU rather than on a CPU. So so far, NetSuite does not exploit, GPUs. So in that case, I would suggest to go to other libraries such as Gunrock or Hornet because they really take advantage of the resources provided by GPUs.
[00:33:21] Unknown:
As you continue to build out NetworkIt and as you're working to finish out your PhD, what are your plans for the near to medium future of NetworkIt or just some of the overall roadmap of the project?
[00:33:35] Unknown:
Yeah. So the wish list is quite long, but, of course, the resources are quite limited. So right now, as I said, we are working on a new and more efficient graph data structure. Yeah. Right now, the graphs are stored as these adjacency lists, which, yeah, we know we post some limitations in terms of memory. This is 1 direction we are trying to take. Another 1 is to add support node edge attributes. So 1 can store inside an aggregate graph additional attributes instead of being forced of storing them elsewhere. Another feature we are working on is building a new benchmark framework for NetWorkit.
Specifically, we want to answer the question, what happens if I apply this change, which algorithm becomes faster or slower? So right now, we have to do this manually. We have to set up some small experimental pipeline, measure how fast an algorithm become after a specific change, and this is not really scalable. So we want to automate all this process and measure how performance change if we do a specific change in the code base.
[00:34:50] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose the author Edgar Allan Poe to feed off of my pick last week of h b Lovecraft. So if you're looking for an interesting read or some short stories, Edgar Allan Poe has always got some interesting stuff. So definitely worth checking out if you haven't ever read anything by him. So with that, I'll pass it to you, Eugenio. Do you have any picks this week? Yes. I have.
[00:35:21] Unknown:
So, yeah, during the recent lockdown months, I I have to say that books have been a great company to me. And in particular, 1 of the books that really helped me in coping with this sort of forced solitude, post the the Spinoza problem. So this book was written by Irving David Yallon. So and due to these ideas I will get to cut this. So due to his ideas, yeah, Spinoza was forced to leave his Jewish community in Amsterdam and develop his new ideas alone. Yeah. His life and his revolutionary ideas written up as a novel were very inspiring to me. So that's a book that I strongly recommend.
[00:36:05] Unknown:
I'll definitely have to take a look at that. Well, thank you very much for taking the time today to join me and share the work that you're doing on NetworkIt. It's definitely a very interesting library and interesting problem domain. So I appreciate the time and energy you're putting into that. Best of luck on your PhD, and I hope you enjoy the rest of your day. Yeah. Thank you too. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Eugenio Angrajman and NetworkKit
Eugenio's Journey with Python and Graph Algorithms
Transition from Graph Tool to NetworkKit
Applications of Large-Scale Network Analysis
NetworkKit's Position in the Python Ecosystem
Algorithmic Challenges in Network Analysis
Architectural Design of NetworkKit
Ensuring Code Quality and Onboarding Contributors
Active Research Areas in Network Analysis
Parallelizing Computations and Graph Partitioning
User Accessibility and Background Knowledge
Innovative Uses and Projects with NetworkKit
Lessons Learned from Developing NetworkKit
When NetworkKit Might Not Be the Right Choice
Future Plans and Roadmap for NetworkKit
Picks and Recommendations