Summary
If you are interested in a library for working with graph structures that will also help you learn more about the research and theory behind the algorithms then look no further than graph-tool. In this episode Tiago Peixoto shares his work on graph algorithms and networked data and how he has built graph-tool to help in that research. He explains how it is implemented, how it evolved from a simple command line tool to a full-fledged library, and the benefits that he has found from building a personal project in the open.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Tiago Peixoto about graph-tool, an efficient Python module for manipulation and statistical analysis of graphs
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what graph-tool is and the story behind it?
- What are some scenarious where someone might encounter a graph oriented data set?
- In what ways are those graphs typically represented?
- In your experience, what is the overlap of people who are working with networked data, and the use of graph-native databases? (e.g. Neo4J, DGraph, etc.)
- What kinds of analysis or manipulation might someone need to perform on a graph structure?
- There are a few different tools in Python for working with networked data. How would you characterize the current ecosystem and why someone might choose graph-tool?
- Can you describe how graph-tool is implemented?
- How have the goals and design of the package changed or evolved since you first began working on it?
- Who are your target users and what are the guiding principles that you use to inform the API design for the package?
- How much knowledge of graph theory or algorithms are required to make effective use of graph-tool?
- Can you talk through an example workflow of using graph-tool to load, process, and analyze a graph?
- What are some of the overlooked or underutilized aspects of graph-tool that you think more people should know about?
- What are some systems/applications that you have seen which would be simplified by adopting a graph model for their data?
- What is your impression of the overall awareness of the benefits of graphs for simplifying aspects of data processing and analysis?
- What are some cases where a graph structure adds unnecessary complexity?
- What are the most interesting, innovative, or unexpected ways that you have seen graph-tool used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on graph-tool?
- When is graph-tool the wrong choice?
- What do you have planned for the future of graph-tool?
Keep In Touch
Picks
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Central European University
- NetworkX
- GML
- GraphML
- Neo4J
- DGraph
- NetworKit
- igraph
- Matplotlib
- C++ Templates
- Boost Graph Library
- OpenMP
- Maximum Matching
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Thiago Peixoto about Graph Tool, an efficient Python module for manipulation and statistical analysis of graphs. So, Tiago, can you start by introducing yourself?
[00:01:09] Unknown:
Sure. So thank you very much, Tobias. So, yes, I am a associate professor at the University of Central European University in Austria, the department of network and data science. I'm a physicist by training and practice, and I work in a field called network science.
[00:01:26] Unknown:
And do you remember how you first got introduced to Python? Yes. I was doing my PhD.
[00:01:31] Unknown:
I was doing a lot of computation, so I do lots of scientific simulations, scientific computations, and this is something that, you know, it has many aspects. So I don't wanna hand you 1 to run algorithms that are fast, but you also do want to do things like plotting and high level analysis. And then at that time, Python started permeating and percolating through the scientific community because it made things much easier. It's a very expressive language, easy to write, easy to understand, and has a very appealing it's a it's very appealing for scientific computing
[00:02:03] Unknown:
as well. So this is what brought me into it. And so as part of the work that you've been doing for some of your physics research that has led you to building the graph tool library, so I'm wondering if you can just start by giving a bit of an overview of what it is that you've built there and some of the story behind why you felt it was necessary to create it and the overall goals that you have for the project?
[00:02:24] Unknown:
So Python has this very appealing aspect to it. It's a very clean language. It's super nice to write. But it has a drawback, which is that it's a slow language. It's a interpreted language. So and for scientific computation, that's a big deal because you want your code to run fast. You want to work with large data or do large simulations that, you want to run as fast as possible. And if you just do this with Python, they have a problem. So I was working with networks. So GraphQL is a library to manipulate networks, manipulate graph structures. I needed to work with graphs. And during my PhD, that was a topic of my research, so I needed to write code. I needed to write code to do this, and I wanted to it in Python. But you can't. You can't just do it in Python. You have to somehow split it. So at that time, there were already other libraries.
So I started this already in 2006, so it's quite a long time ago when I was in the middle of my PhD. And I think NetworkX, which is a library it's a it was a very popular library to work with graphs. It already existed at the time, but it did not attempt to solve the problem that I wanted, which is a Digit Network x, a pure graph library a pure Python library. So it's implemented in Python, and there is no consideration at all to perform. And so it didn't work for me at all. I couldn't use it to do my my research. So I decided to take this NumPy philosophy. Right? So NumPy is this very central library for numerical analysis in Python. And what they do is they should just take the the stuff that needs to run fast, the loops, you know, the linear algebra stuff, and you just push it into c, the implementers in c, and then they put a then they by essentially a binding to these data structures and algorithms that are implemented in c. So I I wanted to do exactly the same thing, but for graphs.
So graph tool is precise as it is. It's all the data structures, the main data structure, the graph itself, and the auxiliary stuff. And the algorithms, they're all implemented in c plus plus Right? So not implemented in Python. Then you have some bindings to this. Right? So whenever you run a function, graph tool just dispatches to this underlying c plus plus layer, which then is then very fast. But then it can do the high level stuff in Python. So in the end, it becomes some it's just programming in pure Python and when you're using GraphQL, but then whenever the actual important bits or the important algorithms run, they run-in pure c plus plus So you get the best of both worlds. Right? You get code which is very, very fast, and you get to write the high level code in Python. That's the idea.
[00:04:51] Unknown:
And so in terms of the usage of GraphQL, as you mentioned, it's a library for being able to do analysis of network datasets and graph analysis. But I'm wondering what are some of the scenarios or industries or research avenues where somebody is likely to encounter some of these graph datasets or want to produce them.
[00:05:13] Unknown:
Yeah. It's a very broad question. There are many, many scenarios where this can happen. I would say that whenever the data is relational, when it represents interactions between pairs of things, then you automatically have an network. So, for example, a friendship between people, even in a so you're online social network. Right? That's a relationship. It's a relationship between 2 people, 2 nodes in an network. Who follows Zoom, for example, on Twitter? If you have this kind of data, then you automatically have a graph structure. Right? We have nodes, right, which are which are the entities and the edges or links, which are the connections between them.
Right now, for example, we are in the middle of hopefully, past the middle of this global pandemic, and that's a process if you want to understand how this process goes about, it's you have to understand a dynamical process that takes place on a network. Right? You have people which can be affected with the disease. They spread them to their friends. So these spreads over edges of the network. Many other situations. For example, suppose you have a website. You're you're Amazon. You're selling items. You want to understand what users are buying. So you you can represent this as a graph where the items are nodes, and the users, the kind your customers are also nodes, and then you have connections between them. These are the purchases.
Then you can try to do all sorts of thing. We might want to predict what kind of user we're buying or buy what kind of product. This is a recommendation system, a recommendation problem. It's inherently
[00:06:40] Unknown:
a network problem. As far as the actual representation of these network datasets, I'm wondering what the format typically is that you've been working with and some of the formats that you support within GraphQL.
[00:06:53] Unknown:
The data structure itself, you know, mathematically, these graphs are just matrices. But reality, there's an underlying property of all these networks that we usually work with, which is that they are sparse. So if you consider all these examples I've given you, like friendship between people, items and purchases, etcetera, most possible connections do not exist. Right? We say that these datasets are sparse, or these attributes are sparse. The actual existing edges, are a very small subset of all possibilities. And this requires certain kinds of data structures. In particular, there is 1 where which most of these algorithms are based on, which is a JSON list. So for each item, for each node, you just list its neighborhood, right, the edges that are incident on this node. It's a relatively simple data structure. In terms of format, there are many ways to store this in a file. The most trivial, just a list of edges. Right? You just give a pair of edges. They tend to have very, very simple representations. Of course, it can get a little more complicated when you start putting attributes on the nodes, attributes on the edges. And there are a few standardized formats that talking now about file formats, not necessarily memory formats. These are just standard.
But for file formats, there are some variations, some standards like GraphML or GML actually and GraphXML, etcetera. So these are the standard formats which Graph 2 supports. It even has its own binary format, which is meant for to be able to store and load efficiency,
[00:08:22] Unknown:
from disk. Another way that I've seen a lot of people interacting with graph networks and graph data structures is with an actual graph oriented database, such as things like Neo 4J or d graph, and there are a number of others that have built in support for that. I'm curious what your experience has been of the overlap, particularly in the sciences of people who are dealing with these networked data structures and people who are actually using a graph database to assist with that and, you know, some of the ways that GraphQL might either complement or sort of obviate the need for a graph database.
[00:08:58] Unknown:
So I think these communities are largely disjoint to a large extent. And the reason I think is that graph databases are meant for applications where you are interested or you're allowed. You can only look at a very small portion of the graph at any given moment. So you have a truly large graph that probably doesn't even fit in memory, and you're just carrying this graph, looking at particular pieces of it or streaming through its contents. These graph database are used as infrastructure for, you know, websites and all sorts of things. Libraries like Graphtool and others like NetworkX, iGraph, and so on, they are used mostly by researchers. So these are situation where we can actually load the graph. They can be large, but they could actually fit in a memory. Your laptop or your server or your HPC cluster, you can actually load the graph and then do some analysis on it. You wanna by the complicated algorithms on the whole network. That's typically how these things are done. So, of course, there are overlaps. I'm sure there are people that are loading graphs that ex you know, that are parts of a graph that exist in a big database and loaded them up in Graph tool to do some sort of analysis. I'm sure there are people doing this kind of stuff out there, but you don't see many people in research and academia using directly the database because they're usually meant for storage
[00:10:14] Unknown:
and representation rather than the analysis itself. Right? So I would say this is probably the point of contact that they have. And as far as the kinds of analysis that you're going to be doing on these graphs, I'm wondering what are some of the typical operations or typical background that somebody needs to be able to effectively use a tool like Depends a lot. It
[00:10:36] Unknown:
It depends a lot. It varies tremendously, right, on what you can do all sorts of things with it. And depending on what you do, it require different kind of background. So if a researcher, you know, you you want to simulate some process on a network, you can use GraphQL to do that. You might want to simulate the epidemic spreading or opinion spreading on the structure of the network, and you might want to investigate what kind of network properties, network structures, influences or that kind. For that, you need to, of course, to have a certain understanding of what is it you're doing, but and the graph tool allows you to implement these things. Just give an example of predicting edges, like a recommendation system. Try to want to predict what the user is going to do. If they have done this in the past, right, in terms of purchases in the website. You can use Graph tool to do this. You might be a researcher in a drug company, and you want to predict whether or not a person takes 2 drugs, if this is gonna have an adverse reaction or a neutral reaction or a positive reaction.
This is, again, a type of edge prediction. You can do that using the algorithms that are present in GraphQL. You might want to find clusters of nodes that are similar. Right? So let's say in a social network, you have people talking to people, being friends, you might want to say, okay. Are there groups of people here who are similar? This is a a desk bonus, network clustering. So this you can do with graph 2. We want to find what's your most important nodes in the network. Network. So, you know, if you want to vaccinate people, right, it's, again, a very timely topic.
Suppose you have a finite number of vaccines, who should you vaccinate first? Right? This is the kind of stuff you can do using the kind of actions you can ask can answer with, network methods. Now depends so in terms of what background you have, depends on the question you want. So if you just want to do something simple, like load the network and, like, do a simple visualization, you don't need much in terms of background. But if you want to go deeper into these examples I gave you, yes, you need to understand a bit what's going on. In graph 2, in particular, you know, there is quite a bit of documentation available in the website, in the docstrings.
And I make the effort of putting citations there and to cite it to the source material. So it's actually a good place to learn things. You know, if you want to know a lot more about what particular function is doing, you can just go and read read up the literature. So I did this mostly for myself as well because I forget things with everything in the documentation. In the end, it ends up also serving as a good reference.
[00:12:56] Unknown:
Yeah. It's definitely an interesting use case for the documentation in the library of being able to point to further research to give a deeper understanding. That's something that I I'll have to take a deeper look at once we're done with this conversation because I think sort of graph structures and networks datasets are an interesting area and something that I've always been kind of poking at. So so it's great that you're adding that extra information and they those extra points for people to be able to dig deeper. There's also a couple of my own research. So I should say that my main activity is research. I'm a researcher. I develop new methodologies, and I ask scientific questions based on natural data, try to answer theoretical questions as well.
[00:13:33] Unknown:
And what ends up happening is that I develop a new model or a new method, and this goes into the library. Right? This library exists mostly for myself. I mean, the I'm the primary user. Of course, I am happy to make it available for everyone and make an effort there as well, but it reflects my research activities as well. So so you usually see new new code and new and new methods coming up all the time. That's definitely interesting
[00:13:58] Unknown:
sort of a living research paper in code. That's exactly right. That's awesome. And in terms of the overall space of libraries in Python for being able to deal with these network datasets. You mentioned a few network access, sort of the canonical 1 that most people have come across. There's graph tool that you're building. There's another 1 that I've come across recently called NetworkKit, and I'm sure that there are others that I that I'm not familiar with. I'm wondering if you can just give your sort of perspective on the ecosystem of tools that exist for these types of operations and your thoughts on GraphQL's position within that space.
[00:14:36] Unknown:
Yeah. So the central point here is that GraphQL is meant to be fast. That's the thing that is absolutely a central goal here. Right? So as I mentioned in the beginning, when I started doing this, network x already existed. It was not a very well developed at the time, it was developed as well developed as it is now. I think it was something like 1 year old or something. And I did notice it, but as I said, it's just central design of being implemented on pure Python. For me, it works more as a reference. Networks, or even to do repeated analysis on small metrics. It's just too slow.
So, say, NetworkX, it's very easy for people to get started, and it's very comprehensive. There are lots of things there. But if you want to apply this, you know, for heavy duty computation, it very quickly becomes too slow. Other than this, I think there's Igraph, which is some library that I think has started more or less at the same time that Graph tool started. I was not aware of it when implementing this. It has a similar idea to Graph Tool. It's implemented in c, and then it has some bindings. At the time where it started, I think it only had, bindings to r, and then only later on, they put in Python. And it feels to me that Python is not, like, a very core. It feels to me, although I have to be honest, I don't follow it too too closely.
It feels to me like it's not a Python is not a central. It's just 1 of the appendages of iGraph. And there are more. That's where Kate, I think you mentioned. I think it has a very central design, which is very similar to Graph tool. I would say that the only difference there, you know, is past competitive speed. It's not slower than GraphTool, but vice versa. I mean, they're basically this compatible speed wise, but network is a much younger project. And as far you know, it's not as comprehensive as Gravel. To be honest, I I was a bit puzzled when it came out. I was like, okay. They're doing it again. Okay. But, you know, I don't blame them for doing their own code. It's fun. So but I have a lot of experience, with it for just took a look at it and yeah. So this is more is more or less the the landscape as far as I'm aware.
Graph tool has this main advantage that it's very fast, and there's more other things too compared to all these other ones. For example, Graph tool has visualization. It puts a lot of effort on doing good visualization code, and this is something that our NetworkX, for example, lacks. NetworkX, I think it has a visualization engine that's based on mass plot lib, which is a rational choice, but the visualization that produces are not that nice looking. Well, of course, there's some amount of opinion here. Right? But I think it's fair to say that Graph 2 has more elaborate visualization teams, and it produces, more interesting looking visualizations for larger networks and so on. This is true also in compared to graph and and the others. So I think that's a good advantage of GraphQL. On top of this, it has as I said, it connects to my own research, and I happen to work on, among other things, network clustering using statistical inference, which is by now the state of the art of this kind of thing. Right? This is already available in Graphtool.
It has better efficient code for this, and this is not available in any of these other libraries.
[00:17:48] Unknown:
And as far as the actual implementation of Graph tool, you mentioned that the core of it is in c and c plus plus for speed with the Python wrappers for it. But I'm wondering if you can just describe a bit of the architectural aspects of it and some of the design choices and your overall design philosophy that has guided your development of the project?
[00:18:07] Unknown:
The idea is to do heavy duty stuff in c plus plus So, actually, Raft 2 is a relatively thin wrapper around a c plus plus course. So I use c plus plus with templates, right, which allows you to write code with quite a bit of abstractions, all of which completely evaporates during compilation. Right? So you you get to be able to write less, but end up with very efficient code. Although you have to, as a trade off, accept a rather excruciating syntax. Right? There's no comparison with Python in that regard. But if you pay this price, you get this excellent payoff, which is just very, very fast code and, yes, compact code as well to some extent. I use the Boost Graph Library, which is this excellent meta library, a template library for working with graphs, and I do reflect. So the boost graph library itself contains quite a bit contains many graph algorithms.
I certainly make use of this. So I base myself a lot on the on PGL, on the boost graph graph graph library. The idea is that, you know, whenever you want a function in graph tool, it automatically dispatches as possible as soon as possible. So and by the moment it's running, it Python doesn't exist. Right? That's a central design choice here. The only major CPU is hitting a 100%. Python interpreter is out of the picture. Right? You get parallelization very easily as well. Right? So I use a lot OpenMP, which means very easy to use at a sales level. So, you know, for algorithms that allow this that are parallelizable, you can make use of all your cores, which, by the way, is something that stands out in GravTool as well. For example, with iGraph, it doesn't do that. It doesn't have many parallel algorithms.
This is the general philosophy. It's not, revolutionary. It's the same idea for for that NumPy had. Right? That's exactly how what they how they do things. They don't use c plus plus they use c, but that's basically the same the same philosophy.
[00:20:00] Unknown:
And as far as the overall goals of the project and some of the design and architectural elements, I'm wondering how those have changed or evolved since you first began working on it. Yeah. It evolved as I did. Right? Because the main target users is myself.
[00:20:14] Unknown:
The reason why I started doing this, I had no hopes of this becoming a long term project. I just did it because when I write things, I tend to forget. So I decided to and I noticed right away in my research that there were certain things that I would like to use again. Right? So I decided to in fact, what Graph tool started was not even a library. It's a command line tool. That's where the name comes from. But quickly realized that it needed to be a library, so I started separating the stuff that I was I intended to use in the future. We're just, like, ad hoc ad hoc for a particular project. For the things that I wanted to keep around, I started documenting them very quickly because I tend to forget what a function does, what the parameter is, and so on. That's so my guideline is essentially what works for me, what I find to be clean, what I find to be choice. I'll give you some examples. There's lots of taste involved here, but I don't have a lot of experience with programming or using network x because I use, essentially, graph tool for everything.
But if I compare network x, they have essentially for each variation of a given algorithm, they have a different function that does that variation. Where in graph 2, I tend to put 1 function that does essentially a task, and then there are parameters that dictate how variations are done. I find this to be cleaner and easier to keep in mind what you know, also to write the code and to use the code. I find this this kind of design choice better, but I don't have a principle behind. It just just feels better like this. And how is it evolved? Well, it became cleaner over time. I think there was quite a bit of unification.
[00:21:43] Unknown:
This is in the beginning. Right? That's because I was essentially learning how to do all this, and I was correcting mistakes. Over time, this has happening less and less, and things have are more or less converged into a coherent style as far as I can tell. In terms of the actual workflow of using graph tool and building on top of it and analyzing graphs, you mentioned that it started off as a command line tool and then morphed into a library. So I'm wondering if you can just give a bit of an overview of how much of the command line nature is is still available and how much of it is driven by other Python code that you're going to write and wrap around it. This is gone completely. These are primordial times. It's just a library. And because of this this workflow can vary quite a bit. Right? Depends a lot of what you're doing.
[00:22:27] Unknown:
So it adapts to a very big variety of workflows. So but I'll give you some examples. So what I find myself doing very often is using an interactive session. So just open interactive Python shell. I don't use notebooks very much, but you can definitely use this in a Python notebook, and it has built in support for this. Like, all the visualization, for example, works perfectly fine in the Python notebook. And then you will load a file or load some other data and convert it into a graph format. You run some algorithms, do some visualization in an interactive way. So this is a very typical usage. Another thing that I do very often, I imagine people who do too, is run a graph tool in a HPC, right, in a cluster to do some heavy duty analysis, you know, thousands, millions of graphs or graphs with huge size and things that might take, you know, days to run, etcetera.
These are the other usage pattern as well. So, of course, as a library, it's totally suitable as well. And you know you're not helping to destroy the planet because you're actually using all these cores for actual computation, not just for the Python interpreter to be shuffling things around. Right? So it's efficient for that. So it's meant to get the best of both worlds in this regard. Right? So it feels like a Python library when you're using it, but it's also very efficient. Now to be fair here, I think this as I mentioned, it reflects a lot the design choices, the library. In fact, how I think, what I think is intuitive, but also the constraints of the c plus plus data structures. Right? So the graphs, for example, graphs in nodes and edges, they have representations which are less flexible than, for example, what you get in network x. At network x, you have the graphs are just dictionaries of dictionaries, and they can have any options. Here, of course, it doesn't make sense to develop a library like this because it needs to run fast. So the the graph data structure is based on a Gens list, then we have property maps that map these nodes and edges to anything you would like. This is, of course, general enough. You can do everything with it, but you have to think around these kinds of concepts.
I'm, of course, perfectly used to this, but a person that comes through and reacts might think at the at first that this is a bit constrained, but it's a requirement of the design choice.
[00:24:33] Unknown:
As people start to get involved with using graph tool and start to build analyses on top of it, how much understanding of the sort of graph theory and some of the associated algorithms do they need to be aware of to be able to be effective with it versus being able to use your documentation and references to just discover that and explore along the way?
[00:24:55] Unknown:
Depends on what they're doing. So you can do something very simple. Just so if you're just cleaning up some data, you can just grab 2 for that, but then you don't have, of course, to go into it. If you're using some more complicated algorithm or if you're trying to find the maximum matching, let's say, if you have to know what that is or not afraid to make sense of what the output is. So it depends on what you are doing. The computation is very extensive, and there are many examples. So I hope at least people can use it to explore and to learn things. So it tries to explain in a concise way what what everything is doing, what, you know, it gives references.
So even if you don't you're not entirely familiar with the concepts right at the beginning, you should be able to get somewhere just by looking at the documentation.
[00:25:36] Unknown:
In terms of the systems and applications that you have been building with GraphQL and that you've seen people build with GraphQL or or systems that you see that are using more flat data structures or, you know, nested documents that might benefit from a graph structure. I'm just wondering if you can give some examples of ways that using graph algorithms and graph structures can simplify types of analysis that you have seen people sort of struggle with in other format. I think graph analysis
[00:26:07] Unknown:
allows you to see the big picture of the data. If you don't represent it like this, then you are often constrained to do simple aggregate statistics. Right? You know? But if you'll represent as a graph, you can actually see the relationship. So for example, I'll just go back to example I already gave because it's actually a good 1. So recommendation systems. Right? You have users doing things, user buying things, you know, your websites, so customers and items. If you construct this as a graph, then you can do things that you couldn't do before. Right? If you don't construct this as a graph, you can just count things. You can count, okay, how often has this item been bought or things like this, which is very basic level of you can from this, you can extract only a very basic level of understanding what's going on. But once you start looking at these relationships, then things change quite a bit. Then you can, for example, do clustering. You can see, okay, what are the categories of items that you have? What are the categories of customers that you have? And what are their behavior? You can start to be able to predict what they're going to do. These are all things you couldn't do if you don't represent it as a graph. Now if this simplifies, I'm not sure. It certainly exposes a lot of the structure.
Yeah. So I'm not sure if there is a situation where GraphQL simplifies, but it allows you to do all sorts of things you couldn't do before.
[00:27:20] Unknown:
And so GraphQL, as you mentioned, is a project that you've been working on for a number of years, and you've been adding new capabilities and new algorithms to it for a while. And so for somebody who is more of a newcomer to Graph tool and is just adopting it, maybe using it somewhat naively, I'm curious if there are any aspects of the library that you think are often overlooked or underutilized or that you think deserve to be called out specifically that might help people who are newer to the project.
[00:27:48] Unknown:
Yeah. There are a couple of things I think that they're hard to explain in the documentation. Right? Because once people get I think when they heard of the library, go to the website, and they read the documentation, there are a couple of things that are hard to express in the documentation, but I think maybe people don't notice. For example, GraphQL has interactive visualizations. You can in a directive session, you can load up a graph, you can draw the graph in a window, and you can actually use your mouse to move the nodes around, to rotate the graph and do other basic manipulations. We were talking before about other libraries, and I mentioned the libraries that are no, but there are also other kinds of programs for the network analysis that are visual. They're not libraries.
They're things like pseudoscape and Gephi, for example. Gephi, for example, is written in Java. And so a window is a graphical program with a graphical user interface where you just load a network and you visualize the network. And so it's a completely different paradigm, right, than a Python library. But to graph 2, you can actually do quite a bit of interactive visualization. Right? You're actually using your mouse to do things and to, do to adjustments and zoom in, zoom out, and this kind of thing. And this is very hard. You cannot put this in a docstring. You can write it, and if they read it, they'll be able to go, But it's the website, there are some some examples, but it's hard to tell people this. But it's definitely there. Another thing this is actually documented, but I'm not sure to what extent people are really aware of it or use it, which is that you can use GraphQL together with C plus plus So you can write C plus plus extensions. So what happens there's a problem with this paradigm of writing a private library, which is dispatches c plus plus. It's just that whenever you think of an algorithm which is not implemented in a library, of course, there is a low level Python API to manipulate the graph tool, like adding the edges or navigating through the net to the nodes and edges. Of course, you can do that via Python. But if you implement an algorithm using this low level API, it's going to be just as low as you if you implement, you know, as network x or anything else because you're not really doing anything in c plus plus But what you can do is write extensions, c plus plus extensions that are that just run as fast as the native code in GraphQL.
This is documented, and there are some examples in the documentation about to do this, but I I think this is underutilized. I think people could make use of this. I think it's probably because, you know, less people are familiar and inclined to write c plus plus code with templates. But if you are so inclined, you can actually do this. And this gives you quite a bit of an advantage.
[00:30:14] Unknown:
Conversely to my question about places where people should be using graphs and aren't, what are some of the cases where you've run into the situation of you have a hammer and everything looks like a nail, and you've tried to do something with GraphQL that was actually much simpler if you don't try to think about it as a graph structure. Absolutely. This can happen too. Yes. Although, of course, I'm biased. But, yes, this does happen. If you give an example, like, there are some domains of applications which for which there are specialized tools that work better. For example,
[00:30:43] Unknown:
image processing. Right? I mean, there is no need, I think, most cases to think of to apply to think of use metrics or graphs to algorithms. The guiding principle here I mean, graphs are relatively general mathematical structures. So if you want to, you can see, you know, graphs everywhere. But there is a particular property here that is important for these libraries, such as Graph tool, to make sense, which is the property of sparsity. Right? So the metrics that we have, that we use, they tend to be sparse. I think I explained this before that most edges in the network don't exist. Right? So in other words, every node has only a finite number of neighbors.
Right? If you start thinking of images and other kinds of structures, which are actually dense, to start thinking of them as graphs, you can do it. And, in fact, we have done this. There you can find papers of people doing this kind of converting image analysis into a graph problem. But I think it doesn't really help. I think it adds unnecessary complexity, as I mentioned. There are more specialized tools that are better in these cases.
[00:31:49] Unknown:
In terms of projects that you've built with GraphQL or things that you've seen other people use it for, what are some of the most interesting or innovative or unexpected ways that you've seen it used? I find it being used in
[00:32:01] Unknown:
research all of the time, and this gives me a tremendous joy. It's very cool when you open up a paper and there's, like, a graph. Like, I know this has been done with graph 2. We can tell. Right? And this is cool because even if it's some denial task, even if they're just doing, like, a graph visualization, which is a rather basic functionality of Graph tool, I feel very happy because I feel like I have I helped a little bit this research to do something. There are all sorts of things. People are using the methods to do analysis of a gene sequence data, for example. It's all over the place, tissues in cells.
This last example I saw, there was a paper, I think, in Science some years ago where they had the placement of cells in tissues, and they had, like, a lattice representation of this. They used graph tool to to analyze this. I was totally blown away. I never thought they would be able to never imagined this as a type of application.
[00:32:51] Unknown:
You mentioned a few times citations that you have within GraphQL, and I know that there's also a growing trend of having citations for open source software libraries and research. And then there's also the archetypal challenge of working in academia of publisher Parrish. And I'm wondering what your experience has been as to the overall weight that your open source software contributions have in your role as somebody working in academia versus the pressure to, you know, have published research papers and, you know, just the overall balance that you have in terms of the effort that goes into the different avenues? Yeah. That's a very interesting question. The reason why I do GraphQL is not to cater to my citations because I wouldn't have been smart.
[00:33:39] Unknown:
I use it because I get pleasure out of it. It does help my own research because it organizes what I have. It helps me organize my tools, and over time, we build this little compendium of things. But, I do it because it's like a Zen garden for me. You know? I have 2 modes of operation. So when I'm working on research projects, I'm thinking of a new concept or sort of a hard problem, and it's very intensive, and you're thinking a lot and trying to come up with something. But then when you're doing things of course, you're writing codes like this. It can be like this as well, but, also, there's quite a bit of it, which is, you know, taking care of the documentation or the docstrings, building the package, build the website. All these things, for me, it's a different mode of operation. Right? It's like, as I mentioned, a little garden that you were setting up, and this is gonna be relaxing.
I tell these people they think it's weird, but I don't think so. I think it's I do it because it gives me pleasure. It's a thing that I like to do. I do think it helps other people too. Right? It's not the most used graph library, but I'm sure some I know for a fact that other people are using it, and this is great. And this matters. I think it's important for research to have this kind of activity even if it doesn't count directly to your citation counts. Right? At some point, if you wanna think about this, you're gonna go crazy.
[00:34:51] Unknown:
And in terms of your experience of building Graph tool and using it for your own research, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:35:00] Unknown:
The biggest complaint people make about GraphQL is that some people find it difficult to install. And so I've heard this many times. So, of course, I try to address this at work with the problem here, of course, is that the library is a bit deceptive. Python users, they expect a library is a text file that you download. Right? And then it just runs because it's an interpreted language. Graph 2 is not this. Graph 2 is a wolf and sheep's clothing. Right? Python has an API, but it's a c plus plus library. It has c plus plus dependencies. You need to install boost. You have a compiler, right, if you want to compile the library.
And then, as I said I developed this in 2006. It's already when it began. Over time, of course, Macintosh, macOS has become kind of a de facto standard in science, although that's not what I use. This kind of operating system doesn't have any built in package management. So can you imagine somebody who doesn't have a package management having installed by hand, you know, boost and GCC and all this stuff. It's a nightmare. Of course, on, you know, Linux is not an issue. Right? Maybe we have to wait a little bit for the thing to compile, but it's not a big deal. But I noticed that over time to a different generation, younger generation of people coming, they're not used to the figure making style. This is something that's for for older people. Right? So, of course, this is not what you need to do to Graph tool. It's very easy to install Graph tool. Nowadays, you can use it's packaged in Anaconda, packaged on Homebrew for Mac OS. If you use Debian, you know, Ubuntu, it's there are packages for it. There are packages for ARC. I did make a lot of effort to try as much as possible to do this, but you're gonna find corner cases that for whatever reason, they cannot use 1 of these packages, and they have to do it by by hand. And they get surprised by the fact that it takes a long time to compile because of the template programming, which the compiler is not that optimized to handle this. So they, you know, take a long time and long memory to compile, etcetera. So this, I think, is something that required some effort to improve. And the state is better now, but can still be improved. So I think this is maybe the Achilles' heel of this kind of project.
[00:37:05] Unknown:
For people who are looking for a library to be able to do analysis on graphs or construct graphs and do some manipulation on them? What are the cases where a graph tool is the wrong choice and they might be better served with a network x or a network it? Graph tool does not include every possible algorithm.
[00:37:23] Unknown:
So it might need to get some implemented there, just I didn't have the time to go through. Graph tool is essentially a 1 man's project doing basically everything. Once in a while, some people helps. There are some help people have helped with this or that, with some things over the years, but I'm essentially the 1 doing everything. It's fairly extensive. It's particularly because I was able to also I implemented quite a bunch of things myself, but I have been able to use what the Boost Graph Library had in place, which was quite a bit of a thing. I had quite a range of algorithms, but maybe did not find something. I think RetriK X has actually a larger coverage. It actually has different algorithms that because of my profile, I haven't needed to implement these algorithms.
They tend to be biased towards the kind of things that I do. So it might be that somebody doesn't, you know, searches for a given code that's not there and so they have to do something else. So this is definitely a possibility. Another 1 has to do with what I just said. It might be that they don't have a binary package for their particular, you know, combination operating system. Like, you know, I've encountered many people who have a very old laptop with a ridiculously old version of macOS that is not compatible. They have a hard time installing. For some corner cases like this, it might be burdensome to get going. Right?
[00:38:39] Unknown:
As you continue to use GraphQL for your own work and develop it and maintain it, what are some of the things that you have planned for the near to medium term of the project?
[00:38:47] Unknown:
Not planning a big revolution with it. I would just continue to expand it and put more things, maintain it, clean it up. It's connected to my research, so I do have research agenda related to natural inference and reconstruction, which is going to coming is going to be become part of this package for sure. There are some things I would like to do, but this depends a little bit on external tooling. I'm very interested, for example, exploiting GPUs. Right? So offloading some of these codes to GPUs. Some of them, I think, could benefit from this. It's not very difficult to do using things like CUDA and so on, but it's right now, at the moment, a bit cumbersome because of the packages available for these other things on different systems.
But it is something I'm looking towards implementing, which I think would be quite cool. I'm also taking paying attention to the developer c plus plus So c plus plus is a very live language. Right? It's been evolving quite a bit with these new releases. I have a couple of years occurred. They tend to help a lot, libraries such as Graph Tools, which depend a lot on metaprogramming.
[00:39:50] Unknown:
This part of c plus plus is actually evolving quite a bit, and I'm looking forward for, you know, improvements that make compilation easier and makes the development easier too. Are there any other aspects of the work that you're doing on graph tool or the ways that you're using it in your research or the overall space of graph algorithms and network datasets that we didn't discuss yet that you'd like to cover before we close out the show? No. I think you covered quite a lot. I would recommend we just take a look at it if they find it Steve. Right? It's pretty software. You can just get it and play with it. That's surely the best way to learn about it. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or contribute to GraphQL, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose a bit of a self serving pick to announce that the book that I've been working with O'Reilly on for a while now of 97 things every data engineer should know. So just a collection of different essays from people in the field is finally published and released. So happy to see that out the door. Definitely wanna let people know that it's there. So I'll add some links in the show notes to where you can find it. Thank you very much for taking the time today to join me and share the work that you've been doing with Graph tools. It's definitely a very interesting library, and I'm excited to learn that you have all of these different citations and references to relevant material as you explore the overall space of graph algorithms. So I'll have to take a look at that. So thank you again for all the time and effort you've put into the project, and I hope you enjoy the rest of your day. Thank you very much, Tobias, for the interest.
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management. And visit the site at pythonpodcastdot com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Thiago Peixoto's Background and Introduction to Python
Development and Goals of Graph Tool
Applications and Use Cases of Graph Tool
Graph Databases vs. Graph Tool
Typical Operations and Background Needed for Graph Tool
Comparison with Other Python Graph Libraries
Implementation and Design Philosophy of Graph Tool
Evolution and Workflow of Graph Tool
Understanding Graph Theory for Effective Use
Simplifying Analysis with Graph Structures
Underutilized Features of Graph Tool
When Not to Use Graph Tool
Interesting Applications and Research Using Graph Tool
Impact of Open Source Contributions in Academia
Challenges and Lessons Learned
When to Choose Other Graph Libraries
Future Plans for Graph Tool
Final Thoughts and Recommendations