Summary
The foundation of every ML model is the data that it is trained on. In many cases you will be working with tabular or unstructured information, but there is a growing trend toward networked, or graph data sets. Benedek Rozemberczki has focused his research and career around graph machine learning applications. In this episode he discusses the common sources of networked data, the challenges of working with graph data in machine learning projects, and describes the libraries that he has created to help him in his work. If you are dealing with connected data then this interview will provide a wealth of context and resources to improve your projects.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at pythonpodcast.com/hightouch.
- Your host as usual is Tobias Macey and today I’m interviewing Benedek Rozemberczki about his work on machine learning for graph data, including a variety of libraries to support his efforts
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by giving an overview of when you might want to do machine learning on networked/graph data?
- How do networked data sets change the way that you approach machine learning tasks?
- Can you describe the current state of the ecosystem for machine learning on graphs?
- You have created a number of libraries to address different aspects of machine learning on graphs. Can you list them and share some of the stories behind their creation?
- How do the different tools relate to each other?
- Can you talk through some of the structural and user experience design principles that you lean on when building these libraries?
- When you are working with networked data sets, what is your current workflow from idea to completion?
- What are the most difficult aspects of working with networked data sets for machine learning applications?
- What are the most interesting, innovative, or unexpected ways that you have seen graph ML used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on graph ML problems?
- What are some examples of when you would choose not to use some or all of your own libraries?
- What do you have planned for the future of your libraries/what new libraries do you anticipate needing to build?
Keep In Touch
- benedekrozemberczki on GitHub
- @benrozemberczki on Twitter
Picks
- Tobias
- Benedek
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Karate Club
- PyTorch Geometric Temporal
- AstraZeneca
- Budapest
- University of Edinburgh
- Matlab
- R
- Bipartite Graph
- Node Classification
- Graph Classification
- PyTorch
- PyTorch Geometric
- DGL (Deep Graph Library)
- Parametric Machine Learning
- graph-tool
- Jax
- NetworkX
- Little Ball of Fur
- GCN == Graph Convolutional Network
- NetworKit
- Gensim
- Nvidia cuGraph
- Random Walk
- scikit-learn
- MalNet
- Graph Representation Learning by William Hamilton
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hi touch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at python podcast.com/hitouch.
Your host as usual is Tobias Macy. And today, I'm interviewing Benedek Rosambertski about his work on machine learning for graph data, including a variety of libraries that support his efforts. So, Benedek, can you start by introducing yourself?
[00:01:42] Unknown:
I'm Benedek Rosambergsky. I'm a machine learning engineer at AstraZeneca. I did a PhD in data science at the University of Edinburgh and did some research in Google AI. Right now, I live in Oxford, the United Kingdom, and I was born in Budapest, Hungary. And I work mainly on graph related machine learning problems.
[00:02:03] Unknown:
And do you remember how you first got introduced to Python?
[00:02:05] Unknown:
Yeah. It's a funny story. So because originally, you know, I'm an economist, and I mainly worked with, like, MATLAB and R. Both of those languages are pretty bad with strings. And when I was in grad school, doing a master's in economic, there was a course in Python, and that meant that I moved from those 2 instantly because it was way more convenient to use. Yeah.
[00:02:32] Unknown:
As we mentioned at the outset, a big part of what you're working with is graph data and doing machine learning on that. And I'm wondering if you can just start by giving a bit of an overview of when you might want to do machine learning on networked graph oriented data and some of the types of problem domains where that kind of data crops up.
[00:02:52] Unknown:
Yeah. So first, I will discuss, you know, type of problems when you already have the network structured data. But I will mention this multiple times that essentially every data that lives in a metric space can be turned into graph structured data, which means that every machine learning problem can be formulated as a graph, machine learning problem interestingly. But usually, the type of problems that people deal with are, you know, in this typical, like, web domains. So for example, if you have a web graph that is typically something where you can use these tools also in health care, for example, you can have certain types of interaction networks of molecules and proteins.
And, also, you know, molecules themselves can be represented as graphs. Other types of problems, of course, like social network data. A lot of recommender system problems can be formulated as graph problems as you have, you know, you have 2 types of nodes with connections between them. There's a so called bipartite graph that's underlying, that problem. Also, in other domains, for example, you know, special forecasting problems are such that you can formulate them, that there is a graph that's underlying. Another thing that you can do, which I mentioned earlier, is that if you have tabular data with some smart use of, you know, some hashing or some, you know, locality sensitive hashing, you can generally build the graph out of your tabular data, and then you can do graph machine learning on that.
And usually the type of problems that we are trying to solve. So I will use a social network example. 1 of the problems that people try to deal with is, you know, predicting properties of nodes, which can be, you know, classification of nodes, regression on nodes. So for example, each prediction in a social network, which is a problem that can be very well solved with, you know, a social network based approach, you know, predicting edge properties. So, for example, how do I know that 2 people are linked when, you know how can I predict that when they are not yet linked on Facebook? But we might assume that they will link or they know each other. There's, like, a, you know, hidden connection between them.
And there are problems that you want to do where you look at the whole graph level. So for example, you can ask questions like if I have these 2 drugs which are represented by graphs, how can I predict that there is a certain, you know, polypharmacy side effect for them? Or you can ask questions like if there is a call graph of, you know, a piece of software, how can I tell that this call graph belongs to a certain type of mother? Or, you know, a large number of problems can be formulated. And from an abstract point of view, you know, you can formulate these problems as graph ML problems as, for example, node classification or edge prediction, edge classification, or or graph classification. And then what you do is that, you know, it boils down to this very simple task of standard tools that we try to build, you know, tools that I try to build.
[00:06:01] Unknown:
And in terms of the tooling that's available, I'm wondering if you can give a bit of an overview about the state of the ecosystem for libraries or frameworks that are built to be able to work with graph and networks data and how the overall approach to machine learning shifts or any differences from graph and networks data as opposed to sort of relational or unstructured data?
[00:06:30] Unknown:
Yeah. So what I will start with is, you know, just highlight a few statistical things that are interesting. So when when you have graph data, 1 of the things that you can do is that you can do semi supervised learning. So you can exploit the features of the neighbors that a node has, which is, like, a fantastic thing. So 1 of the things that it's, like, wonderful for is, like, you know, solving problems where you have a label, sparsity. That's 1 of the, like, things that's pretty cool, and you can do that. Another thing is that, you know, context is based on the neighbors, and you can predict things for a given data point even when you have missing features for the data point, which is also some.
Most of the techniques exploit, you know, that there is, like, a similarity, especially, you know, like, it's called special autocorrelation. It's like the birth of a feather phenomenon, which is also lovely. And 1 of the issues is that when this, you know, special correlation is not positive but negative, then it's very hard to apply these methods. That's, you know, like, complicated problem. And, like, software wise, 1 of the things that was kind of important for GraphML, you know, to start this, like, boom was, the fact that we have a lot of good libraries for sparse linear algebra.
And then the came in 2018 when people started to shift towards the use of PyTorch. So in general, in the graph analytics domain, those tools that provide, you know, analytical, like, capabilities and those which are, like, for machine learning are very different from each other, and the type of people who develop those are kind of, like, different. And, yeah, PyTorch became dominant. 1 of the libraries, which is, like, very nice, it's developed by Matthias Feil, and it's PyTorch Geometric. It's a research and oriented machine learning library. It always incorporates the, you know, newest research research. It's It's very well designed, good software engineering principles, in my opinion. But, you know, it's research oriented very much. Like, it's not something that's designed with scalability, production, deployment, you know, set of mind that we have to consider these things.
Then there's then other library, which as far as I know, has some support from Amazon DGL, which runs on PyTorch and TensorFlow so you can switch out the back ends. It's production oriented. It contains less methods. It also has tools for, you know, dealing with knowledge graphs. And so these tools are, you know, the the main, like, tools for parametric question learning. And there is also a library called Pykin, which is built on PyTorch by Charles Hoyt, who is, you know, a postdoc at Harvard. And, it's a tool for, like, knowledge graph, embeddings. It contains models, loss functions, tools for training and scoring.
So these are the tools for deep learning. Interestingly, you know, the analytics tools are maintained by people who are not really involved in the deep learning community. 1 of the prominent tools that's really nice is graph tool, which provides, you know, basic descriptive statistics, but it's really scalable. You know, pure c plus plus. It has a Python rep, and it's maintained by a guy called Tiago Pejoto. Interestingly, JAX, you know, the auto diff library by DeepMind is coming up, and they have a library that's built on JAX for this graph machine learning. For example, they have the basic models for graph convolutions and so on. And generally, the issue is that there is no single unified library, which covers a lot of things that we need, such as, you know, basic analytics, so called graph kernel functions that can compare graphs pairwise, how similar they are, parametric tools such as, you know, deep learning on graphs and other type of, you know, GraphML, like dynamic GraphML, you know, graph fingerprinting techniques. And usually, there is no good support for heterogeneity and the temporal aspect of things.
Yeah. I feel like that's covers it. Yeah.
[00:10:40] Unknown:
As you were describing some of the different libraries that are out there, you called out a couple of things being either production oriented or something oriented towards research. I'm wondering if you can just give a bit of more context as to research versus production oriented libraries and sort of how the differences in those areas of focus manifest in the interfaces and the capabilities and just some of the knock on effects that they have in terms of what you're able to do once you select a given library for the work that you're doing?
[00:11:14] Unknown:
Yeah. So I would like to use an analogy from computer vision. So if you think very abstractly about these things, the the node classification problem is equivalent to the segmentation problem, the supervised segmentation problem in computer vision. The issue with graphs is that while for images, you can have some very simple cropping of certain parts of the image and you can take them, for graphs, you have to be able to sample locally if you want to operate on a sample. And that's an issue. And the library, which is, you know, designed with a production mindset, should be able to do that data loading that does, you know, localized sampling on a graph.
And PyTorch geometry is not designed with that mindset. So because of that, you have to create, you know, boilerplate code to sample locally, batch it up properly. And that can be kind of tiresome, especially if you have a graph which has a temporal aspect, which has heterogeneity with risk to edges, nodes, and so on. DGL is more like production oriented. It supports less models, but those are the type of things that will work well in practice. And you can easily build pipelines, for example, in Microsoft Azure using DGL. So 1 of the things that we did in AstraZeneca is that we use DGL.
It works quite nicely. Just to, like, give you a call, like, a comparison, like, when when I was in Google AI research, 1 of the, like, issues that we had from a very abstract point of view was exactly this, like, how we can sample things well, when you have to keep a lot of neighborhoods in a graph in memory, or have key value stored from which you can access, you know, when you sample. And it's an issue. Yeah. And then as far as the actual
[00:13:23] Unknown:
area of graph machine learning, you've actually ended up building a number of libraries on your own. I'm wondering if you can just give a bit of an overview of the projects that you've created and some of the motivations for building them and the story behind how you ended up building them and releasing them as open source and maybe how they relate to each other either in terms of inspiration or sort of the lessons that you've learned across building each of the different libraries.
[00:13:51] Unknown:
Yeah. So the first library was karate club. The story behind karate club was that I had this frustration with ML research that a lot of the times you are asked to compare to certain methods, but you can do that because methods are not implemented and the code is not, like, open source. So I had this, like, machine learning paper Fridays when I would take a paper, implement it, and I had a lot of tools that would run from the command line. And then when I was doing my research internship in Google, people told me that you you should, like, have a unified library. It should have you know, it could be installable via PIP, and it would be a lovely thing to do. And then at the same time, what I saw is that certain tools such as, like, this fingerprinting techniques, which can learn vector representations of a whole graph are not really available in a library, and I had a lot of tools that could do that. And then what I wanted to do is have this unified library, which provides a, you know, a wide range of GraphML tools that are unsupervised.
And what I wanted to do is to have something that's not dependent on the auto differentiation libraries. So everything is implemented purely in scipyNumPy, which was kind of looking back a choice that had, you know, consequences with respect to technical debt. Also, using network x was not the best idea. At the same time, it gave a very nice coverage of this graph fingerprinting techniques. And to this day, people don't really write libraries which cover these things. Yeah. And then there was the other library, which was little ball of fur. The goal of this library was to allow sampling on graphs. And the goal was that if, for example, if you are a company which has a large graph, then you can down sample it and and test your internal tools on that smaller version. For example, if you have a a browser which is running on some, you know, graph database, then you have a smaller version and you can test with that. Or if you have a library which provides some sort of graph data, then you can use it for that. And also for, like, training GCN, it would be nice to have something graph convolutional network, which you can use to do batching, localized sampling, and so on. And here, 1 of the design ideas was that I wanted to have something where the graph operations are abstracted away. So there is a back end based design that you just pass in a graph, which is in a widely used format. For example, network x or network kit. And then in the back end, the sampling happens, and it's scalable.
Yeah. And it was a hit because it turned out that there was a need for this. We are actually using it internally in AstraZeneca to test certain, you know, tools that we have with smaller versions of graphs that have billions of edges. And, the other interesting thing was that people loved it so much that there was a whole workshop at the web conference this year about just this library and using this library in in graph mining, which was nice. And then the last 1 is PyTorch geometric Temporal, which is a deep learning library. And the reason for this library was that I have friends who are researchers themselves in, you know, special temporal, like, machine learning and special temporal analytics, and they work on problems such as, like, you know, windmill output forecasting, which is a really challenging problem, or they work for companies that do bicycle deliveries or need tools for estimating, you know, the spread of diseases and so on. And it turned out that there is no single GPU accelerated spatial temporal deep learning library out there.
And it turned out that most of these methods can be implemented in PyTorch with some, you know, tricks of, you know, you can exploit a lot of, like, existing PyTorch libraries, such as PyTorch Geometric. And then we started to build this library and the, you know, the mindset was that we want to have something that very well interoperates with PyTorch Geometry and can solve this, you know, business relevant or scientifically relevant problems. The nice idea is that you can, you know, reuse, for example, the PyTorch geometric data object as, you know, representations
[00:18:19] Unknown:
of a single temporal snapshot, for example. And, you know, building on these tools, very existing tools heavily that you can interface with them well. You mentioned that karate club was the 1st library that you ended up building, and you had some regrets as to sort of the foundational elements that you built on top of it. I'm curious how those lessons carried forward into your work on the other libraries and how you think about building and maintaining open source projects and just the overall sort of user experience and architectural design that go into it? Yeah. So with Listed, I kind of started to overcome some of those, you know,
[00:18:56] Unknown:
problems that I had with, like, karate club. The the reason for the, like, the programs with karate club kind of the very result also that it wanted to be very, you know, wide. So certain algorithms that I implemented have very strong assumptions about the input data. And to have a unified API, it required to enforce certain, you know, assumptions about the input data. And then later on, what I wanted to, you know, do is to have libraries where you can have a lot of, you know, corner cases of the data. You for example, like, by by the geometric temporal, you can have setups where, for example, the features on nodes are changing, but the graph is fixed.
And we wanted to have something that, you know, allows that and all of these, like, interesting corner cases. 1 of the, like, issues is that it's very hard to allocate time when you are working full time in a PhD and also, for example, you are working. And at the same time, you get this feeling that the person on the other side, you know, they are using the top end source tool in their, like, full time job, and they are not really willing to contribute back. So sometimes I'm really surprised when when people are, like, just helping out. So for example, karate club heavily depends on Gensim and someone just as a friendly gesture, like, updated everything for the new version of Gensim.
Then just a few days ago, we had someone who committed a new dataset for vital geometric temporal, which is like the bus traffic in Uruguay, like in Montevideo. And I was, like, so shocked that, you know, they take the time, they write the test, they write proper documentation, they set up everything on read the docs, and I was like, woah. This is nice. It's nice to have contributors like that. But, yeah, it's very hard to, you know, find this balance, like, how much you have to go after, like, individual needs that people have. And, yeah, the other thing that is really nice to have is when people who are researchers, they see that there is this, like, nexus of, I don't know, graph sampling. And then what they say is that, okay, then what I will do is that I will open source my code as a part of this library, and they they contribute.
[00:21:13] Unknown:
You mentioned that you recently finished your PhD program. And I'm curious how much of your research work you're able to complete using the open source libraries that you were building and just the kind of relationship between these open source projects and the PhD projects and the kind of pressure towards publication and what your experience has been in terms of how much weight the software holds in the kind of academic ecosystem?
[00:21:41] Unknown:
Yeah. So that's funny. So both Karate Club and ended up in my thesis. 1 of the external examiners was actually, like, a guy who built something on top of karate club for community detection, which is very nice. But, yeah, I feel like that it helps you to build social capital. That's for sure. Also, people, who interviewed me for AstraZeneca, you know, they knew these tools. Some of them were was, like, you know, using these tools actively. And just as tools so as a researcher, you want to build things that allow you to iterate very fast. Because of that, like, right now, when I'm writing papers, which do this, like, typical tasks such as note classification, graph classification, I can reuse my own tools.
So for example, we wrote a paper on, like, combination therapy for cancer. And because it used graph structured data, it was very easy to run benchmarks with my own tools. You know? You know how to do the data engineering, and it's just like 1 afternoon. You have a whole table of results.
[00:22:50] Unknown:
In terms of working with network datasets and these graph structures, I'm wondering if you can just talk through your overall workflow of going from, I have this dataset or I need to acquire this dataset and clean it up, to having a sort of trained model that you're able to build predictions on and, you know, gain confidence and just sort of
[00:23:13] Unknown:
So 1 of the issues that you generally have is the this question, like, how large is the graph? Whether it fits in memory and whether it fits in memory for a single GPU. So for example, there is this new tool called, like, CoGraph by NVIDIA. I love that idea. So what they did is that they ported the network x API. It literally looks like you're writing network x code, but it runs on a GPU, and you can do your analytics. So it's a Python library on a GPU. And usually, the question is whether you can do this that either you can do in memory analysis with so the question is whether you can do in memory analysis on CPU or GPU. That's the first question. Whether you have to, you know, come up with some smart way to do sampling, That's a big question always.
And also some other aspects, like, you know, whether you have available node features, etch features besides the graph itself and the target variables. So these are kind of things that have a very early impact on what you are going to do. So the type of factors that you can use. I have this tendency that if you are doing something that's not research oriented, it should be the simplest tool that can solve that problem. Like, for node embeddings, DeepWalk is going to work. For graph embeddings, graph2vec is going to work. For, you know, graph convolutions, the simplest models give good results.
And also early on, you have to decide whether you want to use the graph for some sort of data fusion and, you know, like, doing, like, exploration on that. And finally, 1 of the, like, interesting things that kind of came out of, like, working in this domain for years is that you have to very early on, you know, measure feature importance based on special auto correlation, whether it makes sense to include something because it's not just like the correlation between, for example, an outcome feature that matters, but also, like, what happens if I consider a feature, the average of that feature in our neighborhood, and so on. And 1 of the, like, tools that's really missing, at least, I feel like it is, like, tools that can measure special autocorrelation of features.
And, you know, you have to write code manually for that when you do exploratory data analysis, for example.
[00:25:38] Unknown:
As you're working through these problems of building a machine learning model that is relying on these network datasets, I'm curious, what are some of the unique challenges that are posed by these connected entities and just some of the difficulties that you've had to overcome in the process of gaining expertise in working within this problem space and some of the useful tricks that you've built up over time as you've spent more time in the sort of problem domain of working on these graphs?
[00:26:11] Unknown:
Yeah. So, like, 1 aspect is that the meaning of, for example, distribution shift and, like, the meaning of, like, what happens when you're in a dynamic setting is very different. So for example, it might happen, just to give you a social network example, that nodes and edges that are arriving in a dynamic setting for the completely new community where this very salient, spatial autocorrelation patterns are very different, for example. That's 1 issue. Then another issue is that if the nodes are kind of heterogeneous and they have different data modalities, then you must have, you know, a way to integrate.
So for example, if you have, like, a protein drug interaction network, then the types of features that you have for drugs and for proteins, it's going to be very different. Or just to give you an example, like, when you have, you know, multimodal notes, when you're, for example, something like YouTube or you're working on, like, data that has that aspect that's, like, you know, really challenging. And then, of course, like, the graph itself has certain, like, issues that you have to overcome. So in most graphs, you have this, like, phenomenon called, like, the short effective diameter that even if you consider, like, a few steps in a neighborhood, you can reach most of the other nodes. And it's suddenly something that's original. The graph itself is extremely sparse. But suddenly, if you have this, like, consideration of larger neighborhoods, the set of the neighbors explodes, which is, like, an issue. And then you have to come up with ways to deal with that.
And, yeah, these are general issues. And then, of course, like, you can have settings where the edges themselves don't have the same, you know, reliability. So just to give you an example, in AstraZeneca, we deal with, edges or knowledge graphs that are generated by NLP, and they are not based on some ground trues. And that's, like, a completely different, you know, level of confidence that you can have for that.
[00:28:15] Unknown:
That's interesting having to build your machine learning model on an already derived dataset that's been processed that that's been generated from another machine machine learning model.
[00:28:25] Unknown:
Yeah. It's troubling. And the other other thing is that you have to create these ablations. Like, what is the gain that we have when we include the these, you know, noisy edges, for example, like,
[00:28:37] Unknown:
And another interesting aspect of graphs is that because of the fact that they are inherently connected and that there are inherent relations in them, I'm curious what your experience has been as far as how parallelizable the machine learning execution can be and how you're able to break down the graphs into sort of discrete chunks for being able to run-in parallel to try and speed up training?
[00:29:03] Unknown:
Yeah. So, usually, how these pipelines operate, I can give you 2 examples. It's going to be very computer science oriented. But, like, 1 of the, like, algorithms that does know the embeddings does it's like the following idea that you have multiple random works workers in parallel on a graph. They generate the sequences of nodes, which are treated as as sentences. And, of course, it's very easy to run that in parallel. And then what you have is that you learn the embedding based on the sequences by applying, you know, the VertuX Keepgram. And that's also something that you can run-in parallel if you have, you know, a log free gradient, that's some based approach.
And that's, like, a very nice design of how you can, you know, create something that can learn like this function that maps the nodes into this embedding space where nodes that are nearby are close together. And so that's 1 design. Another design, I would recommend a paper which we did in Google AI Research where we essentially designed this system that calculates pruned back personalized page rank weights on the edges, And that can be done in parallel. And you can, you know, sample locally in parallel. You can batch those, you know, localized samples together, and then you can train a graph convolutional network with that. And that's again something that's easy to paralyze.
The issue again is, like, how you store the whole graph in memory and how you make it sure that you can do very fast neighborhood lookups.
[00:30:43] Unknown:
In terms of the work that you've done as far as building these libraries, but also building the models, I'm curious what are some of the sort of structural patterns and design patterns that you've leaned on to be able to keep the code sort of maintainable and interpretable by, you know, current you and future you as well as other people who are coming to your code
[00:31:09] Unknown:
for the first time? So 1 of the, like, assumptions was that the primary target group is, like, researchers or people who want to learn machine learning and have an necessarily for production, which means that you have to adhere to this, you know, research area specific standards. So for example, if you have some tool that relies on, you know, the scientific Python ecosystem, then it has to be, like, scikit learn like. So very limited number of public methods. Those public methods should be used, you know, in a very consistent way. And, also, certain hyperparameters should be preset.
We don't assume that the end user understands all of the hyperparameters. Those hyperparameters should work well out of the box. So, you know, this should be this turnkey type of library that I can just import it. And if you have all of these properties, then you can have this, like, design where you can just switch out the import of the model, you know, the construction of that, you know, instance, and then the code runs still and people can try out things, how these things work, whether 1 of them is better or worse. And then, also, you know, like, whenever it's possible, use the Python method.
So Python geometric temporal has this snapshot iterators, which use a lot of these, you know, methods and also, you know, enforcing strong input assumptions. So the boilerplate is not on your side, but on the user's side. So they have to do the data engineering and also that most of the, you know, heavy lifting should be, you know, moved out. So it's done by Gensim when it's implicit factorization. When it's large scale, you know, graph operations, It runs on network kit. And, generally, I was really annoyed by the fact that no 1 writes tests. You know, there is no continuous integration.
And, also, you have no idea that what's the code coverage class by class in other libraries. So I I try to, you know, take some best practices in that sense, but, you know, it's very hard to write good tests in machine learning. So most of the time, what we do is the smoke test for, you know, shapes and also having trivial machine learning problems where you know that the model should be able to predict something that's very trivial to predict.
[00:33:43] Unknown:
In your experience of building these models and working with people who are using their libraries and working in the space and with your job, AstraZeneca, I'm curious. What are some of the most interesting or innovative or unexpected ways that you've seen machine learning on graphs and networks datasets used?
[00:34:01] Unknown:
So 1 of the interesting things that I have seen is something that people in Microsoft and Facebook research together did was release of a dataset called MARLAT, which is this collection of malware call graphs. And what they did is just, like, classifying of the malware based on the core graphs. And the fact that you can use just that very simple representation to predict this very precisely was interesting for me. Another interesting area of research is this generation of graphs based on metric spaces where you generate the similarity graph, and then you can use semi supervised learning on that. And that seems to work very well when you have this label, you know, scarcity or it's very expensive to get labels for that. And in the last 2 years, I've it's, like, a very active research domain, which has a lot of practical applications and not just, you know, dislike machine learning theory research, but something that seems to, you know, scale well and has a lot of practical applications.
And then there is this, like, thing that we do also in AstraZeneca is that you have a graph which has some sort of, you know, heterogeneity. Like, there are different types of nodes, and then you can have multimodal data on those nodes. And then the question is, how do you integrate that? How do you fuse that? In your own experience of working with this graph data
[00:35:31] Unknown:
and doing research on machine learning with networked datasets, I'm curious what are some of the most interesting or unexpected or challenging lessons that you learned in the process as well as for building these open source libraries to help yourself and other people in doing research on these problem domains?
[00:35:46] Unknown:
Yeah. So 1 of the, like, lessons that I learned, it's not about Graph and all, but it's about a lot of people in machine learning don't understand the algorithms themselves, which is sometimes scary to see. So I get a lot of, like, issues which are like, how can I use this to do? I don't know. This or that type of problem which you can't solve with that tool, which is sometimes funny. Yeah. Another thing is that people are not really good at, like, you know, recombining these ideas, which was also surprising. 1 of the things that I realized is that people need much more, you know, hand holding, notebooks, examples, and that is something that people love. If you have, for example, a model class, then there should be an example with that. There should be a notebook with that.
Yeah. 1 of the things that I realized that I came with a background that was in statistics and econometrics. And because of that, I had a strong awareness, like, what are the basic principles such as, like, let us say, you know, residual autocorrelation and things like statistical phenomenon. And it was really surprising to see that people who develop these models and some of them, like, very important, like, you know, cornerstones in this, domain. And yet they didn't understand that that model fails to capture something or has a certain, you know, specification issue and so on. And that was also, like, interesting to see.
And 1 of the, like, last things is that, you know, we put a lot of effort into, like, beating state of the art models while it's very easy to create algorithms and tools that are very simple, easy to understand, and very hard to beat, which are scalable, work well. And, you know, these are the ideal things for actual, you know, production machine learning.
[00:37:49] Unknown:
In terms of when you're working with a network's dataset or if you're building a machine learning model and you have the option of representing a dataset as a graph or as a maybe more relational structure, what are the cases where you would either choose to change the representation to not be in graphs? Or also conversely, what are the cases where a graph approach might be the superior and more easily manageable way to build these machine learning models?
[00:38:20] Unknown:
I feel like that the data scarcity aspect, like the labeling aspect is something where the graph based approach can help a lot. When it's not going to help is when you don't have these birds of a feather phenomenon with respect to features and labels. And those are kind of the pitfalls for these methods, and people are not checking for those assumptions. For some reason, they don't understand that these techniques, most of them assume that there is this, you know, strong, special similarity in the feature space and the label space. So so, like, there's no to reuse examples. For example, just to give you an example, like, predicting gender based on social network is a really hard problem. So surprisingly hard compared to predicting the age of people. So just by averaging out the age of your friends, you can explain roughly 50 to 60% of variance in age.
But for gender, even with something very complicated, you can barely have an area under the curve of 0.55 or something like that.
[00:39:28] Unknown:
In terms of your own libraries and tools that you've built to help yourself in working through these problems, what are the cases where you would decide not to use them and instead lean on a different set of tooling or technologies?
[00:39:42] Unknown:
So for spatial temporal learning, PyTorchometric temporal is the only tool out there, so there isn't much choice on that side. A little ball of FER works very well with the network. It based back end even on large problems. So when we saw in AstraZeneca that we can scale it up to the whole graph and downsample, we have this knowledge graph and downsample that we took off for that was really nice to see. At the same time, like, design of Karty Club means that it's, like, research oriented, and it doesn't work well when it comes to, you know, industry side node size node embedding problems.
But, you know, side note that some graph fingerprinting methods that I implemented in that library, because they work in this inductive manner can be, you know, parallelized easily, executed just on, you know, blocks of graphs. And because of that, you can do large scale machine learning with that. So the people who release the the malware network, dataset, which consists of, like, 2, 000, 000 graphs, Some of the graphs has, like, hundreds of thousands of nodes. So it's large graphs. It's like, I don't know, 10 terabyte of data. And they managed to use karate club on that, the graph level embeddings for each of the graphs. They had a very, like, smart way of distributing it, which was interesting to see. But what I would say is that the tools that I created, Kratiklub, is purely research oriented, little ball of something that you could use in production, though. So by the geometric temporal would work in production. So we were able to train forecasting models using more than, I don't know, 5, 000 windmills, which are located in 1 of the Scandinavian countries, which is, like, nice.
[00:41:31] Unknown:
In terms of the overall space of GraphML and the sort of capabilities that it provides and the types of problem domains that it's conducive for, what are some of the sort of interesting and upcoming trends that you're keeping an eye on, and what are some of the areas that you're excited to work on either by extending your existing libraries or building out new tooling or just doing your own research in the space?
[00:41:58] Unknown:
So 1 of the, like, interesting problems that we are trying to tackle with the people involved in in tighter geometric temporal is this issue that on most social networks, the time difference between events is not constant. So for example, the examples that I mentioned, the windmill output forecasting problem, that the time differences are constant. But in a lot of settings, it's not true. It's very hard to handle because you have, you know, a special domain, temporal domain, and you have to be able to have the right type of data structures and defining models on that, which can, you know, operate on that type of data. So that's definitely a direction that we are working on. So by the geometric temporal, it's like this community effort with people from the University of Cambridge in Oxford, and we're working on that to make that possible.
And 1 of the things that I want to do personally is to have other types of back ends for little ball of such as, you know, like the scikit network, iGraph, a graph tool, and these classical graph analytics libraries' back ends. And, personally, what I would love to do is have a library which can calculate different measures of spatial correlation on a graph because it's something that I have to do. I have to write code for that, and most probably other people, have the same problem. And it seems to be, you know, again, a niche area that no 1 tries to tackle.
[00:43:31] Unknown:
Are there any other aspects of the libraries that you're building and the work that you're doing in GraphML or the overall space of machine learning on graphs that we didn't discuss yet that you'd like to cover before we close out the show?
[00:43:43] Unknown:
Yeah. I feel like that 1 of the interesting aspects is that it intercepts a lot, with, you know, geometric applications of ML. And because of that, it might be interesting to cut those more general cases cases and how, for example, GraphML relates to computer vision and processing of sequences. It's just like something that I would like to mention. And, also, explainability on graphs seems to be a very hot topic in in the last 1 year, and there are beautiful papers and good tools to do explainability at the instance level on nodes and that JS.
[00:44:22] Unknown:
And are there any particular resources or reference material that you would recommend people look to if they're trying to learn more about how to take advantage of networked datasets and build machine learning models across
[00:44:36] Unknown:
them? Yeah. Luckily, this year, 2 textbooks came out, and I feel like that most of the models that are described in those textbooks are available in Python Geometric and also in DGL. Out of these 2 textbooks, I would recommend the textbook by William Life Hamilton, who is a professor at Mila. If people go search for it, they will find it. It's a very generic graduate school level textbook for graph representational learning. I would recommend it. It's like easy read, fun, publicly available, but you can also order a hard copy with the proper hard copy if you would love to do that.
[00:45:17] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or contribute to the libraries that you maintain, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the movie, wrath of man. It came out recently. It's a Guy Ritchie film. All of Guy Ritchie's films, I've thoroughly enjoyed just because you really have to pay attention to understand what's going on. And this is, you know, another 1 where there are a few interesting twists to it. So just had fun watching that. So if you're looking for something to watch, definitely recommend that 1. And so with that, I'll pass it to you, Benedek. Do you have any picks this week? Yeah. I would like to recommend Hanford the Wilder People, which is a 2016
[00:45:56] Unknown:
movie. It's by Taika Waititi, who is a director from New Zealand. It's a wonderful wonderful comedy drama movie about an actual real rehabilitation problem for children who are young offenders in New Zealand. It's a hilarious movie. And I would also love to recommend the research paper, which has the title geometric deep learning, grids, groups, graphs, geodesics, and gauges, which is this wonderful, pretty long research paper, which, you know, unifies a lot of machine learning that's out there. It's a pretty heavy read, but it's worthwhile.
[00:46:36] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on building libraries and doing work in the GraphML space. It's definitely a very interesting problem domain, and I'm curious to see how it continues to evolve. Definitely been seeing a lot of interest in how to be able to do machine learning on graph data. So thank you for all of your time and efforts on that, and I hope you enjoy the rest of your day. Thank
[00:47:01] Unknown:
you. Have a nice day.
[00:47:05] Unknown:
Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management. And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Benedek Rozemberczki
Machine Learning on Graph Data
Tools and Libraries for Graph Machine Learning
Research vs Production Oriented Libraries
Benedek's Open Source Libraries
Open Source Contributions and Community
Workflow for Graph Machine Learning
Challenges in Graph Machine Learning
Design Patterns and Maintainability
Innovative Uses of Graph Machine Learning
Lessons Learned in Graph Machine Learning
Choosing Graph Representation
Future Trends and Research Directions
Intersections with Other Domains