Traversing The Challenges And Promise Of Graph Machine Learning

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads?

Hi touch is the easiest way to sync data into the platforms that your business teams rely on.

The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems.

No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at python podcast.com/hitouch.

Your host as usual is Tobias Macy. And today, I'm interviewing Benedek Rosambertski

about his work on machine learning for graph data, including a variety of libraries that support his efforts. So, Benedek, can you start by introducing yourself?

I'm Benedek Rosambergsky.

I'm a machine learning engineer at AstraZeneca.

I did a PhD in data science at the University of Edinburgh

and did some research in Google AI. Right now, I live in Oxford, the United Kingdom, and I was born in Budapest, Hungary.

And I work mainly on graph related machine learning problems.

And do you remember how you first got introduced to Python?

Yeah. It's a funny story. So because originally, you know, I'm an economist,

and I mainly worked with, like, MATLAB

and R.

Both of those languages are pretty bad with strings.

And when I was in grad school, doing a master's in economic, there was a course in Python,

and that meant that I moved

from those 2 instantly

because it was way more convenient to use. Yeah.

As we mentioned at the outset,

a big part of what you're working with is graph data

and doing machine learning on that. And I'm wondering if you can just start by giving a bit of an overview of when you might want to do machine learning on networked graph oriented data and some of the types of problem domains where that kind of data crops up.

Yeah.

So first, I will discuss, you know, type of problems when you already have the network structured data. But I will mention this multiple times that

essentially every data that lives in a metric space can be turned into graph structured data,

which means that

every machine learning problem can be formulated as a graph, machine learning problem interestingly.

But usually,

the type of problems that people deal with are,

you know, in this typical, like, web domains. So for example, if you have

a web graph that is typically something where you can use these tools also in health care, for example,

you can have certain types of interaction networks of molecules and proteins.

And, also, you know, molecules themselves

can be represented as graphs.

Other types of problems, of course, like social network data.

A lot of recommender system problems can be formulated as graph problems as you have, you know, you have 2 types of nodes with connections between them. There's a so called bipartite graph

that's underlying,

that problem.

Also,

in other domains,

for example,

you know,

special forecasting problems are such that you can formulate them, that there is a graph that's underlying.

Another thing that you can do, which I mentioned earlier, is that if you have tabular data

with some smart use of, you know, some

hashing or some, you know,

locality sensitive hashing, you can generally build the graph out of your tabular data, and then you can do graph machine learning on that.

And usually the type of problems that we are trying to solve. So I will use a social network example.

1 of the problems that people try to deal with is, you know, predicting properties of nodes, which can be, you know, classification

of nodes,

regression on nodes. So for example, each prediction in a social network,

which is a problem that can be very well solved with, you know, a social network based approach,

you know, predicting edge properties. So, for example,

how do I know that 2 people are linked when, you know how can I predict that when they are not yet linked on Facebook? But

we might assume that they will link or they know each other. There's, like, a, you know, hidden connection between them.

And there are problems that you want to do where you look at the whole graph level. So for example, you can ask questions like if I have these 2 drugs

which are represented by graphs, how can I predict that there is a certain, you know, polypharmacy side effect for them?

Or you can ask questions like if there is a call graph of, you know, a piece of software, how can I tell that this call graph belongs to a certain type of mother?

Or, you know, a large number of problems can be formulated.

And

from an abstract point of view, you know, you can formulate these problems as graph ML problems as, for example, node classification or edge prediction,

edge classification, or or graph classification. And then what you do is that,

you know, it boils down to this very simple task of standard tools that we try to build,

you know, tools that I try to build.

And in terms of the

tooling that's available, I'm wondering if you can give a bit of an overview about

the state of the ecosystem for libraries or frameworks

that are built to be able to work with

graph and networks data

and how the overall approach to machine learning

shifts or any differences from

graph and networks data

as opposed to sort of relational or unstructured data?

Yeah. So what I will start with is, you know, just highlight a few statistical things that are interesting.

So when when you have graph data, 1 of the things that you can do is that you can do semi supervised learning. So you can exploit the features of the neighbors that a node has, which is, like, a fantastic thing. So 1 of the things that it's, like, wonderful for is, like, you know, solving problems where you have a label, sparsity. That's 1 of the, like, things that's pretty cool, and you can do that.

Another thing is that, you know, context is based on the neighbors, and you can

predict things for a given data point even when you have missing features for the data point, which is also some.

Most of the techniques

exploit, you know, that there is, like, a similarity,

especially, you know, like, it's called special

autocorrelation. It's like the birth of a feather phenomenon,

which is also lovely.

And 1 of the issues is that when this, you know, special

correlation is not positive but negative,

then it's very hard to apply these methods.

That's, you know, like, complicated problem. And, like, software wise,

1 of the things that was kind of important for

GraphML, you know, to start this, like, boom

was, the fact that we have a lot of good libraries for sparse linear algebra.

And then the came in 2018

when people started to shift towards the use of PyTorch.

So in general, in the graph analytics domain, those tools that provide, you know, analytical, like, capabilities and those which are, like, for machine learning are very different from each other, and the type of people who develop those are kind of, like, different.

And, yeah, PyTorch became dominant. 1 of the libraries, which is, like, very nice, it's developed by

Matthias Feil, and it's PyTorch Geometric. It's a research and oriented machine learning library.

It always incorporates the, you know, newest research research. It's It's very well designed, good software engineering principles, in my opinion. But, you know, it's research oriented very much. Like, it's not something that's designed with scalability,

production,

deployment, you know, set of mind that we have to consider these things.

Then there's then other library, which as far as I know, has some support from Amazon DGL,

which runs on PyTorch and TensorFlow so you can switch out the back

ends. It's production oriented. It contains less methods. It also has tools for, you know, dealing with knowledge graphs.

And

so these tools are, you know, the the main, like, tools for parametric question learning.

And there is also a library called Pykin, which is built on PyTorch

by

Charles Hoyt, who is, you know, a postdoc at Harvard.

And, it's a tool for, like, knowledge graph, embeddings.

It contains models, loss functions,

tools for training and scoring.

So these are the tools for deep learning. Interestingly, you know, the analytics tools are

maintained by people who are not really involved in the deep learning community.

1 of the prominent tools that's really nice is graph tool,

which provides, you know, basic descriptive statistics, but it's really scalable. You know,

pure c plus plus. It has a Python rep, and it's maintained by a guy called Tiago Pejoto.

Interestingly,

JAX, you know, the auto diff library by DeepMind is coming up, and they have a library that's built on JAX for this graph machine learning. For example,

they have the basic models for graph convolutions

and so on. And

generally, the issue is that there is no single unified library,

which covers a lot of things that we need, such as, you know, basic analytics,

so called graph kernel functions that can compare graphs pairwise, how similar they are,

parametric tools

such as, you know, deep learning on graphs

and other type of, you know, GraphML,

like dynamic GraphML,

you know, graph fingerprinting

techniques. And usually, there is no good support for heterogeneity and the temporal aspect of things.

Yeah. I feel like that's covers it. Yeah.

As you were describing some of the different libraries that are out there, you called out a couple of things being either production oriented or something oriented towards research. I'm wondering if you can just give a bit of more context as to research versus production oriented libraries and sort of how the

differences in those areas of focus manifest in the interfaces and the capabilities and

just some of the knock on effects that they have in terms of

what you're able to do once you select a given library for the work that you're doing?

Yeah. So I would like to use an analogy from computer vision. So if you think very abstractly about these things, the the node classification problem is equivalent

to the

segmentation problem, the supervised segmentation problem in computer vision.

The issue with graphs is that while for images,

you can have some very

simple cropping of certain parts of the image and you can take them,

for graphs, you have to be able to

sample locally

if you want to operate

on a sample. And that's an issue.

And the library, which is, you know,

designed with a production

mindset,

should be able to do that data loading that does, you know, localized sampling on a graph.

And

PyTorch geometry is not designed with that mindset.

So

because of that,

you have to create, you know, boilerplate code

to sample locally,

batch it up properly.

And that can be kind of tiresome, especially if you have a graph which has a temporal aspect, which has heterogeneity

with risk to edges, nodes, and so on.

DGL is more like production oriented.

It supports less models,

but those are the type of things that will work well in practice.

And you can easily

build pipelines,

for example, in Microsoft Azure using DGL. So 1 of the things that we

did

in AstraZeneca

is that we use DGL.

It works quite nicely.

Just to, like, give you a

call, like, a comparison, like, when when I was in Google AI research, 1 of the, like, issues that we had from a very abstract point of view was exactly this, like, how we can

sample

things

well, when you have to keep a lot of neighborhoods in a graph in memory,

or have key value stored from which you can access, you know, when you sample. And it's an issue. Yeah. And then as far as the actual

area of graph machine learning, you've actually ended up building a number of libraries on your own. I'm wondering if you can just give a bit of an overview of the projects that you've created and some of the motivations

for building them and the story behind how you ended up building them and releasing them as open source and maybe how they relate

to each other either in terms of inspiration or

sort of the lessons that you've learned across building each of the different libraries.

Yeah. So the first library was karate club. The story behind karate club was that

I had this frustration with ML research that a lot of the times you are asked to compare to certain methods, but you can do that because methods are not implemented and the code is not, like, open source.

So I had this, like, machine learning paper Fridays when I would take a paper, implement it, and I had a lot of tools that would run from the command line.

And then when I was doing my research internship in Google, people told me that you you should, like, have a unified library.

It should have you know, it could be installable via PIP, and it would be a lovely thing to do. And then at the same time, what I saw is that certain tools such as, like, this fingerprinting techniques, which can learn vector representations of a whole graph are not really

available in a library, and I had a lot of tools that could do that. And then what I wanted to do is have this unified library,

which provides a, you know, a wide range of

GraphML tools that are unsupervised.

And

what I wanted to do is to have something that's not dependent on the auto differentiation

libraries. So everything is implemented purely in scipyNumPy,

which was kind of looking back a choice that had, you know, consequences

with respect to technical debt. Also, using network x was not the best idea.

At the same time, it gave a very nice coverage of this graph fingerprinting techniques. And to this day, people don't really write libraries which cover these things. Yeah. And then there was the other library, which was little ball of fur. The goal of this library was to allow sampling on graphs. And the

goal was that if, for example, if you are a company which has a large graph, then you can down sample it and and test your internal tools on that smaller version. For example, if you have a a browser which is running

on some, you know, graph database, then you have a smaller version and you can test with that. Or if you have a library which provides some sort of graph data, then you can use it for that. And also for, like, training GCN, it would be nice to have something graph convolutional network, which you can use to do batching,

localized sampling, and so on. And here, 1 of the design ideas was that I wanted to have something where the graph operations are abstracted away. So there is a back end based design that you just pass in a graph, which is in a widely used format. For example, network x or network kit. And then in the back end, the sampling happens, and it's scalable.

Yeah.

And it was a hit because it turned out that there was a need for this. We are actually using it internally in AstraZeneca

to test certain, you know, tools that we have with smaller versions of graphs that have billions of edges.

And,

the other interesting thing was that people loved it so much that there was a whole workshop at the web conference this year about just this library and using this library in in graph mining, which was nice. And then the last 1 is PyTorch geometric Temporal, which is a deep learning library.

And the reason for this library was that I have friends who are researchers

themselves in, you know, special temporal, like, machine learning and special temporal analytics, and they work on problems such as, like, you know, windmill output forecasting, which is a really challenging problem, or they work for companies that do

bicycle deliveries

or need tools

for estimating, you know, the spread of diseases

and so on. And it turned out that there is no single GPU accelerated spatial temporal deep learning library out there.

And it turned out that most of these methods can be implemented

in PyTorch

with some, you know,

tricks of, you know, you can exploit a lot of, like, existing PyTorch libraries, such as PyTorch Geometric.

And then we started to build this library and the, you know, the mindset was that we want to have something that very well interoperates with PyTorch Geometry

and can solve this,

you know, business relevant or scientifically

relevant problems.

The nice idea

is that

you can, you know, reuse, for example, the PyTorch

geometric data object as, you know, representations

of a single temporal snapshot, for example. And, you know, building on these tools, very existing tools heavily that you can interface with them well. You mentioned that karate club was the 1st library that you ended up building, and you had some regrets as to sort of the foundational elements that you built on top of it. I'm curious how those lessons carried forward into your work on the other libraries and how you think about building

and maintaining open source projects and just the overall sort of user experience and architectural design that go into it? Yeah. So with Listed, I kind of started to overcome some of those, you know,

problems that I had with, like, karate club.

The the reason for the, like, the programs with karate club kind of

the very result also that it wanted to be very, you know, wide.

So certain algorithms that I implemented have very strong assumptions about the input data. And to have a unified API, it required to enforce certain, you know, assumptions about the input data.

And then later on, what I wanted to, you know,

do is to have

libraries

where you can have a lot of, you know, corner cases of the data. You for example, like,

by by the geometric temporal, you can have setups where, for example, the features on nodes are changing, but the graph is fixed.

And we wanted to have something that, you know,

allows that and all of these, like, interesting corner cases.

1 of the, like, issues is that

it's very hard to allocate time when you are working full time in a PhD and also, for example, you are working.

And at the same time, you get this feeling that the person on the other side,

you know, they are using the top end source tool in their, like,

full time job, and they are not really willing to contribute back. So

sometimes I'm really surprised when when people are, like, just helping out. So for example, karate club heavily depends on Gensim

and someone

just as a friendly gesture, like, updated everything for the new version of Gensim.

Then

just a few days ago, we had someone who committed a new dataset for vital geometric temporal, which is like the bus traffic in Uruguay,

like in Montevideo.

And I was, like, so shocked that, you know, they take the time, they write the test, they write proper documentation,

they set up everything on read the docs, and I was like, woah. This is nice.

It's nice to have contributors like that. But, yeah, it's very hard to, you know,

find this balance,

like, how much you have to go after, like, individual needs that people have.

And, yeah, the other thing that is really nice to have is when people who are researchers,

they see that there is this, like, nexus of, I don't know, graph sampling. And then what they say is that, okay, then what I will do is that I will open source my code as a part of this library, and they they contribute.

You mentioned that you recently finished your PhD program. And I'm curious how much of your research work you're able to complete using the open source libraries that you were building and just the kind

of relationship between these open source projects and

the PhD projects and the kind of pressure towards publication and what your experience has been in terms of how much weight the software holds in the kind of academic ecosystem?

Yeah. So that's funny. So both Karate Club and

ended up in my thesis. 1 of the external examiners

was actually, like, a guy who built something on top of karate club for community detection, which is very nice.

But,

yeah,

I feel like that it helps you to build social capital. That's for sure.

Also,

people,

who interviewed me for AstraZeneca,

you know, they knew these tools. Some of them were was, like, you know, using these tools actively.

And just as tools

so as a researcher, you want to build things that allow you to iterate very fast.

Because of that, like, right now, when I'm writing papers, which do this, like, typical tasks such as note classification, graph classification, I can reuse my own tools.

So for example, we wrote a paper on, like, combination therapy for cancer.

And

because it used graph structured data, it was very easy to run benchmarks with my own tools. You know? You know how to do the data engineering, and it's just like 1 afternoon.

You have a whole table of results.

In terms of working with network datasets and these graph structures, I'm wondering if you can just talk through your overall workflow of going from, I have this dataset or I need to acquire this dataset and clean it up, to

having a sort of trained model that you're able to

build predictions on and, you know, gain confidence and just sort of

So 1 of the issues that you generally

have is the this question, like, how large is the graph?

Whether it fits in memory

and whether it fits in memory

for a single GPU. So for example, there is this new tool called, like, CoGraph

by NVIDIA.

I love that idea. So what they did is that they ported the network x API.

It literally looks like you're writing network x code, but it runs on a GPU, and you can do your analytics. So it's a Python library

on a GPU.

And usually,

the question is whether you can do this that either you can do in memory analysis

with so the question is whether you can do in memory analysis on CPU or GPU. That's the first question. Whether you have to, you know, come up with some smart way to do sampling,

That's a big question always.

And also

some other aspects, like, you know, whether you have available

node features, etch features besides the graph itself and the target variables. So these are kind of things that have a very early impact on what you are going to do. So the type of factors that you can use.

I have this tendency that if you are doing something that's not research oriented, it should be the simplest tool that can solve that problem. Like, for node embeddings, DeepWalk is going to work. For graph embeddings, graph2vec is going to work.

For, you know, graph convolutions, the simplest models give good results.

And also early on, you have to decide whether you want to use the graph for some sort of data fusion and, you know, like, doing, like, exploration on that.

And finally, 1 of the, like, interesting things that kind of came out of, like, working in this domain for years is that

you have to very early on, you know,

measure feature importance based on special auto correlation, whether it makes sense

to include something because it's not just like the correlation between, for example, an outcome

feature that matters, but also,

like, what happens if I consider a feature,

the average of that feature in our neighborhood, and so on. And 1 of the, like, tools that's really missing, at least, I feel like it is, like, tools that can measure special autocorrelation

of features.

And, you know, you have to write code manually for that when you do exploratory data analysis, for example.

As you're working through these problems of building

a machine learning model that is relying on these network datasets, I'm curious,

what are some of the unique challenges

that are posed by these connected entities and

just some of the difficulties that you've had to overcome in the process of gaining expertise in working within this problem space and

some of the

useful tricks that you've built up over time as you've spent more time in the sort of problem domain of working on these graphs?

Yeah. So,

like, 1 aspect is that the meaning of, for example, distribution

shift and, like, the meaning of, like, what happens when you're in a dynamic setting is very different.

So for example,

it might happen,

just to give you a social network example, that

nodes and edges that are arriving in a dynamic setting for the completely new community

where this very salient,

spatial autocorrelation patterns are very different, for example. That's 1 issue.

Then another issue is that if the nodes are kind of heterogeneous

and they have different data modalities, then you must have, you know, a way to integrate.

So for example, if you have, like, a protein drug interaction network,

then the types of features that you have for drugs and for proteins, it's going to be very different. Or just to give you an example, like, when you have, you know, multimodal notes, when you're, for example, something like YouTube or you're working on, like, data that has that aspect that's, like, you know, really challenging. And then,

of course, like, the graph itself has certain, like, issues that you have to overcome. So in most graphs, you have this, like, phenomenon called, like, the short effective diameter that

even if you consider, like, a few steps in a neighborhood,

you can reach most of the other nodes. And it's suddenly

something that's original. The graph itself is extremely sparse. But

suddenly, if you have this, like, consideration of larger neighborhoods, the set of the neighbors explodes, which is, like, an issue. And then you have to come

up with ways to deal with that.

And, yeah, these are general issues. And then, of course, like, you can have settings where the edges themselves don't have the same, you know,

reliability. So just to give you an example, in AstraZeneca,

we deal with,

edges or knowledge graphs that are generated by NLP, and they are not based on some ground trues. And that's, like, a completely different, you know, level of confidence that you can have for that.

That's interesting having to build your machine learning model on an already derived dataset that's been processed that that's been generated from another machine machine learning model.

Yeah. It's troubling. And the other other thing is that you have to create these ablations.

Like, what is the gain that we have when we include the these, you know, noisy edges, for example, like,

And another interesting aspect of graphs is that because of the fact that they are inherently connected and that there are inherent relations in them, I'm curious

what your experience has been as far as how parallelizable

the machine learning execution can be

and how you're able to break down the graphs into

sort of discrete chunks for being able to run-in parallel to try and speed up training?

Yeah. So, usually, how these pipelines operate, I can give you 2 examples. It's going to be very computer science oriented. But,

like, 1 of the, like, algorithms that does know the embeddings does it's like the following idea that you have multiple

random works workers in parallel on a graph.

They generate the sequences

of nodes,

which are treated as as sentences.

And, of course, it's very easy to run that in parallel.

And then what you have is that you learn the embedding based on the sequences by applying, you know, the VertuX Keepgram.

And that's also something that you can run-in parallel if you have, you know, a log free gradient, that's some based approach.

And that's, like, a very nice design of how you can, you know, create something that can learn like this

function that maps the nodes into this embedding space where nodes that are nearby are close together. And so that's 1 design. Another

design,

I would recommend a paper

which we did in Google AI Research

where we essentially designed this system that

calculates

pruned back personalized page rank weights on the edges,

And

that can be done in parallel.

And you can, you know, sample locally in parallel.

You can batch those, you know, localized samples together,

and then you can train a graph convolutional network with that. And that's again something that's

easy to paralyze.

The issue again is, like, how you store the whole graph in memory and how you make it sure that you can do

very fast neighborhood lookups.

In terms of the

work that you've done as far as building these libraries, but also building the models, I'm curious what are some of the sort of structural patterns

and design patterns that you've leaned on to be able to

keep the

code sort of maintainable

and interpretable

by, you know, current you and future you as well as other people who are coming to your code

for the first time? So 1 of the, like, assumptions was that the primary target group is, like, researchers

or people who want to learn machine learning and have an

necessarily for production,

which means that you have to adhere to this, you know, research area specific standards. So for example, if you have

some tool that relies on, you know, the scientific Python ecosystem, then it has to be, like, scikit learn like.

So very limited number of public methods.

Those public methods should be used, you know, in a very consistent way. And, also,

certain hyperparameters

should be preset.

We don't assume that the end user understands all of the hyperparameters.

Those hyperparameters

should work well out of the box.

So, you know, this should be this turnkey type of library that I can just import it.

And if you have all of these properties, then you can have this, like, design where you can just switch out the import of the model, you know, the construction of that, you know, instance, and then the code runs still and people can try out things,

how these things

work, whether 1 of them is better or worse.

And then, also, you know, like, whenever it's possible, use the Python method.

So

Python geometric temporal has this snapshot iterators,

which use a lot of these, you know,

methods

and also, you know, enforcing strong input assumptions. So the boilerplate

is not on your side, but on the user's side. So they have to do the data engineering

and also that most of the, you know,

heavy lifting should be,

you know, moved out. So it's done by Gensim when it's implicit factorization.

When it's large scale, you know, graph operations, It runs on network kit.

And, generally,

I was really annoyed by the fact that no 1 writes tests.

You know, there is no continuous integration.

And, also, you have no idea that what's the code coverage

class by class in other libraries. So I I try to, you know,

take some best practices

in that sense, but, you know,

it's very hard to write good tests in machine learning. So most of the time, what we do is the smoke test for, you know,

shapes and also having trivial machine learning problems where you know that the model should be able to predict something

that's very trivial to predict.

In your experience

of building these models and working with people who are using their libraries

and working in the space and with your job, AstraZeneca, I'm curious. What are some of the most interesting or innovative or unexpected ways that you've seen machine learning on graphs and networks datasets used?

So 1 of the interesting things that I have seen is something that people in Microsoft

and Facebook

research together did was release of a dataset called MARLAT,

which is this collection of malware call graphs. And what they did is just, like, classifying of the malware

based on the core graphs. And the fact that you can use just that very simple representation to predict this very precisely was interesting for me.

Another interesting area of research is this

generation of graphs based on metric spaces

where you generate the similarity graph, and then you can use semi supervised learning on that. And that seems to work very well when you have this label,

you know, scarcity or it's very expensive to get labels for that. And in the last 2 years, I've it's, like, a very

active research

domain,

which has a lot of practical applications

and not just, you know, dislike

machine learning theory research, but something that seems to, you know, scale well and has a lot of practical applications.

And then there is this, like, thing that we do also in AstraZeneca

is that you have a graph which has some sort of, you know, heterogeneity.

Like, there are different types of nodes,

and then you can have

multimodal

data on those nodes. And then the question is, how do you integrate that? How do you fuse that? In your own experience of working with this graph data

and doing research on machine learning with networked datasets, I'm curious what are some of the most interesting or unexpected or challenging lessons that you learned in the process as well as for building these open source libraries to help yourself and other people in doing research on these problem domains?

Yeah. So 1 of the, like, lessons that I learned, it's not about Graph and all, but it's about a lot of people in machine learning don't understand the algorithms themselves, which is sometimes scary to see. So I get a lot of, like, issues which are like, how can I use this to do? I don't

know. This or that type of problem which

you can't solve with that tool, which is sometimes funny.

Yeah. Another thing is that

people are not really good at, like, you know, recombining

these ideas, which was also surprising.

1 of the things that I realized is that

people need much more, you know, hand holding,

notebooks,

examples,

and

that is something that people love. If you have, for example, a model class, then there should be an example with that. There should be a notebook with that.

Yeah. 1 of the things that I realized that I came with a background

that was in statistics

and econometrics.

And because of that, I had a strong awareness, like, what are the basic principles such as, like, let us say, you know, residual autocorrelation

and things like statistical

phenomenon.

And it was really surprising to see that

people who develop these models and some of them, like, very important, like,

you know, cornerstones in this,

domain.

And yet they didn't understand that that model

fails to capture

something or has a certain, you know,

specification

issue and so on. And that was also, like,

interesting to see.

And 1 of the, like, last things is that, you know, we put a lot of effort into, like, beating state of the art models

while

it's very easy to create algorithms and tools that are very simple,

easy to understand, and very hard to beat, which are scalable,

work well.

And, you know, these are the ideal things for actual, you know, production machine learning.

In terms of when you're working with a network's dataset or if you're building a machine learning model and you have the option of representing a dataset as a graph or as a maybe more relational structure, what are the cases where you would either choose to

change the representation to not be in graphs? Or

also conversely, what are the cases where a graph approach might be the

superior and more easily manageable

way to build these machine learning models?

I feel like that the data scarcity aspect, like the labeling aspect is something where the graph based approach can help a lot. When it's not going to help is when you don't have these birds of a feather phenomenon with respect to features and labels.

And those are kind of the pitfalls for these methods, and people are not checking for those assumptions. For some reason,

they don't understand that

these techniques, most of them assume that there is this, you know,

strong, special similarity

in the feature space and the label space. So so, like, there's no to reuse examples. For example, just to give you an example, like, predicting gender based on social network is a really hard problem. So

surprisingly hard compared to predicting the age of people. So just by averaging out the age of your friends, you can

explain roughly 50 to 60%

of variance

in age.

But for gender,

even with something very complicated,

you can

barely have an area under the curve of 0.55

or something like that.

In terms of your own

libraries and tools that you've built to help yourself in working through these problems, what are the cases where you would

decide not to use them and instead lean on a different set of tooling or technologies?

So for spatial temporal learning, PyTorchometric

temporal is the only tool out there, so there isn't much choice on that side.

A little ball of FER works very well with the network. It based back end even on large problems. So

when we saw in AstraZeneca that we can scale it up to the whole graph and downsample, we have this knowledge graph and downsample that we took off for that was really nice to see.

At the same time, like, design of Karty Club means that it's, like, research oriented, and it doesn't work well

when it comes to, you know, industry side node size node embedding problems.

But, you know, side note that

some

graph fingerprinting

methods that I implemented in that library,

because they work in this inductive

manner

can

be, you know, parallelized easily,

executed just on, you know, blocks of graphs. And because of that, you can do

large scale machine learning with that. So the people who release the the malware network,

dataset, which consists of, like, 2, 000, 000 graphs, Some of the graphs has, like, hundreds of thousands of nodes. So it's large graphs. It's like, I don't know, 10 terabyte of data. And they managed to use karate club on that, the graph level embeddings for each of the graphs. They had a very, like, smart way of distributing it,

which was interesting to see. But what I would say is that the tools that I created, Kratiklub, is purely research oriented, little ball of something that you could use in production, though. So by the geometric temporal would work in production. So we were able to train

forecasting models

using more than, I don't know,

5, 000 windmills, which are located in 1 of the Scandinavian countries, which is, like, nice.

In terms of the overall space of

GraphML

and the sort of capabilities that it provides and the types of problem domains that it's conducive

for, what are some of the sort of interesting and upcoming trends that you're keeping an eye on, and

what are some of the areas that you're excited to work on either by extending your existing libraries or building out new tooling or just doing your own research in the space?

So 1 of the, like, interesting problems that we are trying to tackle with the people involved in in tighter geometric temporal is this issue that

on most social networks, the time difference between events is not constant.

So for example, the examples that I mentioned, the

windmill output forecasting problem, that the time differences are constant.

But in a lot of settings, it's not true. It's very hard to handle

because you have, you know, a special domain, temporal domain, and you have to be able to have the right type of data structures and defining

models on that, which can, you know, operate on that type of data. So that's definitely a direction that we are working on. So by the geometric temporal, it's like this community effort with people from the University of Cambridge in Oxford, and we're working on that to make that possible.

And 1 of the things that I want to do personally is to have other types of back ends for little ball of such as, you know, like the scikit network,

iGraph,

a graph tool, and these classical graph analytics libraries' back ends.

And, personally, what I would love to do is have a library which can calculate different measures of spatial correlation

on a graph

because it's something that I have to do. I have to write code for that, and most probably other people,

have the same problem.

And it seems to be, you know, again, a niche area that no 1 tries to tackle.

Are there any other aspects of the libraries that you're building and the work that you're doing in GraphML or the overall space of machine learning on graphs that we didn't discuss yet that you'd like to cover before we close out the show?

Yeah. I feel like that 1 of the interesting aspects is that it intercepts a lot,

with, you know, geometric applications

of ML.

And because of that, it might be interesting

to cut

those more general cases cases and how, for example, GraphML

relates to computer vision and processing of sequences. It's just like something that I would like to mention. And, also,

explainability

on graphs seems to be a very hot topic in in the last 1 year, and there are beautiful papers and good tools

to do explainability

at the instance level

on nodes and that JS.

And are there any

particular resources or reference material that you would recommend people look to if they're trying to learn more about how to take advantage of networked datasets and build machine learning models across

them? Yeah. Luckily,

this year, 2

textbooks came out, and I feel like that most of the models that are described in those

textbooks

are available in Python Geometric and also in DGL.

Out of these 2 textbooks, I would recommend

the textbook by William Life Hamilton, who is a professor

at Mila.

If people go search for it, they will find it. It's a very generic graduate school level textbook for graph representational learning.

I would recommend it. It's like easy read, fun, publicly available,

but you can also order a hard copy with the proper hard copy if you would love to do that.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing or contribute to the libraries that you maintain, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the movie, wrath of man. It came out recently. It's a Guy Ritchie film.

All of Guy Ritchie's films, I've thoroughly enjoyed just because you really have to pay attention to understand what's going on. And this is, you know, another 1 where there are a few interesting twists to it. So just had fun watching that. So if you're looking for something to watch, definitely recommend that 1. And so with that, I'll pass it to you, Benedek. Do you have any picks this week? Yeah. I would like to recommend Hanford the Wilder People, which is a 2016

movie. It's by Taika Waititi, who is a director from New Zealand. It's a wonderful

wonderful comedy drama movie about an actual real

rehabilitation

problem for children

who are young offenders in New Zealand. It's a hilarious movie.

And I would also

love to recommend the research paper,

which has the title geometric deep learning, grids, groups, graphs, geodesics, and gauges,

which is this

wonderful,

pretty long research paper, which,

you know, unifies a lot of machine learning that's out there.

It's a pretty heavy read, but it's worthwhile.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on building libraries and doing work in the GraphML space. It's definitely a very interesting problem domain, and I'm curious to see how it continues to evolve. Definitely been seeing a lot of interest in how to be able to do machine learning on graph data. So thank you for all of your time and efforts on that, and I hope you enjoy the rest of your day. Thank

you. Have a nice day.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__