Accelerating Drug Discovery Using Machine Learning With TorchDrug

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Zhantung about Torchdrug.

So, Zhian, can you start by introducing yourself? Thanks very much for the invitation. So glad to be here. My name is Zhian Tung. So I'm an assistant professor at MENA Feedback AI Institute,

which is a research institute focused on machine learning, artificial intelligence on the by the tooling award winner Yoshua Banjo. So my main research interest focus on deep learning for graph structured data or in other words, graph reputation learning, graph neural networks, geometric deep learning, not graphs. And, also, specifically, I focused on I'm very interested in applications like drug discovery or material discovery. And And do you remember how you first got introduced to Python?

So I remember I first like got in touch

with Python when I was a peer student.

That was back to 2010.

So at that time, I'm also asked many use program language as, like, say, fast fast and also like, Java. But then I knew, like, a lot of people were studying to use in Python. Many of my friends told me, like, Python are very

easy to use and that the community is really growing. And there's a lot of, like, very useful tools that we can leverage. So that's why I get started to get interested in Python and also start to learn Python.

And so 1 of the particular areas of focus that you have right now is the Torchdrug project, which you mentioned is being used for some drug discovery use cases, and it seems to be a fairly

well polished project. I'm wondering if you can just give a bit of an overview about what it is that you're building with Torchdrug and some of the story behind how it got started and why you decided to focus your efforts on that particular area.

Right. So maybe I can give a a little bit sort of my research background. So as I just said, my research focus on deep learning for for graph structure data. But in the first place, when I was a pure student, I was not working on deep learning. I I was working on machine learning, but more like class commission learning models, like, statistics topic models.

But then back to 2, 012

or 2013, that was, like, my last year period. So deep learning was starting to get to very halt. And then so, basically, then I feel like I want I'm really interested in I was really interested in getting to deep learning. I I want to move to deep learning. But then at that time, deep learning was mainly used to applications like computer vision, speech recognition, and natural language processing. Then I wonder, okay. I want to work on deep learning. I may I may not be want to work on those applications because I'm not as good as those people. So maybe I can find like a new area, a new domain, and I can apply different techniques to those

occasions. K. Because for me, I was always interested in in graph data. So I was very active in the data mining community.

So at that time, like, in data mining communities, people are really interested in, like, networks or graphs. Because at that time, people, we are very interested in analyzing to the data. You know, to the we have a lot of network. Right? So follow e social network data. So I was, like, very interesting graph data and network data. But, you know, like, people in data mining community, they don't know deep learning. Right? For people who work on deep learning, they don't know rough. They may work on computer vision and natural language processing or speech recognition.

And that's, like, how or where I found a big opportunity, basically trying to connect with 2 community

and then start to work on deep learning for graph structure data or we can say graph rotation learning. So that was back to 2012,

which is the last year of my PRD. And then after after my PRD, I worked in Microsoft Research. I continue to work on this work in this area. I feel like there's a huge

potential.

And then after working for 2 years at the Microsoft Research, I decide to I want to go back to academia. And that's why I quit my job in Microsoft Research and then did a postdoc in University of Michigan and also kind of general university.

And then afterwards, I found the faculty position at MENA, which where I am now.

And then when I, started my faculty position at MENA in Canada, I already work on graph rotation only for around, like, that was, like,

2018.

I already work on, like, graph rotation only for for, like, 4 years. And then I was, like, wondering

what will be the future for graph reputation only. So besides, like, applying them to,

data like a social network, which is also exciting, but I feel like we could like find something more interesting than I start to like think about like what could be the key application for graph rotation learning. I realized that, like, in Barish in drug discovery.

So we have a lot of, like, graph structure data. We have a lot of graphs, like molecules,

proteins.

They can be represented as graph. And also in biology, we also have many, like,

different network, like protein interaction,

network interactions between proteins and drugs. And also, you know, when I was a high school, our teacher told us like 21st century is going to be the century of Barich in the border of my heart. I was very interested in Baruch. And of course the later, because I work on AAI, of course, we don't believe 21st century would be the century of like, college. We believe 21st century is gonna be the century of like AI, right? But then I feel like the maybe the best domain I can work on is really in the intersection of AI and the virus. And that's why, I guess, I get very interested in working on AI for drug discovery. Right? So my group has been focused on, like, AI for drug discovery

in the past 4 years. Basically, many developing

rough machine learning techniques for drug discovery. So Torchmark is really basically ensemble all our previous efforts in drug discovery in different tasks into a really, like, a unified system. So it basically

covers

many fundamental

tasks of drug discovery.

For example, like, de novo,

molecule design, that means we want to develop

totally

a new molecule. So in order to that, there are a few very fundamental tasks. For example, given a compound or molecule,

we want to predict its chemical property or biological activity. So that's 1 very fundamental task. And the second 1 is like a de novo molecule design and optimization. So basically, we want to search for molecules

with design properties.

We are fine to a protein target. It can be synthesized

or has low toxicity. Right? And then so these are second problem, a very fundamental problem called, like, the normal molecule design and optimization.

And once we are able to

identify a design molecule, and then in practice, we also have to care about, like, can this molecule

be synthesized in practice?

And this is related to the problem called, like, planning or prediction.

That means we want to find a set of, like, molecules or reactants

based on which we can synthesize

the design, the target molecule, the molecule structure we want,

which is search in the second program I just described. Okay. So these basically all these very fundamental tasks

are ready to de novo molecule design. That means design a new

molecule structure. And it decides, like,

design a new molecule.

So another solution for drug discovery is called, like, a drug repurposing.

That means

so instead, like, they're trying to design a new molecule, a new drug, so we can repurpose

some of the existing approved drugs which are used to treat other disease, and then we can repurpose those drugs to treat a new disease. Because, like, for COVID 19, they take this this example.

Developing a new drug takes, like, a really long time. It's almost impossible to quickly

develop a new drug. So then a more practical solution will be really trying to repurpose

some of the easiest in drugs. For example, those drugs which we are developed for treating SaaS. Right? Because COVID is very similar to SaaS. Maybe the drugs which could be used to treat SAS could be used to treat COVID as well. So that's basically the second type of, like, drug discovery solution, which is basically

trying to repurpose

some of the existing approved drugs for treating new disease and which could be much faster. Right?

So that's basically a quick introduction

to drug and also basically relate to, like, how I go into this journey. Sounds like some of the primary goals of the project are for some of those molecule design and discovery and, like, being able to make the search space tractable

in, you know, a reasonable amount of time. If you're trying to do that by hand or by actually

synthesizing the molecules and testing them out, it's going to take a lot more time and expense. And

so I'm wondering who are sort of the target users of the project and who are the sort of

targets in terms of

the outputs of what the project is able to create? So, basically, the goal of creating this platform is really trying to

accelerate the progress of AI for drug discovery. So Algo is really trying to build an open source community

for AI for drug discovery. So, basically, then I think

there could be, like, 2 type of users. And the first type of user is really, like, scientists,

especially people working on, like, machine learning, working on AI. But right now, in the AI community, in the machine learning community. So there's also, like, a huge interest

in working on drug discovery, but they may not have the right knowledge. Right? They don't know, like, which task they can work on, what are the most, like,

important

task in this domain? So in TopTrack, we basically, like, benchmark a bunch of different tasks like the ones I just mentioned. We benchmark a bunch of different important tasks in this domain. And for each task, we provide some, like, public datasets and also the evaluation metrics. We try to minimize

users domain knowledge. So for people working machine learning, they can mainly work on, like, designing new machine learning algorithms. K. So this is the first type of audience mainly for people working on AI for machine learning so they can focus on if they want to work work on drug discovery, they they know what they can work on. Right? They know the important problems, and they can focus on developing, like, a new audience.

So this is a first time audience. And for the second group of, like, audience is, of course, for people working drug discovery. Right? Because now we have this, like, open source platform,

which basically already implemented

the state art mission learning algorithm for drug discovery. So if they have data and they can leverage the latest mission and techniques for their problems. So this could be used for folks like academic researchers working on AI for drug discovery and also industry practitioners. For example,

many, like, right now, we we have many AI for drug discovery startups. Many of them may be able to leverage our efforts. And then as far as the

sort of challenges and complexities that are faced by people who are working in the sort of scientific aspects of this domain, you know, biologists and chemists and people working for pharmaceutical companies, what are some of the challenges and complexities that they face in doing drug discovery

in sort of the existing state of the art in the absence of AI and some of the

new capabilities that this might unlock for them?

Somehow I already mentioned this, like, in in the previous questions. Like, for people working on artificial intelligence or machine learning, right, the main challenge is, like, wow, it's a very exciting field, but I don't know which problem I should work on. Right? And then Torchmark really basic to why it's like a set or, like, really the most important task in this format. So they know what they can work on. Right? The same case, like, for people, like, working like Biorex or chemist, they have good knowledge about tractorscaria, but they are not so familiar with the the latest the ones man of the commission learning models. They don't know which model they should use for for their specific task. And then Torchmark really basically implements the state art, the machine learning algorithms

or or models for their task. And, also, I think another good thing about Torch Rock is, like, because Torch Rock, like, it supports different task,

and it basically also covers

different type of data, as I mentioned in the beginning. So

Torchmark, basically, right now supports analyzing

2 type of data. 1 is, like, molecules,

small molecules, which are mainly, like, biochemists.

We are also, like, able to analyze

biologic graph. Basically, mainly, like, the interactions between different biologic entities. Like, for example, the interactions between drugs,

proteins,

disease. So these complicated interactions are mainly interested by viruses. And Torch Rock basically provides a solution

to both analyze the small molecules, the molecule structure, as well as the biological network. So which are usually

analyzed or processed

separately by chemists and also biologists.

Also in terms of the, I guess, collaboration opportunities that it opens up for

the sort of chemists and the biologists

in the research arena,

wondering how some of these AI technologies

are able to

speed up the process and simplify some of the communication

aspects of people who are working in these different fields and sort of the 2 different stages of

the workflow of discovering these different drugs? Really, like, total drug right now is able to support analyzing both type data. Right? So it's able to for example, like, for understand the molecule,

we can ease analyze it based on its chemical structure, which is mainly done by an invitation of chemist. Right? And also, we can also send a molecule according to to its interactions with other entities, for example, with proteins,

with disease. Right? So these many biologists care about. Right? And Torchmark allows to, like, simultaneously

support the analysis of both type data. In terms of the actual implementation of the Torchdrug

project, can you talk through some of the ways that it's designed and some of the sort of architecture and software development aspects of it? So TorchRock, initially, like, as I said in the beginning, so we mainly support, like, the analysis of, like, small molecules

and also biometric and notch graph.

So both type of data, they are graph data, but they are quite different. Small molecules, they are graph, but they have a small graph. Right? Which usually has run, like, tens of, like, atoms or tens of nodes. But then biometric knowledge graph will be very huge. Basically, it involves, like, the relationship between different kinds of, like, drugs, proteins, and disease, which could have, like, millions of, like the the graph could be would have, like, millions of notes, which is very huge. In the beginning, we tried to design a system which could process both type of graph. We're we're basically trying to consider this. We are able to process

small molecules,

small graph, as well as the big graph. That's the key we try to consider when we first designed the system. As you have been exploring the overall problem space and getting more sort of understanding of what's involved in some of the

scientific aspects of drug discovery and some of the ways that it's being used by scientists and by the AI community.

How has that helped to

drive the particular focus of the Torch Drug Project and some of the priorities as far as directions that you might want to explore with it? So right now, for Torchmark, so right now, we mainly,

focus on small molecules.

But now recently,

I mean,

drug discovery based on small molecules, but, like, recently starting from last year and this year, there's a huge interest in the community in, like, in AI or in biological community. There's a huge interest

in developing, like, big molecules, like, for example, proteins, like peptides or antibody for drug discovery. Okay. So in the future, like, we also plan to support, like, big molecule discovery, like protein design and the protein through reputation learning. So that's basically the what we are doing now and what we plan to further support in the future through, total draft.

Recently, like, breaks through by the mind on protein structure prediction with other for 2. Right? And that's really, like, brings a lot of, like, attention or interest in in proteins, in protein understanding or protein repetition learning or protein design. Then that's definitely something we want to support in this next release.

In terms of the sort of

assumptions that you had going into this particular problem space of drug discovery, given your understanding

of the sort of network datasets and graph learning,

what are some of the

assumptions that you had that have been challenged or some of the ideas that you had as far as the potential approach or the potential capabilities

that have had to be sort of adjusted or reconsidered as you've progressed further through the implementation

and engagement with the community?

So by far, you see, like, we usually we represent

all the molecules, either small molecules. So for small molecules, we usually represent them as graphs.

And then for proteins, we usually represent proteins as sequence, basically, a sequence of, like, amino acid. Yeah. But that's actually a sort of, like there's some limitation there because for molecules, either small molecules of protein, and more natural reputation should be their 3 d structures. Right? Because molecules naturally, they represent 3 d structures.

So that's a big limitation

for graph based

methods because graph based method only can capture the relationship

between the atoms. They cannot capture

the geometric

features of those molecules, which are very important for drug design. For example, like, if you want to design a small molecule,

so ideally,

the shape of the small molecule

should be complementary

to the shape of the protein

surface. Right? So in order to model that, so it's important to leverage or to model their 3 d structures

instead, like, only based on their 2 d graph repetition. So definitely and then actually, there's also a trend in the community. So for molecule or protein modeling, so now we are going beyond,

2 d graph repetition learning to a 3 d structure modeling. In terms of the sort of graph

characteristics of these molecular structures, I'm wondering what are some of the

complexities that are inherent to the problem domain where, you know, if you're doing some graph research, then maybe you're able to enforce that it, you know, is acyclic in nature, whereas with chemical structures and chemical compounds, you want to have certain cycles involved and just being able to understand

the molecular composition

from a network

perspective and just some of the interesting problems that that brings into the sort of overall discovery and research space. Right. So maybe I can mention a few important compresses or challenges.

The first question is really to, like, the number of, like, the labeled data. So usually because, you know, if we want to build up a machine learning model, so then it's important to have enough training data. Like, for example, in computer vision, in natural language processing,

in speech recognition, we have many labeled data. But in drug discovery so, actually, we don't have so many labeled data. Right? That's in general a big challenge. That means, like, we want to debug models,

which is able to learn from very few label data. So, of course, this is somehow, like, already, like, active study in the machine learning community, which is known as, like, a future learning or meta learning. But this kind of techniques are particularly important in drug discovery because in this domain, the number of labor data is really limited and it's very, computational. It's very, like, expensive and also time consuming to obtain the labor data. So this is the first challenges, like, for molecule modeling for for drug discovery. The second challenge, I think, is really to, like, we may try to model molecules.

So we basically have to take the symmetry

of the molecule into consideration.

Like, if we represent a molecule as a 3 d structure,

then a molecule a 3 d molecule structure rotation should be like a rotation and the translation in Merit.

And so when you try to develop your model, that's something you have to take into consideration. If you rotate or translate

your molecule,

some of the properties, for example, to the energy or the property of the molecule should remain emergent. K? So that's 1 type of, like, asymmetry we want to consider, which is numeracy. But there's also another property which is called a concurrency. So for example, like, given a molecule, if we want to predict

the forces over each atoms,

then so

in this case, the forces over the atoms should be

rotation and the translation

equivalent.

What does that mean? So if you have 3 d molecule structure, you rotate your molecule,

then

the forces

over the atoms should be rotated accordingly.

So this is called, like, a rotation, like, equivalent.

So when you try to build a model, so that's something you need to take into consideration.

So this is by far a very active research, like, in molecule modeling.

Another interesting aspect of the Torch Drug Project is the fact that you are trying to

make it accessible

and understandable

to these 2 different communities of scientists and, you know, chemists and biologists

and the AI community who are familiar with some of the sort of, like, deep learning concepts. And I'm wondering what your approach has been as far as how to

design the APIs of the project

and the sort of documentation

and tutorials to make it accessible to these different

communities with these varying backgrounds?

That's a really good question. And I would like to thank, like, my students for these efforts. In the beginning, we first identify a set of, like, important task in job discovery. And then for each task, we first, like, provide APIs to some important public dataset so the user don't need to, like, deal with the data preprocessing.

For example, like, for people working in machine learning, they may not have the experience to preprocess the molecules. We have API to upload the dataset to pre process the dataset. So so the users do not need to take care of the data pre processing stage.

And then, of course, then we have some standard. We provide a bunch of different, like, say, that algorithms, which are trained on the public dataset. And at the same time, we also provide

like standard evaluation metrics. So once we are able to train your model and then so we have some standard modules to evaluate the performance of your organs. Right. So basically, we provide the exam open public datasets and also standard evaluation methods. So that like even for people work in machine learning, they don't need to worry about the data preprocessing and also the iteration. They just need to, for example, eventually, all the data will be represented as graphs so they can only focus on the machine learning side. Basically, how to divert machine learning algorithms or different algorithms

to analyze graph data. In terms of the workflow of somebody who's using Torchdrug to be able to build a model or be able to

do some of these, like, search space optimization problems of the de novo molecule design or the sort of retrosynthesis

capabilities. I'm wondering if you can just talk through just some of the overall workflow of getting started with Torchdrug, setting up the problem design, figuring out sort of, like, what are the goal states that you're optimizing for, and then be, like, executing and iterating on the experiment. So usually, let's say we have a protein target. K? And then maybe, like, a through sound, like, well, lab experiments. So we can, like, have a sign initial training data. We know

some molecules. They are like a potential either. We're basically positive molecules, like, who's the protein targets. We also have some active molecules.

Right? So then in this case, we can basically trying to predict the binding affinity for a molecule. K. So this is basically related to the task of, of like a molecule property prediction in tow truck. So we have a molecule. We want to predict its chemical properties, for example, either bind or not. Right? So once we have like a set of like day by day like this, the users can churn a molecule property prediction model based on the training data. And once we have the molecule property

prediction model or by the infinity prediction model,

So that model can be used in the molecule

design and optimization

task. Because in molecule design and optimization

task, we need to have, like, a reward function.

Basically,

trying to tell the all with them, okay, whether this molecule is good or not. Because the normal molecule design and optimization,

the essential idea is, like, the algorithm, we use reinforcement learning. So the algorithm basically is trying to error. Right? So you try this molecule

and then the environment, we are telling you, okay, whether this molecule is good or not. Right? Maybe the algorithm will further explore around that, like, area. If it's not good, maybe the algorithm will be able to move to its other area. So then you can see, like, during this process, the essential part is, like, we need to provide a reward function, basically, to tell the algorithm

whether this molecule is good or not. So good or not, basically, it depends on, like, the type of, like, molecule property you want. So, of course, the most important 1 is usually the binding of printing whether this molecule can be bind to the protein target or not. Right? And that's why I said, like, before working on this molecule design and optimization or molecule search module, we have to train a machine learning predictor

basically, to predict

whether molecule has designed property or not. You know, maybe, like, whether the molecule can bind to the protein target or not. In order to do this, we usually have to have some initial training data. So we have some positive molecules, which can be bind to the protein targeted. And, also, we can have some inactive molecules, which they are not able to bind to the protein targets. So this can be done through some by acids. We don't need to have many in the beginning, right? But we can have some in the first place. And then once we have the machine learning predictor and then we can use that to search for better molecules

using the molecule design and optimization

module. By doing that so we are able to search for maybe a set of, like, potentially bad molecules. For example, we can search for, like, 1, 000 molecules, which could potentially be a better binder.

Okay? And then this 1, so the molecules, like, we can maybe

send them back to the wet life experiments so we can test them again. And then by doing this, we can get a better churn data. Those kind of data could feedback to our machine learning predictor because now our machine learning predict has more churning data. Right? And so we can refine our machine learning predictor,

and then the new machine learning predict that we send back to the search module, and then we can find based on the new machine learning predict that maybe we can find better candidates. And these more candidates will be sent back to the web app. So, basically, by doing this, we can solve the right, funnel loop between the computational

site and also the web app site. And this is basically how the

both workflow

functions.

You mentioned the AlphaFold project. And I'm wondering, given the fact that you are trying to optimize for being able to bind to these various protein targets,

what sort of potential impact do you see

as the availability

of these

new

sort of protein folding

structures and the information that it provides to the biological

community, how that might impact just the overall

aspect of drug discovery and some of the ways that you might be able to

integrate that information

into the Torch drug tool chain? Yeah. So that's a very good question. Because usually for drug discovery, there are 2 type of purchase. 1 time purchase is called, like, legal based approach, and then also 1 called structure based design approaches.

So for ligand based approaches, we usually we don't make use of the structure of the protein target. So we mainly, like, use the like information. So like what I said, for some ligands, you know, they are positive, but others, they are negative. So we can just train the machine learning predictive only based on the molecule information. Right? Of course, this is no idea because in this case, we don't leverage the 3 d structure of the protein information, right, the targeted protein information. So then a better solution is really the the structure based like a drug design. So in this case,

leverage

the 3 d structure of the protein target. So, basically, we're trying to find a good match between

the protein target. I mean, there's 3 d structure and also the small molecule. K. With AlphaFold too, because now we are able to basically obtain the 3 d structure of almost all the human proteins very accurately. So I think it's really basically makes all the structure based drug design possible because for any drug targets, now we know they are 3 3 d structure, and then we can mainly focus on developing technique, like structure based, like, a drug design. Right? Because previously, like, for some targets, we don't know there's 3 d structures. There's a big challenge there. But now for all the targets, we know there's 3 d structure. And then we can just, like, focus on designing the small molecules

or or even proteins, which are able to bind to the 3 d structure of the drug target. In terms of the sort of technical aspects of the foundation that you've chosen for the Torch Drug Project, I'm wondering what was the decision process and some of the sort of aspects that you were looking for as you were deciding what to base the overall project on and how that ultimately ended up leading you to use PyTorch versus TensorFlow or any of the other machine learning frameworks that are available? Maybe I think that's somehow it's also read to the how the evolution of, like, community development between TensorFlow and the PyTorch. In the beginning, TensorFlow was very popular, basically, dominating the whole, like, community. Right? But then after PyTorch re released, I think the scientific community is quickly

adopting PyTorch because it's easy to use. And I think with PyTorch, I think we are able to really

prototype a research idea. And that's why, like, in in scientific community, like, the PyTorch is really, like, gaining increasing our popularity.

And, actually, the decision is not that difficult to make because in our group, like, based all my students, they are the new network's based on PyTorch. Doesn't actually, not a very difficult decision to make. In the PyTorch, like, framework, there are also already

quite a few frameworks which are able to deal with graph because all techniques are mainly based on graph. There are already quite a few frameworks which are able to deal with graphs in PyTorch. For example, PyTorch Geomagic, which PyTorch, like, a library, which are specific designed for graphs. Okay? And also, like, DGL deep graph library case, which are very popular in the graph machine learning community. Okay? But remember that I those graph based, like, PyTorch library. So they are mainly designed for handling, like, arbitrary graphs, not only like a molecule graph. They they may decide to support any kind of graph like social networks. They could deal with molecules as well. But for us, we really want to design a specific library for drug discovery, for hand orient, for molecules, proteins, and maybe by mental knowledge graph. So also, like, molecules, they can be represented as graph, but, actually,

there are some specific, like, operations which are very unique to molecules.

So that's why, like, you know, like, framework, we actually

designed some specific operators,

which are used to deal with, like, molecules to deal with, like, molecule graph. And now we are also extending that to deal with proteins

as well as, like, 3 d structures. As I said in the beginning, now we are still, like, trying to move from 2 d graph to 3 d structure, which is actually

missing in many existing graph based, like, library. And now we are trying also trying to deal with different kinds of operations on 3 d structures.

To the point of the sort of graph aspects of it, I'm wondering if the target of small molecules helps to cut down on some of the sort of computational cost and complexity because of the fact that

the size of the networks are relatively small. And if there has been any

specific to the graph library aspect of it, if there have been any sort of unique capabilities in network x that you're leaning on, or if you've also considered using or have experimented with some of the projects such as Networked or GraphLab? We mainly focus on, like, a developer, you know, deep learning deep learning algorithms for basically graph repetition learning, graph new notes, or knowledge graph embedding techniques. And there are some, of course, like, some traditional graph based or network based and that's algorithms. But those are just mainly traditional

network analysis software. For example, they want to calculate the pay page rank, crossing coefficients,

But those are not the type of techniques we want to use for molecule modeling because we mainly want deep learning techniques

based on techniques like graph neural network, graph repetition learning, or geometric deep learning. So these are new type of techniques that we we want to develop for model to model, which are quite

different from traditional

graph analysis or network analysis software.

Yeah. In terms of the just overall space of

the sort of drug discovery research and some of the complexities that are there, I'm wondering what are some of the biggest open questions in that area

and some of the ways that you're hoping to be able to apply the capabilities of Torchdrug to be able to

simplify or speed up some of that research.

So in general, right, for drug discovery, the really fundamental

question is is to understand

the interactions

between, let's say, protein and ligand. Right? And also interactions between protein and the protein, which is the key to drug design. So let's say you have a protein target. And then if you want to use a design small molecule protein

to bind to the protein target,

then it's really important to answer the interactions

between either the interactions between protein and ligand,

the interactions between protein and the protein. I think to understand the interactions between

the 2 is really the key

to drug design in the future. And, also, it's very challenging

because compared to other programs as what we discussed, now we have made huge progress in terms of, like, a protein structure prediction

based on the apple 2, system. Right?

But it's still, like, a really

challenging question to model the interactions between proteins because, like, the number of, like, labeled data is just very limited

in protein protein infection or in protein ligand design. So for protein structure prediction, so we still have many labeled data or training data. And that's why actually 1 of the very 1 of the key reason

why alpha42 is very successful because we have a lot of training data to train a model. But for protein protein interaction

or protein like the interaction, we don't have so many neighbor data. So it's gonna be a very challenging problem. K? So I think in the future, if we can make some significant progress on either of these 2 problems, I think, really, like, we can make a huge advancement

in drug design. So so in the future, we will try to benchmark

the 2 tasks. 1 is protein ligand interactions, mainly support for small molecule

discovery and also protein interaction

mainly support for protein design. So we want to benchmark the 2 tasks and also try to push forward the progress of the 2 fundamental and also really 2 key tasks in drug discovery.

And so in terms of your experience of building the Torch Drug Project and exploring the overall space of drug discovery

and sort of digging into the research, I'm wondering what are some of the most interesting or unexpected or challenging lessons that you've learned in the process. I think the most challenging thing is really the problem itself is really multi disciplinary. Right? You need to have, like, expertise

from both, like, mission learning

and also drug discovery. I think that's really I feel like where

the

most challenging thing are.

We also have a lot of collaborators in the biomedical side. I think we really like communication is always, like, difficult in the beginning because we speak different languages. Right? And then we have a total different mindset.

So I think that's really, like, as a lot of, like, complexity there in terms of communication.

But I think but with more and more communication, things getting better. So I think definitely that's in general the most challenging thing, especially for people working on AI. Right? So because we don't know what are the most important problems in this domain. And we so we don't know where we can get the useful dataset. Right? So another thing is, like, for building a system like Torchrack, is this basically the same problem. Right? So for us, we are mainly, like, machine learning researchers focused on

developing new algorithms.

And then Torchmark is a machine learning system, so

it's not enough for people who only have hours and development skills to build a good machine learning system. So I think, unfortunately, like, in my groups, we have a student who can understand

both, like, the machinery algorithm and also have a very strong system development capability.

Right? I think in general, like, I feel like in this domain, always the most challenging things are really like, the problem is really multi disciplinary. So we need to have, like, expertise from machine learning,

from chemistry,

from physics. Right? And even, like, even for machine learning, we also have, like, algorithm development and also system development,

which are also quite different.

Yeah. It's definitely interesting.

Some of the sort of human problems that crop up when you're trying to put together 2 different groups of people who have different interests, but have sort of a combined goal and just some of the

shared vocabulary that you have to figure out where, you know, each domain has their specific ways of discussing things, but they don't mesh well together until you're able to get them all talking together and establish a new sort of vocabulary.

Right. But I think in general, it's important to have, like, a really open mindset

and then basically

appreciate

the efforts done in other domains. I was also very open to learn from them. Yeah. It's important also to be quick learner because, like, for many things, it's totally new for us. Like, pretty much all the biologic, like, knowledge is totally new to us, so we have to learn from scratch. Then I think it's really important to be a quick learner as well. Absolutely. And so for people who are interested

in exploring the space of drug discovery or sort of actually

working in that field, what are some of the cases where a torch drug might be the wrong choice and they're better served with the sort of established experimental practices for drug discovery

or, you know, using different computational approaches?

Oh, drug discovery because, like, of course, right now, machine learning is some trending techniques. But before machine learning, there are also some traditional

approaches to drug discovery. For example, those, like, are based on traditional

physics

based methods.

So for now, to a drug, we don't support those techniques. Right? We mainly support, like, a machine learning based techniques for drug discovery. So in that case, I think it may not be the right tools if you want to explore more traditional physics based approaches

for drug discovery. K. As a meantime, like, now in the community, we are also trying to use a new network or using different algorithms

to

approach those problems traditionally started based on physics based method as the 1 I just mentioned, like, the interactions between protein ligand. So it used to mainly, like, based on traditional physics based methods. Now there's a lot of interest

in developing your network based approaches in different outlets for those kind of problems. And now we also see, like, some exciting progress along that direction. As you continue to

work on the Torch Drug Project and continue to explore the overall space of drug discovery and build out the community around the work that you're doing. What are some of the things that you have planned for the near to medium term and any particular areas of help or contribution that you're looking for? So basically, I said, like, no right now contract mainly support support, like, a small molecule discovery. Now the really big space for therapeutic discovery is really

protein antibody design.

So which is something that we plan to support in the near future. So right now, like, we already have a module collect protein representation learning or protein design in our platform, but it's it's on the development. So we hope to erase

that module in the near future, and this is 1 thing. So another 1 is more related to the techniques.

So right now, like, for small molecules, for proteins, the normal because I represent graph and the full proteins, I may represent as sequence. So in the future, we really want to, like, go beyond the current sequence or graph repetition

to 3 d structure repetition.

So we can model molecules and proteins as 3 d structures, and then we can start to model their geometric features, which is really important for drug design.

So how to effectively

combine the competition methods? I mean, many of the machine learning algorithms

with white lab experiments in a more effective way is really important. Right? So in this case, this actually has there's huge interest in the community to use active learning algorithms.

So basically, for the machine learning algorithms. So because in machine learning, there are some positive or active. So positive, basically, you just learn from what people give you. Active or basic, you learn from where you want to learn. Right? Then, basically, that's actually really a good way to combine

the competition methods, basically, machine learning algorithms with the well app experiments.

So based on the machine learning algorithms,

we can suggest a set of, like, drug candidates or

by assets we want to conduct. So by doing that, so we can sort of, like, obtain

the kind of training data that we want. Right? This is a way to activate basically, based on the current machine learning model where we want to acquire the knowledge. So we can suggest a set of, like, a molecule candidates, drug candidates, or biases

to let the well apps to validate

those model candidates and to get the trend data that we want. So in this scenario, active running is very important. It's a good way to really effectively

combine the well lab experiments and also the dry lab or competition methods.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose the rope project. It's a refactoring library for Python projects that has a lot of useful capabilities for being able to do sort of ad hoc restructurings of Python projects. The API is

not as accessible as I would like sometimes, but for the cases where you're trying to, you know, move a module to a different directory structure and make sure that all the imports are

updated accordingly. It's 1 of the best ways that I've found to do that. So definitely worth taking a look for some of your more complicated refactorings.

And so with that, I'll pass it to you, John. Do you have any picks this week? I think and, John, I hope the pandemic will be finished very soon so we can really, like, attend conference and see all the friends, their colleagues

physically.

Absolutely.

Yeah.

Can definitely second that sentiment. So

definitely appreciate the time you've taken today to join me and share the work that you've been doing on Torchdrug and the contributions you're making to the problem domain of drug discovery. It's Definitely very interesting project and an interesting problem domain. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you very much. You too.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com

for the latest on modern data management.

And visit the site of python podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastthenit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__