Summary
Finding new and effective treatments for disease is a complex and time consuming endeavor, requiring a high degree of domain knowledge and specialized equipment. Combining his expertise in machine learning and graph algorithms with is interest in drug discovery Jian Tang created the TorchDrug project to help reduce the amount of time needed to find new candidate molecules for testing. In this episode he explains how the project is being used by machine learning researchers and biochemists to collaborate on finding effective treatments for real-world diseases.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Jian Tang about TorchDrug
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what TorchDrug is and the story behind it?
- What are the goals of the TorchDrug project?
- Who are the target users of the project?
- What are the main ways that it is being used?
- What are the challenges faced by biologists and chemists working on development and discovery of pharmaceuticals?
- What are some of the other tools/techniques that they would use (in isolation or combination with TorchDrug)?
- Can you describe how TorchDrug is implemented?
- How have you approached the design of the project and its APIs to make it accessible to engineers that don’t possess domain expertise in drug discovery research?
- How do graph structures help when modeling and experimenting with chemical structures for drug discovery?
- What are the formats and sources of data that you are working with?
- What are some of the complexities/challenges that you have had to deal with to integrate with up or downstream systems to fit into the overall research process?
- Can you talk through the workflow of using TorchDrug to build and validate a model?
- What is involved in determining and codifying a goal state for the model to optimize for?
- What are the biggest open questions in the area of drug discovery and research?
- How is TorchDrug being used to assist in the exploration of those problems?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on TorchDrug?
- When is TorchDrug the wrong choice?
- What do you have planned for the future of TorchDrug?
Keep In Touch
- tangjianpku on GitHub
- @tangjianpku on Twitter
- Website
Picks
- Tobias
- Rope refactoring library
- Jian
- Attending conferences once the pandemic is over
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- TorchDrug
- Mila
- Yoshua Bengio
- Alphafold
- Few-shot learning
- Metalearning
- PyTorch Geometric
- DeepGraph Library
- NetworKit
- graph-tool
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Zhantung about Torchdrug.
[00:01:02] Unknown:
So, Zhian, can you start by introducing yourself? Thanks very much for the invitation. So glad to be here. My name is Zhian Tung. So I'm an assistant professor at MENA Feedback AI Institute, which is a research institute focused on machine learning, artificial intelligence on the by the tooling award winner Yoshua Banjo. So my main research interest focus on deep learning for graph structured data or in other words, graph reputation learning, graph neural networks, geometric deep learning, not graphs. And, also, specifically, I focused on I'm very interested in applications like drug discovery or material discovery. And And do you remember how you first got introduced to Python? So I remember I first like got in touch with Python when I was a peer student.
That was back to 2010. So at that time, I'm also asked many use program language as, like, say, fast fast and also like, Java. But then I knew, like, a lot of people were studying to use in Python. Many of my friends told me, like, Python are very easy to use and that the community is really growing. And there's a lot of, like, very useful tools that we can leverage. So that's why I get started to get interested in Python and also start to learn Python.
[00:02:15] Unknown:
And so 1 of the particular areas of focus that you have right now is the Torchdrug project, which you mentioned is being used for some drug discovery use cases, and it seems to be a fairly well polished project. I'm wondering if you can just give a bit of an overview about what it is that you're building with Torchdrug and some of the story behind how it got started and why you decided to focus your efforts on that particular area.
[00:02:38] Unknown:
Right. So maybe I can give a a little bit sort of my research background. So as I just said, my research focus on deep learning for for graph structure data. But in the first place, when I was a pure student, I was not working on deep learning. I I was working on machine learning, but more like class commission learning models, like, statistics topic models. But then back to 2, 012 or 2013, that was, like, my last year period. So deep learning was starting to get to very halt. And then so, basically, then I feel like I want I'm really interested in I was really interested in getting to deep learning. I I want to move to deep learning. But then at that time, deep learning was mainly used to applications like computer vision, speech recognition, and natural language processing. Then I wonder, okay. I want to work on deep learning. I may I may not be want to work on those applications because I'm not as good as those people. So maybe I can find like a new area, a new domain, and I can apply different techniques to those occasions. K. Because for me, I was always interested in in graph data. So I was very active in the data mining community.
So at that time, like, in data mining communities, people are really interested in, like, networks or graphs. Because at that time, people, we are very interested in analyzing to the data. You know, to the we have a lot of network. Right? So follow e social network data. So I was, like, very interesting graph data and network data. But, you know, like, people in data mining community, they don't know deep learning. Right? For people who work on deep learning, they don't know rough. They may work on computer vision and natural language processing or speech recognition. And that's, like, how or where I found a big opportunity, basically trying to connect with 2 community and then start to work on deep learning for graph structure data or we can say graph rotation learning. So that was back to 2012, which is the last year of my PRD. And then after after my PRD, I worked in Microsoft Research. I continue to work on this work in this area. I feel like there's a huge potential.
And then after working for 2 years at the Microsoft Research, I decide to I want to go back to academia. And that's why I quit my job in Microsoft Research and then did a postdoc in University of Michigan and also kind of general university. And then afterwards, I found the faculty position at MENA, which where I am now. And then when I, started my faculty position at MENA in Canada, I already work on graph rotation only for around, like, that was, like, 2018. I already work on, like, graph rotation only for for, like, 4 years. And then I was, like, wondering what will be the future for graph reputation only. So besides, like, applying them to, data like a social network, which is also exciting, but I feel like we could like find something more interesting than I start to like think about like what could be the key application for graph rotation learning. I realized that, like, in Barish in drug discovery.
So we have a lot of, like, graph structure data. We have a lot of graphs, like molecules, proteins. They can be represented as graph. And also in biology, we also have many, like, different network, like protein interaction, network interactions between proteins and drugs. And also, you know, when I was a high school, our teacher told us like 21st century is going to be the century of Barich in the border of my heart. I was very interested in Baruch. And of course the later, because I work on AAI, of course, we don't believe 21st century would be the century of like, college. We believe 21st century is gonna be the century of like AI, right? But then I feel like the maybe the best domain I can work on is really in the intersection of AI and the virus. And that's why, I guess, I get very interested in working on AI for drug discovery. Right? So my group has been focused on, like, AI for drug discovery in the past 4 years. Basically, many developing rough machine learning techniques for drug discovery. So Torchmark is really basically ensemble all our previous efforts in drug discovery in different tasks into a really, like, a unified system. So it basically covers many fundamental tasks of drug discovery.
For example, like, de novo, molecule design, that means we want to develop totally a new molecule. So in order to that, there are a few very fundamental tasks. For example, given a compound or molecule, we want to predict its chemical property or biological activity. So that's 1 very fundamental task. And the second 1 is like a de novo molecule design and optimization. So basically, we want to search for molecules with design properties. We are fine to a protein target. It can be synthesized or has low toxicity. Right? And then so these are second problem, a very fundamental problem called, like, the normal molecule design and optimization. And once we are able to identify a design molecule, and then in practice, we also have to care about, like, can this molecule be synthesized in practice?
And this is related to the problem called, like, planning or prediction. That means we want to find a set of, like, molecules or reactants based on which we can synthesize the design, the target molecule, the molecule structure we want, which is search in the second program I just described. Okay. So these basically all these very fundamental tasks are ready to de novo molecule design. That means design a new molecule structure. And it decides, like, design a new molecule. So another solution for drug discovery is called, like, a drug repurposing. That means so instead, like, they're trying to design a new molecule, a new drug, so we can repurpose some of the existing approved drugs which are used to treat other disease, and then we can repurpose those drugs to treat a new disease. Because, like, for COVID 19, they take this this example.
Developing a new drug takes, like, a really long time. It's almost impossible to quickly develop a new drug. So then a more practical solution will be really trying to repurpose some of the easiest in drugs. For example, those drugs which we are developed for treating SaaS. Right? Because COVID is very similar to SaaS. Maybe the drugs which could be used to treat SAS could be used to treat COVID as well. So that's basically the second type of, like, drug discovery solution, which is basically trying to repurpose some of the existing approved drugs for treating new disease and which could be much faster. Right?
So that's basically a quick introduction
[00:09:06] Unknown:
to drug and also basically relate to, like, how I go into this journey. Sounds like some of the primary goals of the project are for some of those molecule design and discovery and, like, being able to make the search space tractable in, you know, a reasonable amount of time. If you're trying to do that by hand or by actually synthesizing the molecules and testing them out, it's going to take a lot more time and expense. And so I'm wondering who are sort of the target users of the project and who are the sort of targets in terms of
[00:09:39] Unknown:
the outputs of what the project is able to create? So, basically, the goal of creating this platform is really trying to accelerate the progress of AI for drug discovery. So Algo is really trying to build an open source community for AI for drug discovery. So, basically, then I think there could be, like, 2 type of users. And the first type of user is really, like, scientists, especially people working on, like, machine learning, working on AI. But right now, in the AI community, in the machine learning community. So there's also, like, a huge interest in working on drug discovery, but they may not have the right knowledge. Right? They don't know, like, which task they can work on, what are the most, like, important task in this domain? So in TopTrack, we basically, like, benchmark a bunch of different tasks like the ones I just mentioned. We benchmark a bunch of different important tasks in this domain. And for each task, we provide some, like, public datasets and also the evaluation metrics. We try to minimize users domain knowledge. So for people working machine learning, they can mainly work on, like, designing new machine learning algorithms. K. So this is the first type of audience mainly for people working on AI for machine learning so they can focus on if they want to work work on drug discovery, they they know what they can work on. Right? They know the important problems, and they can focus on developing, like, a new audience.
So this is a first time audience. And for the second group of, like, audience is, of course, for people working drug discovery. Right? Because now we have this, like, open source platform, which basically already implemented the state art mission learning algorithm for drug discovery. So if they have data and they can leverage the latest mission and techniques for their problems. So this could be used for folks like academic researchers working on AI for drug discovery and also industry practitioners. For example, many, like, right now, we we have many AI for drug discovery startups. Many of them may be able to leverage our efforts. And then as far as the
[00:11:37] Unknown:
sort of challenges and complexities that are faced by people who are working in the sort of scientific aspects of this domain, you know, biologists and chemists and people working for pharmaceutical companies, what are some of the challenges and complexities that they face in doing drug discovery in sort of the existing state of the art in the absence of AI and some of the new capabilities that this might unlock for them?
[00:12:03] Unknown:
Somehow I already mentioned this, like, in in the previous questions. Like, for people working on artificial intelligence or machine learning, right, the main challenge is, like, wow, it's a very exciting field, but I don't know which problem I should work on. Right? And then Torchmark really basic to why it's like a set or, like, really the most important task in this format. So they know what they can work on. Right? The same case, like, for people, like, working like Biorex or chemist, they have good knowledge about tractorscaria, but they are not so familiar with the the latest the ones man of the commission learning models. They don't know which model they should use for for their specific task. And then Torchmark really basically implements the state art, the machine learning algorithms or or models for their task. And, also, I think another good thing about Torch Rock is, like, because Torch Rock, like, it supports different task, and it basically also covers different type of data, as I mentioned in the beginning. So Torchmark, basically, right now supports analyzing 2 type of data. 1 is, like, molecules, small molecules, which are mainly, like, biochemists.
We are also, like, able to analyze biologic graph. Basically, mainly, like, the interactions between different biologic entities. Like, for example, the interactions between drugs, proteins, disease. So these complicated interactions are mainly interested by viruses. And Torch Rock basically provides a solution to both analyze the small molecules, the molecule structure, as well as the biological network. So which are usually analyzed or processed separately by chemists and also biologists.
[00:13:40] Unknown:
Also in terms of the, I guess, collaboration opportunities that it opens up for the sort of chemists and the biologists in the research arena, wondering how some of these AI technologies are able to speed up the process and simplify some of the communication aspects of people who are working in these different fields and sort of the 2 different stages of
[00:14:04] Unknown:
the workflow of discovering these different drugs? Really, like, total drug right now is able to support analyzing both type data. Right? So it's able to for example, like, for understand the molecule, we can ease analyze it based on its chemical structure, which is mainly done by an invitation of chemist. Right? And also, we can also send a molecule according to to its interactions with other entities, for example, with proteins, with disease. Right? So these many biologists care about. Right? And Torchmark allows to, like, simultaneously
[00:14:35] Unknown:
support the analysis of both type data. In terms of the actual implementation of the Torchdrug project, can you talk through some of the ways that it's designed and some of the sort of architecture and software development aspects of it? So TorchRock, initially, like, as I said in the beginning, so we mainly support, like, the analysis of, like, small molecules
[00:14:56] Unknown:
and also biometric and notch graph. So both type of data, they are graph data, but they are quite different. Small molecules, they are graph, but they have a small graph. Right? Which usually has run, like, tens of, like, atoms or tens of nodes. But then biometric knowledge graph will be very huge. Basically, it involves, like, the relationship between different kinds of, like, drugs, proteins, and disease, which could have, like, millions of, like the the graph could be would have, like, millions of notes, which is very huge. In the beginning, we tried to design a system which could process both type of graph. We're we're basically trying to consider this. We are able to process small molecules,
[00:15:33] Unknown:
small graph, as well as the big graph. That's the key we try to consider when we first designed the system. As you have been exploring the overall problem space and getting more sort of understanding of what's involved in some of the scientific aspects of drug discovery and some of the ways that it's being used by scientists and by the AI community. How has that helped to drive the particular focus of the Torch Drug Project and some of the priorities as far as directions that you might want to explore with it? So right now, for Torchmark, so right now, we mainly,
[00:16:09] Unknown:
focus on small molecules. But now recently, I mean, drug discovery based on small molecules, but, like, recently starting from last year and this year, there's a huge interest in the community in, like, in AI or in biological community. There's a huge interest in developing, like, big molecules, like, for example, proteins, like peptides or antibody for drug discovery. Okay. So in the future, like, we also plan to support, like, big molecule discovery, like protein design and the protein through reputation learning. So that's basically the what we are doing now and what we plan to further support in the future through, total draft. Recently, like, breaks through by the mind on protein structure prediction with other for 2. Right? And that's really, like, brings a lot of, like, attention or interest in in proteins, in protein understanding or protein repetition learning or protein design. Then that's definitely something we want to support in this next release.
[00:17:03] Unknown:
In terms of the sort of assumptions that you had going into this particular problem space of drug discovery, given your understanding of the sort of network datasets and graph learning, what are some of the assumptions that you had that have been challenged or some of the ideas that you had as far as the potential approach or the potential capabilities that have had to be sort of adjusted or reconsidered as you've progressed further through the implementation and engagement with the community?
[00:17:34] Unknown:
So by far, you see, like, we usually we represent all the molecules, either small molecules. So for small molecules, we usually represent them as graphs. And then for proteins, we usually represent proteins as sequence, basically, a sequence of, like, amino acid. Yeah. But that's actually a sort of, like there's some limitation there because for molecules, either small molecules of protein, and more natural reputation should be their 3 d structures. Right? Because molecules naturally, they represent 3 d structures. So that's a big limitation for graph based methods because graph based method only can capture the relationship between the atoms. They cannot capture the geometric features of those molecules, which are very important for drug design. For example, like, if you want to design a small molecule, so ideally, the shape of the small molecule should be complementary to the shape of the protein surface. Right? So in order to model that, so it's important to leverage or to model their 3 d structures instead, like, only based on their 2 d graph repetition. So definitely and then actually, there's also a trend in the community. So for molecule or protein modeling, so now we are going beyond, 2 d graph repetition learning to a 3 d structure modeling. In terms of the sort of graph
[00:18:55] Unknown:
characteristics of these molecular structures, I'm wondering what are some of the complexities that are inherent to the problem domain where, you know, if you're doing some graph research, then maybe you're able to enforce that it, you know, is acyclic in nature, whereas with chemical structures and chemical compounds, you want to have certain cycles involved and just being able to understand the molecular composition from a network perspective and just some of the interesting problems that that brings into the sort of overall discovery and research space. Right. So maybe I can mention a few important compresses or challenges.
[00:19:32] Unknown:
The first question is really to, like, the number of, like, the labeled data. So usually because, you know, if we want to build up a machine learning model, so then it's important to have enough training data. Like, for example, in computer vision, in natural language processing, in speech recognition, we have many labeled data. But in drug discovery so, actually, we don't have so many labeled data. Right? That's in general a big challenge. That means, like, we want to debug models, which is able to learn from very few label data. So, of course, this is somehow, like, already, like, active study in the machine learning community, which is known as, like, a future learning or meta learning. But this kind of techniques are particularly important in drug discovery because in this domain, the number of labor data is really limited and it's very, computational. It's very, like, expensive and also time consuming to obtain the labor data. So this is the first challenges, like, for molecule modeling for for drug discovery. The second challenge, I think, is really to, like, we may try to model molecules. So we basically have to take the symmetry of the molecule into consideration.
Like, if we represent a molecule as a 3 d structure, then a molecule a 3 d molecule structure rotation should be like a rotation and the translation in Merit. And so when you try to develop your model, that's something you have to take into consideration. If you rotate or translate your molecule, some of the properties, for example, to the energy or the property of the molecule should remain emergent. K? So that's 1 type of, like, asymmetry we want to consider, which is numeracy. But there's also another property which is called a concurrency. So for example, like, given a molecule, if we want to predict the forces over each atoms, then so in this case, the forces over the atoms should be rotation and the translation equivalent.
What does that mean? So if you have 3 d molecule structure, you rotate your molecule, then the forces over the atoms should be rotated accordingly. So this is called, like, a rotation, like, equivalent. So when you try to build a model, so that's something you need to take into consideration. So this is by far a very active research, like, in molecule modeling.
[00:21:46] Unknown:
Another interesting aspect of the Torch Drug Project is the fact that you are trying to make it accessible and understandable to these 2 different communities of scientists and, you know, chemists and biologists and the AI community who are familiar with some of the sort of, like, deep learning concepts. And I'm wondering what your approach has been as far as how to design the APIs of the project and the sort of documentation and tutorials to make it accessible to these different communities with these varying backgrounds?
[00:22:20] Unknown:
That's a really good question. And I would like to thank, like, my students for these efforts. In the beginning, we first identify a set of, like, important task in job discovery. And then for each task, we first, like, provide APIs to some important public dataset so the user don't need to, like, deal with the data preprocessing. For example, like, for people working in machine learning, they may not have the experience to preprocess the molecules. We have API to upload the dataset to pre process the dataset. So so the users do not need to take care of the data pre processing stage. And then, of course, then we have some standard. We provide a bunch of different, like, say, that algorithms, which are trained on the public dataset. And at the same time, we also provide like standard evaluation metrics. So once we are able to train your model and then so we have some standard modules to evaluate the performance of your organs. Right. So basically, we provide the exam open public datasets and also standard evaluation methods. So that like even for people work in machine learning, they don't need to worry about the data preprocessing and also the iteration. They just need to, for example, eventually, all the data will be represented as graphs so they can only focus on the machine learning side. Basically, how to divert machine learning algorithms or different algorithms
[00:23:33] Unknown:
to analyze graph data. In terms of the workflow of somebody who's using Torchdrug to be able to build a model or be able to do some of these, like, search space optimization problems of the de novo molecule design or the sort of retrosynthesis capabilities. I'm wondering if you can just talk through just some of the overall workflow of getting started with Torchdrug, setting up the problem design, figuring out sort of, like, what are the goal states that you're optimizing for, and then be, like, executing and iterating on the experiment. So usually, let's say we have a protein target. K? And then maybe, like, a through sound, like, well, lab experiments. So we can, like, have a sign initial training data. We know
[00:24:15] Unknown:
some molecules. They are like a potential either. We're basically positive molecules, like, who's the protein targets. We also have some active molecules. Right? So then in this case, we can basically trying to predict the binding affinity for a molecule. K. So this is basically related to the task of, of like a molecule property prediction in tow truck. So we have a molecule. We want to predict its chemical properties, for example, either bind or not. Right? So once we have like a set of like day by day like this, the users can churn a molecule property prediction model based on the training data. And once we have the molecule property prediction model or by the infinity prediction model, So that model can be used in the molecule design and optimization task. Because in molecule design and optimization task, we need to have, like, a reward function.
Basically, trying to tell the all with them, okay, whether this molecule is good or not. Because the normal molecule design and optimization, the essential idea is, like, the algorithm, we use reinforcement learning. So the algorithm basically is trying to error. Right? So you try this molecule and then the environment, we are telling you, okay, whether this molecule is good or not. Right? Maybe the algorithm will further explore around that, like, area. If it's not good, maybe the algorithm will be able to move to its other area. So then you can see, like, during this process, the essential part is, like, we need to provide a reward function, basically, to tell the algorithm whether this molecule is good or not. So good or not, basically, it depends on, like, the type of, like, molecule property you want. So, of course, the most important 1 is usually the binding of printing whether this molecule can be bind to the protein target or not. Right? And that's why I said, like, before working on this molecule design and optimization or molecule search module, we have to train a machine learning predictor basically, to predict whether molecule has designed property or not. You know, maybe, like, whether the molecule can bind to the protein target or not. In order to do this, we usually have to have some initial training data. So we have some positive molecules, which can be bind to the protein targeted. And, also, we can have some inactive molecules, which they are not able to bind to the protein targets. So this can be done through some by acids. We don't need to have many in the beginning, right? But we can have some in the first place. And then once we have the machine learning predictor and then we can use that to search for better molecules using the molecule design and optimization module. By doing that so we are able to search for maybe a set of, like, potentially bad molecules. For example, we can search for, like, 1, 000 molecules, which could potentially be a better binder.
Okay? And then this 1, so the molecules, like, we can maybe send them back to the wet life experiments so we can test them again. And then by doing this, we can get a better churn data. Those kind of data could feedback to our machine learning predictor because now our machine learning predict has more churning data. Right? And so we can refine our machine learning predictor, and then the new machine learning predict that we send back to the search module, and then we can find based on the new machine learning predict that maybe we can find better candidates. And these more candidates will be sent back to the web app. So, basically, by doing this, we can solve the right, funnel loop between the computational site and also the web app site. And this is basically how the both workflow functions.
[00:27:34] Unknown:
You mentioned the AlphaFold project. And I'm wondering, given the fact that you are trying to optimize for being able to bind to these various protein targets, what sort of potential impact do you see as the availability of these new sort of protein folding structures and the information that it provides to the biological community, how that might impact just the overall aspect of drug discovery and some of the ways that you might be able to integrate that information
[00:28:04] Unknown:
into the Torch drug tool chain? Yeah. So that's a very good question. Because usually for drug discovery, there are 2 type of purchase. 1 time purchase is called, like, legal based approach, and then also 1 called structure based design approaches. So for ligand based approaches, we usually we don't make use of the structure of the protein target. So we mainly, like, use the like information. So like what I said, for some ligands, you know, they are positive, but others, they are negative. So we can just train the machine learning predictive only based on the molecule information. Right? Of course, this is no idea because in this case, we don't leverage the 3 d structure of the protein information, right, the targeted protein information. So then a better solution is really the the structure based like a drug design. So in this case, leverage the 3 d structure of the protein target. So, basically, we're trying to find a good match between the protein target. I mean, there's 3 d structure and also the small molecule. K. With AlphaFold too, because now we are able to basically obtain the 3 d structure of almost all the human proteins very accurately. So I think it's really basically makes all the structure based drug design possible because for any drug targets, now we know they are 3 3 d structure, and then we can mainly focus on developing technique, like structure based, like, a drug design. Right? Because previously, like, for some targets, we don't know there's 3 d structures. There's a big challenge there. But now for all the targets, we know there's 3 d structure. And then we can just, like, focus on designing the small molecules
[00:29:29] Unknown:
or or even proteins, which are able to bind to the 3 d structure of the drug target. In terms of the sort of technical aspects of the foundation that you've chosen for the Torch Drug Project, I'm wondering what was the decision process and some of the sort of aspects that you were looking for as you were deciding what to base the overall project on and how that ultimately ended up leading you to use PyTorch versus TensorFlow or any of the other machine learning frameworks that are available? Maybe I think that's somehow it's also read to the how the evolution of, like, community development between TensorFlow and the PyTorch. In the beginning, TensorFlow was very popular, basically, dominating the whole, like, community. Right? But then after PyTorch re released, I think the scientific community is quickly
[00:30:16] Unknown:
adopting PyTorch because it's easy to use. And I think with PyTorch, I think we are able to really prototype a research idea. And that's why, like, in in scientific community, like, the PyTorch is really, like, gaining increasing our popularity. And, actually, the decision is not that difficult to make because in our group, like, based all my students, they are the new network's based on PyTorch. Doesn't actually, not a very difficult decision to make. In the PyTorch, like, framework, there are also already quite a few frameworks which are able to deal with graph because all techniques are mainly based on graph. There are already quite a few frameworks which are able to deal with graphs in PyTorch. For example, PyTorch Geomagic, which PyTorch, like, a library, which are specific designed for graphs. Okay? And also, like, DGL deep graph library case, which are very popular in the graph machine learning community. Okay? But remember that I those graph based, like, PyTorch library. So they are mainly designed for handling, like, arbitrary graphs, not only like a molecule graph. They they may decide to support any kind of graph like social networks. They could deal with molecules as well. But for us, we really want to design a specific library for drug discovery, for hand orient, for molecules, proteins, and maybe by mental knowledge graph. So also, like, molecules, they can be represented as graph, but, actually, there are some specific, like, operations which are very unique to molecules.
So that's why, like, you know, like, framework, we actually designed some specific operators, which are used to deal with, like, molecules to deal with, like, molecule graph. And now we are also extending that to deal with proteins as well as, like, 3 d structures. As I said in the beginning, now we are still, like, trying to move from 2 d graph to 3 d structure, which is actually missing in many existing graph based, like, library. And now we are trying also trying to deal with different kinds of operations on 3 d structures.
[00:32:06] Unknown:
To the point of the sort of graph aspects of it, I'm wondering if the target of small molecules helps to cut down on some of the sort of computational cost and complexity because of the fact that the size of the networks are relatively small. And if there has been any
[00:32:24] Unknown:
specific to the graph library aspect of it, if there have been any sort of unique capabilities in network x that you're leaning on, or if you've also considered using or have experimented with some of the projects such as Networked or GraphLab? We mainly focus on, like, a developer, you know, deep learning deep learning algorithms for basically graph repetition learning, graph new notes, or knowledge graph embedding techniques. And there are some, of course, like, some traditional graph based or network based and that's algorithms. But those are just mainly traditional network analysis software. For example, they want to calculate the pay page rank, crossing coefficients, But those are not the type of techniques we want to use for molecule modeling because we mainly want deep learning techniques based on techniques like graph neural network, graph repetition learning, or geometric deep learning. So these are new type of techniques that we we want to develop for model to model, which are quite different from traditional graph analysis or network analysis software.
[00:33:17] Unknown:
Yeah. In terms of the just overall space of the sort of drug discovery research and some of the complexities that are there, I'm wondering what are some of the biggest open questions in that area and some of the ways that you're hoping to be able to apply the capabilities of Torchdrug to be able to simplify or speed up some of that research.
[00:33:39] Unknown:
So in general, right, for drug discovery, the really fundamental question is is to understand the interactions between, let's say, protein and ligand. Right? And also interactions between protein and the protein, which is the key to drug design. So let's say you have a protein target. And then if you want to use a design small molecule protein to bind to the protein target, then it's really important to answer the interactions between either the interactions between protein and ligand, the interactions between protein and the protein. I think to understand the interactions between the 2 is really the key to drug design in the future. And, also, it's very challenging because compared to other programs as what we discussed, now we have made huge progress in terms of, like, a protein structure prediction based on the apple 2, system. Right?
But it's still, like, a really challenging question to model the interactions between proteins because, like, the number of, like, labeled data is just very limited in protein protein infection or in protein ligand design. So for protein structure prediction, so we still have many labeled data or training data. And that's why actually 1 of the very 1 of the key reason why alpha42 is very successful because we have a lot of training data to train a model. But for protein protein interaction or protein like the interaction, we don't have so many neighbor data. So it's gonna be a very challenging problem. K? So I think in the future, if we can make some significant progress on either of these 2 problems, I think, really, like, we can make a huge advancement in drug design. So so in the future, we will try to benchmark the 2 tasks. 1 is protein ligand interactions, mainly support for small molecule discovery and also protein interaction mainly support for protein design. So we want to benchmark the 2 tasks and also try to push forward the progress of the 2 fundamental and also really 2 key tasks in drug discovery.
[00:35:36] Unknown:
And so in terms of your experience of building the Torch Drug Project and exploring the overall space of drug discovery and sort of digging into the research, I'm wondering what are some of the most interesting or unexpected or challenging lessons that you've learned in the process. I think the most challenging thing is really the problem itself is really multi disciplinary. Right? You need to have, like, expertise
[00:36:00] Unknown:
from both, like, mission learning and also drug discovery. I think that's really I feel like where the most challenging thing are. We also have a lot of collaborators in the biomedical side. I think we really like communication is always, like, difficult in the beginning because we speak different languages. Right? And then we have a total different mindset. So I think that's really, like, as a lot of, like, complexity there in terms of communication. But I think but with more and more communication, things getting better. So I think definitely that's in general the most challenging thing, especially for people working on AI. Right? So because we don't know what are the most important problems in this domain. And we so we don't know where we can get the useful dataset. Right? So another thing is, like, for building a system like Torchrack, is this basically the same problem. Right? So for us, we are mainly, like, machine learning researchers focused on developing new algorithms.
And then Torchmark is a machine learning system, so it's not enough for people who only have hours and development skills to build a good machine learning system. So I think, unfortunately, like, in my groups, we have a student who can understand both, like, the machinery algorithm and also have a very strong system development capability. Right? I think in general, like, I feel like in this domain, always the most challenging things are really like, the problem is really multi disciplinary. So we need to have, like, expertise from machine learning, from chemistry, from physics. Right? And even, like, even for machine learning, we also have, like, algorithm development and also system development, which are also quite different.
[00:37:33] Unknown:
Yeah. It's definitely interesting. Some of the sort of human problems that crop up when you're trying to put together 2 different groups of people who have different interests, but have sort of a combined goal and just some of the shared vocabulary that you have to figure out where, you know, each domain has their specific ways of discussing things, but they don't mesh well together until you're able to get them all talking together and establish a new sort of vocabulary.
[00:37:59] Unknown:
Right. But I think in general, it's important to have, like, a really open mindset and then basically appreciate the efforts done in other domains. I was also very open to learn from them. Yeah. It's important also to be quick learner because, like, for many things, it's totally new for us. Like, pretty much all the biologic, like, knowledge is totally new to us, so we have to learn from scratch. Then I think it's really important to be a quick learner as well. Absolutely. And so for people who are interested
[00:38:30] Unknown:
in exploring the space of drug discovery or sort of actually working in that field, what are some of the cases where a torch drug might be the wrong choice and they're better served with the sort of established experimental practices for drug discovery or, you know, using different computational approaches?
[00:38:48] Unknown:
Oh, drug discovery because, like, of course, right now, machine learning is some trending techniques. But before machine learning, there are also some traditional approaches to drug discovery. For example, those, like, are based on traditional physics based methods. So for now, to a drug, we don't support those techniques. Right? We mainly support, like, a machine learning based techniques for drug discovery. So in that case, I think it may not be the right tools if you want to explore more traditional physics based approaches for drug discovery. K. As a meantime, like, now in the community, we are also trying to use a new network or using different algorithms to approach those problems traditionally started based on physics based method as the 1 I just mentioned, like, the interactions between protein ligand. So it used to mainly, like, based on traditional physics based methods. Now there's a lot of interest in developing your network based approaches in different outlets for those kind of problems. And now we also see, like, some exciting progress along that direction. As you continue to
[00:39:48] Unknown:
work on the Torch Drug Project and continue to explore the overall space of drug discovery and build out the community around the work that you're doing. What are some of the things that you have planned for the near to medium term and any particular areas of help or contribution that you're looking for? So basically, I said, like, no right now contract mainly support support, like, a small molecule discovery. Now the really big space for therapeutic discovery is really
[00:40:13] Unknown:
protein antibody design. So which is something that we plan to support in the near future. So right now, like, we already have a module collect protein representation learning or protein design in our platform, but it's it's on the development. So we hope to erase that module in the near future, and this is 1 thing. So another 1 is more related to the techniques. So right now, like, for small molecules, for proteins, the normal because I represent graph and the full proteins, I may represent as sequence. So in the future, we really want to, like, go beyond the current sequence or graph repetition to 3 d structure repetition.
So we can model molecules and proteins as 3 d structures, and then we can start to model their geometric features, which is really important for drug design. So how to effectively combine the competition methods? I mean, many of the machine learning algorithms with white lab experiments in a more effective way is really important. Right? So in this case, this actually has there's huge interest in the community to use active learning algorithms. So basically, for the machine learning algorithms. So because in machine learning, there are some positive or active. So positive, basically, you just learn from what people give you. Active or basic, you learn from where you want to learn. Right? Then, basically, that's actually really a good way to combine the competition methods, basically, machine learning algorithms with the well app experiments.
So based on the machine learning algorithms, we can suggest a set of, like, drug candidates or by assets we want to conduct. So by doing that, so we can sort of, like, obtain the kind of training data that we want. Right? This is a way to activate basically, based on the current machine learning model where we want to acquire the knowledge. So we can suggest a set of, like, a molecule candidates, drug candidates, or biases to let the well apps to validate those model candidates and to get the trend data that we want. So in this scenario, active running is very important. It's a good way to really effectively combine the well lab experiments and also the dry lab or competition methods.
[00:42:28] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose the rope project. It's a refactoring library for Python projects that has a lot of useful capabilities for being able to do sort of ad hoc restructurings of Python projects. The API is not as accessible as I would like sometimes, but for the cases where you're trying to, you know, move a module to a different directory structure and make sure that all the imports are updated accordingly. It's 1 of the best ways that I've found to do that. So definitely worth taking a look for some of your more complicated refactorings.
[00:43:10] Unknown:
And so with that, I'll pass it to you, John. Do you have any picks this week? I think and, John, I hope the pandemic will be finished very soon so we can really, like, attend conference and see all the friends, their colleagues physically.
[00:43:23] Unknown:
Absolutely. Yeah. Can definitely second that sentiment. So definitely appreciate the time you've taken today to join me and share the work that you've been doing on Torchdrug and the contributions you're making to the problem domain of drug discovery. It's Definitely very interesting project and an interesting problem domain. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you very much. You too. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management.
And visit the site of python podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastthenit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Zhian Tung's Background and Research Interests
Overview of Torchdrug Project
Target Users and Applications of Torchdrug
Implementation and Architecture of Torchdrug
Workflow and Use Cases of Torchdrug
Technical Foundations and Framework Choices
Challenges and Future Directions in Drug Discovery
Lessons Learned and Multidisciplinary Collaboration
Limitations and Alternative Approaches
Future Plans and Contributions Needed
Closing Remarks and Picks