Summary
Computers are excellent at following detailed instructions, but they have no capacity for understanding the information that they work with. Knowledge graphs are a way to approximate that capability by building connections between elements of data that allow us to discover new connections among disparate information sources that were previously uknown. In our day-to-day work we encounter many instances of knowledge graphs, but building them has long been a difficult endeavor. In order to make this technology more accessible Tom Grek built Zincbase. In this episode he explains his motivations for starting the project, how he uses it in his daily work, and how you can use it to create your own knowledge engine and begin discovering new insights of your own.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Tom Grek about knowledge graphs, when they’re useful, and his project Zincbase that makes them easier to build
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what a knowledge graph is and some of the ways that they are used?
- How did you first get involved in the space of knowledge graphs?
- You have built the Zincbase project for building and querying knowledge graphs. What was your motivation for creating this project and what are some of the other tools that are available to perform similar tasks?
- Can you describe how Zincbase is implemented and some of the ways that it has evolved since you first began working on it?
- What are some of the assumptions that you had at the outset of the project which have been challenged or updated in the process of working on and with it?
- What are some of the common challenges when building or using knowledge graphs?
- How has the domain of knowledge graphs changed in recent years as new approaches to entity resolution and data processing have been introduced?
- Can you talk through a use case and workflow for using Zincbase to design and populate a knowledge graph?
- What are some of the ways that you are using Zincbase in your own projects?
- What have you found to be the most challenging/interesting/unexpected lessons that you have learned in the process of building and maintaining Zincbase?
- What do you have planned for the future of the project?
Keep In Touch
Picks
- Tobias
- Tom
Links
- Zincbase
- Commodore 64
- Electronic Engineering
- Artificial Intelligence
- Primer.ai
- Artificial General Intelligence
- Matlab
- IPython
- NumPy
- Excel
- Jupyter
- Pandas
- Knowledge Graph
- The Matrix
- Keanu Reeves
- Ontology
- Semantic Web
- Word2Vec
- SparQL
- Neo4J
- Graph Database
- AWS Neptune
- PostgreSQL
- Dask
- BBC Micro
- BASIC
- Prolog
- NLP
- ELMO
- BERT
- GPT-2
- Winograd Schema Challenge
- PyTorch BigGraph
- Ampligraph
- SpaCy
- AI Winter
- PyTorch
- scikit-learn
- NetworkX
- SciPy
- CircleCI
- Read The Docs
- Project Gutenberg
- Allen NLP
- Doctest
- Reinforcement Learning
- Metacognition
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computations, such as training machine learning models and running your CICD pipelines, they just launched dedicated CPU instances. They've also got worldwide data centers, including a new 1 in Toronto and 1 opening in Mumbai at the end of the year. So go to python podcast.com/linode, that's linode, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show And to grow your professional network and find opportunities with the start ups that are changing the world, then AngelList is the place to go. Go to python podcast.com /angel today to sign up. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.
For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference with upcoming events including the O'Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit in Graphorum. Go to python podcast.com/conferences to learn more and to take advantage of our partner discounts when you register. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers. Your host as usual is Tobias Macy. And today, I'm interviewing Tom Grech about knowledge graphs, when they're useful, and his project, the Zinc base that makes them easier to build. So, Tom, can you start by introducing yourself?
[00:02:05] Unknown:
Sure. I'll give you my nerd resume. So, I started coding way back when I was 5 years old on a Commodore 64. I studied, electronic engineering at college. I got a master's in AI back when it was fringe and didn't really work. I've had a a failed start up. I've worked all over the world. I've worked in big enterprise. I've freelanced. I've worked for the UK government's Secret Labs, and I'm now working as an engineer at Prima dotai, an NLP startup in San Francisco. And, just to frame the rest of this discussion, I I do wanna point out that I'm a true believer in AGI, artificial general intelligence, which I think could solve all of the world's problems, within our lifetime. And
[00:02:54] Unknown:
do you remember how you first got introduced to Python? I do actually.
[00:02:58] Unknown:
It was around 2011 or so. I was fed up of using MATLAB. And a friend suggested that I try IPython and NumPy. And then I discovered the whole SciPy ecosystem, and it was a a love story thereafter. I think, you know, at 1 point in particular, I was making a lot of spreadsheets and and and doing financial modeling. And I switched from Excel to Jupyter and Pandas, and
[00:03:26] Unknown:
my life got a lot better at that point. I don't think I'd ever say I'm a a good Pythonic Python coder, though. You know, that's a big claim. But I am a big fan of the language. Yeah. That's 1 of the beautiful things about it is that you can take your own style to it. There are these vague references of things being Pythonic, But in some ways, you can kind of cast that however you feel as long as whatever you're doing with the language fits your way of thinking and is effective for the goals that you're trying to achieve. Right. And all the time, I'm learning new things about it as well and discovering things hidden in the standard library. It's it's it's fantastic. And the ecosystem and the people involved with it are the best. So before we get too deep into your project, can you start by giving an explanation about what a knowledge graph is and some of the use cases that they enable and ways that they're used that people may have come in contact with? Right. Sure.
[00:04:18] Unknown:
So in a knowledge graph, or more generically, I guess, a knowledge base, you have nodes, which are entities. So people, places, things, companies, and edges between them, which capture different relation types. So, for example, Keanu Reeves is a node. The matrix is a node, and an edge between them could be acted in. So and this fact, Keanu Reeves acted in the matrix. It has 3 components, a subject, a predicate or relation, and an object. And together, we call that a triple. So the core goal here is to be able to ask the knowledge base for some made up triple.
What is the probability that this triple is true? So in my project sync base, we include a a toy dataset of triples relating to countries. For example, United States borders Mexico, France borders Spain, so on. The knowledge graph, can learn this structure. So say you've input the facts that say Kenya borders Uganda and Tanzania borders Uganda, then what's the probability that Kenya borders Tanzania? Right. And if an unknown entity comes up that wasn't in your training set, you know, that's not in the knowledge base, well, Zinc base can handle that. Well, in general, knowledge graphs can handle it. So say we never trained on Mozambique, but later we find out that Mozambique borders Tanzania. Now we can ask the knowledge base, whether or not Mozambique borders Uganda and it can make that inference, put a probability on it and it can tell you why it made that prediction. Bootstrap a So it's fairly common to bootstrap a knowledge graph with Wikipedia and Wikidata. The nodes would be anything with a Wikipedia page and the edges between them are essentially, you know, similar to the hyperlinks between web pages. And you would view Wikidata as the ontology behind it supporting all of that.
And you can query Wikipedia as a knowledge graph. And if you type some question to Google, like who is the boss of Google, then you'll get an answer, using Google's own knowledge graph. And, philosophically, the pioneers in this field like to think of the whole worldwide web as a knowledge graph. And so you get to this idea where all of the content on a web page is tagged not just with HTML, but with meaning and relations drawn from someone ontology. And that's the idea behind the semantic web, right, or web 3.0, you know, which has been on the verge of being the next big thing for the last decade or so. And I don't you know, some people think, well, doesn't this sound like SQL? And, you know, that's right. It's it's like a a less structured, probabilistic SQL where you get joins and regex for free and you get support for recursive queries and graph traversal algorithms, out of the box.
And it all adds up to something that looks like natural language reasoning. And the other thing I should say is, you know, modern knowledge graphs, have only come about in the past couple of years. We've now got the the techniques and the raw computing power to compute knowledge graph embeddings. So what's that? That's vectors of some dimension that encode everything about a node or an edge, given its position in the knowledge graph, given its relations to other nodes and embeddings. So, linguistically, you've probably heard of word2vec. Right? Yes. Yeah. So we've been able to do this kind of thing for a while with Word2Vec. You know, the canonical example there is that you have some engine, some neural net that's built up vectors for king, queen, man, and woman. And when you do the math, king minus man plus woman, where you find that the most similar vector by cosine distance, the closest vector is queen.
And the equivalent in knowledge graph embeddings is something like, Keanu Reeves minus Neo plus Laurence Fishburne. The closest vector might be Morpheus. And 1 of the beautiful things about these embeddings is you can add new nodes into the knowledge graph simply by
[00:08:55] Unknown:
averaging the embeddings of similar nodes. There are a few things that you brought up in there that are definitely worth touching on. 1 of the things being the semantic web, which as you said, has been on the verge of being the next big thing for quite some time, at least since the early to mid 2000. Mhmm. And you mentioned that 1 of the reasons that knowledge graphs have become relevant again is because of the fact that we have new way new computational techniques that make it viable to be able to extract the entities from different, data sources to be able to create meaningful relations as well as the access to the necessary computing power to make it feasible.
And I know that in the early days, we had the idea of the RDF or rich document format triple stores that were plagued by performance issues. And I'm wondering if there are associated data storage technologies that have come about that have addressed some of those problems as well to help make this a more viable, exercise and a more viable set of tooling for being able to actually solve real world problems.
[00:10:02] Unknown:
So there's an interesting answer to that question, which is that, yes, we do have better data storage now. So we've got things like, Wikipedia uses a cluster of, spark nodes, I believe. And they have their own, query language called SPARQL, a different kind of SPARQL, which, they're able to leverage very well-to-do fast queries. We also have, you know, things like, Neo 4J. Right? The the graph database, and, AWS has its own version called Neptune. They have their own graph database. The thing is, and I I cannot give you the reference for this right now, but back when I was planning out the design of zinc base, I did look into whether we should back this with a graph database or not.
And there's a paper which I read which actually showed that the performance of, you know, a basic relational database to hold triples is perfectly adequate for all this. You don't really need a graph database. So what I do now in in in my work is we just use Postgres. You know, it's the it's the best and fastest database, I think, that there is. So in in terms of, like, data storage, I think we've we've been there for a while. I think where where we have, new advantages in computing is, well, you know, GPUs and cluster computing, distributed computing.
And we've had a lot of advances there. You know, Spark is is excellent. Dask, part of the Python ecosystem,
[00:11:44] Unknown:
is even better in my view. But yeah. So a lot a lot of different things coming together. And in terms of your own interest and involvement in the space of knowledge graphs, I'm wondering what it is that catalyzed your interest and how you got involved in it. Sure.
[00:11:59] Unknown:
This actually started really early for me. So my my school had a BBC microcomputer. Now not all of your listeners might be familiar with the BBC microcomputer, but it was amazing. Britain in the eighties, the the BBC was putting computers in schools to improve computer literacy. It was such a forward thinking program. Mostly these computers ran basic, right. But you could actually swap out the chip and run a Prolog interpreter instead. And it blew my 10 year old mind that you could store facts in a computer and ask it questions and ask it to reason about those facts and that the computer could become an expert in something.
So it started it started very early. And for me, you know, a machine being able to reason about its own learned model of the world, you know, that's that's the true goal here. And I think knowledge graphs are a key step towards that. There's definitely been a big rise in popularity
[00:13:04] Unknown:
of knowledge graphs and uses of them. I know that you mentioned Google and its knowledge graph, which for anybody who's not familiar, if you do a search and it pops up with the sidebar of all the different rich information of different entities, whether it's a country or a person, that's being fed by their knowledge graph. And then there are businesses being built up around just, ingesting data sources and then constructing knowledge graphs from that and then selling access to the resulting graph for being able to query across for other businesses to be able to get value from. And then in terms of your own implementation of being able to build knowledge graphs with Zinc base, I'm wondering how it compares to the overall ecosystem of available tooling, and what was lacking in the projects that you looked at when you were trying to address this problem that motivated you to create your own library?
[00:13:55] Unknown:
Right. I mean, good question. I have a big problem with AI. It's brainless. Right? It's free of actual intelligence. So, I work in NLP, right, which is in the middle of a renaissance. A few years ago for the first time, NLP actually started to, you know, like, work quite well. So you have these, pretrained language models now, Elmo, Ernie, GPT 2, AccelNet. But they have, and and this has been shown in a recent paper. They have no capability to actually reason about the world. They can give decent sounding answers to questions, you know, they can summarize text pretty well, but there's no actual understanding there. And to frame this, I should introduce the concept of a pre trained language model. It's essentially a neural network that's been fed vast gigabytes of text from books and Wikipedia and trained to predict missing words in a document given the rest of the document as context.
And you can fine tune these models and use them as a base for the tasks. And every week, we get new state of the art results out of the research and a primer where we work, you know, we apply that research to useful effect, but you can see fundamentally, this is a statistical task. It's assigning probabilities to words given other words. So for example, right, there's this, research task. It's called the Wino Grad schema challenge. We take a pretrained language model and feed it a bunch of training data consisting of examples like, here's here's a the trophy would not fit in the suitcase because it was too big. What was too big?
And then your model has to pick 1 of 2 options, whether the trophy was too big or the suitcase was too big. Now as humans, we can simulate the world. Right? We have common sense. We have a concept of a trophy, and we have a concept of a suitcase, and we can reason that it's obviously the trophy that's too big. And language models actually suck at this, and they need huge amounts of data just to be able to suck. It's, it's not, it's not working well. So I think that to have good NLP, you know, to move beyond processing to actual, understanding of natural language, you know, we need more than statistical methods.
And, you know, over time, I've picked up things like Prolog and symbolic logic, you know, fuzzy logic, knowledge graphs, evolutionary methods, brain inspired methods, and I wanted to combine them. And, you know, regarding other tools, I haven't seen anything that's exactly like zinc based. I try to build things that tinkers and hackers can pick up and play with quickly. So you have recently, I think last year, maybe even this year, Facebook released a PyTorch big graph, which is, you know, a graph, knowledge graph embeddings, framework.
And Accenture has a DRAM project called Ampligraph, which is excellent. And there are various Prolog and symbolic logic compilers around. And as we talked about in the graph database space, you've got these projects like Neo 4J, and Spotify has its own library for doing fast indexed, nearest neighbor search. And you've got things like spaCy for information extraction tasks that you need to do in order to build the graph. You've got a lot of different components coming together. And zinc based honestly is worse than all those projects individually, but its goal is to be batteries included.
A toolkit whereby you can extract, structured information from unstructured text, build a knowledge graph, and make inferences, using that knowledge graph. I didn't build this simply out of a desire to, push AGI forward, I gotta say. I built Zincbase in part because, I wanted to work on it as my day job. Right? Instead of my day job being, you know, creating bigger and better training datasets instead of training, you know, these ever more complicated neural models with more parameters. I wanted to spend my time building towards real intelligence. Right? So I made Zinc base in my spare time as a prototype for Primer and I demoed it to the bigwigs. And there was enough interest and excitement that I was actually able to make this into my day job.
And the final factor, I guess in or the thing that sets Zinc base apart from other tools is this issue of explainability or interpretability of machine learning models. Right? So I wanna pose to you a a thought experiment now. So at Primer, we work with governments, and we work with the intelligence community. Right? Now imagine a knowledge graph of all people linking them to other people, to places, to events, to organisations. Now imagine I could ask that graph. What's the probability that Tom Grech has occupation terrorist.
That's that's scary. There's a lot of power. And if it if the graph is able to make a prediction on that, she should we consider that prediction actionable? You know, I think if we're going to open that particular Pandora's box, you know, we better be prepared to throw a dynamite inside and and blow it wide open to the point that it's completely transparent. So with zinc based, I wanted to combine the black box neural methods, which really achieved the state of the art performance, with symbolic logic and information retrieval and the more traditional and more explainable machine learning methods.
[00:20:29] Unknown:
Yeah. Being able to understand how a certain prediction or probability is arrived at is definitely essential in particularly the types of cases that you're referencing. But even just in the case of trying to understand in a mundane fashion of, what the probability is that I happen to like apples of just being able to say, because you're making this prediction, you want to know what is the information that was fed into that to be able to reach those conclusions. Because otherwise, yeah, you're you're definitely opening the door for a lot of abuse of the capabilities of these types of systems, particularly given the inherent biases that exist in the data that's being used to generate and train these models.
So understanding what went into it, the entire life cycle of the data that was fed into these models, and what the logic was that generated those predictions is essential. And I'm curious to hear how you have implemented zinc based to be able to provide that type of transparency and visibility into the overall process. Yeah. That was really designed in from the start. You
[00:21:33] Unknown:
have you have this machine learning, neural model that is able to take a graph and reduce that graph so that each node and each relational predicate has its own embedding. And then you can use you know, once you're in vector space, you can do all sorts of things with those embeddings. Pretty much do what you like and especially make inferences, about those nodes and relations. Now, with Zinc base, the graph itself is kind of the 1st class citizen. It's it is something that you can query. And, again, you know, this is part of why I wanted to combine, symbolic logic with, neural network in a statistical methods, is that you can use Prolog. People at work think I'm crazy for building this all around Prolog. You know, it's, it's got some associations with 19 eighties with the old AI winter and so on. But but I just like it. But just substitute Prolog for any kind of English like, query language.
But I I wanted to the the graph itself to be the 1st class citizen here. So once you have a prediction that you've made with sync base, such as, you know, did Keanu Reeves act in the matrix, or does Mozambique border Tanzania, something like this, then you can actually explore and visualize the graph itself to find out why that might be the case.
[00:23:16] Unknown:
And you mentioned some of the libraries such as PyTorch and spaCy and some of the other capabilities that are part of the Python ecosystem. I'm wondering what are the pieces that you are leaning on most heavily in building Zincbase and also some of the ways that it can be integrated into the rest of the Python ecosystem. For instance, if you wanted to incorporate it as part of a web application using Django or Flask or being able to feed data in from something like an airflow workflow engine or things like that? Great question. So, yeah, it's, it's built heavily with,
[00:23:51] Unknown:
Python and PyTorch, which is, am I allowed to say, vastly the superior, auto grad, library? I
[00:23:59] Unknown:
use a You you're allowed to say whatever you want. I I'm just not going to necessarily back up your claims because I don't want us to be involved in that flame war.
[00:24:07] Unknown:
Okay. I use a a few bits of, scikit learn and scipy, and I use, network x for handling the graph stuff. And I like to keep things as standard as possible. You know, many people are used to working with network x. So why would I deviate from that, and why would I try to build my own thing? Zincbase uses CircleCI, to build the documentation in the PIP wheel. It uses, read the docs to to host the documentation. So, really, I'm trying to be as standard as possible. And at work, we, in fact, use this library in production. All you need to do is essentially to have you know, you you have your your model loaded onto a GPU on some virtualized instance, and you just need some queuing mechanism that will take the queries that are coming from your your web app, back end. Stick them in a queue until the GPU on your inference machine is is ready to process them. You put a call back there, and, the inference itself is actually very
[00:25:10] Unknown:
quick. And when you first started building zinc base, I'm curious what some of the assumptions were that you had going into it and some of the ways that those assumptions have been challenged or updated and either validated or invalidated in the process of working on building it and as well as using it for your own work at your job?
[00:25:29] Unknown:
You know, I'm glad you asked. And I, 1 of the things that I've found, and I think it's really a ridiculous situation is documentation. Right? I, in the past, have always documented things with, you know, with with markdown, documented them in GitHub, documented them in a code or maybe in some external Wiki or Confluence tools like this. Right? And for the first time, I have released a a library, which not only my company is using, but I've heard from other people they're using as well. And I had to document it properly. And I think it's fair to say that I'm unhappy with the state of documentation in Python.
And here's, you know, can you answer this question for me? Why are we using RST files instead of markdown? I don't know.
[00:26:19] Unknown:
Because somebody decided that it was superior due to its increased flexibility. But at the same time, I have found that it does provide a bit of a barrier to entry as far as just trying to fit the entirety of it in your head for being able to just write something down simply. I don't wanna have to learn 2 different,
[00:26:36] Unknown:
types of syntax for, for writing documentation. Like there's enough in this kind of space that I have to learn. I I read several scientific papers every day. I, you know, my brain capacity is finite. So I'd love if we were able to just standardize on say, GitHub flavored markdown. But anyway, you know, maybe that's just a personal preference.
[00:26:56] Unknown:
Yeah. It's it's another it's another case of VHS versus Betamax where with the 1 that is actually technically superior doesn't really matter. It's the 1 that people actually use that matters at the end of the day. True. Yeah. Although, you know, Betamax,
[00:27:10] Unknown:
I miss you. So, I mean, there's a couple of other things. Right? Testing stochastic models. How do you really do that? How do you how do you put together doc tests for stochastic models? Right? You can you can do it. How do you put together the unit tests for stochastic models? That's difficult. And I found that everything tends to become kind of an integration test because you have to, you know, first download the model, right, from your CICD and then run the tests. So, you know, I think there are practices that we could improve around that. And another thing is, at work, 1 of the things I've done to the project is to look at model provenance. Right? So if you have a bunch of people working on a library together and each of them is developing models that do different things, so named entity recognition, coreference resolution, relation extraction, What tends to happen is that, data preprocessing scripts get lost. Training data scripts get, datasets get lost, or they don't get versions.
You know, things live in people's own Jupyter Notebooks on their laptops. Model performance statistics are not recorded properly. And I've even seen models passed around as siphonized pickle files where nobody actually has the original c source code, you know, because the the guy that wrote it left the company. So all we have is this pickle. So for me, 1 of the big challenges with this has been building the utilities and the helper functions, that make all of this really easy. And it's not rocket science. It's like 1 part lab notebook, 1 part git, 1 part boto3. But iterating on machine learning models and datasets happens really, really fast.
So it's important that if you wanna build some tool to support that, I think that you don't get in the way of that and that you don't force people to learn, you know, more syntax and and different paradigms. So yeah. So, yeah,
[00:29:14] Unknown:
a lot of challenges in this space. And as far as the difficulties that people have in the general space of building and using knowledge graphs, I'm curious what you have found to be some of the common points of confusion or issues that they encounter and some of the ways that you've tried to address that with Zinc base? Yeah. Great question. So scalability
[00:29:36] Unknown:
is the big problem here. So the Facebook's library PyTorch Big Graph, they have some quite innovative methods around graph partitioning, where you can create a really big knowledge base and train it to produce these embeddings for each node or relation even though it has, you know, maybe 100 of millions of nodes and a trillion edges. Who knows? Now I've successfully used Zinc base on a graph with a 1000000 nodes and 10, 000, 000 edges. So it's quite a, you know, reasonably big graph, but that's not big compared to the social network or compared to a graph of, like, Netflix users and programs. And the thing you have to bear in mind is that a knowledge graph gets exponentially more useful, the larger it is.
So scalability is a real challenge, and I personally, I haven't found yet the need to move to, like, a distributed training schedule. I haven't built any graph that's big enough that's needed to spread itself over multiple partitions. So these are not challenges that I have addressed in Zinc base yet, but I know that the scalability challenges kind of have been solved. So I'm I'm comfortable with that right now. Another challenge is basically GPU GPU, utilization. So the linear algebra that you need to construct graph embeddings is not that complex. Right? So there are 2 things that we could do. Either we can make the math more complex, which probably we should, or we also need to get better at batch training so that we can fully take advantage of GPUs.
And I think 1 1 more problem that I'd like to mention as well is is not really a technical 1. It's a perceptual 1. I think knowledge bases aren't sexy, like, self driving cars. You know, They're they're not as well funded as high frequency trading, and they have some kind of, like, old fashioned, connotations. I mean, particularly, I mean, here I am. It's 2019. I'm talking about prologue. Yeah. That already dates me somewhat. So I I think I'd like this space to be more sexy. Let's let's put it that way. Well, I think it's starting to become that way as more large organizations
[00:32:03] Unknown:
wake up to the realities of how useful these are and start to contribute more to the research and building and utilization of this technology space. Yeah. Also, as you were saying before, some of the renaissance in knowledge graphs has become about because of the revival of natural language processing, and some of these, deep learning based approaches to being able to perform entity extraction and vectorization of text, as well as the broader availability of some of these textual datasets. And I know that recently there were some, books and other publications that recently came into the public domain from, I think, the early 1900, which will help provide a little bit more of a corpus for being able to try and build these more complex models and the ubiquity of textual data that can be freely obtained from various Internet sources, which is what they fed into the gpt 2 model. So it it's definitely an interesting time to be getting into this area, and I'm definitely interested to see some of the ways that they
[00:33:09] Unknown:
continue to be leveraged in other products and projects. Oh, yeah. And and also, like, I just I do wanna give props to, to Wikipedia and project Gutenberg. They do an amazing job, but don't get me started on how, copyright, runs for so so long. We should have more more text data available from more recent times. I'll leave it at that.
[00:33:32] Unknown:
For somebody who is interested in using base and wants to get ramped up and use it in their own projects, can you talk through an example workflow and the overall steps of being able to obtain the data, clean it, train a model, and then be able to use that for generating the embeddings for the knowledge graph, and then populate the knowledge graph and use it in
[00:33:53] Unknown:
production? Yeah. So this is this is really easy. So there are a couple of ways that you can do this. So you can either you can feed to Zinc base a CSV file of triples that you know about. You know, just 3 columns, ask it to read the whole thing and build a graph, compute embeddings for that graph, and then query it in some kind of prologue like language. And you can also plot this graph that's being created. You can plot the embeddings and examine the natural clusters that tend to form, and you can ask it to make predictions. And, you know, this is this is all documented in a repo. It's like 3 or 4 simple Python commands. And the the other thing, I think I I said I tried to design things for the tinker or hacker. And you can also use SyncBase completely, interactively.
So you import it, you add facts manually 1 by 1 just like I used to do, in the eighties, early nineties on the BBC Micro. And then also from there, also, you know, still compute the, the the the embeddings and query the graph. And then, of course, that's gonna be kinda time and resource intensive. But I want people to be excited about this and just pick up the library and just play with it. And, I think I think that's a a good way to do that. And if you look at some of the other libraries in this space, you know, so spaCy, we mentioned, is a very easy to use library. But if you look at something like, well, PyTorch Big Graph or Allen, the Allen Institute's Allen NLP library. They're really difficult to use. You already have to be a specialist.
And so I guess 1 of the things that I'd like to do is just kinda say, hey, everybody, you know, knowledge graphs are accessible. You can just pick this up, import the library and and go crazy in in a couple of lines of Python. So that's kind of a design philosophy that I that I try to, try to stick to. And, know, in terms of like applications for this, there is a lot that I'm really excited about. For me, I'll quite often take zinc base, as the base of something else I want to do. So it's a nice way to get triples into a graph format that, you know, that's easy. And then from there you can get them into the vector space, right? And it's just a sequence of numbers. And then from the, in, in a vector space, you can you can do, you know, pretty much any kind of machine learning tasks that you like. So I'll go off prototypes of new idea, you know, implement a new research paper, and not of not all of that makes it into the repo. You know, in AI, most experiments will fail. And, I think for, you know, other applications that I'm really excited about, like, there was a a recent research paper which, like, in the last month or so, which, got some media attention, which was people were using, I think, universal language models to make predictions about material science. I think they they found that these, these models sort of capture some latent, semantic relations in, in material science literature. And they were able to make predictions about, you know, I don't know what new new, new polymers that should exist, things like that. And if a, if a language model can do that, then, a graph a knowledge graph can do that much, much better. So I'm excited about that. I'm excited about the potential in, you know, for example, you know, the the micro, molecular biology, organic chemistry fields, where you can build a knowledge basis of chemicals and proteins and interactions, you know, predict possible qualities of known substances or identify gaps in the knowledge base where it looks like there should be a substance or an interaction, but currently, you don't know about 1.
Yeah. And and then also, like, more mundane, but I've spent time in enterprise, and large companies would go crazy for something where they can just spider over their, unstructured data, their their PDFs and emails and any kind of document on a corporate network and and build a structured knowledge base from it. Right? And and then have that queryable in something like English. So, like, if a legal case comes up for instance, you can say, oh, who was who was it that was likely to have been in a meeting with this person on a certain date in a certain off or a sales meeting happens. You can ask the noise base whether you've likely got enough of a certain widget to meet demand forecast for the next quarter or what a machine's common failure modes are. Or, you know, even just I don't know if you've experienced this in a large enterprise yourself, but, like, who should you know, you wanna talk to somebody about a particular product line. You just don't know who to talk to, what what person is in charge of that.
And that is the kind of question that a knowledge graph would be very, very easily able to answer. So sorry. I think I got a little bit off the off the subject there, but, it's, it's all documented in Aripa, and it's super easy to get started. And and if people wanna give it a try and give me some feedback, I would absolutely love that. Yeah. I appreciate
[00:39:26] Unknown:
the potential use cases and some of the pontification about, ways and areas that zinc base can be used and knowledge graphs in general. Because for somebody who hasn't worked with 1 or isn't familiar with the space, it can be easy to say, okay. That's great. But it's really hard to actually think about different avenues where it could be potentially applied. And having feedback about some of the ways that you've imagined it being used can help inspire ideas of ways that other people can leverage it for their own use cases.
[00:40:00] Unknown:
Yeah. I I think that's right. I mean, so we we use this library. We're a worker primer now, and and we serve, you know, big banks. We serve Walmart. We serve legal customers. And what these companies want is just some way to be able to organize and analyze, the information that they already have and make predictions about it, explainable predictions. And 1 of the kind of, I I guess, failures of knowledge graphs is, well, it's so general. Like, it sounds like a great concept, but what are the specific applications of that? So I'm glad to kinda have this platform to be able to say, hey. Look. There are real things that you can do with Knowledge Graph. It's got real value. You know? Think about it. Just start building 1 and see what you can come up with. You sync base for that.
[00:40:52] Unknown:
And as far as your own experience of working on zinc base and experimenting with the area of knowledge graphs, I'm wondering what you have found to be some of the most challenging or interesting or unexpected lessons that you've learned in the process.
[00:41:08] Unknown:
Okay. So maybe you or your listeners can help me with this, actually. So why is it so hard for a Python package to get its own version number? Right? So I'll I'll I'll explain. So the the way that models work in Zinc base is, if you try to instantiate a model, right, it looks at this CSV file, which is stored in s 3, and it gets the model versions and the file names and locations of the model weights, you know, which have been files also stored in s 3. Now the CSV file containing all this metadata is versioned to match the zinc based library version.
And so the version number has to be there in the code of the library somewhere. And I and it me and everybody else who works on this library has to update that manually every time we do a new release. So I, I haven't found any standard way for a Python package to get its own version. Do you know what I mean? Yeah. 1 approach that I've seen is to actually have just a version dotpy in the repository
[00:42:15] Unknown:
that you use for incrementing the value. And then in your setup dotpy, just import that to populate the necessary field. I've seen a few different ways of doing it, and also maybe just in in in an init dotpy file. Or I know that there are also some link some libraries that exist specifically just for the purpose of incrementing your version when you do releases. So That that would be great. It is a potentially complicated space. But, hey, isn't this Python? Like, there should be 1 obvious way of doing it. I haven't found that.
[00:42:43] Unknown:
I'll just mention a a couple of other, things that I found. Right? So, 1 thing, I just wanna give some props to doc test. Doc testing is amazing, you know, both for, well, for documentation and for testing. Love that so much. Another 1 is the CircleCI integration with GitHub. Very lovely, very, very easy to set up. And then the the last point that I'd like to mention is that, debugging machine learning is really, really hard. And, you know, even, you know, with, with sync based, we integrate different types of, machine learning or AI. So, like, for example, the Prolog engine that is contained in it is just recursive rule matching, basically. But 1 book can be really hard to isolate.
So just some advice to people working in machine learning is, and we we we've got a lot of, sort of, I guess, newer engineers, joining our company now. And I'd just love to give this 1 piece of advice. Just test everything as you go and make no assumptions about any,
[00:43:58] Unknown:
of the code that you're trying to integrate. And looking forward in terms of your goals for the project and updates that you'd like to make. I'm curious what you have planned for the near to medium term. Yeah. Sure. Well, I mean, this is AI. It's probably the most exciting,
[00:44:17] Unknown:
industry in the world right now, and there are a few things that I'm really excited about. I want to incorporate agents into Zinc base and specifically multiple agents that learn to communicate and learn to collaborate, to complete tasks together. So you can imagine, you know, a couple of agents and you set them loose on a graph, and they both start from different parts of the graph. And they find some way to communicate to each other, you know, in which direction they should be going to reason about the information in this graph. So for me, Zinc base is a little bit of a playground where I can get all of this advanced stuff in and synthesize multiple fields.
And the other 1 is, you know, there's been some recent research in, the reinforcement learning field on how these kind of intelligent agents can build world models. Right? And this is this is exactly my problem with AI. It's it's stupid. It's not able to, build a model of the world and reason about that. And if you think about it, you know, a world model is just a kind of knowledge base where the nodes are items, you know, physical things in the real world, And the edges are interactions. Like, if I push this domino, you know, it's gonna fall and and knock down the next domino. So I'm really excited about, you know, the potential for integrating learnings from reinforcement learning about creating these, these I guess it's kind of like a metacognition.
So agents with metacognition and and if they have a knowledge graph that they can query about the real world and other trivia, I feel like that's gonna be extremely powerful. But, anyway, concretely, you know, I see the future of Zincbase is basically a tool for extracting structured data from unstructured text, building it into graphs, and using those graphs to make probabilistic but explainable inference, about unknown information. And, 1 final thing, really, really, if anybody would like to collaborate with me on this or just, you know, ex exchange emails and and talk about it, I would be extremely happy with that. And also my company, Primer, is is recruiting. So if you wanna work on this full time to, to get in touch as well. Yeah. And I'll definitely
[00:46:45] Unknown:
have you add the best way to get in touch with you into the show notes. And before we close out the show, is there anything else as far as the areas of knowledge graphs and zinc base that we didn't discuss yet that you'd like to cover before we close out? No. I I think,
[00:47:01] Unknown:
from my side, anyway, I I think that was a a pretty broad and deep enough
[00:47:09] Unknown:
discussion of the considerations involved. That was, you know, a lot of fun chatting. Alright. So with that, I'll move us into the picks. And this week, I'm going to choose a recipe that I tried out recently for some banana blueberry oat bars. They ended up being quite delicious, so, definitely something worth checking out for a quick and easy breakfast. And so with that, I'll pass it to you, Tom. Do you have any picks this week?
[00:47:31] Unknown:
Okay. I'll share a risk fee. I I recently tried as well, which is pickled habaneros. Just take some vinegar, a little bit of sugar. This is, a recipe I, found, on a trip to Thailand that I just had, but gave it a bit of Mexican influence. Pickled habaneros, a bit of vinegar, a bit of sugar. Leave it in the fridge for a week or so, and you get this delicious,
[00:47:53] Unknown:
sauce to go on rice. Excellent. I'll have to give that a try as well. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Zinc base and on your perspective of knowledge graphs in general. It's definitely a very interesting area, and it's great to be able to talk to somebody who is so deeply involved in it. So thank you for all of that, and I hope you enjoy the rest of your day. Okay. Thank you so much. That was a lot of fun.
Introduction and Sponsor Messages
Interview with Tom Grech: Introduction and Background
Understanding Knowledge Graphs
Tom's Journey into Knowledge Graphs
Challenges and Innovations in Knowledge Graphs
Building Zincbase: Motivation and Features
Scalability and Practical Applications of Knowledge Graphs
Using Zincbase: Workflow and Examples
Challenges and Lessons Learned
Future Plans for Zincbase
Closing Remarks and Picks