Build Your Own Knowledge Graph With Zincbase

Hello, and welcome to podcast.init,

the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking,

scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computations, such as training machine learning models and running your CICD pipelines, they just launched dedicated CPU instances.

They've also got worldwide data centers, including a new 1 in Toronto and 1 opening in Mumbai at the end of the year. So go to python podcast.com/linode,

that's

linode,

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show And to grow your professional network and find opportunities with the start ups that are changing the world, then AngelList is the place to go. Go to python podcast.com

/angel today to sign up. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.

For even more opportunities

to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, and the Open Data Science Conference with upcoming events including the O'Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit in Graphorum. Go to python podcast.com/conferences

to learn more and to take advantage of our partner discounts when you register. And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers. Your host as usual is Tobias Macy. And today, I'm interviewing Tom Grech about knowledge graphs, when they're useful, and his project, the Zinc base that makes them easier to build. So, Tom, can you start by introducing yourself?

Sure. I'll give you my nerd resume. So, I started coding way back when I was 5 years old on a Commodore 64.

I studied,

electronic engineering at college. I got a master's in AI back when it was fringe and

didn't really work.

I've had a a failed start up.

I've worked all over the world. I've worked in big enterprise. I've freelanced.

I've worked for the UK government's

Secret Labs,

and I'm now working as an engineer at Prima dotai,

an NLP startup in San Francisco.

And, just to frame the rest of this discussion, I I do wanna point out that I'm a true believer

in AGI,

artificial general intelligence,

which I think could solve all of the world's problems,

within our lifetime. And

do you remember how you first got introduced to Python? I do actually.

It was around

2011

or so.

I was fed up of using MATLAB.

And a friend suggested that I try IPython and NumPy.

And then I discovered the whole SciPy ecosystem, and it was a a love story thereafter.

I think, you know, at 1 point in particular, I was making a lot of spreadsheets and and and doing financial

modeling. And I switched from Excel

to Jupyter and Pandas, and

my life got a lot better at that point. I don't think I'd ever say I'm a a good Pythonic Python coder, though. You know, that's a big claim. But I am a big fan of the language. Yeah. That's 1 of the beautiful things about it is that you can take your own style to it. There are these vague references of things being Pythonic, But in some ways, you can kind of cast that however you feel as long as whatever you're doing with the language fits your way of thinking and is effective for the goals that you're trying to achieve. Right. And all the time, I'm learning new things about it as well and discovering things hidden in the standard library. It's it's it's fantastic. And the ecosystem and the people involved with it are the best. So before we get too deep into your project, can you start by giving an explanation about what a knowledge graph is and some of the use cases that they enable and ways that they're used that people may have come in contact with? Right. Sure.

So in a knowledge graph,

or more generically, I guess, a knowledge base,

you have nodes, which are entities.

So

people, places, things, companies,

and edges between them, which capture different relation types. So, for example,

Keanu Reeves is a node.

The matrix is a node,

and an edge between them could be

acted in.

So and this fact, Keanu Reeves acted

in the matrix.

It has 3 components,

a subject,

a predicate or relation,

and an object.

And together, we call that a triple.

So the core goal here is to be able to ask the knowledge base

for some made up triple.

What is the probability that this triple is true? So in my project sync base, we include a a toy dataset of triples relating to countries.

For example, United States borders Mexico,

France borders Spain, so on.

The knowledge graph,

can learn this structure.

So

say you've input the facts that say Kenya borders Uganda

and Tanzania

borders Uganda,

then what's the probability that Kenya borders Tanzania? Right. And if an unknown entity comes up that

wasn't in your training set, you know, that's not in the knowledge base, well, Zinc base can handle that. Well, in general, knowledge graphs can handle it. So say we never trained on Mozambique,

but later we find out that Mozambique borders Tanzania. Now we can ask the knowledge base,

whether or not Mozambique borders Uganda and it can make that inference, put a probability on it and it can tell you why it made that prediction. Bootstrap

a

So it's fairly common to bootstrap a knowledge graph with Wikipedia and Wikidata. The nodes would be anything with a Wikipedia page

and the edges between them are essentially, you know, similar to the hyperlinks between web pages. And you would view Wikidata as the ontology

behind it supporting all of that.

And you can query Wikipedia as a knowledge graph.

And if you type some question to Google, like who is the boss of Google,

then you'll get an answer,

using Google's own knowledge graph. And,

philosophically,

the pioneers in this field like to think of the whole worldwide web as a knowledge graph.

And so you get to this idea where all of the content on a web page is tagged not just with HTML, but with meaning and relations drawn from someone ontology. And that's the idea behind

the semantic web, right, or web 3.0,

you know, which has been on the verge of being the next big thing for the last decade or so. And I don't you know, some people think, well, doesn't this

sound like SQL? And, you know, that's right. It's it's like a a less structured,

probabilistic

SQL where you get joins and regex for free and you get support for

recursive queries

and graph traversal algorithms,

out of the box.

And it all adds up to something that looks like natural language reasoning. And the other thing I should say is,

you

know, modern knowledge graphs,

have only come about in the past couple of years.

We've now got

the the techniques and the raw computing power to compute knowledge graph embeddings.

So what's that? That's

vectors of some dimension that encode everything about a node or an edge,

given its position in the knowledge graph,

given its relations to other nodes and embeddings.

So,

linguistically,

you've probably heard of word2vec.

Right? Yes. Yeah. So we've been able to do this kind of thing for a while with Word2Vec. You know, the canonical example there is that you have some

engine, some neural net that's built up vectors for king, queen, man, and woman. And when you do the math, king minus man plus woman,

where you find that the most similar vector by cosine distance, the closest vector is queen.

And the equivalent in knowledge graph embeddings

is something like,

Keanu Reeves

minus Neo

plus Laurence Fishburne.

The closest vector might be Morpheus.

And 1 of the beautiful things about these embeddings is you can add new nodes into the knowledge graph

simply by

averaging the embeddings of similar nodes. There are a few things that you brought up in there that are definitely worth touching on. 1 of the things being the semantic web, which as you said, has been on the verge of being the next big thing for quite some time, at least since the early to mid 2000. Mhmm.

And you mentioned that 1 of the reasons that knowledge graphs have become relevant again is because of the fact that we have

new way new computational techniques that make it viable to be able to extract

the

entities from different,

data sources to be able to create meaningful relations

as well as the access to the necessary computing power to make it feasible.

And I know that in the early days, we had the idea of the RDF or rich document format triple stores

that were plagued by performance issues.

And I'm wondering if there

are associated

data storage

technologies that have come about that have addressed some of those problems as well to help make this a more viable,

exercise

and a more viable set of tooling for being able to actually solve real world problems.

So there's an interesting answer to that question, which is that, yes, we do have better data storage now.

So we've got things like,

Wikipedia uses

a cluster of, spark nodes, I believe.

And they have their own,

query language called SPARQL, a different kind of SPARQL,

which,

they're able to leverage very well-to-do

fast queries.

We also have, you know, things like, Neo 4J. Right? The the graph database,

and, AWS has its own version called Neptune.

They have their own graph database.

The thing is, and I I cannot give you the reference for this right now, but back when I was planning out the design of zinc base,

I did look into whether we should back this with a graph database

or not.

And there's a paper

which I read which actually

showed that the performance of, you know, a basic relational database to hold triples

is perfectly adequate for all this. You don't really need a graph database.

So

what I do now in in in my work

is we just use Postgres. You know, it's the it's the best and fastest database, I think, that there is.

So in in terms of, like, data storage, I think we've we've been there for a while.

I think

where

where we have, new advantages in computing is, well, you know, GPUs

and cluster computing, distributed computing.

And we've had a lot of advances

there. You know, Spark is is excellent.

Dask, part of the Python ecosystem,

is even better in my view. But yeah. So a lot a lot of different things coming together. And in terms of your own interest and involvement in the space of knowledge graphs, I'm wondering what it is that catalyzed your interest and how you got involved in it. Sure.

This actually started really early for me. So

my my school had

a BBC microcomputer.

Now not all of your listeners might be familiar with the BBC microcomputer,

but it was amazing. Britain in the eighties, the the BBC was putting computers in schools to improve computer literacy. It was such a forward thinking program.

Mostly these computers ran basic, right. But you could actually swap out the chip

and run a Prolog interpreter instead.

And it blew my

10 year old mind

that you could store facts in a computer

and ask it questions and ask it to reason about those facts and that the computer could become an expert in something.

So

it started it started very early. And for me, you know, a machine being able to reason about its own learned model of the world, you know, that's that's the true goal here. And I think knowledge graphs are a key step towards that. There's definitely been a big rise in popularity

of knowledge graphs and uses of them. I know that you mentioned Google and its knowledge graph, which for anybody who's not familiar, if you do a search and it pops up with the sidebar of all the different rich information

of different entities, whether it's a country or a person, that's being fed by their knowledge graph.

And then there are businesses being built up around just, ingesting

data sources and then constructing knowledge graphs from that and then selling access to the resulting graph for being able to query across for other businesses to be able to get value from.

And then in terms of your own implementation

of being able to build knowledge graphs with Zinc base,

I'm wondering how it compares to the overall ecosystem of available tooling, and what was lacking in the projects that you looked at when you were trying to address this problem that motivated you to create your own library?

Right. I mean, good question.

I have a big problem with AI.

It's

brainless.

Right? It's free of actual intelligence.

So,

I work in NLP, right, which is in the middle of a renaissance. A few years ago for the first time,

NLP

actually started to, you know, like, work

quite well.

So you have these, pretrained language models now, Elmo, Ernie,

GPT 2, AccelNet.

But they have, and and this has been shown in a recent paper. They have no capability

to actually

reason about the world.

They can give decent sounding answers to questions, you know, they can summarize text pretty well,

but there's no actual

understanding

there. And

to frame this, I should introduce

the concept of a pre trained language model. It's

essentially

a neural network that's been fed vast gigabytes of text from books and Wikipedia

and trained to predict missing words in a document

given the rest of the document as context.

And you can fine tune these models and use them as a base for the tasks. And every week,

we get new state of the art results out of the research and a primer where we work, you know, we apply that research to useful effect, but

you can see fundamentally, this is a statistical task. It's assigning probabilities

to words

given other words.

So for example, right, there's this,

research task.

It's called the Wino Grad schema challenge.

We take a pretrained language model and feed it a bunch of training data consisting of examples like,

here's here's a the trophy

would not fit in the suitcase because it was too big. What was too big?

And then your model has to pick 1 of 2 options,

whether the trophy was too big or the suitcase

was too big.

Now as humans, we can simulate the world. Right?

We have common sense. We have a concept of a trophy, and we have a concept of a suitcase, and we can reason that it's obviously the trophy that's too big. And

language models actually suck at this,

and they need huge amounts of data

just to be able to suck.

It's, it's not, it's not working well. So I think that to have good NLP, you know, to move beyond processing to actual,

understanding of natural language, you know, we need more than statistical methods.

And, you know, over time,

I've picked up things like Prolog and symbolic logic,

you know, fuzzy logic,

knowledge graphs, evolutionary methods, brain inspired methods, and I wanted to combine them. And,

you know, regarding other tools, I haven't seen anything that's exactly like zinc based.

I try to build things that tinkers and hackers can pick up and play with quickly.

So you have recently,

I think

last year, maybe even this year, Facebook released a PyTorch big graph, which is, you know,

a graph, knowledge graph embeddings,

framework.

And Accenture has a DRAM project called Ampligraph,

which is excellent.

And there are various Prolog and symbolic logic compilers around. And

as we talked about in the graph database space, you've got these projects like Neo 4J,

and Spotify has its own library for doing fast indexed, nearest neighbor search. And you've got things like spaCy for information extraction tasks that you need to do in order to build the graph.

You've got a lot of different

components coming together. And

zinc based honestly is worse than all those projects individually,

but its goal is to be batteries

included.

A toolkit whereby you can

extract,

structured

information

from unstructured text,

build a knowledge graph, and make

inferences,

using that knowledge graph. I didn't build this simply out of a desire to,

push AGI

forward, I gotta say.

I

built Zincbase in part because,

I wanted to work on it as my day job. Right?

Instead of my day job being, you know, creating bigger and better training datasets

instead of training, you know, these ever more complicated neural models with more parameters.

I wanted to spend my time building towards real intelligence. Right? So I made Zinc base in my spare time as a prototype for Primer and I demoed it to the bigwigs.

And there was enough interest and excitement that I was actually able to make this into my day job.

And

the final

factor,

I guess in or the thing that sets Zinc base apart from other tools

is this issue of explainability

or interpretability

of machine learning models. Right?

So I wanna pose to you a a thought experiment now. So

at Primer, we work with governments, and we work with the intelligence community. Right?

Now imagine a knowledge graph of all people

linking them to other people, to places, to events, to organisations.

Now imagine I could ask that graph.

What's the probability

that

Tom Grech

has occupation

terrorist.

That's that's scary.

There's a lot of power.

And if it if the graph is able to make a prediction on that, she should we consider that prediction actionable?

You know, I think if we're going to open that particular Pandora's box, you know, we better be prepared to

throw a dynamite inside and and blow it wide open to the point that it's

completely transparent.

So with zinc based, I wanted to combine the black box neural methods,

which really achieved the state of the art performance,

with symbolic logic and information retrieval and the more traditional and more

explainable

machine learning methods.

Yeah. Being able to understand

how a certain prediction or probability

is arrived at is definitely

essential in particularly the types of cases that you're referencing. But even just in the case of trying to understand in a mundane fashion of,

what the probability is that I happen to like apples of just being able to say, because you're making this prediction, you want to know what is the information that was fed into that to be able to reach those conclusions.

Because otherwise, yeah, you're you're definitely

opening the door for a lot of abuse of the capabilities of these types of systems, particularly given the

inherent biases that exist in the data that's being used to generate and train these models.

So understanding what went into it, the entire life cycle of the data that was fed into these models, and what the logic was that generated those predictions is essential.

And I'm curious to hear how you have implemented zinc based to be able to provide that type of transparency and visibility into the overall process. Yeah. That was really designed in from the start. You

have

you have this machine learning,

neural model that is able to take a graph

and reduce

that graph so that each node and each relational predicate has its own embedding.

And then you can use you know, once you're in vector space, you can do all sorts of things with those embeddings. Pretty much do what you like and especially make inferences,

about those nodes and relations.

Now,

with Zinc base,

the

graph itself

is kind of the 1st class citizen. It's

it is something that you can

query.

And,

again, you know, this is part of why I wanted to combine,

symbolic logic

with,

neural network in a statistical methods,

is

that you can use Prolog. People at work think I'm crazy for

building this all around Prolog. You know, it's, it's got some associations

with 19 eighties with the old AI winter and so on. But but I just like it. But just substitute Prolog for any kind of English like,

query language.

But I I wanted to the the graph itself to be the 1st class citizen here. So once you have a prediction that you've made with sync base, such as, you know, did Keanu Reeves act in the matrix,

or does Mozambique border Tanzania, something like this, then you can actually

explore and visualize

the graph itself

to find out why that might be the case.

And you mentioned some of the libraries such as PyTorch

and spaCy and some of the other capabilities that are part of the Python ecosystem. I'm wondering what are the pieces that you are leaning on most heavily in building Zincbase and also some of the ways that it can be integrated into the rest of the Python ecosystem. For instance, if you wanted to incorporate it as part of a web application using Django or Flask or being able to feed data in from something like an airflow workflow engine or things like that? Great question. So, yeah, it's, it's built heavily with,

Python and PyTorch, which is, am I allowed to say, vastly the superior,

auto grad, library? I

use a You you're allowed to say whatever you want. I I'm just not going to necessarily back up your claims because I don't want us to be involved in that flame war.

Okay.

I use a a few bits of,

scikit learn and scipy,

and I use,

network x for handling the graph stuff. And I like to keep things as standard as possible. You know, many people are used to working with network x. So why would I deviate from that, and why would I try to build my own thing? Zincbase uses

CircleCI,

to build the documentation in the PIP wheel. It uses, read the docs to to host the documentation.

So, really, I'm trying to be as standard as possible. And at work, we, in fact, use this library in production. All you need to do

is essentially to have you know,

you you have your your model loaded onto a GPU on

some virtualized instance, and you just need some queuing

mechanism

that will take the queries that are coming from your your web app, back end. Stick them in a queue until

the GPU

on your inference machine is is ready to process them. You put a call back there, and, the inference itself is actually very

quick. And when you first started building zinc base, I'm curious what some of the assumptions were that you had going into it and some of the ways that those assumptions have been challenged or updated and either validated or invalidated in the process of working on building it and as well as using it for your own work at your job?

You know, I'm glad you asked. And I,

1 of the things that I've found,

and I think it's really a ridiculous situation is documentation.

Right? I,

in the past, have always documented things

with, you know, with with markdown, documented them in GitHub, documented them in a code or maybe in some

external Wiki

or Confluence tools like this. Right? And for the first time, I have released a a library,

which not only my company is using, but I've heard from other people they're using as well. And I had to document it properly.

And I think it's fair to say that I'm unhappy with the state of documentation in Python.

And here's, you know, can you answer this question for me? Why are we using

RST files

instead of markdown? I don't know.

Because somebody decided that it was superior due to its increased flexibility. But at the same time, I have found that it does provide a bit of a barrier to entry as far as just trying to fit the entirety of it in your head for being able to just write something down simply. I don't wanna have to learn 2 different,

types of syntax for, for writing documentation.

Like there's enough

in this kind of space that I have to learn. I I read several scientific papers every day. I,

you know, my brain capacity is finite. So I'd love if we were able to just standardize on say, GitHub flavored markdown. But anyway, you know, maybe that's just a personal preference.

Yeah. It's it's another it's another case of VHS versus Betamax where with the 1 that is actually technically superior doesn't really matter. It's the 1 that people actually use that matters at the end of the day. True. Yeah. Although, you know, Betamax,

I miss you.

So,

I mean, there's a couple of other things. Right? Testing stochastic models. How do you really do that? How do you how do you put together doc tests for stochastic models? Right? You can you can do it. How do you put together the unit tests for stochastic models? That's difficult.

And I found that everything tends to become kind of an integration test because you have to, you know, first download the model, right, from your CICD

and then run the tests. So, you know, I think there are practices that we could improve around that. And

another thing is,

at work, 1 of the things I've done to the project is to look

at model provenance. Right? So if you have a bunch of people working on a library together

and each of them is developing models that do different things, so named entity recognition, coreference resolution,

relation extraction,

What tends to happen is that, data preprocessing

scripts get lost. Training data scripts get,

datasets get lost,

or they don't get versions.

You know, things live in people's own Jupyter Notebooks on their laptops.

Model performance statistics are not recorded properly.

And

I've even seen models passed around as siphonized

pickle files

where nobody actually has the original c source code, you know, because the the guy that wrote it left the company. So all we have is this pickle.

So for me, 1 of the big challenges with this has been building the utilities and the helper functions,

that make all of this really easy. And it's not rocket science. It's like 1 part lab notebook, 1 part git, 1 part boto3.

But iterating on machine learning models and datasets happens really, really fast.

So it's important that if you wanna build some tool to support that, I think that you don't get in the way of that and that you don't force people to learn, you know, more syntax and and different paradigms.

So yeah. So, yeah,

a lot of challenges in this space. And as far as the difficulties that people have in the general space of building and using knowledge graphs,

I'm curious what you have found to be some of the common points of confusion or issues that they encounter and some of the ways that you've tried to address that with Zinc base? Yeah. Great question. So scalability

is the big problem here. So

the Facebook's library PyTorch Big Graph, they have some quite innovative methods around graph partitioning, where you can create a really

big knowledge base and

train it to produce these embeddings for each node or relation even though it has, you know,

maybe 100 of millions of nodes and a trillion edges. Who knows? Now

I've successfully used Zinc base on a graph with a 1000000 nodes and 10, 000, 000 edges. So it's quite a, you know, reasonably big graph, but that's not big compared to the social network or compared to a graph of, like, Netflix users and programs. And the thing you have to bear in mind is that a knowledge graph gets exponentially

more useful,

the larger it is.

So

scalability

is a real challenge, and

I

personally, I

haven't found yet the need to move to,

like, a

distributed training schedule. I haven't built any graph that's big enough that's needed

to spread itself over multiple partitions.

So these are not challenges that I have addressed in Zinc base yet, but I know that the scalability challenges

kind of have been solved. So I'm I'm comfortable with that right now. Another challenge is basically

GPU GPU,

utilization.

So

the linear algebra that you need to construct graph embeddings is not that complex.

Right? So

there are 2 things that we could do. Either we can make the math more complex,

which probably we should,

or we also need to get better at batch training so that we can

fully take advantage of GPUs.

And I think 1 1 more problem that I'd like to mention as well is is not really a technical 1. It's a perceptual 1. I think knowledge bases aren't sexy,

like, self driving cars. You know, They're they're not as well funded as high frequency trading,

and they have some kind of, like, old fashioned,

connotations.

I mean, particularly,

I mean, here I am. It's 2019.

I'm talking about prologue.

Yeah. That already dates me somewhat.

So

I I think

I'd like this space to be more sexy. Let's let's put it that way. Well, I think it's starting to become that way as more large organizations

wake up to the realities of how useful these are and start to contribute more to the research and building and utilization

of this technology space. Yeah. Also,

as you were saying before, some of the

renaissance in knowledge graphs has become about because of the revival of natural language processing, and some of these,

deep learning based approaches to being able to perform entity extraction and vectorization

of text, as well as the broader availability of some of these textual datasets.

And I know that recently there were some,

books and other publications that recently came into the public domain from, I think, the early 1900, which will help provide a little bit more of a corpus for being able to try and build these more complex models

and

the ubiquity of textual data that can be freely obtained from various Internet sources, which is what they fed into the gpt 2 model. So it it's definitely an interesting

time to be getting into this area, and I'm definitely interested to see some of the ways that they

continue to be leveraged in other products and projects. Oh, yeah. And and also, like, I just I do wanna give props to, to Wikipedia

and project Gutenberg. They do an amazing job, but don't get me started on how,

copyright,

runs for so so long. We should have more more text data available from more recent times. I'll leave it at that.

For somebody who is interested in using base and wants to get ramped up and use it in their own projects, can you talk through an example workflow

and the overall steps of being able to obtain the data, clean it, train a model, and then be able to use that for generating the embeddings for the knowledge graph, and then populate the knowledge graph and use it in

production? Yeah. So this is this is really easy. So there are a couple of ways that you can do this. So you can either you can feed to Zinc base a

CSV file of triples that you know about. You know, just 3 columns, ask it to read the whole thing and build a graph, compute embeddings for that graph, and

then query it in some kind of prologue like language. And you can also plot this graph that's being created.

You can plot the embeddings and examine the natural clusters that tend to form, and you can ask it to make predictions. And, you know, this is this is all documented in a repo. It's like 3 or 4 simple Python commands. And the the other thing, I think I I said I tried to design things for the tinker or hacker. And you can also use SyncBase completely,

interactively.

So

you import it, you add facts manually

1 by 1 just like I used to do, in the eighties, early nineties on the BBC Micro. And then

also from there, also, you know, still compute the,

the the the embeddings and query the graph. And then, of course, that's gonna be kinda time and resource intensive. But

I want people to be excited about this and just pick up the library and just play with it. And, I think I think that's a a good way to do that. And if you look at some of the other

libraries in this space, you know,

so spaCy, we mentioned,

is a very easy to use library. But if you look at something like,

well, PyTorch Big Graph or Allen, the Allen Institute's

Allen NLP library. They're really difficult to use. You already have to be a specialist.

And so I guess 1 of the things that I'd like to do is just kinda say, hey, everybody, you know, knowledge graphs are accessible. You can just pick this up, import the library and

and go crazy in in a couple of lines of Python. So that's kind of a design philosophy that I that I try to,

try to stick to. And, know, in terms of like applications for this, there is a lot that I'm really excited about. For me,

I'll quite often take zinc base,

as the base of something else I want to do. So it's

a nice way to get triples into a graph format that, you know, that's easy. And then from there you can get them into the vector space, right? And it's just a sequence of numbers. And then from the, in, in a vector space, you can you can do, you know, pretty much any kind of machine learning tasks that you like. So I'll go off prototypes of new idea, you know, implement a new research paper, and not of not all of that makes it into the repo. You know, in AI, most experiments

will fail. And,

I think for, you know, other applications that I'm really excited about,

like, there was a a recent research paper

which, like, in the last month or so, which,

got some media attention, which was people were using,

I think,

universal language models

to

make predictions about material science. I think they they found that these,

these models sort of capture

some latent,

semantic relations

in, in material science literature. And they were able to make predictions about, you know, I don't know what new new,

new polymers that should exist, things like that. And if a,

if a language model can do that, then,

a graph a knowledge graph can do that

much, much better. So I'm excited about that. I'm excited about the potential in, you know, for example,

you know,

the the micro,

molecular biology, organic chemistry fields,

where you can build a knowledge basis of chemicals and proteins and interactions,

you know, predict possible qualities of known substances

or identify

gaps in the knowledge base where it looks like there should be a substance or an interaction, but currently,

you don't know about 1.

Yeah. And and then

also, like, more mundane, but I've spent time in enterprise,

and

large companies

would go crazy

for something

where they can

just spider over their,

unstructured data, their their PDFs and emails and any kind of document on a corporate network and and build

a structured knowledge base from it.

Right? And and then have that queryable

in something like English. So, like, if a legal case comes up for instance, you can say, oh,

who was who was it that was likely to have been in a meeting with this person on a certain date in a certain off or a sales meeting happens.

You can ask the noise base whether you've likely got enough of a certain

widget to meet demand forecast for the next quarter or what a machine's common failure modes are. Or, you know, even just I don't know if you've experienced this in a large enterprise yourself, but, like, who should you know, you wanna talk to somebody about a particular product line. You just don't know who to talk to, what what person is in charge of that.

And that is the kind of question that a knowledge graph would be very, very easily able to answer. So sorry. I think I got a little bit off the off the subject there, but, it's, it's all documented in Aripa, and it's super easy to get started. And and if people wanna give it a try and give me some feedback, I would absolutely love that. Yeah. I appreciate

the potential use cases

and some of the pontification about,

ways and areas

that zinc base can be used and knowledge graphs in general. Because

for somebody who hasn't worked with 1 or isn't familiar with the space, it can be easy to say, okay. That's great. But it's really hard to actually

think about different avenues where it could be potentially applied. And having feedback about some of the ways that you've imagined it being used can help inspire

ideas of ways that other people can leverage it for their own use cases.

Yeah. I I think that's right. I mean,

so we we use this library.

We're a worker primer now, and and we serve, you know,

big banks. We serve

Walmart.

We serve legal customers.

And

what these companies want is just some way to be able to organize and analyze,

the information

that they already have

and make predictions about it, explainable predictions.

And 1 of the kind of, I I guess, failures of knowledge graphs is, well, it's so general. Like, it sounds like a great concept, but what are the

specific

applications of that? So

I'm glad to kinda have this platform to be able to say, hey. Look. There are real things that you can do with Knowledge Graph. It's got real value. You know? Think about it. Just start building 1 and see what you can come up with. You sync base for that.

And as far as your own experience

of working on zinc base and experimenting

with

the area of knowledge graphs, I'm wondering what you have found to be some of the most challenging

or interesting or unexpected lessons that you've learned in the process.

Okay.

So maybe you or your listeners can help me with this, actually. So why is it so hard for a Python package to get its own version number?

Right? So I'll I'll I'll explain. So the the way that models work in Zinc base

is, if you try to instantiate a model, right, it looks at this CSV file, which is stored in s 3,

and it gets the model versions and the file names and locations of the model weights,

you know, which have been files also stored in s 3.

Now the CSV file containing all this metadata is versioned to match

the zinc based library version.

And so the version number has to be there in the code

of the library somewhere.

And

I

and it me and everybody else who works on this library has to update that manually every time we do a new release.

So I, I haven't found any standard way for a Python package to get its own version. Do you know what I mean? Yeah. 1 approach that I've seen is to actually have just a version dotpy in the repository

that you use for

incrementing the value. And then in your setup dotpy, just import that to populate the necessary field. I've seen a few different ways of doing it, and also maybe just in in in an init dotpy file. Or I know that there are also some link some libraries that exist specifically just for the purpose of incrementing your version when you do releases. So That that would be great. It is a potentially complicated space. But, hey, isn't this Python? Like, there should be 1 obvious way of doing it. I haven't found that.

I'll just mention a a couple of other, things that I found. Right? So, 1 thing, I just wanna give some props to doc test. Doc testing is amazing,

you know, both for, well, for documentation

and for testing.

Love that so much.

Another 1 is the CircleCI

integration with GitHub.

Very lovely, very, very easy to set up. And then the the last point that I'd like to mention is that,

debugging

machine learning is

really, really hard. And, you know, even, you know,

with, with sync based, we integrate different types of,

machine learning or AI. So, like, for example, the Prolog engine that is contained in it is just recursive

rule matching, basically. But

1 book can be really hard to isolate.

So

just some advice to people working in machine learning

is,

and we we we've got a lot of,

sort of, I guess, newer engineers,

joining our company now. And I'd just love to give this 1 piece of advice.

Just test everything

as you go and make no assumptions about

any,

of the code that you're trying to integrate. And looking forward in terms of your goals for the project and

updates that you'd like to make. I'm curious what you have planned for the near to medium term. Yeah. Sure. Well, I mean, this is AI. It's probably the most exciting,

industry in the world right now, and there are a few things that I'm really excited about. I

want to incorporate

agents

into Zinc base and specifically

multiple agents that learn to communicate

and learn to collaborate,

to complete tasks together.

So you can imagine, you know, a couple of agents and you set them loose on a graph, and they both start from

different parts of the graph.

And they find some way to communicate to each other, you know, in which direction they should be going to reason about the information in this graph. So

for me,

Zinc base is a little bit of a playground where I can get all of this advanced stuff in and synthesize multiple fields.

And the other 1 is, you know, there's been some recent research in, the reinforcement learning field

on how

these kind of intelligent agents can build world models.

Right?

And this is this is exactly my problem with AI. It's it's stupid. It's not able to,

build a model of the world and reason about that.

And if you think about it, you know, a world model is just a kind of knowledge base where the nodes

are items, you know,

physical things in the real world,

And the edges are interactions. Like, if I push this domino, you know, it's gonna fall and and knock down the next domino. So I'm really excited about, you know, the potential

for integrating

learnings from reinforcement

learning

about creating these,

these I guess it's kind of like a metacognition.

So agents with metacognition

and and if they have a knowledge graph

that they can query about the real world and other trivia, I feel like that's gonna be

extremely powerful.

But, anyway, concretely, you know, I see the future of Zincbase is basically a tool for extracting

structured data from

unstructured text, building it into graphs, and using those graphs to make probabilistic

but explainable

inference,

about unknown information.

And,

1 final thing, really, really, if anybody would like to collaborate with me on this or just, you know, ex exchange emails and and talk about it, I would be extremely happy with that.

And also my company, Primer, is is recruiting. So if you wanna work on this full time to, to get in touch as well. Yeah. And I'll definitely

have you add the best way to get in touch with you into the show notes. And before we close out the show, is there

anything else as far as the areas of knowledge graphs and zinc base that we didn't discuss yet that you'd like to cover before we close out? No. I I think,

from my side, anyway, I I think that was a a

pretty

broad and

deep enough

discussion of the considerations involved. That was, you know, a lot of fun chatting. Alright. So with that, I'll move us into the picks. And this week, I'm going to choose a recipe that I tried out recently for some banana blueberry oat bars. They ended up being quite delicious, so, definitely something worth checking out for a quick and easy breakfast. And so with that, I'll pass it to you, Tom. Do you have any picks this week?

Okay. I'll share a risk fee. I I recently tried as well, which is pickled habaneros.

Just take some vinegar, a little bit of sugar. This is, a recipe I,

found, on a trip to Thailand that I just had, but gave it a bit of Mexican influence. Pickled habaneros, a bit of vinegar, a bit of sugar. Leave it in the fridge for

a week or so, and you get this delicious,

sauce to go on rice. Excellent. I'll have to give that a try as well. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Zinc base and on your perspective of knowledge graphs in general. It's definitely a very interesting area, and it's great to be able to talk to somebody who is so deeply involved in it. So thank you for all of that, and I hope you enjoy the rest of your day. Okay. Thank you so much. That was a lot of fun.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__