Summary
Being able to understand the context of a piece of text is generally thought to be the domain of human intelligence. However, topic modeling and semantic analysis can be used to allow a computer to determine whether different messages and articles are about the same thing. This week we spoke with Radim Řehůřek about his work on GenSim, which is a Python library for performing unsupervised analysis of unstructured text and applying machine learning models to the problem of natural language understanding.
Brief Introduction
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com
- Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project
- We are also sponsored by Sentry this week. Stop hoping your users will report bugs. Sentry’s real-time tracking gives you insight into production deployments and information to reproduce and fix crashes. Check them out at getsentry.com and use the code podcastinit at signup to get a $50 credit on your account.
- Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.
- Your hosts as usual are Tobias Macey and Chris Patti
- Today we’re interviewing Radim Řehůřek about Gensim, a library for topic modeling and semantic analysis of natural language.
Interview with Radim Řehůřek
- Introductions
- How did you get introduced to Python? – Chris
- Can you start by giving us an explanation of topic modeling and semantic analysis? – Tobias
- What is Gensim and what inspired you to create it? – Tobias
- What facilities does Gensim provide to simplify the work of this kind of language analysis? – Tobias
- Can you describe the features that set it apart from other projects such as the NLTK or Spacy? – Tobias
- What are some of the practical applications that Gensim can be used for? – Tobias
- One of the features that stuck out to me is the fact that Gensim can process corpora on disk that would be too large to fit into memory. Can you explain some of the algorithmic work that was necessary to allow for this streaming process to be possible? – Tobias
- Given that it can handle streams of data, could it also be used in the context of something like Spark? – Tobias
- Gensim also supports unsupervised model building. What kinds of limitations does this have and when would you need a human in the loop? – Tobias
- Once a model has been trained, how does it get saved and reloaded for subsequent use? – Tobias
- What are some of the more unorthodox or interesting uses people have put Gensim to that you’ve heard about? – Chris
- In addition to your work on Gensim, and partly due to its popularity, you have started a consultancy for customers who are interested in improving their data analysis capabilities. How does that feed back into Gensim? – Tobias
- Are there any improvements in Gensim or other libraries that you have made available as a result of issues that have come up during client engagements? – Tobias
- Is it difficult to find contributors to Gensim because of its advanced nature? – Tobias
- Are there any resources you’d like to recommend our listeners explore to get a more in depth understanding of topic modeling and related techniques? – Chris
Keep In Touch
Picks
- Tobias
- Dark Matter and the Dinosaurs by Lisa Randall
- Chris
- Radim
Links
- Nadia Eghbal
- Gensim
- SQL Addict
- NLTK
- Spacy
- Latent Dirichlet Allocation (LDA)
- LSI
- Keynote in Italy on distributed processing
- Google Scholar references for Gensim
- Stylometric analysis
- On Writing Well
- Student Incubator
- Wikipedia on topic modeling
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast. In it, the podcast about Python and the people who make it great. I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. Linode is sponsoring us this week. Check them out at linode dot com/podcastin it and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project. We are also sponsored by Sentry this week. Stop hoping your users will report bugs. Sentry's real time tracking gives you insight into production deployments and information to reproduce and fix crashes. Check them out at getcentury.com and use the code podcast in it at sign up to get a $50 credit on your account.
You can also visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch. And to help other people find the show, you can leave a review on Itunes or Google Play Music and tell your friends and coworkers. We also have a discourse community at discourse.pythonpodcast.com, where you can find out about upcoming guests, suggest questions, and propose show ideas. Your hosts, as usual, are Tobias Macy and Chris Patti. And today, we're interviewing Radim Rauzhek about Gensim, a library for topic modeling and semantic analysis of natural language. So, Radim, could you please introduce yourself?
[00:01:19] Unknown:
Oh, yeah. Sure. So my name is Radim. Thanks for inviting me. I'm known in the Python community for these open source libraries that we have, like like Gensim, 1 of them, SmartOpen, another 1, and SQLIDict, and so on. The introduction from the personal perspective is I am very much grounded in in research in the academic community. I got my PhD in 2010 or 11 was it, so a couple of years back now. But I always like building practical stuff. And actually, Gensim and all the other stuff that I do is is very much grounded in that. So I like to build things that work. So I left academia eventually.
It didn't fit my personality. I worked for a little bit for for a search engine company and then moved on to to freelancing and building my own company, which is really focused on machine learning. So that's, that's what we do. That's what I like doing. I like this cognitive side of things and and also connected to that computers and languages and all things connected to thinking. History is another 1 of my hobbies. I could talk for a long time about my hobbies. But if you're interested in Gensim, that's the background is really academic. It was part of my PhD thesis, building these scalable algorithms.
And, and there was a practical side of it. I wrote some code that actually realized these algorithms, and I released it in open source back in 2010, I think. And it just continued from there. So Gensym has actually been around for a while, so more than 6 years. And by the way, I should also mention a little plug. So this this this Linode we're we're using Linode for our own servers, so good choice to stay with the sponsors. It's we're happy with them. I I I was never sure whether to pronounce it Linode or Linode, so this is a really useful podcast for me.
[00:03:01] Unknown:
Yeah. I was pronouncing it incorrectly for a little while, and they actually sent me a message to let me know that it is pronounced Linode. It's a combination of Linux and Node. Okay. Yeah. I remember a lot of people used to pronounce Linux as Linux too. So that that probably wouldn't help a lot of people.
[00:03:17] Unknown:
Alright. Useful.
[00:03:19] Unknown:
So how did you get introduced to Python?
[00:03:22] Unknown:
Python. Sure. I was actually kind of a late adopter. When I started, it was these 8 bit computers, you know, c 64, that type of stuff assembly. And I liked c and these fast languages. But I when I started working for this, this web search company that I mentioned, it was around 2, 006, 7 maybe. We're using started using Python. I think it was Python 2324 back then. So already pretty advanced, a late adopter. But, yeah, that's that's how I started. And at first, it was just for these little things. I didn't have much trust in it, you know. Python, this sort of non compiled language, natural suspicion there for somebody, coming from my background, but I really grew to love it. And now we most of our work that we do in is in Python. So Gensym is in Python, obviously, but also the commercial work. A lot of most of our projects are Python in the Python ecosystem.
[00:04:17] Unknown:
So can you start by giving us an explanation of what topic modeling and semantic analysis are?
[00:04:23] Unknown:
Sure. So they're a little different. So let me start with topic modeling. So topic modeling is a way of the motivation here is you have a bunch of documents and you wanna, gain some insight into them. So you don't wanna read all of them, obviously. And you can plug in some sort of keyword analysis, so, like, a keyword search, the the type that, I don't know, Lucene would do, you know, Solr or Elasticsearch. But that's still on the level of individual work. So people have always wanted to see have a sort of high level analytical perspective of their collections. So we have these documents are about legal, and these are about HR. That's sort of high level bird's eye view.
And it turns out that it's possible to do it automatically to a large degree, and that's what topic modeling is about. It's about extracting themes, topics from a collection of unstructured text documents. And the particular approach that is to I mean, there are many ways to do it, and the particular approach that this branch of research implements that Gensim also uses is using statistical analysis. So based on the distributional hypothesis, it's a specific concept in in NLP where you say that the words are known by the company they keep. So just by looking at the context of words, you can kind of figure out the similar words would be or what what the patterns are in the language use. So based on that, you extract these patterns and you say, okay, maybe cat and dog and and horse. These are all kinda related. So they should be part of a common topic, perhaps animals. But rather than doing this by hand, you do this by automated algorithms.
And the output is not actually that the topic is cat, dog, horse, like these words form a topic, but each word has a sort of weight towards the topic. So you can say cat is 90% animal, but maybe 10% bats or something else. So it's a sort of soft assignment towards the topics. And then once you have these topics, you can also take a document as input. So you have a company document describing whatever it is, some some customer complaint, let's say. And then you say, okay. So which topics does this document belong to? And, again, it's not just this document is about animals, but you can say this document is 10% about, I don't know, our product x and 50% about legal and 30% something else. So, again, this sort of concept of the soft assignment that's pretty central to topic modeling to modern topic modeling. So that's the topic modeling, and your second part was, semantic analysis. Right?
So that's a little broader topic. So without getting too technical, it's become a bit of a buzzword, actually. Everything is semantic these days. So 1 answer is it doesn't mean much anymore. But in the context of NLP and machine learning is, again, this sort of trying to abstract away from the surface form of how people express their ideas using particular words and moving a little higher towards, what are the actual concepts with the idea of being able to answer natural language queries. So give me all documents about animals, and you don't care if if if they mention dogs or cats. That's the topic modeling part or something else.
[00:07:37] Unknown:
And does sentiment analysis fit into that at all? Or is that just a completely different branch of the NLP space?
[00:07:44] Unknown:
It does fit in some ways. It's it's about obstructing higher level patterns from text. So in in that sense, it it fits into semantic analysis. You wanna say something about the text as a whole. And in sentiment analysis, you try to derive whether it's the thing that is being talked about is being talked about in a positive or negative way. There was a big boom with semantic analysis. Sorry. With sentiment analysis a few years back. I think it's kinda gone now or on its way out again. These things come in waves. So that was a big deal for the marketing industry. Yeah.
[00:08:19] Unknown:
And you mentioned a little bit about how Gensim got started. But can you explain a bit more about what it is and, some more of the backstory?
[00:08:28] Unknown:
Oh, yeah. Sure. So Gensim is generates similar. I'm not very creative with the way I produce names. So the name of our company is rare technologies because, like, I didn't know what name to put there. So Rare is for Radim, the first 2 letters, and the r e is for first 2 letters of my family name. So there you go. So generate similar. And the way it came about is, again, it was very practical. So I was working on a project, a commercial project for our library. And this library, was meant to give some insight into collections of mathematical articles. So this was a library that specialized in handling articles, published peer reviewed articles in mathematics and other supplementary materials.
And they wanted this high level overview and also finding similar documents which are about the similar topics. So maybe not necessarily using the same words, but on on a similar theme. And back then, this topic modeling was really big, and I I looked into what how to do it. And my intention wasn't to implement things from scratch. That's always the last option. It's always a lot of work to do this stuff. But it just turned out that there are no good implementations of of doing this. So if I wanted to try this, I basically had to implement my own. I mean, the the many libraries that do the low level stuff, so for example, latent semantic analysis, which is 1 technique used for topic modeling, is there was an implementation of, of singular value decomposition in in some Fortran libraries, which were efficient. But it was such a pain in this. You know, working with these libraries, you had to recompile the Fortran code, and then there were some special limits. So if you wanted to bigger matrix, you you had to change some constants and recompile it again. And just plugging into that from any other program was, let's say, non non trivial.
But they were very efficient. And on the other hand, there were a lot of implementations of stuff, which were very academic, so stuff that was, you know, just written in whatever Python or like really pure Python or things like that, or Java, they were super slow. So that was not useful either. So that was the pain point that made me, implement this myself. It wasn't genshin to start with. I just did a few scripts to to compute the singular value decomposition. But later, I found it's it's really useful across many different tasks and industries. And I I I've figured I would just release it as open sources, see if other people like it, and they did. And then I added more algorithms. So for topic modeling, there's also the latent allocation and now now where to vac for unsupervised analysis and so on. So a lot of different stuff and distributed versions. So it kinda grew from there, but the initial push came from this digital library project. Now this was around, by the way, I think, 2, 000 and 9. Yeah. Around there, 2008, 9.
[00:11:15] Unknown:
And what are some of the facilities that Jensen provides to simplify the work of that kind of language analysis?
[00:11:22] Unknown:
Well, it implements these low level of routines for doing it, basically. That's it. And I think what really makes Gensim stand apart is is this practical focus. So I'm a practical person, and I think the software reflects that. So there's a lot of libraries that do singular value decomposition, but none of them really do it as as well and as as easy to use as as Gensim. And by that, I mean it it accepts large streams of documents because that's 1 issue I had with many libraries that sort of everything had to be loaded into memory, you know, and then you did some decomposition on it or some processing. But with large collections, this is not really an option. You can't just load everything into memory. It's just too much data. And sometimes you don't even have the data to start with. You need to update the model as new new data comes in. So it's this kind of stream processing was, was there from the start. Really important. And and what it brings?
Well, I'm hoping it brings a lot. I mean, we we have a community of users for Gensim. We have a mailing list, and then we have Twitter and so on where people discuss various things and the projects where they use it, and it's it's hopefully useful. Our our focus on the efficiency, I think, pays off. And the fact that we also try to implement some of the more current techniques, let's say, it's also appreciated.
[00:12:39] Unknown:
Yeah. The unsupervised aspect of gensim is 1 of the things that initially made me interested in researching it and having you come on the show to talk about it, because it definitely seems like a lot of the other tools in the space are much more human in the loop. Whereas being able to feed it a feed Jensen a corpus of text and have it generate the models without necessarily having to do any manual tuning or manual setup for it was pretty appealing. And it definitely seems like it would simplify its use, particularly once it's being operationalized.
So I'm wondering if you can dig into a little bit what sets Gensim apart from some of the other projects that are similar to it, such as natural language toolkit or more recently spaCy?
[00:13:21] Unknown:
Well, it's a completely different space. So NLTK and spaCy, they're focused on the NLP part. So you have text, and you wanna assign, I don't know, part of speech or tokenize stuff or extract entities and so on. So Gensim actually starts where these tools end. It already expects if you wanna train a word2vec model, it expects you have sentences where a sentence is a list of tokens. So some other tool already needs to have tokenized your input or kinda prepared it. So they're not in the same space. So that's that's, the setting apart. I think what really sets it apart is the ease of use. That's that's what we try to do, with Gensim. So the interfaces are really simple.
We try to put insane defaults. You know, a lot of the times, these academic algorithms have a lot stuff to tune, a lot of internal parameters and little tweaks and knobs you can turn. So we tried to put some SANE defaults in there and also support it as kind of a big deal is coming from the academic community. Well, you probably know how it works. You get money for some project, and then the, you know, the project ends and everything is kinda abandoned. You you hope you get some publications out of it because that's that's that's your KPI. That's your metric of success. But the software rallies, and so the quality of the software reflects that. So with genshin, we you know, it's been going on for, what, now 7, 8 years.
And we wrote 1, 000 and thousands of, replies on the mailing list and with all sorts of stuff to workshops to to help people use it and so on. So I think that's another big part, not not the technical or academic side of the algorithms, but also the fact that we're trying to make it useful and make make sure people get some sort of feedback for what they're trying to do.
[00:15:08] Unknown:
Yeah. Particularly for tools in, spaces, complex as natural language processing, having some measure of support behind the tool, I imagine, is an immeasurable boon to people who are trying to work with gensim. And if I heard you correctly, it sounds like for generating the corpus for gensim, it sounds like it might need to be preprocessed using something like NLTK or spaCy, or can you just feed it raw text?
[00:15:34] Unknown:
Yeah. Yeah. Some preprocessing. I mean, we have some very simple functions for preprocessing. So just if you wanna test something, we just tokenize using regular expressions. But if you wanna do you know, if you have a specific application in mind, let's say you're a company and you wanna build some machine learning product using NLP, then definitely should look into more advanced ways of of preprocessing. And by the way, a little plug for spaCy here, the the good people at spaCy who we're in touch with, they actually, did that with, with Word2Vec. So they did some specific type of preprocessing, I think recognizing entities, something like that, or noun phrases, I don't remember. And then running what Gensim on top of that to to build the Word2Vec model out of these phrases. And they have a little demo on their website, so, which is interesting. So you can you can enter not just a word, but a phrase, and it gives you similar phrases as the output. So that that's sort of symbiosis is there.
[00:16:27] Unknown:
Yeah. So what are some of the practical applications that Gensim can be used for?
[00:16:33] Unknown:
Practical. So okay. So there's a lot of academic, let's say, citations where people try it for a lot of different things. I used to follow this in the beginning. You know? It was exciting the the first few times genshin was used. And I and I didn't know the people or what they're trying to do. But now there are too many. It's more than 300 academic citations. So that's 1 side of it. But if you say practical and you mean commercial, people use it a lot in the ediscovery space. So this is and there's a little overlap with what we're doing as a consulting company. So what I'm saying now comes partly from Genzyme and partly from from what we build for our clients. But in the legal space, they have these e discovery systems where they need to analyze a specific type of document. They have a specific process to it. And they actually use latent semantic analysis a lot there. So having a sane, scalable, and and fast implementation that is easy to use is was a big deal there. There's a big use of, in the publishing industry. We had quite a few clients there. So they have a lot of data, like articles and large collections company like Hearst, which is a US company. And they they just need to get some insight into what's what's in there so they can label it. They can search through it in an intelligent way. Basically, anywhere where where you need similarity.
Okay? Gensim generates similar. So the sim part is similarities. And you can see that's really useful pretty much across the board. Patent search. Lot of people use that for that, some patent agencies. Customer surveys in Autodesk, we built a system where they analyze requests which are coming from the clients to see what they're about and where they should be routed to. Sports Authority use it for similar purpose to to analyze e commerce data. Pretty much anywhere where you need similarity on a semantic level.
[00:18:23] Unknown:
And so I imagine that having this built into something like a company knowledge base would also be useful for being able to surface information that isn't necessarily readily apparent, but might be useful to employees at different stages in their careers?
[00:18:38] Unknown:
Yeah. Yeah. That's a great point. Yeah. This this type of analysis. And also a lot of projects we do are actually connected to the HR industry. So, people need to extract some information, let's say, from CVs and on the other hand, from from job adverts, from job openings, and match the 2. And, again, not not on a level of of words or just, let's say, skills, Java here, Java there. That's a match. But extracting a little more context and semantics out of it so you can really find the right candidates.
[00:19:07] Unknown:
Right. So you can say, okay. This person has experience in this particular industry using a certain set of tools that might match the requirements of the job description rather than just do keyword bingo like some recruiters are wanting to do? I wouldn't call it bingo. You know, each each industry, it's easy to say from from the outside that it doesn't make sense, but each industry has its,
[00:19:27] Unknown:
let's say, restrictions and even its own vocabulary. I already mentioned the legal where they call classification. They they call it prediction and and, very specific terminology. And in the HR, it's the same thing. So they they really focus on the keywords, but that's because they that's how they're evaluated. They have a high need for high precision results. They don't care as much about recall. Usually, there's a lot of applications, let's say, to choose from, but they really care that the ones they do offer, that they do extract are really the ones they need. So precision, much more important than recall. By the way, are you familiar? I'm talking to you, like, you know what precision and recall is. But, are you familiar with these machine, let's say, information retrieval terms?
If you could just explain them for the benefit of our listeners, that'd be great. Sure. So recalls is basically how complete your results are. So if you have yeah. How complete they are. If you miss a lot of stuff, you will have low recall, so higher is better. And precision is what you actually do return, how much of that is correct. So that's like the correctness of what you return. So completeness versus correctness, it's often a trade off. So you you can't be like 100% accurate and also cover 100% of what needs to be covered. Often there's a trade off, so you you maybe return more things so you have a higher recall, but some of some more things on what you return will be incorrect, so you will have a lower precision. That's that's the trade off there. And for HR industry, the the trade off very much favors precision.
[00:20:54] Unknown:
And you mentioned earlier that 1 of the features of Gensim is its ability to stream data from disk that would potentially be too large to fit into memory. I'm wondering if you can explain some of the algorithmic work that was necessary to allow for that streaming process to be possible. And I know that when reading through the docs, it mentioned that it can actually update the model in place, incrementally based on the additional data that gets fed into it.
[00:21:18] Unknown:
Yeah. Yeah. Good question. This was a big, big, part of gensim. And actually, it's not just from disk, Just just to add to what you said is actually, the streaming can happen from anywhere. That's that's part of the beauty of it. So we wrote it in a way that we don't expect files on disk. We just expect a stream of, let's say, documents or some sort of examples or events coming in. So we also accept data being streamed from S3, from Amazon Web Services. So that's over the network. Or you can even generate the data on the fly. So it's not stored anywhere. It just generated as as you need it to be generated, and you send it to some model. And then the model does something with that. Yeah. But to answer your question, so how how did we do it? So back then, Python introduced this concept of of generators. And, so it was already supported on the syntactic level in Python. So that was actually a big part of it because it was really easy to express iteration over stuff, in Python. It was conceptually simple. It was clean syntax, easy to explain, easy to implement. So that was a a big part for how the Jensen interface was was designed.
And on the other side of that, on the, let's say, theoretical side, that was my PhD thesis. So how if you want the short answer, by being careful about how you access the data. So obviously, if something is trimmed, you can't just shuffle it or just jump to the middle and then jump back to the beginning. You really have to process it 1 1 thing at a time because it's not just that it's streamed, but it's also 1 pass. That means once the stream once you process a document, let's say, you just forget it and it's gone. You can't can't go back to it. So it's a lot of the algorithms in Gensim are single pass, not just stream, but also single pass. And that's, that's also very useful in many applications where there's too much data to be stored. So you just have to process it on the fly, and then you you you forget it. You throw it away. You update the model. And on the academic side, how it's done? Well, specific algorithms that that actually can can do this can work in this stream manner, and that don't need multiple passes over the data. So everything they they do, they they do it. They construct all the necessary structures as they go along and update those structures on the fly.
[00:23:31] Unknown:
And does this impose any kinds of of limitations on
[00:23:34] Unknown:
what kinds of applications it can use or maybe some of the accuracy that you can expect from gensim when you're using it in this streaming fashion? Oh, yeah. For sure. Especially with small datasets. So if you have small datasets, you don't, you know, and everything's on disk, then you don't really care about streaming or 1 pass. So so it's just optional, this 1 pass thing. In algorithms like LVA, if you say, it's 1 of the topic modeling algorithms, they're all same for LSI, you can say, so, okay, here's the dataset. And I only want you to run once over it and give me the model. And you can do that. But you can also say passes equals 5, and then it will do 5 passes over the entire dataset, which helps it converge maybe better or do something extra. This is not an issue with large datasets sets where usually you do the other thing, you sub sample. So you the normal way to handle a lot of data is you just skip some data points, not go over them multiple times. But for small datasets, the multiple passes can help, definitely.
[00:24:28] Unknown:
Can Jensen potentially be used in the context of some of the modern streaming frameworks like Spark or Storm?
[00:24:34] Unknown:
Sure. That's the short answer. It it can. I mean, Spark specifically is a is a batch or was conceived as a replacement for the Hadoop world. So that's also a batch processing system. They actually added Spark Streaming later on, kind of bolted it on. It's in there now. And I think some people actually already did what you're suggesting. I think there was an implementation of, LDA on top of Spark. So we used the Gensim, the Python processing to do these updates and then use Spark as the as the framework to run it on. Yeah. Spark really handles the the streaming by batching. They kinda took the opposite direction to most other streaming engines. But it's it's possible. And I think some people try the same thing with word 2 vec. The the trouble is there's a lot of overhead, in in these systems. And if if you're after performance, it often happens that, you if you spread things out to many machines and use some sort of distributed framework, you lose something in terms of efficiency. And it just happens that if you the the break even point, I think I did a keynote in Italy last year, I think, we're actually evaluated this. How how many computers do you need before it starts being, making sense to distribute like this. And the result was something like 8. So you really need like a non trivial amount of computers before distributed computing starts paying off or non trivial amount of data, let's say. If if you really have just a few few, you know, 100 gigabytes data set, it's it's not really worth doing this. You can just compute on a single machine and it's gonna be more efficient. The efficiency you win that way just makes up for the thing you would win for, distributing it.
[00:26:11] Unknown:
So given that Gensim supports unsupervised model building, what kinds of limitations does that impose, and in what situations do you need a human in the loop to improve accuracy or the overall capability of the model?
[00:26:25] Unknown:
Yeah. Unsupervised learning is, is a specific field. It's more about gaining insight into your data. So it leads to different kinds of applications, I would say. It's it's not like it's just supervised or unsupervised, depending on just 1 is easier than the other, but more on what you're trying to do. So with unsupervised, the goal is always to cluster something and and see how things fit together. So you do that when you don't know much about your data. But if you have a specific task, like you wanna classify something or or, you know, very clearly where you want to be going, then it makes more sense to use supervised models.
[00:27:02] Unknown:
And once a model has been trained and generated,
[00:27:06] Unknown:
how does it get saved and reloaded for subsequent use? So Gensim is a Python library. So it's it uses it's actually interesting. So it started by just using pickle, which is like a super simple serialization protocol and built into Python. But because people were using it in many different applications and some of them on large data, it's it just and some of them in production environments, you you get to see a lot of these engineering requests for stuff being more efficient. So what we actually started doing is a lot of the times you train a model and then it then you just use it. It's a very common pattern in in all all machine learning. So the model is kinda read only after you're done training it. So you deploy it to production and it does some classifications or We We automatically look for inside the object that you wanna store. So you train a model, it's a Python object. And you store that to disk. But inside that object, we look for large structures such as large NumPy arrays and we store these separately. So we store these as as binary files outside of the pickle.
And the reason for that is once we load the model, we can actually say instead of loading that into into RAM, just load it as a as a memory map file. So we just allocate the virtual address space and just map it directly to the file on disk in in read only mode. And then multiple processes can share the same memory. So for example, if you have 20 you know, if you have a Word2Vec model that you wanna use on a server to answer queries, you can actually load it, say, 20 times in 20 different processes in different servers or however you use it and say use memory mapping for this model. And it will just share this memory across all the processes. So it saves you 19 copies of word2vec model which can be large. So you can easily do it on on a single server and it just improves the efficiency. And I also related to that, when I mentioned you asked me how Genshin started and why Python. So when I started with Genshin, there was this new library called NumPy.
Maybe you heard of it. So it just came out recently, the year before or something like that. But I took a chance on it because it was really easy to use and performance. So that's I think that was a tipping point for gensim not to be written in C plus plus or C, which I probably would have gone for otherwise. But NumPy was NumPy was around and, it's, it had all the things I was looking for because it was really plugging into this low level library. So it was efficient. It was calling the Fortran code and the c code that's really efficient. But at the same time, it has really simple interfaces for interacting with, with these arrays from Python. So that was a big part of it and it was a good choice. Same with Python. I'm happy I chose Python and and NumPy for gensim. Just turns out Python is a good environment. I didn't know back then it was a lucky choice, let's say. I've been using Python for what, 2, 3 years back then, so I was a novice even if I didn't think so back then. But both NumPy and Python turned out to be good good choices. NumPy is really, really cool, and it lets us do many things, almost for free inside the Gensim and our other libraries.
[00:30:21] Unknown:
And I imagine too that by hedging your cart to the Python horse, it possibly helped increase the popularity of Gensim because of the fact of how popular Python has become in the data science and machine learning communities.
[00:30:36] Unknown:
It became popular because there's a lot of projects in it. But it's all like scikit learn and NumPy and all these libraries which make it popular. They come as this quote that the each successful open source project is built on the ashes of somebody's academic career, which I think is which I think is fairly accurate. So so it's, a lot of man hours went into these libraries and a lot of smart people work on them and then do amazing stuff. So that's that's part of. And I think this pragmatic approach, what really keeps me in the Python community, there's there's a lot of things I maybe don't like so much, but what keeps me there are the people who are really knowledgeable. So it's not like this, this crowd that tries this and that, and then moves on. But really people who have seen it, right back from the sixties, from the Fortran, you know, libraries, and they know why things are the way they are and that not not everything is easy to change on the spot.
You can call it baggage from some point of view. There's a lot of baggage and compatibility to to be careful about, but it also gives you a sense of stability and and performance still matters and building practical things matter. So that's, that's my take on it.
[00:31:43] Unknown:
What are some of the more unorthodox or interesting
[00:31:46] Unknown:
uses people have put Gensim to that you've heard about? Well, in the beginning, like I said, I followed the academic. The best place to look for novelties is publications because in academia, it's publish or perish, as I'm sure you know. So I have always have to look for novel, approaches and and novel applications. So I used to do that in the beginning. I used to look at how people use Gensim. That was a lot of stuff. Some of them there's actually a scholar on on Google Scholar. There's a list of these publications. So so 1 of them 1 interesting thing is detecting extremist videos from YouTube.
So mining YouTube to discover extremist videos, users, and hidden communities. I thought that was, interesting, like, how how did they apply to that? Using computational linguistics to understand near death experiences. You know, that's a that's a headline that catches your attention. But basically underneath, it's always the same. So there's some text and need to find some similarity or cluster something. It's it's a pretty general purpose tool, so it's used across many industries, assessing software bugs, comparative analysis of software architecture. So it's actually applied to analyzing, software as well, learning in the medical industry, quite a few. So assessing risk in medicine, comparing quality, less bang for the buck, natural experiment in NIH funding. I remember that 1. They were comparing the quality and quantity of medical studies when they're subsidized or non subsidized, so checking if getting money from the government improves the quality of the service. And if I remember correctly, I hope I'm not misquoting here, but they found no correlation, so it was interesting.
So so basically, all these type of things people try Gensim and other statistical, tools for.
[00:33:28] Unknown:
That is really interesting. In particular, the the 1 about finding extremist videos on YouTube, I find really interesting. And that's kind of I was wondering about people in the intelligence community being able to use tools like ginsim to do analysis on various corpus of text, corpora rather, I guess. So that's that's kind of interesting. It's so it it is kind of in the vein of what I thought it might be in the sense that you can use it to identify, does this sample or set of samples match heuristics that we tend to think, identify it as a certain kind of or or from a certain author. That's very interesting.
[00:34:07] Unknown:
We we could talk about it for 1 of our projects was for a government agency, which shall remain nameless, but it was exactly this authorship detection or identification. So given, some some text documents, which are anonymous, you don't know who wrote them, but you have to say which ones were written by which people. So maybe, you know, group them together or or even more interestingly when a person tries to cover their identity, they still use some pattern words, but there's still stuff that they're not aware of. And you can identify them based on this fingerprint, which is subconscious. So this is the the field of, stylometric analysis. It's not directly connected to topic modeling, but but it's in the same field. Yes. NLP and extracting patterns patterns from from usage of language.
Yeah. So that's another big field we could talk about for a long time with experience with AS.
[00:35:02] Unknown:
I'll bet. It kinda reminds me real quick. I have just finished reading a book called On Writing Well. And the author, 1 of his main points is essentially that you can't change your essential self when you write. You're speaking in your own voice. And so don't struggle against that essentially. Be authentic in your writing. And so it just kinda struck me as you were saying that, you know, it it kinda proves what he was saying. Right? That even if you try to mask who you are, there are aspects of your personality that are gonna come through whether you mean to or not. And this tool kinda kinda proves that out.
[00:35:37] Unknown:
Yeah. Yeah. Well, proofs is a strong word. These are like indications and the result is some sort of statistical significance or some sort of confidence. It's not like you can prove anything. But, yes, there are certain parts which unless you really know what the algorithm is looking for, so you can mask it effectively. If you just, you know, put a comma here or there or or change some words with synonyms, that's not gonna be enough. It's it's pretty advanced research in this field, which is quite interesting, yeah, with applications to security, like you say. The devil is always in the details. But in in this particular 1, it it also expands. It's not just seeing who the author is, but you can also extract some sort of socioeconomic indicator. So you can say how old maybe the person was or what was their gender, stuff like that, which again has implications, not just for the security industry, but also for marketing, saying stuff about people, how maybe how rich they are, how well educated, let's say, which has correlations and and so on. So all of these algorithms, yeah, have good applications in commerce as well.
[00:36:41] Unknown:
And as you mentioned during our conversation, you started a consultancy as somewhat of an outgrowth of your work with Gensim. So I'm wondering how that feeds back into the Gensim project.
[00:36:52] Unknown:
Good question. I started consulting because I I like building stuff. It wasn't necessarily connected to Gensim, and it's it still isn't. We use a lot of different tools, Python and non Python and Gensim and non Gensim. Yeah. That's because there's really a huge demand if there's someone listening and they're considering a career in, let's say, data science as it's called or just, you know, machine learning. There's a huge demand for people who actually know how to execute on stuff, not just talk about it or send formulas, but actually get into it and and build stuff that does something. There's there's a huge, this is way more demand than there is supply. So but it's really the practical side of it. It's it's a there's a lot of demand. And that's that's how I started with freelancing. It was back in the community lift in in Thailand in 2009 or 10. I started doing that remotely. And it was always a lot of work, and people appreciated that they they kept coming. So I hired more people on it. It sort of grew from there. That's that's that's the consulting side of it. That's what we're doing now. We're also giving trainings to companies. So on-site trainings in Python and machine learning and NLP and and deep learning, now popular wave of neural network based algorithms, things like that. And we really specialize in machine learning. I think I already mentioned, but this is where our heart is. We don't build web interfaces or sort of front end systems. We really focus on the back end engines for recommendation, classifications, analysis, and anomaly detections, and these sort of things. So it's not just Gensim. And to answer your question, how does this plug back into Gensim? There's 1 interesting initiative we just started just a couple of months back, which we call the student incubator, which is a program where students can apply. So we have several partner universities across the world as a as a company, where we help students with their thesis and and give them interesting stuff to work on. So they in return, they they learn some practical skills and and become, more useful in in practice, not just the academic side of it. And this incubator program is a sort of an extension of that. So people from around the world can apply and they get mentorship from us and they work on projects, not commercial projects. Obviously, these things are have, legal restrictions and so on. But they work on open source projects, where they learn how to collaborate using version control system, what what tools to use, what approaches are good, and also expressing themselves, in blog posts. So motivation about what they're doing, why somebody should care, and things like that, which are super, super useful skills, no matter where you end up in your career. So that's how we're giving back. This this doesn't really bring us anything, this student incubator. It's a cost for us. Okay? Costs money to to do this mentoring and prepare the projects and do the code reviews and all that stuff. But, we feel that the community is is giving us a lot, the open source community. So we'd like to give back in in this way.
[00:39:50] Unknown:
So are there any other libraries that have come about as a result of issues that happened during client engagements or any particular patches or improvements to Gensim?
[00:40:00] Unknown:
Oh, yeah. For sure. All of them. It's, it's it's Gensim since we're it's an open source project. People just scratch their itch, So they submit stuff they want included, and we do the same. So when we have commercial projects that need to have something optimized or or a wrapper for another library, let's say, like Mallet and so on, or detecting phrases. We often do that. And if if if the legal part of it is okay, if if we can actually use that work as open source, so we redo it and or if it works, then we like to contribute to to gensim or other projects. We have this smart open project, which basically we wanted to read files from s 3. A lot of companies store data on this in the cloud, let's say, it's text data, and it's often large data. And it's not really easy to read it because in the standard libraries that you have, you have to this is a technical point, but you have to read the entire thing into memory or copy it over and stuff like that. And there's no good way to stream it directly from s 3. So let's say line by line, if you write in Python for line in open x, it's actually dying some buffering and it's it's intelligent. It doesn't load the entire file. There was nothing like that for reading files from s 3. So we we built a library like that. Also for HDFS.
Same thing.
[00:41:17] Unknown:
Given the advanced nature of the subject that Gensim covers, is it difficult to find contributors to the project?
[00:41:25] Unknown:
Oh, we're not really looking. So, that's that's what's 1 answer. But people contribute. They use it, and they do stuff. You know, what's difficult is to find contributors who really contribute stuff that is of of production quality because we we like to keep Gensim not just as a collection of random academic code, you know, that maybe works or maybe it breaks in other corner cases and so on. But the core principle that it was, you know, streamed and it was it was efficient and it was, robust. So, finding contributors who can really live up to that bar of stuff being useful. And we assist them. I mean, in the pull request, if you look at Genshin, we we have a lot of discussions on how things can be done differently and so on. But it's that's, it requires some experience with building stuff, so it's not always easier. And that's also why it's why partly why we're collaborating with the students. So, for example, we have a student now. He's part of the Google Summer of Code. So we're a mentoring organization for the Google Summer of Code, and we got a student through the Google Summer of Code. And then you had they have more time because you get resources to really figuring out how to do things properly. That always helps.
[00:42:36] Unknown:
So are there any resources you'd like to recommend our listeners explore to get a more in-depth understanding of topic modeling and related techniques?
[00:42:45] Unknown:
Sure. So depending what level or where you're coming from, but to get the basics, probably Wikipedia has good examples and, like, just basic intro. And if something Jensen specific, we're actually starting to push more documentation. So we're adding tutorials using IPython notebook, but Jupyter notebook as it's called now. So that's also a good place to read through that and see see how things are actually executed concretely. There's there's a couple of good newsletters where people push novel things being produced. How much that is useful, I'm not sure. Probably if you're an expert and you know everything there is to know, then, yeah, following the latest news makes sense. But, really, if if somebody's getting started, then then the best way, in my opinion, to do it is to really just do do basic stuff and see how things fit together and then go from there.
[00:43:38] Unknown:
That's great. I I I really thought your the tutorials that you folks have already of them, the the Wikipedia tutorial was 1 of the ones that actually gave me a sense of what GenSIM does and is useful for as someone who has 0 experience with machine learning and and topic modeling and and that whole sphere. So, that's great. It's great that you folks are planning on on doing more so that people outside the field can really make use of Jintz in an interesting ways.
[00:44:06] Unknown:
Yeah. I appreciate it. Thanks. It's it's the documentation is you know, if if open source is never a priority for people, that documentation of open source is even less of a priority. So it's, that's that's not always so happy, happy relationship there. But, yeah, we're the documentation could be improved, and we're still improving it for the open source. Yeah. It's having students helps, I would say.
[00:44:32] Unknown:
So are there any questions or topics that we didn't cover that you think we should have?
[00:44:37] Unknown:
Oh, lots of them. I'm interested in history. We didn't talk about history at all. Bronze age and all that stuff. So, no. But about Genzyme, I think that's fine. It's, it's yeah. I'd like to reiterate that we're really grateful for the community, the open source, and that's that whole thing that's moving along with it these days is I think it's a a fascinating thing in a way also dangerous, because people learn that everything is for free. You know? Some at least some people. So we didn't even touch on the fact of of making money through open source. That's, that's an topic because a lot of people try that. We tried that. Spacy tried that. It doesn't really work. These first of all, these libraries are too low level, to be really the the business value is too far away, you know, for this to be sold like that. Same with I see a lot of services popping up as as machine learning in the cloud. And again, it's, it's hard to imagine who would need that service because the the value really is not in the algorithm or even the implementation as such. It saves you some time, but the value is in knowing how to apply it and what sort of data fits and how how to structure the whole process and so on. So making money off Genzyme or spaCy or scikit learn like this, this low level libraries, building blocks is, not as straightforward as some people think.
[00:45:58] Unknown:
I'm sure. But I would think that the facts that you produce GenSIM would help your consulting business. Right? I mean, I know for a fact that if I were looking to build an application and I said, ah, that, you know, this problem really lends itself to topic modeling. You folks would certainly be very high up on my list of people to contact.
[00:46:18] Unknown:
Thank you. Thank you. I appreciate that. It's true. It doesn't hurt for sure. But, you know, I've been doing this for what, 15 years now. And so this yeah. This I have a lot of experience with various parts of machine learning, not just topic topic modeling, but all sorts of stuff. But, yeah, it certainly helps to have open source that is somewhat popular that people recognize.
[00:46:39] Unknown:
Money and open source is definitely a long running and complex topic. We've touched on it in a few of our episodes, and as yet, it is an unsolved problem. Though. There is a woman, Nadia Egbal, who's been doing a lot of research and discussion about this. And apparently, she actually just recently joined GitHub to try and do some of the work from the inside there to
[00:47:01] Unknown:
figure out ways to make open source a bit more sustainable. So I'm interested to see how her work pans out. Oh, okay. Yeah. I haven't heard of that. That's interesting. I'd be curious too. But everywhere I follow this, like, Django, they're really struggling to just you know, Django is used all around the world. It's the poster child for web development, at least in Python. So Pinterest and and all these Discus and all these big sites are using it, but it's it's struggling to raise really what what I would see as pitiful amounts to to sustain its development. I mean, Elasticsearch managed the the transition from from being an open source to having lots of money by going to venture capital. But, yeah, it's, it's it's a different game.
[00:47:42] Unknown:
Yeah. It seems like the data management space in terms of databases are generally a little bit easier to commercialize, whereas something that sits more in the middle of the stack is easier to overlook for some reason.
[00:47:57] Unknown:
Well well, middle of the stack is a part of it, but also the fact that it's more still more academic. I mean, machine learning is becoming it's a very hot topic now, and it's becoming practical. Companies are really using it, not just for marketing, but really using it or starting to use it. But the the fact is it's, it's it's it's kinda harder to sell. Those things like Elasticsearch, you know, this is basically sixties when it comes from the research perspective. Okay? You have an inverted index and you look for keywords. Okay? It's still super challenging to do the engineering part. Don't get me wrong. Like, the scaling and this availability and all that stuff. But it's it's not really rocket science from the from the science perspective, let's say. So stuff that is not from the sixties, stuff that, you know, deep learning that just appeared in this decade, past 5, 6 years pretty much. It's, it's not as mature, and and companies don't really understand it, the value. So part of the selling is actually educating the the market, which is not an easy proposition. And you have to teach your customers why they should want what you're offering. So it's a much steeper, hill.
[00:49:01] Unknown:
So for anybody who wants to follow you and keep in touch with what you're up to, what would be the best way for them to do that?
[00:49:09] Unknown:
Best way. Well, I always appreciate handwritten notes, but, also, I'm on Twitter, so that's that's also easy. I'm on email, so it's, I guess I shouldn't be difficult to find. I mean, if you Google for Admissions, I think I'm on the first page, so, hopefully, people will will manage. And yeah. For gensim, if you're interested in contributing to gensim, so we we have that project is now hosted on GitHub. I mean, now for the past 5 years, we moved to GitHub. So feel free to open issues there or pull requests, we have that mailing list where you can discuss more, open ended ideas or ask for help and and so on. Noops. Noops are welcome. We have a lot of noobs there, so don't don't feel shy.
[00:49:54] Unknown:
Alright. So with that, I'll move us on into the picks. My pick this week is a book that I just finished listening to called Dark Matter and the Dinosaurs by Lisa Randall. And it's a well done accounting of some of the cosmological research that's been going on into dark matter. And she ties it into the dinosaurs by positing that dark matter in the galaxy is a potential cause for what set the Chicxulub meteor on a collision course with the Earth. And, just in in the process of explaining that whole story exposes a lot of the research going into dark matter and various cosmological phenomena associated with it. So definitely worth a listen for anybody who's interested in understanding more of the science that's going on there. And with that, I'll pass it on to you, Chris. Thanks, Tobias. It sounds really cool. I have to add it to my audible queue.
[00:50:44] Unknown:
So my only pick today is a tool called MCLI. And basically, what this thing is is a command line utility for twiddling the myriad, sometimes hidden or not even available through the GUI, aspects of of your Mac OS install. Like, as a for instance, you can turn gatekeeper on or off. If for those people who use Mac, it's the annoying thing that says, you've just downloaded an application from the Internet. Are you sure you wanna open it? Or in some cases, depending upon how your admin has things set, you don't get to open it at all. So it's it's very, very handy, and very easy to use, especially for command line native people like me who would prefer to be using a CLI whenever we possibly can. It's great stuff.
That's what I have for picks today. Radim, what do you have for us?
[00:51:39] Unknown:
Well, I didn't have anything. But now that I think of it, I mean, both of yours are interesting, but mine would probably be a book that I started reading recently. So not connected to tech or at least not, obviously connected, although everything is connected if you look deep enough. But this 1 is by Eric Klein and the name is 1177 BC, the year civilization collapsed. So it's about the collapse of the late bronze age, you know, world, which is fascinating in for the historical perspective. But this guy, he also draws well, first of all, it's entertaining way, but he also draws many parallels with what's what's happening now, which is always a bit slippery road, but he does it in a in a tasteful way and it's, it's really thought provoking. It's, it's a book that I will finish for sure, let's say. So 1177 BC, the year civilization collapsed, would be my recommendation, my pick.
[00:52:33] Unknown:
Very cool. I'll definitely have to take a look at that, and I'm sure I'll have to recommend it to my mother who is very much into history as well. So, yeah, thank you very much for joining us today and telling us more about Gensim and the work that you're doing with rare technologies. It's definitely very interesting space and 1 that I intend to do some more research into. And, yeah, I really appreciate your time. Brilliant. Thank you very much. If you have any questions,
[00:52:56] Unknown:
I'll be happy to hear them. Enjoy the rest of your day. Thanks. Bye bye, guys.
Introduction and Sponsors
Interview with Radim Rauzhek
Radim's Background and Journey
Introduction to Topic Modeling and Semantic Analysis
Gensim's Origin and Development
Practical Applications of Gensim
Gensim's Technical Details and Streaming Capabilities
Supervised vs Unsupervised Learning
Unorthodox Uses of Gensim
Consulting and Community Contributions
Resources for Learning Topic Modeling
Open Source Sustainability
Contact and Contribution Information
Picks and Recommendations