Summary
The state of the art in natural language processing is a constantly moving target. With the rise of deep learning, previously cutting edge techniques have given way to robust language models. Through it all the team at Explosion AI have built a strong presence with the trifecta of SpaCy, Thinc, and Prodigy to support fast and flexible data labeling to feed deep learning models and performant and scalable text processing. In this episode founder and open source author Matthew Honnibal shares his experience growing a business around cutting edge open source libraries for the machine learning developent process.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, node balancers, a 40 Gbit/s public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they’ve got dedicated CPU and GPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. And now, the events are coming to you, with no travel necessary! We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference on April 6th and ODSC East which has also gone virtual starting April 16th. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host as usual is Tobias Macey and today I’m interviewing Matthew Honnibal about the Thinc and Prodigy tools and an update on SpaCy
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by giving an overview of your mission at Explosion?
- We spoke previously about your work on SpaCy. What has changed in the past 3 1/2 years?
- How have recent innovations in language models such as BERT and GPT-2 influenced the direction or implementation of the project?
- When I last looked SpaCy only supported English and German, but you have added several new languages. What are the most challenging aspects of building the additional models?
- What would be required for supporting symbolic or right-to-left languages?
- How has the ecosystem for language processing in Python shifted or evolved since you first introduced SpaCy?
- Another project that you have released is Prodigy to support labelling of datasets. Can you talk through the motivation for creating it and describe the workflow for someone using it?
- What was lacking in the other annotation tools that you have worked with that you are trying to solve for in Prodigy?
- What are some of the most challenging or problematic aspects of labelling data sets for use in machine learning projects?
- What is a typical scale of data that can be reasonably handled by an individual or small team working with Prodigy?
- At what point do you find that it makes sense to use a labeling service rather than generating the labels yourself?
- What is a typical scale of data that can be reasonably handled by an individual or small team working with Prodigy?
- Your most recent project is Thinc for building and using deep learning models. What was the motivation for creating it and what problem does it solve in the ecosystem?
- How does its design and usage compare to other deep learning frameworks such as PyTorch and Tensorflow?
- How does it compare to projects such as Keras that abstract across those frameworks?
- How do the SpaCy, Prodigy, and Thinc libraries work together?
- What are some of the biggest challenges that you are facing in building open source tools to meet the needs of data scientists and machine learning engineers?
- What are some of the most interesting or impressive projects that you have seen built with the tools your team is creating?
- What do you have planned for the future of Explosion, SpaCy, Prodigy, and Thinc?
Keep In Touch
Picks
- Tobias
- Onward movie
- Matthew
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Explosion AI
- SpaCy
- Thinc
- Prodigy
- Natural Language Processing
- Perl
- NLTK
- GPU == Graphics Processing Unit
- TPU == Tensor Processing Unit
- Transfer Learning
- Airflow
- Luigi
- Perceptron
- PyTorch
- Tensorflow
- Functional Programming
- MxNet
- Keras
- Cuda
- C Language
- Continuous Integration
- Blackstone
- Allen AI Institute
- SciSpaCy
- Holmes
- Sense2Vec
- FastAPI
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit in private networking, node balancers, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, all controlled by a convenient API, you ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI and CD lines, they've got dedicated CPU and GPU instances. Go to python podcast.com/linode, that's l I n o d e today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.
You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on great conferences. And now the events are coming to you with no travel necessary. We have partnered with organizations such as ODSC and Data Council. Upcoming events include the observe 2020 virtual conference on April 6th and ODSC East, which has also gone virtual starting April 16th. Go to python podcast.com/conferences to learn more about these and other events and take advantage of our partner's discounts to save money when you register today. Your host as usual is Tobias Macy. And today, I'm interviewing Matthew Honnable about the Think and Prodigy tools and an update on spaCy. So, Matthew, can you start by introducing yourself? Hi, Tobias. So,
[00:01:47] Unknown:
thanks for having me again. I'm, the creator of the spaCy natural language processing library. It's a popular tool for, working with text in, in Python. So it's often used for information extraction projects and, you know, also data science projects to understand text at scale. And I'm the cofounder of, our company, Explosion AI. AI. So we also make an annotation tool called Prodigy. And, we've recently released the updated and released the machine learning component of spaCy as its own library, Think, as well, which, is another thing that I'm excited to talk to you about today.
[00:02:28] Unknown:
And you were actually share how you first got introduced to Python?
[00:02:40] Unknown:
Sure. So like a lot of people, I came to, problems that I wanted to solve with programming before I came to decisions about languages or, you know, those sorts of technical things. So, basically, I started out in linguistics, and I was doing research. And I wanted to process volumes of text for, to answer questions about grammar or to, you know, basically work through linguistic theory that I was working with. And so then it just sort of started from there, and I started writing small strips and everything. And I actually first started out with Perl, but I quite quickly switched across to Python. So this was in around 2004, 2005.
And since then, I've, you know, really worked with with Python for pretty much my whole career, except for, eventually, I realized that I wanted to write programs that were faster and, programs in particular which were more memory efficient, so that I could write, you know, basically concise data structures that would work well with the problems that I was working with. And so then I started working with Scython, and I've found that a really good compromise for that because for some problems, it just is a lot easier if you can sit down and plan out the memory ahead of time and sort of reason about how much you can hold in memory. So that's how and that was all very much informed how spaCy was, written because the library is really implemented in scythe rather than in in Python directly.
[00:04:00] Unknown:
And at the time when we spoke, the natural language toolkit was still sort of the de facto standard for anybody who wanted to do any sort of natural language processing. But these days, most of the times when I see references to people doing any sort of NLP, spaCy has become the more prominent library for that. So I'm curious what your sense of that has been as the creator and maintainer of spaCy, how things have progressed over the past few years in terms of the level of popularity and adoption for your library.
[00:04:29] Unknown:
So NLTK is still an extremely popular and, useful library, and they really do different things. So I would, you know, never wanna say that, oh, well, you know, there's only 1 way to do it and that, you know, the other tools are, like, deprecated or something. Like, there's still a lot of functionality in multi k and there's, use cases where people find its approach of not having to initialize as much like, you know, load a large model into place and, you know, it's basically got all of these utility functions. So it's still, certainly a very popular tool.
But, I've been pleased to see that a lot of people have been finding spaCy useful, and especially for models that are pre trained and sort of pre trained processing pipelines and also data structures for working with different annotations. So 1 thing that, you know, I think spaCy is quite good at is if you've got annotations for, like, entities in text you wanna do things like get relations between them or, you know, you wanna find other parts of speech or you wanna retokenize the text, the object oriented interface in, spaCy makes those it quite easy to interact with those annotation layers together, and so people are finding that quite useful. Also processing pipelines.
So being able to string together, rule based matching with, entity recognition and then, apply some other rules on top and then get out the the document at the end, is something that I think spaCy is quite strong at as well. And so that's why people are using it for these processing pipelines. And spaCy also has a little bit more of industrial use case focus. So it's more oriented towards production use cases. And so I think, there's a lot of companies who've basically been looking for a tool that has that kind of focus, rather than 1 which is more oriented towards teaching or research. And
[00:06:12] Unknown:
for the work that you're doing at Explosion, you mentioned that you founded the company around the same time that we talked and as a follow on from your work on spaCy. So I'm wondering if you can give a bit of an overview of the mission for that company and highlight some of the different projects that you've been working on there. Yeah. So
[00:06:29] Unknown:
when we first started out with Explosion, we, did some consulting projects for 6 or 7 months. We refer to this, you know, together with my cofounder Ines as raising a client round. It was really a good way to basically understand what sort of problems people had with NLP and, figure out what we wanted to do next. And so then the product that we ended up releasing was this annotation tool, Prodigy, and that's been going very well since, and that's been, you know, really funding our activities in the company. So the way that we see things is that 1 of the needs that people have for machine learning technologies is to be able to develop them closely themselves. So around when we've we're founding Explosion, there are a lot of people who are thinking that AI technologies like NLP would be something that you consumed as a cloud service, and you would really have very few developers working closely with across these technologies.
And our bet was different. Our bet was that this is a bit more like web development in that to really make use of the technology effectively in projects and products, people would need to work with it closely. And there would be a lot of developers who are wanting to understand how all of these technologies fit together. And so open source and sort of self run technologies would be the way that people wanted to build their projects. And I think that that's largely true. That's largely the way that people have been working with AI is, to be using open source libraries or, you know, at least self hosted technologies that they can really understand in detail.
And, you know, so we wanted to basically have a a tech stack that fit along with that sort of viewpoint, so that people could run their projects more, you know, basically move faster and try things out. And so going back to spaCy,
[00:08:06] Unknown:
in the last 3 and a half years, there have been a lot of new innovations and shifts in direction for the study and usage of natural language with models such as BERT and GPT 2 coming out. And I'm curious how that has influenced the direction or the implementation of spaCy itself and any other product developments or project updates that you think are worth noting and that have happened in that time frame? Yeah. So that's definitely been a very exciting thing that's been happening with natural language processing. So,
[00:08:38] Unknown:
essentially, it's, all of these models give the ability to have very accurate models through language model pre training. So a problem for natural language processing technologies has always been this problem that's broadly called the knowledge acquisition bottleneck. And that's that there's so much knowledge that's kind of in the background about language that a model that has to work with language has to understand about the world in order to get any specific application done. So let's say you wanna do something, you know, reasonably boring like extract certain, financial figures from some document.
And so it's, I don't know, profit and loss statements from, like, you know, company filings or something. All of these other words and all of these background things that the model has to sort of understand something about in order to figure out which sentences are of interest to and which sentences are providing that information. And if you have a person that you're teaching to do this, along with all of their knowledge about the world and general intelligence, they have this just knowledge of language that means that they only have to learn a little bit about the task before they can do it very accurately. Whereas if you have a model that has to see all of these words for the first time, you need an enormous number of examples to teach it this boring task. It would be like having a new employee, and instead of just teaching them what to do, you have to teach them English as well. And that's, you know, obviously a huge learning curve where you want to be able to import the general capability of English and just teach your task on top. And that's always been a well identified problem for natural language processing models. And finally, over the last couple of years, we've really had a big breakthrough in how this is done. So, these models, basically start off learning to predict, the next word or some task similar to it from large bodies of text, and they can use this knowledge to you know, you can start off with that knowledge and apply it to some specific task. Now the challenge at the moment is that these models have largely been developed by research labs who have you know, where compute costs are completely not a consideration.
And they've especially been developed, to favor GPU and TPU devices. So this means that if you just run these models straight from research at the moment, they're really quite expensive to run. And so if you wanna be processing large volumes of text and you wanna run the the processing several times because you wanna keep experimenting, the costs of running those models starts to add up very quickly. The, you also have problems with serving them, because if you've got a model that requires a GPU device, the latency for using it effectively starts to get, quite significant because you need to batch up a lot of examples.
So it's been very exciting to see these breakthroughs happen, but the challenge has been basically adjusting, you know, the architectures that we have and finding the right compromise between models which are still cheap to run and models which are still low enough latency while being able to take advantage of the high accuracy from, these new techniques.
[00:11:37] Unknown:
And, these problems are constantly evolving, and, you know, there's more and more work that's coming out recently that about making these models smaller and more efficient as well. And another element of the large models that I've seen referenced is doing things like transfer learning for being able to take the existing models and then swap out a couple of the layers to be able to make it fit your specific use case. Is that something that spaCy is used for in that context as well, or is that sort of outside of the scope? So we have a command like spaCy pretrain that where you can, you know, run language model pretraining
[00:12:11] Unknown:
even, you know, basically from scratch. But you can also especially with Think, it's quite easy to sort of plug these layers together and to take advantage of these sorts of technologies. And in spaCy, the reason that we sort of redesigned Think was really to take advantage of this type of model better. So 1 of the challenges that, you know, basically is introduced by the new transformer architectures and the new ways of doing machine learning and when, spaCy was first developed, I thought carefully about what the right level of abstraction was to present to developers who so that they could take advantage of natural language processing technologies without well, you know, basically, you know, which bits of complexity to shield off from them and which bits to present us, like, the decisions that they would be making. So the level of abstraction that was sort of most sensible when I was designing the library was to think at the component level and say, alright. Well, this is a named entity recognizer, and it does, you know, this task of assigning labels to text, and then you can combine it with, like, a tagger or you can combine it with, you know, a parser. And then these are things which will analyze the text, and then you'll get back to stock object, and you'll basically work with the doc object from there. Now with the neural network technologies, and in particular with the transfer learning technologies, it's the level of abstraction that's sort of most handy to for developers to work with is a little is a little bit different because, you wanna be able to take these models and basically be thinking about the tenses and thinking about, you know, saying, alright. Well, I'll feed this bit of, you know, this word representation out into this layer, and I'll share that information with this other layer. And that's really a level that, you know, basically, developers want to be working with now because, you know, the knowledge of, these models is pretty detailed in the community. People there's a lot of people who, you know, understand these things pretty well, and so the abstraction is different. And so this is something that we've basically wanted to adjust in the library and, you know, make it easier to work at that sort of level while still, you know, making sure that the library does the job that, you know, it originally did as well of working the pipeline layer as well. Another development that has happened since we last talked is the fact that you've added support for a number of other languages, whereas at the time, I believe it was only English and German. And I'm curious, different
[00:14:22] Unknown:
languages and any, building those additional models for different languages and any challenges that you see in terms of being able to support things like symbolic languages, like Japanese or Korean or right to left languages, such as Arabic? Yeah. So the in terms of supporting more languages, I would say that the the 2 big challenges are DevOps,
[00:14:43] Unknown:
and data. So the DevOps challenge is simply that, you know, the, as we've added more languages, the operational complexity of training all of the models and the automation required to have those jobs complete well and, you know, have pipelines for all these things that, you know, reliably and with low manual effort, we can get all of those artifacts built tested, for each release. And that was something that took longer than I thought it might. So, you know, the training jobs take, you know, a fair bit of time and then for each individual training job, you need to be able to resume it and stuff. So I tried out a number of technologies like Airflow, Luigi, and things. They ended up with, you know, basically a a setup that works well for us. But this was definitely a challenge and, you know, that was a thing that took a fair bit of time setting up all of these things partly because, you know, these tasks aren't ones which I was, you know, had, so much deep expertise in. So it was a bit of a learning curve for me at EDair as well. And then the other 1 is just the data resources. So we wanna make sure that for all of these languages when we produce a model that it's, you know, basically, useful to people and that the models don't just sort of exist for the sake of it. So that's been something that's been difficult and especially with different corporate and not having, having inconsistent licensing and stuff. You know, we wanna make sure that the models that we produce are, you know, available for commercial use for people and also that the the data is good enough that it's, you know, something that's actually useful.
So over time, 1 of the things that's changed, since we last talked is the universal dependency corpora have gotten a lot better and gotten, you know, pretty consistent. And so that's something that we've been able to take advantage of and produce some,
[00:16:20] Unknown:
some more of these models. And for being able to build out these models, as you said, 1 of the challenges is having the appropriate corpora. And I imagine that another aspect is being able to label it effectively and find pre labeled datasets where I'm sure that some of the inspiration for your Prodigy tool came from. So I'm wondering if you can just talk through a bit of the motivation for creating that and describe the use cases that it enables and the workflow for somebody using it. Yeah.
[00:16:47] Unknown:
So definitely questions about labeling data or, you know, some of the the problems around this. And, you know, so we sorted when we were doing the consulting, this was definitely something that, teams were struggling with. So probably the most important thing that we thought that we could offer that was, a little bit different from, you know, or lacking in people's process was, I guess, you could say it's more of an agile methodology to, data labeling. So the problem of labeling data is, well, you just sort of decide what the labels should be and then you tell somebody to to, you know, apply that labeling scheme and then the problem is just this grunt work task of getting the the things done. And for some tasks, it looks a little bit like that. Some image tasks are a little bit more like that. But certainly for language, as soon as you come up with any labeling scheme and you start applying it to text, you very quickly realize that there's all these edge cases, and it's kind of edge cases all the way down. And even more importantly, you need to realize that there's that there are ways that you can adjust labeling scheme that will hit a better compromise between what will be useful for your model or what will be useful for your end goal and what will be easy to annotate and what the model will be able to, you know, annotate effectively.
So, you know, the the other day, we were working on a little demo of Prodigi, and we were annotating, instances of, like, ingredients in cooking discussions because we wanted to see, alright. Well, there are all these, like, trends in, what sort of food people are, are using especially for home cooks and things. So we wanted to see, alright. Well, can we find some of those in, the frequencies with each different ingredients are mentioned? And so this sounds, you know, simple enough, but then you quickly realize that there's not really a clear distinction between what's an ingredient and what's a finished product because in sometimes, you might have, something that, you know, I don't know, like, chicken fillets could be a ingredient in a recipe, or it could be, the recipe itself.
So and there's all sorts of other, examples like this where you're kind of not sure of the boundaries. And so always, you're making these decisions when you're annotating any project. And that means that you have to basically take a pretty flexible view of what you're doing. It means that you have to be able to start and stop the annotations and look at the data and have a basically integrated process. So that's really what we did with Prodigi. We made sure that it was a a tool that was, fully scriptable, so that you can really have control of the annotation process yourself, and you'll be able to build out different capabilities and build whatever automation you need as well. So you can drive it from Python. And if you can basically write a function in Python that generates the data, then, that will be something that you can quite easily put in your little function and then have that thrown up in a a web browser for you to click through, and then it'll be saved in a local database. And you can make different choices if you wanna scale out. For instance, you can have the database saved to, like, you know, a a MySQL instance instead of a local SQLite file.
You can host the application in different ways. But at the core of it is a tool for any data scientist working individually. It's a really quick way to be able to build out these experiments yourself and be able to try different things so that you don't have this stumbling block of as soon as you need some small amount of annotation, you hit a sort of process block and you, you know, have to do something different or have to, go to your team and get, you know, basically apply for funding with it to, throw it out to an external labeling service. Instead, it's just something that you can do flexibly yourself.
[00:20:12] Unknown:
And I know that there are some other labeling tools out there. I'm curious what you saw as being some of the lacking features or capabilities in the available market that
[00:20:23] Unknown:
necessitated building out Prodigi as an alternative to them? So the number 1 thing was really the design of it is a developer tool and a scriptable developer tool. Because when we talk to people about the their experience doing annotation and, using annotation tooling, almost all of them had built annotation tools in house. And that was something that was worth thinking about. It's like, okay. So if this is a type of problem where people are very frequently motivated to write their own tools, You know, why would that be? And, you know, the simple answer is, well, nobody's come up with, you know, just to write annotation tool that everybody needs. And I don't think it's quite that. I think it's that, you know, the needs are quite flexible, and people want to have control of the process because that's kind of efficient, and different needs are different.
So we wanted to make sure that it was something that people could really work with and that they could work with as developers. So, you know, the scriptability and, the fact that people can interact with it programmatically and have itself host is something that, you know, we really wanted to build into the design. So most of the other things are designed as web applications, and programming against the web application that you're not hosting is always gonna be kinda limited. The other problem is, data privacy. So the vast majority of our users really don't want to, you know, often simply can't upload their text in, into some cloud service. And this makes a lot of sense to me. You know? Like, if I've got text in a platform that's, private to me, I don't think that that vendor or that those people should be sending that data to, you know, some external third parties. And, you know, since then, the regulation has also, caught up with this, you know, view that I have of how things should work. And I think that that's great. And I think that the US is catching up with this as well, and we'll have, you know, rules that are more standardized already that works in Europe too. Another thing that I saw that was appealing about Prodigy is the fact that it supports multiple different types of data for being able to label it. So it has capabilities for text so that you can do things like named entity recognition, like you were referring to earlier, being able to say, in an example you gave, this is an ingredient versus this is a finished product. But it also has support for being able to do labeling of images and segmentation of those images to say, you know, this is a rectangular area, this is a polygonal area, and this is the label associated with it. And then you also have support for some other data types as well. So I'm curious what you have found to be some of the challenges of building a tool that supports those different data types and some of the value that you've seen come out of it? Yeah. So basically, just trying to trying to make the right compromises between, you know, what people need and different, use cases while still without, you know, diffusing too much and being less useful for any particular use case. So, you know, to be clear, I do think that there'll always be other tooling that people want to use as well. And I, you know, I I think that some of the worst things that computational tools or actually tools in general can do is to try to be the 1 stop shop for all use cases in all situations. I think, you know, it's important to do the job that you set out for yourself well, but a lot of people have found it useful to have this variety of capabilities in Prodigi so that you don't have to have very different workflows just because you now have an image task as opposed to a text task. We've also introduced nice audio support recently as well and there's, you know, Inus has had some fun building that out and getting, you know, an interface that's helpful for that. So 1 of the challenges has been, designing workflows for things that we don't do ourselves so often. You know, we still don't do a lot of image work in terms of our our actual projects, and we don't have as deep an expertise on them. So making sure that we, you know, basically doing something that's helpful to people without having as closer connection to it, I think, is something that we've had to, you know, think carefully about.
And then, you know, different data sizes and stuff. So, you know, obviously, the the size of the input, for something like, video, image, or, audio is quite different from text. And so the we had to make some adjustments in the the way that the database works and stuff to make sure that those are well accommodated. And you mentioned too some of the challenge of being able to allocate funding for working with an external labeling service where you can have the capacity
[00:24:21] Unknown:
for doing your own service where you can have the capacity for doing your own labeling, at least on a small to medium scale. And I'm curious what you see as being a reasonable scale of data that can be handled by an individual or a small team, and at what point you think it's necessary to start working with some of these labeling services to handle more large scale or more fine grained labeling for the data that you need to use for building your models or building your products? Well, so I think that
[00:24:54] Unknown:
you definitely can get something to production with working with with it basically with, you know, having just sort of ad hoc resources of yourself or your team or, like, you know, maybe, some interns or, like, some other junior people around the place. And so it depends on the tasks, and it depends on, how much data is needed to get the the models trained because there's no 1 answer for this. The different tasks have different complexities and things. So I would say that the number of examples that you need per model is dropping all the time because the transfer learning technology is very good. So I would say that, you know, especially now, you need less data than ever.
And I would always be focused on if you find that you're needing, you know, 100 of hours of annotation, then I rather than saying just saying, well, this is our life. That's just how much we need. I would always be suggesting that you you look at ways that you can redesign the models because it may be that, you know, something's wrong with the way that you're actually defining the problem. So I'll give you an example of this. You know, 1 of the examples that I use in some of my talks is imagine that what you wanted to do is extract information from crime reports, and you wanted to fill out this database of, you know, who the victim's name, a perpetrator name if it's there, a location where the, event happened, the event type, or something. So 1 way of doing this is very directly, and you might say, alright. I'm going to do this as a labeling task where I'd lay label this span of text, John Smith, as victim. And then this other span of text to location King Cross's, you know, location of crime or something. And so, you know, that's definitely a way that you will be able to train the model, but you're coupling 2 pieces of information. You're coupling the identification of that, John Smith as a person with the fact that the event is about well, actually, a couple in 3 pieces of information. The the sentence is about, you know, the event type crime, and, you know, John Smith's role in that event is, the role of victim. So if you factor those 3 pieces of information out, you can often need far less data because the the decision of, alright, that's a person versus not a person is, you know, basically easier, and it doesn't require as much information about the whole rest of the sentence. Similarly, the information that, you know, is this sentence about a crime or is it not about a crime? That's 1 bit of information that you can annotate over the whole sentence. And so if you annotate these separately and you, train the model separately, you can often need far less data. And so you'll have, some situations where people are finding that the model isn't converging well, and their first instinct is to either try a different architecture or to, you know, annotate more data when by far the most, you know, best lever to pull is 1 which people don't really have practice pulling because it's not 1 which, you know, you'll have gotten from shared tasks. So you'll have gotten from, writing papers and things. And that lever is, how can I redesign the task? How can I find a different way to either need less for the application or to just structure the models differently so that they attack different parts of the task and, define things differently? How can I say, alright? Well, what if I did this as a sentence labeling task rather than as a labeling the words in the sentence?
Would my application be able to deal with that slightly less precise piece of information? Well, if so, maybe you'll find that the model converges far faster and far better. So to answer the question about when I would actually switch to a labeling service, I think, actually, I would use a labeling service basically never. And the alternative after, you have it passed prototype, I would actually have people do like, be hiring people to do the work in house, and, you know, they can be remote employees or, like, you know, people on freelance contracts. But I would always want them to be specific people that I can talk to under the supervision of the project.
Because after you get past the prototyping stage, the task of the labeling data is this discrete event where you do it once, get back this batch of task, and then the project just kinda shut down. It'll be something you constantly want this feed of data and feed of examples so that you can, keep monitoring the model, keep, improving it over time. And you don't wanna have it as this, like, discrete contracts where the data is going to be different each time you go back to the service because you're getting it done by different people with different standards, potentially with different pricing. And it's, you know, basically something that you wanna have consistent control over over time because your needs will change as well. The needs of what you're going to find that, oh, okay. These are I wanna adjust slightly the way that the, data is annotated because there's this problem in the, the application problem in the model that needs to be solved. And
[00:29:34] Unknown:
the third component that we mentioned at the opening and that ties into this whole ecosystem that you're building out is the Think project, which you mentioned was extracted from the spaCy project originally. I'm wondering if you can just talk a bit more about the motivation for releasing it as its own library and some of the primary problems that you're aiming to solve with it within the ecosystem of machine learning and data science?
[00:30:00] Unknown:
Yeah. So spaCy always kind of came with its own machine learning implementations. Initially, it was, you know, basically a pretty simple linear model that was optimized to work with very sparse features using the average perceptron algorithm. And so this was always for these linear models, it was pretty common in NLP for, you know, basically everybody to implement themselves. And, you know, most most other parsers would have their own, like, you know, linear model implementations lurking within them. So I did it the same way, and I found that, you know, basically a helpful way to, you know, keep the the model efficient and working well. And then over time as the neural network models came in, I had already been implementing neural network code. Right about when PyTorch came out, I was basically done with, the models, and we were experimenting, for, spaCy 2. So if PyTorch had come out earlier than that, you know, before I've, you know, basically been doing all of that work, there's probably every chance we would have just used PyTorch from the start. But 1 of the advantages that we saw in spaCy 2 of, you know, sticking with our own implementation was that we could, a, make the library a little bit smaller because we only had to implement the models that we needed and we didn't have to know, drag in this whole, you know, huge binaries from an external library. And we were able to make sure that we didn't have a dependency on a specific version of PyTorch because we knew that the library would, evolve quickly. And we wanted to make sure that people never hit a situation where they had 2 projects and spaCy needed a particular range of versions of PyTorch, and then their other code needed a different range of versions of PyTorch and so that they had this lock. And so that was always something that we were conscious of as well. So, you know, over time, we have kept using our own implementations, but as I as I mentioned, more and more people have wanted to interact with the machine learning layer, you know, underneath. They want the people need to be able to define their own models and bring their own models into, you know, spaCy and Prodigy. And so what we decided to do rather than stand standardizing on PyTorch directly was to, sort of adapt Think into a a library where you, that could sit as a wrapper around different machine learning solutions underneath. So in addition to things own implementations of things, you can use it as a just sort of interface layer, above PyTorch. You can really easily define any PyTorch model that you want and then think it's just the interface layer that interacts within, space. So that was what we, sort of set out to do. And the way that we approached that was to really think about the, you know, what's the sort of lightest weight or, like, you know, most minimal interface that, is necessary for this type of, deep learning library. And we ended up with a functional programming inspired design that features a very minimal interface of, like, a single model class.
And the actual work of the layers is done in function definitions. And instead of bringing in a, a definition of, you know, some sort of autograd mechanism or tech based differentiation like you have in PyTorch, there's, just this convention of, callback mechanisms. And then you have different relationships between layers, like, say, a feed forward relationship or concatenation or, subtraction or something is all handled by higher order functions. So we ended up with a design that's, really quite lightweight and minimal, and, the library itself is quite small and easy to read. And this means that you can really bring, any model that you want from another library, whether PyTorch, MXNet, TensorFlow, and you can plug them into spaCy and plug them into PyTorch. We also built out a few other user interface features that we felt were would be help very helpful. And the main 1 is that I think is kind of underrated as a problem in machine learning is the problem of configuration.
So we always found in spaCy to do this this problem of how to pass configuration through a tree of objects. So 1 way to do it is that you pass into some, you know, component a whole tree of configuration that defines the that model and then maybe components to the model, etcetera. So you have this blob of configuration you pass top down into something. But this means that as soon as that component has a pluggable, you know, sub pieces, you it can never know which, you know, what features or what configuration options its, individual parts will need. So if I if I wanna say configure a parser or a tagger and I wanna give flexibility of the model of that or flex or allow people to change individual pieces of that model, then I have to pass all of this opaque blob of configuration forward. And then those, functions that are being configured probably have, defaults and things. So you end up with this, problem of different defaults being set. And you, can very often have problems where you think that you've overridden the default and you haven't.
So instead of passing the configuration top down like that, we have a way of passing the defining the configuration bottom up through the config file and letting the tree of objects be defined and, sort of brought in from that. That. And we've found that really helpful in keeping the code clean and keeping you know, helping to manage this problem of defaults. And then finally, we've got typing. So Python 3 has, type declarations, and, we've really made good use of these in Think. So it's the first time we've really seen good, really full support for NumPy arrays and things in, you know, a a PyData ecosystem. We have you know, so that you can get static type errors for something like indexing an array in a way that's, invalid because it's, you know, a three-dimensional array and used too many index indices or something.
[00:35:45] Unknown:
So, yeah, I think that that's an exciting feature that we're that will be very helpful to people as well. And 1 of the things that I thought about as I was looking at Think is because of the fact that it acts as a high level wrapper for multiple different frameworks for doing deep learning. It puts it in some sense in the same space as the Keras project. And I'm curious what your sense of the comparison is for it being the comparison
[00:36:11] Unknown:
of Think versus Keras in terms of acting as that wrapping layer. Yeah. So I really found the the functional programming style in Keras, you know, very interesting when I first saw it. And it was definitely something that helped inspire the approach that we talk and think. But over time, the, sort of focus of Keras has, shifted a little bit. And, you know, it is really part of TensorFlow these days, and it's really the way that, you know, basically, it's sort of the main interface into TensorFlow that people use, and it's, you know, really a high level, you know, API for TensorFlow that's quite coupled into the TensorFlow ecosystem. So I would say that the the focus is a little bit different, we think, and we've also benefited from coming into in a little bit later and being able to come up with a a design that's a little bit tighter and that will, be able to maintain a little bit more consistency over time. So that, you know, we really hope that we will not have to make breaking changes over time and that we are able to basically keep, you know, the design quite concise and coherent.
So I would say that the use case is a little bit different and, you know, there's that Keras is not so much a wrapper around different things as much as just a, you know, a key part of Tensor Flow specifically.
[00:37:21] Unknown:
And then because of the fact that Think also acts as its own framework for building these neural nets and doing deep learning. I'm curious what you have found to be the strategy that you use in terms of determining when to do something entirely and think versus when to incorporate either PyTorch or TensorFlow or some of those other frameworks into the network as a component of the Think project rather than just doing it entirely without those frameworks?
[00:37:51] Unknown:
So we, our goal is to avoid having in the long term models, which have a strict dependency for on PyTorch or TensorFlow as the core pipeline APIs in spaCy because I think that this does make things sort of operationally simpler for most spaCy users. So the way that I see it is that PyTorch in particular is a really excellent compiler of these architectures, and it's able to really take very general you know, basically, you can implement things in a very sort of neutral way without worrying about the the performance details. And, you know, PyTorch will pretty much always do a, you know, a pretty good job of that. And so it's a lot easier to get to, you know, basically, a good performance level without having to manage the specifics of the computation and in particular, specifics of the device. But that said, if you take any specific architecture and you implement something in CUDA or c yourself, you can usually at least match what the performance that you would get from something like PyTorch. So the way that I would do it is that when I'm experimenting with something and I, you know, I wanna say try out a GRU, well, I might not have a GRU implementation thing console, you know, suddenly just plug in thing, PyTorch.
And there's no performance penalty for doing that. There's no overhead in translating the tenses from CuPy to PyTorch. It uses the d l pack formula formulation. When you have PyTorch, we have a thing that lets that sets CuPy as memory allocated via PyTorch as well, so you've only got 1 memory pool. So there's really no disadvantage to that. You just have to have PyTorch installed. So there'll be all sorts of architectures where it's either easier to implement it initially in PyTorch or, you know, there's already PyTorch code for it where I would be using that directly. And then eventually, if we want to provide that to spaCy users or if I feel like I can do a little bit better than PyTorch in optimizing that specific architecture, I would switch that over to Think.
The other thing is that sometimes you have these sort of high level building blocks of models, and some of that composition is actually easier to do in in Think by just thinking about them as different sort of functions that you plug together. So in particular, I'm used to the way to of writing things in Ting, and I find that that's, you know, something that I find kinda concise to define models and try different things out. But for different components, you know, maybe there'll be something where it's easier to have a a PyTorch wrapper around it. Now in terms of users of spaCy and users of, Prodigi, almost always pick they'll be more familiar with PyTorch, and they'll have they'll want to work more directly in PyTorch, and they'll want to, you know, have that as a especially initially as, you know, they'll want to have that as the development framework, and then they can just use a thin wrapper from Think around it. And, you know, over time, maybe they'll decide to do some other specific thing and THINK rather than, doing it in PyTorch directly. But the aim is to let people work with the frameworks that they wanna work with. And I imagine that for most developers to want to work with, you know, PyTorch, and that's a, you know, a pretty standard technology machine learning.
[00:40:50] Unknown:
So all of these tools that we've been talking about are open source, and they're something that you're working on as a core element of your business. And I'm curious what you have found to be some of the biggest challenges in terms of building and maintaining these tools to meet the needs of data scientists and machine learning engineers, and the approach that you're taking to making them sustainable.
[00:41:11] Unknown:
So 1 thing that's definitely difficult about this is that the technologies are changing so quickly, both in terms of the research underneath it and also the software ecosystems around things. We have to strike the right balance between maintaining a good amount of backwards compatibility and stability for people while also moving quickly enough to take advantage of new opportunities from technology and new integrations with things and basically providing a better experience and improvements to data scientists. So I would say that that's definitely something that's, you know, challenging about this type of work and to be pushing ourselves to deliver the best quality software that we can. You know, certainly, the you know, what there's this constant background thing of, I don't know, the, the continuous integration systems are changing underneath us or something. So at some point, we implement everything and get set up with, you know, Travis Sapphire and CircleCI.
And then, okay, as all pipelines comes out and we see that, okay, that's a better, option. So we migrate our things there and different wheel formats and different build tools and things. So there's this sort of background level of, you know, the all of the, basically, the boring problems that the the technology stack around us, changing and improving and, different libraries and all of these things. So that's definitely something that occupies a surprising amount of time. It's just all of the rest of these, like, ecosystem things and interactions with, you know, all of the other software that you wouldn't think of as, sort of core parts of solving the problem, but it's definitely things that need to be done to keep delivering high quality software. And then as far
[00:42:48] Unknown:
as the most interesting or impressive projects that you've seen, I'm curious what you have found to be notable and worth calling out, either things that your team is creating with the tools that you're building or things that you have seen built with those tools that you're releasing?
[00:43:06] Unknown:
So we're always really blown away by seeing all of the things that people are building with spaCy. And that, this is definitely something that's, you know, constantly increasing and improving. So we have a collection of these projects in, on the website called the spaCy universe. So 2 ones which I wanna call out in particular are 2 models for, 2 spaCy pipelines for, specific types of text. So 1 is Blackstone, which is a spaCy pipeline for legal text processing, and another 1 is, SciSpacy, which is a spaCy pipeline for biomedical text processing developed by the AllenAI Institute. Another, project that I think is super cool is this information extraction system called Holmes based on predicate logic. So that's something that I've always wanted to dig into a bit more. It was developed within a company and then, you know, basically kindly open sourced. So it's really quite a substantial project that I think, you know, is definitely cool. Another 1 which we developed internally that, you know, people might wanna check out is we have this, product called Sens2Vec where we train word vectors based on, you know, text that's been preprocessed with spaCy. So noun phrases have been merged into 1 token or entities have been merged into 1 token.
And early this year, we ran spaCy over all the text in Reddit. So, this was, you know, several billions of words from 2010 to 2020. And we have, we use this to get the entities and basically make vectors. And then we pre computed similarities for those vectors, on GPU so that there's you know, you can get pretty much instant nearest neighbor queries across all of these, you know, terms. So you can find similarities across entities and things, which is quite cool to play with. And as
[00:44:50] Unknown:
a contributor and maintainer of all these projects and as somebody who is running a business that relies on them, what have been some of the most interesting or unexpected or challenging lessons that you've learned over the past few years?
[00:45:02] Unknown:
So 1 of the things that's, you know, definitely important is to the way that the projects are documented and communicated. So and, you know, this also stems back to, you know, initial design decisions as well and, you know, basically making sure that things are consistent in the project and consistent in the libraries. And I think that this really makes things, you know, sort of more useful and open to a wider audience. So this was something in particular that, you know, improved a lot with my collaboration with Inis. So she's been really a driving force in getting the the guides and, like, level of explanation that we deliver, you know, to basically a higher level. And I think that that's something that's really been setting apart some of our projects as well. And when we we saw this in particular when we went back and did think that there were so many things where, you know, we felt like we'd done this before of setting up these libraries and setting up a tool in which people would find useful and, you know, ways of doing the documentation, things that people would need.
So we've learned a lot from the types of questions that people have and the types of API design decisions that will, you know, lead us into maintenance problems or lead you know, be confusing to people. And we've, been able to head off some of those things at the start with Think, which we've been pleased about. So those are all things which, you know, I definitely think that we've learned as well. And also just the ways of setting up the testing and, making sure that the code is well tested well tested and testable, to avoid some of these bugs in the first place. And
[00:46:26] Unknown:
as you look to the future of the explosion company and the projects that you're building there, what do you have planned, and what are you most excited for?
[00:46:34] Unknown:
So 1 of the things that we've been, working on for a long time, and we're excited to finally get out, would be an extension to Prodigy called Prodigy Teams, which has more of a sort of team management interface and, you know, has a host component where you can allocate work to individual annotators, start and stop annotation tasks and things. So that's something that we've been working hard on because it has this part of it's a more managed architecture, and there's a more intricate web app behind it. It's, taking us a little longer to develop. It's been going well, and we've been really excited to get that out to people. And, yeah, that's, you know, basically the main thing that we've got in the pipeline as well as having, getting spaCy 3 out, which will make it much easier to use transformer models and really allow you to bring, bring your own model and, you know, basically make it easier to interact with those technologies of spaCy. Are there any aspects of your work at Explosion
[00:47:28] Unknown:
or on the spaCy and Prodigi and Think tools that we didn't discuss yet or anything else in the space of natural language processing and deep learning that we didn't discuss that you'd like to cover before we close out the show? Yeah. So 1 of the things that's been different with since we last, spoke is we've managed to grow a small but extremely effective team, that we've been working with. So
[00:47:49] Unknown:
spaCy's core maintainers now also include Sophie and Adrian. So we've got a team of now 4 people working on it, full time, and that's really been, helpful. So in addition to myself and my cofounder, Ines. And then we also have, Sebastian Ramirez who's joined the company. So, we started working with him because we started using his open source library, FastAPI, which I think is a really great tool that, you know, any Python developer who needs to write REST APIs should check out. And so he's been working with some Teams and as well as we've got another developer, Justin, who's in the US who's, working on Teams as well. So, yeah, we've, you know, now got a few developers working with this. And on the scale of things, it's still quite a small team, but, you know, we really feel quite blessed to be working with people who are very effective and work all quite independently. And, you know, I I feel like it's a very fun collaboration that we have with people. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And this week, I'm going to choose the movie Onward that I watched recently with my family. It's a movie aimed at younger kids, but it's great for the whole family. Hilarious,
[00:49:03] Unknown:
really interesting storyline. Just had a great time watching that. So if you're looking for something to watch with the whole family, I'll definitely recommend it. And with that, I'll pass it to you, Matthew. Do you have any picks this week?
[00:49:14] Unknown:
So outside of the, you know, AI space, a lot of my time over the last few weeks has been spent following the, coronavirus pandemic. So I'm sure by the time you listen to it to this, anything that I say will be different. But I guess just stay safe with that. And, you know, in terms of picks for, you know, things within the ecosystem or recommendations to make, 1 project that I think is really cool that people might check out is this library called Ray, which is developed by some people originally from lab at Berkeley. And I think it's a really cool way to, you know, basically write distributed applications for machine learning in Python.
So it's still quite young, but I think they've got a a nice design, and, it's something that, you know, I think will continue to be popular and is 1 to check out. Yeah. I'll definitely second that 1. And the original creators of the library have also founded a company called AnyScale
[00:50:10] Unknown:
to try and accelerate the development of that framework and turn it into a viable business. So definitely something to keep an eye on there. Mhmm. Well, thank you very much for taking the time today to join me and share the work that you're doing at Explosion on all of your different projects. Definitely a lot of interesting tools that contribute a lot to the ecosystem. So I appreciate all of your time and effort on that front, and I hope you enjoy the rest of your day. Thanks. You too. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dotcom for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes.
And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Matthew Honnibal
Matthew's Introduction to Python
spaCy vs. NLTK
Explosion AI and Prodigy
Impact of BERT and GPT-2 on spaCy
Transfer Learning and spaCy
Supporting Multiple Languages in spaCy
Prodigy Tool and Data Labeling
Challenges of Data Labeling
Scaling Data Labeling
Think Project Overview
Think vs. Keras
Building and Maintaining Open Source Tools
Notable Projects Built with spaCy
Lessons Learned and Future Plans
Team Growth and Collaboration
Closing Remarks and Picks