Summary
Using computers to analyze text can produce useful and inspirational insights. However, when working with multiple languages the capabilities of existing models are severely limited. In order to help overcome this limitation Rami Al-Rfou built Polyglot. In this episode he explains his motivation for creating a natural language processing library with support for a vast array of languages, how it works, and how you can start using it for your own projects. He also discusses current research on multi-lingual text analytics, how he plans to improve Polyglot in the future, and how it fits in the Python ecosystem.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Rami Al-Rfou about Polyglot, a natural language pipeline with support for an impressive amount of languages
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Polyglot is and your reasons for starting the project?
- What are the types of use cases that Polyglot enables which would be impractical with something such as NLTK or SpaCy?
- A majority of NLP libraries have a limited set of languages that they support. What is involved in adding support for a given language to a natural language tool?
- What is involved in adding a new language to Polyglot?
- Which families of languages are the most challenging to support?
- What types of operations are supported and how consistently are they supported across languages?
- How is Polyglot implemented?
- Is there any capacity for integrating Polyglot with other tools such as SpaCy or Gensim?
- How much domain knowledge is required to be able to effectively use Polyglot within an application?
- What are some of the most interesting or unique uses of Polyglot that you have seen?
- What have been some of the most complex or challenging aspects of building Polyglot?
- What do you have planned for the future of Polyglot?
- What are some areas of NLP research that you are excited for?
Keep In Touch
Picks
- Tobias
- Rami
- The Wizard and the Prophet: Two Remarkable Scientists and Their Dueling Visions to Shape Tomorrow’s World by Charles C. Mann
Links
- Polyglot
- Polyglot-NER
- Jordan
- NLP (Natural Language Processing)
- Stony Brook University
- Arabic
- Sentiment Analysis
- Assembly Language
- C
- .NET
- Stack Overflow
- Deep Learning
- Word Embedding
- Wikipedia
- Word2Vec
- NLTK (Python Natural Language Toolkit)
- SpaCy
- Gensim
- Morphology
- Morpheme
- Transfer Learning
- Read The Docs
- BERT (Bidirectional Encoder Representations from Transformers)
- FastText
- data.world
- Quilt package management for data
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast thought in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it, so check out Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. Go to python podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers for software engineers.
Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own. Podcast.init listeners get 2 months free on any plan by going to python podcast.com/clubhouse today and signing up for a free trial. And visit the site at python podcast.com to subscribe to the show, sign up for the newsletter, and read the show notes. And keep the conversation going at python podcast.com/chat.
[00:01:26] Unknown:
Your host as usual is Tobias Macy. And today, I'm interviewing Rami Alrufu about Polyglot, a natural language pipeline with support for an impressive amount of languages. So, Rami, could you start by introducing yourself? Hi. This is Rami. I work currently
[00:01:41] Unknown:
at Google, in language understanding, and, Polyglot, was my PhD project before joining Google. Before that, I grew up in Jordan and finished computer engineering from there, Then worked in Turkish University teaching computer science. After that, I get a scholarship to do my PhD in machine learning and natural language processing from Stony Brook University, State University of New York. And during my PhD, I was interested in understanding language and, I still remember till today when I joined my lab and was talking to my advisor. He was pointing me out to some tools they developed for English for sentiment analysis. And he was asking me to do the same you know, how you know, how much complex their system used to be and how much it could be. It used to be, several, pieces and modules that you have to hook up together and work on them. And then he switched my adviser switched to ask me, can you do it in your native language which is Arabic?
After a semester of working on Arabic for sentiment analysis, I failed miserably, and I realized that if I want to solve it for Arabic, I should solve it for all languages. And, I better think about a different strategy of how to tackle large number of languages instead of 1 at each time. Because once I'm done with the first 1, I will be asked for the second 1 and the third 1. And definitely at some point, my skills in this language will be 0 to do anything, useful. So we should let the machine learn, not ourselves.
[00:03:22] Unknown:
And do you remember how you first got introduced to Python?
[00:03:26] Unknown:
So Python is is not something my university used to teach. There was a lot of emphasis as electrical engineering, computer engineering, assembly, and C. But the computer science used to be the dot net languages like C Sharp and VB, Visual Basic, and so on. I was interested in Python because, I found the syntax to be really easy to remember and write. And I'm a kind of a lazy programmer where most of my projects not necessarily the kind of jobs I have to do. So it was, you know, quite a productive environment to work on. So every time I come up with an idea or something, Python would be my first, like, language. So I landed, you know, by myself.
I used to be hanging out in the IRC channel of Python. And every time I get bugged that these are the days before Stack Overflow. So whenever there is a problem, I would be just nagging the people there. They will be annoyed, but they would be answer they will answer me eventually. And slow by slow, like, if there was no point where I had to master the language but over time because it's my default language and my comfort zone, I start mastering it more and more over time.
[00:04:45] Unknown:
And so you were mentioning that in the process of your research, you were tasked with trying to add support for natural language processing for Spanish and Arabic, and that led you to wanting to build a solution that would work for all languages. So I'm wondering if you can describe a bit about the Polyglot project itself and, some of your experience of getting that started and how it works.
[00:05:11] Unknown:
So, the Polyglass project, the aim of it is to elevate the the status of NLP, projects and, and packages in the sense that when people develop something called NLP, natural language processing, they should not be doing They should be with the mindset that their system should be as applicable as possible to all languages. And that's really a hard task. That's a research task. So, you know, the current status that many packages do not support many languages comes from it's not an easy thing to do. But I believe compromise has to be done to actually put coverage of languages as a first priority over everything else. So Polyglot initiated it from the the fact there was a lot of progress in deep learning, around the time I started the project where people historically used to calculate features from languages manually. So they would look at case and look at character engrams, and they would look at annotations like verb, subject, and so on. And these annotations and features are not consistent across languages and, you know, need a lot of domain knowledge in each language. So a lot of progress happened in deep learning in the sense we can learn feature representation of words without actually knowing the language and that come just from the statistics, corpus statistics of the language.
So when I realized, oh, I can land features of every word in every language automatically, we call it word embeddings. And word embedding means a vector that represent a language a word. And imagine imagine a simple case where you have a a 3 d room and I'm telling you put the words in the space according to their meaning. So in 1 corner you will find, like, places and in another corner, you will find names. But in another corner, you will find verbs. And the verbs that describe walking are closer to each other than the verb described drinking. And if I tell you do it as a human, you will really spend a lot of time not sure how to put words and how much the space will be enough, given it's only 3 dimensions. What our machine learning is trying to do is literally put these words in points in the space, but the dimension usually is way higher. We are talking about between 50 dimensions to 500 dimensions. And when we look at the nearest neighbors for every word in that space, we find them they correspond really well to our understanding of the meaning of these words. So you will find Obama closer to Bush and you will find Apple capitalized closer to Google but Apple lowercase closer to Orange, Dougie close to cat, coffee close to tea, and so on. So what I given this technology, I decided, okay. Why not we develop these word vectors for all languages?
1, you know, challenge to do that is where is the corpus that, you know, that could cover a 100 languages? And that by itself is a really tricky business. So, every time there is a problem where you cannot get resources for multiple languages, usually, my first answer was was Wikipedia. First, the corpus is available, you know, to download easily. But not only that, there is a significant volume of text in each language, in the 1 supporting Polyglot. So the first step was to learn word representations, which we call word embeddings for each language, and that meant take the state of art methods at the time and run them on the corpus of Wikipedia for each language and that end up with a dictionary where you have for every word in that language a vector representing that word.
[00:08:49] Unknown:
So, it sounds like Wikipedia is essentially your digital Rosetta Stone for being able to establish some sort of commonality between the languages and the features that you're trying to optimize for in being able to generate these word vectors to be able to establish an appropriate vector space for doing analysis and, entity extraction from any sort of arbitrary corpus of text within a given language?
[00:09:19] Unknown:
Yes. I mean, we will if we go after every, task that Polyglot achieves, a lot of them not only that the feature were extracted from Wikipedia, but we also utilize the structure of Wikipedia to actually help us with the annotations. So, I'm not sure if you want me to delve deep into these or we go on into them later.
[00:09:40] Unknown:
And so for Polyglot itself, what are some of the main use cases that it enables which would be either impractical or impossible with something such as the NLTK or spaCy libraries or gensim that don't have as broad of a language support?
[00:09:59] Unknown:
I mean, basically basically, the distinguishing thing in in Polyglot is that, the the number of languages we support. I mean, that came at cost in in terms of equality. So if if only your concern is English, the other libraries offer, better equality because of the technology they use depends heavily on resources only available for English. So while developing Polyglot, we had to rely on heuristics and a lot of tricks and hacks to get annotations and pseudo annotations, to develop tools for other languages, but that came at a cost of equality. So if you are a person who wants to work on a language that these libraries do not support, Polyglot will be your first start.
If you are interested in semantic meaning of corpus, in other languages, the word embeddings is a good start. And, yep.
[00:10:56] Unknown:
And you mentioned that there's a significant amount of effort involved in being able to add support for any given language to a natural language tool such as spaCy or l t k. So I'm wondering if you can dig a bit more into the types of difficulties that you would face when adding support for a new language as a sort of incremental step versus the approach that you're taking with Polyglot for being able to do it in a deep learning approach?
[00:11:26] Unknown:
So let me explain. Let's suppose we want to develop an application called named entity recognition, and the goal here is to extract from document entities that were mentioned, like locations, persons, organizations. In a typical in a typical setup that NLTK will take or Spacy or Stanford, NLP, first, you have to develop annotations. So you would go after a a corpus, and the corpus will be, not annotated. So you will have humans annotating where the entities show up, and that involves, like, tens of 1, 000 of of entities has to appear. So you are going over, like, you know, hundreds of thousands of sentences. Second thing, you will go and develop features for that task. And usually, for named interrecognition, people would look at character of enneagrams and casing, which is not consistent across languages.
But you still will rely on them. Then given the features you extracted and the supervised data you have, you will annotate, you know, to learn the name recognition. In the polyglot world, the approach was quite different. We learned features for these languages using Wikipedia. So if you have a new Wikipedia for a new language, you just have to run a tool like Word2Vec, and that will embed your words into vectors. For annotations, given we didn't have the resources and these resources are not gonna scale for our languages, we relied on the link structure of Wikipedia. So what that means?
In the first couple of paragraphs in Wikipedia, if an entity get mentioned, usually it get annotated and linked to its own page. So we know the set of pages in Wikipedia that correspond to entities. So if a if a mention points to a page that that page already is an entity, we will call that mention, that phrase, an entity phrase. So, automatically, that will allow us to have a huge amount of supervised labeled corpus. Now, that doesn't come with the highest quality we would expect because the Wikipedia forces its writers to only annotate the first mention but not the later mentions in the document. So in our paper working on that problem, we designed couple of heuristics to accommodate for that style, Wikipedia style. So first, we have higher confidence in the positive labels than the negative labels. So if something is mentioned to be an entity, we're quite confident. If something was not an entity, we have less confident because it could have been just not annotated according to the standard. And, the second thing, we usually focus on the first couple of paragraphs in the in the corpus but not the later ones. So that end up with a huge amount of annotations.
Given the features, we we can automatically extract it for the text and the annotations. We trained, we train a deep learning model, a feed forward network to actually predict these, annotations.
[00:14:32] Unknown:
And for somebody who's interested in adding additional corpus for a given language to Polyglot to more easily fit the particular problem space or set of topics that they're going to be working with within their project? What's involved in being able to either update the model or retrain Polyglot to be able to handle some of these additional entities, and, how much upfront work is necessary as far as labeling or feature extraction for Polyglot to be able to take advantage of the corpus that they're introducing?
[00:15:10] Unknown:
That's a great question. So, basically, the work on Polyglot started in 2013. And since then, we are, like, like, 5 more than 5 years down the line. The technology evolves so significantly that there is new methods to develop word embeddings. And there are, that definitely outperform the way we have been doing it in Polyglot. So, that Wikipedia. My first my first go to is Word2Vec, which is available through Gensim to develop the word embeddings. And, basically, why I'm advising that because the methods has been constantly improving and you guys are aware of the machine learning, deep learning revolution happening in the last 6 to 7 years. So the technology is consistently changing even up to last 2 weeks where, a team here at Google was also publishing embeddings in a new method, over so many languages called BERT.
So things keep moving fast. So if you want to add a new language, I would say I will focus the first thing into developing a word embeddings for that language. So 2 things that I would focus on first finding the corpus that closest to the domain you care about. So if your domain looks like a news, it would be great if you collect a corpus of a news. If your domain looks like a Wikipedia, then you are good to go. So first thing, collect the corpus that looks as close as possible to the domain you you care about. Second thing, use 1 of the state of art tools to develop word embeddings using Word2Vec or BERT or FastText.
Gensim encompass several of them, but not all of them. And the other tool is available in Python. Once you have your word embeddings or phrase embeddings, now you come with annotations. So that is a common question I get in Polyglot. Oh, you annotate Polyglot annotates organization, persons, and locations, but I need more. And the the problem with a lot of NLP application is just the annotations are hard to get. So while in Polyglot, we relied on the Wikipedia taxonomy, There is no reason not to do it with larger taxonomy. So when we went to Wikipedia and logged the annotations of each page, we only concentrated on group, organizations, persons, and locations.
So you read what you really need to do is to re annotate the text according to a wider definition of pages. So any mention that points to pages that more than the these are 3, you will also introduce it as a new label. So, basically, what is really necessary to happen is, basically re extracting, the corpus from Wikipedia. In our page, we already released the corpus we extracted, but that doesn't include the taxonomy. And, in the last, 3 years, we relied on Freebase to give us annotations over Wikipedia pages to classify them and the free base is not available anymore. So I'm not sure if Wikipedia nowadays have their own taxonomy on top of their pages, and how things go. It used to be also really hard to extract clean text from Wikipedia because of the working markup.
But over the last 3 years also, they developed tools to make it easier and easier to do that. So in 1 side, Wikipedia became easier to get clean text, but on another side, we lost Freebase as a source of, taxonomy over Wikipedia.
[00:18:50] Unknown:
And are there any particular families of languages that are more challenging to support than others whether because of the ways that their text is encoded or the grammatical structures or the way that they refer to entities within the language or the semantics of how the language is constructed?
[00:19:12] Unknown:
Absolutely. So so, there are different there are different, sets of complexity come from different for different reasons. So 1 of the complexity is tokenization. And tokenization, usually in in East Asian languages is if you look at the polyglot code, we end up using a different tokenizer than the 1 we use for the rest. And that comes from the fact that, you know, Chinese script and Japanese and so on, especially Japanese because of the mixes of 3 different writing systems, end up really hard for tokenizers to get right. So, that's 1 issue.
Another issue comes from other languages like Turkish, Arabic, and Finnish where they have really sophisticated morphology, and their morphology end up, like, making, creating so many words that are just compound words. So what do I mean by that? Usually, in polyglot, we concentrate on the top 50, 000, 100, 000 words. How do we select the top 100, 000 words? The most frequent ones. And usually in English, the most frequent 100, 000 words cover almost 98% of the text. When it comes to a language like Arabic or Turkish or Finnish, with the most frequent 100, 000 words, you barely cover 90% of the corpus.
And that leads to degradation represented with 1 single vector and that makes the classifier not it's not in your dictionary. It's out of vocabulary. And out of vocabulary get represented with 1 single vector, and that makes the classifier not you don't get give the classifier enough informative to actually make a good decision. So languages that have complex morphology end up with a lot of sparsity and languages that tokenization is not as easy. Now both issues could actually be solved using a different approach to building NLP applications. That different approach the sequence the sequence, text sequence, why not rely on characters or bytes?
So if Rami gonna build rebuild Polyglot in 2018, what he's gonna do is build representations of language from the characters and up, and that will allow you to represent complex languages and morphology like Arabic and Turkish. Also, Also, you don't need a tokenizer or normalizer. And that's a great blessing because normalizer and tokenizer do not work consistently and robustly across languages. And every library has really different
[00:21:58] Unknown:
scheme of tokenization depending on the language. In terms of the types of operations that Polyglot supports, I know that it is able to do, as you mentioned, tokenization and entity recognition, and being able to extract the different morphemes within the text. And when I was looking at the documentation, it looked like there were varying levels of support across languages for some of those different operations. So I'm wondering if you can talk a bit about the challenges of supporting some of those different processes that you might want to run against the given languages and how the nature of the language itself may complicate being able to support it as thoroughly as some of the other languages?
[00:22:42] Unknown:
So, basically, do we have different varying support for every for every, prediction? So for part of speech tagging, we have support for 40 languages, but for language recognition, we also have 40. But for more FEM analysis, we have more. More than embeddings, we have field for all of them. Tokenization, we have for all of them. Basically and by all of them, I mean, like, the 100 languages we support. The varying support come from the varying level of annotations we can acquire. So and then and as I mentioned earlier, for named entity recognition, we relied on Wikipedia to extract mentions.
But for part of speech tagging, we relied on the NLP community to get, annotated corpora. So and these corpora only cover 40 languages. They don't cover more, and they are expensive to collect. For Morpheme for Morpheme, it's unsupervised task. So in the sense, it doesn't need Morpheme analysis. So it doesn't need, collecting and corpus. So it was easy to push for all languages. So, really, the the bottleneck for developing in your languages, it comes from, do you have annotations for the task you need? And, again, annotations are costly in the sense either you buy them by making a Mechanical Turk task where you ask people to annotate the corpus for you or you do it manually or as I did in my PhD, just do a lot of clever tricks to do data mining on almost get you there, without actually human intervention.
[00:24:10] Unknown:
And so I'm wondering if you can talk a bit more about the way that Polyglot itself is implemented in terms of the software design and architecture and some of the, major libraries or services that you leverage to be able to put it together?
[00:24:28] Unknown:
Yes. So, I mean, the design of Polyglot was emphasizing simplicity. So, just to encompass a lot of tasks and a lot of languages, basically, 2 the library actually consists of kinda in my mind in 2 parts, the software and the models. And, the models used to be, you know, takes a lot of storage, so it was hard really to push binary blobs into GitHub to store them. So we decided to store them on a server and we forked the NLTK downloader package to actually Polyglot index of resources and download according to language or according to task. And this usually used to be host either on Google Cloud or on my university server.
So that's 1 1 part. For the software part, Polyglot has dependencies for tokenization and language identification, on several libraries. So language tokenization come from the ICU Lab. For language identification, it comes from, Google library c c for the c library they had. And then go then Polyglot depend on that library. For also for the Morphine analysis, I, for the Morphine analysis, I use the work of a research group and used their Python library actually to train a morpheme analyzer for each language giving the corpora I have. So then I uploaded the models.
So, basically, Polyglot is, is a collection of other people's effort plus the effort I did in my PhD. In terms of software design, I would say, basically, the greatest value of Polyglot is the pedagogical value in the sense I see the largest, usage is through students, who are taking the first time machine learning and NLP course. Course. I get a lot of requests and questions from them because they find it kind of transparent. It's quite easy to get to throw the code and just figure out what is going on and modifying according to your class project. And
[00:27:02] Unknown:
is there any capacity for being able to integrate Polyglot with some of the other tools in the ecosystem such as spaCy or Gensym to leverage the, additional algorithms or capabilities of those libraries once you've done the either part of speech tagging or tokenization using Polyglot within the languages that are supported?
[00:27:31] Unknown:
I mean, some libraries approached us to use the embedding resources we have, and, we were welcoming in general. I personally don't, like, have the capacity given, my life and work schedule to do more significant development other than maintenance.
[00:27:52] Unknown:
And for somebody who's building an application or a project and using Polyglot, is there any particular domain knowledge in terms of natural language understanding or natural language processing that's required to be able to use Polyglot effectively within that application, or is it fairly easy to get started with a, surface understanding of how the API operates and be able to leverage the capabilities of Polyglot to gain some capacity for being able to process free form text?
[00:28:25] Unknown:
So the the the API, I think, will be straightforward to use, and I don't think you really need to know much about how it's working because, like, you know, from language identification to tokenization to more female analysis and for semantic similarity of words, All of these, you know, you could just use them as the API offer them. 1 thing you need to be aware of is that the way these models are trained, so the way that these models like, your expectations of what the the models will do should be highly influenced by understanding how on what they were trained and how they were trained and under which conditions. So a machine learning model will not do something it was not trained to do. So, again, for the for memory integration, for example, we trained it for 3 entities, and people asked for more entities. And for more entities, you need to retrain the model. The model is not gonna achieve.
Also, is was it was trained on labels that was extracted automatically. So the quality of the classification will not match the state of art in English offered by the Stanford package, for example. If a task was trained, you know, depending, using different research mechanism. And if the quality of what is the annotations producing or the results are a little bit surprising, I advise people to just skim the paper how we actually implement these things because given the way we implement it, it's fair for the model to behave that way, but it may not match what you expect, the model to do, if that makes sense. So, again, the the goal of the Polyglot is to push the community toward larger scale coverage of languages, and we had to make, compromises along the way. And, some of them came on not collecting human annotated datasets for each task, for each language, because we thought that not very, not feasible. And
[00:30:24] Unknown:
are there any particularly interesting or unexpected or unique uses of Polyglot that you have seen? I'm actually surprised by
[00:30:32] Unknown:
the attention I get. I mean, my expectation that was a project to put all my effort in my PhD and package it in a in a presentable way. But over the time, I was surprised by the attention it got and the number of people use it and especially the number of startups. Again, there is a lot of demand for machine learning and AI to be integrated in so many applications. So, I guess, that pushes toward that. I'm almost delighted by the number of researchers and the students who approach me from all over the around the world and the nice words I received from people like I would never meet and, you know, remote areas in the world just make me, you know, quite happy. So I still remember people approaching me, you know, from Indonesia, Bangladesh, Ethiopia, Angola. And, you know, in all honesty, you know, I felt like that quite was quite a reward. Most of what I see other than, small teams trying to get, coverage in new markets.
So they like Polyglot because it gives them unified interface for tokenization, language identification across languages, and an easy API. I found that a lot of, many many schools when they teach NLP for undergrads, they tell them about the package or the students find it and find it easy to integrate into their projects. And in the process
[00:31:56] Unknown:
of researching and creating and maintaining Polyglot, I'm wondering what you have found to be some of the most interesting or unexpected lessons learned and some of the biggest challenges that you faced along the way? So there are things I didn't anticipate when I was writing the code as 1 thing is that huge demand for Windows support.
[00:32:18] Unknown:
I historically start using Linux since, like, 2006 when I was an undergrad. So in my mind, when I was developing the package, there are things like it fit to me like quite defaults, like, you know, path separators and so on. And, I said it, like, clearly in the documentation, this is only being supported in Linux. But if you look at the what people request usually and the questions all about, pretty much more than 2 thirds of them are like, we need to run this on Windows. And, I don't have the machine that has Windows support, and, I don't even have the experience to to develop Python on Windows. So that's kind of weakness on my side, but I'm not sure if the community as a whole have tools that makes it. I know that Python has tools to convert Python 2 to Python 3, but are there tools for Python that says, well, you are gonna this is a tool that will package your software such that it we tested that it that will actually not break on any platform. At least, you know, not semantically, it's not gonna break, but at least, you know, your path the way you constructed path, and the way you are gonna do this and that all will, like, you know, align. You called all the right functions and you didn't do use lower level libraries that will break portability.
So that's something I didn't anticipate, and, it was great to have a website for documentation, read the docs, but I think its integration with GitHub is not optimal in the sense many people don't find the documentation. They find the GitHub repository first. So so many of their concern already being answered in the documentation, and maybe something I should look into.
[00:33:56] Unknown:
And now that you're working full time with Natural Language Understanding, I'm wondering how much time you have to be able to dedicate to any further development or future enhancements to Polyglot and what you have planned for the project in the future?
[00:34:14] Unknown:
That's a great question. So I'm quite, honestly busy with the with the research I do. I still do research in in natural language processing. And, my focus, while not necessarily on multilingual anymore, but I am quite aware of the challenges in multilingual work. So I work hard in developing NLP, technology that doesn't need that avoids all the bottlenecks that come in multilingual processing. So, basically, our recent effort into developing byte language models and byte level processing is all, for the purpose of making it easy to deploy different languages and different general languages without the need for normalization or tokenization, which is a big hurdle in so many applications. So if we can just read the bytes and do processing without even observing the walls, that would simplify a lot of things. So I would say I don't I have only time for maintenance of the package. But when it comes to features development and improvement, I would think it's maybe time for Polyglot 3 or 2, where we actually do all languages from byte level, develop embeddings from the byte level, not from the word level, and do more semantic tasks.
In the sense, develop embeddings, vector embeddings for words, phrases, but without the tokenization normalization. And I think that would be actually more robust for more applications and more people and train beyond just Wikipedia. So if I have the time and maybe the volunteers, I would be happy to guide them through the process of, building on the same basis, keeping, the same the same kind of package, but adding more, more embeddings that are, take advantage of the recent developments in natural language processing and deep learning. And, basically, to my biggest goal to simplify support of massive spellings and different spellings and different genres. Like, you know, imagine a a tokenizer for English. How would it work on Twitter? Twitter is English, but not real English. So I I think, you know, what is the definition of a of a language? What is the definition of a dialect? What is the definition of a genre? These are all blurry lines. I mean, just take Brazilian for sorry.
Portuguese, for example. You have Brazilian Portuguese and you have Portuguese Portuguese. So and you have Angola 1. And the vocabulary is a little bit different. The punctuation may not be consistent. So there are a lot of changes across countries. And instead of them in software, I think the future is just to let the machine learn all of these. So I would say my biggest dream for Polyglot is to publish a new Polyglot that is, do a byte level processing and, make it like the the foundation for all everything we did before. And are there any particular areas of NLP research that are
[00:37:27] Unknown:
currently in sort of the earlier experimental phases or anything in particular with a natural language understanding that you're excited for?
[00:37:35] Unknown:
Yes. Absolutely. So, recently, we just got a paper accepted in AAAI on byte level language modeling. There is other effort in Google called BERT, which is also a language model bidirectional language model that supports so many languages. I think my colleagues here already published this, embedding and vector representation for more than a 100 language also, without class on top of them, just the embedding of a phrase given for for each of these languages. So I would think that, given the amount of new hardware coming and the acceleration of hardware we are seeing nowadays, the NLP, research will be exciting to move forward to, less preprocessing, less interference, and let the machine do more of the work of processing the text.
So, there used to be an area of where people develop word embeddings and word vectors. And they think the future will be no preprocessing, no tokenization, no word observation, and just directly from a sequence of, characters to a representation in the space. And I think that, will be quite exciting. Recently, people have achieved superhuman results in question answering. So I'm looking forward to the new challenges that will be proposed that, will push our research forward.
[00:39:00] Unknown:
And are there any other aspects
[00:39:03] Unknown:
of natural language processing and multilingual support and the polyglot project that we didn't discuss yet, which you think we should cover before we close out the show? Hopefully, I give people the right expectations because a lot of what I receive is people expecting things to work in 2018 while they were developed in 2013 the same way. I wish I had more time to develop these things but given that it's mainstream nowadays to develop embeddings for all all languages, given, for example, my colleagues publishing a 100 languages embedding, I think the concerns I had in 20 13 where the community is not doing NLP for all languages is over time becoming not the issue because just deep learning is advancing fundamentally so that it's actually easier to develop for new languages. The 1 thing that will make it absolutely easy is just a lot of computation that allows us to process on character level. And I think that maybe 1 thing we can tell call the community is, like, you know, a lot of people I I find people a lot of people, approaching me. I have Ukrainian corpus that has different annotations and so on. I wish there is an easy way, like GitHub, where people publish their annotated datasets.
So because, like, there will be another student like Rami who's sitting like, you know what? Here is the GitHub of datasets, and these are annotated. And I'm gonna train my classifier on all of them and put them part of my package. And now we have a package that does a 100 tasks on a 100 language. But it's just a problem sometimes, like, where do you find it? There's no consistent place. I think that would be a nice thing to have. I know Google published a data set search engine but maybe that's not what we need. Maybe we need more of a community based GitHub but for data sets. A couple of places that, you might look at on that front are there's
[00:40:56] Unknown:
a company called Quilt that has an open source platform for data collaboration, and there's also it might not be quite the right fit, but there's also the data dot world company for hosting public datasets and being able to collaborate on those. And so I'll add links to those in the show notes as well as links to interviews I did with some of the people behind those projects on my data engineering podcast. So, for anybody who's curious to learn more about those, I'll add those links to the show notes. And so for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And, so with that, I'll move us into the picks. And this week, I'm going to choose the Duolingo platform, which is a great and free site for being able to learn new languages. So I've been using it to practice some German, and, it's an enjoyable interface. It makes it very fun and easy to pick up a few minutes a day, so definitely worth taking a look at if you're interested in learning any new languages. And so with that, I'll pass it to you, Rami. Do you have any picks? Nowadays, I'm, I'm reading,
[00:42:05] Unknown:
a book called The Prophet and the Wizard and it's about, the 2 campaigns of how people look at the future of Earth. You know, so the Wizards are the scientists who say we can increase the capacity of Earth with our innovation. And, Prophet says, well, disasters are coming because we are exceeding the what the Earth can actually sustain and therefore we need to consume less and grow less. So, it's a fantastic read and without, you know, heated discussion on politics, just show giving you the the stories of people behind this and their,
[00:42:44] Unknown:
thoughts on how the future gonna go for us. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing with Polyglot. It's definitely a very interesting project, and it's always great seeing people expand the availability of the being able to work with multiple languages. It's definitely important as more and more people come online and get involved in the modern era. So I appreciate your work on that, and I hope you enjoy the rest of your day. Thank you. Thank you so much.
Introduction and Sponsor Messages
Interview with Rami Alrufu: Introduction and Background
The Genesis of Polyglot
Polyglot's Unique Approach and Use Cases
Challenges in Multilingual NLP
Technical Implementation of Polyglot
Lessons Learned and Future Directions
Future of NLP Research
Closing Remarks and Picks