Polyglot: Multi-Lingual Natural Language Processing with Rami Al-Rfou

Hello, and welcome to podcast thought in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it, so check out Linode.

With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. Go to python podcast.com/linode

today to get a $20 credit and launch a new server in under a minute.

And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers for software engineers.

Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own.

Podcast.init

listeners get 2 months free on any plan by going to python podcast.com/clubhouse

today and signing up for a free trial.

And visit the site at python podcast.com

to subscribe to the show, sign up for the newsletter, and read the show notes.

And keep the conversation going at python podcast.com/chat.

Your host as usual is Tobias Macy. And today, I'm interviewing Rami Alrufu

about Polyglot,

a natural language pipeline with support for an impressive amount of languages. So, Rami, could you start by introducing yourself? Hi. This is Rami. I work currently

at Google,

in language understanding, and, Polyglot,

was my PhD project before joining Google.

Before that, I grew up in Jordan and finished computer engineering

from there,

Then worked in Turkish University teaching

computer science.

After that, I get a scholarship to do my PhD in machine learning and natural language processing from Stony Brook University, State University of New York. And during my PhD, I was

interested

in understanding language and,

I still remember till today when I joined

my lab and was talking to my advisor.

He was pointing me

out to

some tools they developed for English for sentiment analysis. And he was asking me to do the same you

know,

how

you know,

how much complex their system used to be and how much it could be. It used to be,

several,

pieces and modules that you have to hook up together and work on them. And

then he switched my adviser switched to ask me, can you do it in your native language which is Arabic?

After a semester of working on Arabic for sentiment analysis,

I failed miserably, and I

realized that if I want to solve it for Arabic, I should solve it for all languages.

And, I better think about a different strategy of how to tackle large number of languages instead of 1 at each time. Because once I'm done with

the first 1, I will be asked for the second 1 and the third 1. And definitely at some point, my skills in this language will be

0 to do anything,

useful. So we should let the machine learn, not ourselves.

And do you remember how you first got introduced to Python?

So Python is is not something

my university used to teach. There was a lot of emphasis

as electrical engineering, computer engineering,

assembly, and C. But the computer science used to be the dot net languages

like C Sharp and VB, Visual Basic, and so on.

I was interested in Python because,

I found the syntax to be really

easy to remember and write. And I'm a kind of a lazy programmer where most of my projects not necessarily

the kind of jobs I have to do. So

it was,

you know, quite a productive

environment to work on. So every time I come up with an idea or something,

Python would be my first, like, language. So I landed, you know, by myself.

I used to be hanging out in the IRC channel of Python. And every time I get bugged that these are the days before Stack Overflow. So whenever there is a problem,

I would be

just nagging the people there. They will be annoyed, but they would be answer they will answer me eventually.

And slow by slow, like, if there was no point where I had to master the language but over time because it's my

default

language and my comfort zone, I start mastering it more and more over time.

And so you were mentioning that

in the process of your research, you were tasked with trying to add support for natural language processing for Spanish and Arabic,

and that led you to wanting to build a solution that would work for all languages.

So I'm wondering if you can describe a bit about

the Polyglot project itself

and, some of your experience of getting that started and how it works.

So, the Polyglass project, the aim of it is to elevate the the status of NLP,

projects and,

and packages

in the sense that when people develop something

called NLP, natural language processing, they should not be doing

They should be with the mindset that their system should be as applicable as possible to all languages. And that's really a hard task. That's a research task. So, you know, the current status that many packages do not support many languages comes from

it's not an easy thing to do. But I believe compromise has to be done to actually

put

coverage of languages as a first priority over everything else. So Polyglot

initiated it from the the fact there was

a lot of progress in deep learning,

around the time I started the project where people historically used to calculate features from languages manually. So they would look at

case and look at

character engrams, and they would look

at annotations like verb, subject, and so on. And these annotations and features

are not consistent across languages and, you know, need a lot of

domain knowledge in each language. So a lot of progress happened in deep learning in the sense we can learn feature representation of words without actually knowing the language and that come just from the statistics, corpus statistics of the language.

So when I realized, oh, I can

land features of every word in every language automatically, we call it word embeddings. And word embedding means a vector that represent a language a word. And imagine imagine a simple case where you have

a a 3 d room and I'm telling you put the words in the space according to their meaning.

So in 1 corner you will find, like, places and in another corner, you will find names. But in another corner, you will find verbs. And the verbs that describe walking are closer to each other than the verb described drinking.

And if I tell you do it as a human, you will really spend a lot of time not sure how to put words and how much the space will be enough,

given it's only 3 dimensions. What our machine learning is trying to do is literally put these words in points in the space, but the dimension usually is way higher. We are talking about between 50 dimensions to 500 dimensions. And when we look at the nearest neighbors for every word in that space, we find them they correspond really well to our understanding of the meaning of these words. So you will find Obama closer to Bush and you will find Apple capitalized closer to Google but Apple lowercase closer to Orange,

Dougie close to cat, coffee close to tea,

and so on. So

what I given this technology, I decided, okay. Why not we develop these word vectors for all languages?

1, you know, challenge to do that is where is the corpus that, you know, that could cover a 100 languages? And that by itself is a really tricky

business. So,

every time there is a problem where you cannot get resources for multiple languages, usually, my first answer was was Wikipedia.

First, the corpus is available, you know, to download easily. But not only that, there is a significant

volume of text in each language, in the 1 supporting Polyglot. So the first step was to learn

word representations, which we call word embeddings

for each language,

and that meant take the state of art methods at the time

and

run them on the corpus of Wikipedia for each language

and that end up with a dictionary where you have for every word in that language a vector representing that word.

So, it sounds like

Wikipedia is essentially your digital Rosetta Stone for being able to

establish some sort of commonality between the languages

and the features that you're trying to optimize for in being able to generate these word vectors to be able to establish an appropriate vector space for doing analysis

and, entity extraction

from any sort of arbitrary corpus of text within a given language?

Yes. I mean, we will if we go after every,

task that Polyglot achieves, a lot of them not only that the feature were extracted from Wikipedia, but we also utilize the structure of Wikipedia

to actually help us with the annotations.

So,

I'm not sure if you want me to delve deep into these or we go

on into them later.

And so

for Polyglot itself, what are some of the main use cases that it enables which would be

either impractical

or impossible with something such as the NLTK

or spaCy libraries or gensim

that don't have as broad of a language support?

I mean, basically basically,

the distinguishing thing in in Polyglot is that,

the the number of languages we support. I mean, that came at cost in in terms of equality. So

if if only your concern is English,

the other libraries

offer,

better equality because of the technology they use

depends heavily on resources only available for English.

So while developing Polyglot, we had to

rely on heuristics and a lot of tricks and hacks to get annotations

and pseudo annotations,

to develop tools for other languages, but that came at a cost of equality.

So if you are a person who wants to work on a language that these libraries do not support, Polyglot will be your first start.

If you are interested in

semantic meaning of corpus,

in other languages, the word embeddings

is a good start. And, yep.

And you mentioned that there's a significant amount of effort involved in being able to add support

for any given language to a natural language tool such as spaCy or l t k. So I'm wondering if you can dig a bit more into

the types of difficulties

that you would face when adding support for a new language as a sort of incremental step versus

the approach that you're taking with Polyglot for being able to do it in

a deep learning approach?

So let me explain. Let's suppose we want to

develop an application called named entity recognition, and the goal here is to extract from document

entities that were mentioned, like locations,

persons,

organizations.

In a typical in a typical

setup that NLTK will take or Spacy

or Stanford,

NLP,

first, you have to develop annotations.

So you would go after

a a corpus, and the corpus will be, not annotated. So you will have humans annotating where the entities show up, and that involves, like, tens of 1, 000 of of entities has to appear. So you are going over, like, you know, hundreds of thousands of sentences.

Second thing, you will go and

develop

features for that task. And usually, for named interrecognition, people would look at

character of enneagrams and casing,

which is not consistent across languages.

But you still will rely on them. Then

given the features you extracted and the supervised data you have, you will

annotate, you know, to learn the name recognition.

In the polyglot

world, the approach was

quite different.

We learned features for these languages using Wikipedia. So if you have a new Wikipedia for a new language,

you just have to run a tool like Word2Vec,

and that will embed your words into vectors. For annotations, given we didn't have the resources and these resources are not gonna scale for our languages, we relied on the

link structure of Wikipedia. So what that means?

In the first couple of paragraphs in Wikipedia, if

an entity get mentioned, usually it get annotated and linked to its own page. So we know the set of pages in Wikipedia that correspond to entities.

So if a if a mention points to a page that that page already is an entity, we will call that mention, that phrase, an entity phrase.

So, automatically, that will allow us to have a huge amount of supervised labeled corpus. Now, that doesn't come with the highest quality we would expect because the Wikipedia

forces

its writers to only annotate the first mention but not the later mentions in the document. So in our paper working on that problem, we

designed

couple of heuristics to accommodate for that style, Wikipedia

style. So first, we have higher confidence in the positive labels than the negative labels. So if something is mentioned to be an entity, we're quite confident. If something was not an entity, we have less confident because it could have been just

not annotated according to the standard. And,

the second thing, we usually focus on the first couple of paragraphs in the in the corpus but not the later ones. So that end up with a huge amount of annotations.

Given the features, we we can

automatically extract it for the text and the annotations.

We trained,

we

train a deep learning model, a feed forward network to actually predict these, annotations.

And for somebody who's interested in

adding additional

corpus

for a given language to Polyglot to more easily fit the particular

problem space or set of topics that they're going to be working with within their project? What's involved in being able to

either update the model or retrain Polyglot to be able to handle some of these additional entities,

and,

how much upfront work is necessary as far as labeling or feature extraction

for Polyglot to be able to take advantage of the corpus that they're introducing?

That's a great question.

So, basically,

the work on Polyglot started in 2013. And since then, we are, like, like, 5 more than 5 years down the line. The technology evolves so significantly

that there is new methods to develop word embeddings.

And there are,

that definitely

outperform the way we have been doing it in Polyglot. So,

that Wikipedia.

My first my first go to is Word2Vec,

which is available through Gensim to develop the word embeddings.

And, basically,

why I'm advising that because the methods has been constantly improving and you guys are aware of the machine learning, deep learning revolution happening in the last

6 to 7 years. So the technology is consistently

changing

even up to last 2 weeks where,

a team here at Google was also publishing embeddings in a new method, over so many languages called BERT.

So things keep moving fast. So if you want to add a new language, I would say I will focus the first thing into developing

a word embeddings for that language. So 2 things that I would focus on first finding the corpus that closest to the domain you care about. So if your domain looks like a news, it would be great if you collect a corpus of a news. If your domain looks like a Wikipedia,

then you are good to go. So first thing, collect the corpus that looks as close as possible to the domain you you care about. Second thing, use 1 of the state of art tools to develop word embeddings using

Word2Vec or BERT

or FastText.

Gensim

encompass several of them,

but not all of them. And the other tool is available in Python. Once you have your word embeddings or phrase embeddings,

now you come with annotations. So that is a common question I get in Polyglot. Oh, you annotate

Polyglot annotates organization, persons, and locations,

but I need more.

And the the problem with a lot of NLP application is just the annotations are hard to get. So while in Polyglot, we relied on the Wikipedia

taxonomy,

There is no reason not to do it with larger taxonomy. So when we went to Wikipedia and

logged the annotations of each page,

we only

concentrated on

group, organizations,

persons, and locations.

So you read

what you really need to do is to re annotate the text according to a wider definition of pages. So any mention that points to pages that more than the these are 3, you will also introduce it as a new label. So, basically, what is really necessary to happen is,

basically

re extracting,

the corpus from Wikipedia. In our page,

we already released the corpus we extracted, but that doesn't include the taxonomy.

And,

in the last,

3 years, we relied on Freebase to give us annotations over

Wikipedia pages to classify them and the free base is not available anymore. So I'm not sure if Wikipedia nowadays have their own taxonomy on top of their pages,

and how things go. It used to be also really hard to extract clean text from Wikipedia because of the working markup.

But over the last 3 years also, they developed tools to make it easier and easier to do that.

So in 1 side, Wikipedia became easier to get clean text, but on another side, we lost Freebase

as a source of,

taxonomy over Wikipedia.

And are there any particular

families of languages that are more challenging to support than others whether because of the ways that their text is encoded or

the grammatical structures or the way that they refer to entities within the language or the semantics

of how the language is constructed?

Absolutely. So so,

there are different there

are different,

sets of complexity come from different for different reasons. So 1 of the complexity is tokenization.

And tokenization,

usually in in East Asian languages is if you look at the polyglot code, we end up using a different tokenizer than the 1 we use for the rest.

And that comes from the fact that, you know, Chinese script

and

Japanese and so on, especially Japanese because of the mixes of 3 different

writing systems,

end up really hard for tokenizers to get right.

So,

that's 1 issue.

Another issue comes from other languages like Turkish, Arabic, and Finnish where they have really sophisticated morphology,

and their morphology end up, like,

making,

creating so many words that are just compound words. So what do I mean by that? Usually, in polyglot, we concentrate on the top 50, 000, 100, 000 words.

How do we select the top 100, 000 words? The most frequent ones. And usually in English, the most frequent 100, 000 words cover almost 98%

of the text.

When it comes to a language like Arabic or Turkish or Finnish,

with the most frequent 100, 000 words, you barely cover 90% of the corpus.

And that leads to degradation represented

with

1 single vector and that makes the classifier not it's not in your dictionary. It's out of vocabulary. And out of vocabulary get represented with 1 single vector, and that makes the classifier not you don't get give the classifier

enough informative

to actually make a good decision.

So languages that have complex morphology end up with a lot of sparsity and languages that

tokenization is not as easy. Now both issues could actually be solved using

a different

approach to building NLP applications. That different approach

the sequence

the sequence,

text sequence,

why not rely on characters or bytes?

So if Rami gonna build rebuild Polyglot in 2018,

what he's gonna do is

build

representations

of language from the characters

and up,

and that will allow you to represent

complex languages and morphology like Arabic and Turkish.

Also, Also, you don't need a tokenizer or normalizer.

And that's a great blessing because normalizer and tokenizer do not work consistently

and robustly across languages.

And every library has really different

scheme of tokenization depending on the language. In terms of the types of operations

that Polyglot supports, I know that it is able to do, as you mentioned, tokenization

and entity recognition,

and being able to extract the different morphemes within the text. And when I was looking at the documentation, it looked like there were varying levels of support across languages for some of those different operations. So I'm wondering if you can talk a bit about

the challenges

of supporting some of those different

processes that you might want to

run against the given languages

and how the nature of the language itself

may complicate being able to support it as thoroughly as some of the other languages?

So, basically, do we have different varying support for every for every,

prediction?

So for part of speech tagging, we have support for 40 languages, but for language recognition, we also have 40. But for more FEM analysis, we have more. More than embeddings, we have field for all of them. Tokenization, we have for all of them. Basically and by all of them, I mean, like, the 100 languages we support. The varying support come from the varying level of annotations we can acquire. So and then and as I mentioned earlier, for named entity recognition, we relied on Wikipedia

to extract mentions.

But for part of speech tagging, we relied on the NLP community to get,

annotated corpora. So and these corpora only cover 40 languages.

They don't cover more, and they are expensive to collect. For Morpheme for Morpheme, it's unsupervised task. So in the sense, it doesn't need Morpheme analysis. So it doesn't need,

collecting and corpus. So it was easy to push for all languages. So, really, the the bottleneck for developing in your languages, it comes from, do you

have annotations for the task

you need? And, again, annotations are

costly in the sense

either you buy them by

making a Mechanical

Turk task

where you ask people to annotate the corpus for you or you do it manually

or as I did in my PhD, just

do a lot of clever tricks to

do data mining on

almost get you there,

without actually human intervention.

And so I'm wondering if you can talk a bit more about the way that Polyglot itself is implemented in terms of the software design and architecture

and some of the,

major libraries or services that you leverage to be able to put it together?

Yes. So,

I mean, the design of Polyglot was emphasizing

simplicity.

So,

just to encompass a lot of tasks and a lot of languages,

basically,

2 the library actually consists of kinda in my mind in 2 parts, the software and the models.

And,

the models used to be, you know,

takes a lot of storage, so it was hard really to push binary

blobs into GitHub to store them. So we decided to store them on a server

and

we forked the NLTK downloader package

to actually

Polyglot index of resources and download according to language or according to task.

And this usually used to be host either on Google Cloud or on my university server.

So that's 1 1 part. For the software part,

Polyglot has dependencies for tokenization and language identification,

on several libraries. So language tokenization come from the ICU Lab.

For language identification,

it comes from,

Google

library

c c for the c library they had. And then go then Polyglot depend on that library.

For also for the Morphine analysis, I,

for the Morphine analysis, I

use the work of a research group

and used their Python library actually to train a morpheme analyzer for each language giving the corpora I have.

So then I uploaded the models.

So,

basically, Polyglot is,

is a collection of other people's effort plus the effort I did in my PhD.

In terms of software design, I would say,

basically,

the greatest value of Polyglot is the pedagogical value in the sense I see the largest,

usage is through

students,

who are taking the first time machine learning and NLP

course.

Course. I get a lot of requests and questions

from them because they find it kind of transparent.

It's quite easy to get to throw the code and just figure out what is going

on and modifying according to your class project. And

is there any capacity for being able to integrate Polyglot

with some of the other tools in the ecosystem such as spaCy or Gensym

to leverage

the,

additional algorithms or capabilities of those libraries once you've done the either part of speech tagging

or tokenization

using Polyglot within

the languages that are supported?

I mean, some libraries approached us to use the embedding resources we have, and, we were welcoming in general.

I personally

don't, like, have the capacity given,

my life and work schedule

to

do more significant development

other than maintenance.

And for somebody who's building an application

or a project

and using Polyglot, is there any particular domain knowledge in terms of

natural language understanding or natural language processing

that's required to be able to use Polyglot effectively within that application, or is it fairly easy to get started with a,

surface understanding of how the API operates and be able to leverage the capabilities of Polyglot

to gain some capacity for being able to process free form text?

So the the the API, I think, will be straightforward to use, and I don't think you really need

to know much

about how it's working

because, like, you know, from language identification to tokenization

to more female analysis and for

semantic similarity of words,

All of these, you know, you could just use them as the API offer them.

1 thing you need to be aware of is that the way these models are trained,

so the way that these models like, your expectations of what the the models will do should be highly influenced by understanding how on what they were trained and how they were trained and under which conditions. So a machine learning model will not

do something it was not trained to do. So,

again,

for

the for memory integration, for example, we trained it for 3 entities, and people asked for more entities. And for more entities, you need to retrain the model. The model is not gonna achieve.

Also, is was it was trained on

labels that was

extracted automatically.

So the quality of the

classification

will not match the state of art in English offered by the Stanford package, for example. If a task was trained, you know, depending,

using different research mechanism.

And if the quality of what is the annotations producing or the results are a little bit surprising,

I advise people to just skim the paper how we actually implement these things because given the way we implement it, it's fair for the model to behave that way, but it may not match what you expect, the model to do, if that makes sense. So, again, the the goal of the Polyglot is to push the community toward

larger scale coverage of languages,

and we had to make,

compromises along the way. And, some of them came on not collecting human annotated datasets for each task, for each language, because we thought that not

very, not feasible. And

are there any particularly

interesting or unexpected or unique uses of Polyglot that you have seen? I'm actually surprised by

the attention I get. I mean, my expectation that was a project to put all my effort in my PhD and package it in a in a presentable way. But over the time, I was surprised by the attention it got and the number of people use it and especially the number of startups.

Again, there is a lot of demand for machine learning and AI to be integrated in so many applications.

So,

I guess,

that pushes toward that. I'm almost delighted by the number of researchers and the students

who approach me from all over the around the world and

the nice words I received from people like I would never meet and, you know, remote areas in the world just make me, you know, quite happy. So I still remember people approaching me, you know,

from Indonesia, Bangladesh,

Ethiopia,

Angola. And, you know, in all honesty, you know, I felt like that quite was quite a reward. Most of what I see other than,

small teams trying to get, coverage in

new markets.

So they like Polyglot because it gives them unified interface for tokenization, language identification across languages, and an easy API.

I found that a lot of, many many schools when they teach NLP for undergrads, they tell them about the package or the students find it and find it easy to integrate into their projects. And in the process

of researching

and creating and maintaining Polyglot, I'm wondering what you have found to be some of the most interesting or unexpected lessons learned and some of the biggest challenges that you faced along the way? So there are things I didn't anticipate when I was writing the code as 1 thing is that huge demand for Windows support.

I historically

start using

Linux since, like, 2006 when I was an undergrad.

So in my mind, when I was developing the package, there are things like it fit to me like quite defaults, like,

you know, path

separators

and so on. And, I said it, like, clearly in the documentation, this is only being supported in Linux. But if you look at the what people request usually and the questions all about, pretty much more than 2 thirds of them are like, we need to run this on Windows. And,

I don't have the machine that has Windows support, and,

I don't even have the experience to to develop Python on Windows. So that's kind of weakness on my side, but I'm not sure if the community as a whole have tools that makes it. I know that Python has tools to convert Python 2 to Python 3, but are there tools for Python that says, well, you are gonna this is a tool that will package your software

such that it we tested that it that will actually not break on any platform. At least, you know, not semantically, it's not gonna break, but at least, you know, your path the way you constructed path, and the way you are gonna do this and that all will, like,

you know, align. You called all the right functions and you didn't do use lower level libraries that will

break portability.

So that's something I didn't anticipate,

and, it was great to have a website for documentation, read the docs, but I think its integration with GitHub is not optimal in the sense many people don't find the documentation. They find the

GitHub repository first. So so many of their concern already being answered in the documentation, and maybe something I should look into.

And

now that you're working full time with Natural Language Understanding,

I'm wondering how much time you have to be able to dedicate to any further development or future enhancements to Polyglot and what you have planned for the project in the future?

That's a great question. So

I'm quite,

honestly busy

with the with the research I do. I still do research in in natural language processing. And,

my focus,

while not necessarily on multilingual

anymore, but I am quite aware of the challenges in multilingual work.

So I work hard

in developing NLP,

technology that doesn't need

that avoids all the bottlenecks that come in multilingual

processing. So,

basically,

our recent effort into developing byte language models

and byte level processing is all,

for the purpose of making it

easy to deploy different languages and different general languages without the need for normalization

or tokenization,

which is a big hurdle in so many applications. So if we can just read the bytes and do processing without even observing the walls, that would simplify a lot of things. So I would say I don't I have only time for maintenance of the package. But when it comes to features development and improvement,

I would think it's maybe time for Polyglot

3 or 2,

where we actually

do all languages from byte level,

develop embeddings from the byte level, not from the word level, and do more semantic tasks.

In the sense,

develop embeddings, vector embeddings for words,

phrases,

but without the tokenization

normalization.

And I think that would be actually more robust for more applications and more people and train beyond just Wikipedia. So if I have the time and maybe the volunteers, I would be happy to guide them through the process of, building

on the same basis,

keeping, the same the same kind of package, but adding

more,

more embeddings that are,

take advantage of the recent developments in natural language processing and deep learning. And, basically,

to my biggest goal to simplify support of massive spellings

and different spellings and

different genres. Like, you know, imagine a a tokenizer for English. How would it work on Twitter? Twitter is English, but not real English. So I I think, you know,

what is the definition of a of a language? What is the definition of a dialect? What is the definition of a genre? These are all blurry lines. I mean, just take Brazilian for sorry.

Portuguese, for example. You have Brazilian Portuguese and you have Portuguese Portuguese. So and you have Angola 1. And the vocabulary is a little bit different. The punctuation may not be consistent. So there are a lot of changes

across countries. And instead of

them in software,

I think the

future is just to let the machine learn all of these. So I would say my biggest dream for Polyglot is to publish a new Polyglot that is,

do a byte level processing

and,

make it like the the foundation for all everything we did before. And are there any particular areas of NLP research that are

currently in sort of the earlier experimental phases

or anything in particular with a natural language understanding that you're excited for?

Yes. Absolutely. So,

recently, we just got a paper accepted in AAAI

on byte level language modeling. There is other effort in Google called BERT, which is also a language model bidirectional

language model

that supports so many languages. I think my colleagues here already published this,

embedding and vector representation for more than a 100 language also,

without class on top of them, just the embedding of a phrase given for for each of these languages.

So

I would think that,

given the amount of new hardware coming and the acceleration of hardware we are seeing nowadays,

the NLP,

research will be exciting to move forward to,

less preprocessing, less interference, and let the machine do more of the work of processing the text.

So,

there used to be an area

of where people develop word embeddings and word vectors. And they think the future will be no preprocessing, no tokenization,

no word observation,

and just directly from a sequence of,

characters to a representation in the space.

And I think that,

will be quite exciting.

Recently, people have achieved

superhuman

results in question answering.

So I'm looking forward to the new challenges that will be proposed that,

will push our research forward.

And are there any other aspects

of natural language processing and multilingual support and the polyglot project that we didn't discuss yet, which you think we should cover before we close out the show? Hopefully, I give people the right expectations because a lot of what I receive is people expecting things to work in 2018 while they were developed in 2013 the same way.

I wish I had more time to develop these things but given that it's mainstream nowadays to develop

embeddings for all all languages, given,

for example, my colleagues publishing

a 100 languages embedding, I think the concerns I had in 20 13 where the community is not doing NLP for all languages is over time becoming

not the issue because just deep learning is advancing fundamentally

so that it's actually easier to develop for new languages. The 1 thing that will make it absolutely easy is just

a lot of computation that allows us to process

on character level. And I think that

maybe 1 thing we can tell call the community is, like, you know, a lot of people I I find people a lot of people,

approaching me. I have Ukrainian

corpus that has different annotations and so on. I wish there is an easy way, like GitHub, where people publish their annotated datasets.

So because, like, there will be another student like Rami who's sitting like, you know what? Here is the GitHub of datasets, and these are annotated. And I'm gonna train my classifier on all of them and put them part of my package. And now we have a package that does a 100 tasks

on a 100 language. But it's just a problem sometimes, like, where do you find it? There's no consistent place. I think that

would be a nice thing to have. I know Google published a data set search engine but maybe that's not what we need. Maybe we need more of a community based GitHub but for data sets. A couple of places that, you might look at on that front are there's

a company called Quilt that has an open source platform for data collaboration,

and there's also

it might not be quite the right fit, but there's also the data dot world company for hosting public datasets and being able to collaborate on those. And so I'll add links to those in the show notes as well as links to interviews I did with some of the people behind those projects on my data engineering podcast. So, for anybody who's curious to learn more about those, I'll add those links to the show notes. And so for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And, so with that, I'll move us into the picks. And this week, I'm going to choose the Duolingo

platform, which is

a great and free site for being able to learn new languages. So I've been using it to practice some German, and, it's an enjoyable interface. It makes it very fun and easy to pick up a few minutes a day, so definitely worth taking a look at if you're interested in learning any new languages. And so with that, I'll pass it to you, Rami. Do you have any picks? Nowadays, I'm, I'm reading,

a book called The Prophet and the Wizard

and it's about,

the 2 campaigns of

how people look at the future of Earth.

You know, so the Wizards are the scientists who say we can increase the capacity of Earth with our innovation. And,

Prophet says,

well,

disasters are coming because we are exceeding the what the Earth can actually sustain and therefore we need to consume less and

grow less. So, it's a fantastic

read and

without,

you know, heated discussion on politics, just show giving you the

the stories of people behind this and their,

thoughts on how the future gonna go for us. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you're doing with Polyglot. It's definitely a very interesting project, and it's always great

seeing people expand the availability

of the being able to work with multiple languages.

It's definitely important as more and more people come online and get involved in the modern era. So I appreciate your work on that, and I hope you enjoy the rest of your day. Thank you. Thank you so much.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.init