Summary
Natural language processing is a powerful tool for extracting insights from large volumes of text. With the growth of the internet and social platforms, and the increasing number of people and communities conducting their professional and personal activities online, the opportunities for NLP to create amazing insights and experiences are endless. In order to work with such a large and growing corpus it has become necessary to move beyond purely statistical methods and embrace the capabilities of deep learning, and transfer learning in particular. In this episode Paul Azunre shares his journey into the application and implementation of transfer learning for natural language processing. This is a fascinating look at the possibilities of emerging machine learning techniques for transforming the ways that we interact with technology.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to pythonpodcast.com/census today to get a free 14-day trial.
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at pythonpodcast.com/hightouch.
- Your host as usual is Tobias Macey and today I’m interviewing Paul Azunre about using transfer learning for natural language processing
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what transfer learning is?
- How is transfer learning being applied to natural language processing?
- What motivated you to write a book about the application of transfer learning to NLP?
- What are some of the applications of NLP that are impractical on intractable without transfer learning?
- At a high level, what are the steps for building a new language model via transfer learning?
- There have been a number of base models created recently, such as BERT and ERNIE, ELMo, GPT-3, etc. What are the factors that need to be considered when selecting which model to build from?
- If there are multiple models that contain the seeds for different aspects of the end goal that you are trying to obtain, what is the feasibility of extracting the relevant capabilities from each of them and combining them in the final model?
- What are some of the tools or frameworks that you have found most useful while working with NLP and transfer learning?
- How would you characterize the current state of the ecosystem for transfer learning and deep learning techniques applied to NLP problems?
- What are the most interesting, innovative, or unexpected applications of transfer learning with NLP that you have seen?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on the book?
- When is transfer learning the wrong choice for an NLP project?
- What are the trends or techniques that you are most excited for?
Keep In Touch
Picks
- Tobias
- Paul
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Transfer Learning for Natural Language Processing by Paul Azunre (affiliate link) Use the code podinit21 at checkout for 35% off all books at Manning!
- Low Resource Languages
- Fortran
- C++
- MatLab
- MIT 6.003
- Transfer Learning
- Computer Vision
- Deep Neural Network
- Convolutional Neural Network (CNN)
- Recurrent Neural Network (RNN)
- GLUE == General Lanuage Understanding Evaluation
- NLP SuperGLUE
- NLP Encoder
- Named Entity Recognition
- ImageNet
- Mathematical Optimization
- Gradient Descent
- Yonder AI
- ELMo language model from Allen NLP
- Ghana
- ArXiv
- BERT language model
- TF-IDF == Term Frequency – Inverse Document Frequency
- Word2Vec
- GPT-3
- Ghana NLP
- Automatic Speech Recognition
- ULM Fit
- Keras
- Tensorflow
- Huggingface Transformers
- Multi-Task Learning
- Fast.ai
- OpenAI
- AWS SageMaker
- Kaggle Kernels
- Colab Notebooks
- Azure ML Studio
- BLEU Score
- Khaya application
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for reverse ETL today. Get started for free at pythonpodcast.com/hitouch.
Your host as usual is Tobias Macy. And today, I'm interviewing Paul Ozunre about using transfer learning for natural language processing. Everyone. My
[00:01:42] Unknown:
name is Paul Azunre, and I'm here to talk to you today about natural language processing. I work in the field. I work in the areas of and I work in the field. I work in the areas of NLP for low resource languages, and this is where my interest the problem comes from. And do you remember how you first got introduced to Python? Yeah. That's a funny story. I actually didn't use Python until very recently. My dissertation was written mostly in Fortran some c plus plus and I used a lot of MATLAB in my undergraduate studies. But during grad school, I was slowly introduced to Python because I was teaching a class, 6 double o 3 at MIT. I was a TA, and the professor wanted to use Python.
So this is how I first got introduced to it. And then recently, of course, you start working on machine learning. You can't avoid it. So
[00:02:33] Unknown:
And so now that brings us a bit to what you're working on now in terms of transfer learning and natural language processing. Before we get too much into that, can you start by giving a bit of an overview about what transfer learning even is?
[00:02:46] Unknown:
So transfer learning is the idea that you do not have to start learning things from scratch anytime you need to solve NLP or computer vision problem or any machine learning problem for that matter. If you are starting from scratch, you are likely to need a lot of data and a lot of resources to achieve a certain level of performance. However, with transfer learning, you try to leverage existing knowledge that you might have in order to simplify a task, which is very similar to how human beings learn. And you could say it, human learning is an inspiration for this field because a human being never learns from scratch, tries to make associations with other things they know, and this helps them to tackle new challenges with very little training comparatively.
[00:03:34] Unknown:
And so in terms of its application to natural language processing, I know that it started off a little bit in terms of computer vision research and particularly with the introduction of deep neural nets and deep learning, you know, a few years ago. And I'm wondering how it's being applied to natural language processing and what the sort of translation process looks like from the computer vision transfer learning approaches to how it's being applied in natural language use cases?
[00:04:03] Unknown:
Many people would argue that computer vision was the inspiration for transfer learning and NLP. So this method has matured, you know, over the past couple decades were being used heavily in computer vision, not so much in NLP. NLP, the way it manifests itself is a little different in terms of the kind of deep learning architectures that are used heavily. So in computer vision, it is convolutional neural networks. And so the way you train it is very specific to that. Right? Because the last change from very general to less general as you go through the stack. In natural language processing, we have a lot more use of recurrent neural networks and now transform architectures.
It looks a little different, but a lot of the key ideas are the same. So you start with a pre trained model that was trained on a very general data sets that somehow captures the kind of knowledge the model needs to have in order to solve problems in this, you know, domain. So So in the case of computer vision, this was the ImageNet dataset. In the case of natural language processing, there are a few options recently, but the most prominent 1 is probably glue or super glue. It includes tasks such as the common sense reasoning tasks. Like, is this grammatically correct? Is this question similar to this other question?
You know, a pronoun in a sentence, which 1 of these possible options does it refer to? So these are very general skills that the model learns to do based on training on this general big data set. And then this is followed by a fine tuning step where this general knowledge is adapted. So in computer vision, you may download the model that was trained somewhere, very large model, and only fine tune the last couple layers in it. The same way in the case of NLP, you may download this model and fine tune the last few layers of the encoder in addition to any specific new layers you have for a new task you are trying to solve. A very specific task like named entity recognition or so on. So just to summarize, the ideas transfer learning for NLP is referred to as the ImageNet moments of NLP to draw the similarity. So there are a lot of similarities, but the implementation details, when you get really in the weeds, different, but the high level is very similar. So in terms of your actual
[00:06:38] Unknown:
exposure to transfer learning and natural language processing, as I was researching this interview, I noticed that for your PhD, you were focusing on things like optimization problems and how that applies to different areas of optics. And I'm wondering how you went from there to your current focus of working with natural language processing and transfer learning applications and also a book that you're writing on the topic and just sort of what was your motivation for going down this path and, you know, acquiring the level of expertise that you're currently at? My journey started,
[00:07:10] Unknown:
as you correctly pointed out, in mathematical optimization, which is a field that is related to machine learning, but has a very different focus. So I would say, at least, there is a lot more theory that's been done. That's very mathematically robust and well understood. So there are spaces where this problem is set in a very theoretical way, and there's a lot of activity around that. There's also an applied side to it, which I would say not nearly as widespread and applied in industry as machine learning is. And so in that sense, there's more stress on theory, at least the way I was introduced to optimization than machine learning.
However, machine learning is an optimization problem. Right? You are trying to fit a model to some data, which you formulate as an optimization you know, problem. Stochastic gradient descent is just 1 example of an optimization algorithm. So there is similarity that between these fields that helps you move or if you like, transfer between 1 field and the other. The way it happened for me was, I would say, pure coincidence, I was working in the supplied optimization problem in optics and using these algorithms to build good solar panels. And then, you know, I tried to start a business in that space. At the time, it wasn't like a good time in terms of the economics to do work in that area.
So I had to shift my focus. At some point, you know, I got a job. At some point, I got laid off from the job. And I landed at a startup that was trying to fight this information. This is somebody I had met through my entrepreneurial experience who, you know, I was like, what? The first employee of this company. This company called Yondr, it's still around. That fights it it fights disinformation. My job there was to work on the DARPA projects. And these DARPA projects were related to this disinformation problem in the sense that we were working on NLP and trying to detect nefarious information, nefarious activity by analyzing content on social media and stuff like that. So this is how I first got into NLP. Now in that context, we were using transfer learning around the time that the first model for transfer learning NLP many point 2 was coming out, which is Elmo. Around the same time, we were doing something very similar in a different context. We were using simulated data to reduce the requirement for actual training data we needed to solve a particular problem. So we would simulate the data. We would train our model on the simulated data first. This would get us maybe 70% of the way. And then we would fine tune, like, a few hundred examples for each category of actual labeled human labeled data. Whereas, if we didn't do that, we may need tens of 1, 000, which we didn't have the time or the money to do. So this is how I got introduced to the problem.
At the same time, you know, I'm a person from Ghana. Ghana is a country in West Africa. And a lot of Ghanaian languages, I would say even African languages, are considered to be low resource in the sense that there is not enough, you know, tools or data to train models, these kinds of models, to solve tasks in those languages. Translation tools don't exist and so on and so forth. And, of course, it helps to leverage transfer learning in this space because you are able to take maybe an English model and adapt it somehow to your language. And this makes it easier to solve the task than, you know, if you didn't do that. So I started thinking about this, and I couldn't find, you know, once I started doing this, I couldn't find references that's in my opinion had the right balance between application and theory. Right? So the field is very theoretical.
It's still empirical, but it gets very detailed and specialized in the papers on archive. Right? And so you don't really need to understand 100% the birth paper to apply birth. Right? You may need to understand some key concepts and then that allows you to start making impacts. So this is what this was the idea behind writing this book, building intuition for the problem, understanding the key ideas behind the problem without getting into, you know, proving things and the mathematics and notation and all of that. Code examples that work rather than, you know, very well understood theoretical examples. Right? So that I could just take that code and change the path, you know, and tune some parameters, and I'm I already have something I can start doing real things with. Right? So this is the purpose for the book. It's more applied perspective, perspective that makes it more appealing to somebody without very deep, you know, expertise in machine learning and theoretical academic machine learning to start using these tools to make an impact. You were describing how the
[00:12:04] Unknown:
applications of transfer learning, 1 case is for these low resource languages as you put it, you know, languages that don't have enough of readily available textual data for being able to train models on or languages that only have a very small population of people that speak them. So, you know, maybe languages in terms of some, like, Native American tribes that only have some of the elders who are still speaking it and was never really a primarily written language. And I'm wondering, in addition to that, what are some of the other applications of natural language processing that are either difficult and impractical or intractable without using transfer learning as the basis to work through them as opposed to using some of the more, quote, unquote, traditional approaches to natural language, so doing things like word2vec or, you know, TFIDF and things like that. So I think language generation
[00:12:57] Unknown:
is probably a major example of a major answer. So you've probably heard of GPT 3. Yes. So its ability to say write poetry. I did an experiment where I wrote some poetry, and I put it on my social media. I got a lot of very positive comments. But it was just GPT 3, you know, I guess, replicating some patterns of data it was trained on. Alright. So this is an example. I mean, is that useful? Somebody could use that, you know, in a creative sense to, like, stimulate ideas. I don't know. Somebody could use it for writing or, like, just, you know, I guess, generating ideas, more practical applications of the same technology. Like, for instance, generating code from a mere description of it. Because GPT 3 was trained on basically a lot of code examples, let's say, in Python.
You can ask it, you know, generates, you know, a Python function to featurize something using Web2Vec. And, surprisingly, it's able to do pretty well. In many cases, generate very good functional code, especially if it is fine tuned for that task. Right? Not the general model, but fine tuned for that specific task. I would say that's something that almost feels like sci fi. Right? It's not it's not we need to stress. It's not completely eliminating the human from the picture. It's just giving the human a multiply effects on their productivity. It's not gonna write it correctly a 100% of the time.
Right? Probably that code is going to just be a template that you start from. So maybe instead of copying it from Stack Overflow, which you used to do, now you start with GPT 3, which just kind of gives it to you without fewer clicks. And I don't know. Maybe you can voice search it or something just to multiply your productivity. It's not eliminating you. I need to stress that because some people are afraid of that. We probably can't even predict what people are going to do with it tomorrow. At the beginning of this year, in my work at the organization called Ghana NLP, where we are trying to do this stuff for Ghanaian languages, couldn't have imagined 6 months ago what we are able to do today with voice.
So voice automatic speech recognition. Automatic speech recognition is when it transcribes the speech. Right? Nothing like this exists. For Ghana languages, there are some academic examples in the papers somewhere. There isn't, like, a well deployed method that's used. We were able to build it with just a couple hours of speech using a pre trained model that was pre trained on a lot of different languages to learn some common things across them. And then add a couple hours of speech in our language. Couple hours. That's 1 person can do that. If they're sufficiently motivated to put their own language on the map, they just need a couple hours. Think about it. So I expect that, you know, within 6 months, all of a sudden, something that we couldn't even imagine was possible will become quite possible for basically almost all the languages out there. I think that's pretty dramatic.
[00:15:52] Unknown:
So maybe in 5 years, we'll actually have a real live Babelfish that we don't necessarily have to stick it to our ear, but we can actually use, you know, in practical conversations. I know that there are some sort of proof of concept implementations of that where that Google has for being able to, you know, do live transcription from, you know, 1 language, translate it, and then, you know, text to speech back to the other person to be able to have sort of a halting conversation. But, you know, maybe in 5 years' time, we'll actually be able to have that be more fluid with more languages supported. Have you tried Google Translate lately? Only the textual version, but not on any sort of actual phrases.
[00:16:26] Unknown:
The speech recognition is pretty good already. For the large languages, already, I would say we are there. I mean, you have to do a couple clicks. It's not like a, you know, well integrated smooth workflow. So maybe that's what we are waiting for. Like, some engineering solution like that. And there are challenges there that I probably don't even know all of them. Just about, you know, there's noise in the environment. So, like, the model we built, if you're trying to teach the person how to speak the language correctly, it's a great tool for that. Because they will announce it to teach you how to enunciate. Otherwise, it won't recognize it correctly. Because the person who recorded it was a linguist who was trying to do everything correctly. But come on. We all know that's not how we actually speak in practice. If you want to build a system that we use while we are playing video games, we are not gonna sit there and not say it. Right? So we need this noisy, more robust, you know, more real world transfer learned maybe from this more formal model.
And then, you know, test it
[00:17:21] Unknown:
in real settings because there are challenges that come with that. And we are doing some of that, by the way. Digging more into the actual practical aspects of applying transfer learning to NLP problems, can you just talk through the sort of high level steps that are involved in taking a base language model and translating it to be applicable to a particular problem domain that you're trying to solve for? So I would say there's the 2 main steps
[00:17:48] Unknown:
was captured very well by a method known as ULM fits in transfer learning for NLP. And there are 2 main steps here. So the first step, of course, when you say base model, just so that our listeners understand what that means, that's the model that was trained pre trained on the large corpus of, say, multiple languages or large corpus in a particular language that we are now going to adapt to a new problem. So usually, the first step is fine tuning that base model on supervised corpus of data to make to adapt it for the data new data distribution, if that makes sense. So that data that the base model was trained on was data that was supposed to capture the entire all the possibilities out there. Right? In terms of what it might have to recognize and do.
So it may be able to understand all business sectors, but at the expense of, you know, loss in accuracy, slight loss in accuracy. You can take that model and adapt it to, say, financial data or text from a financial domain or financial news to make that model more specific to your use case. That usually helps. The second step is fine tuning the downstream task layers. Just an additional layer you place on the base model that then maps, you know, the vectors coming out from the base model into, say, a classification problem, like a softmax, where it's picking between a couple of categories.
Or you're mapping it into some other activation function to do a regression problem. Right? So that's like a downstream task or task specific fine tuning step, where usually you have a small amount of supervised data or labeled data that then,
[00:19:31] Unknown:
you know, give you very good performance for your specific task. Yeah. It definitely does. And then in terms of the specifics of actually taking 1 of these pretrained models, and earlier, you're mentioning sort of removing a couple of layers of the network and then rebuilding additional layers on top of it to fit your particular domain. I'm just wondering, getting into the bits and bytes of it, what does that actually look like where you have this binary object that's this pretrained model? What's actually involved in being able to then say, you know, back out the last 2 or 3 layers of your neural net and then replace it with these other network layers, you know, in terms of, like, actually implementing that. What does that look like as far as what is the binary object of the model that you're working with? How are you able to tell it, you know, these are the layers that I want you to remove. And then from, you know, a machine learning perspective, I know that a number of people who work in this space will understand sort of building these additional neural nets. You know, it's essentially the same as working from scratch. It's just that it's a much shallower network that you're appending to the end of an existing network. But, you know, what are the steps of unwinding and rewinding those different layers?
[00:20:35] Unknown:
I think it depends very much on the framework you are using to do it. So if you are using, like, Keras or TensorFlow, there's a process. And we have examples in the book that take you through that. But basically, you know, you can take your original intensive flow. You can take your original model description, load the weights, and that will put the weights into it, and then remap some of the outputs to new outputs. Right? So if it was a base model, it was probably producing and it was but it was producing a 768 dimensional feature vector. Right? So you can take that and create a new model, which, you know, maps that to the softmax for your classification problem. That's pretty much all. And at that point, you have to call some functions to freeze the layers you don't wanna train and on freeze the layers you do want to train in tools such as Hugging face transformers. It's even easier than that So, you know, you load a class that represent a model like, but language model.
I don't remember exactly, but for language modeling. That's 1 class they have. So you load the pre trained model into it, and then you can save that model, say, locally. Right? In whatever form. They even have ways of saving it as a TensorFlow save model. You can save it as a PyTorch model. You can save it as a TensorFlow model. Right? Then you can take another class, let's say, but for sequence classification, which now has the new architecture you're going to use, like classification. Right? So you don't have to do any mapping. You just use a different class. You tell it how many outputs he wants and what the activation function should be. And it has been built to understand that, okay, I'm loading a base model into a task specific model. So the base model goes in the encoder, and the rest is, you know, the new stuff you have to train.
Right? And so, again, in that space, you can either use the TensorFlow format to freeze and unfreeze if you are using the TensorFlow back end, Or you can use the PyTorch syntax to freeze and unfreeze weights as needed. So there's some coding.
[00:22:52] Unknown:
We've all been asked to help with an ad hoc request for data by the sales and marketing team, then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV file via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, and many more. Go to python podcast.com/census today to get a free 14 day trial and make your life a lot easier.
And we've been talking about a number of different models that are fairly well known. So there's BERT, there's Ernie, there's Elmo, GPT 3. I'm sure that there are a number of others out there. I'm just wondering, as you say, you know, I have this particular problem that I'm trying to solve for. You know? I want to be able to build a, you know, language understanding or I wanna be able to do sentiment analysis, but I wanna be able to do it on a, you know, low resource language. And so how do I then determine what is the best base pretrained model to work from or, you know, do some type of economic modeling using news sources from Yahoo News or something. How do I go from, you know, across these different problem domains and understand what are the best pretrained models to work from. And then, you know, once I say, okay. I've got this model or maybe there are 2 or 3 different models that have different aspects of what I want to be able to build from. Are there ways to be able to synthesize across those models to be able to make use of some of those different capabilities?
[00:24:31] Unknown:
So I think the answer, of course, it depends on the problem statement. You you gave some several different ones. So if you are interested in classification, you're definitely going to try BERTs, And you are probably going to try some smaller versions of BERTs, which are the still BERTs, tiny BERT. Versions of BERT that were made smaller and faster, maybe at a small loss in performance. Right? Models like birds, these transformer based models are very good at classification problems. I guess with some experience, if you, say, read a reference, which kind of tries to cover some of the strengths and weaknesses of the models, and it doesn't have to be a lot. Like, the final chapter of my book tries to do that. There are many other references that are good. Maybe even, like, 4 pages, like a recent survey of recent architectures and why they were built and what their strengths are. So but it's gonna be good for classification. Let's say you are or sentiment analysis is a type of classification. Let's say you're interested in language generation.
Right? That immediately means something like GPT 3 or 1 of its alternatives out there. A generative model. Right? The generative models, you can also use them for classification. Right? They do produce a set of vectors that are useful for classification. In fact, when gpt 3 came out, it beats all the records in that domain. But over time, data models took over. So it does seem like the generative models are better for generating language price. So if you are trying to write poetry or, you know, generate a blog or generate code, right, you definitely need a generative model. That's where you're gonna start. If you are thinking about a different language, like a low resource language for which there's no data, then you need a multilingual variant of 1 of these models. So a lot of them have been pretrained in a lot of languages simultaneously in order to learn the cross lingual commonalities and to generalize to new languages. Right? So, of course, if you are interested in new language, you will start with a multilingual model because the pure English 1 probably doesn't contain as much useful information for you.
There are a lot of considerations to be had. Like, some models were built for long text. Right? So a lot of these models have a fixed input size. And so let's say 512 tokens. Right? Which a lot of the time is okay because, I mean, when people usually put the most important stuff at the beginning. Right? And so that's already a good way to, like, even reduce your data. But sometimes, some applications, that's not true. Sometimes you are looking for something that might be at the end. So if you are truncating, that's not working for you. So then you'll start looking at, okay, you know, long form a, reform a, big bird. These are just some examples of model that were built for that specific use case.
Obviously, there are voice models and there are text models. Right? So if there's that dimension, some of them are cross lingual. Sorry. Cross model that texts image models now from OpenAI. So the answer is gonna depend very much on the particular problem you want to solve, but this I would say these are like the general kind of strokes.
[00:27:38] Unknown:
If you have something where, you know, maybe I want to be able to do, you know, classification, and then from that classification, do some sort of generative text. You know, I want to classify the general sentiments around the economic recovery and then be able to generate a summary of information from what I was just, you know, synthesizing. Is there a way to be able to combine those multiple different pretrained models into a new model for being able to combine those types of use cases?
[00:28:07] Unknown:
People like to group this transfer learning into various groups depending on whether you're training on a new data distribution, a new language, or a new task. And so the task is area is called multitask learning, and that's where you're learning on multiple tasks simultaneously. And there are various ways of doing that. You can, you know, do them 1 after the other. Sometimes there are ways of concatenating them into the same feature vector. You can encourage the systems to work well together by constructing an appropriate loss function and then minimizing that loss function.
So there are various different strategies for this. But, yes, multitask learning is how you would accomplish this.
[00:28:47] Unknown:
In terms of the actual tools and frameworks that are available, you mentioned things like Keras, TensorFlow, PyTorch. Those are all pretty well known in terms of the, you know, general deep learning and machine learning frameworks. And then you also mentioned things like Hugging Face Transformer, and then there's a lot of stuff coming out of OpenAI. I'm just wondering if you can give your sense of what the current state of the ecosystem looks like for being able to use transfer learning in the NLP domain and what the current strengths are in any areas of weakness or any gaps in the available tooling or, you know, availability of useful tutorials for being able to get started in this space?
[00:29:27] Unknown:
So I'll start with 1 that hasn't come up yet. It's fast dotai. That's a open source course Jeremy Howard and his partner develop. It's, I would say, in my personal opinion, is very good at trying some very academic or otherwise inaccessible methods very quickly. So it has, like, functions for picking the best landing rates for your problem, which is something actually I haven't seen anywhere else. So, like, this method, ULM fits in just a couple function calls, you are able to, you know, use this state of the art method right away. In terms of weaknesses, very quickly, it's very hard to, in my experience, adapt it to new, you know, very custom problems that I'm having, that they haven't written their API for, if that makes sense. So if I was building a very customized solution, at the end of the day, I'm probably going to end up in TensorFlow or PyTorch.
And the reason is deployments. On deployment front, things like Agenface transformers, for instance, provides an accelerated version, API that's paid, that will run fast. But if you are not using 1 of those kind of solutions, you are going to have to use something like TensorFlow 7 or PyTorch 7 to, you know, speed it up. Because you are probably going to be running it on maybe millions of pages or something in production or real production grade problems. And just using the Python library is not gonna be good enough. So but the Python library transformers, I can say it's transformers. It's pretty much where you're going to find the latest architectures where you can just load them in, like, 2 lines of code and try it around and see. At least on a small data set, you can compare the different methods. And then you can export the model with a PyTorch or TensorFlow and then put it in this more downstream tools. And other tools that should be mentioned, the AWS SageMaker, obviously.
This is something that allows you to use, like, elastic inference accelerators, which is cheaper than GPU for scaling, for instance. It allows you to, you know, get your, like, 7 solutions or maybe even Docker images if that's what how you like to do things and deploy that fast and at scale. Docker can be important if you have a very specific dependency, right? If you do not have access to a GPU locally, then you probably want to look at Kaggle Kernels or you want to look at the Google tool collab notebooks That will give you free GPU at least something to be doing research with if you don't have the resources Azure has a solution that's similar to SageMaker, which is the machine learning studio.
So I would say this pretty much covers it. Keras, of course, is important, and they're doing a lot of work in this direction now.
[00:32:16] Unknown:
And in terms of your experience of working in the space and researching the space and doing the research for the book? What are some of the most interesting or innovative or unexpected applications of transfer learning and NLP that you've seen?
[00:32:29] Unknown:
I think this has kind of been dripping through the conversation. I think the the impact on the amount of data I need to train a new language has been mind blowing to me. And just a very practical impact on something I care about. The code generation was pretty, you know, head snapping for me. It's like I can do that. I can just tell the machine to basically outline the broad strokes of a library. And then, of course, I'm gonna have to get in there and test it, but that cuts down the development cycle so much. Most people say gpt 3 or a model like that isn't really understanding human language.
Right? But at the same time, you could argue it's passing the Turing test because You know, you put the text out there and people think you wrote something deep, right? So they think a human wrote it. Did it pass the Turing test? Right? So this raises questions like how important metrics are. Right? Most people train their translators for Blue score. Right? But when you read the Blue score papers, Even the blue score authors outline ways in which it fails So that's what everybody does though Everybody trains on blue score because the people who first did it train on blue score and now we need to compare to their work So we're stuck with blue score, but is it really the best way to measure things? I think as a field, maybe machine learning sometimes doesn't stop because of the speed at which things happen. Right? And so there's momentum and everything moves. But sometimes it may benefit us to kind of just slow down and think carefully about what we are doing rather than, you know, just scaling because it seems to be working. So let's go. Right?
I hope that falls within the purview of your question.
[00:34:16] Unknown:
And then another thing that I've been seeing a lot lately is sort of, general fear, uncertainty, and doubt around the use of things like generative models for being able to create content and sort of the deep fake issues around video and audio being applied to text and being able to, you know, rapidly generate a bunch of false news reports or misleading information and just sort of the general aspects aspects of disinformation, which I know you've said that you've done some work in that space. And I'm wondering just what your thoughts are on the trade offs of these models being so powerful for being able to do beneficial work and also the fact that they're potentially being used to create disinformation and so uncertainty and doubt?
[00:35:04] Unknown:
So 1 thing I would say based on my work, in disinformation space is that the most effective disinformation actors are actually not robots. They are cyborgs. Right? You really need people working really hard with the dissemination tools to, you know, make this stuff work. Right? So inherently places some limits on how far These things can scale at the same time. Yes, it has lowered, you know, the entry barrier for some malicious actors to do terrible things. And I mean, it's a fundamental philosophical question though that I think we always return to. Should we innovate or should we not innovate? Because, you know, understanding the atom means we can create nuclear weapons, but it also means we can create nuclear energy. Right?
Should we not understand the atom because we are afraid of our darker nature. Right? Any problem like this is kind of like an active adversarial area where basically the good guys and the bad guys fighting each other. At the end of the day, we hope that there is more support behind the good actors that they are going to win. That's the only way to really defeat the evil is to be ahead of it, not playing, right? Saying I'm not gonna participate. They'll probably figure out how to participate without you, right? So you probably should be there
[00:36:26] Unknown:
and balance it in the right direction. Yeah. I know that there was some general sort of back and forth when GPT 3 was first created around first, they only released the the light version of the model because they were worried about its applications, and then they kinda realized that, you know, it was going to come out 1 way or another. And so they released the full model. So it's definitely an interesting problem domain. And as you mentioned, it's very sort of existential and philosophical and but at the same time, a very practical problem. So it's interesting to see how it's playing out. Sometimes you may confuse political statements for the truth.
[00:36:59] Unknown:
There's a lot of power that the winners of this game that, you know, the commercial game that's going on and all these technologies are going to wield. And so, you know, people sometimes don't release their software because they want to keep an advantage. But, of course, they may say that they're doing it for a different reason. Right? How do you really
[00:37:18] Unknown:
know? And so in terms of your experience of working on the book and diving deeper into this part particular space of transfer learning and natural language processing, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:37:33] Unknown:
An area like this 1, it's very important to be, you know, focused on some key things you're going to tackle. Because every day, there's going to be a new paper, a new model that try something different and something new. And it may seem that it's kind of like deprecating your stuff you are working on. So there is like a balance of picking the problems that you think are going to be, you know, stick around for a while. Right? Because you don't want it to become irrelevant tomorrow because things move so quickly. I mean, you can make that determination if you look carefully. Like, we know, Bert, for instance, Even today, there may be all these other models out there, but everyone still starts with BERT. Right? When they're working on the problem like this, BERT should be in your testing.
And a lot of companies have put resources behind making those models particularly fast to deploy and so on. And so, you know, it's very important to pick carefully what you're going to work on and not get too distracted or worried about, you know, stuff going on. There'll be a lot of stuff going on. The other part is, I guess, how little academics think about making some of these messages appealing to the wider public. There seems to be a gap between, you know, these amazing papers that are being produced, very applied impactful papers on all this cutting edge machine learning and making it actually easy for people to use it in production or write some code that actually uses it for some problem. I think that's an area that's not been given as much attention.
And I think there's a lot of reason to close that gap. And this is what this book tries to do. I've seen more resources like that coming out, which encourages me. But I think this we still need some distance to go.
[00:39:31] Unknown:
And in terms of people who are looking at transfer learning and they're looking at natural language processing and they've got some sort of language based or textual project that they're trying to work on, what are the cases where transfer learning is the wrong approach and they might be better served with, you know, TFIDF or Word2Vec or 1 of the other types of approaches.
[00:39:50] Unknown:
So 1 of the dangers of pre trained language models is potential bias the model may carry that you don't know about because you didn't train it. Right? So if you are working in an application where explainability is very important and you need to explain, you know, it's let's say you are working on detect who is guilty based on the blood pressure or something like that. Right? I mean, if there's so much can go wrong, if you get it wrong, that, you know, you really should think about if such a bias can have a bad impact like that. If you can build a model from scratch where you can, you know, explain all the pieces, that's ideal. And that's not always possible. Right? Some problems are way too hard.
And in that case, you should be doing working on some kind of explainability mechanism that explains how your model came to its decision. In some cases, your deployment strategy may not be sophisticated enough to handle some of these models. They are more expensive than some of the approaches you mentioned. You know? Like, running production grade neural networks, you know, it's not cheap, you know, if you don't have, like, a use case that's valuable enough. Right? So you may not need it. It may not be the right application if those all those trade offs don't work out right for you. You may not need it if you have sufficient data out there already that you can train your own model from scratch. You know, like, training your own birth model, I mean, it's it's not cheap, but for, like, a startup, it's not that expensive anymore. Right? So you may be able to train your own if you have sufficient data. Or I guess probably it's just simple enough. Right? It doesn't need all of this stuff. You might be better served. Yeah.
[00:41:35] Unknown:
And as you continue to work in this space, what are some of the trends or techniques or new areas of research that you're keeping an eye on or that you're most excited for?
[00:41:44] Unknown:
The major 1 at the beginning of this year, I mentioned already, which is the focus on voice. So, so far, the past couple of years, it's all been text, text, text, text. Now all of a sudden, all these other modalities have been being brought in. So now we've seen that the same sorts of techniques can be used in voice technologies, which is very exciting. And also, of course, the return back to computer vision. So all of this was inspired by computer vision to but it's different from computer vision. And now some of these new ideas are going back into computer vision. So now people are building image transformers where they're trying to contextually embed objects.
Right? So train on unsupervised data where you kind of try to learn what kind of context, let's say, a camera usually appears or a human or glasses or headphones. Right? So that you can try to be generative, descriptive about the image, and so on. And so the use of transformers in computer vision, which is kind of a reverse direction of the image in that moment, is It's also very exciting.
[00:42:49] Unknown:
Are there any other aspects of transfer learning and natural language processing in these particular areas of research or the work that you're doing on the book that we didn't yet that you'd like to cover before we close out the show?
[00:43:00] Unknown:
I would like to stress some of the work that me and some of my colleagues are doing in, I guess, democratizing some of these technologies for African languages through the organization known as Ghana NLP. And in this context, we have done things like build new data resources for some African languages. We've done things like build translation apps both for iOS and Android. The app is called Khaia, k h a y a. You can download it, give it a try. We started with Ghanaian languages, 3 hour. We have since expanded it to the rest of the African map because our users were asking for it. So we have languages like Swahili and Zulu and Wallof and Yoruba in beta testing now. We are also putting in the voice capabilities. So we recently built a tree automatic speech recognition system. We are putting it in the apps and just trying to make sure that, you know, some of these tools, like, people take for granted, like Google Translate and so on, are available to, you know, say, people from Africa, which I think is important
[00:44:09] Unknown:
for all of us, a more equitable world and where we can all communicate and work together. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. This week, I'm going to choose a musical group, group called infected mushroom. They do electronic or techno or whatever you wanna call it, but just a very, you know, interesting group, a lot of, you know, variety in terms of how their different albums go. You know, some of it's entirely instrumental. Some of it has voice. So just a lot of fun for something to listen to, you know, while you're coding or zoning out or exercising, whatever it is. So definitely worth checking out if you're looking for a new group to listen to. And with that, I'll pass it to you, Paul. Do you have any pics this week? I watched that movie called Tenet, which was a very interesting concept.
[00:44:59] Unknown:
I thought the movie made it difficult to understand. But the concept of traveling backwards in time, not in 1 go, but as a in a linear way, where you just switch direction and you're going backwards, and then switch it again and going the other way. It's kind of interesting.
[00:45:15] Unknown:
I don't think a lot of, popular culture has thought about it that way. So that comes to mind. Yeah. That was definitely an interesting movie. Another guest has recommended that 1 as well. So, yeah, I watched it, and I'll have to agree with you. Definitely worth a watch, but make sure that you're paying attention because it's there's a lot going on. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on transfer learning and NLP and for taking the time to write the book to make it more accessible to more people. So appreciate all the time and energy you've put into that, and I hope you enjoy the rest of your day. Thank you so much, Tobias. Thank you for having me, and I had a great time talking to everyone. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Message
Interview with Paul Azunre Begins
Introduction to Transfer Learning
Paul's Journey into NLP and Transfer Learning
Applications of Transfer Learning in NLP
Practical Steps for Applying Transfer Learning
Choosing the Right Pretrained Model
Combining Multiple Pretrained Models
Tools and Frameworks for Transfer Learning
Challenges and Lessons Learned
When Not to Use Transfer Learning
Future Trends in Transfer Learning
Democratizing NLP for African Languages
Closing Remarks and Picks