Open Source Automated Machine Learning With MindsDB

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or you want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up.

And for your tasks that need fast computation, such as training machine learning models or running your CI pipelines, they just launched dedicated CPU instances.

In addition to that, they just launched a new data center in Toronto, and they've got 1 opening in Mumbai at the end of 2019.

Go to python podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show.

And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system that can keep up with you that's designed by software engineers for software engineers.

Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own.

With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page.

Podcast dot init listeners get 2 months free on any plan by going to python podcast.com/clubhouse

today and signing up for a free trial.

And you can visit the site at python podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And if you have any questions, comments, or suggestions, I'd love to hear them.

And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Your host as usual is Tobias Macy. And today, I'm interviewing George Hosu and Jorge Torres about MindsDB,

a framework for streamlining the use of neural networks. So, George, can you start by introducing yourself?

Yeah. So I'm George, and I

work mainly on MindsDB.

Occasionally some other contributing gigs in, like, the data engineering and machine learning space. There's possibly a slight chance you've read some of my articles before if you've stumbled upon them on, like, Medium, and,

you can find more about me on my GitHub profile, which is george3d6.

And, Jorge, how about yourself? Yep. I also work for my CV, and I've been,

working on it since we started a year and a half ago.

And

you can find me on GitHub on, tormaltormail.

And I work mostly on explainability,

just simplifying machine learning for And, George, going back to you, do you remember how you first got introduced to Python?

So, personally, I I think I got introduced to it, as soon as I started sort of getting into programming since it's a rather popular language. So if you go into many fields, you essentially stumble upon it by mistake.

I actually started being interested in statistics and machine learning

when I was reading a sort of introductory

Python book called Thing by Ace, which I would recommend to anyone. So, yeah, I think that's how I got into it, essentially. And, Jorge, do you remember how you first got introduced to Python?

Yeah. It was,

almost 12 years ago. We were doing some work at Kalshir Fin, and we needed to do some, basic machine learning, and we

introduced Python to the stack back then. And so

can 1 of you start by explaining what MindsDB

is and the problem that it was trying to solve and the motivation for creating it?

So

the

problem or the thing we are trying to do with MindsDB essentially is provide

a high level library for making predictions and evaluating them.

So we we believe that we've sort of reached a point in machine learning where in certain scenarios, a lot of scenarios,

you can automate the work that traditionally,

data scientist or a machine learning engineer would do. So

just to sort of give an analogy here maybe,

in the eighties or even in the nineties,

a large majority of game studios might have produced their own game engines.

But nowadays, 99% of studios

have no game engine, game engine engineers in them at all, and instead they sort of have just a a general focused programmer which is

able to use various engines and library that handle the complex mathematics of real time rendering.

They handle,

cross

cross platform compatibility,

physics simulations, etcetera.

So that's kind of what you're trying to do with MindsDB,

give you the power of close to bleeding edge machine learning

without you having to know any of the intricacies behind it.

So you can use it for anything from, like, recommendation algorithms to image recognition

to estimating revenue or optimizing spend on whatever.

So what we're doing with MindsDB

is we aren't aiming for those 1% of sort of hard machine learning and statistics problem

that might require a team of data scientists,

but rather we are aiming at what you'd call the bottom 99% of ML problems.

Problems that have already been solved in, like, 1 iteration or another

thousands of times

and that you could easily automate with 1 function call. But until now, there's really no open source solution to do that.

And as far as the sort of target user, is it somebody who is, as you mentioned, just a general programmer who is interested in incorporating some of the predictive capabilities of machine learning into the rest of their application? Is it a business user who has maybe a spreadsheet of information and they're trying to make some predictions based on that who might have a little bit of sort of technical acumen? Or is it just sort of anyone who is able to,

come across MindsDB and understand the potential and go through the initial setup processes?

So I I think with the the version we have at the moment, the main focus is on sort of programmers that don't necessarily have any machine learning experience, but have a problem that requires some sort of machine learning, that requires some sort of predictions.

That being said, we are in the process

of building tools to allow essentially anyone with a prediction problem to use MindsDB, at least to some extent. It it might be a bit hard to actually

deploy a model into production

if you are a business person, but you should at least be able to do it to prototype some ideas.

That being said, we're also targeting at, for example, a data scientist

which might just want to have a a benchmark model, for example, to compare their models against or might want to have a a place order model somewhere

and don't want to bother writing it. So

that's kind of the audience we're going for at the moment, though the core of it is mainly developers and that's partially a good thing as an early stage project because you get a lot of people that can

accurately report issues and maybe look at the code and even get interested and

contribute to it. And so before we get too much deeper into MindsDB itself,

I know that it's using neural networks as the underlying architecture for these machine machine learning models. So I'm wondering if you can just give a bit of background about what that terminology means for somebody who isn't familiar with it. Okay. So

I'll maybe do my best to explain and then,

Jorge could maybe complete here. So, essentially, there's, like, 2 ways to look at a neural network. So the the 1 that's most popular,

at least as an introduction, I think is a lot of times for us from a sort of

psychology or neuroscience perspective,

people try

to compare it

to something that tries to mimic the brain. So you have these neurons quote unquote, which are essentially just, 2 numbers and a bunch of,

ties

to other neurons

And what you're essentially trying to do

is build a network that sends impulses down the line to hopefully go from some inputs

to some conclusions, some outputs.

I think maybe a a better perspective,

an easier perspective is a more mathematical perspective upon a thing, something that

most people in ML nowadays, I think, would take. For that, I I would like to maybe go into the the simple things such as like a linear classifier. Right? So

let's say that you have a very simple predictive problem. Let's say you

for a given piece of content, you want to estimate your revenue and what you know are the viewers and you want to estimate how much ad revenue you'll get 2 months down the line.

And let's say for the sake of argument that this problem is linear.

So you can have, like, your number of viewers and then multiply that by some number

a and you get your revenue. Right? Does that make sense?

Okay. So so that's a kind of equation which you can essentially model with a line, what we call a linear equation because you only have 1 moving variable a. And you can use a rather simple algorithm, let's say you just start by plotting a random line

and then you see how well it fits all of your previous data points and then you just adjust it until you get an a which

is more or less the solution to the equation

viewers times a equals an approximation

of your ad revenue.

But most problems in the real world are not linear.

Most problems in the real world are not even like polynomial or exponential, they are quite complex and they require a very complex equation.

So essentially what a neural network is

is just an equation with sometimes dozens of millions of parameters

which we adjust in order

to get a solution for a very complex problem. So for example,

feeding in a bunch of pixels what

what what sort of sits at the core of a neural network

is not so much the architecture

but actually the

algorithm

that is used to to train the network

to adjust all these parameters

which is what we'd call backwards propagation.

So

backwards propagation essentially

is this technique

to

adjust

an equation

based on the errors. So you feed in a bunch of pixels

and what you want to get is

some numbers that mean cat and instead you get some numbers that mean elephant or random gibberish.

So backwards propagation,

which is sort of a subclass

of

a wider set of algorithms which you might call reverse automatic differentiation,

essentially allows you to compute

the amount that you want to tweak your variables in that equation

to hopefully get closer to the correct answer which would be cat. And and you, you know, you apply this algorithm

dozens of thousands of times

and in the end, you manage to adjust this very complex equation with millions of variables, which we call a neural network,

to sort of give you the correct labels for, you know, whatever images you have. And the the the nice thing about neural networks, because they have so many variables,

is that they can efficiently model

a large host of problems.

So

a very similar neural network

to the 1 that might be able to classify images, let's say, could also be used to classify

audio

and

a slightly tweaked neural network could be used to translate human speech into text or the opposite.

So explaining sort of neural networks in a in a quick way might be a bit hard, but I I think I can recommend to anyone that sort of wants to start getting into the field

a set of videos by the YouTuber FreeBlueOneBrown

which usually does,

videos on topics related to mathematics.

He actually did an amazing series on neural networks and will hopefully put a link to that in the description.

And I really encourage anyone to check it out. He does a much better job at explaining it than I

could. So I think that there was something that wasn't covered in the

question, which was, like, neural networks and deep neural networks.

And as George was describing it, neural network is just this equation that will kinda try to model something. And it's inspired by how neurons work on the brain. And then the deep side of this is just that you pile these layers of neural networks and and you give depth

to what you call your neural network or your artificial neural network.

And the deeper you go,

the

more,

complex are the patterns that this neural network can,

understand.

And the hierarchy of of,

patterns will lead to be able to classify or perform more complicated tasks. So, essentially, a deep neural network is a neural network is just that that has many different layers,

and those layers will allow you to solve much more complicated problems. That is 1 of the main differences.

And so for somebody who is getting started with MindsDB

and wants to build a model based on some input data, I'm wondering if you can just talk through the overall workflow and the types of source data that you support.

So at the moment, what what we support in terms of data would be

CSVs or any other character separated value files. We also support Excels.

We support

JSONs as in an array of objects with the keys representing the column.

We also support Pandas data frames, and

MindsDB does expose an interface to sort of build your own data source. So so the workflow, once you have your data,

would be to essentially would be just 2 functions. So first, you have your original

training data set. Right? So it's the data set on which you have some input features

and an output feature.

So, for example, let's say that you are trying to predict the number of viewers that this podcast is going to get,

And let's say you have something like the length of the podcast

and,

I know, the day of the week when you intend to publish the podcast, the title of the podcast and let's say you have like a score for the popularity of the guest that you have. Right? And and what you want to predict is the number of viewers. So what you'd essentially feed into MindsDB would be a CSV with

previous data from your other podcast with those 4 columns

and the number of viewers which is what we'd call the output feature.

And that would you could just, you know, place that in a, again, a CSV or an Excel or a pandas data frame

and you would just call the

learn method of MindsDB

and you would give it that file and you would tell it the column that you want to predict, in this case, number of viewers.

So then MindSphere would train a model

and it will probably give you some insight about the data.

So, you know, it might say something like, I don't know, the day of the week the podcast gets published doesn't seem to be at all relevant for how many viewers you have or,

you know, something like that. The popularity of the guest is so strongly correlated with the number of viewers that you could just throw out the other columns.

But, you know, you you can act on those insights that you get during training or you can just leave your data as is.

And then once you've trained the model, once the learn function has finished running,

you can simply call the predict method where you give your input features

and it will give you the value that you want. So in this example I kind of made up,

it would give you the number of viewers and it would also

give you a confidence score for that prediction. So how confident it is in that specific prediction given the parameters that you gave as an input. That's really about it.

So most users of Mindb will probably just use the the learn and predict function and that's kind of the purpose of the library

to have the usage be very simple. There's a a few

more advanced parameters that someone might want to tweak for

some other types of datasets, but let's not get into those right now.

And 1 of the sort of canonical

quotes for machine learning

in general is the idea of garbage in, garbage out. And so I'm wondering how much upfront cleaning

and preparation is necessary for feeding the data into MindsDB to ensure that you're getting something useful out of it, or if MindsDB is able to,

handle some measure of automation as far as cleaning up outliers and, data anomalies within the source inputs. Alright. So so currently, the what we've opted for is to mainly go for an approach, as I mentioned before,

where

we warn the users

about potential

mistakes of their data. So, for example, if you have a column which has a particularly large numbers of outliers,

we will give you a warning about it and tell you, hey, maybe you don't want to use this column for a prediction.

If a particular column

is very bad, so, for example, if all the values are null,

we might decide to just go ahead and not use that for the prediction. But in 99.9%

of the cases, our approach is

to use everything for the prediction but warn the user when he feeds in the data and after we analyze it about data that might pose a potential issue.

So either data which will

essentially be inconsequential to the prediction

or if we find a column which is essentially correlated with the prediction or if we find 2 columns which are essentially the same thing but, you know, rephrased in 2 different ways. So let's say, I don't know,

date time in an ISO format and then the same date time as a time stamp, we will warn the user about it and

tell him to maybe remove those columns. And after we've actually trained

the model itself,

we have an analysis phase

where we try to figure out what column the what columns the model actually cares about and what columns it doesn't and what columns

might harm the prediction. And, again,

we try to tell the user about that. So so that's kind of our approach there.

That being said,

we

do do some sort of data normalization

in in order to determine a consistent data type for a column,

for example.

So so, you know, you can have a few mistakes in your dataset and that won't mess up your whole prediction. But generally speaking we're we are aiming more towards

letting the user know about the mistakes and having them correct it. Because if if we did that internally,

you know, we we might end up being wrong,

and we might actually correct something that we see as a mistake but the user might have strongly wanted to keep in there. I think adding to this, it's important to know that the purpose of of my NCB is really to

minimize the risk

of potentially machine learning and AI becoming dangerous.

And

we see that as a, like, a process that will go as machine learning progresses.

And what we understand right now as a risk is people trusting,

machine learning blindly. So if you train a model, as you said, trash in, trash out,

we want to inform you where we identify potential risks.

And we also want to make sure that people that are implementing machine learning models

are aware of the liabilities

of what they're dealing with. If you are implementing a machine learning model to predict if someone is going to develop, say, a chronic condition in health care. You want to understand

why,

and when the model may fail and when in your data,

you have biases or problems that may be significant to the predictions of you that you're producing.

As well as when you get a prediction, you want to understand how confident

you are about this prediction, and these are the elements that yours were just were was describing as to what we provide for users

other than just a

prediction. Yeah. And I think that

the fact that you have built in explainability

as a first class concern in MindsDB

to ensure that when you do get a prediction, you're able to understand

why it reached that particular conclusion is definitely valuable because it can

be very useful both in terms of regulatory environments where I know that GDPR, for instance,

has a line item in it that says that

any prediction

or,

result that is built as a result of machine learning or artificial intelligence needs to be able to be explainable.

So

having that built into MindsDB

is useful for that context, but also for to somebody who's trying to maybe get into machine learning and using MindsDB as an entry point,

being able to use that as a way to backtrack to fill in their understanding of how these systems work or, you know, just in general, having that explainability is valuable in a lot of different context. So I'm wondering how you have approached that in MindsDB

and some of the ways that you have found it to be useful in your own work. Yeah. I think that right now, we're going in the stage of quality. So we are trying to analyze

what data you give us from different quality dimensions. As Jers was describing, we analyze

over many, many different factors

from outliers to potential

indicators of biases.

And then once you train the model, we analyze a model for quality

to understand how good is the model at predicting what you want to predict. And that is a stage at which we are right now. And we want to nail this very well because we believe it's the entry point to explainability.

Once we, are able to

finally produce

reports about the quality of the data as well as the quality of the predictions to the people that are using MindsDB, then we can move to the second stage, which is trying to understand

what drives a prediction

and why not

another prediction.

And we believe that expendability

is is, is a road that will move into

machine learning that can itself explain, like, why it behaves the way that it does. But as it stands now, for most of the problems, just by understanding the quality of the data and the quality of the predictions,

we think that it's just a good start, a good start for

anyone that wants to implement machine learning to be confident,

if they can use this in a in a real world scenario

or when they cannot use it. Yeah. And I think that that too helps address some of the, maybe, hesitance or nervousness that people might have of bringing machine learning into an existing application because they might not necessarily

have a good handle of

why a decision

might be made. Whereas, when you're just dealing with procedural logic, you can follow the path and understand and, sort of reassure yourself that you know what's happening.

But if you're

handing it over to a complex mathematical model and not being able to then trace it back to the input data and the overall flow of execution,

It might just lead somebody to the path of saying, I'll just deal with more procedural logic and just make it ever more complex, whereas they might be able to reduce a significant amount of effort and code in their,

application by using some of these advanced mathematical techniques,

but that they might be avoiding just because of the fact that they don't have enough confidence as to how it actually functions. Yeah. That's right, Toby. And there's a second thing about, what we do that has to do with the risk of machine learning. And it's the reason why we want to simplify it to the point that anyone that knows a very basic of programming can produce a machine learning model,

it's because usually the domain experts

are not data scientists. Like, if you are someone that has been working with oncology data for some time, you may be more of an oncologist than a programmer.

We want to provide you the the toolset to get some insights as to what machine learning can do for you, but still you are the domain expert. And 1 of the biggest risks today is that true machine learning experts, they they cannot also bear the responsibility of also being oncologist or being,

manufacturing engineers.

And and there is a risk of all of that domain expertise to be lost if we delegate

the predictive capabilities and expertise just to the machine learning experts.

Because, really, it takes years years to build the domain expertise to understand

when things are fishy or not. And that's why we, are going back to the data quality and and and informing people about,

where something is not looking correct. And it usually would take a domain expert to see this and be like, okay. This makes sense with my understanding of the problem.

And so as far as the

overall sort of

development cycle for a machine learning project in an organization that does have data scientists and machine learning engineers on staff, but does have a certain amount of domain expertise requirement,

does MindsDB then allow you to change the overall cycle time where rather than having a sort of requirements gathering phase that you pass off to the data scientist, and then they go and explore the data and make their best effort to build some sort of a model or prediction

based on that inputs.

You flip it so that you have the domain expert

determine the relevant data, feed that into MindsDB

to generate an initial model, and then pass that off to the data scientists to then explore and refine before you go to production? Oh, yeah. So so that would be 1 of the hopes exactly. So in I was mentioning the quick prototyping thing. So in a large organization where you

do have access to a data science team, you might still have too many problems for them to handle all of them. So MindsDB can serve as a very good filter

in terms of you as the domain expert

being able to sort of figure

out which of your datasets

are even worth handing over to the data science team

in order to make some predictions based on them. And

so once you have generated a model using MindsDB,

I'm wondering what the output format looks like and what the next steps are as far as being able to use it either in a production context or being able to use the generated model to run the predictions

either directly from MindsDB

or as part of another script or application?

So we

MindsDB,

as it stands today, it's a

set of different projects. And 1 is what we call MindsDB native, which is a pip module that you can install and then you can train and do everything locally.

But, you can save and export this predictive models and share them with other people as well as to deploy them to what we call my CB server. And then once you have it in the MyCB server, they can be accessed through the API. We have JavaScript a SDKs,

and we also have a graphical user interface for those that don't even know how to program,

but want to, like, experiment with

the models that someone has built with either, MindsDB native or directly through the API on the server. And we we want to enable people to regardless of the their expertise in like, their technical expertise to be able to either experiment with the models

or to be able to deploy

and and actually use them in production. So MindsDB as it stands now, it's it's a combination of

a a tool that you can run-in your computer. It's a Python module, and then you also have a Python server that you can use to, like, share and deploy models as well as to expose us to a graphical user interface that we have for people that don't even use Python or don't even know how to program. And in terms of how MindsDB

is implemented, I'm wondering if you can talk through the overall architecture and design of the application

and talk through how it has evolved since you first began working on it. Essentially, we provided a tool that

was kinda like duct tape prototype,

and people kinda like the idea of being able to produce

a machine learning model with just 1 line of code. And this was due to having a lot of friends developing projects

that,

required some sort of machine learning, but the machine learning part was just a cog within their machine.

So we we produced a a simple

Python library that then we've been evolving to the point that it is now.

And Georges can actually describe to you a little bit more about how the evolution of that has happened.

Yeah. So so, essentially, what

we

had as a sort of prototype was a tool that was

mainly focused on the prediction

and the predictive model behind it.

And what we really wanted, as we said before, was mainly to be able to

focus on the analytics

besides the prediction.

So there are

a lot of smart people working on the how of making prediction right now.

And what we really want to do is just

take the best algorithms out there, collect them and

we want to focus on

the why with MindsDB. So why the prediction came out that way, is your dataset good enough to make a prediction to begin with, etcetera, all the things we talked about.

So in in order to do this, what we've done is main MindsDB more and more modular.

So the

main, MindsDB repository right now is essentially 90% focused on processing the data and doing analytics on that data

and then doing analytics on the model itself.

And the way we think about the machine learning part

is we are essentially able to plug in a machine learning back end,

which, in this case, what what we're working with is a machine learning framework from Uber called Ludwig

and our own machine learning framework

called Lightwood.

The former is,

based on, TensorFlow and the latter is based on PyTorch.

But we also have another

machine learning model in the

well, I would say in the works, it's in the beginning phases developed by

a specific university, but I won't name names right now,

because it's not finalized.

And, essentially,

that's 1 of the things that this modularity

allows us to do. So another thing that we've sort of decoupled from MindsDB

is

the idea of a of a GUI. So we have a separate GUI that's sort of able

to cleanly show all of those insights to someone that might not fassy, you know, reading them out of the command line.

We also have a server component,

which essentially allows you to do predictions remotely. So if you want

to, say, run off a Raspberry Pi

and you you can't really, run the learning algorithms there, but you can collect the data with the sensors there,

You can just give that to the server,

which will then feed that into my s t b. So so that's kind of the way the code has evolved. We've focused a lot more on the explainability part

and we've tried making them more modular.

Also along the way, I think we've cleaned up the code a lot. I've been trying to encourage as many community contributions as possible.

And even when people don't contribute code, I find that clean code allows people to more easily

trace back their issues to maybe even a specific function or, you know, line of code that is actually causing it rather than just copy pasting the stack trace. And I think it's interesting that you

have, as you said before,

started off using Ludwig, which relies on TensorFlow, and now you're building your own back end using PyTorch. I'm wondering what you have found to be some of the comparative differences in your experience of working with those 2 back ends. So, we we actually started off,

with our own back end, which was mainly built by by Jorge before I joined.

And we switched to introducing

Ludwig when we decided to make this decision of separating the machine learning back end.

And Ludwig came along at just about that time,

so it was a perfect sort of experiment for us to see if we can actually easily integrate MindsDB

with another generic machine learning library and use that as sort of the model back end as we'd call it. And,

reinforcing in that point, I think that

1 of the elements of machine learning today is that it moves really fast. And,

you know, like, PyTorch did not even exist a few years ago, and and now there's a growing community around it. And and we believe that there is and there should be a a Ludwig version of,

for for PyTorch, which is what we built, in Lightwood.

But nonetheless, this this is will continue to evolve and and there are so many other frameworks like Julia and and you name it, that what we want to to prevail is,

for those that want to use machine learning in their in their production or even for experimentation,

for them to not have to worry too much about what is the latest

of the latest. We will make sure that we will incorporate that in our modular architecture so that you can rely on the fact that if tomorrow there's something better than,

Ludwig

or Lightwood, whatever you call it, we will try to incorporate that into into the machine learning back end. And therefore, you can, rest assured that you will be using the state of the art,

in terms of of machine learning frameworks.

Nonetheless, explainability

and the other elements of machine learning that are important and what we think is crucial to solve right now, we we will continue to provide through MindsDB.

So

as the machine learning community and frameworks will continue to grow, we will continue to support them, and we will continue

to produce,

an effort into,

making those explainable and reliable

and easy to understand and use by anyone.

And as far as the actual algorithms that you're using within MindsDB,

are you relying largely on what's built into the different back end frameworks, or are you adding your own custom built algorithms? And then as far as determining which 1 to apply to a given problem set, I'm wondering what the overall sort of search function looks like as you're,

building and training the models. Okay. So what we essentially do with MindsDB

is a lot of

looking at the data

and based on that, trying to figure out a set of instructions

for

the machine learning back end itself.

So what we have right now is, essentially, we determine a bunch of data types and data subtypes, and then we also

have some other insights about based on

the number of data points in the column, the quality of the data points,

which we can and, in certain cases, do feed into the model back end in order to allow it to build a more

a

more well suited algorithm for the specific dataset. And as far as your experience

of building and expanding on MindsDB, I'm curious what were some of the assumptions that you had going into the project

and how those assumptions have been challenged and updated as you continued to gain users and build features? Well, I think that the big assumption that we're betting on right now is that people will actually use the explainability part.

And we we believe that explainability is important and is crucial to to be able to understand.

We are yet to validate if the approach that we're taking at it is a correct 1. So that right now is an assumption. We have some validation, but we will continue to validate this throughout.

It's not that we don't believe that people don't want to trust,

and have data to trust whatever they're

they're building their machine learning capabilities on.

We just want to make sure that the way the approach that we're taking, as to, like, analyzing quality first and then taking it from there is the right 1. Before, the big assumption that we had was, do people really want

a a machine learning framework that they can simply, with 1 line of code, produce a predictive model, and then with another line of code, use it, and then just

be blind about how it happens on the back end. And we've we've found people with 2 different perspectives. You know, you have the hardcore data scientist that wants to dive in as much as as he or she can.

And for those,

MindsDB

still is very opaque.

And I think that the explainability part may add value for them. But,

we also understand that there's a another 90% of developers out there that really they just want to get a solution built in into the product and continue to the next task. And those have, found MindsDB to be to have the right approach, like, the simplest

API possible.

And then if you want to dive deeper, then you can, which is where we're

going for at the moment. And in its current state, what are the primary limitations of MindsDB

and the situations

where it's necessary

to pass a given task on to a data scientist or machine learning engineer for building a model versus,

being able to do any of the preliminary work within MindsDB itself?

So I I think that

the main limitation of MindsDB

is the main limitation

of really any machine learning model,

which is to say that it only finds correlations. It can't really think about causation.

But that's mainly the perspective of sometimes you need to pass,

the task to a domain expert

to be able to actually tell you if the model makes sense or even trying to build a model in that situation

would make sense. In terms of

limitations

for the problems we are trying to solve specifically,

at the moment,

we still aren't able to really handle

audio inputs or video inputs.

So that would be 1 of them when, at the moment, you would still need to build your own model. MindsDB wouldn't be able to do that, but we are working on it.

Another 1 of the

limitations right now is very large datasets.

So once you get into the hundreds of gigabytes or dozens of terabytes,

MindsDB might not be ideal.

But, again, we are actively working

on ways

to have MindsDB work on larger and larger datasets and to have it be distributed amongst multiple machines. So that's 1 of our focus points in the in the near future

and has been for a little while. And as far as

the overall work of building and growing the MindsDB project and the user base, I'm wondering what have been some of the most challenging, complex, or unexpected

aspects of doing that work. I I think for me personally, actually, 1 of the most surprising aspects has been, how much of a challenge it is to

actually get it working on as many machines as possible and kind of trying to work with users to debug the various issues they have. Since it's an open source project, there's

often a lot more feedback than with a closed source 1 and the feedback is often of less quality. So there's there's definitely a different development pathology there. That's, I I think, honestly, been the main 1 for me. Yeah. I agree with George. It it is fascinating how once you try to release something for as many platforms and as many people's com computer configurations as possible,

then we we spend quite significant amount of time trying to

solve bugs that are I don't know. Some Windows user

cannot 1 of the dependencies doesn't work, or things like that that do consume just as much time as any of the other issues. But we put a lot of effort,

into making sure that those issues are solved and and it continues to be available to most.

I think that, from my perspective as well,

when we decided to incorporate Ludwig,

we we got sold into the idea that they were

quite ahead into the back end that we were building ourselves.

And in reality, you know, it's another open source project, and they also have a lot of issues.

And

we even though we've we've took a leap into

into

separating the the back ends, and that's the future of MindsDB.

A lot of what we thought worked,

also didn't work out of the bat. And it's the same thing that we have, you know, like a lot of people may use MindsDB and they may find things that

they expect it to work.

So the fact that MindsDB relies on

other open source projects on on the back, that also means that we we rely on on the quality of development of of other people that is outside of our balance.

A last 1 which I would like to add, which I actually found quite interesting because I I worked for a long time as what I would describe a a data engineer,

is getting performance up to par.

Because it's it's definitely harder

to think about performance and even to test performance

when your dataset is literally anything.

Right? Like your inputs could be anything, your outputs could be anything.

And that's quite an interesting challenge, which we are actively working on and getting better and better. And we've actually

sustainability

of the work that you're doing and trying to make sure that you are able to, you you know, make a living at the work that you're doing and also make sure that you are able to incorporate

community feedback? Yeah. It's a $1, 000, 000 question.

I think that

there are successful projects out there that are open source, and these people have invented a formula that we don't have to reinvent.

And so that, like, we if you go to mindzav.com,

it looks more of, like, a consulting site.

And

people come to us with all kinds of crazy problems. And those problems may work out of the bat with MindsDB,

and some of those may require some tweaking.

And even for the ones that work out of the bat, you know, there's always really understanding what is it that people want.

And right now, our our path to,

sustainability is helping people with any type of predictive problem that we have,

and try to use machine like, the machine learning capabilities that we've built through mysab. If myseeb doesn't work, then we improve myseeb with that understanding.

But nonetheless,

it is a win win solution where we bring in revenue from people with

problems that are real world problems,

and we continue to improve MindsDB.

So the learnings of that will continue to go back to the community.

And the thing is, the more you do machine learning, the the more complicated your problems get.

And and the more you kinda, like, rely on the value of machine learning in your organization.

So it is easy for companies to start with 1 simple question that they wanna answer, and then once they answer that 1, they wanna answer another 1 and another 1. So we we build

a a system open source just because we really want to democratize machine learning. But we also understand that there are many other companies that, are willing to pay for a service that uses this open source infrastructure,

that we've built. And there are many projects out there like Red Hat

and and Mongo. You you name that. Successful

open source project, it always comes with a a side of really understanding the customer needs

and making sure that,

the the learnings that we have from those engagements

get back to the community. And I think to to address maybe the community side of the of the question as well,

1 of the way that we have tried to sort of differentiate even from other open source project is to make a lot of the development open.

So, obviously, you can't really

discuss everything in in public,

and doing that is pointless. Like, a lot of projects have most of the discussion

discussions going on in the mailing list, but who's going to read the mailing list if they're not working on the project?

But we've tried to have all of our

sort of goals

and all of our development targets open as issues on GitHub.

And whenever we make any changes to those, we

talk about them in the actual issues. So there's discussions going on on GitHub. And if someone wants to engage with the projects or wants to figure out how a feature is coming along or

what the quality of a certain feature is at the moment, they can always track that on GitHub.

Because there there's a lot of open source projects, for example, you know, Aerospike, which, don't get me wrong, is an amazing product.

But the only person that can work on Aerospike is an Aerospike developer

because

the code is rather complicated and it's not really

designed neither the code nor the development process is designed with outsiders in mind. The way we are trying to design MindsDB and the development process as a whole is in such a way as to make it very easy for people from community both to contribute and to figure out what's going on. And as far as your experience

of working with the community and with the people who have reached out to you for support and just seeing the types of ways that MindsDB is being used, I'm wondering what are some of the more interesting or innovative

ways that it's being leveraged or things that you found surprising, and also just overall lessons that you have learned in your work of building and maintaining MindsDB that have been particularly unexpected or useful? Okay. I I I will start. So for me, it's it's been really 2 things. So 1 has been,

how many simple datasets are out there that people want to run predictions on. So the the kind of datasets that, you know, 1 could

easily sort of run with a lot of generic models in, you know, maybe 10, 20 lines of code,

that people maybe haven't tried playing with before just because

even the simple solutions out there like scikit learn,

were maybe not friendly enough. And another 1 has been the number of people

that

had sort of interesting

side projects

that they wanted to,

to run with MindsDB. So for example, a a software developer that I met sort of randomly at a conference

had this,

moonlighting job or hobby as as a DJ and he actually wanted to try and use MindsDB

to figure out playlists for him. So it's I I've seen a few sort of very interesting projects

which are definitely not in the, you know, category we'll focus on in terms of business, but I I would really hope

we can also help those people. So the the people that just want to do something fun and they have some data,

but maybe they, you know, they're not quite up to par with the latest in machine learning. Yeah. I think that, the most surprising 1 we've

we've gotten, from my perspective

is,

someone that reached out to us that wanted to predict the Lotto,

numbers,

as for the lottery, and he had collected

incredible amounts of data

from all those different lotteries around the world. It was, to me, surprising to see how much effort this person had put into

the data collection.

And I I don't know what the results were as to how accurate it was. But,

what I did find interesting is that there are people out there with all sorts of,

ideas,

and it is just our mission to,

provide them with a toolset

no matter how crazy their idea is that they can make a prediction,

if they don't know machine learning or if they don't have the time to actually build the models. Or as George mentioned, to to have just a base benchmark.

And so looking forward, I know that you said that some of the immediate term work is going to be focused on,

adding more support for the modularity of MindsDB to be able to support different back ends. But I'm wondering in the medium to long term, what you have planned for the future of the project?

Yeah. I think that the medium to long term of the project goes back to the purpose of my SAP in itself, which is

targeting the problem of machine learning or AI

becoming dangerous.

And the danger that we see right now is,

again, that people will trust blindly what this machine learning models are are being,

are outputting

or that

the domain experts get replaced.

So

the

medium term and the long term

planning of MindsDB will always go around

solving the actual issues that we identify

at that given time

that we think are dangerous

in machine learning. In this case, it's explainability.

The second 1 is

we don't want the domain experts to be replaced. We actually believe that the future of machine learning or AI

is that in which

humans and and machines will continue to collaborate, and machine learning is just a tool to augment

the decision capacities

of of the decision makers. So

we

are aiming to always produce these tools such that,

instead of replacing

a human somewhere, it's augmenting

the capacity of that human. In this case, it's augmenting the capacity of the domain experts,

and also making it reliable. And we will continue to to make sure that

in the long term,

we will tackle the problem that we identify as dangerous at that time,

when it comes to machine learning.

Does that answer your question? And are there any other aspects of the work that you're doing with MindsDB

or the current

of machine learning that we didn't discuss yet that you'd like to cover before we close out the show? So in in terms of what we are doing with

MindsDB's

wait. Let me think on this for a second.

If the answer is no, that's fine too. I'd just like to give you guys the opportunity to call out anything else that we didn't cover that I didn't think to ask about. Yeah. I think that,

what we want to generate is a is an open discussion to,

where does machine learning have to and how does machine learning has to be designed

to

to be safe for humans and humanity in general? And,

this is just a a a start of a project, and it has that general objective.

But we we really would like people to to come back with this philosophy of,

a, whenever we design machine learning, we should design it

to augment humans and to make sure that it is always with the best intention for humanity.

And

and for us, MindsDB is just a platform for this conversation to happen. And for companies to make sure that when they're implementing machine learning, it's reliable and stress worthy.

And there are always humans at the end making those decisions.

So we we want to make sure that

that as we continue to talk about MindsDB, we we can continue to talk about, like, what is the general objective of what we're building,

rather than it is just a simple

Python framework

that makes it easy. Easy is important because it's easy for domain experts to use it and explainable because that's the problem that we see right now. But but in general, we we would like to invite people to think of

when they implement and they build either competitors of of MindCB or the next generation of, machine learning frameworks

to think of of these crucial problems, which are how do you

produce the most value to to humans rather than a threat

to to humans,

which, you know, you can argue machine learning can be 1,

potential 1 in terms of jobs and later when when it becomes smarter than us.

Unless it's it's embedded into collaborating with us,

then then it can be potentially dangerous.

And and I think what I would maybe like to add to that

is the fact that even though the open source part

might seem

small

to begin with. I think this whole trend, which is definitely not just MindsDB

of making machine learning open source is important.

Because machine learning

will get to be 1 of those things which

is present in all aspects of life.

And even though it might seem like a small thing in the long run that most people that want to use machine learning out of the box will do it for a service like AWS

or Google or Azure.

I think that in the long run, it matters a lot.

And just to sort of give a historical example of that, think back to the days when most compilers

were,

closed source

and then think back to GCC and the kind of or sorry,

gc I guess it was yeah, the new compiler essentially

and

the impact that Stallman and

his his codevelopers

had

in the long run by developing that and making it open source and releasing it under a GPL license, which means it will keep being open source and modifications to it will keep being open source.

So I I think that even if in the short run, this sort of open source model

looks like it doesn't necessarily matter that much, I think

that pulling the community in an open source direction in the long run

will make a big difference in terms of

who controls,

this very powerful tool. Right? Like, do you go to an open source solution for your machine learning needs or do you go to the Google God and hope that, you know, they have the right answer?

And I I think 1 of the big parts that

we want to focus on in the long run

is making sure

that

anyone is able to use MindsDB,

that that we don't necessarily just have support for the best hardware,

that we don't necessarily have support just for sort of enterprise use cases

that, you know, a 14 year old that has an interesting problem can pick it up and run it on his old laptop

and get something interesting out of it. Alright. And for anybody who wants to get in touch with either of you, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the quiet Comfort 25 headphones

from Bose. I just picked up a set recently because they were on a pretty significant sale,

and so far, I've been happy with them. The noise canceling is quite good, and they are a lot more comfortable than the last set of headphones I was using. So, if you're in the market for a new set of headphones, definitely worth checking them out. And so with that, I'll pass it to you, George. Do you have anything this week? There's a second project that we're working on, which is Lightwood, which is your template to Ludwig

that if you guys wanna try it. As to completely off topic, I'll have to think about it. Let let let me get yours to answer this question and I'll I'll do some pictures. Okay. So 1 thing I would definitely like to recommend to everyone and it could be considered loosely related

to the discussion,

is the neuroscience

course from,

MIT,

which is completely open and you can find it on their website or on YouTube. I've recently started going through it and have already gone for, like, 10 courses in 2 days. It's extremely interesting.

And I think anyone interested in, like, brains or machine learning or really just how the world works should check that out.

On a personal note, I always like to shell out my blog which is blog.suribberlab.com

where I write articles. Some of them may be a bit inflammatory.

So, you know, those are definitely my personal opinions and not necessarily related to MindsDB.

Was there anything else that you wanted to call out, Jorge? Yeah. I I'm definitely getting the headphones that you that you mentioned. But I wanted to talk about,

documentation.

We

for MindsDB, we use docuseries,

but for the Zoom project, we decided to go for,

MK docs.

And if anyone is building a a project that requires documentation,

if you use MK Docs with, material,

templates

from Google, then you you probably get something

very, very nice

straight out of the bat. That usually took us quite a while to produce with Docu series. So give it a try with, MK Docs. Alright. Well, thank you both for taking the time today to join me and discuss the work that you've been doing with MindsDB

and your efforts to make machine learning more accessible to more people. I definitely appreciate that, and it's something I'll be planning to toy around with on my own. So I appreciate all of your efforts, and I hope you enjoy the rest of your day. Thank you. Thank you, and thanks for helping us.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__