Summary
Machine learning is growing in popularity and capability, but for a majority of people it is still a black box that we don’t fully understand. The team at MindsDB is working to change this state of affairs by creating an open source tool that is easy to use without a background in data science. By simplifying the training and use of neural networks, and making their logic explainable, they hope to bring AI capabilities to more people and organizations. In this interview George Hosu and Jorge Torres explain how MindsDB is built, how to use it for your own purposes, and how they view the current landscape of AI technologies. This is a great episode for anyone who is interested in experimenting with machine learning and artificial intelligence. Give it a listen and then try MindsDB for yourself.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Podcast.init listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
- You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing George Hosu and Jorge Torres about MindsDB, a framework for streamlining the use of neural networks
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by explaining what MindsDB is and the problem that it is trying to solve?
- What was the motivation for creating the project?
- Who is the target audience for MindsDB?
- Before we go deep into MindsDB can you explain what a neural network is for anyone who isn’t familiar with the term?
- For someone who is using MindsDB can you talk through their workflow?
- What are the types of data that are supported for building predictions using MindsDB?
- How much cleaning and preparation of the data is necessary before using it to generate a model?
- What are the lower and upper bounds for volume and variety of data that can be used to build an effective model in MindsDB?
- One of the interesting and useful features of MindsDB is the built in support for explaining the decisions reached by a model. How do you approach that challenge and what are the most difficult aspects?
- Once a model is generated, what is the output format and can it be used separately from MindsDB for embedding the prediction capabilities into other scripts or services?
- How is MindsDB implemented and how has the design changed since you first began working on it?
- What are some of the assumptions that you made going into this project which have had to be modified or updated as it gained users and features?
- What are the limitations of MindsDB and what are the cases where it is necessary to pass a task on to a data scientist?
- In your experience, what are the common barriers for individuals and organizations adopting machine learning as a tool for addressing their needs?
- What have been the most challenging, complex, or unexpected aspects of designing and building MindsDB?
- What do you have planned for the future of MindsDB?
Keep In Touch
- George
- Blog
- George3d6 on GitHub
- @Cerebralab2 on Twitter
- Jorge
- MindsDB
Picks
- Tobias
- Bose QuietComfort 25 noise cancelling headphones
- George
- Jorge
Links
- MindsDB
- 3Blue1Brown – Neural Networks
- Think Bayes
- Backpropagation
- Reverse Automatic Differentiation
- Ludwig deep learning toolbox
- Lightwood
- Tensorflow
- PyTorch
- Aerospike
- scikit-learn
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or you want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI pipelines, they just launched dedicated CPU instances. In addition to that, they just launched a new data center in Toronto, and they've got 1 opening in Mumbai at the end of 2019.
Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show. And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system that can keep up with you that's designed by software engineers for software engineers. Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own. With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page.
Podcast dot init listeners get 2 months free on any plan by going to python podcast.com/clubhouse today and signing up for a free trial. And you can visit the site at python podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And if you have any questions, comments, or suggestions, I'd love to hear them. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
[00:01:58] Unknown:
Your host as usual is Tobias Macy. And today, I'm interviewing George Hosu and Jorge Torres about MindsDB, a framework for streamlining the use of neural networks. So, George, can you start by introducing yourself?
[00:02:09] Unknown:
Yeah. So I'm George, and I work mainly on MindsDB. Occasionally some other contributing gigs in, like, the data engineering and machine learning space. There's possibly a slight chance you've read some of my articles before if you've stumbled upon them on, like, Medium, and, you can find more about me on my GitHub profile, which is george3d6.
[00:02:31] Unknown:
And, Jorge, how about yourself? Yep. I also work for my CV, and I've been, working on it since we started a year and a half ago. And you can find me on GitHub on, tormaltormail. And I work mostly on explainability,
[00:02:50] Unknown:
just simplifying machine learning for And, George, going back to you, do you remember how you first got introduced to Python?
[00:02:57] Unknown:
So, personally, I I think I got introduced to it, as soon as I started sort of getting into programming since it's a rather popular language. So if you go into many fields, you essentially stumble upon it by mistake. I actually started being interested in statistics and machine learning when I was reading a sort of introductory Python book called Thing by Ace, which I would recommend to anyone. So, yeah, I think that's how I got into it, essentially. And, Jorge, do you remember how you first got introduced to Python?
[00:03:25] Unknown:
Yeah. It was, almost 12 years ago. We were doing some work at Kalshir Fin, and we needed to do some, basic machine learning, and we introduced Python to the stack back then. And so
[00:03:38] Unknown:
can 1 of you start by explaining what MindsDB is and the problem that it was trying to solve and the motivation for creating it?
[00:03:46] Unknown:
So the problem or the thing we are trying to do with MindsDB essentially is provide a high level library for making predictions and evaluating them. So we we believe that we've sort of reached a point in machine learning where in certain scenarios, a lot of scenarios, you can automate the work that traditionally, data scientist or a machine learning engineer would do. So just to sort of give an analogy here maybe, in the eighties or even in the nineties, a large majority of game studios might have produced their own game engines. But nowadays, 99% of studios have no game engine, game engine engineers in them at all, and instead they sort of have just a a general focused programmer which is able to use various engines and library that handle the complex mathematics of real time rendering.
They handle, cross cross platform compatibility, physics simulations, etcetera. So that's kind of what you're trying to do with MindsDB, give you the power of close to bleeding edge machine learning without you having to know any of the intricacies behind it. So you can use it for anything from, like, recommendation algorithms to image recognition to estimating revenue or optimizing spend on whatever. So what we're doing with MindsDB is we aren't aiming for those 1% of sort of hard machine learning and statistics problem that might require a team of data scientists, but rather we are aiming at what you'd call the bottom 99% of ML problems.
Problems that have already been solved in, like, 1 iteration or another thousands of times and that you could easily automate with 1 function call. But until now, there's really no open source solution to do that.
[00:05:38] Unknown:
And as far as the sort of target user, is it somebody who is, as you mentioned, just a general programmer who is interested in incorporating some of the predictive capabilities of machine learning into the rest of their application? Is it a business user who has maybe a spreadsheet of information and they're trying to make some predictions based on that who might have a little bit of sort of technical acumen? Or is it just sort of anyone who is able to, come across MindsDB and understand the potential and go through the initial setup processes?
[00:06:13] Unknown:
So I I think with the the version we have at the moment, the main focus is on sort of programmers that don't necessarily have any machine learning experience, but have a problem that requires some sort of machine learning, that requires some sort of predictions. That being said, we are in the process of building tools to allow essentially anyone with a prediction problem to use MindsDB, at least to some extent. It it might be a bit hard to actually deploy a model into production if you are a business person, but you should at least be able to do it to prototype some ideas. That being said, we're also targeting at, for example, a data scientist which might just want to have a a benchmark model, for example, to compare their models against or might want to have a a place order model somewhere and don't want to bother writing it. So that's kind of the audience we're going for at the moment, though the core of it is mainly developers and that's partially a good thing as an early stage project because you get a lot of people that can accurately report issues and maybe look at the code and even get interested and
[00:07:20] Unknown:
contribute to it. And so before we get too much deeper into MindsDB itself, I know that it's using neural networks as the underlying architecture for these machine machine learning models. So I'm wondering if you can just give a bit of background about what that terminology means for somebody who isn't familiar with it. Okay. So
[00:07:40] Unknown:
I'll maybe do my best to explain and then, Jorge could maybe complete here. So, essentially, there's, like, 2 ways to look at a neural network. So the the 1 that's most popular, at least as an introduction, I think is a lot of times for us from a sort of psychology or neuroscience perspective, people try to compare it to something that tries to mimic the brain. So you have these neurons quote unquote, which are essentially just, 2 numbers and a bunch of, ties to other neurons And what you're essentially trying to do is build a network that sends impulses down the line to hopefully go from some inputs to some conclusions, some outputs.
I think maybe a a better perspective, an easier perspective is a more mathematical perspective upon a thing, something that most people in ML nowadays, I think, would take. For that, I I would like to maybe go into the the simple things such as like a linear classifier. Right? So let's say that you have a very simple predictive problem. Let's say you for a given piece of content, you want to estimate your revenue and what you know are the viewers and you want to estimate how much ad revenue you'll get 2 months down the line. And let's say for the sake of argument that this problem is linear.
So you can have, like, your number of viewers and then multiply that by some number a and you get your revenue. Right? Does that make sense? Okay. So so that's a kind of equation which you can essentially model with a line, what we call a linear equation because you only have 1 moving variable a. And you can use a rather simple algorithm, let's say you just start by plotting a random line and then you see how well it fits all of your previous data points and then you just adjust it until you get an a which is more or less the solution to the equation viewers times a equals an approximation of your ad revenue.
But most problems in the real world are not linear. Most problems in the real world are not even like polynomial or exponential, they are quite complex and they require a very complex equation. So essentially what a neural network is is just an equation with sometimes dozens of millions of parameters which we adjust in order to get a solution for a very complex problem. So for example, feeding in a bunch of pixels what what what sort of sits at the core of a neural network is not so much the architecture but actually the algorithm that is used to to train the network to adjust all these parameters which is what we'd call backwards propagation.
So backwards propagation essentially is this technique to adjust an equation based on the errors. So you feed in a bunch of pixels and what you want to get is some numbers that mean cat and instead you get some numbers that mean elephant or random gibberish. So backwards propagation, which is sort of a subclass of a wider set of algorithms which you might call reverse automatic differentiation, essentially allows you to compute the amount that you want to tweak your variables in that equation to hopefully get closer to the correct answer which would be cat. And and you, you know, you apply this algorithm dozens of thousands of times and in the end, you manage to adjust this very complex equation with millions of variables, which we call a neural network, to sort of give you the correct labels for, you know, whatever images you have. And the the the nice thing about neural networks, because they have so many variables, is that they can efficiently model a large host of problems.
So a very similar neural network to the 1 that might be able to classify images, let's say, could also be used to classify audio and a slightly tweaked neural network could be used to translate human speech into text or the opposite. So explaining sort of neural networks in a in a quick way might be a bit hard, but I I think I can recommend to anyone that sort of wants to start getting into the field a set of videos by the YouTuber FreeBlueOneBrown which usually does, videos on topics related to mathematics. He actually did an amazing series on neural networks and will hopefully put a link to that in the description.
And I really encourage anyone to check it out. He does a much better job at explaining it than I
[00:12:40] Unknown:
could. So I think that there was something that wasn't covered in the question, which was, like, neural networks and deep neural networks. And as George was describing it, neural network is just this equation that will kinda try to model something. And it's inspired by how neurons work on the brain. And then the deep side of this is just that you pile these layers of neural networks and and you give depth to what you call your neural network or your artificial neural network. And the deeper you go, the more, complex are the patterns that this neural network can, understand.
And the hierarchy of of, patterns will lead to be able to classify or perform more complicated tasks. So, essentially, a deep neural network is a neural network is just that that has many different layers, and those layers will allow you to solve much more complicated problems. That is 1 of the main differences.
[00:13:40] Unknown:
And so for somebody who is getting started with MindsDB and wants to build a model based on some input data, I'm wondering if you can just talk through the overall workflow and the types of source data that you support.
[00:13:54] Unknown:
So at the moment, what what we support in terms of data would be CSVs or any other character separated value files. We also support Excels. We support JSONs as in an array of objects with the keys representing the column. We also support Pandas data frames, and MindsDB does expose an interface to sort of build your own data source. So so the workflow, once you have your data, would be to essentially would be just 2 functions. So first, you have your original training data set. Right? So it's the data set on which you have some input features and an output feature.
So, for example, let's say that you are trying to predict the number of viewers that this podcast is going to get, And let's say you have something like the length of the podcast and, I know, the day of the week when you intend to publish the podcast, the title of the podcast and let's say you have like a score for the popularity of the guest that you have. Right? And and what you want to predict is the number of viewers. So what you'd essentially feed into MindsDB would be a CSV with previous data from your other podcast with those 4 columns and the number of viewers which is what we'd call the output feature.
And that would you could just, you know, place that in a, again, a CSV or an Excel or a pandas data frame and you would just call the learn method of MindsDB and you would give it that file and you would tell it the column that you want to predict, in this case, number of viewers. So then MindSphere would train a model and it will probably give you some insight about the data. So, you know, it might say something like, I don't know, the day of the week the podcast gets published doesn't seem to be at all relevant for how many viewers you have or, you know, something like that. The popularity of the guest is so strongly correlated with the number of viewers that you could just throw out the other columns.
But, you know, you you can act on those insights that you get during training or you can just leave your data as is. And then once you've trained the model, once the learn function has finished running, you can simply call the predict method where you give your input features and it will give you the value that you want. So in this example I kind of made up, it would give you the number of viewers and it would also give you a confidence score for that prediction. So how confident it is in that specific prediction given the parameters that you gave as an input. That's really about it. So most users of Mindb will probably just use the the learn and predict function and that's kind of the purpose of the library to have the usage be very simple. There's a a few more advanced parameters that someone might want to tweak for some other types of datasets, but let's not get into those right now.
[00:16:50] Unknown:
And 1 of the sort of canonical quotes for machine learning in general is the idea of garbage in, garbage out. And so I'm wondering how much upfront cleaning and preparation is necessary for feeding the data into MindsDB to ensure that you're getting something useful out of it, or if MindsDB is able to,
[00:17:11] Unknown:
handle some measure of automation as far as cleaning up outliers and, data anomalies within the source inputs. Alright. So so currently, the what we've opted for is to mainly go for an approach, as I mentioned before, where we warn the users about potential mistakes of their data. So, for example, if you have a column which has a particularly large numbers of outliers, we will give you a warning about it and tell you, hey, maybe you don't want to use this column for a prediction. If a particular column is very bad, so, for example, if all the values are null, we might decide to just go ahead and not use that for the prediction. But in 99.9% of the cases, our approach is to use everything for the prediction but warn the user when he feeds in the data and after we analyze it about data that might pose a potential issue.
So either data which will essentially be inconsequential to the prediction or if we find a column which is essentially correlated with the prediction or if we find 2 columns which are essentially the same thing but, you know, rephrased in 2 different ways. So let's say, I don't know, date time in an ISO format and then the same date time as a time stamp, we will warn the user about it and tell him to maybe remove those columns. And after we've actually trained the model itself, we have an analysis phase where we try to figure out what column the what columns the model actually cares about and what columns it doesn't and what columns might harm the prediction. And, again, we try to tell the user about that. So so that's kind of our approach there.
That being said, we do do some sort of data normalization in in order to determine a consistent data type for a column, for example. So so, you know, you can have a few mistakes in your dataset and that won't mess up your whole prediction. But generally speaking we're we are aiming more towards letting the user know about the mistakes and having them correct it. Because if if we did that internally, you know, we we might end up being wrong, and we might actually correct something that we see as a mistake but the user might have strongly wanted to keep in there. I think adding to this, it's important to know that the purpose of of my NCB is really to
[00:19:43] Unknown:
minimize the risk of potentially machine learning and AI becoming dangerous. And we see that as a, like, a process that will go as machine learning progresses. And what we understand right now as a risk is people trusting, machine learning blindly. So if you train a model, as you said, trash in, trash out, we want to inform you where we identify potential risks. And we also want to make sure that people that are implementing machine learning models are aware of the liabilities of what they're dealing with. If you are implementing a machine learning model to predict if someone is going to develop, say, a chronic condition in health care. You want to understand why, and when the model may fail and when in your data, you have biases or problems that may be significant to the predictions of you that you're producing.
As well as when you get a prediction, you want to understand how confident you are about this prediction, and these are the elements that yours were just were was describing as to what we provide for users other than just a
[00:20:48] Unknown:
prediction. Yeah. And I think that the fact that you have built in explainability as a first class concern in MindsDB to ensure that when you do get a prediction, you're able to understand why it reached that particular conclusion is definitely valuable because it can be very useful both in terms of regulatory environments where I know that GDPR, for instance, has a line item in it that says that any prediction or, result that is built as a result of machine learning or artificial intelligence needs to be able to be explainable. So having that built into MindsDB is useful for that context, but also for to somebody who's trying to maybe get into machine learning and using MindsDB as an entry point, being able to use that as a way to backtrack to fill in their understanding of how these systems work or, you know, just in general, having that explainability is valuable in a lot of different context. So I'm wondering how you have approached that in MindsDB
[00:21:48] Unknown:
and some of the ways that you have found it to be useful in your own work. Yeah. I think that right now, we're going in the stage of quality. So we are trying to analyze what data you give us from different quality dimensions. As Jers was describing, we analyze over many, many different factors from outliers to potential indicators of biases. And then once you train the model, we analyze a model for quality to understand how good is the model at predicting what you want to predict. And that is a stage at which we are right now. And we want to nail this very well because we believe it's the entry point to explainability. Once we, are able to finally produce reports about the quality of the data as well as the quality of the predictions to the people that are using MindsDB, then we can move to the second stage, which is trying to understand what drives a prediction and why not another prediction.
And we believe that expendability is is, is a road that will move into machine learning that can itself explain, like, why it behaves the way that it does. But as it stands now, for most of the problems, just by understanding the quality of the data and the quality of the predictions, we think that it's just a good start, a good start for anyone that wants to implement machine learning to be confident, if they can use this in a in a real world scenario
[00:23:12] Unknown:
or when they cannot use it. Yeah. And I think that that too helps address some of the, maybe, hesitance or nervousness that people might have of bringing machine learning into an existing application because they might not necessarily have a good handle of why a decision might be made. Whereas, when you're just dealing with procedural logic, you can follow the path and understand and, sort of reassure yourself that you know what's happening. But if you're handing it over to a complex mathematical model and not being able to then trace it back to the input data and the overall flow of execution, It might just lead somebody to the path of saying, I'll just deal with more procedural logic and just make it ever more complex, whereas they might be able to reduce a significant amount of effort and code in their, application by using some of these advanced mathematical techniques,
[00:24:03] Unknown:
but that they might be avoiding just because of the fact that they don't have enough confidence as to how it actually functions. Yeah. That's right, Toby. And there's a second thing about, what we do that has to do with the risk of machine learning. And it's the reason why we want to simplify it to the point that anyone that knows a very basic of programming can produce a machine learning model, it's because usually the domain experts are not data scientists. Like, if you are someone that has been working with oncology data for some time, you may be more of an oncologist than a programmer. We want to provide you the the toolset to get some insights as to what machine learning can do for you, but still you are the domain expert. And 1 of the biggest risks today is that true machine learning experts, they they cannot also bear the responsibility of also being oncologist or being, manufacturing engineers.
And and there is a risk of all of that domain expertise to be lost if we delegate the predictive capabilities and expertise just to the machine learning experts. Because, really, it takes years years to build the domain expertise to understand when things are fishy or not. And that's why we, are going back to the data quality and and and informing people about, where something is not looking correct. And it usually would take a domain expert to see this and be like, okay. This makes sense with my understanding of the problem.
[00:25:25] Unknown:
And so as far as the overall sort of development cycle for a machine learning project in an organization that does have data scientists and machine learning engineers on staff, but does have a certain amount of domain expertise requirement, does MindsDB then allow you to change the overall cycle time where rather than having a sort of requirements gathering phase that you pass off to the data scientist, and then they go and explore the data and make their best effort to build some sort of a model or prediction based on that inputs.
You flip it so that you have the domain expert determine the relevant data, feed that into MindsDB
[00:26:05] Unknown:
to generate an initial model, and then pass that off to the data scientists to then explore and refine before you go to production? Oh, yeah. So so that would be 1 of the hopes exactly. So in I was mentioning the quick prototyping thing. So in a large organization where you do have access to a data science team, you might still have too many problems for them to handle all of them. So MindsDB can serve as a very good filter in terms of you as the domain expert being able to sort of figure out which of your datasets are even worth handing over to the data science team in order to make some predictions based on them. And
[00:26:47] Unknown:
so once you have generated a model using MindsDB, I'm wondering what the output format looks like and what the next steps are as far as being able to use it either in a production context or being able to use the generated model to run the predictions either directly from MindsDB or as part of another script or application?
[00:27:07] Unknown:
So we MindsDB, as it stands today, it's a set of different projects. And 1 is what we call MindsDB native, which is a pip module that you can install and then you can train and do everything locally. But, you can save and export this predictive models and share them with other people as well as to deploy them to what we call my CB server. And then once you have it in the MyCB server, they can be accessed through the API. We have JavaScript a SDKs, and we also have a graphical user interface for those that don't even know how to program, but want to, like, experiment with the models that someone has built with either, MindsDB native or directly through the API on the server. And we we want to enable people to regardless of the their expertise in like, their technical expertise to be able to either experiment with the models or to be able to deploy and and actually use them in production. So MindsDB as it stands now, it's it's a combination of a a tool that you can run-in your computer. It's a Python module, and then you also have a Python server that you can use to, like, share and deploy models as well as to expose us to a graphical user interface that we have for people that don't even use Python or don't even know how to program. And in terms of how MindsDB
[00:28:23] Unknown:
is implemented, I'm wondering if you can talk through the overall architecture and design of the application and talk through how it has evolved since you first began working on it. Essentially, we provided a tool that
[00:28:36] Unknown:
was kinda like duct tape prototype, and people kinda like the idea of being able to produce a machine learning model with just 1 line of code. And this was due to having a lot of friends developing projects that, required some sort of machine learning, but the machine learning part was just a cog within their machine. So we we produced a a simple Python library that then we've been evolving to the point that it is now. And Georges can actually describe to you a little bit more about how the evolution of that has happened.
[00:29:07] Unknown:
Yeah. So so, essentially, what we had as a sort of prototype was a tool that was mainly focused on the prediction and the predictive model behind it. And what we really wanted, as we said before, was mainly to be able to focus on the analytics besides the prediction. So there are a lot of smart people working on the how of making prediction right now. And what we really want to do is just take the best algorithms out there, collect them and we want to focus on the why with MindsDB. So why the prediction came out that way, is your dataset good enough to make a prediction to begin with, etcetera, all the things we talked about. So in in order to do this, what we've done is main MindsDB more and more modular.
So the main, MindsDB repository right now is essentially 90% focused on processing the data and doing analytics on that data and then doing analytics on the model itself. And the way we think about the machine learning part is we are essentially able to plug in a machine learning back end, which, in this case, what what we're working with is a machine learning framework from Uber called Ludwig and our own machine learning framework called Lightwood. The former is, based on, TensorFlow and the latter is based on PyTorch. But we also have another machine learning model in the well, I would say in the works, it's in the beginning phases developed by a specific university, but I won't name names right now, because it's not finalized.
And, essentially, that's 1 of the things that this modularity allows us to do. So another thing that we've sort of decoupled from MindsDB is the idea of a of a GUI. So we have a separate GUI that's sort of able to cleanly show all of those insights to someone that might not fassy, you know, reading them out of the command line. We also have a server component, which essentially allows you to do predictions remotely. So if you want to, say, run off a Raspberry Pi and you you can't really, run the learning algorithms there, but you can collect the data with the sensors there, You can just give that to the server, which will then feed that into my s t b. So so that's kind of the way the code has evolved. We've focused a lot more on the explainability part and we've tried making them more modular.
Also along the way, I think we've cleaned up the code a lot. I've been trying to encourage as many community contributions as possible. And even when people don't contribute code, I find that clean code allows people to more easily trace back their issues to maybe even a specific function or, you know, line of code that is actually causing it rather than just copy pasting the stack trace. And I think it's interesting that you
[00:32:12] Unknown:
have, as you said before, started off using Ludwig, which relies on TensorFlow, and now you're building your own back end using PyTorch. I'm wondering what you have found to be some of the comparative differences in your experience of working with those 2 back ends. So, we we actually started off,
[00:32:30] Unknown:
with our own back end, which was mainly built by by Jorge before I joined. And we switched to introducing Ludwig when we decided to make this decision of separating the machine learning back end. And Ludwig came along at just about that time, so it was a perfect sort of experiment for us to see if we can actually easily integrate MindsDB with another generic machine learning library and use that as sort of the model back end as we'd call it. And,
[00:33:01] Unknown:
reinforcing in that point, I think that 1 of the elements of machine learning today is that it moves really fast. And, you know, like, PyTorch did not even exist a few years ago, and and now there's a growing community around it. And and we believe that there is and there should be a a Ludwig version of, for for PyTorch, which is what we built, in Lightwood. But nonetheless, this this is will continue to evolve and and there are so many other frameworks like Julia and and you name it, that what we want to to prevail is, for those that want to use machine learning in their in their production or even for experimentation, for them to not have to worry too much about what is the latest of the latest. We will make sure that we will incorporate that in our modular architecture so that you can rely on the fact that if tomorrow there's something better than, Ludwig or Lightwood, whatever you call it, we will try to incorporate that into into the machine learning back end. And therefore, you can, rest assured that you will be using the state of the art, in terms of of machine learning frameworks.
Nonetheless, explainability and the other elements of machine learning that are important and what we think is crucial to solve right now, we we will continue to provide through MindsDB. So as the machine learning community and frameworks will continue to grow, we will continue to support them, and we will continue to produce, an effort into, making those explainable and reliable and easy to understand and use by anyone.
[00:34:35] Unknown:
And as far as the actual algorithms that you're using within MindsDB, are you relying largely on what's built into the different back end frameworks, or are you adding your own custom built algorithms? And then as far as determining which 1 to apply to a given problem set, I'm wondering what the overall sort of search function looks like as you're,
[00:34:56] Unknown:
building and training the models. Okay. So what we essentially do with MindsDB is a lot of looking at the data and based on that, trying to figure out a set of instructions for the machine learning back end itself. So what we have right now is, essentially, we determine a bunch of data types and data subtypes, and then we also have some other insights about based on the number of data points in the column, the quality of the data points, which we can and, in certain cases, do feed into the model back end in order to allow it to build a more a more well suited algorithm for the specific dataset. And as far as your experience
[00:35:46] Unknown:
of building and expanding on MindsDB, I'm curious what were some of the assumptions that you had going into the project
[00:35:53] Unknown:
and how those assumptions have been challenged and updated as you continued to gain users and build features? Well, I think that the big assumption that we're betting on right now is that people will actually use the explainability part. And we we believe that explainability is important and is crucial to to be able to understand. We are yet to validate if the approach that we're taking at it is a correct 1. So that right now is an assumption. We have some validation, but we will continue to validate this throughout. It's not that we don't believe that people don't want to trust, and have data to trust whatever they're they're building their machine learning capabilities on.
We just want to make sure that the way the approach that we're taking, as to, like, analyzing quality first and then taking it from there is the right 1. Before, the big assumption that we had was, do people really want a a machine learning framework that they can simply, with 1 line of code, produce a predictive model, and then with another line of code, use it, and then just be blind about how it happens on the back end. And we've we've found people with 2 different perspectives. You know, you have the hardcore data scientist that wants to dive in as much as as he or she can. And for those, MindsDB still is very opaque.
And I think that the explainability part may add value for them. But, we also understand that there's a another 90% of developers out there that really they just want to get a solution built in into the product and continue to the next task. And those have, found MindsDB to be to have the right approach, like, the simplest API possible. And then if you want to dive deeper, then you can, which is where we're
[00:37:33] Unknown:
going for at the moment. And in its current state, what are the primary limitations of MindsDB and the situations where it's necessary to pass a given task on to a data scientist or machine learning engineer for building a model versus, being able to do any of the preliminary work within MindsDB itself?
[00:37:53] Unknown:
So I I think that the main limitation of MindsDB is the main limitation of really any machine learning model, which is to say that it only finds correlations. It can't really think about causation. But that's mainly the perspective of sometimes you need to pass, the task to a domain expert to be able to actually tell you if the model makes sense or even trying to build a model in that situation would make sense. In terms of limitations for the problems we are trying to solve specifically, at the moment, we still aren't able to really handle audio inputs or video inputs.
So that would be 1 of them when, at the moment, you would still need to build your own model. MindsDB wouldn't be able to do that, but we are working on it. Another 1 of the limitations right now is very large datasets. So once you get into the hundreds of gigabytes or dozens of terabytes, MindsDB might not be ideal. But, again, we are actively working on ways to have MindsDB work on larger and larger datasets and to have it be distributed amongst multiple machines. So that's 1 of our focus points in the in the near future and has been for a little while. And as far as
[00:39:18] Unknown:
the overall work of building and growing the MindsDB project and the user base, I'm wondering what have been some of the most challenging, complex, or unexpected
[00:39:27] Unknown:
aspects of doing that work. I I think for me personally, actually, 1 of the most surprising aspects has been, how much of a challenge it is to actually get it working on as many machines as possible and kind of trying to work with users to debug the various issues they have. Since it's an open source project, there's often a lot more feedback than with a closed source 1 and the feedback is often of less quality. So there's there's definitely a different development pathology there. That's, I I think, honestly, been the main 1 for me. Yeah. I agree with George. It it is fascinating how once you try to release something for as many platforms and as many people's com computer configurations as possible,
[00:40:10] Unknown:
then we we spend quite significant amount of time trying to solve bugs that are I don't know. Some Windows user cannot 1 of the dependencies doesn't work, or things like that that do consume just as much time as any of the other issues. But we put a lot of effort, into making sure that those issues are solved and and it continues to be available to most. I think that, from my perspective as well, when we decided to incorporate Ludwig, we we got sold into the idea that they were quite ahead into the back end that we were building ourselves. And in reality, you know, it's another open source project, and they also have a lot of issues.
And we even though we've we've took a leap into into separating the the back ends, and that's the future of MindsDB. A lot of what we thought worked, also didn't work out of the bat. And it's the same thing that we have, you know, like a lot of people may use MindsDB and they may find things that they expect it to work. So the fact that MindsDB relies on other open source projects on on the back, that also means that we we rely on on the quality of development of of other people that is outside of our balance.
[00:41:23] Unknown:
A last 1 which I would like to add, which I actually found quite interesting because I I worked for a long time as what I would describe a a data engineer, is getting performance up to par. Because it's it's definitely harder to think about performance and even to test performance when your dataset is literally anything. Right? Like your inputs could be anything, your outputs could be anything. And that's quite an interesting challenge, which we are actively working on and getting better and better. And we've actually
[00:42:07] Unknown:
sustainability of the work that you're doing and trying to make sure that you are able to, you you know, make a living at the work that you're doing and also make sure that you are able to incorporate
[00:42:20] Unknown:
community feedback? Yeah. It's a $1, 000, 000 question. I think that there are successful projects out there that are open source, and these people have invented a formula that we don't have to reinvent. And so that, like, we if you go to mindzav.com, it looks more of, like, a consulting site. And people come to us with all kinds of crazy problems. And those problems may work out of the bat with MindsDB, and some of those may require some tweaking. And even for the ones that work out of the bat, you know, there's always really understanding what is it that people want. And right now, our our path to, sustainability is helping people with any type of predictive problem that we have, and try to use machine like, the machine learning capabilities that we've built through mysab. If myseeb doesn't work, then we improve myseeb with that understanding.
But nonetheless, it is a win win solution where we bring in revenue from people with problems that are real world problems, and we continue to improve MindsDB. So the learnings of that will continue to go back to the community. And the thing is, the more you do machine learning, the the more complicated your problems get. And and the more you kinda, like, rely on the value of machine learning in your organization. So it is easy for companies to start with 1 simple question that they wanna answer, and then once they answer that 1, they wanna answer another 1 and another 1. So we we build a a system open source just because we really want to democratize machine learning. But we also understand that there are many other companies that, are willing to pay for a service that uses this open source infrastructure, that we've built. And there are many projects out there like Red Hat and and Mongo. You you name that. Successful open source project, it always comes with a a side of really understanding the customer needs and making sure that, the the learnings that we have from those engagements
[00:44:21] Unknown:
get back to the community. And I think to to address maybe the community side of the of the question as well, 1 of the way that we have tried to sort of differentiate even from other open source project is to make a lot of the development open. So, obviously, you can't really discuss everything in in public, and doing that is pointless. Like, a lot of projects have most of the discussion discussions going on in the mailing list, but who's going to read the mailing list if they're not working on the project? But we've tried to have all of our sort of goals and all of our development targets open as issues on GitHub.
And whenever we make any changes to those, we talk about them in the actual issues. So there's discussions going on on GitHub. And if someone wants to engage with the projects or wants to figure out how a feature is coming along or what the quality of a certain feature is at the moment, they can always track that on GitHub. Because there there's a lot of open source projects, for example, you know, Aerospike, which, don't get me wrong, is an amazing product. But the only person that can work on Aerospike is an Aerospike developer because the code is rather complicated and it's not really designed neither the code nor the development process is designed with outsiders in mind. The way we are trying to design MindsDB and the development process as a whole is in such a way as to make it very easy for people from community both to contribute and to figure out what's going on. And as far as your experience
[00:45:59] Unknown:
of working with the community and with the people who have reached out to you for support and just seeing the types of ways that MindsDB is being used, I'm wondering what are some of the more interesting or innovative ways that it's being leveraged or things that you found surprising, and also just overall lessons that you have learned in your work of building and maintaining MindsDB that have been particularly unexpected or useful? Okay. I I I will start. So for me, it's it's been really 2 things. So 1 has been,
[00:46:29] Unknown:
how many simple datasets are out there that people want to run predictions on. So the the kind of datasets that, you know, 1 could easily sort of run with a lot of generic models in, you know, maybe 10, 20 lines of code, that people maybe haven't tried playing with before just because even the simple solutions out there like scikit learn, were maybe not friendly enough. And another 1 has been the number of people that had sort of interesting side projects that they wanted to, to run with MindsDB. So for example, a a software developer that I met sort of randomly at a conference had this, moonlighting job or hobby as as a DJ and he actually wanted to try and use MindsDB to figure out playlists for him. So it's I I've seen a few sort of very interesting projects which are definitely not in the, you know, category we'll focus on in terms of business, but I I would really hope we can also help those people. So the the people that just want to do something fun and they have some data, but maybe they, you know, they're not quite up to par with the latest in machine learning. Yeah. I think that, the most surprising 1 we've
[00:47:45] Unknown:
we've gotten, from my perspective is, someone that reached out to us that wanted to predict the Lotto, numbers, as for the lottery, and he had collected incredible amounts of data from all those different lotteries around the world. It was, to me, surprising to see how much effort this person had put into the data collection. And I I don't know what the results were as to how accurate it was. But, what I did find interesting is that there are people out there with all sorts of, ideas, and it is just our mission to, provide them with a toolset no matter how crazy their idea is that they can make a prediction, if they don't know machine learning or if they don't have the time to actually build the models. Or as George mentioned, to to have just a base benchmark.
[00:48:34] Unknown:
And so looking forward, I know that you said that some of the immediate term work is going to be focused on, adding more support for the modularity of MindsDB to be able to support different back ends. But I'm wondering in the medium to long term, what you have planned for the future of the project?
[00:48:51] Unknown:
Yeah. I think that the medium to long term of the project goes back to the purpose of my SAP in itself, which is targeting the problem of machine learning or AI becoming dangerous. And the danger that we see right now is, again, that people will trust blindly what this machine learning models are are being, are outputting or that the domain experts get replaced. So the medium term and the long term planning of MindsDB will always go around solving the actual issues that we identify at that given time that we think are dangerous in machine learning. In this case, it's explainability.
The second 1 is we don't want the domain experts to be replaced. We actually believe that the future of machine learning or AI is that in which humans and and machines will continue to collaborate, and machine learning is just a tool to augment the decision capacities of of the decision makers. So we are aiming to always produce these tools such that, instead of replacing a human somewhere, it's augmenting the capacity of that human. In this case, it's augmenting the capacity of the domain experts, and also making it reliable. And we will continue to to make sure that in the long term, we will tackle the problem that we identify as dangerous at that time, when it comes to machine learning.
[00:50:24] Unknown:
Does that answer your question? And are there any other aspects of the work that you're doing with MindsDB or the current of machine learning that we didn't discuss yet that you'd like to cover before we close out the show? So in in terms of what we are doing with
[00:50:41] Unknown:
MindsDB's wait. Let me think on this for a second.
[00:50:49] Unknown:
If the answer is no, that's fine too. I'd just like to give you guys the opportunity to call out anything else that we didn't cover that I didn't think to ask about. Yeah. I think that,
[00:50:58] Unknown:
what we want to generate is a is an open discussion to, where does machine learning have to and how does machine learning has to be designed to to be safe for humans and humanity in general? And, this is just a a a start of a project, and it has that general objective. But we we really would like people to to come back with this philosophy of, a, whenever we design machine learning, we should design it to augment humans and to make sure that it is always with the best intention for humanity. And and for us, MindsDB is just a platform for this conversation to happen. And for companies to make sure that when they're implementing machine learning, it's reliable and stress worthy. And there are always humans at the end making those decisions.
So we we want to make sure that that as we continue to talk about MindsDB, we we can continue to talk about, like, what is the general objective of what we're building, rather than it is just a simple Python framework that makes it easy. Easy is important because it's easy for domain experts to use it and explainable because that's the problem that we see right now. But but in general, we we would like to invite people to think of when they implement and they build either competitors of of MindCB or the next generation of, machine learning frameworks to think of of these crucial problems, which are how do you produce the most value to to humans rather than a threat to to humans, which, you know, you can argue machine learning can be 1, potential 1 in terms of jobs and later when when it becomes smarter than us.
Unless it's it's embedded into collaborating with us, then then it can be potentially dangerous.
[00:52:48] Unknown:
And and I think what I would maybe like to add to that is the fact that even though the open source part might seem small to begin with. I think this whole trend, which is definitely not just MindsDB of making machine learning open source is important. Because machine learning will get to be 1 of those things which is present in all aspects of life. And even though it might seem like a small thing in the long run that most people that want to use machine learning out of the box will do it for a service like AWS or Google or Azure.
I think that in the long run, it matters a lot. And just to sort of give a historical example of that, think back to the days when most compilers were, closed source and then think back to GCC and the kind of or sorry, gc I guess it was yeah, the new compiler essentially and the impact that Stallman and his his codevelopers had in the long run by developing that and making it open source and releasing it under a GPL license, which means it will keep being open source and modifications to it will keep being open source. So I I think that even if in the short run, this sort of open source model looks like it doesn't necessarily matter that much, I think that pulling the community in an open source direction in the long run will make a big difference in terms of who controls, this very powerful tool. Right? Like, do you go to an open source solution for your machine learning needs or do you go to the Google God and hope that, you know, they have the right answer?
And I I think 1 of the big parts that we want to focus on in the long run is making sure that anyone is able to use MindsDB, that that we don't necessarily just have support for the best hardware, that we don't necessarily have support just for sort of enterprise use cases that, you know, a 14 year old that has an interesting problem can pick it up and run it on his old laptop
[00:55:04] Unknown:
and get something interesting out of it. Alright. And for anybody who wants to get in touch with either of you, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose the quiet Comfort 25 headphones from Bose. I just picked up a set recently because they were on a pretty significant sale, and so far, I've been happy with them. The noise canceling is quite good, and they are a lot more comfortable than the last set of headphones I was using. So, if you're in the market for a new set of headphones, definitely worth checking them out. And so with that, I'll pass it to you, George. Do you have anything this week? There's a second project that we're working on, which is Lightwood, which is your template to Ludwig
[00:55:44] Unknown:
that if you guys wanna try it. As to completely off topic, I'll have to think about it. Let let let me get yours to answer this question and I'll I'll do some pictures. Okay. So 1 thing I would definitely like to recommend to everyone and it could be considered loosely related
[00:56:01] Unknown:
to the discussion, is the neuroscience course from, MIT, which is completely open and you can find it on their website or on YouTube. I've recently started going through it and have already gone for, like, 10 courses in 2 days. It's extremely interesting. And I think anyone interested in, like, brains or machine learning or really just how the world works should check that out. On a personal note, I always like to shell out my blog which is blog.suribberlab.com where I write articles. Some of them may be a bit inflammatory. So, you know, those are definitely my personal opinions and not necessarily related to MindsDB.
[00:56:44] Unknown:
Was there anything else that you wanted to call out, Jorge? Yeah. I I'm definitely getting the headphones that you that you mentioned. But I wanted to talk about, documentation. We for MindsDB, we use docuseries, but for the Zoom project, we decided to go for, MK docs. And if anyone is building a a project that requires documentation, if you use MK Docs with, material, templates from Google, then you you probably get something very, very nice straight out of the bat. That usually took us quite a while to produce with Docu series. So give it a try with, MK Docs. Alright. Well, thank you both for taking the time today to join me and discuss the work that you've been doing with MindsDB
[00:57:28] Unknown:
and your efforts to make machine learning more accessible to more people. I definitely appreciate that, and it's something I'll be planning to toy around with on my own. So I appreciate all of your efforts, and I hope you enjoy the rest of your day. Thank you. Thank you, and thanks for helping us.
Introduction to MindsDB with George Hosu and Jorge Torres
What is MindsDB?
Target Users and Use Cases
Workflow and Supported Data Types
Data Cleaning and Preparation
Explainability in Machine Learning
Implementation and Architecture
Assumptions and Challenges
Interesting Use Cases and Future Plans
Closing Remarks and Picks