Summary
Data mining and visualization are important skills to have in the modern era, regardless of your job responsibilities. In order to make it easier to learn and use these techniques and technologies Blaž Zupan and Janez Demšar, along with many others, have created Orange. In this episode they explain how they built a visual programming interface for creating data analysis and machine learning workflows to simplify the work of gaining insights from the myriad data sources that are available. They discuss the history of the project, how it is built, the challenges that they have faced, and how they plan on growing and improving it in the future.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. And now you can deliver your work to your users even faster with the newly upgraded 200 GBit network in all of their datacenters.
- If you’re tired of cobbling together your deployment pipeline then it’s time to try out GoCD, the open source continuous delivery platform built by the people at ThoughtWorks who wrote the book about it. With GoCD you get complete visibility into the life-cycle of your software from one location. To download it now go to podcatinit.com/gocd. Professional support and enterprise plugins are available for added piece of mind.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Blaž Zupan and Janez Demsar about Orange, a toolbox for interactive machine learning and data visualization in Python
Interview
- Introductions
- How did you get introduced to Python?
- What is Orange and what was your motivation for building it?
- Who is the target audience for this project?
- How is the graphical interface implemented and what kinds of workflows can be implemented with the visual components?
- What are some of the most notable or interesting widgets that are available in the catalog?
- What are the limitations of the graphical interface and what options do user have when they reach those limits?
- What have been some of the most challenging aspects of building and maintaining Orange?
- What are some of the most common difficulties that you have seen when users are just getting started with data analysis and machine learning, and how does Orange help overcome those gaps in understanding?
- What are some of the most interesting or innovative uses of Orange that you are aware of?
- What are some of the projects or technologies that you consider to be your competition?
- Under what circumstances would you advise against using Orange?
- What are some widgets that you would like to see in future versions?
- What do you have planned for future releases of Orange?
Keep In Touch
- Blaž
- University Bio
- @bzupan on Twitter
- BlazZupan on GitHub
- Google Scholar
- Janez
- University Bio
- @jademsar on Twitter
- janezd on GitHub
- Google Scholar
Picks
- Tobias
- Blaž
- Janez
Links
- University of Ljubljani
- Data Explorer
- Silicon Graphics
- Visual Programming
- PyQT
- Linear Regression
- t-SNE
- K-Means
- TCL/TK
- Numpy
- Scikit-Learn
- SciPy
- Textable.io
- RapidMiner
- Single Cell Genomics
- Transfer Learning
- Orange Video Tutorials
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports the show on Patreon. Your contributions help to make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@podcast init.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app. And now you can deliver your work to your users even faster with the newly upgraded 200 gigabit network in all of their data centers. If you're tired of cobbling together your deployment pipeline, then it's time to try out GoCD, the open source continuous delivery platform built by the people at Thoughtworks who wrote the book about it. With GoCD, you get complete visibility into the life cycle of your software from 1 location. To download it now, go to podcastinit.com/gocd. Professional support and enterprise plug ins are available for added peace of mind. You can visit the site at podcastinnit.com to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions, I would love to hear them. You can reach me on Twitter at podcastinit or email me at host@podcastinit.com.
To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Blasz Zuppan and Yanas Demshar about Orange, a toolbox for interactive machine learning and data visualization in Python.
[00:01:32] Unknown:
Yanis, could you start by introducing yourself? I'm Yanis Damshed. I'm professor at University of Ljubljana. I started with AI and machine learning. Now I'm mostly working with, students in 1st year. I teach them basic programming in Python And with children, so that's my second second hobby. I'm teaching children, about computational thinking, programming, using microcontrollers, stuff like this.
[00:01:59] Unknown:
And, Blaj, how about you? Yeah. I'm also with, University of Ljublana and, also with Baylor College of Medicine. I do data mining and machine learning, and, I also teach here at university, but also we make a lot of courses for, people outside universities like in companies and high schools and so on. So we like to teach. I'm lately all on Orange, so I completely spent all my time in, on Orange development and everything around
[00:02:33] Unknown:
it. And, Ioannis, do you remember how you first got introduced to Python?
[00:02:36] Unknown:
Yeah. It was actually related to Orange. Orange at that time was, a collection of c plus plus classes. I started writing them as as a student, I guess, and nobody could use them. There was no documentation. So it was just a library of of c plus plus classes. And so I packed them into kind of command line utilities, which had larger and larger number of options. You may think that FFmpeg is a complicated program, but Orange had really complicated command lines. And at a certain point, we, together with, we decided we have to, invent a language, in which you could program processes in machine learning. And so somebody told me about Python, and I admired it as a glue language, essentially.
So So it was way before NumPy and and before Python. So if you wanted to do something quick in Python, you had to do you had to write the slow part in c plus plus There was no way around it at that time, I guess. So I used Python mostly as a glue language. I didn't appreciate how nice the language is by itself.
[00:03:41] Unknown:
So that was my first introduction to Python. And, Blaj, do you remember how you got introduced to Python?
[00:03:47] Unknown:
There will be a simple question. Janis pushed me. But the more complicated question, I think, we were we were actually visiting, Baylor College in Houston, and, we actually there there was a there was a there was a bookstore just close to our hotel, and we went there. And we bought 2 books on Python on Friday. And then, we were thinking what to do with it, and started to program. And, actually, he made a interface, from his c plus plus machine learning routines in Python, and I think this was the start. And it was somewhere around 98 or 99, if I remember correctly.
[00:04:23] Unknown:
Yeah. And, the story is interesting because we bought 2 out of 3 books they had about Python. They had just 3. And I guess this was Barnes and Noble, a a huge bookstore. They had just 3 books about Python at that time. Now compare it to what you have today. Yeah. There's definitely quite the catalog of books about Python at this point spanning every discipline that you could imagine.
[00:04:46] Unknown:
And you mentioned briefly that Orange has had a very long history, but I'm wondering if you can just take some time to describe what the project is and what was your motivation for building
[00:04:59] Unknown:
it. So I still need to dig into a bit of history. I think in 1997, I met, Donald Meek, and he visited the institute I was employed in. Donald Meek is 1 of the, like, founders of artificial intelligence. And, we talked about how it would be great to have something in machine something for machine learning like what r does. But it should be on the web. It should have fancy interface. It should have visualization. And then Donald, Mickey, and, myself and Nadal Arash, we organized a meeting in Bled, which is, by the way, a beautiful, resort in Slovenia, lake resort.
And there were some really great names coming, like Ross Greenland and, Tom Mitchell. So, both of them founders of machine learning. And it was interesting that so the meeting was about building a tool for machine learning. That was in 1997. Also invited were was IBM because we thought that Java would be the the program. And then it was interesting that nothing came out of the meeting. So I think the the error was that only the seniors who were there and Janis and myself were the only juniors actually. And then Janis walked into my office, like, 2 weeks later and he said, well, we should just do it. Right? And this is how it started.
So the motivation was actually to have, at that time, a library, So like a Swiss knife for machine learning. And we haven't thought about interface, we haven't thought about, visualization,
[00:06:27] Unknown:
just plain simple library that would be very simple to use. And given how long the project has been around for, what have been some of the challenges of evolving it as new technologies and new machine learning techniques have evolved and been introduced?
[00:06:44] Unknown:
So at a certain point, we stopped developing our own algorithms, and implementing them. Instead, we just used what's in scikit and other libraries, and we concentrated more on visualization aspects. So we want to have algorithms that can give you a nice visualizations of patterns, in your data. So that's what changed. Also, we don't go much into, like, deep neural networks and those methods that cannot give you, any real insight in your data. They can give you just predictions. They're good for some things. Of course, they are, what moves AI now, but we are not really in this business a lot.
[00:07:23] Unknown:
And who is the target audience for Orange?
[00:07:26] Unknown:
Yeah. The beginning, actually, the target audience were researchers in machine learning because we were, you know, researchers. And actually not just researchers but educators because we also thought, I mean, we still teach machine learning and data science. And, we assume that, the audience enjoyed programming and would learn Python and Orange easily. Right? But but but then, actually, around 2000, we worked with medical doctors and mined data from clinical medicine, and we noticed there was little excitement. You know, if I would have a physician and I would say, oh, you know, my area under ROC curve with cross validate is 0.9.
Right? There will be no excitement about that. Even if we show them how the the classification trees or coefficients in logistic regression look like, there will be no excitement at all. And then we figured out at that time what they really wanted to see is, they would like to visualize the data. They would like to talk with us, but through visualization of data, models, interactions. For instance, in 2000, there was no tool that, you could show a classification tree, and then they would say, oh, show me the particular branch. Where what are the patients that are in this particular branch. Right? You would need to go, into program and you would need to script, and this is not what we wanted. And we actually figured out that, visual interactions are so more so more powerful.
That means visualizations. And then not only visualizations of the data, but something that you can you can browse and touch and, select and then mine further. And that goes actually back in 1993. I had a class, in high performance computing with a tool called Data Explorer, IBM tools, and something similar was also by Silicon Graphics, and it used visual programming. Visual programming was, like, you could you could design the workflow, how you would analyze an image and how you would, actually construct an image, and and then change the parameters easily and so on. This was fantastic. In 1993, this tool existed. That doesn't exist anymore. And then in 2003, we said, okay.
That's the way that's a proper way to address it. So we need a visual programming framework, where you construct data analysis workflows, very easily just by putting the blocks together. We call them widgets. And we need a heavy visualization. Heavy means that everything needs to be visualized, not only data, but the models. And we need interaction. So in any kind of visualization, you can touch any element and then you can find out what particular data is actually associated with that. So so that means, right, that our audience changed from machine learning researchers and, teachers and university students to just about anybody that has data.
[00:10:25] Unknown:
And you mentioned that there's a strong focus on the graphical capabilities of Orange. So I'm wondering how that interface is implemented and the kinds of workflows that can be built with the different components and widgets that you have in the library.
[00:10:41] Unknown:
So, technically, we use PyQt, which is a great interface to great tool for making GUIs. And what the user can do you can put components of this workflow. Like, I'm reading the data. I'm showing the data table. I'm showing the scatterplot. Now the data points I select in the scatter plot should go into, I don't know, clustering. I'm gonna use k means with such and such parameters to choose the the optimal k, and the clustering we have should go into the those and those widgets so the user can compose a schema like this, which is essentially programming except that it that you don't write any codes like like textual code. So it's much easier for new users who don't know maybe who don't know any programming and also who may not know much about data analysis because it's because it's so intuitive. It's like a like a child playing with with Lego toy Lego bricks, just putting stuff together, and and the user sees what what works and what what doesn't. So that that's the way you use Orange.
[00:11:42] Unknown:
So what are some of the most notable or interesting widgets and capabilities that you have in the current incarnation of the catalog for the project?
[00:11:51] Unknown:
Although there are many and some of them are like, as I mean, some of them are fancy in, interactions like, like clustering and Disney and so on. I really like a very simple 1, and this is called the paint data widget. It's a it's a widget where I can paint the data. I can take a brush or a pen, and I can paint 2 dimensional data. And that works beautifully, when I when we teach machine learning. So I can pay the data for, to show that, linear regression with polynomial expansion over fits. I can pay the data for, for clustering.
I can show what is like the, regularization. I can show actually, how to trick, k means such that it doesn't guess the right clusters or maybe that it does. Right? So I like that widget a lot, but it's it really shines in combination with the other ones. Right? So I would I I would have several windows open in 1. I would paint the data, and the other 1, I would show the results of the clustering. And that would all happen instantaneously. So I have another 1. I thought that I thought that Jan is gonna mention nomograms or something, here.
[00:13:05] Unknown:
I don't have favorite widgets, so I I I cannot. So my view of widgets is is so different. I I implemented lots of them. So my favorite widgets are those with the with the nicest code, and users don't see it, but, that's what what drives me. So we have a we have a different driver with Johannes. He he's more on the code side, and I'm I'm on a presentation layer, actually, which is great.
[00:13:29] Unknown:
Working with somebody that's not completely, a clone of yourself is, really complimentary and great. So maybe I mentioned another 1. Can I? Sure. Yeah. So lately we are playing with image analytics, and this is 1 that I enjoy, a lot because it's, it's so simple. So the the idea is that, there's a there's a reader, for images. So you have images in some folder on your desktop or wherever. You load the images, and there's another widget called, called image embedder. So that pushes the images through the deep neural network, and then each image is described with a vector, so the penultimate layer of the neural network. And then you can do things like classification or clustering. So you can cluster the images, and then you there's another widget called, image, image viewer. So you would actually go from reading the images, embedding them in a vector space, clustering them, and then you can type on the branches of the cluster and then you see, which images are similar to each other. And that works really great. So thanks for, the developments in deep learning, of course.
[00:14:41] Unknown:
And, Jannis, are there any widgets that you're particularly proud of the implementation that you'd like to call out? So what I like is in PyQt, you have this model view,
[00:14:52] Unknown:
paradigm. And I think that our widgets, at least those the Nuance and those that we renovate, the the factor, really, make use of this paradigm a lot. We implemented a layer on top of of Qt, which helps us develop widgets quickly. So we don't just add Qt components, in like like you would do it usually. We call a function which adds the component, but then also synchronizes what's in the component. Like a combo box, it's synchronized with an attribute in a certain Python class that represents the widget. So in this way, we have very little overhead to maintain user interface. So widgets can be really small because of this extra layer.
So, those widgets that that I really like are the new widgets that use the small view paradigm, and they can be much easier to maintain. We can we can, add new features quite quickly in this way. Yanis
[00:15:52] Unknown:
is also kind of often very modest about, the framework that, he has scared to create. So what I would like to expose, if you write a new widget, widget meaning that, there's a component that gets, let's say, some data and does something like, build a neural network and then output something, in this case, the the classifier. It's very simple to to build it. So Orange comes with the so the library for for development comes with the the library of different, different controls that you can use. And these controls, you can set such that, they would initialize so that they would, there would be, they would they would use the they would set themselves based on the data on the input. Right? So so every time the data would change to that widget, the widget would know how what was the latest setting with the particular dataset. So so I think this is a really nice piece of code that is, dictating to Orange, but it helps you to actually define the the user interfaces in a in a very simple way, and such that everything works nicely so that the settings are data dependent. That, orange, when you save the workflow, saves these settings. When you open the workflow, the settings are back again. So it helps a lot to reproducibility of the, of the data analysis. Right? Every data analysis, any any workflow that you store is gonna behave exactly the same way when you open it again. So, essentially, if you have a new, machine learning algorithm that you implemented
[00:17:32] Unknown:
or a new visualization method, whatever, if if you have it in Python or in c plus plus and you call it through some interface, it's really easy to add a new widget to Orange. So anybody can just add a new add on package to PyPI, pack his methods in, like, maybe 10 or 20 lines of code. You can add a new widget with with your method and and push it as an add on to Orange.
[00:17:59] Unknown:
And for the data interchange between the different stages of the visual flow, how do you manage sending the data from the outputs of 1 step to the inputs of the other in a way that is as efficient as possible?
[00:18:14] Unknown:
Yeah. So each region declares what are its, outputs. So it's a name of an output and a type, and other widgets declare inputs. And so there's a piece of orange called signal manager. If you connect the widgets, they're just gonna send the data object from 1 widget to another. Most of data types so most of most signals are just, NumPy tables, NumPy arrays. So it doesn't doesn't involve any any overhead. So all processing that happens, if if widget needs to do something with a table or construct a new table, it's in the widget. But this communication is really, nothing, no processing, no no memory gets consumed there. This is it's nice because this decouples,
[00:18:57] Unknown:
the complexity of data analysis. So when you develop orange, you don't develop orange. You're actually thinking only about the widget. Right? You're thinking, okay, what is this component gonna work on and what it's gonna what's gonna output and how, it's gonna present the information from the input, like in any kind of visualization. So you can develop a widget completely independently of everything else. And then, for me, still it's shocking when I use, Orange. It's shocking about all the possible combinations where I can use something that was developed maybe half a year ago, and then the combination would be, completely new. Right? I would use something in a completely different way that it was supposed to. Like, for instance, we build we build a Silhouette widget, for, trying to score, score data instances on how well do do they fit a particular cluster. Right? And then we found out that this widget is actually excellent to find the outliers.
[00:19:58] Unknown:
And when you're working with a visual language that is an abstraction of a textual oriented language that's fully touring complete, there are often gonna be cases where you bump up against the limitations of what can be represented in the widget. So I'm wondering, what are those limitations that are present in the graphical interface, and what options do users have when they reach those limits in order to be able to take work that they're doing and extend it beyond what they can do within Orange?
[00:20:27] Unknown:
Yeah. There is a separation. What you can do in visual languages, like like Orange, like workflows, and what you can do in the code. So 1 thing that we don't have and probably never will have is loops. Any loop that happens happens in the widget. You can't you don't we don't have loops in the work flows. We have been exploring this a lot, and people are asking for it, but it just doesn't doesn't fit our paradigm. Of course, you have languages with loops for for children who have scratch, which is which includes loops, of course. But in this workflow paradigm, we just cannot cannot make it work. There's no, obvious way to do it. So we we try to avoid, making things incomprehensible by just adding loops.
And, also, we don't need it. So when the user needs to do something something very complicated, he has to go to code. Eventually, when the analysis becomes too too complex, you you have to do some coding. There was a design decision in Orange,
[00:21:25] Unknown:
not to have too many widgets. So widget again are components, that process the data information, and the idea was not to have too many of them. Right? So if you would have loops, then you would have to have some data progress, post preprocessing, post processing, and everything. So we do not have that. We do not have a widget that, changes a string to something else because that would be too low. Right? So we we rather have, as few widgets as possible, but there's a there's a threshold. Right? So it should not be too complex, again. Right? So the complexity meaning that it would have too many options. It it would do too many different things. Right? So we try to we try to leverage this, complexity on 1 side, and the number of different widgets on the other side. The problem with too many widgets would be that, users wouldn't know actually what to use. If you have a library of thousands of widgets, that
[00:22:23] Unknown:
adds a complexity layer that we would like to avoid. So, I have seen tools that are kind of similar to orange, but in which signals could also be not just a whole data tables, but a single integers or strings. And it became more or less like your coding. So it was too low level to be useful for ordinary users. So we try to stay above this. And if you want to, say, widgets to communicate with with low level data, you're just not supporting it. It's, it's not our our game.
[00:22:56] Unknown:
And we've you've mentioned a bit about some of the underlying libraries that you use to build Orange itself and the widget catalog. I'm wondering if you can spend some more time talking about how widget is implemented at a low level and how that implementation has changed over the different versions that you've released.
[00:23:16] Unknown:
Okay. So we started I think we had the very first version was in Tcl t k, but we stopped using it quite soon. It just didn't look nice. So we decided early on to use PyQt, and we are still quite happy with it. The real problem is with, visualization libraries, like ScaryPlots and Boxplot and stuff like this, they're just we haven't found any good tools for this. So mat matplotlib is great, but it doesn't play well with, QT and with interactivity. You just cannot really interact with it with it. So there was currently we use PyQtGraph.
We are not too happy with it. We are going for some JavaScript embedding, into widgets. We just cannot find anything that plays well with, PyQt. So that's a big problem we have. Otherwise, we are relying on NumPy, scipy, scikit learn for machine learning stuff. So that's these are the the the 2 big groups of libraries. 1 is for visualization, and the other is for machine learning. Otherwise, we have, like, dozens of of different small libraries also included.
[00:24:25] Unknown:
And for people who are just getting started with doing data analysis and machine learning, there are a lot of concepts to be able to tackle and understand. So I'm wondering what are some of the most common difficulties that you've seen when those users are first getting started in that area, and how does Orange help to overcome those gaps and understanding?
[00:24:47] Unknown:
So I think there are 3 approaches to learn about machine learning. 1 goes through mathematics, which is obviously okay if you're, good in mathematics. The other can be through programming. Again, you need to be a computer science major for this. And the other is more intuitive. You learn through observing things through, as Blasch said, painting some data and seeing how clustering works there and which machine learning method works there. So if you take this route, for instance, you're not computer science or mathematics major, you're, coming from, I don't know, biology, then Orange would help you to learn about data mining in more intuitive, human friendly
[00:25:26] Unknown:
way. I think for me, the challenge is suppose so so you go, you say you're in Boston. Right? Suppose you you go on the street in Boston and pick up 3 random people. Right? You stop 3 random people and say and you have, 1 hour to teach them what machine learning is. And for me the challenge is that, can I have a tool actually where within 1 hour, I would show them what machine learning is such that when they go home, they can they can do it on their own? Right? And this is some of the some of the things that we do in Orange is just for supporting this kind of training. Can maybe not in 1 hour. So, we are recently we recently are having a lot of, hands on workshops, that take 1 day. Right? And in 1 day, I think you can teach people what is machine learning, what is classification, what is clustering, what is the problem with overfitting.
We even go into regularization. Right? And Orange is really a tool where you can do that. You can actually train people within within a day or maybe 2, such that they become aware of what is possible to do. I'm not saying that, they become data scientists because for data scientists, you not you you need years of practice. Right? But, I'm talking about informing people what is machine learning. Because in current society, only a few, like a tiny little percentage of people, actually know what machine learning is. And on the other side, we know that machine learning is all around us, and all the big and small companies are using that. So Orange is really about democratization of data science. Can we can we support just about everybody with tools that they can use,
[00:27:08] Unknown:
to play with the data? And there's also the case where you have people who are working in the area professionally, but they don't necessarily want to break out the big guns to work through just a, know, proof of concept. They just wanna be able to throw some data up, you know, run it through a few different steps and see what the outcome is. And it seems that orange is a good way to facilitate that experimentation and data exploration without having to invest a lot of upfront in terms of setting up the boilerplate for being able to load the data, process it through different steps, and then export that through a visualization tool? Yeah. It's like
[00:27:46] Unknown:
it's like smoothing the learning curve. Right? I mean, think about Excel. How how people accept this as a tool that virtually anyone can use, and Excel is now used in, primary schools and secondary schools, and everybody knows about it. Right? And we probably do not have such a tool for machine learning. Right? Where I think, Orange and similar products are actually just ripe, right, to go to primary school and, think about, what data science is.
[00:28:16] Unknown:
And, also, as you said, if you even if you know how to handle big guns and you just want to pick in your take a peek at, at your data before you start using big guns, you can just open Orange and take a look. So it's really useful for for this too.
[00:28:32] Unknown:
And there is also this element of interaction. Right? So so you cannot do data interaction with, Matplotlib or Python. Right? It's this element where where you talk with the, say, with the customer or with colleague or with the data owner, it becomes so much fun to actually be able to touch the data. What are some of the most interesting or innovative uses of Orange that you've seen? So we are working with a group of,
[00:28:57] Unknown:
physicists and chemists who, run some large synchrotrons. These are like huge microscopes, essentially. They use radiation that comes from accelerators. And, 1 of those groups, they took orange, threw all widgets away, and, composed this a new set of widgets that they use for simulating, optics that happens in these huge microscopes that they use. So data there are not data from, like, machine learning, from data mining. Data there so each data point is a photon, and they're trying to simulate what lenses would do with a beam of photons in the in the accelerators.
[00:29:41] Unknown:
So this is so Johannes just mentioned an example where orange is was not used as intended. Right? It was, it's not a data it's not was not a data science
[00:29:50] Unknown:
workflow tool, but for something else. Right? Yeah. They they took the infrastructure that they have. That is, the the canvas, the the the part of Orange that allows you to put widgets and onto the schema and connect them. They took this part, but they replaced everything else. There are also other projects that use, Orange in among more common ways. So I'm just,
[00:30:11] Unknown:
I'm just looking here at the textable. Textable. Io is like, it's an add on for text analysis and for digital humanities. So there's a group of people developing their own widgets. They interact with our widgets, but they also are more towards text analysis. And maybe an anecdote, which is not exactly, so on on production of widgets, but on the use. Right? So I remember I was explaining, image analysis, to my colleague in Houston, that, has some he had some images for molecular biologists. And so the the test there was classification of, different development stages of a certain amoeba.
And then so then, so then we quit the conversation, and, I don't know. After 8 hours, Gadi calls me back. He says, oh, you know, I just went home, and I took the photos from my family photo album, and I cluster them. Right? And then he described how in there are some images where he appears with his wife and with his children and in the mountains and without, and it was such a great experience. Right? So so going from just showing somebody how to do image analysis. Right? 8 hours later, right, he does mining of his photo album.
[00:31:32] Unknown:
Yeah. It's definitely cool seeing the different ways that people will use the tools and technologies that we produce often in ways that we don't necessarily expect. And 1 of the other things that is worth noting about Orange that I don't think we talked about yet is the fact that it has some widgets built in for being able to serve as data sources so that if you don't already have an existing dataset that you want to work with, but you want to experiment with some of these different techniques, you have the ability to just plug into an existing stream of data. So for instance, Twitter or some of the other capabilities that you have for bringing data into the tool for doing some different types of analysis?
[00:32:10] Unknown:
Yeah. Lately, this is actually where where we are going. Right? So so we are also working with some companies, that, have completely different, you know, environments and some databases that where you build a widget actually to to actually access the data, and then you build some widgets to, visualize them in their in their own way. Right? There there are widgets, like, for PubMed browsing, and there are widgets for, that can gather tweets, and there are widgets that can, gather Facebook contact contacts and so on. Right? There's such a wealth of things that, you can still do. So, of course, we as a, I mean, we as a group, right, we wouldn't like to do everything, and we would just like to set, the environment where people can do that and then enjoy on everything else. Right? So if you build a widget that can, I don't know, read all the Shakespeare books, of course, there are widgets then then can cluster the text, that can analyze them, that can visualize the differences, and so on? So you have this data you have this component rich environment where just adding 1 more widget enables, the use of origin completely some some new field. Right? Being, I mean, that being, I don't know, economics or digital humanities or or physics or, molecular biology, whatever.
[00:33:39] Unknown:
So what are some of the other projects or technologies that you consider to be either competition with Orange or alternative, options for people who are interested in the capabilities that Orange has?
[00:33:55] Unknown:
So maybe first rather than competition, we, of course, gain a lot from everything that is developed in Python. Right? So because Orange is based on Python, whatever libraries that are new, whatever deep learning that, that comes out there, it's great for Orange because we can include it or or we learn from that. There are also similar projects in other languages, like, probably most notable currently are RapidMiner and NIME. I should say that both of these are actually companies. So we are not we're still we're still a group at university. Right?
And, but but I wouldn't say that there are competition there. They're doing something different. So so let's say RapidMiner is more on building the workflows that can run on servers. Right? And then you get after after you you design the data, then there is a final report. Right? And Orange is not about that. Orange is not about creating a report with a known workflow. Rather, we what we really specialize in is, like, data visualization interaction. Right? So you basically work work with the data like you would tell a story. Right? You start, you visualize the data, you select why you tell why this is interesting, you mine further.
So Orange is is different in that respect, and I think 9 is somewhere in between RapidMiner and Orange in in the focus. Right? So but we what we we really especially like about Orange is this interactivity, and that takes a lot of development work. So so the widget that shows just a scatter plot, right, would be very simple to do. Right? But the widget where you can select different things in scatter plot and then you have all the modifier keys that, you can not only select 1 group, but you then you can define clusters, and that outputs this kind of cluster that you can analyze further. That's much more work. Right? Even for a simple thing like scatter plot, but then consider dendograms, trees, graphs, networks, all of them have to be interactive, and this is the specialty where,
[00:36:07] Unknown:
I think where we have a niche. And under what circumstances would you advise against using Orange?
[00:36:14] Unknown:
If your focus is not data visualization and interaction, then then maybe. So if you would like to have a really good prediction accuracy and, when you have large datasets that don't fit into memory, you like you'd like to use, I don't know, cloud computing, then you cannot use orange for that. So if you're just into making a lot of predictions quickly, that's not orange. Orange is about, as Blash said, it's it's a storytelling tool. You let the data tell a story. So in in this case, you would use orange. If you just if you just like to squeeze your data, then it's not orange.
[00:36:55] Unknown:
And you've mentioned a lot of the different types of widgets that you have currently. I'm wondering what are some of the ones that you would like to see in future versions or ones that you would like to see contributed as additional plugins for Orange?
[00:37:08] Unknown:
We are now working, in some specialized fields. Like, we are building Orange for single cell genomics. Since single cell genomics is like the the almost like a brand new field that it's just popping popping up. And the data there is very interesting and needs integration with, other libraries in molecular genomics. So this is a part of, what our group is working on. We're, of course, working on, towards some new, data visualizations widget, and we would also like to find means of how to explain the results of deep learning. So so we just use deep learning, but they that have the explanation power. Right? If I if I say that, I don't know, there's a there's an image. Right? And and on that image, there's, I don't know, it's an image from, let's say, clinical medicine and there's a certain disease in that image. I would like to point out what particular part of image is responsible for that and I would like to do that interactively so that I can explore further what's happening. So but beyond widgets, right, I think, what we're working on is now we would like to create dashboards. So Orange is, in Orange, you visually design the workflows, but then you would like to select some of the, typical visualizations, and maybe settings of particular methods like clustering. Right? And you would like to organize them in the dashboard with a constraint that the dashboard should be just as interactive as everything else, is in is in Orange. And so we are I think it's gonna take us a few more months to turn Orange into a dashboard creator.
[00:38:51] Unknown:
Yeah. And the widgets I'm interested in are education. So we have an education add on with widgets, intending intended to teach about data mining. We use it in lectures, in in, hands on courses we give. I'd really like to expand this part because I think we can be really good at this. There are just a few widgets there. And the other part is Internet of Things. So I'm playing with microcontrollers, and we are we I don't think we have any widgets that can pull data from there. But this would be really interesting if, if we can, add widgets for analyzing data that comes continuously from small microcontrollers that you put around with sensors,
[00:39:34] Unknown:
stuff like this? We are work also working on widgets that use deep learning. So the idea is not to use Orange for deep learning because usually deep learning takes a lot of data and time. You don't do that in Orange. But once you build the model, you can use something called, transfer learning. Right? So you can, let's say, you can learn embedders from millions of images, but then you only have a small set of images that you would like to analyze. And this is what we would like, Orange to do. We would, we would allow people to read these images and then put them through embedders and then analyze them and probably explain what is on each image. So something like, the analysis for smaller datasets but using already trained deep models and deep models for images and text and graph structures, and also for instance, chemical structures.
[00:40:28] Unknown:
And beyond just the widgets, what are some of the other features that you have planned for future releases?
[00:40:33] Unknown:
So as I said, dashboards. So dashboard is where we gonna we gonna go this next.
[00:40:39] Unknown:
And is there any particular help or certain areas that you would like, contributions from people who are using Orange and you'd like to have some feedback or contributions in terms of code or documentation or other types of input?
[00:40:56] Unknown:
I think Python community is doing great. So so many things are developed there that anything you want, somebody's working on it. So we are really happy with our choice of language, and I I don't think we we we could have any specific wishes that are not being fulfilled right now already. There's also help coming from communities. So every time
[00:41:17] Unknown:
somebody installs Orange, there's, there's, so Orange is comes for free. You don't need to leave email, but you can actually answer survey questions, and we're learning a lot from people that, entered these surveys. So there are now 2. 1 is a short 1 that you can answer in 1 minute, and 1 is a longer 1 where people actually can put some thoughts into. And it's great to to see that every week, right, we have somebody in the lab that summarizes all the all the surveys, and it's great to look at that and what people would like to have. Of course, we are just like a small group of, about 10 to 20 developers. We cannot fulfill everything, and, currently, we are contacted by a number of companies that actually would like to extend Orange in particular ways, let's say, for health care or, for financial markets or for and and this is great because, it's gonna actually propel Orange through through domains that we, as a as a single group, right, we cannot address, because of the lack of domain knowledge and, of course, lack of time. But essentially, what we are doing is building a tool that, the others can either use or can develop their own widgets for their own liking.
[00:42:33] Unknown:
Are there any other topics that we didn't cover yet that you think we should talk about before we start to close out the show? In the past year, I've been enjoying also working with, Ida Pretner.
[00:42:43] Unknown:
She's, she's actually anthropologist in I mean, she's not a computer scientist. She's anthropologist, joined my lab about 2 years ago, and the plan was actually to make YouTube videos, that relate to Orange and, that introduced anybody to data science. And this has been great fun. And, so the the the videos are picked up. I think there have been almost half a 1000000 views already. The videos are short, about 3 minutes each, and we keep adding them, like, in the rate of 1 per month, and that has been great. Also, it helps the group sometimes to focus on something that is new that we would like to present in orange, and then we kind of polish this up. So it's like a artificial deadline for, for the group as well.
[00:43:33] Unknown:
And 1 last question is, where does the name come from?
[00:43:37] Unknown:
So we always get this question at some stage. So the initial name that, when we were conceiving the machine learning library, with Yanis was m l star, like a closure on machine learning. Right? So any number of components that you would combine. Right? And this is very hard, so what what is that? M l star, right, should be, how do you write this? Right? It should be m l s or something. And then I went, to my wife and ask, well, we have this funny name, which is very geekish. Right? I mean, now that I think back, maybe we should keep with that 1. But, but then she said, okay. How about blue? And I said, no. Blue is melancholic. Right? And then say she said, let's call it orange.
[00:44:26] Unknown:
And looking deeper into that a little bit, it seems like it's a little bit serendipitous because a lot of the work that you're doing with machine learning is comparisons. And sometimes you're trying to compare apples and oranges. So it has a little another layer there too. Yeah. So this,
[00:44:41] Unknown:
of course, it's, I mean, when you choose a name, I think, the rule is never to choose a name from any color or fruit. Right? And we picked we picked both, which is I mean, there were I don't know. I think 5 years, we gathered, 20 of us together and we say, okay, maybe we should change it. And then we talked for, we discussed this for a week, right? What are all the derivatives of oranges that we could use? And, I think now it's so, I mean, there are so many pages, so many web pages, so many references to Orange data mining that, we probably cannot go and change it anymore.
So it's gonna stay like this for, for future.
[00:45:27] Unknown:
Alright. Well, for anybody who wants to follow the work that either of you are up to and keep in touch, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And for my pick this week, I'm going to choose an episode of the Data Stories podcast where they spoke to a couple of people from the New York Times Learning Lab about a project that they're working on called what's going on in this graph, where they take different, data visualizations and graphs from past episodes of the New York Times, strip out a lot of the context, and then post it on the the learning network so that students and teachers can try to ask and answer some questions about what the purpose of the graph is, what's the context around it. And then after about a week or so, they will also post the actual story that it was originally embedded into, and they have moderators facilitating the discussion. So it seems like a really interesting project and, something that it looks like it would be a lot of fun for students and educators to get involved with. And so with that, I'll pass it to you, Blaush. Do you have any picks for us this week? Yeah. I actually enjoy,
[00:46:37] Unknown:
the pod and Guy Raz. So I cannot name a particular 1, but everyone that I listened to, was great. So, I really like that. It gives me, it gives me energy. I listen it listen to it usually when I'm in the car after work, and it's simply it's great to resetting my mind and listen how other people have, struggled and, quite a number of them succeeded.
[00:47:08] Unknown:
And, Ioannis, do you have any picks for us? Yeah. It's a advent of code. It's something that happens every December. You get a new new thing to program every day. It can take you 5 minutes, can take you 5 hours if you're if the problem is more difficult. Usually, it's just, like, a few minutes. But it's a nice task. I use it for I use those tasks also for students, and you can set your own challenges. So you can do it 1 colleague did it. So you get 25 tasks. He decided to do it in 25 different languages. Last year, I decided to do everything on Arduino, which is a small microcontroller with, 2 kilobytes of memory. This year, I'm deciding to I decided to learn a new language, so I'm doing it in Kotlin, which is really nice language. I shouldn't say that here, but, sometimes,
[00:47:56] Unknown:
some programs in Kotlin are even nicer than Python. So element of code is my pick. Alright. Well, I appreciate the both of you taking time out of your days to talk to me about the work you're doing with Orange. It's definitely a very interesting tool and 1 that I'm likely to start experimenting with and possibly even use to start teaching my children about data literacy and data science. So thank you for that, and I hope you enjoy the rest of your days. Okay. Thanks a lot for inviting us. It's been great. So I I have to admit I didn't know about your podcast,
[00:48:27] Unknown:
until you invited us, but I listened to about 20 of of them already, and they're all great. So you're doing great work. So thank you a lot. Yeah. Thank you. Thank you very much. Bye.
Introduction to the Guests and Their Backgrounds
History and Motivation Behind Orange
Target Audience and Use Cases for Orange
Technical Implementation and Features of Orange
Data Interchange and Workflow Management
Educational Uses and Democratization of Data Science
Innovative Uses and Extensions of Orange
Competitors and Alternatives to Orange
Limitations and Future Plans for Orange
Community Contributions and Feedback
Origin of the Name 'Orange'