Summary
Data scientists are tasked with answering challenging questions using data that is often messy and incomplete. Anaconda is on a mission to make the lives of data professionals more manageable through creation and maintenance of high quality libraries and frameworks, the distribution of an easy to use Python distribution and package ecosystem, and high quality training material. In this episode Kevin Goldsmith, CTO of Anaconda, discusses the technical and social challenges faced by data scientists, the ways that the Python ecosystem has evolved to help address those difficulties, and how Anaconda is engaging with the community to provide high quality tools and education for this constantly changing practice.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Your host as usual is Tobias Macey and today I’m interviewing Kevin Goldsmith about Anaconda’s contributions to the Python ecosystem for data science
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Anaconda focuses on solving for?
- What was your path into the CTO position?
- From your perspective as the CTO of Anaconda, what are the biggest challenges facing data scientists today?
- What is the breakdown between technical and organizational sources for those difficulties?
- How is the Anaconda product suite architected to help address some of those problems?
- Where are you spending your focus to allow Anaconda to address the current and future needs of data scientists?
- Python has been a dominant force in the data and analytics ecosystem for several years now. What do you see as the future of the space? (e.g. monoglot vs. polyglot workflows)
- What are the most interesting, innovative, or unexpected ways that you have seen the Anaconda platform used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anaconda and data science tooling?
Keep In Touch
- @KevinGoldsmith on Twitter
- Website
Picks
- Tobias
- Kevin
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Anaconda
- Spotify
- Lisp
- Scheme
- C#
- Anaconda Nucleus
- PyData
- AnacondaCon
- Grid Computing
- PyTorch
- Tensorflow
- Pyston
- Dask
- Numba
- Panel dashboard framework
- Datashader
- Jupyter
- R
- Julia
- AstroPy
- Arrow
- Data Teams by Jesse Anderson
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host as usual is Tobias Macy. And today, I'm interviewing Kevin Gold Smith about his work at Anaconda and their contributions to the Python ecosystem for data science. So Kevin, can you start by introducing yourself?
[00:01:08] Unknown:
So delighted to be on podcasting it. My name is Kevin Goldsmith. I am the chief technology officer at Anaconda.
[00:01:15] Unknown:
And do you remember how you first got introduced to Python?
[00:01:17] Unknown:
Yeah. It's funny. I remember when Python was created. At the time, I was, you know, very much a c, c plus plus developer and thought, oh, okay. Well, that's cute. And promptly ignored it for a number of years until I was at Adobe, and Adobe had built a sort of meta build system, build system that would generate make files in Python. And I had to do something for the product I was working on. That was where I first really had to actually learn Python syntax and just to kind of do some changes there. And then was still doing c plus plus for a while. I really started learning Python a few years later. I was at Spotify, and I was leading the consumer engineering team there.
And we were doing more and more, obviously, you know, Spotify is 1 of the world's largest kinda data companies as well as being a music company. We're starting to do a lot more work with those folks in my team, got much more interested in, you know, the data ecosystem and got a O'Reilly book. And, you know, step 1 was go install Anaconda. So I did and started learning Python and started using it for my own projects there. Basically have ever since. You know, I've worked in lots of languages, c plus plus and c and Java and Lisp and Scheme and c sharp. I worked at Microsoft a number of years, c sharp.
So I've worked in lots and lots of different languages over the years, but pretty much since the last 7 or 8 years, I've just been a 100% Python for my own projects or anything I'm coding on is almost always now these days in in Python. JavaScript too, but only for front end. Stuff. Yeah. It's
[00:03:10] Unknown:
it's both amazing and, at least for myself, somewhat unfortunate that there is no escaping JavaScript.
[00:03:17] Unknown:
Yeah. You know, I also remember when JavaScript was created. Right. And it's amazing to me, like, how that language has developed over the years and how much more capable it is now. But, yeah, you know, every language has its idiosyncrasies, and every language, I think, ends up becoming every other language eventually. I was lucky enough that I started with JavaScript at the very beginning because you needed to. By the time I started working in Python, it was actually a fairly comprehensive ecosystem, I think, which helped me in sort of my own adoption of it. If I'd started using it when it was first created, I don't know if I would have found it as useful. It's interesting too just in the sort of language design sense,
[00:04:00] Unknown:
how much languages can evolve and yet still be very reflective of their core foundational elements where Python has evolved immensely from where it started. But because it built on very strong conceptual foundations, a lot of what's actually present in the language is really just syntactic sugar over the skeleton of the actual VM and the language syntax, whereas JavaScript has evolved a lot, and it's added things like classes, but it's still very much under the covers just like, there are a lot of great elements to JavaScript. I don't wanna be negative to it, but, like, when you really dig under the covers, there are a lot of bits of it, especially coming from a Python background just make no sense. Right. Yeah. Of course. And really just show that it was created, you know, in a frenzy because somebody needed a language to compete for their browser.
[00:04:49] Unknown:
There's a lot of languages and huge amounts of use that were really somebody's quick hack Mhmm. To get something working and became entire industries into themselves. Yes. Yeah. Python Python has the benefit of actually, you know, having a lot of thought put into it originally, which helps a lot. Yeah. 1 of my favorite sort of phrases that I'm probably misquoting is that any hack that is insufficiently broken will remain in production forever. Yes. Yeah. That's absolutely right.
[00:05:18] Unknown:
And so now you have ended up at Anaconda as the CTO. So I'm wondering if you can just give a bit of an overview for people who aren't familiar, what it is that Anaconda is doing and the core focus of the business and your particular path into that CTO position?
[00:05:35] Unknown:
So Anaconda, people tend to think about us primarily for our distribution of Python. We have distribution of Python that north of 25, 000, 000 people use every day. A significant percentage of those that work in the Py data ecosystem use our distribution in Python. It is built by us. It supports hardware targets that sort of core Python, native Python doesn't support. We work with a lot of hardware vendors to enable Python to work on some less well known systems that are really critical for data folks or industry more so, but also a lot of students use that distribution as well because it also simplifies greatly some of the challenges you can have with kind of managing packages and package collisions and those kinds of things. So, a big part of this is we curate the packages that we ship. As part of our Python distribution, we provide sort of a large set of sort of critical core Py data packages, again, that we've built so that you get all that into your kind of root Python environment. We also have a solver that we work very hard to make sure that anything that you install will install the appropriate versions of the other packages it depends on. So, we do a lot of work to make sure that you don't break your own Python environment.
For a long time, we were sort of the easiest way to get Python onto a lot of systems and that I think helped a lot. It was very hard to get Python onto Windows for a long time, for example. And Anaconda was 1 of the best ways to do that. So that's pretty much what we're primarily known for, than what we call individual edition, the open source free Python distribution that we have. In addition to that though, we have a commercial version of that, what we call commercial edition. That version is really designed for companies that use Python for their business. That adds on a level of security beyond just the fact that, you know, we've built all those packages. So 1 of the things companies are very concerned about is security issues ranging from where their things are being built or what's being inserted into their supply chain vulnerabilities.
So 1 of the things that we've added as part of commercial edition is we have, I believe, and I may be wrong, but I'm I'm fairly certain, the first signed package or sign verification on packages. So we build packages, we sign them, you can verify the signatures on those packages. And that's been a we shipped that earlier this year. We also do curated vulnerability things. So we take the NIST and other vulnerability streams and actually curate them and clean them up because there's a lot of noise in them. And so that's another thing companies really appreciate if they're really concerned about, security around their languages.
So that's the commercial edition. That also obviously is licensed for commercial use, which is another important aspect of it. Beyond that, we have a few other products, 1 being what we call team edition. Team edition is actually a packaging server. So it's really meant for, again, for companies. So this allows companies to use our server to distribute Python packages or actually beyond Python packages. Primarily used for Python, but we support all languages so that your company can control sort of which packages are in use within the company. You can have your local cache. It supports things like being air gapped.
So if you have a highly secure environment and you're very concerned around, again, vulnerabilities coming in from outside, this allows you to let developers within your company have access to packages that have been scanned and cleaned or or whatever else you have a lot more governance. There's another product called enterprise edition, which adds beyond just the repository capabilities, adds on workflow capabilities as well for enterprise development teams. So beyond those, there's something that we launched late last year that's been growing very quickly is something we call Nucleus, which is actually a community, and it's where we have a large collection of information around Python and the PyData ecosystem.
We did our virtual Anaconda Con the last 2 years, last year and this year, and we have something now which is basically Anaconda Con on that platform, and that's just a free platform for people to engage and learn more about the Pydata ecosystem and Python. So that's the set of things that people are using for Anaconda today. And my own path to CTO, I mentioned I worked at Microsoft, I worked at Adobe, recently computer graphics, and then spent some time doing sort of grid computing. I spent some time doing media and coding for distribution, which I which lend me eventually to Spotify. I was at Spotify for a while In Sweden, came back to US, kinda did legal tech, but sort of after that time period, after Spotify, data science and machine learning were obviously, they were impacting the industry during this time, but certainly, they were every kind of company I've been at since Spotify has been, in some ways, a machine learning company.
So after Spotify, I was at a legal tech platform called Avvo, where we were driven a huge amount very large data analytics team, very large data science team. From there to, Edutek company again built on data science, then to a identity, verification company on Fido in in the UK, where we were a data science platform, we were using computer vision machine learning to do ID verification. So that became just sort of a critical and sort of central part of even though I was working in different industries, I was leading large data science teams and data engineering teams, which then led me very naturally to Anaconda, where I joined last year. So my own path to CTO, I've been a CTO for about the last 5 years ever since I left Spotify. At Spotify, I was a VP of engineering.
Prior to that at Adobe, I was a director of engineering. So that's kind of how I ended up in the CTO role at Anaconda.
[00:12:00] Unknown:
From your perspective as the CTO for Anaconda, which is a company that is well known in the data science community for having a lot of contributions to the open source ecosystem, both in the form of the conda platform and the anaconda distribution of Python, but also in terms of the variety of open source projects that are either created or fostered by members of the team there. What do you see as being the biggest challenges that data scientists are facing today, whether it be in terms of this technology or the organizational aspects or the sort of level of sophistication that's necessary just in general? Like, from your perspective as the CTO of this business that is so closely enmeshed in the ecosystem and the community, I guess, what are some of the problems that you are seeing in a day to day basis? So I think 1 of the things that's fascinating,
[00:12:53] Unknown:
specifically about the sort of high data ecosystem or data science world, is just the speed at which it is maturing. Right? So, you know, while we've had artificial intelligence for quite a long time, it was always a very nascent kind of future. I worked in AI projects in the early nineties. But what we did then and what's done now is so radically different. This industry and this area has grown so quickly. We continue to find not only new applications of this technology, but also new ways of doing things. I think that's 1 of the biggest challenges to practitioners and to data scientists is just the continuous emergence of new approaches.
And 1 thing I also find really interesting, you know, talking to folks coming out of graduate programs today, as somebody that's been kind of in the world for a long time, we're losing actually a lot of the classical approaches. Because everything is moving so quickly, everyone's kind of adopting these new things or they're learning the new things, and, you know, they don't know or have forgotten about some of the classical approaches. So I think that's 1 of the biggest challenges for practitioners is just being aware of all the new things that are coming, all the new ways of addressing them.
This is even beyond the computer industry itself's kind of normal sort of very fast progression. So, you have that sort of meta thing, but then you have this new field that in itself is so critical to everything that's being done. It's just moving very, very quickly. So, staying aware and staying current is 1 of the biggest challenge for practitioners today. That's a challenge for just the people working there. I think also, the other thing I think that is a big challenge today is we are now also starting to learn the consequences of some of the approaches that we've been using. And we don't have good solutions for these things.
From the simple stuff like good practices around confidence modeling and drift detection, Being able to actually have more observability about the models that we're building and how the models are operating, which leads to things like bias. And bias being absolute critical problem in the industry that we've become aware of, we don't have great solutions on. And that is also a danger for us in that it may erode confidence in data science itself, which would be obviously, you know, a problem. But I think 1 of the things that's interesting around this as well is it's a positive thing. You have these great tools coming out there to sort of democratize data science and make it accessible to people, which is I think a really, you know, interesting and invaluable contribution. But at the same time, it's making these tools available to people that don't have the fundamental grounding in statistics or numerical theory or these kinds of things where they're giving tools people who don't necessarily understand how to use them well. And that I think can be also a challenge to the industry.
The great part about making it available to more people is wonderful, but it's making it available to more people in ways that could also kind of take these unintended consequences, these unintended uses, and actually accentuate them. So that's a bit of a challenge. I think we need to find that right mixture. There's more things to do with data science and we have people to do it. And that's gonna be true for a while, really advanced practitioners and across so many different domains that we need something to simplify the work and make it easier and make it more accessible. But at the same time, we have to figure out how to do that in ways that also educate people to some of the challenges around this, that if you just have a model generated for you, throw some data at something, get a model, and then start applying that model
[00:16:53] Unknown:
in ways and not really understand what's happening under the hood. That could be very dangerous as well. Yeah. Definitely agree on those points. And the sort of metaphor that came to mind as you were describing that was like handing a hammer to a child where they might build a birdhouse or they might smash their thumb, or, you know, they might break the window. And so there are a lot of things that can go right and a lot of things that can go wrong.
[00:17:15] Unknown:
Absolutely. Yeah. You're giving extremely powerful tools to people and telling them, oh, this makes everything easy, but this is really hard. Right?
[00:17:26] Unknown:
And, you know, the question is, how is that gonna get used? Yeah. And and then another thing worth calling out is sort of combining your point about people coming out of school not having a thorough grounding in the fundamentals of data analytics and data science and the issues around bias and explainability and the sort of growth of the immense escalation of deep learning in terms of the power and capabilities that we can leverage it for, but not necessarily always having the most judicious application of it, where the first thing that you reach for now is deep learning because that has been proven to be effective in so many different ways, and then not actually fully understanding what you're getting out the other side of it because you don't have that foundational grounding or because you didn't first do the modeling in a more sort of statistically oriented fashion, and you just went straight to, you know, the latest shiny thing coming out of PyTorch or TensorFlow.
[00:18:20] Unknown:
Yeah. Exactly. And sometimes it's the wrong tool because it's the wrong tool because it's just really inefficient. Like, you're throwing, like, 1 of those massive dump trucks that's 3 stories tall at a problem that could be solved with a pickup. You're just wasting money in solving this problem. And sometimes you're just gonna get the wrong answers because you're just using a completely inappropriate
[00:18:42] Unknown:
solution for something. Yep. And digging a bit more into some of these challenges, we've talked about some of the technical challenges, the educational challenges. And in terms of applications of data science in organizations and in industry, what do you see as being the breakdown between the technical versus the organizational sources for those difficulties and challenges?
[00:19:05] Unknown:
1 thing that's exciting is just that the technology is growing so quickly. I think we are starting to hit some of the base challenges with the Python language. Right? And we see this through manifested through things like the work that Greet is doing at Microsoft to improve it or the piston work around performance within the Python core. Then you have, you know, the work obviously we've done around DASK and working on improving speed that way. You have the different GPU efforts. So, there's a place where we're kind of approaching with Python, at least some of the kind of core assumptions in the language that were great and are now starting to be limitations. But the good part is that the technology and the applications are still growing. And that's 1 of the reasons why we're starting to hit some of these limits.
Right now, I think we are moving very quickly, and I think we are struggling to keep up with how fast we're moving technically. So I don't see technology necessarily as the main problems. There's tons of space to innovate, and we will be able to innovate for a while, for a long while in the data science space on the technology front. Organizationally is a challenge. So speaking as somebody who's employed data science teams, some large ones at various companies, We still have a challenge around being able to train enough data scientists to bring people into the entry.
Right now, I mean, it's been years. PhD is still kind of the fundamental degree. Right? To train a software developer to be a useful like, if I was just building a web service. Right, and I just want to build a Flask app or something, you can train somebody to do that, somebody with a reasonable kind of background and grounding, not necessarily in computer science or software development, but you can train somebody to do that in a few months and then they can grow and learn the field. Today, their entry degree is a master's, but still kind of a lot of PhD.
That creates a problem. That means that the pipeline to hire sort of your data science team, that's a multi year, be past the undergraduate degree. So finding ways to bring more people into the field to train more data scientists is gonna be critical, And they're expensive, so organizations don't wanna hire that many of them. That's another challenge. Which leads to these tools that kind of try and do it for you, but maybe not in the best way. Because what they're feeling is within the organization, there's a tremendous interest in using these tools, sometimes in using them in ways that they're not actually that useful, but they think they need them. Right? There's a lot of hype around the field.
It's hard to hire the practitioners, skilled practitioners, especially in such a new field. And so you end up with these kind of tools that may, you know, trying to save money or you can't hire these people. I talk to my peers in other companies or everybody's struggling to build their data engineering and data science teams. And so that leads organizations to make, I think, necessarily, not by sometimes by necessity, somewhat unwise decisions. So I think that the challenges we have in the industry are a lot more organizational, but also kind of fundamental to having such a brand new, highly technical, highly advanced area that is just hard to get people who know what they're doing in it. And that leads organizations to make bad choices.
So that's kind of where if there's anything I worry about in the Pi Data ecosystem beyond obviously people using it for ill intent, which is happening and has been happening for a while. And that becoming this thing where there's a backlash against it. But that's also organizational. I see that more as an industry challenge and somewhat an academia challenge. Like, how do we do a better job kind of teaching those fundamental concepts without actually requiring somebody to go and get a PhD in computer science or civil engineering or statistics or something, you know, those kind of degrees that a lot of data scientists come into the field with. You mentioned that there are
[00:23:07] Unknown:
some educational resources that you have been building out at Anaconda to help with some of this educational component to it of understanding how and where to appropriately apply different techniques and how to apply it most effectively. And then there's also the distribution aspects to make it easier for data scientists to get set up. But I'm wondering if you can just talk a bit more about sort of the holistic model that Anaconda is building to help data scientists to try and overcome some of these challenges that are facing them in industry and in academia?
[00:23:43] Unknown:
Our intent, obviously, 1, is part of being part of a large open source ecosystem that we are very serious about and serious about supporting. Some of our education is just making people aware what's going on in that ecosystem, especially in data science, making them aware of the different packages that they could be using, things like that. Because we have a lot of people coming to us as students, we can see what they're sort of interested in. We talk to them and they come to our conference or we meet them at conferences. And what we're hearing is, I'm trying to get into this field, to creating this platform is to help people either to creating this platform is to help people either continue to stay current or get into the field to also make them aware of all the monopoly of choices that there are to help them do their work and help them figure out what the best 1 for their problem set is. So we're not specifically trying to build a pipeline of data scientists.
What we're trying to do is augment the existing pipelines and give more resources to people in their journey. There's great things in the world outside of just academia that exist to help people upskill or help people learn in the field. And we're just adding to that and augmenting with that and and trying to do our part to support the open source IDA ecosystem as part of that. Another thing worth digging into that you mentioned earlier is this question of security and governance of the code that's being used for these data analysis workflows
[00:25:17] Unknown:
and particularly because of the risk of bias and because of the risk of the sort of algorithmic oppression that can happen if data is misapplied and just the inconsistencies and sort of inequities in the data sources that we're working with. I'm wondering if you can just talk through some of the some of the motivating factors and the concerns in industry and the impact that appropriate levels of security control and governance of the packages being used in these workflows can have.
[00:25:49] Unknown:
Yeah. What we've seen in the last year, blockchain attacks are nothing new, but we've definitely seen an escalation in them, especially given open source centrality to industry, not just in the Python world, but across multiple languages. So a lot of the hacks, well known hacks that have come out over the last couple of years have been supply chain vulnerability attacks, not just brute force getting through or social net engineering, getting into your environment, but actually putting Bitcoin mining into open source projects or things like that. Companies are extremely worried about this.
Not just companies, but countries, right? So, you've seen legislation in the United States now requiring bill of materials, software bill, you know, bill of materials as part of these things. That's moving into law. That's going to affect all sort of software companies and companies that work within the United States, which is frankly all of them, even if they're not based in the US, they're supplying software to US companies. That's 1 of the places because government is stepping in because they don't believe the industry has done enough to police itself. And we probably haven't, to be honest.
So, companies are extremely concerned about this. And this is where Core Python has been working to also look at signing packages and signature verification and things like that. You have that in other languages today. Just concerns around these things and knowing the provenance of where the code you're coming from. Then you have other problems, like a little bit more traditional, things like license leakage. So I'm an MIT licensed my package, but I've copied code from significant portions of code, or 1 of the contributors to my project has copied code from a new license package. And I don't realize this, but my project has just become new licensed, no matter what I claim, Right? Based on the license. And companies are extremely concerned about this because of things like that. So it's certainly supply chain vulnerability, but it's also things like license leakage. You see more and more tools for code scanning. We're doing some of that as well.
It just to try to ameliorate that problem. Anybody working in the open source world and trying to have their things used by commercial vendors has to be thinking about this or has to be concerned about it. And it is a little bit unusual for the open source world because that's not what we tend to think about when we're writing some code to share. And, you know, we have this cool idea and we want others to be able to use it and take advantage of it. That's not necessarily our first concern. And you could say, well, I think that might hurt the open source ecosystem. If we required every open source developer to be trying to think like having to satisfy a CSO at a multinational bank, their requirements. I think that would really kinda rob the open source ecosystem of a lot of the real cool things coming out of it. So having companies that work with the open source ecosystem try and support it by adding that is important.
[00:28:56] Unknown:
Another interesting thing to dig into is that now that you have been at Anaconda for about a year now and you're steeping yourself in the challenges of data scientists and the Python community and these various elements of working in and with open source and its interactions with corporations and government. I'm wondering what are your current priorities and focus for the near to medium term of Anaconda Anaconda to help address these large and growing concerns that exist in the industry and in the ecosystem.
[00:29:32] Unknown:
So 1 of the reasons why Anaconda exists in a lot of ways, like, if you look at the projects that we originated at Anaconda or projects where we're still either maintainers or sole maintainers on. A lot of our work is around performance, the performance of Python, specifically in the Py data ecosystem. So that's where you see things like Numba and Dask and and things like that. It is around making it easier to work with your data. So that's where things like panel or data shader, those kind of projects that we did or back in the day our work with the Jupyter creators and supporting the Jupyter ecosystem.
We're trying to make it easier to work within the PyData ecosystem, and then we're also very focused around performance. It's 1 of the reasons why we work with companies like Intel or IBM or NVIDIA or others to work on making sure that you can use Python on the latest hardware or high performance computing environment. So, those things that have been kind of what we've been traditionally kind of looked at. The security aspect is something that we've been focused on a lot more because we were building our own distribution and we were building it in a controlled environment.
That was something we kind of naturally had already without even really thinking about it or making a priority. But over the last, certainly couple of years, the security aspect and supply chain vulnerability has become a lot more, 1, I think relevant, but then 2, 1 that we're in a really unique position to support. And so that's been, I think, an important thing of what we've been kind of new doing over the last couple of years. I think the other thing that we're also starting to look at, we've been really focused in this pie data ecosystem because it's where we come from with the projects that Peter and Travis originally, they built these projects that became Anaconda, these open source projects.
But we've been really kind of in this pie data world. And now we're looking a little bit more on how we interface with the larger Python ecosystem. And to take some of the things that we've done and see if we can contribute to that larger ecosystem, not just in our space. And as well as work with some of the other folks in PI Data, be a little bit more active with some of the other folks in the PI Data ecosystem that certainly we've been supporting but not as directly or working with but not as directly. So, I think we're looking out a little bit more into the ecosystem to see where we can help given the expertise that we've built over years. And I think we're also looking for that aspect of where companies that really are using Python and are concerned about these things to the point where maybe they would start moving off of Python to other things that they would be less worried about vulnerabilities. And if we can give them a solution that makes them happy to continue to use
[00:32:34] Unknown:
these tools. Keying off of that a bit, Python has been a very dominant force in the data ecosystem and the analytics space for a number of years now. And to some extent, depending on where you're standing, has edged out r in a number of places as sort of 1 of the leading tools for doing statistical analysis. Then there's also the Julia project and the Julia language that has been gaining a lot of ground recently, and, you know, there are plenty of blog posts out there saying, oh, this might be the next Python killer. But with the advent of things like Julia and with the specialization of different, you know, new languages that are emerging and the improvements and evolution of packaging and deployment capabilities that are brought on by things like Docker and Kubernetes.
I'm wondering what you see as the future of the overall space of machine learning and data science, whether you think that it will continue to be largely Python focused or if you think that there is more of an opportunity and more of a direction towards a polyglot approach to data and analytics.
[00:33:39] Unknown:
Maybe for the Python podcast, I'm gonna say something that's gonna get me in trouble, but I'm very much come from a polyglot perspective. And maybe, again, I started my career before Python existed. I've worked in many, many different languages. I think the fact of the matter is that I think as the use cases continue to grow as there's more and more sort of interesting verticals or new sets of problems. I don't think every language maps to every problem. I think languages are at their best when they're really opinionated and focused around a set of problems.
Like I said, I use lots of languages. I started using Python. And 1 of the reasons I continue to use Python for a lot of my problems is a lot of my projects these days are Python problems, are problems that Python itself is exceptionally good at solving. There are other problems that are c plus plus problems. And if I was working more on those, I'd probably be doing a lot more c plus plus and a lot less Python. It just kind of where my own personal interests have led me over the last few years. I've been a lot more in this data world. If I was still doing computer graphics, I'd be writing C plus plus I probably wouldn't be writing Python. I might create Python APIs to, like, generate things for myself, but probably because c plus plus is not the right solution for that. So I believe, we are big supporters of R as well, have been for a long time.
It is part of the space. We are supporters of Julia. There's a company that's been primarily focused in the PI data space. Yeah. We're always gonna support whatever's in use in the community, and I don't think that Python will solve every problem. I get interested, me personally, and I'm not speaking for the company. I'm saying me personally because I also have had, experience in the past building domain specific languages for specific sets of problems. I always wonder whether the chemists, folks doing like the Astro Pi people eventually may come up with DSLs and maybe that runs in a Python ecosystem or maybe it runs in a kind of more native ecosystem, something closer to a hardware to sort of support their own specific use cases.
And maybe that may be where we go, where maybe we have a Python or Julia or who knows, Go or something like that, Rust kind of base. And on top of that, the folks doing particle physics have their own domain specific language to solve their own problems built on top of that or built on top of Python. That I think gets really interesting as we grow. So maybe we don't end up with, oh, okay, Julia replaced Python, and then something's gonna replace Julia. I don't think that's really realistic. PHP is still being used. JavaScript is still being used. Visual Basic, I'm sure still has millions of developers using it. That's not kind of the way things work. We'll find solutions that Python's not great at, and we'll come up with either a new language that supports those use cases and works well with Python or a new language and maybe it'll do certain things better than Python, people will start using it for that and then they'll try and make it more like Python and ruin that language.
That's the other thing is every language gets ruined as it tries to become the 1 solution to show you don't need every other language. You know, we've seen that happen kind of over and over again. I worry about that for Python a little bit. We tend to kind of ruin Python by trying to make it solve Java problems when Java is good at solving Java problems. Java is bad at solving Python problems, but they try and make Java solve Python problems too, which makes Java worse. Right?
[00:37:12] Unknown:
So, yeah, I very much a polyglot person. And continuing on that, a big part of the reason why Python gained so much dominance is because not so much because everything that was being done in data was actually Python under the hood. It was actually just that Python was the best available option for gluing together all of these different workflows where you build these bindings on top of c plus plus and 4 tran, and then you glue everything together that way, and so it becomes sort of your interchange format. And I'm wondering what you see as some of the potential for things like the Aero project to be a more data native interchange format that is language agnostic and some of the impact that that might have in the overall space of data and analytics.
[00:37:57] Unknown:
There's a really interesting project. In some ways, you were describing how Python go. Yeah. You can work with Fortran or different things under the hood. I was gonna say, oh, well, you know, in some ways, you can call Python actually kind of an orchestration thing. Right? But Arrow is kind of the meta, you know, always a meta orchestration thing. Talking about organization versus kind of practitioner challenges, 1 of the things that is becoming more of a problem, and Arrow solves this, but Arrow creates his own problems around this, Is where does data science and the data scientists?
Where does their work end? And where does data engineering pick it up from them? Where does DevOps kind of come into it? Because this is actually 1 of the things that becomes problematic. Arrow makes this easier. It also makes it harder. So we have to talk a little bit about, okay. Well, I'm a data scientist. I'm working in Python on my machine or on a cluster in an environment, you know, set up for me by my company where I have access to hardware or whatever or in the cloud. I'm solving a problem, which is great. Now I'm a data engineer working in the same high data ecosystem, but now I'm thinking maybe I'm doing ETL pipelines.
I may be taking a model that came from a data scientist and applying it in a production type environment, or I'm an analyst and I'm taking stuff out of the data lake and trying to do some either 1 off or daily jobs or continuous jobs. Right? This is where we haven't talked about this a little bit, but this is where everything kind of gets more complicated and things start are falling apart. And this is actually 1 of the kind of current challenges. At Anaconda, we work with the folks who are trying to solve this, you know, things like DASK or whatever, are trying to help solve little parts of it, but we're not looking at trying to solve the bigger problem here. Arrow tries to solve the bigger problem.
I think it'll be useful and valuable, but I think we have to figure out what we expect people in these different roles to be able to do as part of their daily work. I think 1 of the challenges that we see in data sciences, in data science, and for practitioners is they're not DevOps people. They don't wanna be DevOps people. They wanna be data scientists. They wanna be figuring out how to solve these big problems. Like the number 1 complaint about data science is that 80, 90% of data science is data preparation, which is the, like, least fun bit of it. Like, if all my work is data cleaning, very little of my work is actually doing data science, and take that to the next level of, okay, I'm trying to orchestrate all these different solutions, and now I gotta figure out, like, Terraform or something. Like, you know, this is where we have to think a little bit more about the data roles and the team are. So you brought up a technology. I think it's a really interesting and valuable technology, but it brings me back a little bit to the people problem and sort of the skills problem. Yeah. Absolutely
[00:40:55] Unknown:
agree on the blurry lines between all of these different roles, and they just keep getting blurrier as the different tools gain new capabilities where it used to be, you know, the data engineer did the ETL jobs, and then the data scientist picked up from there. And now the data scientist has tools that, you know, automate some of the data prep or the data engineer has some tools that will let them do AutoML. And so, you know, where do you draw those lines and who is responsible for which parts of it? So definitely a big problem. Yeah. We're still figuring out what the best practices are, and that's hard when the industry itself is moving so quickly.
[00:41:31] Unknown:
We think about these cool things like Arrow, like all these cool tools. But the fact of the matter is most companies most, probably, like, 85% of companies doing kind of advanced analytics are still doing it with, you know, 800 line SQL queries
[00:41:49] Unknown:
written in Tableau or something. Right? Like, they're not even at a level where they're ready to take on this kind of stuff. There's still a lot of things we have to figure out there. To your point about this sort of need for so many different roles, there's a book that I'll call out. I did an interview with the author, Jesse Anderson, called just called Data Teams. He's done a whole bunch of different interviews with organizations of different sizes to figure out what is the actual optimal organizational structure for data teams to be able to support all these different roles, and he broke it down into, you know, basically those 3 different teams of you need the operator to be able to do all the DevOps and get the infrastructure set up. You need the data engineer to manage the pipelines. You need the data scientist to be able to focus on the data science and not have to worry about those other 2 things and, you know, figuring out where do you actually start building that team? Do you hire on the data scientists and make them do the data engineering and hate their job? Or do you hire hire on the operator and, you know, overstretch them by having to do all the data engineering too? Or that's the challenge too is figuring out who do you hire first.
[00:42:47] Unknown:
I've inherited those teams, and I've built those teams. I absolutely agree with I haven't read that book. I gotta go read that book. But I absolutely agree with that premise. You absolutely need all 3 of those roles. The way we do it is we have data infrastructure, that's the operators. We have data engineering that's builds the pipelines and maintains the pipelines. And then, yeah, we have data science that is building the models and those kinds of things. That's been the way I've done it at multiple multiple companies now. But a lot of companies will start with a lone data scientist who ends up being a data engineer until they eventually quit.
Or data engineers that can build the stuff but can't really build the models. Maybe wants to, but doesn't necessarily have the background to be able to do it. Yeah. And like I said, and you have a lot of teams that are just analysts who figured out how to either build some sort of ETL pipeline or have a data engineer build a pipeline, but then don't have that operator who can actually help them doing thing kind of repeatable with it and have these very shaky kind of infrastructures that are incredibly air prone and very fragile.
[00:43:53] Unknown:
And so bringing it back around to Anaconda and some of the work that you're doing there and you work with the community, what are some of the most interesting or innovative or unexpected ways that you've seen the different tools and platforms and systems that Anaconda is responsible for used in the broader community?
[00:44:10] Unknown:
1 of the things that should have been obvious, but it was super eye opening and incredibly exciting for me and Anaconda. Having been somebody that's spent nearly 30 years in the software industry. Right? I think I've been working, as I said, with data science, but I've been working with software software companies. Right? So, you know, Spotify, you know, music recommendation systems or personalization systems, marketplace stuff. That's kinda what I've done with it. And at Adobe, you know, computer vision or those kind of applications. And what I hadn't realized, actually, until I announced I was coming to Anaconda, was all my friends, you know, I went to an engineering school for my degree, and I'm friends with chemists, I'm friends with physicists.
I went to school with who were all super excited about me going to Anaconda because they had packages in the Anaconda distribution for chemistry or for physics. And that's actually been 1 of the things that's been really exciting to me about Anaconda is, you know, because I was kind of in this world of these kinds of software world use cases, I hadn't really thought that much. It didn't really occur to me just how pervasive these techniques are in the larger kind of science world. So that's where I get really excited is when I see people using it, you know, to map genes and to work on, you know, new drugs. Or, you know, the folks at the drug companies that built vaccines for coronavirus, you know, are using these tools.
Physicists trying to map the universe are using these tools. Chemists trying to figure out the way molecules interact are using these tools. That's the thing that actually just gets me so excited seeing psychiatrists using these tools, social scientists using these tools. There's all these different ways that people use these tools that I, you know, like I said, I was aware of, you know, you didn't really see outside of maybe, you know, cool New York Times visualizations or those kinds of things. You know, like, oh, yeah. That's cool. They're absolutely doing this stuff. But to actually see that every week show up in our team demos where we're working with some customer that's building insulation for buildings and using the advanced data science tools.
There's a data science team at that company doing really cool stuff and getting to see what they do is fascinating. Yeah. Absolutely. It's funny the number of software engineers that I've spoken with who started off as physicists.
[00:46:46] Unknown:
Yeah. Which is actually what I originally started to go to school for before I ended up switching my head and going back for computer engineering. I think 1 of the other things I like specifically about working at Anaconda as a company,
[00:46:58] Unknown:
because obviously, a lot of people at Anaconda come out of that open source world. And a lot of the people who are contributing to the open source world are grad students in biology or chemistry or or whatever. And so a lot of the software developers at Anaconda have PhDs or even were working professionals in other fields. That's amazingly cool to like work with people who did do their PhD in particle physics or did do their PhD in computational biology. You have a chat in Slack, you're just talking about something and they just happen to know all these incredibly interesting kinda details about the subject because it just so happens that they were doing that research for a number of years.
That's 1 of the things I just appreciate about kind of being part of this. It really enmeshed in this PyData world. It's worth calling out that specifically the data ecosystem and the use of Python in data
[00:47:59] Unknown:
really brings in a lot more diversity of backgrounds and ethnicities and puts me in mind of a story that Jacob shared at a PyCon 1 year who, for people who don't know, is 1 of the cocreators of Django and has been on the security team at Heroku for a number of years, and I forget where he is now, but he was mentioning some work that he had done with a young woman who had been doing research in, I think it was geology and had, you know, built this interesting model and built a website to be able to interact with the work that she had done and then saying, oh, but I'm not a software engineer. I'm not a programmer. You built this amazing thing that I couldn't have thought of doing in months. And because of where you're coming from, you were just trying to solve a problem. You weren't trying to become a programmer, and so they don't identify as a programmer. They just identify as somebody who solves a problem. And so being able to empower those people to solve problems without having to go through the rigor of, you know, all of the accoutrements of software engineering is amazing.
[00:48:58] Unknown:
Yeah. Absolutely.
[00:48:59] Unknown:
That's absolutely true. So in terms of your experience of working at Anaconda and helping to drive the company forward and set their technology strategy? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:49:14] Unknown:
1 of the things that's been interesting to me and 1 of the things that's challenging is really on the open source part. So I've been part of teams that have contributed to the open source. You know, certainly benefited from the open source ecosystem, contributed to the open source ecosystem in certain ways, but we're not open source, what I would call open source companies. So I think 1 of the things that's been really challenging or interesting has just been, you know, Anaconda, we're an open source company. Right? To be honest, this was 1 of the things that attracted me to the company is we are unapologetically, incredibly seriously part of open source. We care about open source, we care about contributors to open source, we care about the open source community.
And so being good steward and trying to be good stewards. And I think, you know, we've had challenges where we haven't done our best or we haven't done what you would expect. You know, when I was at a company like Adobe or a company like Spotify, and we would want to, know, a developer would wanna put something they're working on into the open source. You know, 1 of the things I would tell them is if you're gonna do that, you have to be serious about it. You can't just throw it out and say, hey. Here it is. And then ignore it for a while. If you think of just the sheer number of projects that Anaconda has contributed into the PYDATA ecosystem, most of which at this point, we've moved to community governance.
We are maintainers on a vanishingly small percentage of the projects that we originated or were part of the originating group because we knew we couldn't support everything all the time. And there's a point where it needed to move into community governance. That in alone is like incredibly coming out of very much commercial software world, trying to make that decision of, Oh, okay, well, here's this thing we built, we invested in, we put a lot of energy in, and now we're just gonna give it away and we're not even gonna own anymore, everyone else is gonna own it. And they may take it in this thing, this direction. We generate all this intellectual property that we have a 100% given away. That's been an interesting part of being an open source company. And then also looking at the things where, you know, we haven't done our job as stewards and understanding, oh, okay, this is why we do it is because if we can't support this in the way people need it to be supported, they'll stop using it. And that doesn't benefit anybody. Then we're just put a lot of energy and investment into something and didn't take it seriously and people stopped using it. So, that's been a big kind of lesson and challenge for us. We want to continue to contribute and be good members of the open source community, but we have to do it in a really smart way. And we have to know when it's time to say, you know what, everybody loves this thing. It's probably better for everybody to own it instead of for us to be the ones driving it moving forward. And that's going to be the best thing for the community.
So that's been sort of a really interesting thing to do in a venture backed company because it's very different from any other venture backed company I've ever been in. Are there any other aspects of the work that you're doing at Anaconda
[00:52:19] Unknown:
or the challenges that are facing the data science and analytics ecosystem or the overall space of Python in data that we didn't discuss yet that you'd like to cover before we close out the show? I'm just gonna shout out
[00:52:33] Unknown:
to some of the stuff that we've been working on that I wanna make sure or that hopefully a lot of people should be aware of, but there's been new releases. There was a new RC of Numba that just went out, new release of panel. I think data shader went out again not too long ago. You know, these things, you know, Numba is the only 1 I think where we're still sole maintainers on or primary maintainers on everywhere else, we're just part of the maintenance group. But I'm so excited to see some of these things and see them continue to evolve. So, you know, DASK is another thing. So I'll put a shout out for those.
If you haven't looked at those for in your own practice, and if you're not an Anaconda user, those are all available through, you know, PyPI as well. Right? But they're really cool technologies that make doing
[00:53:22] Unknown:
complicated stuff easy, so worth taking a look at. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And this week, I'm going to choose a group of novels by the author China Meeval. So there are 3 novels are actually an anti trilogy and that they're all set in the same world, but not actually related in terms of their storyline. An interesting take on the idea of sort of a trilogy. So they're the Baselag novels is what it's called. So it's Perdido Street Station, The Scar, and Iron Council. So definitely recommend reading those if you're looking for something to do at any point. And so with that, I'll pass it to you, Kevin. Do you have any picks this week? My pick
[00:54:05] Unknown:
is I got the Lego typewriter for Father's Day, and it is really cool. That does sound very cool. It's pricey, but as LEGO the advanced iOS models go, it's not ridiculously pricey.
[00:54:21] Unknown:
It is very, very cool. Is it actually a functioning typewriter?
[00:54:24] Unknown:
If they actually gave you ink, it kind of almost could be. That's part of the thing that's really cool is just how they've actually been able to emulate and mimic the motions, the mechanisms of a typewriter in Lego is is awesome. That's very cool.
[00:54:43] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your perspectives as the CTO at Anaconda and the window that that gives you on the overall data science ecosystem and communities. So thank you for all the time and energy that you're putting into that, and I hope you enjoy the rest of your day. I appreciate it. Thank you. Thank you so much for having me. We really appreciate you having me on. Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Guest Introduction
Kevin Goldsmith's Journey with Python
Overview of Anaconda's Work and Products
Challenges in Data Science
Technical vs Organizational Challenges in Data Science
Anaconda's Role in Supporting Data Scientists
Security and Governance in Data Science
Anaconda's Current Priorities and Focus
Future of Machine Learning and Data Science
Role of Arrow in Data Science
Innovative Uses of Anaconda's Tools
Lessons Learned at Anaconda
Recent Releases and Final Thoughts