Growing And Supporting The Data Science Community At Anaconda

Hello, and welcome to Podcast Dot in It, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host as usual is Tobias Macy. And today, I'm interviewing Kevin Gold Smith about his work at Anaconda and their contributions to the Python ecosystem for data science. So Kevin, can you start by introducing yourself?

So delighted to be on podcasting it. My name is Kevin Goldsmith. I am the chief technology officer at Anaconda.

And do you remember how you first got introduced to Python?

Yeah. It's funny. I remember when Python

was created.

At the time, I was, you know, very much a c, c plus plus developer

and thought, oh, okay. Well, that's cute.

And promptly ignored it for a number of years until I was at Adobe,

and Adobe

had built a sort of meta build system,

build system that would generate make files in

Python.

And I had to do something for the product I was working on. That was where I first really had to actually learn Python syntax and just to kind of do some changes there. And then was still doing c plus plus for a while.

I really started learning Python

a few years later. I was at Spotify,

and I was leading the consumer engineering team there.

And we were doing more and more, obviously,

you know, Spotify is 1 of the world's largest

kinda data companies as well as being a music company.

We're starting to do a lot more work with those folks

in my team,

got much more interested in, you know, the data ecosystem

and got a O'Reilly book. And, you know, step 1 was go install Anaconda.

So I did and started learning Python and started using it for my own projects there.

Basically have ever since. You know, I've worked in lots of languages, c plus plus and c and Java and

Lisp and Scheme

and

c sharp. I worked at Microsoft a number of years, c sharp.

So I've worked in lots and lots of different languages over the years, but pretty much since the last

7 or 8 years, I've just been a 100% Python

for my own projects or anything I'm coding on is almost always now these days in in Python.

JavaScript

too, but only for front end. Stuff. Yeah. It's

it's both amazing and, at least for myself, somewhat unfortunate that there is no escaping JavaScript.

Yeah.

You know, I also remember when JavaScript was created. Right.

And it's amazing to me, like, how that language has developed

over the years and how much more capable it is now. But, yeah, you know, every language has its

idiosyncrasies, and every language, I think, ends up becoming

every other language eventually.

I was lucky enough that I started with JavaScript at the very beginning because you needed to. By the time I started working in Python, it was actually a fairly

comprehensive ecosystem,

I think, which helped me in sort of my own adoption of it. If I'd started using it when it was first created, I don't know if I would have found it as useful. It's interesting too just in the sort of language design sense,

how much languages can evolve and yet still be

very reflective of their core foundational elements where

Python has evolved immensely from where it started. But because it built on very strong conceptual foundations,

a lot of what's actually present in the language is really just syntactic sugar over the skeleton of the actual VM and the language syntax, whereas JavaScript

has evolved a lot, and it's added things like classes, but it's still very much under the covers just

like, there are a lot of great elements to JavaScript. I don't wanna be negative to it, but, like, when you really dig under the covers, there are a lot of bits of it, especially coming from a Python background just make no sense.

Right. Yeah. Of course. And really just show that it was created, you know, in a frenzy because somebody needed a language to compete for their browser.

There's a lot of languages and huge amounts of use that were really somebody's quick hack Mhmm. To get something working and became entire industries into themselves. Yes. Yeah. Python Python has the benefit of actually, you know, having a lot of thought put into it originally, which helps a lot. Yeah. 1 of my favorite sort of phrases that I'm probably misquoting is that any hack that is insufficiently broken will remain in production forever.

Yes. Yeah. That's absolutely right.

And so now you have ended up at Anaconda as the CTO. So I'm wondering if you can just give a bit of an overview for people who aren't familiar, what it is that Anaconda is doing and the core focus of the business

and your particular path into that CTO position?

So Anaconda,

people tend to think about us primarily for our distribution of Python. We have

distribution of Python that north of 25, 000, 000 people use

every day. A significant percentage of those that work in the Py data ecosystem

use our distribution in Python. It is built by us.

It supports hardware targets that sort of core Python, native Python doesn't support.

We work with a lot of hardware vendors to enable

Python to work on some less well known

systems

that are really critical for data folks or

industry

more so, but also a lot of students use that distribution as well because it also simplifies

greatly

some of the challenges you can have with kind of managing packages and package

collisions and those kinds of things. So, a big part of this is we curate

the packages that we ship. As part of our Python distribution, we provide sort of a

large set of sort of critical core

Py data packages, again, that we've built

so that you get all that into your kind of root

Python environment. We also have a solver that we work very hard to make sure that

anything that you install will install the appropriate versions

of the other packages it depends on. So, we do a lot of work to make sure that you don't break your own Python environment.

For a long time, we were sort of the easiest way to get Python onto a lot of systems and that I think helped a lot.

It was very hard to get Python onto Windows for a long time, for example. And Anaconda was 1 of the best ways to do that. So that's pretty much what we're primarily known for, than what we call individual edition, the open source free

Python distribution that we have.

In addition to that though, we have

a commercial version of that,

what we call commercial edition. That version

is really designed for companies that use Python

for their business. That adds on a level of security

beyond just the fact that, you know, we've built all those packages. So 1 of the things companies are very concerned about is

security issues ranging from where their things are being built or what's being inserted into their supply chain vulnerabilities.

So 1 of the things that we've added as part of commercial edition

is we have, I believe, and I may be wrong, but I'm I'm fairly

certain,

the first signed package or sign verification

on packages. So we build packages, we sign them, you can verify the signatures

on those packages. And that's been a we shipped that earlier this year. We also do curated

vulnerability

things. So we take the NIST

and other vulnerability

streams and actually curate them and clean them up because there's a lot of noise in them. And so that's another thing companies really appreciate if they're really concerned about, security

around their languages.

So that's the commercial edition. That also obviously is licensed for commercial use, which is another important aspect of it. Beyond that, we have

a few other products,

1 being what we call team edition.

Team edition is actually a packaging server.

So it's really meant for, again, for companies.

So this allows companies to use our server to distribute

Python packages or actually beyond Python packages. Primarily used for Python, but we support all languages so that your company can control

sort of which packages are in use within the company. You can have your local cache. It supports things like

being air gapped.

So if you have a highly secure environment and you're very concerned around, again, vulnerabilities coming in from outside,

this allows you to let developers within your company have access to packages that have been

scanned and cleaned or

or whatever else you have a lot more governance.

There's another product called enterprise edition, which adds beyond just the repository

capabilities,

adds on workflow capabilities as well for

enterprise development teams.

So beyond those, there's something that we launched late last year that's been growing very quickly is something we call Nucleus, which is actually

a community, and it's where we have a large collection of information around Python and the PyData ecosystem.

We did our virtual Anaconda Con the last 2 years, last year and this year, and we have something now which is basically Anaconda Con on that platform, and that's just a free platform for people to

engage and learn more about the Pydata ecosystem and Python. So that's the set of things

that people are using for Anaconda today. And my own path to CTO, I mentioned I worked at Microsoft,

I worked at Adobe, recently computer graphics, and then spent some time doing

sort of grid computing.

I spent some time doing media and coding

for distribution, which I which lend me eventually to Spotify. I was at Spotify for a while

In Sweden,

came back to US, kinda did legal tech, but sort of after that time period, after Spotify,

data science and

machine learning were obviously, they were impacting the industry during this time, but certainly, they were every kind of company I've been at since

Spotify has been, in some ways, a machine learning company.

So after Spotify, I was at a legal tech platform called Avvo,

where we were driven a huge amount very large data analytics team, very large data science team.

From there to,

Edutek company again built on data science, then

to

a identity,

verification company on Fido in in the UK,

where we were a data science platform, we were using computer vision machine learning

to do ID verification.

So that became just sort of a critical and sort of central part of even though I was working in different industries, I was leading

large data science teams and data engineering teams,

which then led me very naturally to Anaconda, where I joined last year. So

my own path to CTO, I've been a CTO for about the last 5 years ever since I left Spotify. At Spotify, I was a VP of engineering.

Prior to that at Adobe, I was a director of engineering. So that's kind of how I ended up in the CTO role at Anaconda.

From your perspective as the CTO for Anaconda, which is a company that is well known in the data science community for having a lot of contributions to the open source ecosystem, both in the form of

the conda platform and the anaconda distribution of Python, but also in terms of the variety of open source projects that are either created or fostered by members of the team there.

What do you see as being the biggest challenges that data scientists are facing today, whether it be in terms of this technology or the organizational aspects or

the sort of level of sophistication that's necessary just in general? Like, from your perspective as the CTO of this business that is so closely enmeshed in

the ecosystem and the community, I guess, what are some of the problems that you are seeing in a day to day basis? So I think 1 of the things that's fascinating,

specifically about the sort of high data ecosystem or data science world,

is just the speed

at which it is maturing. Right? So, you know, while we've had artificial intelligence

for quite a long time,

it was always a very nascent kind of future.

I worked in AI projects in the early nineties.

But what we did then and what's done now is so radically different. This industry and this area has grown so quickly.

We continue to find not only new applications of

this technology, but also new ways of doing things.

I think that's 1 of the biggest challenges

to practitioners

and to data scientists is just the

continuous emergence

of new approaches.

And 1 thing I also find really interesting, you know, talking to folks coming out of graduate programs today,

as somebody that's been kind of in the world for a long time, we're losing actually a lot of the classical approaches.

Because everything is moving so quickly, everyone's kind of adopting these new things or they're learning the new things, and, you know, they don't know or have forgotten about some of the classical approaches. So I think that's 1 of the biggest challenges for practitioners

is just being aware of all the new things that are coming, all the new ways of addressing them.

This is even beyond the computer industry itself's kind of normal sort of very fast progression. So, you have that sort of meta thing, but then you have this new field that in itself is

so critical to everything that's being done. It's just moving very, very quickly. So, staying aware

and staying current is 1 of the biggest challenge for practitioners today.

That's a challenge for just the people working there. I think also,

the other thing I think

that is a big challenge today is we are now also starting to learn the consequences

of some of the approaches that we've been using. And we don't have good solutions for these things.

From the simple stuff like good practices around confidence modeling

and drift detection,

Being able to actually

have more observability

about the models that we're building and how the models are operating,

which leads to things like bias. And bias being absolute critical problem in the industry that we've become aware of,

we don't have great solutions on. And that is also

a danger for us

in that it may erode confidence

in data science

itself, which would be obviously, you know, a problem. But I think 1 of the things that's interesting

around this as well is

it's a positive thing. You have

these great tools coming out there to sort of democratize

data science and make it accessible to people,

which is I think a really, you know, interesting and invaluable contribution. But at the same time,

it's making these tools available to people that don't have the fundamental grounding

in statistics

or

numerical theory or these kinds of things where they're giving tools people who don't necessarily

understand how to use them well. And that I think can be also a challenge to the industry.

The great part about making it available to more people is wonderful, but it's making it available to more people

in ways that could also kind of take these unintended consequences,

these unintended uses, and actually

accentuate them.

So that's a bit of a challenge.

I think we need to find that right mixture.

There's more things to do with data science and we have people to do it. And that's gonna be true for a while,

really advanced practitioners

and across

so many different domains

that we need something to simplify the work and make it easier

and make it more accessible. But at the same time, we have to figure out how to do that in ways that also educate people

to some of the challenges around this, that if you just have a model generated for you, throw some data at something, get a model, and then start applying that model

in ways and not really understand what's happening under the hood. That could be very dangerous as well. Yeah. Definitely agree on those points. And the sort of metaphor that came to mind as you were describing that was like handing a hammer to a child where they might build a birdhouse or they might smash their thumb, or, you know, they might break the window. And so there are a lot of things that can go right and a lot of things that can go wrong.

Absolutely. Yeah. You're giving extremely powerful tools

to people and telling them, oh, this makes everything easy,

but this is really hard. Right?

And, you know, the question is, how is that gonna get used? Yeah. And and then another thing worth calling out is sort of combining your point about people coming out of school

not having

a thorough grounding in the fundamentals of data analytics and data science and the issues around bias and explainability and the sort of growth of

the immense

escalation of deep learning in terms of the power and capabilities

that we can leverage it for,

but not necessarily

always having the most judicious application of it, where the first thing that you reach for now is deep learning because that has been proven to be effective in so many different ways,

and then not actually fully understanding

what you're getting out the other side of it because you don't have that foundational grounding or because you didn't first do the modeling in a more sort of statistically oriented fashion, and you just went straight to, you know, the latest shiny thing coming out of PyTorch or TensorFlow.

Yeah. Exactly. And sometimes it's the wrong tool because it's

the wrong tool because it's just really inefficient. Like, you're throwing, like, 1 of those massive dump trucks that's 3 stories tall at a problem that could be solved with a pickup. You're just wasting money in solving this problem. And sometimes you're just gonna get the wrong answers because you're just using a completely inappropriate

solution for something. Yep. And digging a bit more into some of these challenges,

we've talked about some of the technical challenges, the educational challenges.

And in terms of applications of data science in

organizations and in industry,

what do you see as being the breakdown between the technical versus

the

organizational

sources for those difficulties and challenges?

1 thing that's exciting is just that the technology is growing so quickly. I think we are starting to hit

some of the base challenges

with the Python language.

Right? And we see this through manifested through things like the work that Greet is doing at Microsoft to improve it or the piston work

around performance within the Python core. Then you have, you know, the work obviously we've done around DASK and working on improving speed that way.

You have the different GPU efforts. So, there's a place where we're

kind of approaching with Python, at least some of the kind of core

assumptions in the language that were great and are now starting to be limitations. But the good part is that the technology

and the applications are still growing. And that's 1 of the reasons why we're starting to hit some of these limits.

Right now, I think we are

moving very quickly, and I think we are struggling to keep up with how fast we're moving technically. So I don't see technology necessarily as the main problems. There's tons of space to innovate, and we will be able to innovate for a while,

for a long while in the data science space on the technology front.

Organizationally is a challenge. So speaking as somebody who's employed data science teams, some large ones

at various companies,

We still have

a challenge around

being able to

train enough data scientists

to bring people into the entry.

Right now, I mean, it's been years. PhD is still kind of the fundamental

degree.

Right? To train a software developer to be a useful like, if I was just building a web service.

Right, and I just want to build a Flask app or something,

you can train somebody to do that, somebody with a reasonable kind of background and grounding, not necessarily in computer science or software development, but you can train somebody to do that in a few months

and then they can grow and learn the field.

Today, their entry degree

is a master's, but still kind of a lot of PhD.

That creates a problem. That means that the pipeline to hire

sort of your data science team, that's

a

multi

year, be past the undergraduate degree.

So finding ways to bring more people into the field to train more data scientists is gonna be critical,

And they're expensive, so organizations don't wanna hire that many of them. That's another challenge. Which leads to these tools that kind of try and do it for you, but maybe not in the best way. Because what they're feeling is within the organization, there's a tremendous interest

in using these tools, sometimes

in using them in ways that they're not actually that useful, but they think they need them. Right? There's a lot of hype around the field.

It's hard to hire the practitioners,

skilled practitioners,

especially in such a new field. And so you end up with these kind of tools that may, you know, trying to save money or you can't hire these people. I talk to my peers in other companies or everybody's struggling to build their data engineering and data science teams.

And so that leads organizations to make, I think,

necessarily, not by sometimes by necessity, somewhat unwise decisions. So I think that the challenges we have in the industry are a lot more organizational, but also

kind of fundamental to having such a brand new,

highly technical, highly advanced

area that is just hard to get people who know what they're doing in it. And that leads organizations to make bad choices.

So that's kind of where

if there's anything I worry about in the Pi Data ecosystem beyond obviously people using it for ill intent, which is happening

and has been happening for a while. And that becoming

this thing where there's a backlash against it.

But that's also organizational. I see that more as an industry challenge and somewhat an academia challenge. Like, how do we do a better job kind of teaching those fundamental concepts without actually requiring somebody to go and get a PhD in

computer science or civil engineering or statistics or something, you know, those kind of degrees that a lot of data scientists come into the field with. You mentioned that there are

some educational resources that you have been building out at Anaconda to help with some of this educational component to it of understanding

how and where to appropriately apply different techniques and how to

apply it most effectively.

And then there's also the distribution aspects to make it easier for data scientists to get set up. But I'm wondering if you can just talk a bit more about sort of

the holistic model that Anaconda is building to help data scientists to try and overcome some of these challenges that are facing them in industry and in academia?

Our intent, obviously, 1, is part of being part of a large open source ecosystem that we are very

serious about and serious about supporting. Some of our education is just making people aware

what's going on in that ecosystem,

especially

in data science, making them aware of the different packages that they could be using, things like that.

Because we have a lot of people coming to us as students,

we can see what they're sort of interested in. We talk to them and they come to our conference or we meet them at conferences. And what we're hearing is, I'm trying to get into this field, to

creating this platform is to help

people either to creating this platform is to help people either

continue to stay current or get into the field

to also make them aware of all the monopoly

of choices that there are to help them do their work and help them figure out what the best 1 for their problem set is. So we're not specifically trying to build a pipeline of data scientists.

What we're trying to do is augment the existing pipelines and give more

resources to people

in their journey.

There's great things in the world

outside of just academia

that exist to help people upskill or help people learn in the field. And we're just adding to that and augmenting with that and and trying to do our part to support the open source IDA ecosystem as part of that. Another thing worth digging into that you mentioned earlier is this question of security and governance of the code that's being used for these data analysis workflows

and particularly

because of the risk of bias and because of the risk of

the sort of algorithmic oppression that can happen if data is misapplied

and just the

inconsistencies

and sort of inequities in the data sources that we're working with.

I'm wondering if you can just talk through some of the

some of the motivating factors

and the concerns in industry and the impact

that appropriate levels of security control and governance of the packages being used in these workflows can have.

Yeah. What we've seen in the last

year,

blockchain attacks are nothing new, but we've definitely seen an escalation in them, especially given

open source centrality

to industry, not just in the Python world, but across multiple languages. So a lot of the hacks,

well known hacks that have come out over the last couple of years have been supply chain vulnerability attacks, not just

brute force getting through or social net engineering,

getting into your environment, but actually

putting Bitcoin mining into open source projects or things like

that. Companies are extremely worried about this.

Not just companies, but countries, right? So, you've seen legislation in the United States

now

requiring

bill of materials, software bill, you know, bill of materials as part of these things.

That's moving into law. That's going to affect

all sort of software

companies and companies that work within the United States, which is frankly all of them, even if they're not based in the US, they're supplying software to US companies.

That's 1 of the places because government is stepping in because they don't believe the industry has done enough

to police itself. And we probably haven't, to be honest.

So, companies are extremely concerned about this. And this is where Core Python has been working

to also look at signing packages and signature verification and things like that. You have that in other languages today.

Just concerns around these things and knowing the provenance of where the code you're coming from. Then you have other problems, like a little bit more traditional,

things like license leakage.

So I'm an MIT licensed my package,

but I've copied code from

significant portions of code, or 1 of the contributors

to my project has copied code from a new license

package. And I don't realize this, but my project has just become new licensed, no matter what I claim,

Right? Based on the license. And companies are extremely concerned about this because of things like that. So it's

certainly supply chain vulnerability, but it's also things like license leakage. You see more and more tools for code scanning. We're doing some of that as well.

It just to try to ameliorate

that problem. Anybody working in the open source

world and trying to have their things used by commercial vendors

has to be

thinking about this or has to be concerned about it. And it is a little bit unusual

for the open source world because that's not what we tend to think about when we're writing some code to share. And, you know, we have this cool idea and we want others to be able to use it and take advantage of it. That's not necessarily our first concern.

And you could say, well, I think that might hurt the open source ecosystem. If we required

every open source developer to be trying to think like

having to satisfy a CSO at a multinational bank,

their requirements. I think that would really

kinda rob the open source ecosystem of a lot of the real cool things coming out of it. So having companies

that work with the open source ecosystem try and support it by adding that is important.

Another interesting thing to dig into is that now that you have been at Anaconda for about a year now and you're

steeping yourself in the challenges of data scientists and the Python community

and these

various elements of working in and with open source and its interactions with

corporations and government. I'm wondering what are your current priorities and focus for

the near to medium term of Anaconda

Anaconda to help address these large and growing concerns that exist in the industry and in the ecosystem.

So 1 of the reasons why

Anaconda

exists in a lot of ways, like, if you look at the projects that we

originated

at Anaconda

or projects where we're still either maintainers or sole maintainers on.

A lot of our work is around performance,

the performance of Python, specifically in the Py data ecosystem. So that's where you see things like Numba and Dask and and things like that.

It is around

making it easier to work with your data. So that's where things like panel or data shader,

those kind of projects that we did or back in the day our work with the Jupyter creators

and supporting the Jupyter ecosystem.

We're trying to make

it easier to work within the PyData ecosystem, and then we're also very focused around performance. It's 1 of the reasons why we work with companies like Intel

or IBM or NVIDIA or others

to work on making sure that you can use

Python on the latest hardware

or

high performance computing environment.

So, those things that have been kind of what we've been traditionally

kind of looked at.

The security aspect is something

that we've been focused on a lot more because we were building our own

distribution and we were building it in a controlled environment.

That was something

we kind of naturally had already without even really thinking about it or making a priority. But over the last, certainly couple of years, the security aspect and supply chain vulnerability has become a lot more,

1, I think relevant, but then 2, 1 that we're in a really unique position

to support.

And so that's been,

I think, an important thing of what we've been kind of new doing over the last couple of years. I think the other thing that we're also starting to look at,

we've been really focused in this pie data ecosystem

because it's where we come from

with the projects that Peter and Travis originally,

they built these projects that became Anaconda,

these open source projects.

But we've been really kind of in this pie data world. And now we're looking a little bit more on how we

interface with the larger

Python ecosystem.

And to take some of the things that we've done and see if we can contribute to that larger

ecosystem,

not just in our space.

And as well as work with some of the other folks in PI Data, be a little bit more active with some of the other folks in the PI Data ecosystem that certainly we've been supporting but not as directly

or working with but not as directly. So, I think we're looking out a little bit more into the ecosystem to see where we can help

given the expertise that we've built over years. And I think we're also looking for

that aspect of where companies

that really are using Python and are concerned about these things

to the point where maybe they would start moving off of Python to other things that they would be less worried about vulnerabilities. And if we can give them a solution that makes them happy to continue to use

these tools. Keying off of that a bit, Python has been a very dominant force in the data ecosystem and the analytics space for a number of years now.

And to some extent, depending on where you're standing, has edged out r in a number of places as sort of 1 of the leading tools for doing statistical analysis.

Then there's also the Julia project and the Julia language that has been gaining a lot of ground recently, and, you know, there are plenty of blog posts out there saying, oh, this might be the next Python killer.

But with the

advent of things like Julia and with the specialization of different, you know, new languages that are emerging

and

the improvements and evolution

of packaging and deployment capabilities

that are brought on by things like Docker and Kubernetes.

I'm wondering what you see

as the future of the overall space of machine learning and data science, whether you think that it will continue to be

largely Python focused or if you think that there is more of an opportunity and more of a direction towards a polyglot approach to data and analytics.

Maybe for the Python podcast, I'm gonna say something that's gonna get me in trouble, but I'm very much come from a polyglot perspective.

And maybe, again, I started my career before Python existed. I've worked in many, many different languages.

I think the fact of the matter is that I think as the use cases continue

to grow

as there's more and more sort of interesting

verticals

or new sets of problems. I don't think every language maps to every problem. I think

languages are at their best

when they're really

opinionated and focused around a set of problems.

Like I said, I use lots of languages. I started using Python. And 1 of the reasons I continue to use Python for a lot of my problems is a lot of my projects these days are Python problems,

are problems that Python itself is exceptionally good at solving. There are other problems that are c plus plus problems. And if I was working more on those,

I'd probably be doing a lot more c plus plus and a lot less Python. It just kind of where my own personal interests have led me over the last few years. I've been a lot more in this data world. If I was still doing computer graphics, I'd be writing C plus plus I probably wouldn't be writing Python. I might create Python APIs

to, like, generate things for myself, but probably because c plus plus is not the right solution for that. So I believe,

we are big supporters of R as well, have been for a long time.

It is part of the space. We are supporters of Julia. There's a company that's been primarily focused in the PI data space. Yeah. We're always gonna support whatever's

in use in the community, and I don't think that Python will solve every problem. I get interested,

me personally, and I'm not speaking for the company. I'm saying me personally

because I also have had, experience in the past building domain specific languages

for specific sets of problems. I always wonder whether the chemists,

folks doing

like the Astro Pi people

eventually may come up with DSLs and maybe that runs in a Python ecosystem or maybe it runs in a kind of more native ecosystem, something closer

to a hardware

to sort of support their own specific use cases.

And maybe that may be where we go, where maybe we have a Python or Julia

or who knows, Go or something like that, Rust

kind of base. And on top of that, the folks doing particle physics have their own domain specific language to solve their own problems

built on top of that or built on top of Python.

That I think gets really interesting

as we grow. So maybe we don't end up with, oh, okay, Julia replaced Python,

and then something's gonna replace Julia. I don't think that's really realistic. PHP is still being used. JavaScript is still being used.

Visual Basic, I'm sure still has millions of developers

using it. That's not kind of the way things work. We'll find

solutions that Python's not great at, and we'll come up with either

a new language

that supports those use cases and works well with Python or

a new language and maybe it'll do certain things better than Python, people will start using it for that and then they'll try and make it more like Python and ruin that language.

That's the other thing is every language gets ruined as it tries to become the 1 solution

to show you don't need every other language.

You know, we've seen that happen kind of over and over again. I worry about that for Python a little bit. We tend to kind of ruin Python

by trying to make it solve Java problems when

Java is good at solving Java problems. Java is bad at solving Python problems, but they try and make Java solve Python problems too,

which makes Java worse. Right?

So, yeah, I very much a polyglot person. And continuing on that, a big part of the reason why Python gained so much dominance is because

not so much because

everything that was being done in data was actually Python under the hood. It was actually just that Python was

the best available option for gluing together all of these different workflows

where you build these bindings on top of c plus plus and 4 tran, and then you glue everything together that way, and so it becomes sort of your interchange format.

And I'm wondering what you see as some of the

potential for things like the Aero project to be a more data native interchange format that is language agnostic and some of the impact that that might have in the overall space of data and analytics.

There's a really interesting project. In some ways, you were describing how Python go. Yeah. You can work with Fortran or different things under the hood. I was gonna say, oh, well, you know, in some ways, you can call Python actually kind of an orchestration thing. Right? But Arrow is kind of the meta, you know, always a meta orchestration thing.

Talking about organization versus kind of practitioner challenges,

1 of the things that is

becoming more of a problem, and Arrow solves this, but Arrow creates his own problems around this,

Is

where does data

science

and the data scientists?

Where does their work end? And where does data engineering pick it up from them?

Where does DevOps

kind of come into it?

Because this is actually 1 of the things that becomes

problematic. Arrow

makes this easier. It also makes it harder.

So we have to talk a little bit about, okay. Well, I'm a data scientist. I'm working in Python on

my machine or on a cluster in an environment, you know, set up for me by my company where I have access to hardware or whatever or in the cloud.

I'm solving a problem, which is great. Now I'm a data engineer

working in the same high data ecosystem, but now I'm thinking maybe I'm doing ETL pipelines.

I may be taking a model that came from a data scientist and applying it in a production type environment,

or I'm an analyst and I'm taking stuff out of the data lake and trying to do some either 1 off or

daily jobs or continuous jobs. Right? This is where we haven't talked about this a little bit, but this is where everything kind of gets more complicated and things start are falling apart. And this is actually 1 of the kind of current challenges.

At Anaconda, we work with the folks who are trying to solve this, you know, things like DASK or whatever, are trying to help

solve little parts of it, but we're not looking at trying to solve the bigger problem here. Arrow tries to solve the bigger problem.

I think it'll be useful and valuable, but I think

we have to figure out what we expect people in these different roles to be able to do

as part of their daily work. I think 1 of the challenges that we see in data sciences,

in data science, and for practitioners

is they're not DevOps people. They don't wanna be DevOps people. They wanna be data scientists. They wanna be figuring out how to solve these big problems. Like the number 1 complaint about data science is that 80, 90% of data science is data preparation,

which is the, like, least fun

bit of it. Like, if all my work is data cleaning, very little of my work is actually doing data science,

and take that to the next level of, okay, I'm trying to orchestrate all these different solutions,

and now I gotta figure out, like, Terraform or something. Like,

you know, this is where we have to think a little bit more about the data roles and the team

are. So

you brought up a technology. I think it's a really interesting and valuable technology, but it brings me back a little bit to the people problem and sort of the skills problem. Yeah. Absolutely

agree on the blurry lines between all of these different roles, and they just keep getting blurrier as the different tools gain new capabilities where it used to be, you know, the data engineer did the ETL jobs, and then the data scientist picked up from there. And now the data scientist has tools that, you know, automate some of the data prep or the data engineer has some tools that will let them do AutoML. And so, you know, where do you draw those lines and who is responsible for which parts of it? So definitely a big problem. Yeah. We're still figuring out what the best practices are, and that's hard when the industry itself is moving so quickly.

We think about these cool things like Arrow,

like all these cool tools.

But the fact of the matter is most companies

most, probably, like, 85%

of companies doing kind of advanced analytics are still doing it with, you know, 800 line SQL queries

written in Tableau or something. Right? Like, they're not even at a level where they're ready to take on this kind of stuff. There's still a lot of things we have to figure out there. To your point about this sort of need for so many different roles, there's a book that I'll call out. I did an interview with the author, Jesse Anderson, called just called Data Teams. He's done a whole bunch of different interviews with organizations of different sizes to figure out what is the actual optimal

organizational structure

for data teams to be able to support all these different roles, and he broke it down into, you know, basically those 3 different teams of you need the operator to be able to do all the DevOps and get the infrastructure set up. You need the data engineer to manage the pipelines. You need the data scientist to be able to focus on the data science and not have to worry about those other 2 things and, you know, figuring out where do you actually start building that team? Do you hire on the data scientists and make them do the data engineering and hate their job? Or do you hire hire on the operator and, you know, overstretch them by having to do all the data engineering too? Or

that's the challenge too is figuring out who do you hire first.

I've inherited those teams, and I've built those teams. I absolutely agree with I haven't read that book. I gotta go read that book. But I absolutely agree with that premise. You absolutely need

all 3 of those roles. The way we do it is we have data infrastructure,

that's the operators. We have data engineering that's builds the pipelines and maintains the pipelines. And then, yeah, we have data science

that is building the models and those kinds of things. That's been the way I've done it at multiple multiple companies now. But a lot of companies will start with a lone data scientist

who ends up being a data engineer

until they eventually quit.

Or data engineers

that can build the stuff but can't really build the models. Maybe wants to, but doesn't necessarily have the background to be able to do it.

Yeah. And like I said, and you have a lot of teams that are just analysts

who figured out how to either build some sort of ETL pipeline or have a data engineer build a pipeline, but then don't have that operator

who can actually help them

doing thing kind of repeatable with it and have these very shaky kind of infrastructures

that are incredibly air prone and very fragile.

And so

bringing it back around to Anaconda

and some of the work that you're doing there and you work with the community,

what are some of the most interesting or innovative or unexpected ways that you've seen the different tools and platforms and systems that Anaconda is responsible for used in the broader community?

1 of the things that should have been obvious, but it was super eye opening and incredibly exciting for me

and Anaconda. Having been somebody that's

spent nearly 30 years in the software industry. Right? I think

I've been working, as I said, with data science, but I've been working with software

software companies. Right? So,

you know, Spotify, you know, music recommendation

systems or personalization

systems,

marketplace stuff. That's kinda what I've done with it.

And at Adobe, you know, computer vision or those kind of applications.

And what I hadn't realized, actually, until I announced I was

coming to Anaconda,

was all my friends, you know, I went to an engineering school for my degree, and I'm friends with chemists, I'm friends with physicists.

I went to school with who were all super excited about me going to Anaconda because

they had packages in the Anaconda distribution

for chemistry or for physics.

And that's actually been 1 of the things that's been really

exciting to me about Anaconda

is, you know, because I was kind of in this world of these kinds of

software world use cases, I hadn't really thought that much. It didn't really occur to me

just how pervasive

these techniques are in the larger kind of science world.

So that's where I get really excited is when I see

people using it, you know, to map genes

and to work on,

you know, new drugs. Or, you know, the folks at the drug companies that built

vaccines for coronavirus,

you know, are using these tools.

Physicists trying to map the universe are using these tools. Chemists trying to figure out the way molecules interact are using these tools. That's the thing that actually

just gets me so excited seeing psychiatrists

using these tools, social scientists

using these tools. There's all these different ways that people

use these tools that I, you know, like I said, I was aware of,

you know, you didn't really see

outside of maybe, you know, cool New York Times visualizations

or those kinds of things. You know, like, oh, yeah. That's cool. They're absolutely doing this stuff.

But to actually see that every week

show up in our team demos where

we're working with some customer

that's building

insulation

for buildings

and using

the advanced data science tools.

There's a data science team at that company doing really cool stuff and getting to see what they do is fascinating. Yeah. Absolutely. It's funny the number of software engineers that I've spoken with who started off as physicists.

Yeah.

Which is actually what I originally

started to go to school for before I ended up switching my head and going back for computer engineering. I think 1 of the other things I like specifically about working at Anaconda as a company,

because obviously,

a lot of people at Anaconda come out of that open source world.

And a lot of the people who are contributing to the open source world are grad students in

biology or chemistry or or whatever.

And so a lot of the software developers at Anaconda have PhDs or even were working professionals in

other fields.

That's amazingly

cool to like work with

people who did do their PhD in particle physics or did do their

PhD in computational biology.

You have a chat in Slack, you're just talking about something and they just happen to know

all these incredibly

interesting kinda details about the subject because it just so happens that they were doing that research for a number of years.

That's 1 of the things I just appreciate about kind of being part of this. It really enmeshed in this PyData world. It's worth calling out that specifically the data ecosystem and the use of Python in data

really brings in a lot more

diversity of backgrounds

and ethnicities

and puts me in mind of a story that Jacob

shared at a PyCon 1 year who, for people who don't know, is 1 of the cocreators of Django and has been on the security team at Heroku for a number of years, and I forget where he is now, but he was mentioning

some work that he had done with

a young woman who had been doing research in, I think it was geology and had, you know, built this interesting model and built a website to be able to interact with the work that she had done and then saying, oh, but I'm not a software engineer. I'm not a programmer. You built this amazing thing that I couldn't have thought of doing in months. And because of where you're coming from, you were just trying to solve a problem. You weren't trying to become a programmer, and so they don't identify as a programmer. They just identify as somebody who solves a problem. And so being able to empower those people to solve problems without having to go through the rigor of, you know, all of the accoutrements of software engineering is amazing.

Yeah. Absolutely.

That's absolutely true. So in terms of your experience

of working at Anaconda and helping to drive the company forward and set their technology strategy? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

1 of the things that's been interesting to me and 1 of the things that's challenging

is really on the open source part. So

I've been part of teams that have contributed

to the open source. You know, certainly benefited from the open source ecosystem,

contributed to the open source ecosystem

in certain ways, but we're not open source, what I would call open source companies.

So I think 1 of the things that's been really challenging

or interesting

has just been, you know, Anaconda,

we're an open source company. Right? To be honest, this was 1 of the things that attracted me to the company

is we are unapologetically,

incredibly seriously

part of open source. We care about open source, we care about contributors to open source, we care about the open source community.

And so

being good steward and trying to be good stewards. And I think, you know, we've had challenges

where we haven't done our best or we haven't done what you would expect. You know, when I was at a company like Adobe or a company

like Spotify, and we would want to, know, a developer would wanna put something they're working on into the open source.

You know, 1 of the things I would tell them is if you're gonna do that, you have to be serious about it. You can't just throw it out and say, hey. Here it is. And then ignore it for a while.

If you think of just the sheer number of projects

that Anaconda has contributed into the PYDATA ecosystem, most of which at this point, we've moved to community governance.

We are maintainers on a vanishingly small percentage

of the projects that we originated or were part of the originating group

because we knew we couldn't support everything all the time. And there's a point where it needed to move

into community governance.

That in alone is like incredibly

coming out of very much commercial software world,

trying to make that decision of, Oh, okay, well, here's this thing we built, we invested in, we put a lot of energy in, and now we're just gonna give it away and we're not even gonna own anymore, everyone else is gonna own it. And they may take it in this thing, this direction. We generate all this intellectual property that we have a 100% given away. That's been an interesting part of being an open source company. And then also looking at the things where, you know, we haven't done our job as stewards and understanding, oh, okay, this is why we do it is because

if we can't support this

in the way people need it to be supported, they'll stop using it. And that doesn't benefit anybody. Then we're just put a lot of energy and investment into something and didn't take it seriously and people stopped using it. So, that's been a big kind of lesson and challenge for us. We want to continue to contribute

and be good members

of the open source community, but we have to do it in a really

smart way. And we have to know when it's time to say, you know what,

everybody loves this thing. It's probably better for everybody to own it instead of for us to be the ones driving it moving forward. And that's going to be the best thing for the community.

So that's been sort of a really interesting thing to do in a venture backed

company

because it's very different from any other venture backed company I've ever been in. Are there any other aspects of the work that you're doing at Anaconda

or the challenges that are facing the data science and analytics ecosystem

or the overall space of Python in data that we didn't discuss yet that you'd like to cover before we close out the show? I'm just gonna shout out

to some of the stuff that we've been working on that I wanna make sure or that hopefully a lot of people should be aware of, but there's been new releases. There was a new RC of Numba that just went out,

new release of panel.

I think data shader went out again not too long ago. You know, these things, you

know, Numba is the only 1 I think where we're still sole maintainers on or primary maintainers on everywhere else, we're just

part of the maintenance group. But I'm so excited to see some of these things

and see them continue to evolve. So, you know, DASK is another thing. So I'll put a shout out for those.

If you haven't looked at those for in your own practice, and if you're not an Anaconda user, those are all available through, you know, PyPI as well. Right? But they're really cool technologies that

make doing

complicated stuff easy, so worth taking a look at. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

And this week, I'm going to choose a group of novels by the author China Meeval. So there are 3 novels are actually an anti trilogy and that they're all set in the same world, but not actually related in terms of their storyline. An interesting take on the idea of sort of a trilogy. So they're the Baselag novels is what it's called. So it's Perdido Street Station, The Scar, and Iron Council. So definitely recommend reading those if you're looking for something to do at any point. And so with that, I'll pass it to you, Kevin. Do you have any picks this week? My pick

is I got the Lego typewriter

for Father's Day,

and it is really cool.

That does sound very cool. It's pricey, but as LEGO the advanced iOS models go, it's not ridiculously pricey.

It is very, very cool. Is it actually a functioning typewriter?

If they actually gave you

ink, it kind of almost could be. That's part of the thing that's

really cool is just

how they've actually been able to emulate and mimic the motions, the mechanisms of a typewriter

in Lego is is awesome. That's very cool.

Alright. Well, thank you very much for taking the time today to join me and share your perspectives as the CTO at Anaconda

and the window that that gives you on the overall data science ecosystem and communities. So thank you for all the time and energy that you're putting into that, and I hope you enjoy the rest of your day. I appreciate it. Thank you. Thank you so much for having me. We really appreciate you having me on.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host at podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__