Building A Community And Technology Stack For Scalable Big Data Geoscience At Pangeo

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform,

including simple pricing, node balancers,

40 gigabit networking,

dedicated CPU and GPU instances, and worldwide data centers.

Go to python podcast.com/linode,

that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

So now your modern data stack is set up. How is everyone going to find the data they need and understand it?

Select Star is a data discovery platform that automatically analyzes and documents your data.

For every table in Select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries.

Best of all, it's simple to set up and easy for both engineering and operations teams to use.

With SelectStar's

data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets.

Try it out for free and double the length of your free trial today at python podcast dotcom/selectstar.

You'll also get a swag package when you continue on a paid plan.

Your host as usual is Tobias Macy. And today, I'm interviewing Ryan Abernathy and Joe Hammond about Pangeo, a community platform for big data geoscience. So, Ryan, can you start by introducing yourself?

So I am a

Python

hacker who moonlights as a professor of oceanography

at Columbia University.

I

have been involved in

both science and software for

about

10 years now,

And it's always, you know, evolving

where I'm focused. But right now, I'm very focused and very engaged on this PINGEO project. And so we're really excited to share what we've been doing with you. And, Joe, how about yourself?

Yeah. So my name is Joe Hammond. I'm a climate scientist and an engineer.

My day job is the technology director at the nonprofit

Carbon Plan.

And there, we do data science and policy around the climate problem.

My path to this podcast, I guess, is largely through my scientific research, and I ended up working on the X-ray project, which we'll get into later, but turned into the Pangio project anyways. So I started doing that also about 10 years ago.

I have really enjoyed my time in the Python world. Yeah. Definitely interested to dig more into the history of X-ray and Pangio and how they relate. But before we do that, Ryan, why don't you share how you first got introduced to Python?

I went to grad school at MIT.

In my work, I work with a lot of data from satellites and

simulations of the ocean. So I'm an oceanographer.

You know, I started grad school in 2006,

and I was given MATLAB. That's what everyone was doing. You know, background prior to that as a programmer, I knew how to code in

PHP. I knew a little Ruby. I knew Java,

and MATLAB was new to me. I liked some things about it, but after doing my first project, I was kinda just like, yeah. No. I'm need to do an open source

something.

And that was about 2, 008. And at that time, you know,

scientific Python was just starting to emerge.

I was early enough that I compiled and built NumPy, you know, scipy,

matplotlib from source. You know, before there was all these great installers like we have today, It just took off. I loved using it. I wrote all my stuff in it, and I did a bunch of research with it. But then the real turning point was when I started to get involved in community open source, which happened more around probably

2014, 2015.

And, Joe, do you remember how you first got introduced to Python?

Yeah. My my story is not all that different from Ryan's. I was in graduate school at the University of Washington,

and the lab I was in had historically

used a really wide range of programming tools, including c and Fortran and Perl, and kinda all wired together in a web of shell scripts.

And this is how the world worked there. And so my advisor at the time was interested in focusing the group on Python and needed a test subject or a guinea pig. And he said, hey. Why don't you go learn how to do this and see how it works? And he was also learning at the time.

So that's how I got started. What really got me excited and curious was the community aspect of Python, and I thought I was just like, this is so cool. There's, you know, groups of people. They don't necessarily work in the same place, but they're working on the same tool because it meets their needs, and it it scratches an itch in 1 way or another.

And that sort of community

focus of the Python world is really what drew me in. And I went from teaching myself to contributing to some packages

just out of kind of curiosity and interest in how that all worked.

And so that has led you both to the Pangio project, which brings us here today. I'm wondering if you can give a bit of an overview about what that project is, some of its scope, and maybe the story behind it and how you each got involved with it. It all started with X-ray.

Right? So do we need to talk a little bit about X-ray? So X-ray is a fantastic tool for data science,

particularly for scientific data, which doesn't necessarily fit into that tabular data model that Pandas addresses

so well.

Back in around 2013

to 2014,

a lot of us who were working with scientific data in Earth system science, so, like, big satellite datasets or model simulations,

We all had our own little private version of something that looked like X-ray. It was basically some kind of data structure that held together

many different NumPy arrays that were related to each other, maybe some metadata thrown in there. And we were writing sort of these custom

wrappers,

you know, custom, you know, layers to do things with these data structures.

Turns out someone

who's a much better programmer than

me and definitely than I was at that time,

started developing something like that and

did a really, really great job. And that was Stefan Heuer. And so, you know, basically, when I discovered X-ray, I immediately stopped working on my own little thing

that did this job and started using X-ray and then contributing to X-ray.

And

on the X-ray

mailing list,

there was kinda a lot of traffic between people in our research field, you know, earth system science, let's just call it. And we really had the idea that we should coordinate our efforts and really try and get some momentum going because X-ray was useful, but at the same time, there was a lot more unrealized potential,

particularly in scaling out X-ray

to do really big large scale data processing. And around that time,

some datasets were coming out that we were measuring in petabytes. So in particular, I remember this 1 simulation that was run by some colleagues

at NASA JPL.

It was an ocean model simulation,

this gorgeous

super high resolution, high fidelity

simulation of the global ocean. So we're talking about simulating the global

ocean at 1 kilometer mesh resolution.

And it dumped out over a petabyte of data onto, like, 1 of these supercomputers at NASA. And I remember feeling so frustrated

that, like, it was pretty easy for that model to run. It wasn't trivial, but, like, it was straightforward. Like, those HPC

simulation codes are designed to scale out to tens or hundreds of thousands of CPU cores. But we really had no analysis tools that could then deal with the data from that simulation and actually do science with it. And there were a lot of other people in the same boat at that time. And so we wanted

to find a solution to this sort of, you know, scaling data analytics problem. We basically organized a workshop and got people together in person back when we used to get together in person, and that's how Pangea was born. And maybe Joe can tell the the story of what happened at that meeting. We'll say 20 folks at this meeting at Columbia that that Brian hosted and a bunch of us came out

for. And we kinda laid out what are the key challenges

facing the geoscience community when it comes to software and data and looking forward, what is the world we wanna see exist? And, you know, out of that grew the mission of Pangeo and some ideas for projects that we could work on. At the core of that was really connecting

X-ray and DASK

to enable kind of parallel

scalable data analysis that was built on top of X-ray

and making that interactive. And so we also had a focus on bringing in interactive computing using the Jupyter project.

After the workshop, we had some findings. There was a website that's I think probably still exists with a couple ideas on it. Shortly thereafter, a proposal call came out from the National Science Foundation. We responded to that proposal call and got project funded. Ryan and I were both on that. Ryan was the lead PI. So for 3 years, we

used that funding

to integrate

X-ray and DASK and Jupyter into what was really the beginnings of the Pangio project,

these kind of connections

that enabled interactive, scalable data analysis

on the types of data that a lot of geoscientists are using.

Yeah. And when I was starting to prepare for this podcast and digging into some of the documentation

and the resources around it, as I was hearing about it from other folks that led me in this direction, it's not so much that it's a single project that you can PIP install as you would with X-ray or DASK as it is that it's sort of this parent

community that works to bring together these projects as you said. And so there's no single, like, sort of Pangio project per se from a software perspective

as it is a project to enable these types of research that you're discussing. I'm curious if you can maybe talk to some of the interesting

challenges and perspective that that helps bring at a sort of software and community level to say, you know, we have the Pangio project, but there's no actual Pangio

source code that you can deal with and just how you manage that messaging and help using the Pangio project as this kind of umbrella to bring all these people together.

It's a great question, and it goes to the heart of what Pengeo is. I think what we have done is bundle together a lot of different

tools

that in isolation may not seem as relevant to the geoscience community.

By bundling them together and working to make them integrate, we have

Presented them as a unified solution and I think that was really necessary

because at this stage a lot of the

community was thinking about transitioning from much more monolithic analysis environments like IDL

or MATLAB.

And a lot of scientists, when they would look at Python, they would kind of throw up their hands and say there's a 1000000 different packages. You know, what am I supposed to use? You know? And there is this sort of core consensus for scientific Python on NumPy and Matplotlib. They basically get you to MATLAB like parity,

but there's actually a lot more higher level Capabilities that libraries like X-ray and Pandas give you. What bundling stuff together achieved was a couple different things. 1, it made it possible to fund from the National Science Foundation. The federal government spends a ton of money on

software

development

activities,

But a lot of it is really wasted on projects that don't have a lot of users or don't have a lot of impacts. And at the same time, we know that the open source world, you know, it's chronically underfunded

proportionate to its its impact.

We were able to by bundling stuff together, we were able to essentially market these tools as 1 unified solution

that a funding agency could get behind and say, oh, yeah. We see how this aligns with our mission and the things we wanna support. And so they've been able to to fund it. And at the same time, by bundling them together, we've also made it, I think, more accessible to institutions to, like, adopt this software stack. 1 good example of that is the National Center For Atmospheric Research, a really major large research lab in climate and and atmospheric science. Pangio is their in house analysis

stack,

sort of transitioning away from an older tool that they use to maintain a domain specific analysis language called NCL, NCAR command language. And I don't see that that could have happened if we didn't try to create this narrative of how all these tools work together to solve, like, real

end to end workflows.

But the downside of this is I think there's still a lot of confusion about what Pangio is. Like, is it group of people? Is it a software? Is it a cloud infrastructure?

The real answer is kind of all of the above.

If I can zoom in on 1 thing, I think that might be another interesting way to look at what the Pangio project has done and what and how to think about it.

The life cycle of a scientific researcher's

idea is usually to start doing some data analysis, run into a bunch of problems. Like, software doesn't solve all problems, and so you have to write some code yourself.

And scientists have, for a long time, been really good at kinda hacking their way to the finish line.

And what we did in Pancho is kinda break that pattern and insert

some community

opportunities for getting help, collaborating with developers on all of the various touch points of the software ecosystem that scientists were using.

And rather than say, okay. I'm gonna hack my way to the finish line and say, okay. Now we start integrating and working with the developers of these packages that might be able to help us

fulfill our goal without going through laborious hacking process. So

what that's done is use case driven development of a bunch of these individual software libraries.

And, overall, kind of it's the rising tide lifts all boats idea. A bunch of those use cases then get solved for the whole community.

I think the community has caught on to this that, okay. You know, I could hack my way around this, or I could open an issue on GitHub that might motivate some conversation.

Maybe me, maybe someone else knows how to fix that. And, eventually, that ends up in a library like X-ray or Dask or something, then that's a solved problem for a lot of people. So that kind of broke the normal development cycle, and it's been a flywheel effect

that has extended beyond just the core climate science use cases. 1 of the coolest parts about what Pingo has accomplished is is rethink how that development cycle works.

You mentioned that the core elements of the stack are X-ray, DASK, and Jupyter, and being able to tie them all together to be able to do this multidimensional

analysis in a scale out format with an appealing user interface.

And given that core structure that people can build off of, I'm curious what types of

extensions and kind of peripheral projects have been built around that core that may not necessarily be considered part of the broader Pangio umbrella, but are part of that same community.

There's an image a lot of us have in our head of the Python ecosystem is kind of an onion where Python is in the middle and you have some foundational libraries, say, NumPy and Jupyter. And then a layer further out on the onion might be X-ray and Pandas

and a few of these higher level data analysis tools.

What we found is that better integrating all of those

foundational tools, we enable the development

of a bunch of domain specific tools. And so I think 1 of the coolest things that we've seen in the last few years is that now you have a package for

each domain kind of developing that is built on top of this kind of deep stack of well connected tools.

And so that's things like a lidar processing library for looking at data coming out of a NASA satellite or a library for doing high dimensional grid analysis and that sort of thing. So the way that the ecosystem has kind of blossomed around that is, I think, a function of the fact that these foundational

tools solve the core problem, and then there's just, like, the last mile problem that individuals have been able to tackle on their own. We've been trying to advance this sort of architecture for what a package in our ecosystem would do.

We had this on our website, a couple's, like, principles that we would encourage those packages

to follow. But the basic idea is, like, if we have tools that can sort of consume and produce

X-ray datasets

lazily.

We get a lot of interoperability

out of that. Traditionally,

a lot of geoscience workflows have interoperated through files.

So, like, you'll have a command line tool that, like, reads a file

and then, like, writes a file. And these are usually NetCDF files, which is like this really metadata rich,

like format for exchanging information.

And what we're trying to do with Pangio is is to not touch the file system, but have packages that can pass

information

to each other at this high level. Maybe, you know,

annotating their results with, you know, units or extra metadata or, you know, things like that, And then be able to chain different pieces of this ecosystem together. So, like, 1 specific example of a package that we just came out with that I feel like it really exemplifies this is this tool called GCM Filters.

It is a tool for doing

convolution based filtering of

data that lives on these complex

meshes

that we use in our system modeling. Right? So a very common, like, data analysis need is to, like, do smoothing based convolution.

And, like, scipy or, like, scikit image or something, they all do this sort of stuff, but only in sort of rectangular image

based

space. What we have instead in the geoscience is, like, we have to do convolution on this sphere in, like, a curvilinear

coordinate system with like this irregular mesh.

And so

by leveraging the Pangio ecosystem, we could write code that really just targeted that key piece. Doesn't have to worry about, like, IO. It doesn't have to worry about, like, how to parallelize those operations.

We just consume and produce X-ray datasets,

and then it can integrate with the rest of the ecosystem. And that's what we're going for, but it's kind of hard. It's a long road and

doesn't always work out. It's not always so clean and neat, and there's still plenty of confusion and duplication and inefficiencies

in this ecosystem. I don't wanna suggest that it's like solved. It's more of a vision of how we'd like things to work.

In terms of the

scientific domains that you see

orienting around the Pangaea project and the Pangaea stack, I'm wondering if you can maybe

categorize them a bit and

talk to some of the sources of pain that they're experiencing

in their, maybe so called native ecosystems

that pushes them in the direction of

investigating and adopting Pangio to help solve their problems

and maybe some of the ways that Pangio can act as a force multiplier by being this sort of common infrastructure

that different domains

can collaborate with.

Climate science was the core 1, and that's what the sort of lot of the early motivation

for use cases and applications and and those pain points came out of climate science. And so there, we were working with very large, so petabyte scale

collections of multidimensional data.

There were access problems to that data. It was mostly

sitting behind

FTP servers or HTTP servers.

Individuals or individual research groups would grab some part of it. They would do their research on it.

Climate science, we came in and said, okay. So let's start by making it easier to work with that data using X-ray and DASK.

And since then, we've really we haven't gotten into it yet, but cloud computing has become a larger and larger part of what the Pangio project has spent its time focused on. And that's largely because it brings to the forefront this opportunity

for, data commons like environment where data is accessible to multiple researchers or research groups.

And in the climate sciences, we're starting to see that

by putting a bunch of climate data. So Ryan has led a project to put over a petabyte of

data from the coupled model intercomparison

project into the public cloud, and that's available to all researchers anywhere around the world now in both Amazon or or Google Cloud.

And that's really kind of changing the game for how researchers access this data before you either had to work at a place that store that much data, which aren't very many places in the world, or you had to choose a small subset to work on. So today, that's a totally different way of operating on the data. In terms of the domains, I guess, who is using it and how they end up using it? You know, I think the core

nucleus has always been, like, with the stuff that Joe and I and people that are sort of narrow radius do. It's like analysis of high resolution Earth system

observations and simulations.

Really focus on this data analysis problem. Right? Because like I said earlier, like, there's kind of paradox that right now in Earth system science, our ability to simulate

has actually outpaced our ability to understand those simulations

because those simulation codes scale out so well. And, like, we have supercomputers

that run them and, like, you can really easily run a lot of simulations about the Earth. But then

you need a way to analyze those data that come out of it to actually get to some interesting science. And so I would say the core users of NGR are people who are analyzing very large ensembles of Earth system simulation data.

That is probably like the core group. And from there it's sort of diffused out to proximate areas like remote sensing analysis, you know, analysis of satellite imagery and other measurements of the Earth from space,

analysis of data that is collected from autonomous sensors and robots in the ocean or in the atmosphere, weather balloons, things like that.

Anytime someone is, I think, dealing with a dataset that is

too big to easily fit on their laptop or, you know, in memory on their computer,

we think Pangio can help accelerate those.

Specifically by

actually using the same tools and the same code that you can use on small data

for big data. Like, that's the real beauty of Dask particularly and, like, Pangio

gets that for free by using Dask, is that you can just

have code that works on small data in memory, and then you can just have the same code working with a very large dataset with distributed computing. And so the transition

to scaling out your research is way less painful when you're doing it that way. In the past, what people were doing is essentially

batch scripts on

HPC systems

processing file like, a very file centric workflow. Like, read this file, you know, produce something with it, write out another file, like, then do some kind of reduction or something like that. And the abstraction we're going for is that you don't have to think about files. You think about datasets and you think about physical dimensions and coordinates, and that is the sort of code that you write it around, not

what files do I have to process.

And I'll just add to answer the complete part of your question, which is what other scientific domains beyond climate and remote sensing in the geosciences. I think the life sciences is an area we're seeing

parts of what we're working on in Pangio get picked up. It's not a wholesale adoption, and it's actually 1 of the, maybe the best parts of Pangio is that you can pick and choose the parts that fit your applications

appropriately.

But we're seeing applications

in neuroscience,

in genomics, and in bioimaging

that are all using some different parts of the

what we would kinda consider the Pangio ecosystem or the Pangio stack.

Moving back to the file interface that you're trying to move away from, I'm curious

in terms of the source formats that you're dealing with, how you're able to help people with

managing that abstraction to say, we don't want you to think about the source files. We want you to think about the dataset, but you actually still need to work with these source files to get them into the dataset to begin with and just some of that dimension of being able to work with all of these different scientific formats that can be quite esoteric and complicated and sort of the multidimensional aspect of X-ray comes in. But I know, like, for instance, HDF5

can be very complex in terms of what you stuff into it. I think it's really important for us to shout out an organization called Unidata

and a file format

called NetCDF,

a metadata convention called CF Conventions.

Compared to other fields, I have learned that in geoscience, we're really lucky

to have a great standardized

file format that's really broadly

used in our field

and really good metadata conventions that come with that file format.

So much work has already been done with geoscience data to make it, you know, fair, you know, findable,

accessible, interoperable,

reusable, right, through the use of those kind of standards. The last mile that kinda we've had to cross with Pangeo is to just deal with the fact that, like,

most datasets are distributed

as many files. Right? Like, every now and then you'll find that, like, 100 gigabyte

HDF file. But, like, for the most part if you go download data from NASA, they're gonna give you, like, 1, like,

50 megabyte NetCDF file per day and they'll be like

10, 000 of them for the dataset that you wanna work with. That's very common, you know, access pattern.

X-ray does a great job with this. It has this magic function called open mf dataset, open multi file dataset.

And I think a lot of us when we first started using X-ray, this was just like the candy

that made us love X-ray. Just the fact that we could just instantly open this big collection of files

and just treat them as 1

virtual dataset. And it just it's a huge cognitive

load lifted off your brain to focus on the big picture. We started from a pretty good position. But, of course, then once you try and scale out, you really come

right up against, like, what are the actual

IO limitations

of the computing hardware we're using. Right? Like, that's actually then where the bottleneck starts to lie. In a way, X-ray can make things look too easy. It can instantly open and show you all these files. But then when you actually start trying to compute stuff, you have to deal with, well, how fast can my hard disk

get me data? Or is this HPC file system you know, what kind of contention is going on here as I try to work

with this? Or in the cloud, can I efficiently read this format? But it's good to be up against those limits. That means that you're reaching the saturation limit of what your computing hardware can deliver.

Something that's emerged out of this is some even higher level abstractions

on large, multi file datasets.

And so I think this is where kind of the catalog space comes in. And 2 projects we've

leveraged quite a bit in the Pango ecosystem

are intake,

which is a library for cataloging data and then loading it into Python objects.

And then fspec, which is a library for accessing local and remote data under a common API.

So I think these are 2 things that have really kinda changed the game for us in how we think about

data organization

and data access. And they plug in directly with X-ray and DASK, and they work in the cloud, or they work on HPC, or they work remotely and locally.

Another interesting element of this overall problem space that you've touched on a couple of times is the advent of the cloud and its increasing popularity

and

the different access patterns that it enables and requires as compared to HPC, which is where a lot of this research and analysis has been done

up till now. And I'm curious if you can talk to some of the challenges that

this move to the cloud brings to this analytical flow and some of the ways that Pangio is helping people make that migration off of these monolithic HPC systems that you have to try and

allocate your time share onto versus the public cloud where you just need to give them a credit card and hope that your budget doesn't run out. It's first important to say that our HPC centers that we have in science are an incredible resource

and the people who run them are really fantastic.

We would be nowhere without them. So it's not HPC or cloud. It's really HPC and cloud. HPC has a really important role to play here. I don't think that's gonna go away anytime soon. As far as cloud for us, like, we got into it by accident,

basically. We submitted this proposal in 2017,

and NSF asked us to trim the budget.

And 1 thing we put in that budget was, you know, like,

$50, 000 to buy a couple servers to, like, store data because that's the kind of thing that, you know, we put in proposal budgets.

We wrote back to NSF. We said, okay. We'll cut these servers, but can you give us some cloud credits? Because at that time, NSF was running this program called Big Data. They partnered with Microsoft, Google, and Amazon to, like, just directly grant cloud

computing credits to researchers.

In retrospect, it was a very good marketing

ploy be by the cloud providers because now we've become total cloud evangelists. So we got, like, a $100, 000 worth of Google Cloud credits, and we just kinda went nuts. In retrospect, it's remarkable how kind of fast and loose we played with this. We really just started spinning stuff up and trying things. But we basically

stood up this cloud based JupyterHub

connected to a bunch of cloud based data

and basically led anyone who wanted to access it and play with it. And it was super fun and liberating to just be running and deploying our own infrastructure. Right? If you're used to working with HPC, it's very much like a

sysadmin

versus user. You know? Like, you have to beg them to do things. Like, they're very conservative. They're very security,

you know, aware.

Resources are limited. They're large, but they're limited. And cloud, it was just kind of like a completely new paradigm. We could do anything we wanted. We could build anything we wanted. There was we had a lot of money, so we weren't really worried about, like, our spend. In retrospect, now all of those things become serious concerns as the project matures, and we try to, like, actually

run real production infrastructure. But at the time, it was very exciting and liberating, and we learned a ton of stuff. We experimented a lot.

And I do think that we helped to chart a course into cloud

computing for the geosciences and maybe science you know, maybe a little bit more broadly than just geosciences.

Things are great in geosciences because we have

very few restrictions around our data. We don't have, you know, privacy. We don't have

personally identifiable information. We don't have HIPAA

regulations

around, you know, health data. It's basically all Creative Commons licenses, and you can just do whatever you want with it for the most part. And so we were able to just forklift data into the cloud and start computing on it, and it was awesome.

1 technical thing that's worth

getting into is

this whole

question of cloud native data formats. Right? So so far we've talked about

HDF and NetCDF.

What we learned

in 2017 is that we couldn't just put HDF data

or NetCDF data into

cloud object storage and compute on it in a convenient way because we didn't have a way to, like, open those file. They're really these complex opaque binary formats,

and there's, like, a c library that has to read them. And that time it was just baked in that it's gonna be like a POSIX file system where the data live. So you just have to, like, download the whole file if you wanted to even open it up. And so that didn't feel very, like, cloud native to us. What we really want is a data format that can just be accessed directly through HTTP calls, which is what how object storage works and where we can sort of get the metadata

without downloading the whole file and where we can efficiently subset or select from the data. So we experimented with a lot of different formats. You know, Parquet is sort of the canonical example of a cloud native format for tabular data.

It's not ideally suited for multidimensional arrays, and so we really put a lot of time into working on Xaar.

Xaar is an open source community driven project that's

more or less API compatible with HDF. But whereas HDF stores things in 1 single file,

XAR just kind of explodes the dataset into many

individual metadata objects, which are just JSON files and then binary blobs of data that can be compressed and chunked and stuff like that. And so we will spend a lot of effort integrating

XAR,

Dask, and Xarray to provide a real cloud native workflow,

and it's awesome. If you are working with XAR data in the cloud

using DASP to scale out on hundreds of compute nodes, you can absolutely

burn through data really just, you know, getting right up against the limitations of the hardware, saturating the network,

saturating the CPU,

just really process data as fast as the computers will allow. That's quite exhilarating when you're doing that at scale,

and it's part of, I think, the fun of Pangio when users get to experience that. It's interesting hearing about this because I have another podcast about data engineering, and it's very focused on what you're talking about with Parquet and these cloud native formats and scalable compute and analysis. And it's always interesting hearing the perspective from people who are dealing with more complex data sets and data structures beyond just the tabular.

And so I'm curious if there's any sort of

analog to what's happening in the data ecosystem with data warehouses and the quote unquote modern data stack and incremental stream processing or, you know, managing that whole data pipeline life cycle as it applies to the geosciences

and the just sort of general

academic scientific exploration writ large?

It's a great problem because

the modern data stack has so much momentum right now, and

it really doesn't address scientific data, which means there's this huge opportunity

as science moves into the cloud to build out that type of tooling, but focused on this more complex data model that we deal with in scientific research. And I think that's a super exciting

opportunity.

You know, there's great patterns in the modern data stack, you know, but the fact is I can't put, you know, my climate model data in Snowflake. Or if I can, they just call it, you know, unstructured data. No. It's not unstructured. It's highly structured. It's just not the structure that those tools want to assume. What we are trying to build now with funding from the NSF EarthCube program is a framework called Pingo Forge, which is an ETL tool

focused on cloud native scientific data. You can think of it

as maybe vaguely comparable to what something like Airbyte or Fivetran or Stitch does in terms of connecting data sources,

you know, ingesting data into a data lake, possibly with transformations

in there, and storing it in a cloud native data catalog. That's where, like, a 100% of my sort of developer effort is focused right now, and we're really psyched about it. It's kinda just

come alive,

and I could probably talk about it for the next hour. But I'll leave it there for now. That's great. I'll throw 1 more thing in here, which is that the reason Pangio Forge needs to exist today is largely out of the success

of the workflow and the model of using Pangio tools

in the cloud. You know, we noticed a pattern just like I was talking about earlier, how that kind of science

workflow

has

found a bunch of detours that build general solutions to problems that we find over and over again.

And

as we've been developing the Pangio Cloud ecosystem,

we found that 1 of the hardest and biggest pain points is

getting data into cloud optimized and analysis ready data formats,

and then pushing it and tracking it in the cloud. And so out of that need grew Pangio Forge, and it's something that we feel like we could put a whole community effort into at this point because it's the keys to the castle in terms

of, you know, interactive big data geoscience going forward.

Spending a bit more time on the sort of cloud native

storage format, I'm curious

if and how much you've explored the TileDB's

platform for being able to handle these multidimensional

data structures

in this sort of matrix vectorized

sort of storage format that is designed for the cloud and maybe some of the limitations that it has that leads you away from maybe endorsing that in a full throated manner?

We love TileDB.

It's a great tool and a great format and a great company.

I think the reason we haven't

done more with it is a pretty simple practical problem which is that X-ray cannot write TileDB.

X-ray is our Swiss army knife of data formats.

It can read

every format we know

of and it can

write

most formats we would want to write. And it has an extensible

architecture through entry points that allow anyone 3rd parties to implement readers and writers. For whatever reason, writing a tile d b writer for X-ray has never been implemented.

And therefore,

we don't have the ability to create TileDB data. Almost all of our data is coming from a legacy data format. And so it's really just this 1

technical

blocker. You know? I think to some degree, there are some dynamics of the fact that there's a big company behind TileDB

might make some people in the open source community a little less likely to just roll up their sleeves and, like, do that work,

kind of just hoping that, you know, all of that funding and all of the engineers in TileDB,

you know, they can do that work. Whereas with Xaar, there was no company. Now there's a few proposals that fund it, but it was more just like, well, everyone pitch in and, like, let's hack this thing together and make it work. So that's really the only reason. Now I think TOWDB, to be honest, is a technically

superior format in most ways to czar. It's very clever. It's very high performing. So in the future, I wanna work with it more.

Alright. Well, I I consider that a challenge to the listeners.

Well, if someone wants to implement this, like, please do it. It would be a huge service to the community.

And so in terms of the overall sort of scope and the goals of the Pangeo project, I'm curious if you can

talk to some of the ways that those have shifted from when you first began working on it to where you are today and where you're looking to go in the future.

You know, in the early days, we had kind of 3 main goals. Foster collaboration

around open source Python for applications in the climate sciences and geosciences sciences more broadly,

support all these domain specific packages, and then improve the scalability

of them to petabyte

scale data analysis problems.

Where we're at today is not that we've solved all those problems, but we solved some of those problems. And what we're seeing as we've kind of transformed into something that's more user like, there's probably more users in Pangio than there are developers

today. That's different than the early days whereas most, like, user developers or developers,

is that Pangaea is kinda growing into more of an open science community with a lot of people that are using the tools.

And not to say that there's fewer developers, there's probably more developers, but the ratio is probably higher on the user side.

When I say Open Science Community, it's still a place where there's a lot of collaboration

on how to use tools and then on developing those software tools and also on data side.

But the focus is kind of increasingly more on the science user than on the software packages themselves, which I think is a great evolution of a maturing project

to see it move from alright. Here's, like, 3 software packages that we're developing on to here's a collection of tools and ideas in architectures

that we can use to create effect on the problems, the research problems, the scientific problems that we're facing. You said a keyword, which is open science,

and this is not something we talked about

and used that word as much in 2017

as we do today. We can see huge

momentum behind the open science movement

from agencies like NASA,

NSF.

Open Science is loosely more

that

more that happens when you have that, which is a more

collaborative

approach to scientific research

that I believe is very much inspired by what we have happening in open source. You know, in the open source world there's not really so much competition. I mean there are some cases of competition, but in general it's kind of this feeling that we're all in this together. Let's try and improve this software so we can all do better work. Let's work together. Let's not duplicate efforts. Let's try and find a place for new contributors.

And I think a lot of scientists would like to be working with a similar ethical framework

on their research.

And so what I see for the future of Pengeo is to see if we can actually change the culture of science

where we do our work more in the open,

and are more open to

outside collaborators

getting involved and, you know, augmenting our projects. And I think that's gonna take science

farther

in the coming decades than the sort of more individualistic

sort of siloed model has in the past. So that's an open question whether Pengeo can actually

do this for the community, but I think we have become a concrete

example

of a real

live, healthy, open science community

that others can look to for a model.

In terms of

the work that you've seen

both

done under the auspices

of Pangio and maybe as a direct result of it, both technical and scientific

and in terms of the community, I'm curious if you can speak to some of the most interesting or innovative or unexpected ways that you've seen that come about. Couple ideas that come to mind immediately is some of the operational

uptick in or uptake in PINGEO tools.

So, you know, Ryan mentioned NASA, but NASA and NOAA, 2 of the biggest public geoscience data providers in the world,

adopted parts of the tooling and the approaches we've taken, in particular, as both of those organizations are

moving towards

using

cloud

as the place to store and distribute

their data.

We're seeing things like

tooling being built using X-ray

or data being stored in XAR formats or cataloged using similar tools.

And I think that's a really exciting thing. I mean, it's in many ways, why, say, NASA or NSF are giving grants to organizations

like ours

is so that we can innovate these ideas to a place where they can be taken up by

larger institutions. And so I think, you know, NASA is certainly doing this. NOAA is as well. It's a big success of the project so far. Yeah. My answer is more or less the same as Joe's. You know, I think it's been really cool and unexpected to see,

like, also, like, organizations in Europe, like European Space Agency, like, building

operational infrastructure,

production

level infrastructure for data processing

out of, like, these building blocks.

I think in many times, they don't explicitly credit Pangio, and that's fine. They're like, this processing system is built. This is in Xara data. We're processing it with DASK, and, like, we're providing, you know, X-ray.

It's clearly Pangio, but, like, it's not

credited.

And that's

fine actually and good. Like, I think there's a world in which

Pangio becomes this invisible,

you know, layer, you know, supporting these tools. And if they go out and are adopted and used

and help the world, like, that's kind of best case scenario.

The other direction is adoption in biogenomics

and microscopy. That's super cool to see. In your experience

of helping to form and grow this community and the associated technologies, what are some of the most interesting or unexpected or challenging lessons that you've each learned in the process?

For me, it's been the management

and fostering of a large distributed

asynchronous community.

Has taken a lot of energy. I mean, it's been incredibly fruitful, but there's also been a lot of coordination effort that has gone into that. When I started working on X-ray, it wasn't to build a community.

It was in part because there was a bit of community aspect to it. But as this kinda grew into the Pangeo project, it started attracting users and other developers, it became more and more a coordination

project and less of a development project.

I echo that. Like, trying to keep the community organized and moving forward and motivated has been a huge time sink for me. I mean, I think it's time well spent. But, you know, it really does take energy and effort. And I think sometimes we underestimated

like, it's easy to start projects and start things, but sometimes we underestimate the cost of keeping them going. I think that's true in software and work more generally.

I would add another, like, challenge.

It's good. It's always important to acknowledge. You know, we really try to make our community,

like, really inclusive

through code of conduct, through a lot of things we can do to try and welcome new users. But the fact is, you know, the open source software development world is not the most diverse place.

You know, we've tried

and had some examples of success to make our community

more diverse and welcome contributors from more different walks of life and more different backgrounds. But we still have a lot of work to do there as I think a lot of the, you know, scientific Python projects do. That's a perpetual challenge and something we're always trying to think about how to do better on.

And as you continue

to work on and foster that community and the associated technologies, what are some of the things you have planned for the near to medium term future of Pangio?

I think in the near term, Ryan talked about it earlier, but I think Pangio Forge is a big effort right now.

And I think that pushing on this

data lake concept for geoscientific

data in the cloud is a big aspect there.

There's other areas of growth within the Pangeo project that are not necessarily connected to Ryan and I, but I think it's worth saying. I think there's this early stage shift towards more of an open science community,

and there's others pushing on that. I think that's really exciting.

I also see there's a group at the National Center For Atmospheric Research and at the University of Washington

that have

recently got support to do

software support across the ecosystem,

which is something that it's really quite easy to get grants traditionally to build new things.

But NASA has a program now for supporting open source software,

which is both really cool, and there's gonna be focus in the NGO space on that. So there's a whole mix of things. The community is really multidimensional in its own way now.

I really think cloud

can be transformative for science. Like, I think a lot of people get

wrapped up in, like, the technical or, like, the cost,

you know, analysis

and, like, miss the broader point that, like, cloud allows us to be way more collaborative

internationally,

you know, around research.

And I think that will have really big impacts on science. And I think, know, it doesn't mean we have to use Amazon Cloud, but, like, Cloud broadly defined as this open Internet environment for doing science, I really do think is gonna transform the way we work for the better. As far as Pangio goes, I'm very interested in, like, building bridges with other languages,

particularly, like, R and Julia.

As far as, like, simulation goes,

I'm very bullish on Julia, like, in the future. I think in 10 years, you you know, a lot of our work is probably gonna be in Julia. You know, Pingo has like, there's interoperability

today between Pangio and Julia. Like, the XAR format, for example, has really good Julia support.

You know, you can call Python from Julia and Julia from Python.

But I think just continuing to build out that bridge

is really important for sort of, like, looking to the future. Because I see a lot of great innovations happening in that community. You know, I dabbled in the language myself, and I think that's gonna accelerate.

Are there any other aspects of the Pangaea project and your work in that community and the technology stack that we didn't discuss yet that you'd like to cover before we close out the show? It's a good time to highlight the fact that there's a discourse forum.

You know, Pingo is fundamentally an online community. It's not based at any 1 institution. It's on GitHub, and we have a discourse forum. And so it's discourse dotpangio.io on GitHub atpangiodata. I think those are the nexus,

of collaboration

and encourage folks that are interested to go check those out. There's also a weekly meeting that is about half the time, it's a showcase where people can check out presentations or demonstrations,

things that people are doing in this space. I think it's also really important to shout out. Like, a lot of people ask, like, how can I learn to use PINGEO?

Which you might just translate as how can I learn to use, like, scientific Python in these various packages? But there's a great group that's focused on the education side within Pinjio, which is called Project Pythia.

This is a collaboration between some folks at NCAR

and the University of Albany.

And they have a great website, a really rich Jupyter book with

awesome sort of training

material.

It's really a 0 to 60 course in scientific

Python Computing in the Geosciences. So if anyone is hearing this and wants to know where can I start, that's a great place to start?

Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And so with that, I'll move us into the picks.

And with

spring coming and once I get on the other side of mud season, I'm excited to get back out and do some mountain biking. So for folks who haven't given that a shot, definitely a fun way to spend your time. So with that, I'll pass it to you, Ryan. Do you have any picks this week? A way more indoor pick, but I just read the novel Klara and the Sun by Kazuo Ishiguro.

Really blew my mind. It's about AI. It's about artificial friends

and potential where we might end up if we really manage to create

very, very advanced AI and put it into human like bodies. It was very thought provoking and a beautiful book, so highly recommend it. Alright. And how about you, Joe? Oh, have to see you out on the trails and on the mountain bike.

Also looking forward to that. But last year, I finished a book called Range by David Epstein, and it's about how generalists

thrive in a specialized world. And I thought it was a really interesting take on

why, you know, a bunch of experiences and a more generalist approach is important in the world we live in today.

Alright. Well, definitely gonna have to take a look at both of those. So thank you for taking the time today to join me and share the work that you're each doing on Pangio and helping to give more people access and capacity

for doing

analysis and data science on large geospatial datasets. Appreciate all the time and energy that you and the rest of the community have put into that, and I hope you 2 enjoy the rest of your day. Yeah. It was a pleasure. Thanks for having us. This was awesome. Thanks, Tobias.

Thank you for listening. Don't forget to check out our other show, The Data Engineering Podcast at dataengineeringpodcast.com

for the latest on modern data management.

And visit the site of pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes.

And if you've learned something or tried out a project from the show, then tell us about it. Email host atpodcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__