Summary
Science is founded on the collection and analysis of data. For disciplines that rely on data about the earth the ability to simulate and generate that data has been growing faster than the tools for analysis of that data can keep up with. In order to help scale that capacity for everyone working in geosciences the Pangeo project compiled a reference stack that combines powerful tools into an out-of-the-box solution for researchers to be productive in short order. In this episode Ryan Abernathy and Joe Hamman explain what the Pangeo project really is, how they have integrated a combination of XArray, Dask, and Jupyter to power these analytical workflows, and how it has helped to accelerate research on multidimensional geospatial datasets.
Announcements
- Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at pythonpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan.
- Your host as usual is Tobias Macey and today I’m interviewing Ryan Abernathy and Joe Hamman about Pangeo, a community platform for Big Data geoscience
Interview
- Introductions
- How did you get introduced to Python?
- Can you describe what Pangeo is and the story behind it?
- What is your role in the project/community and how did you get involved?
- What are the goals of the project and community?
- What are the areas of effort and how are they organized?
- What are the scientific domains that Pangeo is focused on supporting?
- What are the primary challenges associated with data management and analysis in these scientific communities?
- What are the forms that these data take and how have they been evolving? (e.g. formats/sources)
- What are some of the challenges introduced by the widespread adoption of cloud resources and the associated architectural patterns?
- Can you describe the technical components that fall under the Pangeo umbrella?
- How do they come together to form a functional workflow for geo sciences?
- How has the scope of the Pangeo project changed or evolved since it started?
- What are the most interesting, innovative, or unexpected ways that you have seen Pangeo used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pangeo?
- When is Pangeo the wrong choice?
- What do you have planned for the future of Pangeo?
Keep In Touch
- Joe
- @HammanHydro on Twitter
- Ryan
Picks
- Tobias
- Ryan
- Klara And The Sun by Kazuo Ishiguro
- Joe
- Range by David Epstein
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Pangeo
- Pangeo Forge
- CarbonPlan
- M2LInES
- LEAP
- Columbia University
- XArray
- MIT
- MatLab
- PHP
- Ruby
- Java
- NumPy
- SciPy
- Matplotlib
- C
- Fortran
- Perl
- Dask
- Jupyter
- IDL
- HDF5
- Unidata
- NetCDF
- CF Metadata Conventions
- Intake
- FSSpec
- Parquet
- Zarr
- Data Engineering Podcast
- Pangeo Forge
- Airbyte
- Fivetran
- Stitch
- TileDB
- Pythia
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With the launch of their managed Kubernetes platform, it's easy to get started with the next generation of deployment and scaling powered by the battle tested Linode platform, including simple pricing, node balancers, 40 gigabit networking, dedicated CPU and GPU instances, and worldwide data centers.
Go to python podcast.com/linode, that's l I n o d e, today and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. So now your modern data stack is set up. How is everyone going to find the data they need and understand it? Select Star is a data discovery platform that automatically analyzes and documents your data. For every table in Select star, you can find out where the data originated, which dashboards are built on top of it, who's using it in the company, and how they're using it, all the way down to the SQL queries. Best of all, it's simple to set up and easy for both engineering and operations teams to use.
With SelectStar's data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at python podcast dotcom/selectstar. You'll also get a swag package when you continue on a paid plan. Your host as usual is Tobias Macy. And today, I'm interviewing Ryan Abernathy and Joe Hammond about Pangeo, a community platform for big data geoscience. So, Ryan, can you start by introducing yourself?
[00:01:56] Unknown:
So I am a Python hacker who moonlights as a professor of oceanography at Columbia University. I have been involved in both science and software for about 10 years now, And it's always, you know, evolving where I'm focused. But right now, I'm very focused and very engaged on this PINGEO project. And so we're really excited to share what we've been doing with you. And, Joe, how about yourself?
[00:02:26] Unknown:
Yeah. So my name is Joe Hammond. I'm a climate scientist and an engineer. My day job is the technology director at the nonprofit Carbon Plan. And there, we do data science and policy around the climate problem. My path to this podcast, I guess, is largely through my scientific research, and I ended up working on the X-ray project, which we'll get into later, but turned into the Pangio project anyways. So I started doing that also about 10 years ago.
[00:02:54] Unknown:
I have really enjoyed my time in the Python world. Yeah. Definitely interested to dig more into the history of X-ray and Pangio and how they relate. But before we do that, Ryan, why don't you share how you first got introduced to Python?
[00:03:06] Unknown:
I went to grad school at MIT. In my work, I work with a lot of data from satellites and simulations of the ocean. So I'm an oceanographer. You know, I started grad school in 2006, and I was given MATLAB. That's what everyone was doing. You know, background prior to that as a programmer, I knew how to code in PHP. I knew a little Ruby. I knew Java, and MATLAB was new to me. I liked some things about it, but after doing my first project, I was kinda just like, yeah. No. I'm need to do an open source something. And that was about 2, 008. And at that time, you know, scientific Python was just starting to emerge.
I was early enough that I compiled and built NumPy, you know, scipy, matplotlib from source. You know, before there was all these great installers like we have today, It just took off. I loved using it. I wrote all my stuff in it, and I did a bunch of research with it. But then the real turning point was when I started to get involved in community open source, which happened more around probably 2014, 2015.
[00:04:12] Unknown:
And, Joe, do you remember how you first got introduced to Python?
[00:04:15] Unknown:
Yeah. My my story is not all that different from Ryan's. I was in graduate school at the University of Washington, and the lab I was in had historically used a really wide range of programming tools, including c and Fortran and Perl, and kinda all wired together in a web of shell scripts. And this is how the world worked there. And so my advisor at the time was interested in focusing the group on Python and needed a test subject or a guinea pig. And he said, hey. Why don't you go learn how to do this and see how it works? And he was also learning at the time. So that's how I got started. What really got me excited and curious was the community aspect of Python, and I thought I was just like, this is so cool. There's, you know, groups of people. They don't necessarily work in the same place, but they're working on the same tool because it meets their needs, and it it scratches an itch in 1 way or another.
And that sort of community focus of the Python world is really what drew me in. And I went from teaching myself to contributing to some packages just out of kind of curiosity and interest in how that all worked.
[00:05:18] Unknown:
And so that has led you both to the Pangio project, which brings us here today. I'm wondering if you can give a bit of an overview about what that project is, some of its scope, and maybe the story behind it and how you each got involved with it. It all started with X-ray.
[00:05:34] Unknown:
Right? So do we need to talk a little bit about X-ray? So X-ray is a fantastic tool for data science, particularly for scientific data, which doesn't necessarily fit into that tabular data model that Pandas addresses so well. Back in around 2013 to 2014, a lot of us who were working with scientific data in Earth system science, so, like, big satellite datasets or model simulations, We all had our own little private version of something that looked like X-ray. It was basically some kind of data structure that held together many different NumPy arrays that were related to each other, maybe some metadata thrown in there. And we were writing sort of these custom wrappers, you know, custom, you know, layers to do things with these data structures.
Turns out someone who's a much better programmer than me and definitely than I was at that time, started developing something like that and did a really, really great job. And that was Stefan Heuer. And so, you know, basically, when I discovered X-ray, I immediately stopped working on my own little thing that did this job and started using X-ray and then contributing to X-ray. And on the X-ray mailing list, there was kinda a lot of traffic between people in our research field, you know, earth system science, let's just call it. And we really had the idea that we should coordinate our efforts and really try and get some momentum going because X-ray was useful, but at the same time, there was a lot more unrealized potential, particularly in scaling out X-ray to do really big large scale data processing. And around that time, some datasets were coming out that we were measuring in petabytes. So in particular, I remember this 1 simulation that was run by some colleagues at NASA JPL.
It was an ocean model simulation, this gorgeous super high resolution, high fidelity simulation of the global ocean. So we're talking about simulating the global ocean at 1 kilometer mesh resolution. And it dumped out over a petabyte of data onto, like, 1 of these supercomputers at NASA. And I remember feeling so frustrated that, like, it was pretty easy for that model to run. It wasn't trivial, but, like, it was straightforward. Like, those HPC simulation codes are designed to scale out to tens or hundreds of thousands of CPU cores. But we really had no analysis tools that could then deal with the data from that simulation and actually do science with it. And there were a lot of other people in the same boat at that time. And so we wanted to find a solution to this sort of, you know, scaling data analytics problem. We basically organized a workshop and got people together in person back when we used to get together in person, and that's how Pangea was born. And maybe Joe can tell the the story of what happened at that meeting. We'll say 20 folks at this meeting at Columbia that that Brian hosted and a bunch of us came out
[00:08:30] Unknown:
for. And we kinda laid out what are the key challenges facing the geoscience community when it comes to software and data and looking forward, what is the world we wanna see exist? And, you know, out of that grew the mission of Pangeo and some ideas for projects that we could work on. At the core of that was really connecting X-ray and DASK to enable kind of parallel scalable data analysis that was built on top of X-ray and making that interactive. And so we also had a focus on bringing in interactive computing using the Jupyter project. After the workshop, we had some findings. There was a website that's I think probably still exists with a couple ideas on it. Shortly thereafter, a proposal call came out from the National Science Foundation. We responded to that proposal call and got project funded. Ryan and I were both on that. Ryan was the lead PI. So for 3 years, we used that funding to integrate X-ray and DASK and Jupyter into what was really the beginnings of the Pangio project, these kind of connections that enabled interactive, scalable data analysis on the types of data that a lot of geoscientists are using.
[00:09:42] Unknown:
Yeah. And when I was starting to prepare for this podcast and digging into some of the documentation and the resources around it, as I was hearing about it from other folks that led me in this direction, it's not so much that it's a single project that you can PIP install as you would with X-ray or DASK as it is that it's sort of this parent community that works to bring together these projects as you said. And so there's no single, like, sort of Pangio project per se from a software perspective as it is a project to enable these types of research that you're discussing. I'm curious if you can maybe talk to some of the interesting challenges and perspective that that helps bring at a sort of software and community level to say, you know, we have the Pangio project, but there's no actual Pangio source code that you can deal with and just how you manage that messaging and help using the Pangio project as this kind of umbrella to bring all these people together.
[00:10:35] Unknown:
It's a great question, and it goes to the heart of what Pengeo is. I think what we have done is bundle together a lot of different tools that in isolation may not seem as relevant to the geoscience community. By bundling them together and working to make them integrate, we have Presented them as a unified solution and I think that was really necessary because at this stage a lot of the community was thinking about transitioning from much more monolithic analysis environments like IDL or MATLAB. And a lot of scientists, when they would look at Python, they would kind of throw up their hands and say there's a 1000000 different packages. You know, what am I supposed to use? You know? And there is this sort of core consensus for scientific Python on NumPy and Matplotlib. They basically get you to MATLAB like parity, but there's actually a lot more higher level Capabilities that libraries like X-ray and Pandas give you. What bundling stuff together achieved was a couple different things. 1, it made it possible to fund from the National Science Foundation. The federal government spends a ton of money on software development activities, But a lot of it is really wasted on projects that don't have a lot of users or don't have a lot of impacts. And at the same time, we know that the open source world, you know, it's chronically underfunded proportionate to its its impact.
We were able to by bundling stuff together, we were able to essentially market these tools as 1 unified solution that a funding agency could get behind and say, oh, yeah. We see how this aligns with our mission and the things we wanna support. And so they've been able to to fund it. And at the same time, by bundling them together, we've also made it, I think, more accessible to institutions to, like, adopt this software stack. 1 good example of that is the National Center For Atmospheric Research, a really major large research lab in climate and and atmospheric science. Pangio is their in house analysis stack, sort of transitioning away from an older tool that they use to maintain a domain specific analysis language called NCL, NCAR command language. And I don't see that that could have happened if we didn't try to create this narrative of how all these tools work together to solve, like, real end to end workflows.
But the downside of this is I think there's still a lot of confusion about what Pangio is. Like, is it group of people? Is it a software? Is it a cloud infrastructure? The real answer is kind of all of the above.
[00:12:59] Unknown:
If I can zoom in on 1 thing, I think that might be another interesting way to look at what the Pangio project has done and what and how to think about it. The life cycle of a scientific researcher's idea is usually to start doing some data analysis, run into a bunch of problems. Like, software doesn't solve all problems, and so you have to write some code yourself. And scientists have, for a long time, been really good at kinda hacking their way to the finish line. And what we did in Pancho is kinda break that pattern and insert some community opportunities for getting help, collaborating with developers on all of the various touch points of the software ecosystem that scientists were using.
And rather than say, okay. I'm gonna hack my way to the finish line and say, okay. Now we start integrating and working with the developers of these packages that might be able to help us fulfill our goal without going through laborious hacking process. So what that's done is use case driven development of a bunch of these individual software libraries. And, overall, kind of it's the rising tide lifts all boats idea. A bunch of those use cases then get solved for the whole community. I think the community has caught on to this that, okay. You know, I could hack my way around this, or I could open an issue on GitHub that might motivate some conversation. Maybe me, maybe someone else knows how to fix that. And, eventually, that ends up in a library like X-ray or Dask or something, then that's a solved problem for a lot of people. So that kind of broke the normal development cycle, and it's been a flywheel effect that has extended beyond just the core climate science use cases. 1 of the coolest parts about what Pingo has accomplished is is rethink how that development cycle works.
[00:14:42] Unknown:
You mentioned that the core elements of the stack are X-ray, DASK, and Jupyter, and being able to tie them all together to be able to do this multidimensional analysis in a scale out format with an appealing user interface. And given that core structure that people can build off of, I'm curious what types of extensions and kind of peripheral projects have been built around that core that may not necessarily be considered part of the broader Pangio umbrella, but are part of that same community.
[00:15:12] Unknown:
There's an image a lot of us have in our head of the Python ecosystem is kind of an onion where Python is in the middle and you have some foundational libraries, say, NumPy and Jupyter. And then a layer further out on the onion might be X-ray and Pandas and a few of these higher level data analysis tools. What we found is that better integrating all of those foundational tools, we enable the development of a bunch of domain specific tools. And so I think 1 of the coolest things that we've seen in the last few years is that now you have a package for each domain kind of developing that is built on top of this kind of deep stack of well connected tools.
And so that's things like a lidar processing library for looking at data coming out of a NASA satellite or a library for doing high dimensional grid analysis and that sort of thing. So the way that the ecosystem has kind of blossomed around that is, I think, a function of the fact that these foundational tools solve the core problem, and then there's just, like, the last mile problem that individuals have been able to tackle on their own. We've been trying to advance this sort of architecture for what a package in our ecosystem would do.
[00:16:23] Unknown:
We had this on our website, a couple's, like, principles that we would encourage those packages to follow. But the basic idea is, like, if we have tools that can sort of consume and produce X-ray datasets lazily. We get a lot of interoperability out of that. Traditionally, a lot of geoscience workflows have interoperated through files. So, like, you'll have a command line tool that, like, reads a file and then, like, writes a file. And these are usually NetCDF files, which is like this really metadata rich, like format for exchanging information.
And what we're trying to do with Pangio is is to not touch the file system, but have packages that can pass information to each other at this high level. Maybe, you know, annotating their results with, you know, units or extra metadata or, you know, things like that, And then be able to chain different pieces of this ecosystem together. So, like, 1 specific example of a package that we just came out with that I feel like it really exemplifies this is this tool called GCM Filters. It is a tool for doing convolution based filtering of data that lives on these complex meshes that we use in our system modeling. Right? So a very common, like, data analysis need is to, like, do smoothing based convolution.
And, like, scipy or, like, scikit image or something, they all do this sort of stuff, but only in sort of rectangular image based space. What we have instead in the geoscience is, like, we have to do convolution on this sphere in, like, a curvilinear coordinate system with like this irregular mesh. And so by leveraging the Pangio ecosystem, we could write code that really just targeted that key piece. Doesn't have to worry about, like, IO. It doesn't have to worry about, like, how to parallelize those operations. We just consume and produce X-ray datasets, and then it can integrate with the rest of the ecosystem. And that's what we're going for, but it's kind of hard. It's a long road and doesn't always work out. It's not always so clean and neat, and there's still plenty of confusion and duplication and inefficiencies in this ecosystem. I don't wanna suggest that it's like solved. It's more of a vision of how we'd like things to work.
[00:18:35] Unknown:
In terms of the scientific domains that you see orienting around the Pangaea project and the Pangaea stack, I'm wondering if you can maybe categorize them a bit and talk to some of the sources of pain that they're experiencing in their, maybe so called native ecosystems that pushes them in the direction of investigating and adopting Pangio to help solve their problems and maybe some of the ways that Pangio can act as a force multiplier by being this sort of common infrastructure that different domains can collaborate with.
[00:19:10] Unknown:
Climate science was the core 1, and that's what the sort of lot of the early motivation for use cases and applications and and those pain points came out of climate science. And so there, we were working with very large, so petabyte scale collections of multidimensional data. There were access problems to that data. It was mostly sitting behind FTP servers or HTTP servers. Individuals or individual research groups would grab some part of it. They would do their research on it. Climate science, we came in and said, okay. So let's start by making it easier to work with that data using X-ray and DASK. And since then, we've really we haven't gotten into it yet, but cloud computing has become a larger and larger part of what the Pangio project has spent its time focused on. And that's largely because it brings to the forefront this opportunity for, data commons like environment where data is accessible to multiple researchers or research groups.
And in the climate sciences, we're starting to see that by putting a bunch of climate data. So Ryan has led a project to put over a petabyte of data from the coupled model intercomparison project into the public cloud, and that's available to all researchers anywhere around the world now in both Amazon or or Google Cloud. And that's really kind of changing the game for how researchers access this data before you either had to work at a place that store that much data, which aren't very many places in the world, or you had to choose a small subset to work on. So today, that's a totally different way of operating on the data. In terms of the domains, I guess, who is using it and how they end up using it? You know, I think the core
[00:20:48] Unknown:
nucleus has always been, like, with the stuff that Joe and I and people that are sort of narrow radius do. It's like analysis of high resolution Earth system observations and simulations. Really focus on this data analysis problem. Right? Because like I said earlier, like, there's kind of paradox that right now in Earth system science, our ability to simulate has actually outpaced our ability to understand those simulations because those simulation codes scale out so well. And, like, we have supercomputers that run them and, like, you can really easily run a lot of simulations about the Earth. But then you need a way to analyze those data that come out of it to actually get to some interesting science. And so I would say the core users of NGR are people who are analyzing very large ensembles of Earth system simulation data.
That is probably like the core group. And from there it's sort of diffused out to proximate areas like remote sensing analysis, you know, analysis of satellite imagery and other measurements of the Earth from space, analysis of data that is collected from autonomous sensors and robots in the ocean or in the atmosphere, weather balloons, things like that. Anytime someone is, I think, dealing with a dataset that is too big to easily fit on their laptop or, you know, in memory on their computer, we think Pangio can help accelerate those. Specifically by actually using the same tools and the same code that you can use on small data for big data. Like, that's the real beauty of Dask particularly and, like, Pangio gets that for free by using Dask, is that you can just have code that works on small data in memory, and then you can just have the same code working with a very large dataset with distributed computing. And so the transition to scaling out your research is way less painful when you're doing it that way. In the past, what people were doing is essentially batch scripts on HPC systems processing file like, a very file centric workflow. Like, read this file, you know, produce something with it, write out another file, like, then do some kind of reduction or something like that. And the abstraction we're going for is that you don't have to think about files. You think about datasets and you think about physical dimensions and coordinates, and that is the sort of code that you write it around, not what files do I have to process.
[00:23:14] Unknown:
And I'll just add to answer the complete part of your question, which is what other scientific domains beyond climate and remote sensing in the geosciences. I think the life sciences is an area we're seeing parts of what we're working on in Pangio get picked up. It's not a wholesale adoption, and it's actually 1 of the, maybe the best parts of Pangio is that you can pick and choose the parts that fit your applications appropriately. But we're seeing applications in neuroscience, in genomics, and in bioimaging that are all using some different parts of the what we would kinda consider the Pangio ecosystem or the Pangio stack.
[00:23:49] Unknown:
Moving back to the file interface that you're trying to move away from, I'm curious in terms of the source formats that you're dealing with, how you're able to help people with managing that abstraction to say, we don't want you to think about the source files. We want you to think about the dataset, but you actually still need to work with these source files to get them into the dataset to begin with and just some of that dimension of being able to work with all of these different scientific formats that can be quite esoteric and complicated and sort of the multidimensional aspect of X-ray comes in. But I know, like, for instance, HDF5
[00:24:23] Unknown:
can be very complex in terms of what you stuff into it. I think it's really important for us to shout out an organization called Unidata and a file format called NetCDF, a metadata convention called CF Conventions. Compared to other fields, I have learned that in geoscience, we're really lucky to have a great standardized file format that's really broadly used in our field and really good metadata conventions that come with that file format. So much work has already been done with geoscience data to make it, you know, fair, you know, findable, accessible, interoperable, reusable, right, through the use of those kind of standards. The last mile that kinda we've had to cross with Pangeo is to just deal with the fact that, like, most datasets are distributed as many files. Right? Like, every now and then you'll find that, like, 100 gigabyte HDF file. But, like, for the most part if you go download data from NASA, they're gonna give you, like, 1, like, 50 megabyte NetCDF file per day and they'll be like 10, 000 of them for the dataset that you wanna work with. That's very common, you know, access pattern.
X-ray does a great job with this. It has this magic function called open mf dataset, open multi file dataset. And I think a lot of us when we first started using X-ray, this was just like the candy that made us love X-ray. Just the fact that we could just instantly open this big collection of files and just treat them as 1 virtual dataset. And it just it's a huge cognitive load lifted off your brain to focus on the big picture. We started from a pretty good position. But, of course, then once you try and scale out, you really come right up against, like, what are the actual IO limitations of the computing hardware we're using. Right? Like, that's actually then where the bottleneck starts to lie. In a way, X-ray can make things look too easy. It can instantly open and show you all these files. But then when you actually start trying to compute stuff, you have to deal with, well, how fast can my hard disk get me data? Or is this HPC file system you know, what kind of contention is going on here as I try to work with this? Or in the cloud, can I efficiently read this format? But it's good to be up against those limits. That means that you're reaching the saturation limit of what your computing hardware can deliver.
[00:26:45] Unknown:
Something that's emerged out of this is some even higher level abstractions on large, multi file datasets. And so I think this is where kind of the catalog space comes in. And 2 projects we've leveraged quite a bit in the Pango ecosystem are intake, which is a library for cataloging data and then loading it into Python objects. And then fspec, which is a library for accessing local and remote data under a common API. So I think these are 2 things that have really kinda changed the game for us in how we think about data organization and data access. And they plug in directly with X-ray and DASK, and they work in the cloud, or they work on HPC, or they work remotely and locally.
[00:27:30] Unknown:
Another interesting element of this overall problem space that you've touched on a couple of times is the advent of the cloud and its increasing popularity and the different access patterns that it enables and requires as compared to HPC, which is where a lot of this research and analysis has been done up till now. And I'm curious if you can talk to some of the challenges that this move to the cloud brings to this analytical flow and some of the ways that Pangio is helping people make that migration off of these monolithic HPC systems that you have to try and allocate your time share onto versus the public cloud where you just need to give them a credit card and hope that your budget doesn't run out. It's first important to say that our HPC centers that we have in science are an incredible resource
[00:28:19] Unknown:
and the people who run them are really fantastic. We would be nowhere without them. So it's not HPC or cloud. It's really HPC and cloud. HPC has a really important role to play here. I don't think that's gonna go away anytime soon. As far as cloud for us, like, we got into it by accident, basically. We submitted this proposal in 2017, and NSF asked us to trim the budget. And 1 thing we put in that budget was, you know, like, $50, 000 to buy a couple servers to, like, store data because that's the kind of thing that, you know, we put in proposal budgets. We wrote back to NSF. We said, okay. We'll cut these servers, but can you give us some cloud credits? Because at that time, NSF was running this program called Big Data. They partnered with Microsoft, Google, and Amazon to, like, just directly grant cloud computing credits to researchers.
In retrospect, it was a very good marketing ploy be by the cloud providers because now we've become total cloud evangelists. So we got, like, a $100, 000 worth of Google Cloud credits, and we just kinda went nuts. In retrospect, it's remarkable how kind of fast and loose we played with this. We really just started spinning stuff up and trying things. But we basically stood up this cloud based JupyterHub connected to a bunch of cloud based data and basically led anyone who wanted to access it and play with it. And it was super fun and liberating to just be running and deploying our own infrastructure. Right? If you're used to working with HPC, it's very much like a sysadmin versus user. You know? Like, you have to beg them to do things. Like, they're very conservative. They're very security, you know, aware.
Resources are limited. They're large, but they're limited. And cloud, it was just kind of like a completely new paradigm. We could do anything we wanted. We could build anything we wanted. There was we had a lot of money, so we weren't really worried about, like, our spend. In retrospect, now all of those things become serious concerns as the project matures, and we try to, like, actually run real production infrastructure. But at the time, it was very exciting and liberating, and we learned a ton of stuff. We experimented a lot. And I do think that we helped to chart a course into cloud computing for the geosciences and maybe science you know, maybe a little bit more broadly than just geosciences.
Things are great in geosciences because we have very few restrictions around our data. We don't have, you know, privacy. We don't have personally identifiable information. We don't have HIPAA regulations around, you know, health data. It's basically all Creative Commons licenses, and you can just do whatever you want with it for the most part. And so we were able to just forklift data into the cloud and start computing on it, and it was awesome. 1 technical thing that's worth getting into is this whole question of cloud native data formats. Right? So so far we've talked about HDF and NetCDF.
What we learned in 2017 is that we couldn't just put HDF data or NetCDF data into cloud object storage and compute on it in a convenient way because we didn't have a way to, like, open those file. They're really these complex opaque binary formats, and there's, like, a c library that has to read them. And that time it was just baked in that it's gonna be like a POSIX file system where the data live. So you just have to, like, download the whole file if you wanted to even open it up. And so that didn't feel very, like, cloud native to us. What we really want is a data format that can just be accessed directly through HTTP calls, which is what how object storage works and where we can sort of get the metadata without downloading the whole file and where we can efficiently subset or select from the data. So we experimented with a lot of different formats. You know, Parquet is sort of the canonical example of a cloud native format for tabular data.
It's not ideally suited for multidimensional arrays, and so we really put a lot of time into working on Xaar. Xaar is an open source community driven project that's more or less API compatible with HDF. But whereas HDF stores things in 1 single file, XAR just kind of explodes the dataset into many individual metadata objects, which are just JSON files and then binary blobs of data that can be compressed and chunked and stuff like that. And so we will spend a lot of effort integrating XAR, Dask, and Xarray to provide a real cloud native workflow, and it's awesome. If you are working with XAR data in the cloud using DASP to scale out on hundreds of compute nodes, you can absolutely burn through data really just, you know, getting right up against the limitations of the hardware, saturating the network, saturating the CPU, just really process data as fast as the computers will allow. That's quite exhilarating when you're doing that at scale,
[00:33:08] Unknown:
and it's part of, I think, the fun of Pangio when users get to experience that. It's interesting hearing about this because I have another podcast about data engineering, and it's very focused on what you're talking about with Parquet and these cloud native formats and scalable compute and analysis. And it's always interesting hearing the perspective from people who are dealing with more complex data sets and data structures beyond just the tabular. And so I'm curious if there's any sort of analog to what's happening in the data ecosystem with data warehouses and the quote unquote modern data stack and incremental stream processing or, you know, managing that whole data pipeline life cycle as it applies to the geosciences and the just sort of general academic scientific exploration writ large?
[00:33:55] Unknown:
It's a great problem because the modern data stack has so much momentum right now, and it really doesn't address scientific data, which means there's this huge opportunity as science moves into the cloud to build out that type of tooling, but focused on this more complex data model that we deal with in scientific research. And I think that's a super exciting opportunity. You know, there's great patterns in the modern data stack, you know, but the fact is I can't put, you know, my climate model data in Snowflake. Or if I can, they just call it, you know, unstructured data. No. It's not unstructured. It's highly structured. It's just not the structure that those tools want to assume. What we are trying to build now with funding from the NSF EarthCube program is a framework called Pingo Forge, which is an ETL tool focused on cloud native scientific data. You can think of it as maybe vaguely comparable to what something like Airbyte or Fivetran or Stitch does in terms of connecting data sources, you know, ingesting data into a data lake, possibly with transformations in there, and storing it in a cloud native data catalog. That's where, like, a 100% of my sort of developer effort is focused right now, and we're really psyched about it. It's kinda just come alive,
[00:35:13] Unknown:
and I could probably talk about it for the next hour. But I'll leave it there for now. That's great. I'll throw 1 more thing in here, which is that the reason Pangio Forge needs to exist today is largely out of the success of the workflow and the model of using Pangio tools in the cloud. You know, we noticed a pattern just like I was talking about earlier, how that kind of science workflow has found a bunch of detours that build general solutions to problems that we find over and over again. And as we've been developing the Pangio Cloud ecosystem, we found that 1 of the hardest and biggest pain points is getting data into cloud optimized and analysis ready data formats, and then pushing it and tracking it in the cloud. And so out of that need grew Pangio Forge, and it's something that we feel like we could put a whole community effort into at this point because it's the keys to the castle in terms of, you know, interactive big data geoscience going forward.
[00:36:15] Unknown:
Spending a bit more time on the sort of cloud native storage format, I'm curious if and how much you've explored the TileDB's platform for being able to handle these multidimensional data structures in this sort of matrix vectorized sort of storage format that is designed for the cloud and maybe some of the limitations that it has that leads you away from maybe endorsing that in a full throated manner?
[00:36:41] Unknown:
We love TileDB. It's a great tool and a great format and a great company. I think the reason we haven't done more with it is a pretty simple practical problem which is that X-ray cannot write TileDB. X-ray is our Swiss army knife of data formats. It can read every format we know of and it can write most formats we would want to write. And it has an extensible architecture through entry points that allow anyone 3rd parties to implement readers and writers. For whatever reason, writing a tile d b writer for X-ray has never been implemented. And therefore, we don't have the ability to create TileDB data. Almost all of our data is coming from a legacy data format. And so it's really just this 1 technical blocker. You know? I think to some degree, there are some dynamics of the fact that there's a big company behind TileDB might make some people in the open source community a little less likely to just roll up their sleeves and, like, do that work, kind of just hoping that, you know, all of that funding and all of the engineers in TileDB, you know, they can do that work. Whereas with Xaar, there was no company. Now there's a few proposals that fund it, but it was more just like, well, everyone pitch in and, like, let's hack this thing together and make it work. So that's really the only reason. Now I think TOWDB, to be honest, is a technically superior format in most ways to czar. It's very clever. It's very high performing. So in the future, I wanna work with it more.
[00:38:13] Unknown:
Alright. Well, I I consider that a challenge to the listeners.
[00:38:17] Unknown:
Well, if someone wants to implement this, like, please do it. It would be a huge service to the community.
[00:38:22] Unknown:
And so in terms of the overall sort of scope and the goals of the Pangeo project, I'm curious if you can talk to some of the ways that those have shifted from when you first began working on it to where you are today and where you're looking to go in the future.
[00:38:38] Unknown:
You know, in the early days, we had kind of 3 main goals. Foster collaboration around open source Python for applications in the climate sciences and geosciences sciences more broadly, support all these domain specific packages, and then improve the scalability of them to petabyte scale data analysis problems. Where we're at today is not that we've solved all those problems, but we solved some of those problems. And what we're seeing as we've kind of transformed into something that's more user like, there's probably more users in Pangio than there are developers today. That's different than the early days whereas most, like, user developers or developers, is that Pangaea is kinda growing into more of an open science community with a lot of people that are using the tools.
And not to say that there's fewer developers, there's probably more developers, but the ratio is probably higher on the user side. When I say Open Science Community, it's still a place where there's a lot of collaboration on how to use tools and then on developing those software tools and also on data side. But the focus is kind of increasingly more on the science user than on the software packages themselves, which I think is a great evolution of a maturing project to see it move from alright. Here's, like, 3 software packages that we're developing on to here's a collection of tools and ideas in architectures that we can use to create effect on the problems, the research problems, the scientific problems that we're facing. You said a keyword, which is open science,
[00:40:06] Unknown:
and this is not something we talked about and used that word as much in 2017 as we do today. We can see huge momentum behind the open science movement from agencies like NASA, NSF. Open Science is loosely more that more that happens when you have that, which is a more collaborative approach to scientific research that I believe is very much inspired by what we have happening in open source. You know, in the open source world there's not really so much competition. I mean there are some cases of competition, but in general it's kind of this feeling that we're all in this together. Let's try and improve this software so we can all do better work. Let's work together. Let's not duplicate efforts. Let's try and find a place for new contributors. And I think a lot of scientists would like to be working with a similar ethical framework on their research.
And so what I see for the future of Pengeo is to see if we can actually change the culture of science where we do our work more in the open, and are more open to outside collaborators getting involved and, you know, augmenting our projects. And I think that's gonna take science farther in the coming decades than the sort of more individualistic sort of siloed model has in the past. So that's an open question whether Pengeo can actually do this for the community, but I think we have become a concrete example of a real live, healthy, open science community that others can look to for a model.
[00:41:48] Unknown:
In terms of the work that you've seen both done under the auspices of Pangio and maybe as a direct result of it, both technical and scientific and in terms of the community, I'm curious if you can speak to some of the most interesting or innovative or unexpected ways that you've seen that come about. Couple ideas that come to mind immediately is some of the operational
[00:42:11] Unknown:
uptick in or uptake in PINGEO tools. So, you know, Ryan mentioned NASA, but NASA and NOAA, 2 of the biggest public geoscience data providers in the world, adopted parts of the tooling and the approaches we've taken, in particular, as both of those organizations are moving towards using cloud as the place to store and distribute their data. We're seeing things like tooling being built using X-ray or data being stored in XAR formats or cataloged using similar tools. And I think that's a really exciting thing. I mean, it's in many ways, why, say, NASA or NSF are giving grants to organizations like ours is so that we can innovate these ideas to a place where they can be taken up by larger institutions. And so I think, you know, NASA is certainly doing this. NOAA is as well. It's a big success of the project so far. Yeah. My answer is more or less the same as Joe's. You know, I think it's been really cool and unexpected to see,
[00:43:13] Unknown:
like, also, like, organizations in Europe, like European Space Agency, like, building operational infrastructure, production level infrastructure for data processing out of, like, these building blocks. I think in many times, they don't explicitly credit Pangio, and that's fine. They're like, this processing system is built. This is in Xara data. We're processing it with DASK, and, like, we're providing, you know, X-ray. It's clearly Pangio, but, like, it's not credited. And that's fine actually and good. Like, I think there's a world in which Pangio becomes this invisible, you know, layer, you know, supporting these tools. And if they go out and are adopted and used and help the world, like, that's kind of best case scenario.
The other direction is adoption in biogenomics and microscopy. That's super cool to see. In your experience
[00:44:02] Unknown:
of helping to form and grow this community and the associated technologies, what are some of the most interesting or unexpected or challenging lessons that you've each learned in the process?
[00:44:13] Unknown:
For me, it's been the management and fostering of a large distributed asynchronous community. Has taken a lot of energy. I mean, it's been incredibly fruitful, but there's also been a lot of coordination effort that has gone into that. When I started working on X-ray, it wasn't to build a community. It was in part because there was a bit of community aspect to it. But as this kinda grew into the Pangeo project, it started attracting users and other developers, it became more and more a coordination project and less of a development project.
[00:44:44] Unknown:
I echo that. Like, trying to keep the community organized and moving forward and motivated has been a huge time sink for me. I mean, I think it's time well spent. But, you know, it really does take energy and effort. And I think sometimes we underestimated like, it's easy to start projects and start things, but sometimes we underestimate the cost of keeping them going. I think that's true in software and work more generally. I would add another, like, challenge. It's good. It's always important to acknowledge. You know, we really try to make our community, like, really inclusive through code of conduct, through a lot of things we can do to try and welcome new users. But the fact is, you know, the open source software development world is not the most diverse place.
You know, we've tried and had some examples of success to make our community more diverse and welcome contributors from more different walks of life and more different backgrounds. But we still have a lot of work to do there as I think a lot of the, you know, scientific Python projects do. That's a perpetual challenge and something we're always trying to think about how to do better on.
[00:45:48] Unknown:
And as you continue to work on and foster that community and the associated technologies, what are some of the things you have planned for the near to medium term future of Pangio?
[00:45:58] Unknown:
I think in the near term, Ryan talked about it earlier, but I think Pangio Forge is a big effort right now. And I think that pushing on this data lake concept for geoscientific data in the cloud is a big aspect there. There's other areas of growth within the Pangeo project that are not necessarily connected to Ryan and I, but I think it's worth saying. I think there's this early stage shift towards more of an open science community, and there's others pushing on that. I think that's really exciting. I also see there's a group at the National Center For Atmospheric Research and at the University of Washington that have recently got support to do software support across the ecosystem, which is something that it's really quite easy to get grants traditionally to build new things.
But NASA has a program now for supporting open source software, which is both really cool, and there's gonna be focus in the NGO space on that. So there's a whole mix of things. The community is really multidimensional in its own way now.
[00:46:57] Unknown:
I really think cloud can be transformative for science. Like, I think a lot of people get wrapped up in, like, the technical or, like, the cost, you know, analysis and, like, miss the broader point that, like, cloud allows us to be way more collaborative internationally, you know, around research. And I think that will have really big impacts on science. And I think, know, it doesn't mean we have to use Amazon Cloud, but, like, Cloud broadly defined as this open Internet environment for doing science, I really do think is gonna transform the way we work for the better. As far as Pangio goes, I'm very interested in, like, building bridges with other languages, particularly, like, R and Julia.
As far as, like, simulation goes, I'm very bullish on Julia, like, in the future. I think in 10 years, you you know, a lot of our work is probably gonna be in Julia. You know, Pingo has like, there's interoperability today between Pangio and Julia. Like, the XAR format, for example, has really good Julia support. You know, you can call Python from Julia and Julia from Python. But I think just continuing to build out that bridge is really important for sort of, like, looking to the future. Because I see a lot of great innovations happening in that community. You know, I dabbled in the language myself, and I think that's gonna accelerate.
[00:48:17] Unknown:
Are there any other aspects of the Pangaea project and your work in that community and the technology stack that we didn't discuss yet that you'd like to cover before we close out the show? It's a good time to highlight the fact that there's a discourse forum.
[00:48:33] Unknown:
You know, Pingo is fundamentally an online community. It's not based at any 1 institution. It's on GitHub, and we have a discourse forum. And so it's discourse dotpangio.io on GitHub atpangiodata. I think those are the nexus, of collaboration and encourage folks that are interested to go check those out. There's also a weekly meeting that is about half the time, it's a showcase where people can check out presentations or demonstrations,
[00:48:59] Unknown:
things that people are doing in this space. I think it's also really important to shout out. Like, a lot of people ask, like, how can I learn to use PINGEO? Which you might just translate as how can I learn to use, like, scientific Python in these various packages? But there's a great group that's focused on the education side within Pinjio, which is called Project Pythia. This is a collaboration between some folks at NCAR and the University of Albany. And they have a great website, a really rich Jupyter book with awesome sort of training material.
It's really a 0 to 60 course in scientific Python Computing in the Geosciences. So if anyone is hearing this and wants to know where can I start, that's a great place to start?
[00:49:41] Unknown:
Alright. Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And so with that, I'll move us into the picks. And with spring coming and once I get on the other side of mud season, I'm excited to get back out and do some mountain biking. So for folks who haven't given that a shot, definitely a fun way to spend your time. So with that, I'll pass it to you, Ryan. Do you have any picks this week? A way more indoor pick, but I just read the novel Klara and the Sun by Kazuo Ishiguro.
[00:50:14] Unknown:
Really blew my mind. It's about AI. It's about artificial friends and potential where we might end up if we really manage to create very, very advanced AI and put it into human like bodies. It was very thought provoking and a beautiful book, so highly recommend it. Alright. And how about you, Joe? Oh, have to see you out on the trails and on the mountain bike.
[00:50:37] Unknown:
Also looking forward to that. But last year, I finished a book called Range by David Epstein, and it's about how generalists thrive in a specialized world. And I thought it was a really interesting take on why, you know, a bunch of experiences and a more generalist approach is important in the world we live in today.
[00:50:56] Unknown:
Alright. Well, definitely gonna have to take a look at both of those. So thank you for taking the time today to join me and share the work that you're each doing on Pangio and helping to give more people access and capacity for doing analysis and data science on large geospatial datasets. Appreciate all the time and energy that you and the rest of the community have put into that, and I hope you 2 enjoy the rest of your day. Yeah. It was a pleasure. Thanks for having us. This was awesome. Thanks, Tobias. Thank you for listening. Don't forget to check out our other show, The Data Engineering Podcast at dataengineeringpodcast.com for the latest on modern data management.
And visit the site of pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host atpodcastinit.com with your story. To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Ryan Abernathy and Joe Hammond
Ryan's Journey with Python
Joe's Introduction to Python
Overview of the Pangio Project
The Birth of Pangio
Challenges and Community Building
Core Elements and Extensions of the Pangio Stack
Scientific Domains Using Pangio
Managing Source Formats and Data Abstraction
Cloud Computing and Its Impact on Geosciences
Modern Data Stack and Scientific Data
Exploring Cloud Native Storage Formats
Evolution and Future Goals of Pangio
Innovative Uses and Success Stories
Lessons Learned in Community Building
Future Plans for Pangio
Community Resources and Education
Picks and Recommendations