Summary
Astrophysics and cosmology are fields that require working with complex multidimensional data to simulate the workings of our universe. The yt project was created to make working with this data and providing useful visualizations easy and fun. This week Nathan Goldbaum and John Zuhone share the story of how yt got started, how it works, and how it is being used right now.
Announcements
- The Open Data Science Conference is coming to Boston May 3rd-5th. Get your ticket now so you don’t miss out on your chance to learn more about the state of the art for data science and data engineering.
- Now you can get T-shirts, sweatshirts, mugs, and a tote bag to let the world know about Podcast.init, and you can support the show at the same time! Go to teespring.com/podcastinit and load up!
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.podastinit.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app.
- Visit the site to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Nathan Goldbaum and John Zuhone about the YT project for multi-dimensional data analysis.
Interview
- Introductions
- How did you get introduced to Python?
- What is yt and how did it get started?
- Where does the name come from?
- How does yt compare to other projects such as AstroPy for astronomical data analysis?
- What are the domains in which yt is most widely used?
- One of the main use cases of yt is for visualizing multidimensional data. What are some of the design challenges in trying to represent such complicated domains via a visual model?
- Some of the sample datasets for the examples are rather large. What are some of the biggest challenges associated with running analyses on such substantial amounts of information?
- How has the project evolved and what are some of the biggest challenges that it is facing going forward?
Contact
- John
- @njgoldbaum on Twitter
- Nathan
- @astrojaz on Twitter
Picks
- Tobias
- Nathan
- John
Links
- HDF5Py
- Matt Turk
- Seismodome
- Computational Fluid Dynamics
- AstroPy
- SymPy
- Magnetohydrodynamics
- Numerical Relativistic Hydrodynamics
- MPI4Py
- Matplotlib
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Just a couple of announcements before we start the show. If you haven't picked up a ticket for the Open Data Science Conference in Boston, there's still time. It's happening from May 3rd through 5th. It's a great chance to learn more about data science and data engineering. You can also now get a t shirt and help support the show by going to teespring.com/podcast in it. You can also pick up a mug and a sweatshirt while you're at it. Hello, and welcome to podcast.init, the podcast about Python and the people who make it great. I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it. So you should check out linode@linode.com/podcastin it, and get a 20 dollar credit to try out their fast and reliable Linux virtual servers for running your app or trying out something you hear about on the show.
You can visit our site at www.podcastinnit.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. To help other people find the show, you can leave a review on iTunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host as usual is Tobias Macy, and today I'm interviewing Nathan Goldbaum and John Zuhon about the YT project for multidimensional data analysis. John, could you start us off by introducing yourself?
[00:01:22] Unknown:
Hi. I'm, John Zuhon. I'm a astrophysicist and a computational scientist at the, Harvard Smithsonian Center For Astrophysics in, Cambridge, Massachusetts.
[00:01:33] Unknown:
Nathan, how about yourself?
[00:01:34] Unknown:
Hi. My name is Nathan Goldbaum. I'm a postdoc at the National Center For Supercomputing Applications at the University of Illinois in Urbana Champaign.
[00:01:43] Unknown:
And could you guys tell me how you first got introduced to Python? John, how about you go first again?
[00:01:48] Unknown:
I got introduced to Python while I was a graduate student. I was running a, simulation code and I needed to parse a log file. And I came to a, post doc and I said, I don't know what the best tool is to parse the log file. And he, hit the tool of his choice was Python, and so he started writing a Python script and just sort of explained it to me along the way and kind of got hooked on it, I think, because I found it a really powerful but also really easy language to learn. So And, Nathan, do you remember how you got started?
[00:02:18] Unknown:
Yeah. So before I used Python, I was mostly using IDL, which is this awful language that basically only astronomers use, maybe a few other domain scientists, but basically astronomers. And it's closed source, so you have to pay a license or your generally, your academic department pays a license fee for you and just kind of awful to use and is a terrible language. So in the in the middle of grad school, I realized that I was using this awful proprietary language, and I should probably learn something that's more marketable and also just less awful. So at the same time, I was starting to work with big simulation datasets, and I realized that there was this library, y t, that I could use if I was using Python. So I sort of picked up Python and y t at the same time, like in 2012.
[00:03:00] Unknown:
And that leads us nicely into the question of what is YT, and how did it get started?
[00:03:06] Unknown:
Oh, yeah. Sure. So, YT is a package that was, developed initially by, Matt Turk, who's also at the University of Illinois, and it's a Python package for analyzing, multidimensional volumetric data. So because I should sort of explain maybe those 2 terms and unpack them a little bit. So by multidimensional, at the moment, we mainly, mean three-dimensional data. So you have like a you imagine like a basically a simulation box where you've got, you know, 2 stars colliding or, you know, structure of the universe forming. Or, you can imagine like a 2 d plane where you've got like, a a shock tube where a wind is blowing and, you know, hitting a, impediment and basically forming a lot of interesting structure. And, by volumetric data, I guess that's also what I mean is that, we're thinking about things that are in spread out in some kind of space. Normally, we think about that in terms of, like, the 3 spatial dimensions, but, we're also moving into domains when we think about, you know, thinking about different dimensions in terms of, you know, other things besides length, width, and height. Let's say that 1 of the dimensions along the box is
[00:04:11] Unknown:
time. I can't hear John anymore.
[00:04:14] Unknown:
Sorry, Nathan?
[00:04:16] Unknown:
You you started to get a little bit choppy through there. You're I heard you up through saying we're starting to get into areas where and then you started to cut out.
[00:04:25] Unknown:
Okay. We're starting to get into areas where we don't just think of the dimensions in terms of length, width, and height. We also think of dimensions that are not spatial dimensions we're thinking about. Let's say you have a dataset that has a time dimension and sort of treating that as a dimension that you would, you know, take a slice on or, you know, make a subs examine a subset of or maybe even you know, we have in astronomy, we have, data cubes where the 2 dimensions are the 2 directions on the sky, and then the 3rd direction is, say, velocity along the line of sight. That's also a very common dataset. So, basically, we're thinking about how to look at complex multidimensional datasets in a intuitive way. That is a way that scientists and physicists would prefer to think about it as opposed to thinking about it just from like a coding perspective.
[00:05:15] Unknown:
YT's bread and butter is working with really complex research datasets. So these are data that are output by research codes. So the person that designed the dataset might not have been trained in how to do that well or there might be extremely complex for other reasons or whatever. So you have this very complex data set on disk, which might be really difficult to work with. Yt can parse it and understand it and present it to you in a manner that allows you to ask questions about it in a physically motivated ways. Like, what is the total mass of gas contained in in this sphere? That's like a really basic y t operation. Or I wanna visualize a slice through this simulation or a projection or do some sort of more comp more sophisticated complicated volume rendering, that sort of thing. So it's the ability to take a parsers for for real world research data formats and then ask questions about that research data format without having to write your own analysis software that needs to parse this complicated research data format. And then on top of that, do some sort of analysis routine that, you know, some sort of algorithm that you could implement yourself or you could use some sort of community software to do. So, like, basically, every grad student whoever works with, these complicated research data formats basically is reimplementing the same parsers and reimplementing the same analysis algorithm code when instead it would be much better if everybody was pooling their resources into a community code, which is sort of like what YT tries to be. And was there any sort of pre existing tooling or
[00:06:43] Unknown:
set of best practices that YT is replacing from other languages or or that's more specific to the scientific domain? I'll make a comment about that. So in astronomy, there's a lot of roll your own type attitude where we've got this as Nathan was talking about, we have these complex datasets, and, you know, they're often in very common file formats. Like, 1, file format is hdf5 that's used a lot. And so there are some industry standards in terms of that, but when it get to when people decide how they want to read that data, it's very much, you know, they write something on their own, and they just kind of, you know, special purpose it for whatever purpose they were thinking of when they're like, okay. I need to read, say, the velocities out of this dataset. I need to figure out the average velocity in, say, a rectangular region or something. And I think what the velocity behind y t is is that it turns out that you might write a code like that, and then you you determine that you need the code to do something slightly different. So instead of a rectangular solid, maybe I need a sphere or maybe instead of the velocities, I need to read the temperature or something. So the idea behind y t is to take these very common operations, generalize them, express them in a language that allows you to ask questions of the data in a in in sort of a physics sort of way. So, for example, asking questions about spheres and rectangular solids and slices and that sort of thing,
[00:08:06] Unknown:
and prevents you from having to sort of reinvent the wheel all the time. Can you explain a bit about the name and where that came from? Because having a 2 letter name for a project is a little bit enigmatic.
[00:08:18] Unknown:
I think that's true. Yeah. So the original author of the of the code base is Matt Turck, who's the, my advisor at the University of Illinois, for for my postdoc. He he wrote it for his PhD, like in 2008 or so. So, like, it's sort of a whimsical name because he didn't expect it to be used by lots of people. It just turned out to be useful and he sort of kept the name. So it it stands for yours truly, but it's also a reference to the Neal Stephenson novel Snow Crash. And that makes more sense if you realize that it was originally written to analyze outputs from a simulation code named Enzo, which Matt was also using for his PhD. So, y YT is Enzo's helper in that novel. But, yeah, it's it's also a substring of Python, which is nice.
[00:09:01] Unknown:
It's a surprisingly deep level of complexity combined with those 2 little letters. So it sounds like 1 of the primary use cases for YT is for analyzing astronomical data. So I'm wondering how it compares or interfaces with other projects in the Python ecosystem such as AstroPy for doing data analysis of astronomy?
[00:09:21] Unknown:
Sure. So AstroPy is mostly there for working with data that comes from a telescope. So that means, like, doing math on the surface of a sphere or timekeeping, really, really accurate timekeeping, or, like, figuring out different coordinate systems on the sphere, like, trying to find what object is contained in inside a circle. That that's that's what AstroPi is there for, as well as many, many other things. Yt is is there for analyzing the outputs of research simulations. So sort of the sort of simulations you might run for your PhD or as part of a huge research project where the simulations are run. Like, a really common thing that big efforts at National Labs in order to push forward cosmology programs is to run really, really enormous cosmological simulations and then use those cosmology simulations as input for parameter estimation in the physics problem. So you you have this enormous, cosmology simulation you need to work with. Or there you know, there's lots of other reasons you might wanna run a cosmology simulation. That's just 1. So you basically, you end up with gigabytes and gigabytes of this complicated data
[00:10:18] Unknown:
that you need to process in some way. And YT is there as the toolkit that you that you need in order to answer physical questions about that data. I think an important point in this context is is that often, you have this gigabytes and gigabytes of data. And in any given time, you don't need all of the data that's in that dataset to answer the particular question you have. And so the good thing about YT is it's designed to work only with the data that you are interested in. So, say, a spatial subset of the data or only certain physical quantities that that are in the data. And this is good because it sort of, you know, boxes in the sort of it sort of, you know, simplifies the questions that you need to ask. But also rather importantly for these kinds of datasets is that it avoids having to read in all the memory of this giant dataset in at once when in reality, you maybe just need, like, a small subset up. And so it's designed to sort of make your job simpler and also make the computer's job a lot
[00:11:16] Unknown:
easier too, not using quite as much memory and not spending as much time that you would need to do if you were reading and all that stuff just to use a small subset up. And and the trick for that is that the IO masking is is spatial. So so if you only want a certain subset of a data set, we know where on disk in that data set that data comes from. We only need to do IO for a a small subset. And doing that in increasingly clever ways is basically how how YT can get faster.
[00:11:43] Unknown:
And so at a high level, is it fair to say that Astropie is more for observational analysis of data, whereas, YT is more for theoretical research. That that's a good that's a good way of characterizing it, I think. And aside from astronomy, are there any other domains where YT is, widely
[00:12:02] Unknown:
used? Although, there there are so this sort of is a good segue, I guess. There are astronomical datasets where y t is great. So John brought this up earlier. Position position velocity data is something you get a lot in astronomy. So, like, from a radio telescope. So what what a radio telescope actually sees is at different places on the sky, there's different, the emission at different frequencies and those and those frequencies correspond to different velocities along the line of sight. So you can map this this, this the data that comes out of a radio telescope into a position position velocity, a 3 d data cube, which you can you you can use YT to to look at, like, do slices or projections or different sort of sub selections of the data. There there are other tools that do similar things like that that that sort of regular gridded data isn't where YT really shines. YT really shines with adaptive multiresolution data. So, like, data that has some sort of its course in places, and then in other places, it's it's very, refined and high resolution. It's tricky to do spatial chunking spatial aware chunking on a dataset like that because it's complete it's it's structured, but it's structured in a very complicated way. And are there domains outside of astronomy where YT is popular or widely used? I don't know if I'd say widely used. I know that there are other areas in which it's being used for certain, applications. I know that Matt, who we referred to earlier, has which is, something you can you can Google.
[00:13:28] Unknown:
I've there's I've also seen, I don't think there was any serious analysis done on it, but, you know, brain scans have, you know, a length, width, and depth to them. And, there's a, YT,
[00:13:39] Unknown:
volume rendering of a brain scan as well that I thought was pretty interesting. It's also used by a lot of nuclear physicists. So so people that are that are simulating nuclear reactors. So there was a a big effort a year or 2 ago to add support for unstructured mesh data. So so YT now now has support to do second order interpolation on various kinds of unstructured meshes and and is being used right now to simulate, molten salt reactors by by 1 student in our research group. And then I I know it was used to simulate ITER and some, like, I don't remember the details, but by people at the University of Wisconsin. So there's sort of a growing community of people that are doing nuclear physics. And then we'd also like to sort of expand to other fields. There are plenty of other people, especially, like people that are doing CFD simulations that that could that have data that YT could work with. It's just a matter of writing readers and, you know and also there's a lot of things in YT that are very astronomically focused. Going forward, we're we're thinking about ways to make YT less or, like, keep the astronomical focus stuff, but make it possible to have a focus on other domain sciences.
That so sort of if you load like, if you loaded a weather simulation, then you would be able to, like, have weather specific analysis tools or fields. We haven't talked about fields yet, but fields are really important in YT. So, like, predefined things that people commonly work with in your in your domain that you can access in your data set. And 1 of the main use cases, or at least 1 of the use cases that was
[00:15:01] Unknown:
highlighted in the documentation is the ability to visualize multidimensional data. So I'm wondering what are some of the overall design challenges
[00:15:10] Unknown:
that are presented when trying to represent such complicated domains using a visual model? I would say that the first 1 is is that you kind of have to have a good sense. At least the end user, you know, they need to have a good sense of, like, what kind of visualization they want to do. Right? So they might only care about, like, a 2 dimensional slice, for example. Or, in astronomy, you know, we have three-dimensional simulations, but, of course, we don't deal with things in 3 dimensions. On the sky, things are projected onto the sky. And so we have a projection operation where you basically just integrate along 1 dimension and 2 down to a 2 dimensional map. And then we have also volume renderings. And so there's a you know, where you're, you know, you're just, you have a 3 d volume rendering of, you know, different surfaces of a particular field within the domain. And the thing is is that these are a common set of operations that various people with various types of datasets would be interested in. And the challenge is is that because the datasets can be so very different, you know, a uniform mesh or a adaptive mesh or an unstructured mesh or a, you know, a set of particles, for example, we have to sort of figure out a way to make the underlying structure of the code take these common operations that we present in the same way to every user and still do the same stuff under the hood on the very different types of data. And I think that's that's a big challenge, I think. It's a challenge that we're still grappling with with certain types of data set. So so here's a really concrete example is I wanna create a plot of a slice through my simulation. And there are a number of ways that you could construct that visualization.
[00:16:42] Unknown:
So so, like, probably the simplest 1 would just be creating the pixelized representation of that slice. So you have like a 2 d uniformly uniform resolution array and just just generating that slice when that slice can come from particle data, unstructured mesh data, adaptive mesh refinement data, and then uniform resolution data, architecting the code so that it can generate from the same interface that that results. It is that's sort of the complexity that that John was just alluding to. And then on top of that, once you have that image, that image can represent lots of physically meaningful things. Like, you could just export the image and just let people construct their own plots, but what people commonly wanna do is have, like, color bar and, like, axes and a title and, like, annotations because something is interesting inside of the image. So you we we have this whole wrapper around Nowplotlib. It's called the plot window interface that lets you create these sorts of it's and the idea is that you you're sort of making a window into the into the data, and it's got annotated axes. So, you know, you have the spatial scale, and a color bar with a label that corresponds to whatever field you are asking for. That label has units. All the axes have units. Just making it so that that works and makes sense and works with many different kinds of data and many different sorts of visualizations of data in a consistent way can get really really complex. But also, it's sort of magical when it, like, you write a 3 line script, and then you're able to make this nice slice plot that, you know, maybe is publication quality, kind of? I don't I don't know if it's if it's quite there, but it's it's I think it's pretty high quality. Digging into a bit of the guts of YT, what is the overall
[00:18:17] Unknown:
architecture or design look like, and what are some of the biggest challenges that you have faced in the process of developing the project?
[00:18:25] Unknown:
So there's sort of so so the the repo is sort of split into the sort of core data structures, the thing that represents the dataset itself. So that's that's a thing in our documentation we refer to as ds usually. Ds equals yt.load some some path to some dataset so that this is like an op a Python object that conceptually represents the dataset and then you do things with that dataset object in order to to access the data that that it represents or or do things with it. So there's sort of that and then all the things that hang off of it and then there's then there's the the front ends. So those are the parsers for these research data formats because y t y t has this concept of, like, trying to map semantic meaning onto the data. So we need to understand the data at a pretty high level. So, like, we need to know, like, this field means the gas density field and has units of gram per centimeter cubed. And, we need to read it as from this place on disk. And that that's different for each data format. So there's there's a bunch of different front ends. And then next to that, there's the analysis modules, which is sort of like sort of an attic of people that have things that things that people have developed and we're actually trying to split those off into other repositories so that they can, you know there's some things that people developed a long time ago and then sort of threw over the wall and we're trying to maintain it but it's also, you know, difficult to maintain stuff like that. So we're yeah. Going forward, it's gonna be mostly just the core and the readers. And then also, you know, on top of that as well is the field system.
So fields in y t are you want to construct some, like, for example, the kinetic energy that that depends on things that you might read off of disk like the mass and the velocity. So you need to construct that by doing some sort of operation. And we have lots of predefined fields and of systems for defining these fields in a sort of automatic way because that, you know, like, the field definitions might depend on different properties of the dataset, so they need to be they need to specialize themselves depending on what sort of data are are we're we're reading. I think that's it. Did I miss anything, John?
[00:20:19] Unknown:
I think 1 of the things I'd like to highlight a little bit of since I've worked on it a lot and also find it probably the 1 of the more fascinating parts of the code is the unit system. So, you know, physicists work with, things in many different kinds of units. Right? So you have time units like seconds and hours or mass units like grams or, in the case of astrophysics, solar masses. And, you know, the interesting thing about these calculations is before yt, you know, basically, I have all of these, you know, arrays in my analysis code that I was multiplying together, and say I needed to get from density and temperature or, say, magnetic field to some kind of, like, intensity of light or something from an astronomical object. And so I'd multiply all these numbers together, and I would have to check very meticulously and often fail at getting the units right because I would be multiplying these numbers with a lot of different mathematical operations and including physical constants and doing everything by hand and usually getting the units wrong on several different things. And the y t uses, the Python package SymPy in a symbolic units, implementation basically where you have NumPy arrays or just like single floating point number that are associated with units. So for example, you might have a number that has, velocity of kilometers per second, or you might have an array that has a temperature in, Kelvin, and you can multiply and add and combine in a number of different mathematical operations with physical constants. And the number you get at the end is in units that are actually the correct units given the mathematical operations. So you take the square root of the gravitational potential energy, by the way. It turns out that that has the units of a velocity squared, so you can convert that into, you know, kilometers per second.
And all of this happens, you know, now all of this happens fairly seamlessly. But it turns out to make a system that works and takes care of all the intricacies of physical units, which turn out to be a lot more than I think the average person anticipates, takes a lot of work by a lot of different people. For example, there are different electromagnetic systems of units for measuring things like current and charge and magnetic field, and they're not necessarily compatible with each other. So 1 of the things we had to do is figure out how to have these different unit systems live side by side with each other, but also make appropriate translations between them even if the certain things are not necessarily even dimensionally
[00:22:42] Unknown:
equivalent. And getting getting that last bit right is extremely import important for people that do things like magneto hydrodynamic simulations, which it turns out is kind of popular that there are lots of people that do simulations like that. So you need to you need to get those details right or they notice.
[00:22:57] Unknown:
Yeah. That's that's right. We did have, we've we we have a, a couple of ways that, people can, know, get to us for user support. And I remember actually, I remember the day this came up on our, I think it was on our I c IRC channel. The the units for the magnetic field were not being converted properly. And so we that was something we had to look into. But I just think it's fascinating how you know, it took as I said, it took a lot of work by a lot of different people, but now it's like 1 of the most impressive features that I always like to show off in YT to Physicists because they're like, wow. You just multiplied those numbers and it gave, say, the pressure in the units I expected to expected it to. So
[00:23:35] Unknown:
I should say there are several other projects that do so much stuff like PINT and Astro Pi. This is a case of conversion of evolution. We well, like, I think Pint existed when we started working on this but, like, there were some like, it wasn't it wasn't getting lots of knits. I I don't I don't remember the details, but we we ended up going with this other library that this, grad student who is participating in YT a lot a couple years ago named Casey Stark wrote that were were wrapped SymPy, and we sort of absorbed it into YT and improved it. So so it's unfortunately, there are, like, lots of different implementations of these units libraries, but and we're we're pretty we we think ours is pretty great.
[00:24:10] Unknown:
Yeah. I'm forgetting which episode it was exactly. But I was speaking with someone else recently where they were leveraging PyNT, and I was asking about its relation or comparison to the units capabilities in AstroPi. And 1 of the things that they commented on is that, well, at face value, units are units. Once you start getting more involved with them, there are so many different edge cases that are necessary to consider. That 1 unit's package isn't
[00:24:42] Unknown:
of of the of the units implementation that we were using and, like, adjust it to our use case was really nice. So so basically, vendoring this library into y t and then making it sort of absorbing into the rest of the code base and making it sort of integrated as much as possible is is really useful. And maybe having an upstream dealing with an upstream project that's, you know, might make decisions that you wouldn't make isn't the best for that sort of thing. Oh, I was just going to point out that what an interesting case of that is simulations
[00:25:09] Unknown:
of, you know, Einstein's theory of general relativity. Numerical relativists, have a habit of saying that, okay, the Newtonian gravitational constant and the speed of light are equal to 1. And they like measuring things like lengths and times in terms of the mass of the sun, for example. And because everything is just simpler, if you take all these pesky constants, you just set them equal to 1 and then maybe make adjustments at the end. And so that's something that we have a little bit of support for. I wouldn't argue that it's, maybe ideal, but it was definitely an interesting challenge as to how do you take, you know, how do you think about things like within the framework we have, how do you think about things that are fundamentally different dimensions, but give them units as if they are Yeah. Basically. So Numeric or eligibles have really complex data, but maybe not surprisingly.
[00:25:54] Unknown:
But both conceptually and, also just, like, the the representation on disk is really complicated because, you know, they're they're simulating things like black holes. So trying to simulate infinities is hard.
[00:26:05] Unknown:
And you guys have touched on it a little bit, but some of the sample data sets that I was looking at in your examples are fairly substantial in size. So I'm wondering what are some of the biggest challenges associated with trying to run analysis on, such large amounts of information?
[00:26:21] Unknown:
So the first problem is getting it done before the end of the universe. So, you know, like and that that can be accomplished either by making each bit of the analysis faster and also by doing the analysis in parallel. So y tree YT tries to be both very fast and also parallel aware, using the the library mpi4py, which is built on the the message passing interface, which is really so you would if you were on a system where you had a really enormous dataset, you would have access to an MPI library because it's it's really commonly used to produce, simulation datasets like that. And and that sort of is is another challenge. Just moving data around can start to be extremely difficult. Once it gets to, like, the terabyte size, you don't wanna move that much data across the country. It takes it takes a very long time. So you you wanna keep the data wherever it was produced and then analyze it in situ, hope preferably in parallel. Yeah. I think that's something that's important to point out is that most of maybe not all, but many of these datasets you're referring to were produced using, you know, thousands of cores on a supercomputer.
[00:27:19] Unknown:
And so, you know, often, if you wanna do, like, global analysis, you know, in YT, it will take Nathan's referring to it, it will take, doing operations in parallel, which YT supports. But, again, the nice thing about YT is is that, you know, if I have this giant dataset
[00:27:33] Unknown:
and I only am curious about, say, like, a few parts of the simulation domain, say, there's, like, maybe this big domain that maybe a few galaxy clusters formed, then we'll only wanna look at those. Then y team knows how to figure out where the data is from those particular parts of the simulation domain and only download the cells and or particles it needs from those locations. And you can just kinda leave the rest of the giant dataset behind. And also, it's it's not only so so John and I are working together on a research project. And during the process of this research project, we've managed to produce 500 gigabyte datasets, I think, something along those those sizes. So it's not every day that you're working with data's data's that huge, and data that huge has its own challenges. But, you're more it's often the case that you're working with a time series that can be really large. Or, so, like, for for my thesis, I publicly released this 14 terabyte dataset that that was each individual, dat datum in that dataset is a 5 gigabyte file, which is 1 snapshot of the evolution of a galaxy like the Milky Way. And it's it's you you know, you can get lots of interesting scientific discovery if you have lots and lots of time steps. So that's both by, you know, watching the the thing as it evolves and also getting really fine time resolution for different sort of analyses. Yeah. So being able to to analyze a big time series like that in parallel is really important. And YT has parallel parallelism primitives that allow you to easily iterate over a simulation time series. You know, and that that that all that can also be more complicated as well for things like a cosmology simulation where your time output is linked to the redshift, which doesn't have anything to do with, like, the time. So being able to to make analyze that at certain precise redshift values,
[00:29:10] Unknown:
you you need to be able to have some sort of smart indexing system, which YT also will help you out with. And how has the project evolved overall from when it was first created and in the time that you guys have been working on it? And what are some of the biggest challenges that you see that it's going to face in the future?
[00:29:28] Unknown:
So I'd say the biggest challenge will be, you know, trying to expand beyond our niche. So right now, we're we're mostly dealing with astronomical data. So astronomical simulation data, which is sort of it's a it's a community that maybe has a couple hundred, maybe even as much as a 1, 000 researchers in it. But I think YT's potential for enhancing discovery in these sorts of simulation datasets can be expanded way beyond astronomy. So I think a big challenge for us is figuring out how we as a community can incorporate many different perspectives and then, you know, break all the assumptions that we're currently making in order to to make it so that YT can make sense to to work with data from many different kinds of fields. I was just gonna say that 1 of those assumptions that we're dealing working with right now is that most of the datasets that we're really good at analyzing
[00:30:15] Unknown:
are in 3 dimensions and in Cartesian coordinates, so x, y, and z, basically. And there are a lot of research group, both in astrophysics and otherwise, that have coordinate systems for the datasets that are, you know, in, say, spherical coordinates or cylindrical coordinates or, you know, even, you know, more unusual coordinate transformations. And so when you are thinking about the, operations we were talking about earlier, like making slices and projections and volume renderings, you know, it's sort of easily easy to sort of I guess, I I I'm under the assumption that my I sort of see the world as Cartesian, basically, to first approximation. You know, that's the way I think about the world. So it's easy for me to think about how we need to do all those operations under that kind of coordinate assumption. But trying to translate those concepts kind of coordinate assumption. But trying to translate those concepts over to these other kinds of datasets, I think, is a really challenging thing, but necessary thing to do because there are just there are many calculations that are just simply much more convenient in those coordinate systems. And as I was sort of referring to earlier, you know, some datasets don't even have they're not even properly spatial, you know, or there might even be a case where we decide we want to look at, you know, a 4 dimensional dataset. So, you know, we can treat a set of 3-dimensional datasets as a 4 dimensional dataset where you have length, width, height, and time.
[00:31:33] Unknown:
And how has the product evolved? So it started out as a sort of a a personal, like, toolkit that that 1 person was doing for their for their own thesis and then, sort of quickly several other people in his research group also chipped in and helped out in various ways. Then more and more other people that are used that were using Enzo that the the simulation code I mentioned earlier that YT was originally written to work with. With. Enzo is a a publicly available adaptive mesh refinement hydrodynamics code. So it sort of has the same principles about of about openness that YT does. So it's sort of people that are working with Enzo were also sort of interested in working with YT as well. Yeah. So that that sort of extended to that community, and then, we we gained support for many different data formats beyond Enzo, and that sort of expanded that community. So the project has evolved sort of by supporting more and more different kinds of data, and also bringing in as contributors the people that are actually working with that data. So it's the people that know how the whatever research algorithms they want y t to perform, how it should be performed, and then also the people that know, like like, have crazy data that YT fails with so we can fix bugs or realize that there's, like, things that we need to do to change the way we we do things because we're making an assumption that's broken by some some data that actually exists in the real world. Alright. Do either of you have anything else on that particular subject that you'd like to expand upon a bit more? I guess I just wanna emphasize that, you know, YT is a community. We and it's mostly domain scientists that that are contributing to YT. There are a few people, basically me, that that are actually paid to work on YT, but almost everybody else and also over the history of the code, the people that were contributing mostly to YT are using it because they they their their contributions solve real world research problem. So I wanna emphasize that, you know, YT is a community project
[00:33:14] Unknown:
and it's there. We're standing on the shoulders of giants and all the people that have contributed over the years. I think we're up to more than a 100 people now. Yeah. And in fact, you know, I just wanna emphasize, you know, that as well in the sense that working with this community and people like Nathan and, you know, Matt, who we mentioned earlier, and many others has I've I've become a much better programmer, software engineer, much better scientist by interacting with all of these folks and learning just better coding practices. And, you know, how to work as a team, you know, has helped me immensely, you know, in terms of, like, the job that I'm working at now. Like, a lot of I think I've learned a lot from just working with the team at YT that has helped me in that area as well. And so I think it's a really important and integral part of, like, what make the software package work is the, you know, the openness of the community and the fact that people who are, you know, not just interested in working on code, although as fun as that is, people who are actually using it for the sorts of problems that, you know, other people may be interested in, I think is what really what makes it work. And are there any other topics that we think we should cover before we start to close out the show?
I'd just like to make a comment about something kind of neat that I that I think is happening in that. So, every 10 years, NASA has a decadal survey and new flagship level missions are proposed. And I'm part of a a group that's proposing a new x-ray, telescope for the next decadal survey. And YT is actually a very integral part of the developing the science case because we're taking 3 d simulations, and we're using y t to, read these simulations because these simulations come from a very large number of different codes, and the common thing that is able to read them all is yt. And then we have other packages that we can use with yt to create simulated x-ray astronomers who are trying to make really cool images and other data products of what what we would actually get to see in the universe. In this case, it was galaxies, assuming that we had this wonderful instrument that we're proposing. And so, I think that it's just sort of an example of how something that started off a tool with limited application has turned out to be a much bigger thing that's having a broader impact. Alright. Nathan, anything else?
[00:35:31] Unknown:
No. That's really awesome, John. I didn't know that. That's really cool. So for anybody who wants to follow either of you and follow what you're up to, I'll ask you guys to just, either add your contact information to the show notes or send it to me so I can add it. And with that, I will move us to the picks. My pick for this week is a project I came across recently called scout 2, which is a simple command line tool that will, read in your credentials for AWS and then scan through your account to try and see if there are any configurations that will lead to any sort of security vulnerabilities. And then it gives you a nice HTML formatted report to click around and explore in. So it's an easy way to just do a very quick heuristic of, are there any is there any low hanging fruit that I need to address in my AWS account? And as somebody who works in the operation space, it's definitely very useful to be able to get that kind of information quickly and easily. And with that, I'll pass it to you, Nathan. Do you have any picks for us this week? Yeah. So,
[00:36:29] Unknown:
I'm currently reading The Expanse novels, which are really awesome. If if you like the TV show, you should read the novels because they're better. John, do you have anything? Maybe this might surprise some of your listeners, but I'm actually finding that, 1 of the best code editors I've used in a long time is, Visual Studio Code by Microsoft, which I'm guessing that some listeners might be falling over at this point. But, it's an open source project. They've open sourced it on GitHub, and I'm finding it to actually be a really powerful editor. And they're they've developed a pretty sophisticated plug in system with support for many different languages, and I'm just finding it very enjoyable to you. And, so when I you know, talking with other people who are doing, Python programming or in other languages, I say, hey. Actually, this is a pretty cool, editor that you should check out. And, you know, if it's missing some feature you like, then you should you know, you can actually develop it and develop it and send them a pull request. So which I think is a neat thing that they're doing. So Alright. Well, I really appreciate the both of you taking time out of your day to come and join me today despite the scheduling difficulties,
[00:37:30] Unknown:
and I hope you enjoy the rest of your evenings. I'm sure that the listeners will greatly enjoy hearing about the ways that YT is being used for astronomical research. Thanks for having us on. Thank you very much.
Introduction and Announcements
Guest Introductions
Introduction to YT Project
YT's Role in Astronomy and Other Domains
YT and AstroPy Comparison
Visualizing Multidimensional Data
YT's Architecture and Design
Challenges with Large Datasets
Evolution and Future of YT
YT's Broader Impact
Picks and Recommendations