Summary
Jake Vanderplas is an astronomer by training and a prolific contributor to the Python data science ecosystem. His current role is using Python to teach principles of data analysis and data visualization to students and researchers at the University of Washington. In this episode he discusses how he got started with Python, the challenges of teaching best practices for software engineering and reproducible analysis, and how easy to use tools for data visualization can help democratize access to, and understanding of, data.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. And now you can deliver your work to your users even faster with the newly upgraded 200 GBit network in all of their datacenters.
- If you’re tired of cobbling together your deployment pipeline then it’s time to try out GoCD, the open source continuous delivery platform built by the people at ThoughtWorks who wrote the book about it. With GoCD you get complete visibility into the life-cycle of your software from one location. To download it now go to podcatinit.com/gocd. Professional support and enterprise plugins are available for added piece of mind.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Jake Vanderplas about data science best practices, and applying them to academic sciences
Interview
- Introductions
- How did you get introduced to Python?
- How has your astronomy background informed and influenced your current work?
- In your work at the University of Washington, what are some of the most common difficulties that students face when learning data science?
- How does that list differ for professional scientists who are learning how to apply data science to their work?
- Where is the tooling still lacking in terms of enabling consistent and repeatable workflows?
- One of the projects that you are spending time on now is Altair, which is a library for generating visualizations from Pandas dataframes. How does that work factor into your teaching?
- What are some of the most novel applications of data science that you have been involved with?
- What are some of the trends in data analysis that you are most excited for?
Keep In Touch
Picks
- Tobias
- Jake
- Kevin M. Kruse
- White Flight by Kevin Kruse
Links
- UW eScience Institute
- NumPy
- SciPy
- SciPy Conference
- PyCon
- Pandas
- Sloan Digital Sky Survey
- Spectroscopy
- Software Carpentry
- Data Carpentry
- Git
- Mercurial
- Matplotlib
- Altair
- Conda
- Xonsh
- Jupyter
- Jupyter Lab
- Vega
- Vega-lite
- Interactive Data Lab
- D3
- Mike Bostock
- Brian Granger
- Bokeh
- Grammar of Graphics
- ggplot2
- Holoviews
- Wikimedia
- AstroPy
- LIGO
- Wes McKinney
- Feather
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports the show on Patreon. Your contributions help to make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@podcastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app. And now you can deliver your work to your users even faster with the newly upgraded 200 gigabit network in all of their data centers. If you're tired of cobbling together your deployment pipeline, then it's time to try out GoCD, the open source continuous delivery platform built by the people at Thoughtworks who wrote the book about it. With GoCD, you get complete visibility into the life cycle of your software from 1 location. To download it now, go to podcastinit.com/gocd. Professional support and enterprise plug ins are available for added peace of mind. You can visit the site at podcastinnit.com to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions, I would love to hear them. You can reach me on Twitter at podcastinit or email me at host@podcastinit.com.
To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Jake Vander Plaas about data science best practices and applying them to academic sciences. So Jake, could you start by introducing yourself?
[00:01:31] Unknown:
Yeah. Yeah. So my name is Jake. I'm, an astronomer by training, but for the last 5 or so years, I've been more on the data science side of academia. I've been working at the University of Washington E Science Institute. And we have a a really a great team here and a and a really interesting, goal, which has been for the last 5 years to to kind of serve as a glue between all sorts of different scientific departments and and other other research departments on campus with these common tools that everybody seems to be using, whether it's, you know, r or Python or, machine learning algorithms or or data cleaning approaches and things like this. So it's it's been a really fun position, and I've gotten to branch out a little bit from my astronomy into into areas that I never imagined I'd be working on.
[00:02:21] Unknown:
It's funny seeing the ways that your sort of progression through particularly in technology can change so drastically purely by the unifying fact that you're writing software. Right. Right. I sort of got into the Python world when I was doing my PhD in astronomy. And,
[00:02:39] Unknown:
the reason was I was basically using NumPy and scipy to drive everything I was doing with with astronomy data. And, it turns out that those tool tools are pretty general. So I started started going to to Python meetups and Python, conferences and and talking with people in other fields and, really got bitten by the o open source bug and started contributing back to these packages. So that's sort of how I got into things here.
[00:03:06] Unknown:
And what has been your experience sort of seeing the difference between the scientific Python communities and the more general Python communities, particularly as it manifests between the scipy conferences and the PyCon series? Yeah. You know, so I've I've been going to these
[00:03:23] Unknown:
both, conference series, scipy and PyCon, since about 2012. And early on, they were they were very different conferences. You know, when I when I started at at PyCon, it was, like, there was not much science or or or data content in it. It was a lot of, web development stuff, you know, Django and and a lot of the core Python contributions and, you know, having fun with generator expressions and things like that. Whereas the SciPy conference has always been about showcasing research that, that happens using Python, that's enabled by Python. But there it seems like over the last 5 years, there's been kind of a a convergence a little bit. Both conferences have moved a little closer together and that they've they've both embraced data science to some extent. You know, you go to SciPy and there are tutorials on on pandas and, doing, you know, large large database queries with with pandas style operations and things like this that are not necessarily all that close to the to the the physical science roots of of sci fi, but are more in the in the data science sense of things. Now on the other hand, you see you have PyCon, which has moved into that realm as well. I was in 2013, 2014, actually, almost every year for the past 4 years, I've been involved with the tutorial, evaluations and sometimes the tutorial selections in the for PyCon. And, I remember in in 2013, there was a a discussion among the committee because there were, like, 3 data science tutorials that were submitted or maybe, like, 1 1 on scikit learn and 1 on, machine learning with some some other tool. And there's this discussion in the committee that, like, should we really devote 2 tutorial sessions to such a niche topic like data science?
Right? And then, it it's just funny thinking back on that because this last year, if you look at the PyCon tutorials in 2017, they were, like, half of them were data science, maybe 3 quarters of them. So it's it's totally shifted over the last few years.
[00:05:32] Unknown:
Yeah. It's definitely interesting seeing the ways that the massive amounts of data that are being collected now are so drastically influencing the career availabilities and the ways that some of the different popular languages are being used that weren't really being thought of at the time when they were first being created.
[00:05:50] Unknown:
Yeah. And and you see that larger trend too. Like, 5 years ago when when Python was all about web stuff, it was, you know, all the talk was whether Ruby on Rails is going to, you know, overtake overtake what what people are using Python for. Now you don't hear much about that. But you hear more about the r versus Python thing or, you know, or Julia versus Python in the in the sort of scientific numerics world.
[00:06:15] Unknown:
Right. And then in the systems area, it's a lot of talk about Go versus Python Yeah. Largely because of the packaging story and deployment. Mhmm. Mhmm. And you mentioned that your background was originally in astronomy Mhmm. And now you're working more in the data science and education space. I'm wondering how your background in astronomy informs and influences your current work.
[00:06:38] Unknown:
Yeah. Well, I still, fortunately, I still get to work on astronomy topics a little bit here and there. I've been I've been publishing maybe 1 paper a year in, in astronomy journals. So I've I've kept my kept my toes in there. But, the thing is that astronomy in a lot of ways was kind of an early adopter of these these really data driven approaches to, to science. You know, we had even going back 15 years or so, there's with the start of the Sloan Digital Sky Survey, it was this mode of of scientific operation where you you basically what Sloan did was it it scans scanned the whole sky and and got kind of this big catalog of objects up in the sky and then went back and followed up on those objects with some detailed spectroscopic observations.
And in the end, it's somewhere I think over the course of 10 years, they got a a couple million of these spectroscopic observations. And the idea there was that this this was, really revolutionary at the time. The idea was that, if you wanted to be an astronomer studying the sky, instead of doing the classic thing of going out and applying for telescope time and then pointing that telescope where he wants and then getting that data, they wanted, the mode of exploring the sky to be database queries. Right? You have this this huge database, this huge catalog of all these interesting objects in the sky and a public open database where you can you can query by by position, by object type, by, by all sorts of different datasets. So so astronomy has been doing this thing for a long time, and I feel like in some ways, the last 5 years of the you might call the data science revolution in in, in scientific research, it's been a lot of other fields catching up to that notion of large public datasets, of, working together in kind of an open fashion. And it's been really fun to see. And and to be honest, astronomy has prepared me really well for that.
[00:08:43] Unknown:
And in your current work at the University of Washington, you're working on educating students on some of the different best practices in data science, and my understanding as well is that you're teaching some of those same lessons.
[00:08:55] Unknown:
Yeah. What we're doing at at Esciences is sort of multipronged. We're trying to, we're trying to reach all of campus. Right? That because there's there's, of course, students that need to that need to learn these skills to be effective scientists in this in this era. And so 1 of the things we've done on the education side that's been interesting is we've been trying to influence the curriculums, the curricula of all these these departments around campus. And as you can imagine, if you've ever been in academia, you know that it's hard to change things. There's a lot of inertia. And if you say, for example, that, you think astronomers need to be learning about how to how to work with large datasets, how to write effective code, how to how to do software engineering, that sort of thing. You say, okay. Well, let's let's create a graduate class for that. But graduate students are already overworked and they already have a full curriculum. So if you're gonna create a class that's about data science methods, about software engineering, about statistical methods or something like that, you need to figure out what you're gonna drop from there. Correct? Right? And this is really hard because what are you gonna drop, stellar atmospheres? Are you gonna are you gonna drop the course on the interstellar dust? You know, because each each of those courses has someone who's devoted their career to studying this thing and thinks it's the most important thing of the in the world. So how could you how could you get a PhD in astronomy without having a course in that? So it's this really difficult thing. Right? Because it's this set of skills that's absolutely fundamental to doing science today.
And departments are really slow to actually add that to the curriculum. So we've been pushing on that in in a number of different ways here and giving graduate students these, interdisciplinary electives that they can take, things like transcriptable options where you could if you take, some these extra courses, you could get a PhD in biology with an emphasis in data science. And that appears on your transcript, kinda like a a minor for an undergraduate degree. So we've been working with the administration, working with different departments to try to figure out how to get these sorts of skills into the main curricula. But at the same time, we've been doing, a lot on the more informal education side. So so programs like Software Carpentry have been amazing. We've been a partner with Software Carpentry for a number of years now and running workshops a few times a quarter. And that's just to get if you're not familiar with Software Carpentry, that basically takes people from ground 0 to, figuring out how to use the shell, how to work with Python or with R, how to manipulate data, how to use Git and GitHub in version control, and and these sorts of core skills that are so necessary these days in science. Another aspect of it that's been really fun is not so much working with the students, but working with the researchers that are already in place, whether they're post docs or research scientists or even faculty. And we have a number of ways through the through the eScience Institute where we can build connections with people around campus. And they they go from really simple things like just having open office hours. We have 1 of our 1 of our e science data scientists or research scientists is basically on duty all through the week. Each person does does about 2 hours or so. And so throughout the week, anyone from around campus can come in and just ask technical questions about their research and get get connected with someone who has expertise in the area they need, whether whether they need some sort of statistical modeling or whether they need to apply machine learning in some way or whether they need to scale up their their data and and use databases effectively, or whether they need to build a a web front end for some kind of public outreach aspect of their research. So we have we have all these ways of connecting with people around campus. And then for me as an academic researcher, those have often led to more long term collaborations. So now I have instead of just having papers in astronomy, you know, I have papers in geophysics or in transportation policy or things like that. And it's been been really fun. And
[00:13:00] Unknown:
in the work that you're doing both with students and with active researchers, what have been some of the common misconceptions or stumbling blocks that they come across in the process of trying to learn and understand and apply these different statistical and data science techniques?
[00:13:17] Unknown:
So it's it really varies from person to person because we we get a lot there's a lot of people who have there's a wide range of backgrounds when it comes to computing and to statistics. And it's because these topics have just are are not treated, in in a lot of undergraduate and graduate curricula. These these topics are not treated as first class citizens. So there's a huge range of of what people know coming in. But the the biggest challenge actually as far as as far as academic research goes is that the incentive structure for academics, you know, if people people are looking to basically get tenure eventually. The the incentive structure has nothing to do with with writing good code or writing good tools that other people can make use of. It's all about papers and citations. Right? So if I tell someone, hey, it's you'll you'll be much better in the long run if you can version control your code and document the code and write in a modular way so that the next time you come around, you can remember what you did and build on what you did. That doesn't help someone get a paper out in the next month that they need in order to, to appease their tenure committee. Right? So it's academia is in this weird place right now where it's kind of, in some ways, shot itself in the foot by by developing these incentive structures that don't match what makes research most effective in the long run. So a big part of what we've been doing here and what we've been working on with our colleagues at NYU and Berkeley, who are on who who we share a grant with, is trying to figure out how to how to nudge that how how to nudge academia in the right direction and think about how we can, make it so that the activities we think are valuable to science overall are recognized within the incentive structure. You know, they're recognized by hiring committees, recognized by funding committees. So that, you know, if I if I spend a week a week or 2 after publishing a paper, making sure that my software is something that can be used by other researchers to to reproduce my results, to extend my results, or to make their make their own research more effective to it's important to make sure that that's somehow recognized. Or if it's not recognized, then it's it's a waste of 2 weeks in terms of my academic career even though it might be incredibly valuable in terms of progress of science as a whole. Given the fact that there are all of these
[00:15:48] Unknown:
different tools and technologies and techniques that would benefit the researchers in the long run, and a lot of them have a fairly steep learning curve. What are some of the areas of improvement particularly in that tooling that you think would be most beneficial particularly to the, academic sciences?
[00:16:08] Unknown:
Yeah. There's a lot of that's a good question because there there are a lot of barriers to entry and there's pretty steep learning curves. 1 thing that we we teach often is, Git for version control. And Git is it can be just so painful because there's, there's so many things to remember. There's this whole mental model of what's going on with your repository and the remote and the staging area and the the commit history and the directory, you know. And it takes a long time sometimes to get someone to to get someone familiar enough where Git is a a tool that helps their workflow rather than hinders their workflow. So that that's something that's pretty hard. I think it's it's sort of unfortunate that we, as a a data science community, have landed on Git and GitHub because I think Mercurial is, has a lot of the advantages of Git and and, makes things a little bit easier. But we've, you know, we've we've ended up in this place where where Git is sort of the lingua franca of code develop of open source development and of, reproducibility.
So we have to we have to teach that if we want people to be able to to, work with that. So those sorts of things. And then there are other things too. Like, in in the Python worlds, the plotting thing, visualization is not all that easy. You know, matplotlib is, is a really incredible tool because it allows you to do almost anything you want as far as creating publication quality graphics. But it's, not all that great a tool for, like, quick exploratory visualization. You end up needing to do a lot of, you know, a lot of tweaking of axes and figure layouts and things like this that that impede your ability to to quickly explore data.
So those kind of things are difficult and there's, I think there there are efforts on various fronts to to make make these tools easier to use. Oh, the the other thing I wanted to mention along that, the difficulties that academics have is, the packaging and installation and remembering, like, for Python, for example, remembering the difference between Unix paths and Python paths and Jupyter kernels and where they're pointing and all all this sort of stuff sometimes means that, you know, if if a student comes up to me and is using a Jupyter Notebook and says, I want to install Astropy so I can use it in this notebook. There's, like, this whole backtrack we need to do of, like, what kernel are you using? What Python version are you using? Are you using pipercondda?
And how do you use pipercondda to install it in in that specific environment? You know, there's conda ends and there's virtual ends and all this sort of stuff. Like, if you if you take a little while to explain it and figure it out, it all makes sense. But for a a new grad student coming in who just wants to run some Python code, it's, the startup cost is pretty immense. So I think, fundamentally, what's going on there is you have this, you have tools written by, sort of, Linux hackers for Linux hackers. And then now you have people coming along who've who have only ever double clicked on icons in Windows that want to use this. And there's there's kind of a a lack of a lack of common experience and common knowledge by the people who design the API versus the the users.
And then that's that's a real challenge. So it means that using these tools to start using these tools, you need to get people up to speed on using the UNIX command line, which, doesn't always feel like the best use of time for someone who's,
[00:19:54] Unknown:
you know, worried about funding next month and trying to get research out. So it's it it can be a little bit difficult. Yeah. The UNIX command line in particular is a long road to go down because I use it every day for my work as a systems administrator. And there are still things that are obtuse and sort of unintuitive for me and for people who have been doing it even for longer than I have because it's just a sort of amalgamation of different tools with a little bit of glue to, you know, to to bind them together. So it's interesting to see efforts such as the conch shell from Anthony Scopetz and how that could potentially play into trying to ease that on ramp for people who want to be able to work in this environment but not get exposed to some of the sharp edges of the overall UNIX ecosystem.
[00:20:39] Unknown:
Yeah. And there's been there been other I I like that you mentioned the conch shell. That that's a really, really nice project, and it's been it streamlined some of those things. Other other examples, I think, of people that are trying to smooth over those rough edge edges are the Jupyter project, you know, making kind of bringing everything through these notebooks or through this new JupyterLab interface to to sort of hide the the details of what are happening, for for beginners. And then also the the conda project, I think, has been really valuable because it lets people it it it takes away some of the installation headaches that dominated, the Python world in the in the 2000. You know? If you tried to install scipy from source even on Linux in, like, 2, 005, was a pain in the butt. Now you can, even if you're on Windows or Mac, you can type conda install scipy, and you you basically basically have it working. But 1 of the problems with that that I've been noticing in my teaching is that every time you create a new tool or a new a new layer that smooths out the rough edges of the layer below it, it means that everything works well 95% of the time, but the 5% of the time it breaks, your user has absolutely no vocabulary to figure out how to fix it. So the example that comes to mind is if you, if you're if you have a user using a using a Jupyter notebook, and, I I mentioned this before. If if you're in a Jupyter Notebook and you wanna install SciPy so that it works in your Jupyter Notebook, you need to know that the Jupyter Notebook has these kernels that are running in under it, and the kernel points to a Python environment. And a Python environment is built by virtualenv or by conda.
And in order to install in that environment, you need to choose which installer to use, whether it's pip or conda install, and you need to make sure the environment is activated before you run it and that the the PIP that you're using points to the Python version that you're using in the Jupyter kernel. So there there's this whole, like, level of abstraction that you don't want people to worry about, and they usually don't. But then when something goes wrong, they're all of a sudden confronted with this tidal wave of of concepts that they need to know in order to just get sci fi working in their notebook. Right? And I I see that with, with a lot of these tools that are trying to trying to smooth things over. And I I debate what to do about that. Right? Because on on 1 hand, you could go the route of, like, let's just start with the command line and let's ignore all these tools so that we build up the knowledge people need in order to debug everything that comes up. But then you you spend 10 weeks, getting people to the point where they can run, you know, print hello world in a in a condo environment or something. Right? But on the other hand, if you if you try to smooth out all those and get people working with with their hands on data right away, maybe using a a VM that's set up perfectly to run Jupyter Notebook that they connect to over the network and and nothing's running on their own computer.
You can do that, and you can get them analyzing data right away. But then at the end of the quarter, when that when that VM server turns off, they are no further along than they were at the beginning because they don't know how to use their computer. So this is the kind of question that I really have been struggling with lately is in terms of data science education is, what level of abstraction do you start with with the students, and and how much do you spend time teaching fundamentals like how to list directories in the Unix shell? And how much do you spend time relegating those questions to Stack Overflow and and teaching them how to manipulate and visualize data? It's a hard 1. The, layers of abstraction problem has been plaguing software
[00:24:21] Unknown:
since there was software, basically. So, you know, there's the law of leaky abstractions where every abstraction will leak at some point. And, you know, it's it's it's great until it isn't, and then it's horrible. We should all just be writing machine code. Right? 1 of the projects that you're involved with right now is Altair, which you were mentioning some of the difficulties of being able to work with interactive visualizations, particularly in the context of Matplotlib. And my understanding is that Altair is an effort in that direction for being able to generate statistical visualizations with a strong integration with Pandas data frames. So I'm curious how the work that you're doing on that factors into your teaching, and what are some of your goals with that project overall?
[00:25:05] Unknown:
Yeah. This this project has been really interesting because, on my side, it's kind of a software engineering challenge, and all the conceptual parts of it are are sort of taken care of and baked in. So it's it's based on this, this visualization grammar called Vega Lite that comes out of the interactive data lab here at University of Washington. And that that lab is run by Jeff Haire. You may have heard of Jeff Haire because he and his grad student when he was at Stanford, his he and his grad student, Mike Bostock, created this thing called d 3, which drives a lot of interactive visualization on the web these days. But if you have ever, tried to use d 3 in a in a exploratory data analysis context, it's really painful. Right? Because d 3 is this incredibly imperative thing. You're like you're like drawing the axis and then drawing the tick marks and then drawing the data. And to create a a bar chart in d 3, it it's like a 100 lines of code or something like that. So d 3 is is awesome for these kind of polished finished products and making something that's available to the world through this this web API, but it's not great for the exploratory process. And so when when Jeff finished d 3, he he started working on these grammars of visualization that would kind of address that. So Vega is a is a declarative grammar of visualization that lets you say, instead of here's some code to draw the axis and then here's some code to draw some dots in the right place on those axis to visualize you know, create a scatter plot. Instead, what Vega does it says it says, here's here's the data and here's how I want the data to be visualized on the screen. It gives you a grammar to specify that. Vega itself is still a little bit complicated.
So after Vega was more or less complete, the team started working on Vega Lite. And the idea of Vega Lite is is pretty cool. You're you're basically taking these I you're you're taking these concepts that have come out of the data visualization research over the past few years and creating a grammar that lets you express those concepts. So for example, if you wanna scatter plot some data on X versus Y, really fundamentally what you're doing is you're taking a column of data and mapping it to the X axis on the chart. And you're taking another column of data and you're mapping it to the y axis on the chart. And then the charting library should, once given that, should be able to generate the points and generate the axis and all these things. So all you really need to specify is, here's some data. I want x to be this column. I want y to be that column. And then the details are taken care of. So it's it's a really powerful way to, to quickly visualize data because it's you you don't have to worry about all these incidental details.
And what the Alterra project does is it tries to it it creates a Python API around this grammar of visualization. The grammar is expressed in in the form of, JSON notation. And I don't like writing raw JSON dictionaries. So I'd rather write I'd I'd rather use some sort of object oriented API in Python that that generates those JSON structures. So that's what Altair tries to do. And, yeah, it's it's been a a really fun project because it allows you to it allows you to to really quickly build up complicated visualizations from these these core fundamental concepts. And what I'm really excited about Altera with for data science education is the fact that it number 1, it takes away the need to to learn this this very imperative boiler plate that you get in something like matplotlib and lets you basically just think about the data and the visualization.
And also it, by design, it enforces visualization best practices in some ways Because you're not in an imperative language where you're building the plot element by element to the final to the final piece, you have to think at each element. Is this a good choice as far as what we know about effective data visualization? But in the case of, of Altair and Vega Lite, those good choices are being made by the code behind the declarative specification. So you you end up with with really useful, visualizations without having to think about about the details of that along the way. So my my colleague, Brian, Brian Granger, who's on the Jupyter team and he teaches down at Cal Poly San Luis Obispo. He's been using our early versions of Altair to teach data science and teach teach, like, the fundamentals of data visualization.
And he it seems like it's been pretty successful for him because the API of the library kind of shifts how you think about data visualization. So you're you're thinking in terms of things like encoding like, champ map mapping columns to to channels, to encoding channels and and visualization channels. And these things that that help you help you make visualizations that are visualizations more effectively no matter what tool you're using. So it's sort of like, when when the a when the API mirrors best practice, you end up in a much better situation.
[00:30:21] Unknown:
And 1 of the utilities of data visualization in the context of doing data science is being able to quickly identify what the overall shape of the data is, identify outliers visually rather than having to painstakingly go through the underlying data or perform, you know, perform statistical analysis on the data that may or may not yield any useful results and then being able to guide the direction in which you are querying the data and trying to process it to see what are some of the underlying answers that you can gain from it. So I could definitely see how having Altair is a quick way of translating the data from the format that you already have it in your Python code to a format that you can see
[00:31:04] Unknown:
visually will drastically increase the rate at which people can explore and process that, underlying data. We still do have we we still do have a few technical challenges in it that we're working out. Like, for example, the the fact that the the Python code spits out this JavaScript notation that's then read by the JavaScript's Vega Lite library in order to visualize it in your browser. There are these because there are these multiple steps. Right? It's another another layer of abstraction. Right? And and we we run into issues where, if there's an error on the Python side, we know how to handle that. But if there's an error on the JavaScript side, it's sometimes hard for users to debug. You know, the initial release of Altair, if, you did something wrong, like, if you specified a column that doesn't exist in your data, the output would just be blank. Right? And it gives you no hints about the facts that you used a capital letter versus a lowercase letter in your column name. So we're still working through some of the usability issues of that. The other thing is that the the mode that it works in, if you have a Pandas data frame, is it serializes the data frame to JSON and then passes that whole JSON objects to the browser in order to visualize. So if you have, like, if you have a data frame with a 1, 000, 000 elements, then what it tries to do is serialize a a 1, 000, 000 rows into JSON, and you you end up with this this massive, massive glob of data that you're submitting to your browser. And it it can it can run make things run pretty slowly. So we're still, I'd say the library is still young and we're we're trying to iron out some of these, issues that come up. But I think I think long term as we work with this library with students and kinda figure out how to how to streamline some of those things and make make the errors easier to recover from, I think it could be a really useful library.
[00:32:59] Unknown:
And 1 of the other libraries that I'm familiar with that seems to be targeting a similar application is Bokeh from the folks at Continuum. So I'm wondering what was lacking or sort of difficult in that library that ins that that necessitated the work on Altair or they just targeting completely different use cases?
[00:33:21] Unknown:
Yeah. That's an interesting 1 because Bokeh start started out 5 years ago. They they really wanted to do kind of, like, a ggplot style thing for, for Python. I know Peter Wang, when he was starting Bokeh, he really was, he was digging into grammar of graphics and figuring out how to do those sorts of things in a in a more declarative manner. As as Bokeh developed, it became kind of more of an imperative engine that they wanted to wanted to build a a declarative wrapper on on later on. So if you use BookIt, it's it's an incredible library. You know, you can you can do these these amazing interactive visualizations online. You can target the browser. You can even do, really large datasets because they use HTML 5 Canvas rather than rather than SVG. So it allows you to do some bigger datasets sets than other web visualization tools. The thing that was lacking for me was the, the easy API.
Bokeh, to to do a lot of things requires requires a lot of boilerplate code in general in my experience, and I know people might might get mad at me for saying that. But I I just don't I don't find it all that more intuitive than a tool like Matplotlib for your your daily, exploratory data analysis. That said, they have been working on this recently. There's a package called hollow views that started out as a separate visualization package, and it's now been wrapped into the Bokeh project. And it's gonna basically be Bokeh's declarative API.
So that so it's it's sort of the exploratory data analysis or data visualization API built on top of this really powerful back end that's the Bokeh Live. I think I I I see a lot of potential there. I think that could be a a really nice tool as as time goes on. With Altair, we have a slightly different goal than Okay's goal. We want we want it to be an API that people can use. But key part of Altair is this Vega Lite back end. And Vega Lite is is more than just an API for visualization. It's a grammar of graphics serialization.
So the idea is that if you create a plot in Altair, you can serialize it to Vega Lite. And Vega Lite is this compact JSON notation that you can pass to anyone that, in in principle tells you everything you need to know to reproduce the plot. So, my vision long term, what I would love to have is a matplotlib, reader for Vega Lite. So you can create something in Altair, send it to matplotlib, load it up in matplotlib, and then start doing matplotlib's annotations on top. Similarly, I could imagine, you know, sending sending this Vega Lite thing over to Bokeh or sending it over to ggplot, and having Vega Lite be this sort of lingua franca that operates between programming languages and between packages in those languages. Now that that that vision will take a long time and a lot of software engineering to realize, but, we're seeing a little bit of this buy in. Like for example, Wikimedia recognizes Vega Lite for visualizations now. So you can go on with Wikimedia and put a Vega Lite specification in there, and it will render it on the web page. And then if someone comes along and they want to modify that plot, they're not trying to reproduce someone's PNG with another package in order to add a new access label. Right? They can actually load the the Vega Lite and change the Vega Lite spec using, raw JSON editor. They could change it using a tool like Altair. Or, in the future, you might be able to change it with a tool like Bokeh or ggplot or matplotlib or hollow views or any of these other ones. So our vision for that is to really unify the visualization of information across languages and, using this this Vega Lite as a format. I should mention similarly that in the newer versions of Jupyter Notebook, Vega Lite is built in. So, you know, you no longer will have to load the Vega Lite library. Jupyter just knows how to knows how to handle Vega Lite. So whether you're using Jupyter from R, from Python, from Julia, if you have a visualization library that knows how to output these serializations, these Vega Lite specifications, then you can visualize it in Jupyter. You can visualize it on Wikimedia, and we're hoping other tools will adopt that as well. And it sounds like you've been involved in a lot of really
[00:37:47] Unknown:
interesting research and applications of data science. I'm wondering what are some of the most novel or noteworthy that you think are worth mentioning here?
[00:37:56] Unknown:
Yeah. So, there have been a lot of interesting things. My work over the past few years has been more along the lines of, like, the the software engineering side of things. So 1 of the things I really enjoy doing is is taking, algorithms and concepts that are kind of locked away in basically useless papers, and putting them in the form of software so that I can use them in my projects and so that others can use them in theirs. And so I think I think this is a a real strength of the kind of data science workflow that's, the reproducible workflow that's going on now. So I've I've been involved, for example, in the Astro Pi project, which is a, a a community collection of of astronomy routines written in Python and documented well, unit tested well, maintained well. And so it means that I can take, I can take algorithms that are out there in the literature that everybody has their own homegrown version of and create a really nice nicely implemented, nicely documented version of this in a Python package that then others can go out and use instead of wasting cycles, reinventing the wheel in their research. So that in in my personal work, that's been kind of the thing that I've been pretty passionate about is is doing these these sorts of things. From kind of, colleagues around the world and in in my institution, the 1 thing that I've been really excited about is seeing, the uptake of tools like the Jupyter Notebook for, publishing executable documents in in association with scientific projects.
And the best example of that from the last couple years, I think, is the, the LIGO results, the these gravitational wave results. And there's it's amazing now that, you know, it's it's getting to the point that observing gravitational waves from a pair of colliding black holes is commonplace enough that it doesn't even get a press release anymore. You know, there was 1, I think, last week that sort of flew under the radar. And but but the team at LIGO is doing a great job of making the data available, making the simple analysis code available in the form of a Jupyter notebook with explanations and with code inputs and outputs. So you as, you know, average person on the Internet with, with Jupyter installed can go and download this and like see how they did their analysis and and see the statistics that go into extracting gravitational wave signals from this incredibly noisy data. And that sort of thing is, I I think is amazing. We're we're seeing some of the same stuff in in, journalism publishing Jupyter Notebooks that allow you to go in and check their assumptions and make sure that what they're doing is, you know, is is up to par. So that that sort of thing in the last 5, 10 years has been 5 years, I should say. Jupyter Notebooks are only about 5 years old, 6 years old.
That sort of thing has been really cool to see, this this ability of of people to publish data, publish analysis scripts, and and the average technically literate person on the Internet can go in and,
[00:41:15] Unknown:
and check it and and validate it. Yeah. I think that that's 1 of the really valuable and interesting outcomes of this revolution that we've been having in terms of data science and open data is the fact that it does lower the barrier for citizen scientists and to get involved and also the advent of some of the more open journals that are making it easier to gain access to some of these publications and just have, you you know, somebody who's not necessarily working in academia be able to actually double check some of those results and get involved in the overall scientific process rather than having it be contained within these different universities or research institutions that, you know, have historically been harder to gain access to to the commonly person.
[00:42:01] Unknown:
It's been it's been really cool to see that and and to see it enabled by, you know, by Python tools and enabled by some of the software that my friends and colleagues are are spending their time on. It's it's great to see that impact in the real world.
[00:42:15] Unknown:
And on those lines, what are some of the other trends in data science and data analysis that you're most excited for and seeing how they play out in the future? I've already touched on some of the things that I'm most excited about is these
[00:42:28] Unknown:
I I mentioned the the visualization, like, the the Altair specification being this lingua franca between languages. I I think it's exciting to see that developing in other spaces too. Like, for example, Wes McKinney, who created Pandas, seems to be spending a lot of time recently on this sort of idea of translatable data structures in the case of data frames. You know, there's this there's this, feather format that both Python and R can read and, and these these kinds of ideas. I think, what I wanna see in the next couple of years is people stop having these language wars. You know, I I think it's it might be fun, but it's very unfruitful to argue about R versus Python versus whatever other tool you want. You know, these these tools are all really great and have their advantages and disadvantages.
And if we can make the tools work together, work work at concert more easily, then I think we all win. Whereas if we if we spend our effort, like, trying to troll our users on Twitter or something like that, then it's then it's not as useful. So that that's sort of, more ecumenical approach to data science,
[00:43:40] Unknown:
I think is something I'm really excited about seeing. Yeah. I can definitely agree with the fact that turning the conversation from either or to yes and can have much more valuable outcomes and fosters the growth of larger communities than just the ones that are contained within a particular language or toolset. Yeah. Definitely. And are there any other topics that you think we should talk about that we didn't touch on yet? Oh, man. We,
[00:44:04] Unknown:
you know, we we covered all the stuff stuff I'm really excited about right now in the in the data science space.
[00:44:10] Unknown:
Yeah. I think the 1 thing that we didn't really cover because it deserves its own conversation in and of itself is the idea of repeatability within research. But as I said, that's a very long and involves topic that does not have any obvious solutions yet. But I think that a lot of the work that you and others are doing in the tooling and, sort of pedagogical techniques is sort of yielding fruit there to make it easier for that reproducibility of research.
[00:44:37] Unknown:
Yeah. The combination of of executable document formats like like Jupyter Notebook and R Markdown, with building of of software skills and the use of version control and, open publishing of code. I think that's really addressing that. 1 1 thing that hasn't been really well addressed in that space, I think, is reproducibility and repeatability on the scale of, like, 5 to 10 years versus the scale of 6 months to 1 year. This is something I think we, as scientists, are gonna have to address because packages evolve quickly. Right? So a package like NumPy, you know, you can you can do a lot of things in it. And if you install NumPy version 1.11, you can expect that you'll be able to do the same thing 10 years down the road. But, but thing features are deprecated. APIs are changed slowly over the course of years, and there are deprecation cycles so that, you know, if you move from NumPy 111 to 1 dot 13, maybe something that that's going to be deprecated. There, you get a warning and you can fix it. But code that's just sitting in a repository for a decade, I have no confidence that I'll be able to run that code again, when I when I revisit it in 10 years. And that's that's a larger thing. And that goes beyond the packages, you know. Even if you even if you're able to record package versions and install legacy versions of Python and NumPy and and the whole tool chain, like, things like like VMs or Docker containers even are what you build that on top of. Your foundation is shifting enough that I'm not sure we'd be able to to work with it. So that that's a concern to me because, we're making such progress in in the reproducibility space right now on the, like, 1 to 3 year timeline that I think we're gonna be grappling with the reproducibility
[00:46:26] Unknown:
issue on 5 to 10 year timeline down the road. Yeah. And even to just the challenges of data preservation, particularly with these large datasets is where do you store it, who pays for it. Oh, yeah. That's a very complicated discussion and maybe 1 that we can find time for on another episode.
[00:46:42] Unknown:
Yeah. Yeah. Well, the Moore's law helps with that a little bit. Bit, you know, that like, I talked about the Sloan Digital Sky Survey that was huge 15 years ago, but it's like, you know, 20 terabytes now, which you can buy for a few $100. So we're we're getting there.
[00:46:58] Unknown:
Well, I think this is probably a good place to wrap up. So for anybody who wants to get in touch with you or follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And with that, I'll move us on to the picks. Okay. And my pick this week is going to be The Redwall Cookbook, which is just what it sounds like, for anybody who's not familiar with the Redwall series. They're this collection of novels that, are focused around this group of woodland creatures that have built an abbey and all of the different, adventures that they go on or, enemies that they have to battle. It's a really fun and interesting set of novels, and every 1 of them has a description of these vast and ornate feasts with very detailed discussions of the types of food. And so for a long time, readers of the series were asking for a cookbook, and so the author of that series eventually did release 1 that is also a story in and of itself that has the recipes that dispersed throughout. So it's a really fun book with some good recipes, and so I'll add a link to that in the show notes. And with that, I'll pass it to you, Jake. Do you have any picks for us this week?
[00:48:00] Unknown:
Oh, man. You know what I've been reading recently is a lot of nonfiction, and there's a historian from, Princeton named Kevin Kruse, who, I really, really appreciate his. He wrote a book called White Flight, and then the other 1 I'm reading the next 1 right now, and I'm trying to remember what it's called. But, yeah, he he has some really great perspectives on how social trends over the last 100 years have influenced the political landscape of today. And, in light of what's going on in politics right now, I'm I'm finding it really interesting to have that sort of historical back. Definitely sounds like an interesting book. Well, I appreciate you taking the time out of your day today to join me and talk about the the work talk about the work that you're up to. It's definitely
[00:48:44] Unknown:
interesting and valuable endeavors, and I hope to, continue following your work because, you've already produced a lot of useful tools for myself and others. And so with that, I'd like to thank you again and I hope you enjoy the rest of your day. Yeah, thanks so much.
Introduction to Jake VanderPlas and Data Science
Scientific Python Communities vs General Python Communities
The Evolution of Data Science in Python
Astronomy's Influence on Data Science
Challenges in Teaching Data Science
Tools and Techniques for Data Science Education
Altair: Simplifying Data Visualization
Comparing Altair and Bokeh
Noteworthy Data Science Projects and Research
Future Trends in Data Science
Reproducibility in Research
Closing Remarks and Picks