Exploratory Data Analysis Made Easy At The Command Line

Hello, and welcome to podcast.init,

the podcast about Python and the people who make

it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you get everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models and running your continuous integration, they just launched dedicated CPU instances.

Go to python podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as O'Reilly Media, Dataversity,

Corinium Global Intelligence, and Data Council.

Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit in Graphorum, and Data Council in Barcelona.

Go to pythonpodcast.com/conferences

today to learn more about these and other events and take advantage of our partner discounts when you register.

Your host as usual is Tobias Macy. And today I'm interviewing Saul Ponson about Visidata, a terminal oriented interactive multi tool for tabular data. So, Sol, can you start by introducing yourself? Yeah. I'm, my name is Sol, and I've been working in the software industry for a while, and

I'm in Seattle. And do you remember how you first got introduced to Python?

Yeah. It was, for work back in 2004. So Python 2 days. I was at a startup and they were using it for their apps, and I picked it up because that's what you do. Right? And so

at some point, you decided that you needed to start building your own multi tool for working with data,

particularly in tabular formats. And I'm wondering if you can just start by describing a bit about the tool that you built in the form of Visidata

and some of the backstory of how it got started.

Yeah. So,

I was working at a company called F5 Networks in

2011, I think, and I built a early prototype of VisiData for them, basically.

I didn't know it at the time. It was just the configuration and kind of exploration tool for Vero and networking hardware. But as I got to using it, I, it was I find the concept very flexible, and I kept adding more and more stuff to it. And then after I left f 5, I found out myself missing it. I kept wanting to use it to just look at, like, HDF 5 files or just random tabular that I've had.

And so when I was,

at another company in 2016 and that company was winding down, I was turning 40, and I looked back over my career of almost 20 years and realized that I didn't have very much to show for it. And I was like, well, I've been wanting this tool just kind of around and about, so let me just start that again. I couldn't use the code that I had been written in f 5, obviously. So I was like, well, if I'm gonna rewrite it, let me do it again from scratch, and I'll do it right once and for all with the tool that I wanna use. And So that's the the genesis of Visitor. There are a couple interesting things in there. 1, the fact that you had to try and rewrite this tool from scratch, experimented

with

experimented with Pytables at all or any of the other libraries for that particular data format and what you found lacking that you wanted to have present in VisiData.

Yeah. So

I hadn't used Pytables, but I've used a lot of other Python tools for using HDF5 files. And just like with any complicated format, you can do anything you want to with Python. Python's really great for that, but it's still kind of a hassle. You still have to write code to do it and often what I would want to do is just open up an hdf5 file and take a quick check and see if that data differences isn't all zeros or has the tables that I want in there. And the tools that I had, there was 1 that,

was written in Java. And so, of course, you'd start it up and it would take, like, 3 seconds to start, and then it actually had a a bug in it so that it would modify the data if there was anything afterwards. It would just kind of truncate it. You know, that was read only. And so just every kind of tool for just quick exploration

didn't really work for that purpose. And I was, I was really missing that really rapid just showing the data, and I always wanna see it from my own eyes and then pop back up. And then for the sake of you rewriting this tool from scratch, I'm curious if there was any functionality in the original tool that you either consciously left out because it didn't suit your particular needs at the time or any functionality that you tried to replicate but weren't quite able to match because you didn't have the necessary context that you would had at the time when you were working at f 5? Yeah. That's a really great question.

I don't

I I really I mean, I've been working on this tool now. I visited it for a lot longer than I worked on the original the prototype that I had made. But, because it was a networking company, 1 of the things that was in the original version that I still haven't added, and maybe someday they can add it, it wouldn't be that big of a deal, but,

was the ability to add, like, a derivative column. Because it was live updated from,

actual

date on the device,

What you'd want if you see, like, the bytes transferred for instance, then you could add another column

that would be the derivative of bytes transferred bytes transferred per second. And you can just kind of do that for any column you wanted to and that was a super handy feature, but I'm finding that this is this is used more for static data now than it was than the original tool was at f 5, and so that feature hasn't been added. I do remember that that prototype had pop up, like, little,

you know, if you

want to change a field between 10 values and enumeration, there's a little modal pop up that, would show up and you could scroll down and pick the right option. It was always kind of a neat little feature, made things a little easier to use. Like, right now, if you have to add an aggregator, you look at it on the on the bottom line for the list of them, which is I still don't quite like. But I decided to not add modal dialogues to VisiData. That if you want to see something, you have to go to a fresh sheet if you're going to go to a fresh thing or the modal modality is just on the bottom status line. That was a conscious design decision. It looks flashier to have the pop ups, but as I'm doing more design work, I see that modal dialogues are just kinda getting the way. You know? You mentioned that the common use case for Visidata now is for static data. I'm wondering if it does have any capacity for being able to process continuously updating information, such as the top command in Linux or the network streams that you had built the original tool for. Yeah. Absolutely. And in fact, 1 of the things I would love for Visodata to have more of are adapters or plug ins or loaders for other things like TOP. Actually, TOP's a great example. And I've got, some prototypes of something for vgit. Vgit's kind of still more static data. I do have a vping, which goes out as a kind of a combination traceroute and ping. It just kind of live updates as it's finding the various hops and their latencies and stuff like that. But then, well, it turns out that every 1 of these applications is a pile of work, and I've been devoting so much to the design of the core application that I found myself not having a whole lot of time to polish the other ones to to the degree that I want to. So I wish that I had that that time where I could find somebody that wanted to work on that with me, but I've been focusing on the other 1 on the main 1. I'm curious what the main use cases are for VisiData that you have found both personally and within the community, and in particular,

which tools it has replaced in your toolbox for things

like systems administration

or

data analysis that you might reach for otherwise, but that VisiData is just more convenient to use? Mhmm.

So it seems like 1 of the main use cases is to get a first look at downloaded data. I know that, Jeremy Singer Vine, who is a data editor at Buzzfeed, uses it all the time for example this purpose, you know, because you find data all the time online. It's public data or even if it's not, and you don't know if it's if it's useful even, and you don't wanna spend a whole lot of time investing in porting it into a database for instance or getting it into some form. You just wanna see right now just show me like the first the columns and the first, you know, how many rows and then you know how to do a quick search or however you do and to be able to get to that instantly as opposed to doing any amount of coding or work is 1 of the huge benefits of it.

And so I would say that that is the main thing that it's super useful for, you know, exploratory tool. And then the other thing that I find it super useful for is basically just getting data from 1 format and structure into another format and structure.

So you know, if you just need this 1 off thing, it's like you've got this pile of data here whether it's an excel spreadsheet or even just,

piped in for another command

and you wanna remove

those 4 columns and add this other computed column and just save that off and pass it off, you can get that done in, like, 30 seconds, whereas doing it in any other tool is gonna take you at least minutes to get those tools put together and write the code or whatever you're gonna do and probably more like a half an hour. And so it's a lot of those 1 off scripts that I think has has replaced for me. Yeah. There are definitely

utilities

both in the general user space libraries, particularly for Linux, but also in Python that can do direct conversion between formats. But the workflow you were describing of being able to manipulate the information before you commit it to the other format is definitely something that would typically require a lot more effort and exploration to make sure that you're getting things right. And then once you do get it working, you're likely to use it repeatedly, but it's going to be much more static and brittle than if you were to use Visi data in a more exploratory fashion. And I know that it also supports being able to

build pipelines for that repeatable use case as well. So I'm wondering if you can just talk through an overall typical workflow

for data exploration and analysis

and also the sort of conversion workflow that you might use VisiData with. Yeah. Okay. So,

a typical workflow, for instance, I might, like, download some data from the Internet, like, for instance, data.boston.gov.

Every city's got their own public datasets. Right? Then so you open it with Vidi, you poke around, see what's there. I like to see the scope of the data

and the precision. Basically, you know, how wide does it get and how much

individual pieces do you get, how clean it is and how useful it is for purposes. And then as I'm browsing around, I find that, oh, this column seems interesting. I wonder what's in here. And because there might be, you know, a 1, 000, 000 rows in here, it's hard to see that and often it's sorted by a certain in a certain form. And so I do a frequency analysis very often. Shift f is, I think, 1 of my favorite things about it. And so, you know, you just shift f on any column and within

maybe instantly or at least a couple of seconds, you can see the top value in that column and the number of number of values in that column total, and I think it's really great for finding if there's any anomalies

or,

outliers in the data. So you can kind of use it as a quick sanity check. It's like, oh, I see. There's really no data in that field or seldom is a data in that field or that's interesting. Why are there,

half a field completely empty? You know, etc. So that's 1 workflow

what happens is once I find something that I've gotten into a state that I want it, then it's pretty easy to revisit the command log and, you call it down to get only the stuff that I really want and then save it off and then I can do that repeatedly if it's something that the data might be updated or,

I want to share that with somebody else. In fact, 1 of the things that's been interesting is how useful the command log has been for debugging, for when people say this is what I've had to do or been doing with, this and this is the input data I have and here's the command log that I've been running over it. And it's interesting how I will look at the command log and it's like, oh, I I wouldn't have known that you pressed that you did that exact command here or on that row and that changes everything. That is the way that I can figure out what's going wrong in it. So it's been really useful to debugging aid too, the replay. Wondering if you can talk a bit more about the command log because as you said, it's definitely frequent that you might be running through an exploratory

cycle and you finally get to a good state, but you don't necessarily remember all the steps that you've run through. And you may have deleted code or added new code, so you don't know exactly what the flow was. Whereas by using a more keyboard oriented tool,

you can keep that history and see what the overall workflow was. So I'm curious if you can just discuss the

command log and how it manifests in VisiData and some of the benefits and drawbacks that it might provide. Yeah. So

the command log is 1 of those.

So as far as I'm concerned, VisiData is a grand experiment. Like I initially made it as just

a browsing tool, like a CSV browser as I used to call it, and then it turns out that it's very broad it's a lot more broadly useful. Like, it turns out that spreadsheets and tables are a very universal construct

And so I just kind of like I was at the recurse center,

in early 2017. I was just playing around. I was like I wonder if I can record all actions into a table itself and I just did that and it didn't work at first as I was recording all the motions and everything else. I was like, oh, that's a mess. But then once I took those

ones that didn't belong in the kind of log, wow, it actually works remarkably well. And I think it's actually super handy. It's not as good obviously as a Python script. Like you can only have 1 input field. Like it's very it's kind of a rigid structure, but given the limited amount of data that's on the thing, on the command log, it's remarkably flexible and powerful. I'm very surprised about that. There is 1 thing that I do wish that I could solve and I'm sure I mean, I know it is solvable but I haven't managed to solve it yet, and that's that the command log records every action that you take, including all the dead ends that you wind up finding yourself in. And

so those are sometimes handy to have on there so you can see what you did and what you, you know, didn't want to keep around. But when you're getting to the end and you're like, I want to do this again, you want to cull all those dead ends out, right, and just get to the place that you currently are. And so to have some kind of

graph or tree that would get you from where you currently are and only show the commands that you took to get there, that's what I really want to add to it and I just haven't gotten to that point yet.

And another thing that you mentioned and that I also noted in the documentation

is the case where you have a file which might have millions of lines, which would typically be either difficult or impossible to open in a more traditional

terminal oriented editor

such as Vim or Emax or the less pager.

And I'm curious what types of performance

strategies and techniques you've used to be able to handle files like that and particularly

being able to manipulate them without just, exploding your memory usage? So, 1 thing is that it actually does explode memory usage. It doesn't actually, it doesn't explode as much as some other things, but in my mind, it actually does take a lot of memory. But beyond that though, the 1 thing that really I think there's a couple things that matter here. 1 is

that I've there's a a an async thread decorator that you can add to any function in VisiData, and that means that when you call that function, it just spawns a thread and goes off and does it asynchronously.

And there's always 1 main interface thread that keeps active that is constantly updating with the status of that. And then within that thread, you can add a little progress meter and those are pretty easy to do. But the main thing is because that is so easy to spawn additional threads, I do that all the time whenever I'm doing anything that might take a while and I know that any kind of linear operation

on data, because you might have millions of rows, might actually take a while. So I wanna make that spin off its own thread. And that keeps me conscious of,

how much time things are taking for 1 thing. But I wanna say that I actually don't think VisiData is,

fast in itself. It's just responsive, and responsiveness

matters more than speed, it turns out. I would rather spend 10 seconds

seeing a progress meter update and, making to the end than 5 minutes with no progress and not knowing how long it's going to take at all. But 1 is kind of soothing. I can take a break for 10 seconds, and the other 1 is very frustrating and makes me almost on edge. It's like, do I need to do something here in order to kill it or before it takes over my entire computer? And then the other thing that's,

I've been

anything that it that is

important is that tools like Excel or Vim, often they want to own the data. They want to,

import it into their own structure and format, and that's what causes the thing to really blow up because now you've got, you know, stuff has every cell has to be stored separately in its own custom thing.

Whereas the key, I think, architecture thing that Visitor has that is super makes it so flexible is that it stores the rows natively.

And so whatever I get from whatever Python library or whatever, that just becomes the row. Every item is just that object. And then columns are basically lenses into those rows. And so it basically is taking the it's computing the cells every time. It doesn't grab the cells and put them in there. It just goes ahead

and, computes whenever you wanna see it. And it's and it's adding a column trivially. You can add a column in constant time. And,

that also means that it's saving, for instance, is comparatively slow because it means it has to do the evaluation of all those columns for every single cell when it's saving. But it turns out you don't have to actually save everything

usually. You're not actually looking at everything. You're only looking at the first 100 rows or these columns or whatever. So I feel like those between the threading and

the row column, rows as objects and columns as functions architecture, that's what really helps it stay focused on the, user experience like that. Can you dig a bit more into the architecture and implementation of VisiData itself and some of the ways that it has evolved since you first started working on it? And I know that you've also got an upcoming 2 0 release, so maybe talk a bit about some of the evolution that's coming there. And then any libraries or language features that you have leaned particularly heavy on and found most useful in your work? So,

probably the

biggest evolution is that when I first was doing it, it actually started out as a single a single file script. I kind of put everything into a single,

vd28.py

and the idea was that

it was a drop in thing. You could kind of copy it over to a server over SSH and then you start using it and you wouldn't need anything but the base Python.

And I licensed that as MIT

and then as I started adding more kind of modules or plugins, like internal plugins to it, I licensed those as GPL 3.

And I was trying to keep this, like, very clean separation because the idea was that you could use then the Speedy TUI, which was very similar to the thing I had at f 5 for making all kinds of other apps. But then nobody really took me up on that and it's kind of hard to

use somebody's single file library like that unless it's a super tight little library and so I kind of gave up on that

and am now heading more towards kind of a more plugin platform architecture. We're visiting the

app is the thing that hosts the individual plugins that you can add and maybe there will be a plugin for it may even wind up having a vgit application that you can use,

but it'll be

it's kind of turning on its head as opposed to incorporating bb2e, some particular version of your thing. You're actually using the metadata library. And so then,

beyond that, it's now just the bigger open source project. We got a whole packaging release cycle

and,

I'm working with, Anya who has been very instrumental

in some of the packaging and testing and documentation stuff that we've been doing,

and it's just kind of taken off and began gaining more traction in that sense.

And you asked about the libraries that I liked and stuff, and I'm actually 1 of the things that I wind up doing also to keep performance good is

I take very few dependencies. I feel like layers is how

things

get messy. And so the fewer layers that you have, the better off it'll be if you can wind up

coding everything in between there. So the libraries that I use, obviously, Curses is essential but that's built into the Python standard library. Now Python standard library is really fantastic

and everything's included which is super, super, bonus. But then also the PyPI ecosystem in general is so broad that any format that I come across, hdf5

or,

Excel or whatever, they have a library already for it and it's a library that you can just use in Python and like you write a page of code and you've got the stuff in there and then all the loaders really are just importing those libraries and then just calling them and putting the rows that they return in there and having some columns around it. It's a pretty simple concept. 1 thing you mentioned,

other you wanted to know other ways that I keep this data fast.

I've been very focused on making sure that it starts up very quickly. Like, I think I feel like if it gets to be a half second of start up time, it just gets in the way. It feels like a certain kind of friction. And so 1 thing that I, do a lot of is lazy importing. So all these libraries, I have no idea how long they're gonna take to load or shut up themselves and I know that,

there are some pretty heavy ones even in VisiData,

that VisiData uses sometimes. But I don't use those unless I need to and so when you open up an excel file, for instance, that's when it imports the XLS library and if you don't ever load an excel file, it doesn't have to and then have to spend that time doing that. So that's

another 1 of those tricks. And before,

we move on, I wanted to mention the Python dateutil library. It is 1 of those I don't take many dependencies but that's 1 that I've been very happy to take because it parses any

date format that you can throw at it. Like, if you can if it can be parsed, I feel like it will parse it. And so it is it's amazing. It's a best in class detection and parsing date parsing tool. And also,

other feature that I have used a lot is Python decorators. I'm not sure,

I mean, I think that's a pretty standard thing, but I use them as just a way of tagging functions. For instance, I mentioned the async thread decorator. Taking a pretty advanced concept sometimes and just making it just the essence of it so that I don't have to think or work hard to have those concepts work for me. Yeah. The decorator capabilities

and the syntax that

allow for it definitely

makes it a lot easier to organize code and concepts when you can just drop it on top of a function definition without

having to try and maybe incorporate it into the body of the function and remember what are the return types, what are the inputs. You just say it wraps this function and then it handles it. I don't have to think about it anymore, and you can just keep it all in, like, a utilities library for instance. Yeah. Exactly. Totally. Oh, I will say that 1 of the,

things that I I wish that decorators did support was to be able to put the function,

like the def function name and signature on the same line as the decorator. And it's a minor point, but I use a lot of, you know, GREP type

tools when I'm I would like to be able to know, oh, that function that I'm looking at is async thread or I have a a deprecated

decorator now, and I wanna see that when I'm searching for functions for instance. Well, 1 tip there is that if you add the grep dash

b at 1, then it'll show you the line that you want as well as the line just before it. Oh, that's a great tip. Yeah. Thank you. And another thing that

stands out with Visidata

particularly is the fact that it is entirely terminal oriented, whereas a lot of data exploration tools will be more focused on a graphical interface

or trying to embed into a Jupyter Notebook for providing some sort of visualization. And I'm wondering what your motivation was for focusing

so closely on the terminal interface and making it a command line client. Yeah.

I feel like the terminal is my home.

Like, I've been using terminal since the eighties and so I'm very comfortable in that environment with those kinds of restrictions

and I'm much more comfortable with a keyboard than a mouse. But then as I, I'm getting older, it's becoming harder to just type verbosely and so I'm really

I I wanted individual keystrokes to do things, and I couldn't figure out how to do that in, for instance, a Jupyter notebook.

The other thing is that a terminal is actually a very universal because it's so old

and mature, let's say, it's a universal interface. Like, any platform

has an SSH client and a terminal and the only other way that you can get that level of universal universality is with a web browser and like for instance an Electron client, and that was way too heavy. So I feel like the choices between Eternal, which is very light, and electron client, which is very heavy. The choice for me is obvious. I don't want it to be a very quick in and out

out thing, and I don't know how I would do the same kinds of things as quickly as as you can

in the web or even a native app if you have to reach for the mouse.

Like, you could do it with a with a given native app for sure, but then you have to make an native app for Mac, for Windows, for Linux, etcetera

and I just didn't wanna do that. And so really, you know, Visidia is about only about 10, 000 lines of code all told, which is actually quite a bit, but it's not that much when you compare it to comparable tools like OpenRefine or whatever like that.

And, I think estimate to being on the command line. Like, I can do kind of the minimum necessary to get the job done and don't have to worry about all the, you know, pixel

widths and stuff. It's like, no. You choose your own font. You choose your own,

font size and ways you wanna interact with the things. I'm just here to I wanna get out of the way, to be honest. It also has the benefit that as you were describing originally of being able to just copy it over via SSH

or now PIP install it on a remote machine,

you don't have to go through the hoops of trying to set up a way to have a graphical interface to that remote box. You can just copy it over, and it broadens the reach and capabilities and use cases for the tool

where the only access you have is via terminal, which, as you said, despite its age, is still fairly,

a fairly common experience for people who are working in technology.

Yeah. Absolutely. And, you know, I I'm working remotely now and, you know, we do have screen sharing apps, but they don't always work that great and you have to sometimes install some other plugin or whatever. And the

app that I love most, I'm actually gonna call it Abby then, it's called Teammate. It's

a tmux

wrapper, I guess, and you just type you install it and you type Teammate and you get an instant

shell into your own machine. You just give somebody an sh link and they can do that and people are usually amazed and it's like well between that and VisiData, now we have an instant data exploration platform that we can share and just with a little bit of,

you know, a chat client or whatever. That's that's it. And I find that to be so much more accessible

than modern video chat things. Like, we even even us at the start of our session here had technology difficulties. It's just,

the shell is very reliable,

by comparison, I think. By the fact that you are targeting the terminal environment, what are some of the constraints

that that brings with it in terms of your capabilities

that you can bring into Visidata? Yeah. And and just some of the most challenging aspects of trying to build a user interface for data exploration and analysis

within this environment that is so graphically constrained,

particularly given the fact that you have incorporated some,

visualization

capabilities and just some of the ways that that manifests?

Yeah. I have to say that I've been pleasantly surprised that it hasn't I haven't felt as constrained as I thought I would. Like, once I discovered that I could use braille characters, for instance, to do the graphing, it kind of just works. And I'm, you know, it's not perfect, but

if you want more perfect things, you should be using other tools. And in fact, that's 1 of the things I think that is important with this data is it's not meant to be a be all and end all. It's kind of a glue,

technology. Right? It's an adapter. And so once you figure out what you want to do, then you should go to the fancier tool and do it right. But there's no reason to do everything

super right from the get go. You just want a quick glance at it. You asked about the most challenging aspects of building a terminal UI, and I I have to say that the thing that's been most challenging for me is knowing that if visited it were on the web, it would probably be worth a lot of money and doing it in the terminal means that I've kind of eschewed that. Like, no 1 really pays for terminal tools. I shouldn't say no 1. They actually have several Patreon subscribers and I'm really appreciative of them. But, you know, if you think about how VisiCalc back in the day was like a it was a fancy program that sold for 100 of dollars. It's like I can't even I can't imagine doing that with VisiData. That's just not how the world works. Although if it was a native app, that might be actually what it was possible to

to sell it for, for instance. The other thing that is more challenging than you might think is you mentioned pip installing something and that's really great for people who already have python and already know about

how to use PIP. But if I want to get this into the hands, I actually think Visiata is a pretty reasonable tool to use for anybody who's willing to use the keyboard anyway and yet it's trickier that I think installation is 1 of the trickier parts, right? Like if I wanted my

my partner to go and install it on her computer,

she has I have to like tell her to install this, click here, download this, and it's like people just want a single thing they can download, they can double click on, and then go. Right? And that's just not how the terminal world generally works. So I feel like installation is 1 of those is 1 of those weird things where it's a larger barrier to entry than it should be, and yet I can't find a super easy way around that. So that's just how it has to be. I'm also wondering which terminal environments it supports because Windows is generally 1 that's left out of the support matrix for a command line tool. But because of the fact that Python does have the cursors interface built in, or if you were relying on the prompt toolkit

library. I know that there is the possibility of being able to support the Windows command line. So I'm just curious what the overall support matrix is for VisiData.

Yeah. You know,

I started off doing analytics because that's what I run and then it of course it turns out it works on in a Mac kernel just fine and I didn't have access even to a Windows machine and so I like well Windows isn't really supported. And it just turned out that people were just running it under Windows subsystem.

Windows subsystem for Linux, WSL like that's what it's called. And it just basically just worked there. And then somebody submitted a small batch and it wasn't really that big and it works I think even without WSL now on a more recent version of Windows. And so I actually have never used it on Windows but I know we've got quite a few people. Some

Italian open data

users like Andrea Boruso

loves using it in Windows and that's it works fine. So as far as I'm concerned, it works under Mac, Windows, and Linux and,

seems to be fine in all of those. So that's that's what I don't I wouldn't say it's what we support necessarily, but if it works, I'm not gonna say we don't support it either. Right?

Another peculiarity

of building a command line oriented client is that it is heavily keyboard driven as you mentioned, and that means that you need to

create the set of key bindings that will do whatever it is that you want it to do. So I'm wondering how you've approached the overall design and selection

bindings

make semantic sense, but also so that you don't run out of key bindings in the event that you have some new capability that you want to run into because it is a limited space even when when you do incorporate modifier keys. Yeah. Totally. It's I mean, we're running up against that now. There are a few keys left in some sense, at least few keys that people want to use.

So the main thing is I have to use mnemonics to make sure that things,

stick in my own in my own memory. I actually have a pretty bad memory myself and so if they don't fit my mental model, I can't remember them for 1 month to the next. So I've made sure that they at least make sense to me. And because I've been using terminal stuff for so long, I'm kind of, like, tuned in to

the the existing test text culture that is around. So a lot of the key bindings are borrowed from, like, Vim, of course, like, you know, d for delete and a for ad and stuff like that. Actually, when I I showed it to a guy on ingi.net,

he chastised me for

that that control z didn't suspend,

visited. And I was like, oh, you are totally right. That needs to be fixed right away, and I did fix right away because you want to make sure that the things that people who are using the text client are used to will still work and so control c will just work.

And,

the other thing I think that's really important is,

there's like layers of mnemonics.

So we have a couple of modifier keys, and I wanted to keep those very simple. Like, I think that VIM is great, but it's it's kind of it it feels daunting when you see all different possible combinations and so I hope that when you see visited, it has exactly 2 prefix keys

that it's like, okay, maybe we can wrap our heads around this and then those are just modifiers on existing other commands, right? And so there's, it's kind of like layers. So another,

piece of this is that column transfer instance are all on the symbol keys. And in fact, all the types are all on a single row on my keyboard anyway. Right? And so you've got the the date is an at and the converting it to an int is a number sign. Those are all just adjacent to each other on the top row of the keyboard. And then other column things are also simple. And so you don't get confused thinking, well, is an int I or whatever signal. It's 1 of those ones up up there. Which 1 is it? Oh, I think it might be the 1 that looks like that. That helps me anyway. And the fact that the sheets at going to a new sheet are all on the shift key and, you know, for me, it's like shift and sheet. I don't wanna say rhyme, but something like that. Right? So shift f goes to the frequency sheet, stuff like that. And then, finally is symmetry. I think that that's really important. So I reserved all of the pairs. So for instance, open paren, close paren, open brace, close brace. Those have were all reserved from the get go for things that have both a front and a back. So sorting is, you the square brackets and 1 way goes ascending and every way goes descending. And,

scrolling down to the next item is a

greater than sign versus the previous item's left than sign and I feel like that symmetry on those things is,

very useful but then also

more largely,

symmetry between commands. So, you know, the g prefix

goes bigger and the z prefix is more more precise. And so when you're saying I wanna change all of these I wanna delete all the rows I've got selected. That's, you know, g d. I wanna unselect all rows or select that's gu and, like, it just kind of to me, I mean, it feels like what makes sense. Like, you maybe not know it before

you discover it, but once you discover it, then it's sticky. I feel like that's a super important thing. So, like, the Visitor interface isn't made for the 1st time user. It's made for, like, the 3rd time user. Yeah. There's definitely a lot of peculiarities

and sort of,

culture and history built up around different key bindings. And as you mentioned, Vim has its own set. Emax has a different set of key bindings that people will be familiar with. And then there are any number of command line tools that have created this, sort of general pattern of how you would craft these

key bindings. So it's definitely interesting to hear some of the history of how you have approached it because of your particular tool set choices. And like you said, anybody who's been living on the terminal long enough will find it fairly natural, and I appreciate the care that you've put into considering how you add new key bindings so that it doesn't just end up cluttered and so that it

so that you can

have some sort of mnemonic muscle memory of being able to recreate a certain workflow once you pick it up. Because

with any tool, there will be periods where you put it down for a while and don't come back to it. And then when you do come back to it, you want to be able to just get right back to being productive without having to go back to try and remember what were all the key commands and look at the reference manual.

Yeah. Totally. The other thing that we did was, last, about a year ago or so, we instituted

long names. And so there's the key bindings,

which are, you know, as they are, and they originally were that was all we had. It was just, you know, shift f means the frequency table. That's the only way you could access it. But now there's actually a long name for that. It's, like, I think it's open frequency table, and I'm actually that's exactly right. Don't call me on that. But and then you bind the shift f to that and so people can rearrange their keyboard if they really want to, but also it gives us the ability to add commands that aren't bound to any key. And so, you know, for instance, if you make your own command and you're visiting a RC or whatever, you can make that and bind it yourself or if we provide a command that's very unused,

like for instance, I think random rows that used to be on shift r, but then we made shift r be redo in the undo redo pair. I felt like that's kind of symmetry of shift u and shift r as opposed to, you know, VIMS lowercase u control r, which doesn't make any symmetric sense to me anyway. But anyway, I actually moved the random off of shift r, but then now there's no real good place for the random

selection of rows to go. But I feel like selecting random rows is an infrequent enough operation

that I put it on a long name and then you can just press the space bar and enter in a long name and when that long name is random rows, it goes ahead and does it. So I feel like that's another tactic is to start moving things off of the default key bindings

into more of, like, a huge list of possible commands you could use. Yeah. That's another thing that's got a long tradition both in Vim and Emax and other tools of being able to have a way of opening a prompt that you can then start typing, as you said, a long named command or being able to start typing it and then maybe tab through to cycle through what the commands are so that you don't have to necessarily remember all of those off the top of your head as well. Mhmm. Mhmm. Totally. And we do have tab completion, and then so,

it works anywhere

that we have an open prompt, including the long name thing. And then in terms of the types of analysis that Visidata can do out of the box, I know that you mentioned frequency analysis or histograms,

but I'm curious what are some of the other capabilities

that come,

natively in Visidata

and any of the interesting plugins that you or others have contributed for being able to expand the capabilities and utility of Visidata?

Yeah.

So out of the box, it can do all kinds of,

interesting stuff, you know, searching and filtering, bulk editing and cleaning, spot checking, finding outliers.

I I also I use it actually more often than I I I would think for file format conversion. The ability to load any format and then save it to JSON or character

separated value or markdown is super handy and gets me from here to there a lot faster than I could sometimes otherwise.

Even the scraping a web page for its tables is basically built in. And,

Jeremy Singer Vine, for instance, has written several plugins

already for the current version.

He wrote 1 that does a row of deep duplication

and the FEC, the Federal Election Commission

dataset, a loader for those, and they just, you just download those and import them in your visit IRC. They're ready for action right away. And when I was looking at the documentation, it seems that 1 of the libraries that you can load into it is pandas.

And I'm wondering if that means that you can

expose all of the pandas capabilities as well as you're exploring these datasets because I know that that's often the tool people will reach for their first time digging into a dataset just to see what's the shape of it. And so I'm curious how that works into

the overall sort of use case of Visidata

as the exploratory tool and then where the boundaries are of when you might want to jump to pandas or if you can just incorporate that whole flow together.

Yeah. That's interesting. You know,

I made a very simple adapter for Pandas. It really was, you know, maybe, 20 lines of code at at at first just because I I wanted to use their Pandas supports a lot of different loaders too and be able to use those and be able to browse those is super handy. But what's interesting is that Pandas and Visited don't actually play that well together. In order to do some of the things like sorting, for instance, you know, visited grabs each value and sorts based on that, but you want to use pandas built in sort function in order to do it more efficiently, and

there's just no good way to do that automatically. You have to kind of write all the commands in a way that was compatible with Pandas for Pandas

sheets. And that's totally doable, but it's a fair amount of work and I haven't done it. And so somebody did make

some modifications

to make pandas more responsive in certain cases, make things work better and that's totally doable. Like I said, it takes a fair amount of work and is not

it doesn't happen naturally like that and so you can't just use Pandas

things like you'd think. You can use some of the functions that Pandas has

on

Pandas Sheets and even on non Pandas Sheets if they're standalone functions, but to use a Pandas data frame just naturally like you would, you probably are better off using it in Jupyter by itself. And then in terms of the overall growth and adoption

of Visidata, it seems that there's a decent community that's grown up around it. And I'm wondering how you approach the project governance and sustainability

as a solo developer

and how you are looking to

grow the community and incorporate more people into the

future of VisiData?

Yeah.

Well, you know, as as you're saying I'm a solo developer. I've got a little bit of help now. Like I said, Anya has been instrumental in making me at least not being so alone with some of the,

decisions and discussions and stuff like that. There's also a

pound Visidata channel on Freenode that several of us are hanging out and there's people talking about things. People ask questions and

how is it possible to do this, etcetera. Maybe you could add this kind of feature. That's I I would prefer, personally,

chat system like that because I find myself doing a lot better with chat. I mean, I've been on chat for

over over 20 years now than I do with email. You know, email is a lot heavier, requires more

intention and attention. In chat, I can just kind of, like, toss off an answer and it's just done.

So I'm, of course, the decider on those things, but I have to be honest, it kind of feels like I'm discovering VisiData more than creating it at this point. Like, it's like a chunk of marble to a sculptor. It kind of tells me what it wants to become and some things

I didn't even consider and then I look at it. I'm like, oh, why did I think of that already? You know? Like, the the row type down in the lower left corner where it shows you, you know, whatever

lines or columns or whatever the the current data type is. For the longest time, almost 1 0, that just said rows, and I didn't know why I even put the text there if all I was gonna do is say the same thing every time. And yet I I felt strongly that it should be there. And then once I realized that that should just be the row type, I was like, oh, like, I don't feel like that was my

creation. That's that is just how it has to be, if that makes sense. And so, so that and then you mentioned about project

sustainability.

And the thing

is is that my energy is my most precious resource, my energy ability to code. I have a day job and so I come home at night and sometimes I just want to screw around with VisiData

and there's

it's really hard to like summon the energy when I don't have like a very concrete use case or somebody that really cares about something.

And so I have the most energy when somebody is around and is enthusiastic

and they have like a sample dataset and they're like, I just want to do this thing to it. It's like, oh, right. And how could we do that? Is there like a it's a kind of little puzzle, you know, like, you kind of put together, like, can I use existing commands to do this? Do they have to write a 1 liner that they put in their visited RC, or does this require like a different core piece of functionality

so that now not just that case, but 10 other cases can be solved too? And those are the things I enjoy the most, and I can I actually do really enjoy solving those puzzles? But then sometimes we'll have people who ask for a generic feature and, like, it doesn't feel like it's,

it's it's not very immediate. It's more abstract or I have a concept for something that I've been wanting for a while and because there's nobody actually,

really, who really wants it, I'm less motivated. I just kind of decide to do something else, you know. So what are some of the most interesting or unexpected or innovative ways that you've seen Visidata used? I feel like we have a couple of super fans, people who, like, will use it for well more than they really should and 1 of them is a guy named Christian Warden

and he does some

a

lot of Salesforce consulting and stuff. And so he's got just buckets of data and just wants to move through it quickly and he built a duplicate row finder for some dataset

with Python expressions and bdscript and Visita is not made for inter row computation. Like you can columns that compute within a row, no problem. But if you want to look at the previous row,

it's not really meant for that. I mean, I would like to add that at some point, but I haven't figured out the most the really great way to do it yet. But he figured out how to pull it off and it was an amazing beast

and it worked. I was amazed. They actually exposed, like, a a bug with computation so that, you know, it was taking forever to run. But once he fixed that, it actually was remarkable. It's like, wow. You really have turned this into yet another Turing complete programming environment.

So,

so that has been kind of, kind of weird. And then also, I I'm not sure if

you or your listeners have seen the lightning talk that I gave a couple years ago, but I just kind of like I had some data that had let long coordinates and I was just curious if I could plot those in my little canvas and it turns out that plotting latitude longitude as x y coordinates

works really well for maps. And if you've got even like a 1000000 points, you just there you go and you can see the distribution

of things.

And

it's, it was surprising to me that it worked as well as it did to be honest. Like I don't think this is built for,

you know, geographic

information at all and yet you can kind of pull it off. So that's been both surprising and unexpected and,

yeah, kind of pleasing too. Yeah. I did see that lightning talk, and that was 1 of the things that I was kind of blown away by as far as the visualization

aspect of Visidata given that it's a terminal environment. And so it's interesting to hear how you just mapped the lat long to x y coordinates, and I'm sure that you just figured out what were the

maximum bounds of the coordinates that you had to figure out what were the what what the overall plane coordinates needed to be in relation to each other. So that's pretty funny. Yeah. Thank you.

The other thing 1 more thing, if I if you don't mind, that has been kinda surprising to me is how meta the thing goes.

So

editing Visida's internals

using Visida's own commands is,

something that's been kind of surprising for me. Just yesterday,

a user asked how they could get another a type column on the describe sheet, and I thought about it and I was like, you know what you can do is you can go to the column sheet

and you can copy it from there and then you can paste it onto the column sheet of the describe sheet

and it'll just work. And it's like

you couldn't possibly do that with Excel, right? If you've got and similarly, if you've got a 1, 000 columns in whatever

thing and you want to search or select all the ones that begin with a certain thing and remove all those from the set, you can do that in VisiData and that's no problem. It just works just like anything else and I have no idea how you do that in almost any other tool.

And so I feel like the metadata editing aspects of something has been really been surprising for

me even though I've been I put it in there, but, the fact that it works as well as it does has been really kind of, interesting.

And looking forward, what are some of the features or improvements that you have planned for the future of VisiData?

So right now we're working on the 2.0 release which will be still a couple months out and the goal is to stabilize it and,

the current version, 1.5.2, it's actually incredibly stable

and

really tight. Like, we haven't found well, there's at least 1 bug I've seen but that's fine. It's an edge case. But it's actually pretty stable.

But the API

is kind of all over the place. You know, it doesn't really it's not as coherent as the user interface and so 1 of the things we want to do is make the API stable and produce

some more coherent documentation about the internals. We're calling it the book of VisiData.

And the point of that is so that we can let it rest

and work on some other things, but then let other people go wild and

share their own creations, plugins or commands or whatever, loaders, without destabilizing the core goodness that is there. So I'm sure there will be like, you know, 2.1 or whatever after that but I'm really hoping that with 2.0,

development can slow down and I can move on to some other projects that I have in the queue. Okay. So 1 of the things that we've been talking about that's kind of taken a little got a little attraction

is something I've been calling where in the data is Carbon San Mateo

which is

a a throwback

or a

yeah. It references an old game from the eighties nineties which maybe you're you've heard of called Where in the World is Carmen San diego? I used to play it. Yep. Okay. Yeah. So you're familiar. So the idea is that I want that kind of game but with data and datasets. And so below is for hardcore data nerds. It's kind of like an escape the room game or a choose your own adventure kind of thing where you're solving whatever crime, but you get the datasets to look at. And that's how you, like, get the actual clues and solve the actual puzzles.

And so I like to work on that, and that's my as my next project probably

is in the queue. But I don't really wanna do that until we've got visited a 2.0 lockdown and feel like it's in a stable place for everybody. Are there any other aspects of Visidata

or data exploration

or any of the other accompanying projects that we didn't discuss it that we didn't discuss yet that you'd like to close out that we didn't discuss yet that you'd like to cover before we close out the show? Not not specifically, although I did wanna say that I feel like we're in an age of kind of a terminal renaissance, you know, that,

we went through the period in

probably the late nineties and early 2000 where it was

more graphics all the time and you're kind of,

that was the obvious way up and out. And the terminal has been, you know, with us throughout and I definitely have never left it, but I feel like within the past maybe 10 years or so with projects like OhMyZSH

and tmuk/tmaite

and many other ones that the terminal has been kind of getting a resurgence. And even now when people go to data boot camps, data science boot camps and stuff, they have to run the terminal

and get involved in there too because you just you need to be able to do that kind of stuff in order to get anywhere,

to dive deeply into data stuff. And,

so I feel like this,

this is part of that, you know, because data is saying, no. Wait a second. You don't have to be in the web graphics world in order to be hyper productive. In fact, not being in that world

makes it

a lot easier for you if you can just embrace the fact that you're gonna be at the terminal using a keyboard. Yeah. I definitely appreciate the fact that there is a lot more focus being paid to just making things that work in the command line and being able to stitch them together. Because as you said, the graphical interfaces,

while they are

appealing, and it's easier to sell something to somebody who isn't as technical if you're in that environment, they do bring a lot of extra weight and requirement

to the development and maintenance of them as well as, in some cases, the use of them because they are definitely much more mouse driven, and it makes it harder to be able to have just a unified flow.

Yeah. And, you know, to be honest, the versions of iOS and Windows are gonna keep marching forward.

And I have no doubt that if I made an app for the current version of either of those things that in the next 3 or 4 years, it wouldn't work with the next version.

And I am actually pretty confident that if I don't touch VisiData

for the next 4 years that you'll be able to use it in the next version of Python or whatever thing, no problem at all. And I I I find that to be really,

motivating to do a good job now because I don't have to keep writing it. I'd have to do it good once. For anybody who wants to follow along with the work that you are doing or get in touch, I'll have you add your preferred contact information to the show notes. And so with that, I'm going to move us into the picks. And this week, I'm going to choose a newsletter that I actually found while I was doing some research for this conversation called data is plural

that's maintained by Jeremy Singer Vine, who you mentioned a few times.

It's a weekly newsletter with just different interesting datasets that looks to have some,

fairly curious

discoveries. So if you're looking for something to experiment with, Visidata,

you might have some interesting finds in there. So with that, I'll pass it to you, Saul. Do you have any picks this week?

I wanted to promote teammate, which I think I have during this episode already, so definitely give Teammate a look if you're a terminal user and want to have a multi user experience. There are a lot of other tools in that same bin like Mosh. I wanted to give a shout out to the mobile shell,

for, you know, less than perfect network connections and, yeah, there's all kinds of good tools out there, but I'm not sure I can come up with any more off the top of my head. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing with Visidata.

It's definitely something that I'm going to be experimenting with because I spend a fair amount of my time on the terminal and have to do a lot of exploration of random datasets,

whether it's just CSVs

or piping things from different bash commands. So thank you for all of your work on that tool, and I hope you enjoy the rest of your day. Awesome. Thank you very much. I hope you are have a great day too.

Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast at data engineering podcast dot com for the latest on modern data management.

And visit the site at pythonpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host@podcastinit.com

with your story.

To help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

The Python Podcast.init

Summary

Announcements

Interview

Keep In Touch

Picks

Closing Announcements

Links

The Python Podcast.__init__