Summary
HDF5 is a file format that supports fast and space efficient analysis of large datasets. PyTables is a project that wraps and expands on the capabilities of HDF5 to make it easy to integrate with the larger Python data ecosystem. Francesc Alted explains how the project got started, how it works, and how it can be used for creating sharable and archivable data sets.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. Linode will has announced new plans, including 1GB for $5 plan, high memory plans starting at 16GB for $60/mo and an upgrade in storage from 24GB to 30GB on our 2GB for $10 plan.
- Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Your host as usual is Tobias Macey and today I’m interviewing Francesc Alted about PyTables
Interview
- Introductions
- How did you get introduced to Python?
- To start with, what is HDF5 and what was the problem that motivated you to wrap Python around it to create PyTables?
- Which are the most relevant contributors for PyTables? How you interacted?
- How is the project architected and what are some of the design decisions that you are most proud of?
- What are some of the typical use cases for PyTables and how does it tie into the broader Python data ecosystem?
- How common is it to use an HDF5 file as a data interchange format to be shared between researchers or between languages?
- Given the ability to create custom node types, does that inhibit the ability to interact with the stored data using other libraries?
- What are some of the capabilities of HDF5 and PyTables that can’t be reasonably replicated in other data storage systems?
- One of the more intriguing capabilities that I noticed while reading the documentation is the ability to perform undo and redo operations on the data. How might that be leveraged in a real-world use case?
- What are some of the most interesting or unexpected uses of PyTables that you are aware of?
Keep In Touch
- @FrancescAlted on Twitter
- FrancescAlted on GitHub
Picks
- Tobias
- Francesc
- Blosc a high speed compressor, specially meant for binary data
- The Lego Batman Movie
Links
- PyTables
- PyTables – Optimization
- Presentations and Videos about PyTables
- Part of the story behind PyTables
- HDF5
- Pandas
- SIMD
- NumFOCUS
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello and welcome to podcast.init, a podcast about Python and the people who make it great. I'd like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy, so you should check out linode@linode.com/podcastinet and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app or trying out something that you find out about on the show. And Linode recently announced an update to their pricing, so you can actually now get a 1 gigabyte server for $5 and high memory plans starting at 16 gigabyte servers for $60 a month. And they've upgraded the storage from 24 to 30 gigabytes for the $10 plan. You can visit our site at www.podcastinit.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. To help other people find the show, you can leave a review on Itunes or Google Play Music, and tell your friends and coworkers and share it on social media. Your host as usual is Tobias Macey. And today, I'm interviewing Francesc Althed about the Pytables project. So, Francesc, could you please introduce yourself? Hello, Tobias. Yes. Of course. I am a freelance consultant
[00:01:12] Unknown:
and developer. I am PCS physicist by training, but, computer scientist, by passion. And in fact, I spent most of my life wrestling with computers I enjoyed
[00:01:24] Unknown:
it so far. And how did you first get introduced to Python?
[00:01:27] Unknown:
Yeah. Well, I just heard of Python because of an old friend of mine, told me about the language. I mean, it was like, we were in the nineties by then. And, he said to me, well, it's you should check this language because it's cool. It's new. So I was a user of Perl at the time, and, I started to read about, Python, and I got caught immediately on on Python. So by the end of nineties, I was learning very, very hard in in Python. And then things turn out that, starting 2001, I started coding most of my time professional time in in Python. That's more or less how I I got in for use.
[00:02:10] Unknown:
And py tables in particular, I know is a high level wrapper around the HDF 5 file format. So I'm wondering if you can share a bit about what HDF 5 is and what was the problem that you were trying to tackle that motivated you to actually start creating PI tables?
[00:02:27] Unknown:
Sure. So HDF5, it's, it's a it's a format. It's a library and also a a format, a file format that allows you to store large amounts of data. So I was looking for a solution starting in the the at the beginning of 2000 for storing large amounts of data, multidimensional data, but also data like tables. And, I discovered HDF 5, which allows you to store multidimensional data, but in a hierarchical way. So I was looking for something very, very close to that, and I just started to to look into that. And I just started to see if there was some kind of, rockets, Python Roberts for for it. But the existing ones were not very interesting for me, so I needed something especially faster than what there was at that time. And I started by tables as a as a way to to wrap HDF 5. Also regarding the capabilities of HDF 5, in my opinion, HDF 5 has a couple of features that makes it unique among other formats. The first 1 is that, as I said before, it allows you to to install a hierarchical structure to the your data, which is very handy in many cases. So you can just, put all your data in a in a file, send this file to another colleague, for example, and he will see the same structure right away. The second amazing feature, in my opinion, on Office 3 5 is the capability to store data in chunks in chunks what what is called chunk dataset. And that allows for a lot of things, but especially 2 things.
The first 1 is that you can extend your existing datasets in your files in your files very easily. Not not very easily, but also very fast. Okay? So it's it's really, really, really you can get a lot of performance by using chunking datasets. And the second thing that allow chunking datasets is that they allow to compress the data on the fly. For example, you if you have a dataset of 1 gigabyte long, you can split it in chunks of, say, 1 megabyte, for example. And then you can IDF 5 will compress each chunk automatically for you. You you only have to specify the codec that you want to use, and then IDF 5 will take care of everything. And then that also means that, when you are storing data on this and if you use these filters, what what is called filters for compressing the data, the data can be compressed, before the data reaches to the list. And that means that you are writing less information, less actual information to release because the information arrives in a compressed state. And then when you are reading, you only ADF 5 will retrieve only the interesting part, the part of the dataset that you requested and ADF 5 will detect which part of the which chunks of dataset it has to decompress and will decompress the data automatically for you. As the data is compressed, the ADF 5 will need to to to read less information from this. And that in the end means that you can get more throughput in the long run if you are compressing the data because the CPU can decompress data faster than it can be stored in from this. So in my opinion, these are the 2 these hierarchical way to store data and also the way to to structure the data in chunks, for me are the 2 key features of of HDRP. And, of course, pi tables takes advantage of that. Mhmm.
[00:06:06] Unknown:
And when you're talking about large datasets, exactly how large you mentioned gigabyte size data, but I'm wondering what are some of the upper bounds of the actual amounts that you've seen in terms of people using, HDF 5 and PI tables
[00:06:21] Unknown:
for? Yeah. Well, I don't I think there is a huge variation, of course. I mean, people that, just store data, which is 1 megabyte or 2 megabytes, since people that is storing, like, creating by tables file, which are up to terabytes of size. So, yeah, I mean, there is no actually, there is no no upper limit on on that. There are other kinds of limitations, like the the amount of metadata that you can store per dataset in the in the in the hierarchy, but not an actual constraint in the in the amount of data you're gonna store. So I I don't remember. Maybe the the largest file that I've seen in HDF 5 is, like, several several terabytes, but it it can be larger than that. Mhmm.
[00:07:06] Unknown:
And does the read and write performance across the data within those large datasets maintain fairly linear as you increase the amount of data that's being addressed?
[00:07:16] Unknown:
Yeah. That's a good question. I mean, it's difficult to I mean, the the the variation of, the range of variation of the data that you can store, it's it's it's huge. So depending on the how large is the data that you want to store, you may want to use a chunk size for another because, I mean, HDF 5 keeps an account of the the position of, every chunk on disk, and that is stored in a on a tree, okay, in in memory. And, of course, if you if you make your chunk size very small, this tree grows very low and grows a lot. And that means that the access to our the time to access the data will slow down consider considerably. So you have to get a balance between the chunk size and the the the total amount of the data that you are storing. 1 of the features that I am more proud of on by tables is precisely the this capability to assess, to suggest the chunk size that you have, to choose of if the Pipedible is choosing that for you by default so that in case you are installing, for example, a very large dataset, like a 1 terabyte dataset, then Pytables would choose chunk size, which is larger.
For example, 10 megabyte. I don't know. I don't remember the details, but let's say that is that. That. Whereas, if you're a story in, like, a dataset, which is a few 100 megabytes, the chunk size that is chosen automatically by byte tables is probably much less than that, probably 256 kilobyte. Okay? So this capability to automatically adjust the chunk size is something that, Pytables implemented the first among, in the in the HDF5, community. And, since then, other other people, other libraries have copied idea. But, yeah, I mean, this is the main way to allow HDF 5 to deal with, small and large datasets by tweaking, by computing this chunk size properly. And for this, I did a lot of experimentation.
Of course, I spent long long time to check the speed access to write, to read. And, yeah, after all all these experiments, I I I did some algorithm to predict more or less in a trusty way more or less, which are which is the charm size that is optimal for the reasons.
[00:09:54] Unknown:
And for the input to that algorithm, do you request that the person using it estimate the amount of data that they're going to be working with, or is it possible to rebalance the chunk sizes after it's already been written to disk?
[00:10:07] Unknown:
Yeah. I mean, at the beginning, if you don't know I mean, Pitable has no way to know how large the data is going to be because the the datasets are enlargeable. Okay? For enlargeable datasets, and by tables has this flag, which is asking you the number of, rows, for example, in tables, but you plan to store that. So this is a way for Pytables to know which will be the final size of of the dataset. For, datasets that are fixed in size, there is no need for doing that. And Pipedible just will take the final size because it is going to be known at the creation time, and it will adjust for for that for that size. So yeah. I mean, ask Pipedebo is asking the the user how to how large it's going to to get to be the the the final links.
[00:10:57] Unknown:
And 1 more question about the sort of technical aspects of HDF 5. I'm wondering if it's primarily a row oriented or a column oriented file format because I know that you have different trade offs and capabilities based on sort of the dimensionality of how the data is addressed, particularly in terms of being able to compute space and computation savings with column oriented data since all of the data types in the given column are known, and so you can get some optimizations as far as addressing all of those in the same manner? Yeah. That's a very good question. And this is a question,
[00:11:33] Unknown:
a subject that I am putting out of time lately. 8 f 5 is basically agnostic about the the order you are storing the the data. So 8 f 5 is about scoring multidimensional datasets. What you can do is to specify in the chunk the the shape, the dimensions of the chunk. So, for example, if you are storing, like, a big mesh of 1000 by 1000 by 1, 000 by 1, 000, you can say you can tell to s d f 5, well, the the chunk size of the chunks would be, like, 32 by 32 by 32. So it's the user who specifies how he wants to access the data. This distinction between column wise and row wise, it it it is not directly applicable to SDFR, except in 1 in 1 situation, which is when, you are storing a heterogeneous datasets in SDFR or what is called compound compound types in SDF 5. Compounds are types are basically made of, simple types. Types that are like integers, quotes, strings, things like that. And you com you create a compound type by putting together several of these simple types. Like, for example, if you are creating a table with, 3 entries, integer, flow, strings. So you will have a a heterogeneous data type, which is made of integer, flow, and and string. And, of course, the way that the SDF 5 is storing that is in row wise. Okay? Because the d type is is the the the way or the the minimal amount of data that, ADF Pipe is going to put together. In pipe table, for example, this specific data type, the compound data type is very, very important and it is called pipe tables for a reason because we are going to use we are using a heterogeneous, or compound data types all the time. And by default, a pi tables is going to create tables that are stored row wise. Okay? So for a long time, I was wondering about instead of creating storing tables in terms of compound types, what about creating single dataset for every column and then store store every column in a in a group, for example, which is the the equivalent of a directory in the hierarchy or in in SDF 5? And then in this way, we will have a table that is stored column wise instead instead of row wise. In many situations, column wise access is much better than than row wise, It depends on the on the on the use case, but, color wise has mainly a couple of advantages over row wise. So the first advantage is that, it is possible to add columns very cheaply because it's just a matter of of adding a new dataset to the group and then you have a table with a new column. Whereas if you have the data stored in row wise, creating a new column, adding a new column to this to this table is going to require a complete copy.
Okay? So this is a very, very expensive operation, copying the data. The second advantage for going column wise is that as you are compressing the data, if you order if you sort on if you order your data in, in a column wise, you are saying, well, this column is going to be in our single data set. This other column is going to to go to another data set, and then the chunks are made of elements which represent the same the same characteristic. Okay? And these values typically has a much less variation than the values in the same row. Okay? So storing the data in column wise has the the the the added benefit that the entropy of the of the values per column is much less. And then the compressors can go faster and can and can convert better. And that also means that you can get much better performance. Okay? So having said that, I had a plan to to implement color wise tables in by tables, but, and I tried to to to get some found some fun things for that. But, in the end, I I didn't succeed to to to get that, and there is no support for column wise tables in in pipelines. But this doesn't mean that it is not a good idea. I think it's it's a good idea. The problem is is a problem of open source.
[00:16:09] Unknown:
And on the subject of open source, I know that it's always a challenge to attract enough attention to a given project to be able to increase the bus factor of the maintainers. So I'm wondering what the level of contributions you've seen for PI tables looks like and what your interaction patterns look like for being able to develop the project in a cohesive manner.
[00:16:32] Unknown:
Yeah. I mean, this is, the main question when you are doing developing open source is, how you can attract the developers, for your how you could create a a business model around open source. Right? For the piTables, things were well, at the beginning, I put a lot of time. It started as a solo project, but, then, I joined forces with with other guys. But, all of the fundings came basically from us. But, soon in time, the HDF, group, which is the creator of the HDF file format, gave us some contracts to implement, some features that their clients needed. So this allow us at the beginning to to work for for that. Then we spent, like, a lot of time developing a new feature, which is, I think, what 1 of the most important feature of by tables, which is indexing. So this capability, when you have a big table and you index, several columns and you can do queries, if you are indexing the query, then the the query time will be much less. It's exactly the same concept as as in relational databases. And, we try to to to sell this product as as closed source, as a pro professional version, so to say, of white tables. We did that. We we spent, like, a couple of years trying to to get this special version of by tables. But in the end, we we didn't we didn't sell license enough in order to keep the company, our company alive. So we had to stop basically the buy table's development.
So after we we closed this this company, I decided to open all the things that we we developed for being the the professional version of Pipedavals, which is basically the the the indexing capability. And, and then, I decided to drop also support on Pipedtables because, I mean, I started Pipedtables, like, in 2002. And then in 2011, it was, like, it was 7 years working on that or no. Even 9 year working on that. Oh my god. And then I decided to change subjects, and I announced that I was going to drop the maintenance of iTables. Unfortunately, open source allows this kind of, people saying, hey. I I am a big fan of Pipedables. If Pipedables cannot die, I am going to take the hat, the material hat. And it was then when, Anthony Skopaz jumped on on the scene. He get the fight tables and did a lot of the documentation. He did a new version of the documentation.
Very, very tough work. And, yeah, I mean, he he pushed he pushed, 5 tables a lot. I I am very glad grateful to him. And also Antonio Valentino, he was doing a very nice work. I mean, he started contributing things to bite tables very early, like in 2 207. And, other developers that joined the crew were while Ivan Bilata joined me for creating the company. He did outstanding work on pipe tables. And lately, we have also new people that is joining the crew, like, Andrea Bedini, which is an Italian guy, but who's working in Australia. And he he did very, very nice contributions lately to 5 tables.
I am also more or less back. I try to I try to to get on top of of that. And Andrea, for example, organized a nice hack test 5 year in Australia. We we got developers from all around the world, people from the U. S, mainly me and me were joining in Australia for creating a new version of white tables. Unfortunately, we didn't have time in just 1 week for doing that, but the call is is there. It's in advance. Okay. Perhaps we can find the the opportunity to finish it. And finally, also, we we had the opportunity to meet, Tom Koji. I think he's from Netherlands, and he's also, the new maintainer of Fytable. So, yeah, I am very happy that open source is a is a big community. And by opening your your your code, you will be able to to make your your ideas to spread more and to be more much more mundane. Don't know if that was the question.
[00:21:14] Unknown:
That did a good job of answering the, the question, and it gives a lot of interesting and relevant backstory to the project as well, which is 1 of the most rewarding and most interesting aspects of doing this show is being able to understand the story behind the project even beyond just the technical aspects of it. And speaking of the technical aspects, I'm wondering what the underlying architecture of the project looks like and some of the design choices that you're most particularly proud of or also alternately, frustrated by?
[00:21:45] Unknown:
Yeah. That's a good question. There are many things that, I am proud of of the design. They have the most 1 of the most basic ones, but I think it's it's the 1 that I I love the most since the beginning is the the this natural naming capability. So natural naming accounts, it's it's a feature that allows you to to access dataset in the hierarchy by just, putting a a dot in a class. So, for example, when you open a file, in a pipe table file, you get an object. And this object inside this object, there is a space a special object attribute, which is called root. And root is the object that, contains the hierarchy in HDF 5. So after root, you can access, for example, a dataset which is called, like, for example, detector, and you can say, well, f dot route dot detector. And then you are accessing the dataset in the deck. Okay? So you don't need to to to say, like, f dot get node at f, and it's called a detector. So this capability of accessing the datasets just by by accessing attributes in objects means that, for example, if you're using IPython, the IPython shell or the regular IPython shell, you can, by just using the these objects f dot root, and then you you can put, like, another dot. And then if you press tabulator twice, you will get an a list of the datasets that hangs from root. Okay? Or if you are going more into the deep in the in the hierarchy, you can you can specify raw root dot group, which is a directory, dot, then hit, tabulator twice, and you will get the the list, of children that comes from this from this group. So this allows you to navigate very easily in your in your in the hierarchy of your business. Probably, this is 1 of the feature that I am more proud of. It's very simple, but it's, I mean, not all their library has this capability. Also, of course, another capability that sets spy tables apart from other solutions is the indexing capability that I that I mentioned before. So this capability of taking a table, okay, and, index create an create an index or a column and then being able to show a query on on this on this table.
And if the the column that the green index is participating in the in the in the query, then the time to to get the the result is going to be much, much short. So when you have tables that are terabytes in size, this capability is is critical. And in fact, in my opinion, this single capability is the is the responsible or is is the the reason why the pandas developers, chosen have chosen, Pipetables as the HDF 5, back end for for pandas. Probably, you know or the audience know what what is pandas, but it's pandas is a is a it's a killer application that many people is is using for for doing data analysis in in
[00:24:59] Unknown:
in in Python in the Python ecosystem. Yeah. I was gonna say I'm actually going to be interviewing 1 of the maintainers for that project for next week's episode. So stay tuned.
[00:25:08] Unknown:
Okay. Is Jeff or Wes McKinney? Jeff. Jeff. Okay. Excellent. Yeah. But probably well, there are more things about, by tables, but I I am not going to get to to to board LDMs anymore. But I think these 2 are the most important. Regarding the things that I am not very so proud of is also this I mean, this natural naming feature also brought some complexities in how the hierarchy is stored or is managed internally in byte tables. And it's managed like, trying to emulate the hierarchy in in form of hierarchy of objects. And sometimes the relationship between objects and how you access objects in the hierarchy is not trivial because you are it's very easy to create loops between these objects.
And sometimes these, these loops can create, memory leaks and debugging these these memory leaks was not fun at all. He spent, like, tons of time trying to to get these leaks out of of white tables. But, anyway, right now, more or less, the the leaks are not existing anymore, but, of course, the the code is, is a little bit no. Too too messy in many aspects in many in many place. Also, I am not very proud of the way how iterators are implemented. I am a big fan of iterators. And most of all of the datasets, API databases can be iterated iterated. But the way how you I implemented the iterators was in a non stateful manner.
So that means that depending on how you iterate the table, it can give you a result or another. So, I mean, the environment in which you create or you involve the iterators is is going to affect the the the result. And this this is a tricky thing that we try to solve it, but it's not easy to fix. We would appreciate if someone jumps in and try to to fix this iterator iterator issue. And this is, yeah, 1 of the 2 things that I don't I don't like very much.
[00:27:18] Unknown:
And what are some of the typical use cases for PI tables? And also, I'm curious how it ties into the broader Python data ecosystem. I know you mentioned that it's used as the back end for pandas to access HDF5 files, but I'm just curious how it how it also relates to some of the other aspects of the, larger ecosystem of Python for data analysis and data science.
[00:27:41] Unknown:
Probably, Pydable right now is a lot is very known because it is the API 5, backend for pandas. Interestingly enough, Pandas didn't exist when I created the Pytable. Pytable was, like, 10 years was created 10 years before Pandas. And, of course, my intention for Pytable was to to serve as a way to store mainly scientific data, but also data coming from computers in the from sensors, for example. Okay? But pandas is used a lot in in financial in the financial world. Many people in the financial world is using 5 tables because, because it is the it is the the the back end 1 of the back ends of pandas. So as a consequence, the financial world is using Pipedables a lot via pandas. And also in standalone mode, there is a lot of interest in using piTables.
And, but, I mean, my main interest at the beginning was the people in the in science and and other industries were going to use Fitables as well. So at the beginning, many especially many universities, you know, when people is doing PhDs, they they are trying to they're fighting with technology, and they are willing to adopt new technologies very soon. So people at the university, especially PhDs, were 1 of the first users of of white tables. And many people started, for example, there was a guy in University of Berkeley, I think. He he had, like, a device for tracking the flight of flights. I don't know.
And all these track of of the flight, were stored in in HDF 5. And then people from the industry like, another guy who asked me to tailor pipe tables because he was using it for storing the results of a simulation program for hydrodynamics, I think. And he wanted me to to tailor or wanted us to tailor the the behavior of high tables specifically for that. So, yeah, people for at the beginning from the university, then from also from companies, especially in research. And and, yeah, in the in the last years, in the the later years, mainly I would say that buy tables is mainly used in financial applications, which was absolutely not the first the first thing that I had in mind, at the beginning. But, you know, life is is like this, and you get, I mean, surprises all the time.
[00:30:10] Unknown:
Yep. Absolutely. It's, it's always amazing to see the uses that people will put something to that were beyond the initial ideas or even intended capabilities of the project at the start.
[00:30:23] Unknown:
Yeah. Yeah. Very true.
[00:30:25] Unknown:
And I know that you mentioned 1 of the original appeals of HDF 5 as a file format was particularly for data interchange for sharing between researchers and also for data archival. Mhmm. I'm wondering if the particular approach to data formats that Pytables uses complicates the ability to use those resulting HDF5 files from other languages or other libraries?
[00:30:51] Unknown:
Well, I mean, this is, this is not an easy an easy subject. By tables is using HDF5, I mean, to store everything except for for 1 single thing, which is when you are trying to store data that has not a simple translation to HDF 5. So, for example, when you're trying to to store, like, a complex Python object, by tables, what it's doing is to serialize this object using pickle. Okay? So pickle creates a big screen out of of the object, and then, by tables store this pickle string into into an attribute of SDN 5. So then if you if you do that and you try to open this SDF 5 from a language which, for example, can be Julia, for example. Of course, Julia doesn't understand the TICOL format. Right?
And Julia will not be able to get this attribute out of the HDF pipeline. However, if you take if you take attention, you may read into the description of the format, which is appendix c or appendix b in by tables documentation, it is exactly documented where by tables is going to use a nonstandard ATF 5 format for storing data. So if you are conscious about that and you try to avoid storing, like, exotic data that cannot be translated easily into ATF 5, you shouldn't have any problem into read these, these datasets from other language. Okay? Also, Pytables makes use of meta information for describing things like, indexes in tables, things like that. This is a specific feature of white tables. Of course, HDF5 doesn't support indexing per se in in indexing. Okay? So although you create a table with an index in byte tables, you will be able to retrieve to read a table from simple from from r from the language r, but you will not be able to use the index that accelerates the query. Okay? So, yeah, I mean, pi tables add some sugar, synthetic sugar or not synthetic, but semantic more more properly on top of the data.
But, if you want to be, like, completely portable, you still can do that in Pytables. You just don't don't use the, more advanced features that Fintechobox offers. So it's perfectly possible to create, 80 f 5 files that can be read read by other HDF 5 applications in any language or in any any no problem with that. But you you have to to know about the the intricacies of of how Pipedrive was stored totally.
[00:33:27] Unknown:
And 1 thing that you mentioned briefly in there that we haven't touched on yet is the querying capabilities of PI tables. And I know that in the documentation, there's a bit of a tutorial of how you could map the concepts of SQL onto the query and capabilities of Pytables and HDF 5. So I'm wondering if you can just briefly touch on that.
[00:33:47] Unknown:
Yeah. I mean, since the beginning, we wanted Pytables to excel on in scoring tables and and doing queries. And 1 of my coworkers, Ivan Vlata, had this idea of saying, well, if we want to make people aware that by Tivo's can do queries in in a close on a similar way than rational database, why not doing a tutorial on how to match the different way to to to the inquiries between relational databases, SQL, basically, and bite tables. And, yeah, I totally agree with him, and he he put some time to create this tutorial. And, and I think it's, I mean, it's it's very nice for people that is used to to handle to handle or to use SQL, all the time. How these SQL queries can be translated into into Python or pi tables specific specific way. And, yeah, for example, the main thing that you can do in piTables, that is similar to, SQL query is the table dot I think it's, query. I think this is the method. I don't remember now exactly where you can you can specify the the condition and then the starting row, the end row, things like that. But the query is not the only and then you can also do, like, indexing, and you can also do sort buys, group buys, but by combining Python with machinery with Python machinery. There is a way to combine that that can reproduce the the behavior or sort by or group by in in SQL. And this is why this is what is written in the in the tutorial, how to how to map or how to reproduce the the SQL query language in in pipe tables and also Python Python using pipe tables and Python machine. I think it it it is it is a very nice a very nice tutorial.
[00:35:40] Unknown:
And as you've mentioned before, the original work with Pytables was started know, several years ago. And in recent years, in particular, there's been an explosion of new data formats and serialization formats as the overall industry of data analytics has, become more prevalent. So I'm wondering how the position of HDF5 and PI tables has changed in relation to those other capabilities and what are some of the features that PI Tables offers that aren't easily in the
[00:36:15] Unknown:
amount of, new and this is showing in in the amount of, new databases, new formats. I mean, this is not something this is not something that is is happening now. I mean, that happened, like, all the time in in the eighties, nineties, 2000. There were, like, a myriad of different formats. Probably nowadays, this is even more visible because data is now the big player. I mean, before, the code was probably more important than data because computers were not very capable to store a lot of data. And the way to to store to to store the data was, like, probably secondary.
Right now, in my opinion, the the big change is that the data is is is critical. I mean, data data storage. They how do you store the data? How do you which containers do you use for for the data? Data? And and, of course, I mean, there are many, many different ways to approach the data analysis, and there there should be I mean, we need the a lot of different different tools to to know eyes and to especially to a store to throw the data. This is not going to change. HDF 5, in my opinion, continues to be a very nice way to store data. It has its own use cases. You cannot go or bill you cannot go much farther than that. But, in my opinion, for studying, like, high multidimensional data and for studying big tabular data, it's, it's still a very nice solution, especially because of of the of the on the flight compression.
And, in the case of the pipe pi pi tables because it can store it can it can index very large terabytes, tables exceeding the limits of the memory. I mean, the the the indexing happens on this. Right? It's like a relational database. But contrary to a relational databases, tables can be stored much more efficiently in HDF5 because compression can be used. And this for big data is really, really critical. And I still see a big niche for for APF 5 and piTables in particular for, for storing, large amounts of data, especially in in Tableau 4. But, of course, other other solutions like Hadoop or, I don't know, or other ways to, to store data are are completely okay. I mean, it it depends on your your data access and and and your needs.
[00:38:49] Unknown:
Yeah. I think that 1 of the unique aspects of HDF 5 still is the ability to have everything self contained in a single binary file that you can transport to different machines and different systems versus a more network oriented file storage such as HDFS or, you know, Cassandra database or anything like that. That's that's exactly true.
[00:39:13] Unknown:
I would say more. I mean, HDF 5, the HDF loop, the people who is behind the HDF 5 development, is very, very conscious about long term usability of of these files. So besides of this capability of being able to to share easily your data, you can trust that this ADF 5 will be you will be able to read ADF 5 files in, like, 10 years or 20 years from here. Right? Behind ADF 5 is the ADF Group, who is sponsored by many public laboratories like especially NASA, for example. NASA has most of its data, its data in HDF5 format. And NASA of course doesn't want that in 20 years they don't they they won't be able to read the information that has been restored for, I don't know, for the Hubble Telescope, for example, or for the for the many satellites that are taking pictures of different aspects of Earth. Right? So this is also very, very, very important, and it is 1 of the main strong points of HDRI. I agree. Mhmm.
[00:40:23] Unknown:
1 of the really interesting capabilities of pie tables that I came across while I was going through the documentation is the undo and redo options that you expose. I'm wondering what a real world use case for that might look like. That was a requirement of a company,
[00:40:39] Unknown:
because people that do unpair open source typically achieve to survive from people that want to tailor, the behavior of your libraries and pay you some sums for for doing that. So a company asked us for a way to be able to store some datasets, to create some datasets in some groups, and then, undo these operations later later on time. I think their their use case were was that they were doing simulations, simulations that took a lot of time. And sometime, the simulations went into the wrong way, and they had to go back and start a new simulation, a new branch of the simulation from another area point. And they want to basically undo all the all the all the modifications that they did in the original simulation branch. And, yeah, we implemented that. And we we were very very proud of how we implemented that because it was basically a lot of the actions. So all all the actions that were created from some some points checkpoints, were locked into a into a specific table inside inside, an 8 and 5, table. And then when the user wanted to go back in time and undo the actions, it was just a matter of recovering the actions in the log and then, and then undo undo the the different the different actions. That was the the use case. I don't know if this feature is used a lot in the industry, but that was the main the main reason why we we included it. And, like, it's probably it's not used a lot, but it's it's a nice feature, obviously.
[00:42:20] Unknown:
And you've touched on a few different, interesting and unexpected uses of pie tables that you've encountered, but I'm wondering if there are any that stand out in particular to you that you'd like to share with the audience.
[00:42:32] Unknown:
Yeah. I mean, something that I don't I didn't mention a lot is well, I I do mention the the the compression capability. Right? So somehow I got very interested in this compression stuff, and, I started developing a new compressor for, especially for pipe tables that was going to be very fast, although compression pressures were were not so good as, for example, zedlib or Yazid, which is the default in in HDF5. So I started to create a a collect specifically for for 5 demos, which, I call BLOS. And, BLOS is is a is a compressor that can do multithreading.
I mean, you can use several cores at the same time for doing the compression and decompression and also implements specific, versions of filters that existed in HDF 5 like the shuffle filter. For in I implement did an implementation of that that used the simple instruction, multiple data instructions that are in modern CPUs, Intel CPUs like the SSC2 set of instructions or more recently ABX 2 set of instructions. And in combination, the new codec and also well, the codec is BLOSC is not exactly a codec. It's it's a set of metacodex. So you can use different codecs inside BLOSC. So you can use, for example, asset 4, which is a very has a lot of reputation for being very fast. And lately, I added also a set standard, which is also reaching very good decompression speed and also very nice compression ratio as well. But you can reuse all these codecs inside Form Plus plus the capability of using a very optimized shuffle.
With that, Pipedavals is able to compress and decompress really, really much faster than any other codec that you can you can see in in other solutions, in other libraries. And, lately, Blob, you can use Blob given from directly from HDF 5 because they they had a group implemented what they call the dynamic filters. And with that capability, you can use loss also also inside any HDF5 application. But I think this BLOS capability is also a very strong point of pipe tables and lately also strong point of ADF pipe in order to get
[00:44:55] Unknown:
the best performance as possible. And are there any other topics or questions that we didn't touch on yet that you think we should cover?
[00:45:02] Unknown:
I think we've we've we've touched basically the the main points. Yeah. I just want to say that whiteboards is still a live project. So it is mainly maintenance mode. That means that it is miss we don't introduce new features or no new very shiny new or very interesting features, but we still do a maintenance of it. And, yeah, we want to keep fight tables safe and and safe to use for for the next for the next, 10 years.
[00:45:34] Unknown:
And, 1 thing that we didn't mention yet is that it's currently maintained under the umbrella of the Numfocus organization. Oh, yeah. That's a very good point. Yeah. I forgot to mention it. Yes. So, yeah, 2 years ago,
[00:45:47] Unknown:
as I said before, Anthony Skopats was taking charge of piTables, and he did a lot of, did a lot of push for piTables, and he achieved that, no focus accepted, the Pipedables under under its umbrella. And it's, yeah, it's a very good news. Thing is that not many not many companies, use this this opportunity to to sponsorize some features or some maintenance, but I encourage anyone that is listening of of this that, yeah, they they can they can put money into the nonfocus organization in order to implement or to enable more maintenance or more new capabilities in in Pipedavals. It is, yeah, it is a very nice umbrella organization. And, hopefully, we will we will receive some benefits in the long term. Thanks for the reminder.
[00:46:43] Unknown:
And for anybody who wants to follow what you're up to and keep in touch with you, I'll ask you to send me your preferred contact information, and I'll include that in the show notes. And with that, I'll move us to the picks. And my pick this week is a movie that I watched recently called the accountant with Ben Affleck and Anna Kendrick. And it was a really well done movie because it was an interesting take on the typical action film where the main character was actually high, high functioning autistic. So he was, in some ways, similar to the idea of Rain Man as far as the mathematical of aptitude, but it's taken in a very different approach where the, the main character is a an accountant for, I guess, the shadier aspects of civilization, but he also has a very strong moral compass and uses his access to these different, organizations to help the treasury department prosecute a number of these cases. So it was an interesting movie, had a lot of thoughtful pieces to it, and I thought it was very well done overall. So we definitely recommend watching that. And with that, I'll pass it to you. Do you have any picks for us this week, Francesc? Oh, yeah. Yeah. I've I've just seen the this new movie, Batman, the level movie. Wow. That was really a very nice movie. It's, I don't know. It's very intense.
[00:48:02] Unknown:
It's, I mean, the action is nonstop action, a lot of, jobs. It's, if you want to spend, like, a couple of hours, laughing and
[00:48:12] Unknown:
and getting I'm feeling much better after seeing the movie. I highly recommend this this movie. Mhmm. Yeah. That's 1 that's definitely on my list to watch pretty soon. So, I I've heard good things about it already, and, that just confirms it even further that it's definitely something to, find some time to watch soon. Pretty amazing. Alright. Well, I really appreciate you taking the time out of your day to join me and talk about pie tables in HDF 5. It was quite interesting. I learned a lot in the process, and I'm sure that, number of, the listeners will find it interesting as well. So thank you for your time. Thank you. Go ahead. Go ahead. Go ahead. Bye bye.
Introduction and Sponsor Message
Interview with Francesc Althed
Introduction to HDF5 and PyTables
Handling Large Datasets with HDF5
Data Storage Formats: Row vs Column Oriented
Open Source Contributions and Community
Technical Architecture and Design Choices
Use Cases and Integration with Python Ecosystem
Querying Capabilities in PyTables
Position of HDF5 in Modern Data Formats
Unique Features and Real-World Use Cases
Future of PyTables and Maintenance
Closing Remarks and Picks