Summary
Understanding what is happening in a software system can be difficult, especially when you have inconsistent log messages. Itamar Turner-Trauring created Eliot to make it possible for your project to tell you a story about how transactions flow through your program. In this week’s episode we go deep on proper logging practices, anti patterns, and how to improve your ability to debug your software with log messages.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at podastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app. And now you can deliver your work to your users even faster with the newly upgraded 200 GBit network in all of their datacenters.
- If you’re tired of cobbling together your deployment pipeline then it’s time to try out GoCD, the open source continuous delivery platform built by the people at ThoughtWorks who wrote the book about it. With GoCD you get complete visibility into the life-cycle of your software from one location. To download it now go to podcatinit.com/gocd. Professional support and enterprise plugins are available for added piece of mind.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing Itamar Turner-Trauring about Eliot, a library for managing complex logs across multiple processes.
Interview
- Introductions
- How did you get introduced to Python?
- What is Eliot and what problem were you trying to solve by creating it?
- How is Eliot implemented and how has the design evolved since you first started working on it?
- Why is it so important to have a standardized format for your application logs?
- What are some of the anti-patterns that you consider to be the most harmful when developers are setting up logging in their projects?
- What have been the most challenging aspects of building and maintaining Eliot?
- How does Eliot compare to some of the other third party logging libraries available such as structlog or logbook?
- What are some of the improvements or additional features that you have planned for the future of Eliot?
Keep In Touch
Picks
- Tobias
- Itamar
- Middlemarch by George Eliot
Links
- Eliot
- Zope
- PHP
- OpenTracing
- Zipkin
- Carl De Marcken
- Sentry
- Elasticsearch
- Logstash
- Kibana
- Eliot-Tree
- Daniel Lebrero
- Flocker
- Context Local Variables PEP (PEP 550)
- Flamegraph
- Brendan Gregg
- DAG
- Structlog
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports the show on Patreon. Your contributions help to make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@podcastinit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app. And now you can deliver your work to your users even faster with the newly upgraded 200 gigabit network in all of their data centers. If you're tired of cobbling together your deployment pipeline, then it's time to try out GoCD, the open source continuous delivery platform built by the people at Thoughtworks who wrote the book about it. With GoCD, you get complete Professional support and enterprise plugins are available for added peace of mind. Professional support and enterprise plugins are available for added peace of mind. You can visit the site at podcastinnit.com to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions, I would love to hear them. You can reach me on Twitter at podcastinit or email me at host@podcastinit.com.
To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media. Your host as usual is Tobias Macy. And today, I'm interviewing Itamar Turner Trauwing about ELLIOT, a library for managing complex logs across multiple processes. So, Itamar, could you please introduce yourself?
[00:01:34] Unknown:
Hi. My name is Isovar. I've been a Python programmer for quite a while, and, I care surprisingly a lot about logging as it turns out.
[00:01:44] Unknown:
Yeah. It's 1 of those things that is far too often overlooked and, you know, ends up being 1 of the more important pieces of your system, particularly when you need to debug it. Do you remember how you first got introduced to Python?
[00:01:58] Unknown:
Yeah. This was 1999, I think. And I was doing web development in PHP. And I read some article about a web framework called Zope, z o p, and it sounded really neat. And it was. It had a lot of features that no other system at the time had. And so I learned Python by learning Zope. For a while, Zope was a really big thing in the Python community. In a sense, been superseded, but I think a lot of people got, Python that way.
[00:02:35] Unknown:
Yeah. It has sort of, fragmented and dissolved into the broader Python ecosystem. There are a lot of elements of it that are still used in various other projects.
[00:02:45] Unknown:
Yeah. I think Pyramid is probably the most zopey of the existing web frameworks.
[00:02:51] Unknown:
Yeah. So, as we mentioned at the start, you are the author and maintainer of the Elliott Library. So I'm wondering if you can describe a bit about what Elliott is and the problem that you were trying to solve when you created it.
[00:03:05] Unknown:
So Elliott is a logging library, originally designed for distributed systems, but now I I think it I I've come to realize it's useful in pretty much any nontrivial application. And the issue I I try to solve is that there's basically 2 audiences for your log messages. There's human beings who are gonna be reading them, and there's automated systems that are going to be processing in them in a variety of ways. Maybe you you you you have, you can extract some, metrics out of them. Maybe you're processing the data to show it to a human eventually. And for both these audiences, both computers and humans, logging is terrible.
The standard way we log things is bad for computers, and it's bad for humans. And so I wanted a logging system that would basically allow me to see the story, what actually happened and why. Most logging systems just give you a pile of facts. This thing happened. That thing happened. Something else happened. Look. There's an error. And, really, what you want to know is not here's a pile of isolated facts. You wanted to know why something happened, what caused it. You want logging to be a story about what happened. And so that was my goal in creating Elliott.
[00:04:23] Unknown:
And you mentioned that 1 of the original intentions was for using it in distributed systems. So how does that particular problem domain differ from the standard logging environment that most people think about where they're just concerned about the domain of their own particular process and not necessarily how that spans, you know, machines or processors or network environments?
[00:04:47] Unknown:
Well, so 1 part of it is that causality becomes that much more important because there there's just a lot more moving parts. But, also, if you're within a single process, you can often approximate causality by just looking at log messages in order of time. You could say this happened after that, so probably this was caused by that. And often that'll be the case. Once you hit a multithreaded program where you have different threads doing different things, the tracing this causality becomes much harder just by looking at time stamps. And once you have a distributed system across multiple machines, across different processes, you can't even trust the time stamps anymore because, different machines can even have different clocks. And so especially when you're trying to debug, like, 1 particular error, like, if 2 machines are off by a few milliseconds, which can happen, you can't tell what happened first. And so it can be really hard to figure out what caused what. And so so once you hit that point where you're in multiple machines, being able to trace causality in some way that doesn't rely on time becomes much more important.
I think that was probably the where the original impetus is the thing about causality in a more structured way it came from. As it turns out, causality, tracing causality is is great in, like, single processes too.
[00:06:09] Unknown:
Particularly with the preponderance in growth of distributed systems in, you know, a lot of the technology companies, particularly as they continue to scale the idea of distributed tracing and being able to trace transactions across multiple different microservices or, service oriented architectures has become something that's talked about fairly frequently. So how would you compare the capabilities and facilities that Elliot provides with something like the open tracing framework or Zipkin?
[00:06:41] Unknown:
Zipkin, and similar systems are often much more focused on the communication between machines. You can use them internal to your process. I'm not sure people do. And they're coming from the perspective of people who are building really massive, massive applications. And so a lot of their focus is on performance. It's about timing. It's about, like, where which message is going where. And and so, like, Zipkin has a lot of stuff about, like, sample like, I believe it has, like, sampling where, like, it only show you, like, percentage of messages and things like that. And Elliot is much more integrated into your processes.
It's the idea is you actually trace the the variables that you have inside your function so you can actually see what the data that was going through the system. And it also has a concept of a much more structured concept of causality rather than your suspend message span like this thing that caused that. It also has a concept of, or rather than you're saying this message came came went went from here to here, it has a concept of the the core abstraction in Elliot is an action. An action starts, and it has child actions, and then it succeeds or fails. And so what Elliott gives you is this structured tree of starting from a root action saying this action caused that action, which caused another action, and that action failed, then went up a level to the next action. That 1 we tried, and that time it succeeded. And then this other thing happened, and, eventually, the whole operation succeeded. And so being able to actually look at that both within your single process and across multiple processes, if you have multiple processes, that tree of things that you tried and then either succeeded or failed, which is the logic of your process. Right? You're writing functions, and they return or they throw an exception. And all it integrates down to the level of that code, so it's it matches that structure.
Like, give giving you that insight into the the machine is a thing that things like Zipkin don't really do. The 1 of the inspirations for, Elliott is a system called Frost. That was developed by Karl DeMarquin. He's a chief scientist at the company I used to work for. And he had this amazing demo where he had a distributed scrabble solver, and he instrumented with his logging system. And the logging system traced actual parameters called the functions. And then so he just for all the important functions, it would just log, like, I got called to these inputs, and it had, like, this causality load in. I'm not sure if it had a concept of success or failure. And then so he'd run the scrabble solver and aggregate the logs, and then he had a completely generic, UI for and looking at the logs. And then using the UI, he figured out there's a performance problem that, like, 40% of the CPU is being spent on a on 1 particular function when it was trying to solve scrabble with 2 blank squares, which means which are basically wildcards. They can be any letter of the alphabet. And so and he did this, like, the the profiling live.
And so if you think about, like, like, a profiling system, a profiling system doesn't tell you what arguments were causing this performance problem. It just tells you this function is slow. And so profiler would never have told you this. And a tracing system would never really have told you this either because like Zipkin because it's all it would tell you, like, messages going fat back and forth between the the scrabble solvers in the in the cluster, but it wouldn't get down to the level of function. And so seeing that, well, like, the the power and and you can get from getting down into level of function, looking at the actual data that is getting passed through, was a big inspiration in, why I made Elliott.
[00:10:22] Unknown:
So is Elliott largely implemented on top of the standard library logging framework, or is it something that was written as a standalone application
[00:10:33] Unknown:
that is meant to supplant the built in logging capabilities of Python? It it's basically meant to supplant it. The built in Python logging is sort of standard. Here is a info message. It has a string. And you can sort of shove structured data in there, but it's not really what it's for. And it really it has no concept of context of, like, an action. And it basically would I I could have built on top of it, but it would have just been a performance bottleneck and wouldn't have added anything. And so I like, Elliott is designed to replace, Python standard log. You can pipe messages from Python standard logging into Elliott, and those messages will be in the right place within the action tree. And so you can get content like, whatever logging is using the other logging systems, you can pipe into Elliot, and they'll know where you are in in the tree of actions.
But Elliot is wants to own logging. It's not a layer on top.
[00:11:33] Unknown:
So for somebody who has an existing application, it sounds like 1 way to begin introducing Elliot is to add it as sort of a wrapper and just consume the existing log messages and maybe implement that as, like, a handler for those messages so that you can, embed those messages in the overall, sort of process flow of what Elliott is recording? Exactly. Yes. And how is the design and implementation of Elliott evolved since you first started working on it?
[00:12:00] Unknown:
So my goals for Elliott were first, have structured messages. This is, at this point, like, a minimal criteria for a logging system. Having just human readable strings is not really acceptable. And I wanted to to have causality, and I wanted it to have, I wanted testability. Like, I wanted, basically, if, you know, you have a system, it's crashed, and it crashes, like, every 2 days, having actual logs that tell you what you need to know is really important. And so if for not for all applications, but for applications where getting these logs is really important, you want to be able to test that right unit tests for your logging. And so from the start, I wanted it to be easy to to write unit tests for your logs. And, basically, you can add logging assertions as, annotations with decorators to your existing unit tests. So you don't have to, like, write a whole new set of unit tests through logging. You just say, I have tested, right, test my code. I'm gonna add these extra assertions off on the side about the logs they produce. And so all of these things were in from the start, but this is a a a new concept. There there's not that many systems like it, if any. And so I had to figure out what the APIs would look like and what the idioms would be. And a lot of that involve just using it in a real application and saying, oh, this is really repetitive, and so we need a a nicer idiom. And so there was a lot of iteration on making the API simpler, and more succinct, and I'm continuing to do that.
There have been changes to Python over the years. So coroutines invalidate some of the, assumptions that evaluates implementation that you can just have a thread local context. So I'm gonna possibly, by the time this podcast is out, I'm gonna have a initial release, that knows about coroutines. And so a lot of it's just been but, yeah, mostly it's been learning how this new idiom works and what is the best way to structure it because it was a new thing, and it wasn't just an obvious model how to implement it.
[00:14:13] Unknown:
So 1 of the key concerns that you've mentioned a couple of times is the idea of having structured logging and having a somewhat standardized format for the application logs to be able to be consumed both by humans and machines. So I'm wondering if you can just elaborate a bit on why it's so important to have that standardized format for being able to make those logs usable outside outside of the application itself.
[00:14:42] Unknown:
K. So this isn't necessarily saying that Elliot magically solves because you're going to be dealing with like, anytime you're dealing with real systems these days that are server side, it's you're no longer dealing with a single process. So, like, just a simple web server, you might have, like, a whiskey process like gunicorn and then NGINX and then a database, maybe Celery, and then there's logs from your cloud provider if using cloud and your operating system logs. And so you you basically have all these logs, and some problems are going to span multiple processes and multiple logging systems effectively because you you can't really assume everyone's gonna log the same way once you're, like, the the Linux kernel and your Python process are both things you might want to look at. And so you you need to to understand your system, you end up having to aggregate all these logs in some central place where, like, they can relate to each other. On a single machine, this has gotten better because, the old days, it was syslog, and syslog was in the end, just you got a bunch of lines of text on your local file system. And these days, system d has a logging system, logging container journal d, that can actually store, structured information. You can query it. You can search it. You can use structure structure queries. It's pretty nice. So So the level of a single machine, that's kinda better. But, once you're hitting server side applications and running more than 1 machine, you need to centralize everything. And if you're dealing with, like, here's a line of text, then someone, probably the ops person who's running it, is gonna have to write some regular expressions to extract the time stamp, which and then log logging level and some text, and then everything gets pushed into a centralized system like Elasticsearch.
And so it's just there's all this tedious setup and and infrastructure work you have to do to get everything in 1 place. And so the more structured your data, the easier it is to just get it set up. And, like, if, you know, it's gonna take you an extra week to do it, then you might put it off, and then you don't have the centralization. So just every barrier in the way to getting that centralized aggregated set of logs, Every barrier you can take away is great.
[00:16:59] Unknown:
Yeah. That's definitely a pain point that I am all too familiar with and 1 that I continue to battle with because I am responsible for a growing number of systems at my day job. So I I have spent far too much time writing and debugging regular expressions to be able to parse some measure of meaning out of various different logging systems.
[00:17:20] Unknown:
Right. And so if it starts out being in JSON, it's just much easier because then you just say, oh, this field is a time stamp. This is the logging level. This is extra fields. And it goes into something like Elasticsearch that can take structured data, and then it's just there, as opposed to parsing lines of text.
[00:17:38] Unknown:
And what are some of the most useful pieces of context that you see across some of the different projects that you've worked on as far as being able to inject additional information into the log messages to be able to retrieve and interpret the sort of operation of the system and potentially even use it for, as you said, being able to inject certain types of metrics?
[00:18:04] Unknown:
So I guess I I can give an example of I was asking some Elliott users how they're using Elliott, this morning. So it's a company in South Africa, and which are using Elliott for their systems. And they talked to a lot of external systems. And so every time they talk to the external system, that's another action using Elliott of, you know, we sent this message. We got a response. And every user interface interaction is in like, is the entry point, like, sort of the top of the tree of action. It's like you say, this user did this action, and then you can trace all the consequences. And so you have this tree, and each tree in Elliot is let's call it as a task. It's a top level action, and each 1 has a unique ID. And when an error occurs, they use a a Sentry, which is a software as a service that lets you aggregate errors from your various systems.
And so they store the the unique ID of the task in Sentry, and then they have some tooling that then takes that, unique ID, then looks up all the messages and builds up a tree. And so whenever, like, you know, an error occurs in some random system, they know within seconds, apparently, that this user did this action, which caused these these things to happen. And here and they just see a tree of everything that happened and where the error was from the point of user interacting the system to all the queries remote systems down to where there was, which sounds like it makes debugging a lot easier as opposed to an error happened. Now I'm gonna try to figure out where all the messages related to this were, and I don't know who caused it or why.
So just being able to trace all the way from human interaction with the website down to all the resulting messages is really useful. I've seen in more complex systems, merely just tracing the network messages can be really useful. Another lagging system inspired Elliot was, 1 that was built for airline reservation system I used to work on. And so you'd get airline reservation messages from other airlines or from travel agents coming to the system, and that would talk to 1 system, which would send a message to another and then send to another. And just being able to look through all those messages, you could see how someone did a booking request, and then it did inventory lookup, and then it did availability call and failed there. And so you could figure out the situations that cost things. So just being able to see all the messages of a trace even if you can't see within the process, even that alone can be really useful.
And, another pattern I hadn't really expected, another use case, which I discovered in my in my latest job. If you're doing scientific computational pipelines, probably also just more generally big data. Like, in in scientific compute like, scientific publishing, like, if you you're you're doing research, you want your research to be reproducible. You want someone else. Won't be able to do the same thing yourself over again, get the same results. And also you want be able to say to someone else, you can run my code with these inputs and get the same answers. And so being able to to so started using a little bit and being able to just say, here are all the steps in the computation. Here are all the inputs. Here were the outputs.
And then being able so you know exactly what the parameters were and what data went in and what data came out is really great for reproducibility. And it's kinda neat because I'd I'd never expected that as a use case.
[00:21:21] Unknown:
Yeah. Being able to track the provenance problem that is, you know, sort of perennial as you said in the sciences and also in data processing of any sort of scale because, you know, anytime you mutate the data, you wanna make sure that the source data was valid. And so being able to have a an additional trail of actions that can be collocated with the actual processing is something that I can see as being immensely valuable.
[00:21:54] Unknown:
I guess, the the South African company, they were also telling me they've hooked up Elliot to the because they already have Elliot logging, they've hooked it up to their unit tests and, like, possibly intent tests as well. So when they have a test failure, they get a trace of everything that happened within that test. So instead of just a trace back, which is sort of a linear, here's who called me, with Elliot logging, you get, like, the whole tree of all of everywhere you put logs, which can be, like, almost as detailed as all the function calls if you go that if you if you wanna be that detailed. And so when they get a test failure, instead of just getting a trace back, they get the whole tree of actions that led up to that point. So also useful in debugging tests.
[00:22:36] Unknown:
Yeah. I can definitely see that being useful too for being able to encapsulate not just the specific steps that were taken, but the intent that was, meant at each of those steps so that you can more easily understand why the different actions happen throughout the system rather than just, as you said, seeing a linear trace of where the, you know, where the process flow went. Yeah. And and the linear trace, yeah, it doesn't like, if you
[00:23:03] Unknown:
have recursion or your program, like, it it starts getting really tricky to figure out what's going on.
[00:23:10] Unknown:
And 1 thing that I was curious about while I was reading through the documentation is because of the fact that you are creating these nested structures of actions, I was just wondering what the actual concrete log lines look like as they're being emitted from the system. Would you emit 1 log for each time you call to the Elliott system, and then they just have some sort of embedded metadata for being able to reference them to previous messages? Or is it actually a nested structure that eventually gets emitted, as 1 step at the end of a particular process chain?
[00:23:46] Unknown:
It is pluggable, but the standard way of logging is 1 line of, a file with 1 JSON object per line. So fairly standard format. Very it's really easy to support with something like, Logstash. So, yeah, each if you're doing that, then each JSON message has a ref or refers to the unique ID of the task it's part of, and then it has a, basically, the path to its location in the tree. So it'll say, like, I'm the second child of the third child of the root. Then you get a message for the start of an action, a message for any just stand alone messages within the action, and then a message for the end of the action, and each of those is a JSON record. So you can load it into log into log storage systems like Elasticsearch that, expect linear isolated messages. But within each message is sufficient information to, extract the structure back out again. And so Jonathan Jacobs, has valid users has written a really nice tool that service rendered ASCII tree of messages in your console, but you can also query it and extract it from, like, like, you can visually just it's human readable, so you can visually sort it by the right column in your log viewing UI, and you'll be able to just read the structure, or you can write tools that parse it.
[00:25:09] Unknown:
What have been some of the most, sort of pernicious antipatterns that you have seen in different projects where developers are setting up logging without necessarily have any specific intent behind those messages?
[00:25:24] Unknown:
So the first 1 is logging levels. The way logging levels are defined in most systems is pretty terrible. Like, there's info and warning and error, and sometimes it's critical, and it's never really clear what these levels mean. They mean different things to different people. So people will just pick pick 1 at random. And so the the best solution to this I've heard is by, someone named Daniel Librero. He works at ig.com. And they came up with a alternative set of logging levels. And I don't think everyone should use these particular ones, but I think the the the method they use is really great. And the the logging levels they chose were in a test environment, in a production environment, tell me about it tomorrow, and wake me in the middle of the night. Those are their 4 logging levels. And those logging levels are it's completely clear what each 1 of them means.
It's not like info or warning. It's tell me about it tomorrow or wake me up right now even if it's 3 AM. And so whenever you're writing a log message, you you think about the, business consequences, of this action, and then you say, is this a thing that someone should tell me about tomorrow, or is this a thing that should wake me up in the overnight? And so it's harder to do this in libraries, to be fair. But if you're writing application logic, you should have some sense of what the implications of this log message are, and then you can come up with logging levels that fit your particular organization.
And what like, is this a catastrophe? Is this annoying? Is this just a thing that only needs to be in in test environments for debugging there? And so coming up with logging levels that match to desired consequences, seems like a really great idea. Yeah. Another anti pattern is just the standard plain text logging. It you know, if you're writing if you're in the the, you know, code tests, poke at it manually on your desktop stage, that works fine. As soon as you're writing something that has to is going to be in a running on a server in production that where the logs are going to be processed somehow. You really should be using structured data. And there's you don't have to use Elliot for this. There are other most logging systems. You can, at at worst, coerce them to store structured data. Often, these days, they'll support it natively. As I mentioned, this isn't true for everyone, but in the cases where debugging something in production is going to be both very important and very hard, Writing automated tests to ensure your logs are actually running and logging the stuff you you think you need them to log is very important because, typically, the the the point where your logging code is going to be run is only in our where where you especially if it's, like, the error path. This is only gonna happen rarely because it's the error path, and this is the thing you don't expect to happen. And so it's very easy to get into this place where you have logging, but when that you know, every 2 weeks, the crash happens, it turns out that that logging code did doesn't actually log what you thought it was. There was a bug in the logging code, and you never caught it because it's such a rare case. And so there are systems where the cost of failure and and the debug is high enough that writing tests for your log logging code is actually very useful. And the final anti pattern I've seen is logging too much.
And in my initial design for Elliot, I left out logging levels because I wasn't happy with the way logging levels were designed in in most systems. And so you can just have a field which is, logging level. Like, it's structured data. You can put whatever you want in it. But what it didn't occur to me is that the the 1 good thing about have forcing logging level on every message is that every time you write a message, you have to think about how frequently you you you have to put a tiny bit of thought of it. What is the consequences of this log message? Like, is it debug or not? And it's very easy to just throw lots of log messages and not think about the cost of running them in production on a critical code path.
And the the issue is that much like much of programming in general, you can't just focus on the the product that you're creating. You have to think about the process that you use to develop software, and adding logging to a system is a process. So I had a coworker, Maduri, who, described the log the process that her she used to use at a previous company. And during development, they would, you know, just add lots and lots of log messages. And then there was an explicit step before they shipped of going back and removing logs. Like, part of their development process was, at this point in the development process, we're going to remove lots of logging because we know the system is more robust than we know there's gonna be performance cost. And so having keeping in mind the sort of the the this sort of life cycle of logging and, like, when you have to add it, when you have to remove it, and planning for that is an important part of adding logging to performance critical systems.
[00:30:25] Unknown:
1 of the approaches that I've seen in another library that I thought was interesting and fairly clever was in logbook. They have the fingers crossed log handler where it will actually, track the different sort of debug or info messages that happen in the system, but it won't actually emit them unless you hit an exception. So that when that case does happen, you have all of the additional contextual information that led up to that point. So how would you compare that capability with the facilities that Elliott provides in terms of its, sort of tree of events and how that plays in with the idea of log levels and filtering out the lower level logs.
[00:31:13] Unknown:
Yeah. So that technique is often used in embedded devices. So, like, you have, you know, this isolated box somewhere that's running some software, and it doesn't have any way to talk to the external world. But sometimes your customer is gonna call you and say it broke. And so what you do is you you allocate a fixed buffer for login, and you, it's it's a ring buffer, basically. So when you hit the end of it, you just go back and overwrite the oldest messages. So you basically always have the last end messages, log messages. So when the machine dump dies, you can go over to it and get a dump of the last thousand things that happened. And that way you don't also just run out of storage with your log messages. And so when when I encountered performance problems with too much logging, Elliot, that was 1 of the approaches I considered, and you could add it as a much like logbook has it as 1 of its handlers. You could add that as a handler to the equivalent concept in Elliott. And Elliott, while it doesn't have logging levels, it does know when it's logging an error or when it's logging a failed action. It does have those concepts because failed action is failed action. And so you you could if if that was a useful thing for something that could be like, Elliot does have the information necessary. I've since been with the other reason I didn't add logging levels to Elliot originally was because I wasn't sure how it would semantically make sense in a tree, presume like, in a traditional logging system because each message is isolated. It's very easy to say, just filter out all log messages with logging level debug.
Whereas in a tree, it wasn't clear to me that that made sense because what if the debug met debug action had debug level action had a child that was a error? But I've since 1 1 of the problems with having a new model, I've since realized that my initial understanding of that was wrong, and it actually, semantically, does make sense to do that. So at some point, I might if I have the time to rein tune, like, introduce logging levels into Elliot. But there's still the I suspect I would probably just do it with only 2 levels, something like debug and not debug, or production, not production. And then more semantically rich things like wake me up in the middle of the night is a thing that users would would add because the the decision about, you know, is this a performance impacting log message is actually a thing that you want the programmer thinking about every time they write a log message. But the decision of is this a meaningful error or not is much more difficult and application specific and business organization specific. In in 1 organizational context, it might be an error. Right? Some of it might not.
And I feel like logging levels beyond beyond the the the distinction between this is a local error or not is very important and distinction within the logging system itself And the distinction between, this is expensive enough that you shouldn't do it in general is also useful because reminder to the programmer at a minimum. The, you know, warning versus error versus critical, I feel like you can't really do that generically.
[00:34:25] Unknown:
What have been some of the most challenging aspects of building and maintaining Elliott?
[00:34:31] Unknown:
So as I mentioned, like, it's a new model. And so figuring out what the right abstractions are, what the what a a good usable API is, that took time. The these days, it's a thing I maintain in my spare time. Although since I'm using it at work, I might get an opportunity to do some work on it as part of my job, which would be nice. So finding enough time to work on it has been occasionally difficult. And there's also just the marketing aspect. Like, I I there Elliott has gotten users. There are people using it in in real world production systems. The system it was originally written for or most extensively used in, Flocker, is unfortunately a defunct project. But before the startup that created a diet, like, we had people using it, in anger to like, as a storage system for their container systems. And so there there's, you know, quite a few systems using Elliot. But because it's a new system, it's taken me a while to get to a point where I consistently explain why it's so useful. It's like, you know, you you reading a you reading a description of it as, you know, logging for distributed systems, and that's where it started. But I think that's not how we necessarily how we describe it today. And so I need to go back and sort of revisit the documentation, try to do a better job explaining it so that people can more people can get the benefit, what it can give them.
[00:35:57] Unknown:
And how would you compare it to some of the other third party logging libraries that are available in the Python ecosystem, such as struct log, which you referenced in the documentation, and logbook, which I, referred to earlier, or any other, sort of libraries that you can think of?
[00:36:15] Unknown:
So logbook seems like haven't used it extensively. I've just glanced at it. It seems like a system that put a lot of thought in how you route your logs and how you annotate them in different parts of your system, but it doesn't really have any it doesn't give you any of the semantic richness that you get from Elliot or really from from struct log if you use it. So in the end, the the code that will, you know, generate the logs and route the logs and so on will be what you have a lot of power. But at the end of the day, the information you get is not going to be particular better than a logging system from 20 years ago. Like, it's the same at its core, it's the same model. And I I I've only glanced at that quotation, so I I might be unfair.
Struct Log is a much more modern system in that and in terms of the semantic richness you can get because it, from the start, is designed for outputting structured data. And it has this nice model of you take a structured record, and then you make a copy of it, and you can override fields and add fields. So you can create a sort of initial message that has a trace ID, and then all the cop the clones of it inherit that trace ID. And so you can get much more of a sense of causality than you would with with if you just had, you know, text.
And so struct log is sort of a or systems like it are sort of fundamentally better than just, logging human pros. I feel like Elliott gives you a better set of abstractions, in terms of a tree of actions that can succeed or fail, is gives you more information about your your code flow than just having a trace ID or something like that. You could emulate. You could build up what Elliot does on top of Struct Log, but then you may as well use Elliot. If you don't like Elliot, I I would if any listener doesn't think Elliott just works for them, I would strongly encourage them to use Strathog, because it it it's it's much better than just logging strings. It's a nice library.
[00:38:26] Unknown:
Are there any systems or situations that you can think of where you think that Elliot would be a poor choice for somebody who's trying to decide how to emit their logs within the system?
[00:38:38] Unknown:
There's a bunch of places where it's overkill. A lot of web applications are just, you know, this really simple query comes in. I do database query. I return a result. And so and that's all you that's all your application does in the end. And so if that's all you're doing, if it's just like, here's an input. I'm gonna do database return results. I could just it's overkill. It's not really going to give you much like, there's you you need to you you need your business logic to hit a certain level of complexity before there's any point in using it.
And there are some technical limitations, which I've started working on. Coroutine support, did not exist at all and told like, it'll be in the next release in 1.2 for async IO. But if you're using other coroutine based networking frameworks, Elliot could be extended to, support other frameworks. And I if you want that feature, I encourage you to file a ticket with saying I want coroutine support for tornado or whatever. Ideally, Elliot would just support all protein based systems, but Python is missing the facilities. So doing it in a generic way would involve something unpleasant, like on every log message traversing the the the stack and looking at frame objects or something awful like that, which would just destroy performance.
So I don't wanna do it. There's some talk about adding a coroutine friendly concept of context similar to thread locals to Python and some future version of Python, which should make this a lot easier. So those are 2 places. If you're in a really polyglotch system, like, if you're doing Python and JavaScript and Java and Ruby and so on, there's sort of a initial implementation of Elliott in JavaScript, but it's for the most like, other than that, it's a very Python centric system. And so if you need if you want to be using the same logging system across multiple languages, Elliot probably won't work for you. That's off the top of my head. I'm sure there's other cases where it wouldn't quite be the right thing.
[00:40:45] Unknown:
And you've mentioned some of the things that you have planned for the future work for Elliot, are there any other improvements or additional features that you've been considering?
[00:40:55] Unknown:
Yeah. So I have an experimental branch where I generate flame graphs, flame graphs from, Elliott traces. Flame graphs are this really night nice, visualization by Brendan I think Brendan Greg came up with them. And so you can I I did this a little bit, mainly last time I was using Elliott? I never really finished the code. So you can get a this basic visual visualization of a profile of of the performance profile of a trace. And across distributed systems, this gets really useful. I I didn't maybe I didn't quite mention it, but part of the information that Elliott gives you is how much time was spent in every action. And I'm gonna try and make it a little bit easier to get out, but it's all there right now. And so as a side effect of getting this tree of actions in your logs, you also have it, the timing of all these actions. Like, you can say this action these inputs took us that much time. And so there's a huge amount of performance information you can get out of the logs because you have the causality. Like, you can say this started here and then ended here and caused these things. You can you can actually it's also pro as your logging system is running, it's also, as a side effect, profiling your system. And so better tools to extract that information would be nice. And the thing I encountered at my job is computational pipelines, like scientific and day big data, aren't actually trees. They're data they're often graphs, directed acyclic graphs.
So it like, you might take a chunk of data, split it up into parts, run all those in parallel, and then aggregate them again, and then do another computation, and then split off a couple more computations. And you you can take a directed acyclic graph, a DAG, and break it up into trees. And that's 1 way to approach it. But it might be nice if there were some way to model these full graphs in Elliott. So trying to expand Elliott's conceptual model beyond trees, which could be a high level abstraction. It could just be a convention for relating 2 trees together. That might be nice.
And in general, just sort of as I learn more about scientific computation, seeing how Elliott can help with that. It's gonna be interesting.
[00:43:15] Unknown:
Are there any aspects of Elliott that we missed or any other topics that you think we should cover before we start to close out the show?
[00:43:22] Unknown:
I I keep talking about causality, and this is a key point about Elliott be because stories are basically how we understand the world. Like, pretty much every culture has these stories about, you know, why does the world exist? Where did people come from? Why why is the world the way it is? And it's sort of the stories are how we we try are are sort of seem like they're they're sort of a key part in how human beings, like, manage to understand a sort of immensely complex world. And most of the stories that that human beings come up with are are wrong. Like, there's fiction, which is very clearly made up, and it gives you the illusion of sort of explaining why something happened, but it's an illusion because someone just made it up. And we have the stories we tell about, you know, this is what you know, if you're a historian, you try to understand this is what caused this thing to happen. Like, there was, you know, the the the gentry were on the rise during 17th century, and so there was this, you know, conflict between the gentry and the and the aristocracy, and then you end up in civil war. Or and there's people who say the gentry was declining, and they have a different story. And these stories are, in the end, always partial and incomplete, but they do help us understand the world. But when you get to software, a universal Turing machine, which is in the end all a program is, is a really, really simple thing. It's a machine that takes these inputs and a program and does some things. Like, the causality in in a piece of software is completely predictable.
It would take you a really long time to reproduce it by hand, but it it is it is a system that is completely predictable, and and you can trace the causality from beginning to end if you had the right tools. And yet in this rare realm where causality is is is a thing we can actually say we can actually say we traced why this happened about a computer program. And you can't say that about pretty much anything else. You can only say that in approximation, maybe, if you're lucky and, you know, you have a few 1000 years of scientific progress behind you. But for a computer program, you can trace this causality, and then we just throw it all away. Like, we just write log message or, like, a thing happened. Another thing happened. And so this realm where we we could have stories that are telling us exactly why something happened, we just say, not worth to bother. And there's, you know, performance reasons why we don't just have full traces of all the, you know, memory operations of the program. And so you in the end, you have to simplify it somewhat. But this is a this is a realm where we can actually have really good and accurate stories of why something happened. That was why I wrote Elliot. And if, you know, Elliot dies, no 1 ever uses it, but that idea persists and someone else comes up with a system that lets you do that, that I think was worth it because it's just we have the potential of being able to our send our software, and we just choose not to do so.
[00:46:19] Unknown:
Are there any other topics that you think we should bring up before we close out the show?
[00:46:24] Unknown:
I think that is all I have about logging, at least
[00:46:31] Unknown:
off the top of my head. I definitely agree that being able to produce meaningful and relevant stories via the systems that we're running so that we can have a way at a human scale to be able to understand what is happening and at the same time have that consumable by additional machine processes to aggregate that information and produce higher level relevant information and actionable steps from that is a very valuable capability. And I've looked at Elliott a couple of times and considered implementing it in some of my projects, but the more I've spoken to you this evening, the more, sort of pertinent of a decision that seems to be to me. So it's something that I'll certainly be revisiting as I continue to work in the Python space.
[00:47:20] Unknown:
I'm glad to hear it. And, I'm happy to answer questions, fix bugs, so on.
[00:47:25] Unknown:
And on that note, for anybody who wants to follow you or get in touch or, tell you about how they're using Elliot or ask any questions, I'll have you add your preferred contact information to the show notes. And so with that, I'll bring us to the picks. And my pick this week is a podcast that I just started listening to called moonshot, which focuses on a lot of the different startups and technologies that are producing some of the future looking products and ideas. And I think it's definitely useful to keep things like that on your radar so you can have some measure of where things are progressing and what are some of the future directions that we might be working on as a species going forward. So, they seem to do a fairly good job of addressing some of those different topics and with people who are driving forward the state of the art in various areas. So definitely recommend taking a look at that 1. And with that, I'll pass it to you. Do you have any picks for us this week?
[00:48:23] Unknown:
Yeah. I would like to recommend the novel Middlemarch. The Elliot logging library is named after George Elliot, which was the pen name of Mary Ann Evans. It was 1 of the, she was 1 of the great novelists of 19th century. Middlemarch is a really excellent novel that really excellent novel she wrote. It's, from the time when, people were writing novels as serials, so they it get published week by week as she wrote it. And so it has this amazing structure, even as it was being published in a sequence. There's a couple of jokes in the Elliott documentation that will only make sense if you read the novel Middlemarch. So that is another reason I recommend reading it.
[00:49:06] Unknown:
Alright. Well, I, appreciate you taking the time out of your day to join me and talk to me about the work that you've done on Elliott. As I mentioned, it's definitely something that I'll be taking a further look at and most likely implementing in some of my projects. So I appreciate the work that you've done on that and your time this evening, and I hope you enjoy the rest of your night. Thank you. Thanks for having me. Really great being here. Thanks for listening. Don't forget to go to podcast init.com/linode to get your $20 credit to run your continuous delivery pipeline with Go CD, which you can download at podcast init.com/gocd. If you enjoyed the show, then don't forget to go to podcastinnit.com/itunes and leave a review.
Introduction and Sponsor Messages
Interview with Itamar Turner-Trauring
Itamar's Introduction and Background
Introduction to the Elliott Library
Challenges in Logging for Distributed Systems
Comparing Elliott with Other Logging Systems
Implementing Elliott in Existing Applications
Evolution and Design of Elliott
Importance of Structured Logging
Use Cases and Examples of Elliott
Common Logging Antipatterns
Advanced Logging Techniques
Challenges in Building and Maintaining Elliott
Comparison with Other Logging Libraries
When Not to Use Elliott
Future Plans for Elliott
Closing Thoughts on Logging and Causality
Final Remarks and Picks