Summary
Do you know what your servers are doing? If you have a metrics system in place then the answer should be “yes”. One critical aspect of that platform is the timeseries database that allows you to store, aggregate, analyze, and query the various signals generated by your software and hardware. As the size and complexity of your systems scale, so does the volume of data that you need to manage which can put a strain on your metrics stack. Julien Danjou built Gnocchi during his time on the OpenStack project to provide a time oriented data store that would scale horizontally and still provide fast queries. In this episode he explains how the project got started, how it works, how it compares to the other options on the market, and how you can start using it today to get better visibility into your operations.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Julien Danjou about Gnocchi, an open source time series database built to handle large volumes of system metrics
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Gnocchi is and how the project got started?
- What was the motivation for moving Gnocchi out of the Openstack organization and into its own top level project?
- The space of time series databases and metrics as a service platforms are both fairly crowded. What are the unique features of Gnocchi that would lead someone to deploy it in place of other options?
- What are some of the tools and platforms that are popular today which hadn’t yet gained visibility when you first began working on Gnocchi?
- How is Gnocchi architected?
- How has the design changed since you first started working on it?
- What was the motivation for implementing it in Python and would you make the same choice today?
- One of the interesting features of Gnocchi is its support of resource history. Can you describe how that operates and the types of use cases that it enables?
- Does that factor into the multi-tenant architecture?
- What are some of the drawbacks of pre-aggregating metrics as they are being written into the storage layer (e.g. loss of fidelity)?
- Is it possible to maintain the raw measures after they are processed into aggregates?
- One of the challenging aspects of building a scalable metrics platform is support for high-cardinality data. What sort of labelling and tagging of metrics and measures is available in Gnocchi?
- For someone who wants to implement Gnocchi for their system metrics, what is involved in deploying, maintaining, and upgrading it?
- What are the available integration points for extending and customizing Gnocchi?
- Once metrics have been stored, aggregated, and indexed, what are the options for querying and analyzing the collected data?
- When is Gnocchi the wrong choice?
- What do you have planned for the future of Gnocchi?
Keep In Touch
- jd on GitHub
- Website
- @juldanjou on Twitter
Picks
- Tobias
- Julien
Links
- Gnocchi
- RedHat
- OpenStack
- Object Oriented Programming
- O’Reilly
- Debian
- Ceilometer
- Prometheus
- Time Series
- MySQL
- Gerrit
- Zuul
- GitHub
- GitLab
- Graphite
- DataDog
- RabbitMQ
- InfluxDB
- Ceph
- S3
- OpenStack Swift
- Cassandra
- Honeycomb Observability Service
- AMQP
- Redis
- DSL (Domain Specific Language)
- Golang
- RBAC (Role-Based Access Control)
- CollectD
- StatsD
- Gnocchi Client
- Telegraf
- Grafana
- TimescaleDB
- OpenStack Heat
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast thought in it, the podcast about Python and the people who make it great. When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it, so check out Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. Go to python podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers for software engineers.
Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own. Podcast.init listeners get 2 months free on any plan by going to python podcast.com/clubhouse today and signing up for a free trial. And visit the site at python podcast.com to subscribe to the show, sign up for the newsletter, and read the show notes. And keep the conversation going at python podcast.com/chat.
[00:01:25] Unknown:
Registration for PyCon US, the largest annual gathering across the community, is open now. So don't forget to get your ticket, and I'll see you there. Your host as usual is Tobias Macy. And today, I'm interviewing Julian Danjou about gnocchi, an open source time series database built to handle large volumes of system metrics. So, Julian, could you start by introducing yourself?
[00:01:45] Unknown:
Sure. So I'm a software engineer. I work at Firdapse recently, and I stumbled upon OpenStack a while ago, like, 7 years ago now. And I've been doing Bryson for more than 10 years. I guess more than 10 years now. I wrote a couple of books about it too, and, and I've been to open source for 20 years. So that's really what I do every day. And do you remember how you first got introduced to Python? I think it was yeah. 10 years ago, I actually was doing Perl programming, which is not related at all. But, I wanted to start doing object oriented programming, which is something I never get to understand in Perl. It's weird. So I just bought an audiobook.
Like, I wanted to learn Python, so that's something, like, a good idea. And I read the book, but at that time, I had, like, no project to write or anything, so I just read the book and thought it would be a good idea to start at some point. And, like, a year later, I'm I've been a DBM developer for more than 15 years too, and, I've had a an idea of a project, like, a build demand for the VM packages. There are some things using in Perl. It was pretty bad, and I did to rewrite it in Python. So that was my first project.
[00:02:58] Unknown:
And so now you've been working on the gnocchi project. So I'm wondering if you can just describe a bit about what it is and how the project first got started. So it started, like, mobile 44 years ago. So when I started to do,
[00:03:13] Unknown:
OpenStack stuff, like, 7 years ago, the company I worked in was trying to build a a public cloud, and there was nothing to build delicately for for whatever people were using in the cloud. So we had 2 major resources that were being used by, the different customers, and that, made us create the cylinder project. But the initial project was trying to do 2 things, which is, what something like Prometheus does no day, which is collecting metrics and storing them. So Cerrita was pretty good at at collecting metrics, but it was pretty bad at storing them. To a point, it was barely usable. I mean, you could store where, metrics, but you will not be able to create them in a timely manner. Like, it would take 20 minutes to retrieve any kind of interesting metrics, which is not something that we could use usable. So we stopped focusing on the storage part of Cenomator, and I decided to start a new project. So it was just a proof of concept at the start.
Really, just a few line of Python to show what we could do, to store metrics at scale. And in a, I would say, a year, we started to really write, and build new what is in your key today, which is a time server database. So what it does, it let it store time series, which are a list of time stamp and values for anything that you want. Usually, people use Anyo key for metrics, our own monitoring, our own infrastructure, order purpose. And
[00:04:45] Unknown:
when it first started, it was housed under the umbrella of the open stack project, but it has recently been spun out to be its own top level project. And so I'm wondering what the motivation was for that and what was involved in separating gnocchi from OpenStack.
[00:05:04] Unknown:
Yeah. So what I learned is that when we built gnocchi, I think from from day 1, we had a different view on what we should do, in term of feasibility of your key. Like, most of the OpenStack project were built with the idea that you would need OpenStack to do OpenStack. Like, any service in OpenStack would need another service in OpenStack, which sometimes makes sense, but sometimes does not. The ecosystem was very closed on itself. Like, we were using only OpenStack libraries that they are, and we were willing to play the wider ecosystem game and and to be a good Python citizen and use, like, standard library, and do things in a way that Nikki could be used for anything that is not especially OpenStack specific, even if it could not be an OpenStack project, and an OpenStack like, good citizen. So we thought a bit differently, about the project and the way we built it, which after a couple of year, made Nokey be able to be deployed as a stand alone time server database in the same way you could use, like, I don't know, MySQL or whatever to to store your relational data. You could use your key to store your charm series. So that was the first technical point that made us think we could actually move out of OpenStack. And what made us move out of OpenStack is a couple of a couple of reasons, being we wanted to extend our reach. OpenStack was kind of a burden for us because they had a ton of bureaucracy, but we didn't want to fill because it was just a waste of time for us.
They use a great platform to to contribute. Like, they're based on Garrett, on Zuul, for the CI, things like that, which is really powerful and really amazing once you use it every day. Problem is the barrier of entry is pretty high. So if you wanna have if you want to have your your user, basically, something new, a simple a simple batch, it's going to be pretty hard for them to get on board of the workflow that is used, on OpenStack. So on the other hand, you know that GitHub, for example, or GitLab, everybody knows how to use that. It's pretty easy for anyone, even a random contributor to to come and say, oh, I fixed the typo here. Here's a port request. So we wanted to have that event. We wanted to have a greater a wider community, which is not only an OpenStack 1, so we decided to to move out of the organization. And so the space of time series databases
[00:07:29] Unknown:
and metrics platforms are fairly crowded. So I'm wondering what the unique features of Noki are that would lead someone to choose that over some of the other options such as Prometheus that you mentioned or Graphite in the Python space or services such as Datadog? So when we started,
[00:07:49] Unknown:
the space was not that crowded, actually. 4 or 5 years ago, there was not a lot of solution. So when we started, we had a few constraint being in OpenStack, you can't use any kind of technology. Like, for example, people suggested that we use Hadoop, back end to store all of the metrics we were collecting, but there was no way that any open stack operator was going to deploy a an Hadoop cluster just for storage metrics, whereas the rest of the architecture is is based on RabbitMQ and MySQL. So we are to take those concentrating in account to build our solutions and to not reuse something that that was existing back then, but there was very not a lot of projects. We built Yoki with these constraints and the 2 of our being we are we were in the OpenStack cloud platform solution where everything supposedly scales out. So you can add any number of nodes in your cloud and you it still works. If you need more capacity, you just add more nodes, and we wanted to have that, in your key to store our metrics at scale.
That's not something that you can find even today in most open source solution like Influx or Prometheus. You can you can scale up that way. I mean, you have to either share your data or find other way to scale out horizontally, whereas NewKey does that naturally. I mean, if you if your data storage scales, if you use something like SAF or s 3 or OpenStack Swift to sort of metrics, you can scale and add any number of new nodes, for Nukey, and it would just continue working and I don't know. That's all. Another feature that we have in gnocchi, by default and that's a trade off. I mean, if you look at all the solutions that exist out there, yeah, I wasn't doing Influx, Prometheus, or anything based on Cassandra or things like that. They all work differently because when you try to solve this problem of time series, you have to do trade offs. So we all make different kind of trade offs, and the trade offs that you made that you won't find in something like Nutrix or what they used to buy is that we don't store the road data point. So everything in your key is aggregated. So if you store metrics, you have to define and to know in advance how you want to store those metrics. Do you want to keep the minimums, the maximums, the average, or do you want to keep only the 90 percentile or from that dot and for along and for, with which granularity, like, every 5 minutes every hour? You have to know that in advance, configure it, and then your key will work and aggregate everything. It won't store your raw data points, which is a trade off and something that is not the case in in what people see usually in terms of database.
[00:10:31] Unknown:
And as you mentioned, the fact that raw data points for some other analysis. And when I was reading through the documentation, I noticed that the raw metrics themselves are actually staged in 1 storage area before they're then processed by the metrics daemon to perform those aggregation. So I'm curious if there's any sort of life cycle rules that you can set up to keep those raw metrics around for some period of time or ship them off to another system easily for processing in a different way that is unaggregated.
[00:11:15] Unknown:
Right. So so there's nothing to move those data points out there, but we've seen people, building different kind of architecture, like some things that data points on a on a bus, like, AMQP or something and plugging your key and another system on top of that to have 2 data store, doing different things. What was the other part of the question? As for life cycle, you actually can, configure the life cycle of Blackhawk policy, and you can change that. So if you want to, keep your metrics for a longer period, you can do that, and you can show that dynamically. If you really want to store, raw metrics, you can do that in your key, but there's a couple of, I would say, hack that you can use usually. You don't pull from metrics every nanoseconds usually. You do that every 5 second or every second if you do something that is really fine grained. And you can configure that policy which are that precise. So if you configure, to say, oh, I want to keep a data point every second and you send a data point every second or every 2, every 5 second, whatever, it will keep anywhere this kind of granularity. So you won't you will lose the precision.
Basically, you will lose the the time stamp. If it's precise with a microsecond, you will lose that precision. It will be rounded to a second, but you can do that still. And that cover most of the use cases we've seen in in monitoring and keeping long term the metrics. And what are some of the trade offs of having all of the
[00:12:45] Unknown:
metrics that are collected pre aggregated before they're queried in terms of the benefits that it provides and some cases where it would actually be beneficial to keep those raw metrics around for maybe being able to do some
[00:12:58] Unknown:
more fine grained analysis of events as they're coming in? Yeah. So I mean, the raw data usually is pretty cool if you want to match time stamps. Like, if you have a system that can be very precise, like, up to the micro nanoseconds, and you have events that can be correlated this way, You really want to keep that, but it's less true for metrics than it is for events, which is a different things in a different kind of system. Then the the perk of having the aggregation being done before and and being done on the on the fly and and stored as is is that when you querying your key, it's very fast. What it does, in summary, when you query it is that it looks at the metrics in an an index, which is usually a SQL database, and then it just fetches the file on your distributed file storage and just and and send it back to you. So it's very fast to, like, get the data, expand any kind of charts. Whereas, if you want to do some kind of dynamic application, you can do that also with NUC if you want to do more complicated application, but usually if you use something like Prometheus FAUS or Influx or whatever, they have to reach the data and then compute and then reply back to you with the answer, and it takes sometimes a lot of time. And can you give a bit more detail on how the gnocchi database
[00:14:18] Unknown:
is architected and how that design has changed since you first began working on it? Right. So as the first version of the key was just like a single
[00:14:29] Unknown:
daemon with a REST API, and everything was synchronous. So it was working, but very slow. Like, you would post on your metrics, and we write the file, compute everything, and that reply back to you. Oh, it's okay. So it's general a lot since then. We have, like, 2 main components, which is the API. So, the key entry point is a REST API, within Python that you can, use to store and to read the metric back and communicate with me in general. That API, when you send new metrics to to it, as you said, it's going to store in a incoming buffer. We usually use Redis for that, but there are a number of driver that we can use, being any kind of storage, build file or anything. Already, it's pretty fast, so we use that really, and it scales pretty well. Once it's stored in the incoming buffer, there's a second daemon, which we call metricd or metric daemon. So metric d are collect those incoming metrics, aggregate them, and then store them for archiving purpose, let's say, into the final storage.
So what final storage is supposedly a scalable storage, so we advise to use SAF, so distributed, file system I mean, object storage, mobile file system. But it also works with, a file. If If you don't really want to scale and you have simple resource, if you want to use NFS, we have some users doing that. NGSO supports Amazon s 3 and OpenStack Swift for open storage. So all the metrics are are read from the incoming preferred restore on that. And the last piece of the architecture and the SQL database, so that's what we call the indexer. The indexer is actually indexing. So it's pretty cool to store time stamp and values, but it's not enough because then you need a way to, retrieve, what kind of time server you want. Like, usually, you have a name or something to identify the time service, and you want to look like that when you are looking for a metric for a a computer or anything. So we store all of those metadata information in a SQL database because it's pretty fast and easy to query. So that's what we use. And your point about naming the metrics
[00:16:42] Unknown:
brings up 1 of the common issues I've seen in systems that are designed for capturing system metrics is the cardinality of those metrics. And some systems use, tagging nomenclature for being able to add additional metadata to make those metrics easier to discover and analyze and aggregate with other systems within a given environment or for being able to investigate aspects of any type of support that it has for tagging metrics and the overall way that those metrics are architected and, searched upon.
[00:17:25] Unknown:
Right. That's something that people usually ask, like, do you support tags or things like that? And, no, we don't. If you look at system that support those kind of tags, it's pretty hard to support at scale. If if you look how to store, like, EAV in SQL, so, like, basically, attribute value for entities, it's pretty hard to divide in a scalable way. We actually tried in our in another project in Schemeter, and we just failed. Because it's I mean, you can do that, but at at certain scale, it's going to to be very, very slow. So we decided to not support that. What we implemented instead is more dynamic kind of human resources. So what you can do, and you can and that's another aspect of the software, which is not structured related to the time series, is that you can describe your, I would say, resources. Maybe not architecture, but it's resources.
So, for example, you, want to, measure the CPU time and the memory usage of a a VM, an instance, running. What you do in your queue is that you create a schema, and you say, okay. So I have a set of resources which are instances. There are a few attributes, which are strings, like their names, they're also running to, they're running they're running on, things like that, and then the metrics that we we will have, like the CPU time, number reset, etcetera. So you create this schema, and when you create a new VM, you create a new resource in your key setting. Okay. I have a new VM that just started at that time, but its name, but, the computer is running too. And that's the metrics I want to store for it. And then you can use that that information that is stored into the index to query in your key and to find back the resource that you just created. Usually, I mean, these schema are created using the REST API, so you can add a number of fields that you want. You can use strings, to describe whatever you want, and you can use that to query new key and the index behind. So it's way better and easier to make that scale because it's not free form like tags or any kind of EAV that you would implement.
But it's also it also makes it very easy to find back, any kind of resource that you want to store in your key and the metrics they have because it's really pretty easy to describe, the resources like. The power that you are with, the free form of tags or any kind of metadata which are not schematized is really poor for the MDM. You know what? If you want your data to be usable, you have to be some you have to to define and have and use some structure. And for those resources,
[00:19:56] Unknown:
that's definitely very useful, particularly in modern cloud environments or container based environments where these resources may be coming and going very frequently or rapidly. And so I'm wondering how those resource attributes are structured in terms of is it the capacity to query for all metrics with a given resource type attributed to them for being able to query across multiple different VMs of a particular type and then also segment based on specific VMs for doing more fine grained analysis of events that happened within the lifespan of that 1 machine and, how the metrics are associated with a given resource as they're delivered into the system. Yeah. So, basically, what we did is that we define,
[00:20:44] Unknown:
like, every project you have to do at some point. We wrote, a GSL, so a domain specific language that we offer and provide through the REST API that allows us to do any kind of query. And we translate those query to SQL at the end because that's what we use for our index. So you can do any kind of query that you just described, like being limited to a certain time span, to a certain type of resource, to a certain type of host name. If you want to, like, find your instances running on any particular host name at that time, you can do that and run through the metrics. There's a full language that we do externally because people keep, coming up with new use cases, but we did not think about or that the language is too limited to do. So we just add more feature from time to time into the REST API to to cover all of these new use cases. We also implemented, Graph on App again, like, a year 2 years ago now, and we also created a lot of new use cases. I mean, I think that was the first big use cases we had to display all kind of fancy dashboard people wants. So you have to be able to run a lot of different type of queries. So we added a few extensions to the API to cover all these kind of use cases. And in terms of the
[00:22:01] Unknown:
project itself, I'm wondering what the motivation was for implementing it in Python and whether you would make that same choice today if you were to start over from scratch. So we never thought about,
[00:22:14] Unknown:
picking another language because being in the OpenStack ecosystem, you have no choice. It was Python or nothing. There are people actually who try to bring Go on the table in OpenStack, and that was a big debate. So we just had no choice. I think today, I would keep Python for its ecosystem, which is really great. So, like, it's we we I mean, the code base of the key is pretty small because we use a ton of library, and and we actually contribute to library we use, because we are very lazy, and we don't want to always be on the wheel. And we're a small team, so we can't afford to write everything from scratch. So that's something that I do love in in in Python. The only downside right now for us in Python is are the performances.
So it's okay because we use NumPy, which is pretty fast. We try to optimize it at some point with Syphon, a few things, but it was not worth it so we didn't do it. And Python, I mean, c Python at least performances are getting better every year or so. So I don't think that's limiting us in no. None of our our users, complain about the performances because of Python. Like, most of the performances bottleneck we had were because of what we did and not because of the language or the VM Python itself. I say probably Go is a good pick right now, and there are a few projects in Go. But I'm still feeling that Python is a is a good pick for this kind of project. And 1 of the other aspects of the Nokia system
[00:23:48] Unknown:
is that it supports multi tenancy. So I'm wondering how that is implemented at the project layer and some of the benefits that provides for somebody who might not necessarily
[00:24:02] Unknown:
multiple environments that they're capturing metrics from. Right. So so that's 1 of the trade off we have to do when we actually do when we started to to build Yoki. OpenStack being multi tenant, we we are to find a time set at the base that was multi tenant, which did not exist 4 years ago, and I don't think still exist if you look at the mainstream, term set of database. So the the way it's implemented, it's completely on the side of the REST API, where you have a segregation of the metrics and the resources. There's a RBAC policy that is implemented that you can define, and it's actually pluggable. So a lot of the thing we do in your key are pluggable, like the drivers to store the metrics or the index. And even for authentication and authorization, everything is pluggable. So you can write your own plug in in Python.
It's we have a few plug in that support, like, HTTPS or Keystone OAuth or things like that. It's pretty easy to write, not a lot of code to to write, and you can already get a lot of support to your authentication system and how to find your users and how to define which metrics, belong to which users pretty easily. The upside of doing that, and that's something we did in the open cycle system, is that then you can give permission to your users to access legal metrics. So if you take a case that we know well, which is a public cloud where you have, multiple tenants using the cloud. You have a central system like simulator or collectd or whatever collecting your metrics and storing them into Nukey. But in some cases, you would like to give great access to those metrics. If you don't have any multi tenant system, then you know that your database is only going to be owned by you and yourself, and you're not going to share it with your users. In the case of your key, you can give access to some metrics you collected you could you collected for your users. They can get access to it, read it, and use it. And for somebody who's interested in
[00:26:01] Unknown:
deploying Nokia to their own system, I'm wondering what's involved in getting it set up and any capabilities built into the system to support ongoing maintenance and upgrades and also the ways that it can be managed using some sort of configuration management platform for being able to ensure consistent deployments across environments or just for being able to test and validate before you go to production?
[00:26:29] Unknown:
I don't think that's a strong point for us right now. So we are, I would say, the usual suspect for deploying, like, their app, the bad code that exists that is maintained to deploy your key or, recipes like that. We don't provide any kind of fencing system in your key. It's pretty easy to configure in the sense that you just have to have a Nukey dotcom file and that just make Nukey works. That file is pretty easy to write and pretty well documented. So once you have that, you can just, you know, copy paste that file on every system where you run your key and run the API or run MeshMD or both, whatever. So we don't we are, I would say, agnostic in trouble. What system you want to use to deploy, scale, or augment your your system? We don't really care about that. We just try to provide something that works, out of the box. Otherwise, it's pretty easy to upgrade, which is was not always true. I mean, the first version was pretty hard to upgrade because we also did a few change of to the file format. So you want to convert and all, but that was, like, 2 years ago. We don't do that anymore.
We have a good file format, which is pretty efficient, so we don't want to change anything. In that regard, so the upgrades are pretty seamless, especially because since all the index schema and and database is under based on REST API and by the client side, that's it's the user that's going to create the schema, etcetera. We don't have to care much about the, SQL upgrade of the schema or whatever. I mean, it's it's can be done via the rest API when the user wants to change anything, so the upgrade are pretty seamless.
[00:28:07] Unknown:
And in terms of delivering metrics to Nioki, I know that there's a specific client that you have implemented, but I'm wondering what other types of systems it will easily support, receiving metrics from such as stats d or collect d or, I think I read in the documentation that it supports the Prometheus format. Right. So, yeah, we try to to be a a good citizen into what we everything that exists.
[00:28:32] Unknown:
So I think the first 1 we created was the stats d daemon, but the stats d protocol is pretty bad. So, I think it works, but it's not completely supported. I'm not aware of anyone really using that, but it I mean, theoretically, it works. So we have steps the, 1 plug in that is used a lot, is the collectd plugin. So it's a plug in for collectd that you can use to store the metrics collected by collectd and to install them into into New Yorkie. That is being used a lot by, telcos or our account of of people using a lot of connectivity. So we provide an SDK that you mentioned, the New Yorkie clients. So that's a Python library which only does REST call, but it offers, you know, modules and and classes that you can use in Python, which is usually easier to use than just, REST API calls. And the collect the client is actually plug in is actually based on that. We support our kind of bigger, protocols. We support the Influx protocol, so you can send the same kind of, data you would send to Influx to in your key. If you take, a look at the, Influx ecosystem, there's a software called Telegraph, which is the equivalent of CollectD.
It collects, it collects metrics, from your system and some of that. So you can configure it to send it to Nucy, and Nucy will happily store those metrics for you. And last, we have a Prometheus, integration. So Prometheus has a feature when they can export the data to another system. Prometheus, does not support, long term retention of the metrics, but it's not something they are interested in. So we built a system to export the data so you can archive them. So we have an entry point on the API, but support that's protocol and can receive the metrics from ProFoundries to do the aggregation we were talking about earlier and start the metrics for our long term retention.
[00:30:25] Unknown:
Once the data is in gnocchi, it's not going to do a lot of good unless you can query it and visualize it. And I know you mentioned that there's some support for dashboarding, And I'm wondering what are some of the other, integration points that are available for extending and customizing Noki and existing integrations that people can take advantage of such as maybe Grafana or something like that for being able to visualize the metrics across the system?
[00:30:53] Unknown:
As far as I know, the only tool that you can use to display any kind of fancy chart and all is Grafana. So our plugin is pretty complete. It's maintained. It's easy to install, and that's used by a lot of our of our users, and they seem to be happy with it. When you can obviously build your own system, I mean, the rest API is is fully documented. And like I said, most of the things we provide in your key are based on a driver system, like the storage, the the authentication, the database access, or many, many points are based on extension that you can have in Python and in your key. So it's easy to extend and to add your own system. We also open to I mean, it's a free software, so we're also open to have any kind of new extension available if you want to add your own.
[00:31:43] Unknown:
And so when somebody is evaluating Noki against some of the other time series systems or metrics platforms, I'm wondering when Noki would be the wrong choice and they would be better served by a different platform.
[00:31:58] Unknown:
Right. So I guess and if you want to, if you're not interested in the trade off that we made, which are, like, pre aggregating everything, starting at large scale, things like that, You should not use an API. You should do something else. But if you are interested into that, it's it's a pretty good platform. And, again, when you need to grow, the raw data points being stored, for doing any kind of over storage analysis later, Avast has another good pick. So, like I said, most of the platform and jobs that that that you will find are doing trade off for you. You just need to be really aware of what you need before jumping into any kind of system, because usually, then it's a bit too late to switch if you already have a ton of data stores.
[00:32:43] Unknown:
And in terms of building and maintaining the gnocchi project and the community around it, I'm wondering what you have found to be some of the most interesting and useful lessons learned as well as some of the biggest challenges that you faced in the process.
[00:32:59] Unknown:
So what we learned about it's pretty hard to build a a community that, thrive and and and scale. We actually I think we're a pretty good product technically, but it's very hard to get your project known and to be heard, especially in that space. I mean, it it has grown from barely, like, a couple of projects up there and there to a large ecosystem of time series. I mean, you are the time series that is like for monitoring. You have you have time scale DB working on implementing time series on top of PostgreSQL. You have so many project going on right now. It's probably going based on on Cassandra or things like that. It's very hard when you build such a project to be heard and to explain to people, but there is no 1 good project that's going to rule everyone. There are a different set of trade offs that you have to do and that you have to choose. Like, you're not going to store all of your data into a SQL database, obviously, and you're not going to store everything in MongoDB neither. You depending on the product that you have, you have to decide what kind of trade offs you want and what solution is good for you. And that's the same thing for a term server database usually, but it's still a young at least for at least for the general public, I would say, it's still a young subject that people are just getting into that probably because we were not storing that such a large amount of metrics, years ago, and we are storing more and more metrics as our system evolve.
[00:34:29] Unknown:
So we need better and bigger system. And what do you have planned for the future of Nokia in terms of any improvements or feature additions? And also, is there any particular type of contribution that would be most useful to you in achieving that vision?
[00:34:47] Unknown:
Right. So right now, we don't have a lot of we do have a few feature requests, but not that many. We are still improving the API and adding new use cases for people. We are we are people using the API to build complex, system like billing system or or the statistics system, capacity depending, or things like that. So they can retrieve any kind of data they want from Yokey, but sometimes we can make them their life easier by inputting the right call in the API, so we don't try to do that. We love to also allow the API to something fancier. So the API is a bit old. I mean, that's a version of 1 of the APIs since the beginning and add a few drawbacks. We'd love to have something new with GraphQL or something. We started a few have few discussion about that, a while ago, but it didn't move. We also want to, improve our integration with our system. So we are we are, but these are barely used right now, and we know there might be some issue there and there. So we'd love to have more users, and people are testing that so we can make that work better. And 1 of the other capabilities
[00:35:58] Unknown:
that I know is not currently supported in gnocchi but has a third party
[00:36:04] Unknown:
contribution for being able to achieve it is the idea of alerting, which can be a critical aspect of a monitoring system. So I'm wondering if you can talk a bit about that. Right. So that's that's the debate. How long is the developers ask? Do we I mean, we want to support that, but do we want to build that into new key or should it be outside new key? So right now, we don't have it just because nobody worked on it, and it's a huge task. We are a system in OpenStack that is able to, like, trigger alarms based on the metrics of threshold, etcetera, which is called Hey, spelled a o d h. It's an openstack project, and it's used in with HIT, which is another openstack project. And both those projects will provide, a nifty feature, which is auto scaling.
And to do auto scaling, you have to evaluate regularly your alarms and your metrics. That's what, a does, and it uses to reduce the metrics and that. But that's a really small project, and it does not support a lot of of features, and it's it's barely mentioned upstream. So we tend to think we might need another better solution, especially because a is not very fast. It's pretty slow. I mean, it's a no architectural.
[00:37:18] Unknown:
But we never managed to work on that, so we still don't know if it's a good idea to build that into Nuki or, on the side of it. And are there any other aspects of the Nokia project or metrics and systems monitoring that we didn't discuss yet which you think we should cover before we close out the show? I don't think so. I think we've got a pretty good coverage. So for anybody who wants to get in touch with you and follow the work that you're up to, I'll have you add your preferred contact information to the show notes. And with that, I'll move us into the picks. And this week, I'm going to choose a podcast that I listen to from NPR called the Marketplace Podcast, which is, just a great way to keep up to date with some of the news of the day as it pertains to economics and, the overall economy. So it's an interesting podcast, good way to stay up to date. And so with that, I'll pass it to you, Julian. Do you have any picks this week? I would like to, yeah, say, what about any product I work on, which is,
[00:38:14] Unknown:
Vertify. So it's a bit of Vertify dot io. So it's a an open source engine that helps you merge your
[00:38:22] Unknown:
pull request in GitHub, and that's actually something that we started to build while moving, your key to GitHub. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you've been up to with Noki. It's definitely an interesting project and 1 that I've been evaluating for my own use. So, after learning a bit more about it, I may end up getting that deployed. So thank you for your work on that. Thank you for your time, and I hope you enjoy the rest of your day. Thank you. The best.
Introduction and Sponsor Messages
Interview with Julian Danjou: Introduction
The Genesis of Gnocchi
Separation from OpenStack
Unique Features of Gnocchi
Lifecycle and Storage of Metrics
Gnocchi's Architecture
Resource Attributes and Querying
Deploying and Maintaining Gnocchi
Querying and Visualizing Metrics
Building and Maintaining the Gnocchi Community
Future Plans and Contributions
Alerting and Monitoring
Closing Remarks and Picks