Summary
Routers and switches are the stitches in the invisible fabric of the internet which we all rely on. Managing that hardware has traditionally been a very manual process, but the NAPALM (Network Automation and Programmability Abstraction Layer with Multivendor support) is helping to change that. This week David Barroso and Mircea Ulinic explain how Python is being used to make sure that you can watch those cat videos.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.podastinit.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app.
- Visit the site to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Your host as usual is Tobias Macey and today I’m interviewing David Barroso and Mircea Ulinic about NAPALM (Network Automation and Programmability Abstraction Layer with Multivendor support), the library for managing programmable network devices
Interview
- Introductions
- How did you get introduced to Python?
- [david] 2012 trying to use django 1.4 to store data I had on confluence.
- [mircea] August 2008, when I bought the Learning Python, Mark Lutz, 2nd edition
- Can you start by explaining what NAPALM is and the problem that you were solving when you started working on it?
- [david] trying to remove all the if vendor_a do this, elif vendor_b do this other thing instead
- [mircea] only if I will feel there’s anything to add
- What led you to choose Python as the language for implementing it?
- [david] it’s what I knew best and vendors were starting to provide libraries to interact with their platforms so python seemed like a natural evolution as we could just provide an abstraction on top of those libraries that already existed.
- [mircea] I didn’t implement NAPALM, I was fistly a user then contributor, now I’m one of the maintainers.
- When working with network equipment it is easy to apply the wrong settings and bring down a large number of systems or lock yourself out entirely. Are there any tools in NAPALM to help prevent this from happening?
- [david] We provide mechanisms to ensure proper peer reviewing; we let operators propose a configuration and get a diff. We have a rollback mechanism so if you detect an issue you can immediately rollback and we also added support to the autorollback feature some vendors have.
- How have you architected the library to allow for easy integration of new classes of network devices?
- [david] very simple architecture. Trying to avoid complex features like abstract classes, metaprogramming or decorators. Main reason is that I figured my main user base wasn’t going to be very python savvy so I wanted something simple. What I ended doing was simulating interfaces with with a base class that described the supported methods and how they were supposed to behave and an extensive testing framework that ensure the method signatures and the behaviors matched the expectations.
- Designing and building a consistent API for such a wide variety of hardware and software platforms is a daunting task. How do you determine the lowest common set of functionality that you are going to expose as part of the core library vs delegating to the underlying dependencies?
- [david] We don’t necessarily go with the lowest common denominator. Sometimes we try to emulate features. For example, if a platform doesn’t support atomic changes we might simulate it by trying to send the configuration as a block and rollback immediately. Obviously a feature likes this is clearly documented so people is aware that this might happen. What we try to avoid though is implementing things that are very specific to a single vendor. In any case the way it has worked so far falls into two categories:
- configuration management. These are primitives like loading a candidate configuration for merging or replacing into the device, getting a diff back, commiting, discarding or rolling back configuration. These primitives were designed at the very begining of the project based on the netconf protocol and they have changed very little since then. When a primitive is not natively supported by a device we try to emulate it as with the atomicity example I gave before or we don’t implement it at all if it’s not possible.
- The second category is what we call getters which are methods that retrieve information from the devices. Things like interface counters, bgp neighbors, etc. These are basically community driven. Someone opens an issue on github explaining the data that he or she needs, we discuss it, we define a model and then we work on it. Not all getters are supported on all platforms. People mostly implements them as they need.
Now there is a third category though. It is actually funny but I presented napalm for the first time a couple of years ago at NANOG64. It turns out the day after, at the same venue, Google was presenting Openconfig. Openconfig is an effort to design a common set of models to operate the network. So, for example, they have models for BGP neighbors, for interfaces, vlans, etc… Those models try to be vendor agnostic and you should, in theory, be able to use them to configure or to retrieve consistently information from any device. Problem is that, of course, vendors are slow implementing them, they don’t even have plans for all of them or for all the platforms, etc… So the sad truth is that two years later support for Openconfig is extremely limited. However, in the last few months I have been working on integrating napalm with opencofig so now we have a beta version of napalm where you can use python bindings that can translate native data from a device into an Openconfig object and viceversa. That has two direct implications:- Now we are not only operating all vendors with the same tool but we are also operating them with the same data structures. This means that I can get the configuration of a cisco device and translate it directly to junos configuration.
- It also means that because now we are dealing with objects, I can do smart things like having an object that represents the candidate configuration, anotther object that represents a certain running state and simulate merges myself without having to rely on the device itself. I can even generate the exact commands to do the merge without having to rely on them doing the actual merge. I can also simulate the changes offline, I don’t even need access to the device anymore, I could be builting the objects from a backup or from the resulting configuration after merging different branches on github.
- I have seen a few posts recently discussing the use of NAPALM in conjunction with configuration management platforms such as SaltStack and Ansible. What are the tradeoffs of using the library directly vs integrated with these other tools?
- [david] napalm is a library in the strict sense. There is no business logic, no workflows, very little tooling embedded. Instead we try to implement as many primitives and be as flexible as possible so other tools can leverage on napalm to implement their workflows. What this means is that using napalm directly is great if you are writing a script to do backups or to solve a specific issue but if you want to build a whole framework for event driven automation or a configuration management system you are probably better off leveraging on napalm integration with salt/ansible/st2.
- I noticed in the documentation that merging configuration is supported. How do you manage conflicts and priority of nested data structures?
- [david] we try to make changes atomic. So if you make a change and trigger a conflict or you are missing some datastructure or some configuration is invalid configuration won’t be applied and the user will get an error. For platforms where changes can’t be atomic we try to apply the configuration changes in bulk and revert immediately if there is an error.
- How does declarative modeling of network devices differ from general purpose operating systems and what unique challenges do they pose?
- [david] lack of tooling like sed/awk/etc. Lots of state. Configuration is state itself and in most cases you can’t even reload it. Which means you have to type the exact commands to go from state a to state b. Like trying to configure the network stack of linux with only the iproute2 tooling available.
- What are the most technically challenging aspects of managing different network hardware programmatically?
- [david] Inconsistencies and buggy code. Not even inconsistencies across different platforms but across minor revisions of the same platforms. Small API changes that are not backwards compatible, small differences on output commands that break regular expressions and APIs that break every second call.
- What are some of the most interesting or unusual uses of NAPALM that you have seen?
- [david]
- I have seen people replacing their SNMP based monitoring system with napalm.
- I have built myself what you could call “immutable infrastructure for the network”. So for example, when you have to do a configuration change you don’t apply that configuration change. What you do instead is compile a full configuration for the device and fully reload the state of the device. That ensures you are always into a known state. So if a user would connect to the device and do a change outside the change control system because you are fully deploying state you can be certain that the manual change will be wipeout. So there is no way out of the automation.
- We also have this
validate
functionality integrated into napalm. With this functionality you can define a desired state, for example certain BGP neighbors have to be up and I must be receiving N prefixes from them. Napalm can then read those rules, figure out which data to retrieve and validate the data retrieved complies so I know some people using this state validation instead of using the traditional times series type of monitoring where you keep retrieving data constantly and alerting when you reach certain thresholds. I guess you could call this test driven monitoring?
- [mircea]
- SNMP thing
- For someone who is interested in learning more about network management, what resources do you recommend?
- [david]
- networktocode.com has some resources, labs, the slack community behing the organization is very active as well.
- ipspace.com has some good resources as well.
- pynet.twb-tech.com is also another great place to check for courses
- o’reilly has a book on Network Programmability and Automation which I haven’t read but I know the authors are very good so I am confident the content will be of high quality.
- [mircea]
- I blog about NAPALM & generally networking and network automation on my personal space: mirceaulinic.net
- packetpushers.net
Keep In Touch
- David
- Mircea
- NAPALM
- @napalm_auto on Twitter
- Documentation
Picks
- Tobias
- David
- Mircea
Links
- Juniper
- Arista
- Paramiko
- netmiko
- Cisco IOS
- Vagrant
- Netconf Protocol
- BGP
- OSPF
- SNMP
- TCP
- IP
- ZTP (Zero Touch Provisioning)
- PXE (Preboot eXecution Environment) Boot
- SaltStack
- Ansible
- StackStorm
- Trigger
- NAPALM Logs
- OpenConfig
- NANOG
- YANG
- Data Plane
- NTP(network time protocol)
- SSH
- Networking Resources
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at ww w.podcastinnit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app or experimenting with something that you hear about on the show. You can visit the site at www.podcastinit.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media.
Your host as usual is Tobias Macy. And today, I'm interviewing David Barroso and Mircea Ullinik about napalm, network automation and programmability abstraction layer with multi vendor support, the library for managing programmable network devices. So, David, could you please, start by introducing yourself?
[00:01:08] Unknown:
Sure. My name is, David Barroso. I'm a network systems engineer at Fastly, where I spend my time automating, networks and also trying to build networks where the application can actually decide how to route the packets themselves. And before that, I was, at Spotify, which is actually where this project started.
[00:01:28] Unknown:
Hi. I'm Mitch. I'm a so network system engineer at Cloudflare where I have the chance to automate 1 of the biggest global networks. We have a more traditional approach to networking, like, like, 99% of the networks in this world. And, beside this, I'm a napalm user, contributor, and maintainer.
[00:01:49] Unknown:
And, David, can you start by, recounting how you first got introduced to Python?
[00:01:55] Unknown:
So, traditionally, I'm a network engineer. Although, nowadays, I spend most of my time, coding, not all my time. But I think it was back in 2012. I was just a regular network engineer, and I was trying to get a hold of the inventory of my network, which was living inside, inside just a a Wiki page, which meant that I couldn't actually query the information. I'm trying to find a device by platform or operating system or something. It was just a nightmare. So I was looking into trying to move all that data to some sort of database that I could actually query, and I just found Django, which seemed like a a gap good fit for for this. And that's pretty much how I started with with Python.
[00:02:40] Unknown:
And, how about yourself?
[00:02:42] Unknown:
Yeah. I am, my story is a bit different. I was previously a software developer. I, was mostly oriented, to back end of, websites. I, began looking at Python because I had to work with the Cherry Python. It was, back in 2007. I was with a group of friends, and, we're looking at Python. It is we said it's it's awesome. It looks very interesting, very readable, and so. And, 1 year later, I have, ordered from Amazon a book called Learning Python written by Mark Lutz, a very well known author. It was only the 2nd edition. I still remember it had a label on it saying that it covers Python 2.3. Yeah. You know? And, we are speaking now about Python, 3.6 and so. After was it was, over a couple of years when I didn't use Python, at all for, like, 5 or 6 years, and then I started using it again.
[00:03:41] Unknown:
And so let's start by explaining what napalm is and the problem that you were solving when you first started working on it. So, David, I believe that, you were the original author. Is that correct? Or the original coauthor, at least? Yeah. That's that's correct.
[00:03:57] Unknown:
So if you take a look to most networks out there, like, they're usually comprised of multiple types of devices, like firewalls, load balancers, routers, switches. They also run, like, many different operating systems, depending on the vendor, depending on the platform. They may even have, different versions of that operating system. And each, each 1 has a complete different, way of interacting with those devices. Like, 1 may have only a a CLI. Another 1 may have a complete different CLI. Then some other may have, like, a proprietary API or maybe use in, NetComm or whatever. Like, there are just many ways you can, manipulate these these devices.
So as you can imagine, if you're trying to do something as simple as just trying to figure out which IP addresses you have configured on your on your network, that becomes, like, a super hard and massive task. You have to start by figuring out to which device you are trying to connect to. Is it a Juniper device? Is it a physical device? Is it an Arista device? Which version of the operating system are running? Which API they have available? So you have to start figuring figuring out all this stuff, and it turns out that your code becomes this massive blob just just trying to figure out that part. Not even trying to get the information. Just trying to figure out how to get that information.
So as I was trying to automate, the network at Spotify, I quickly became apparent that this wasn't going to to scale. So I started working on on first building this, abstraction layer with with napon. So when we were actually trying to solve a problem, we could actually focus on the problem, not on the on the how to do that on a particular device.
[00:05:41] Unknown:
When you first started it, did you immediately decide to use Python, or did you play around with other languages for implementing it at first?
[00:05:49] Unknown:
So I decided to use Python for 2 reasons. The first 1 is because it's what I knew. As I mentioned, I was a traditional network engineer, so I wasn't really a developer. I mean, I could do, back then, like, some Perl, bash, maybe some simple Java, but that wasn't really a developer. But I was a bit more proficient with, with Python, so it seemed like just the the way to go. And the second reason is that some vendors had, by then, some libraries written in Python, so you could interact with those devices. Like, Juniper had, had a library written by by them. Arista had another 1 who was, there was there was also Paramico and Netmico that allows, allowed you to interact with certain devices. So it's it felt like a like a natural fit. Just like, let's try to rely on those, libraries and just have NAPAN build this abstraction on top of them instead of having to reinvent everything from scratch.
[00:06:56] Unknown:
And I imagine that there is also networking devices in your infrastructure that don't have any sort of programmability aspect to them. So does that pose a challenge when you're sort of architecting the overall network and trying to integrate the sort of Juniper and Arista switches and things like that with the more traditional just hardware based networking systems?
[00:07:16] Unknown:
Yeah. So there is a lot of, equipment out there, especially Cisco Ios. It's notorious for that, where the only thing you have is SSH and just plain text, which means that you end up with a lot of regular expressions everywhere. But the problem is not that you have to deal with regular expressions. The problem is that they may have, like, a minor release where they are supposedly just patching some minor bug and it turns out that some output may change as well and all of a sudden all your regular expression just broke. So so, yeah, the Naples has a lot of code just trying to manage those differences, trying to figure out, oh, is this, like 16.5.1 or is it 16.5.2 and try to pick the right regular expression. It's just a massive endeavor doing that.
[00:08:11] Unknown:
Yeah. Someone, made, like, a poll. And speaking about numbers, it turns out that only 15 up to 20 of the network device in this world would actually have an API. Not only because, Windows not supported, but, historically, they did some didn't have at all. They have only the inversions, and there's a problem in the network world that you can't upgrade your your device sometimes to the new version that has the API, or is not stable enough to to to make, the business decision to to upgrade to a new new version that has the API and so on. And, basically, what you have to do, this doesn't stop us. You basically start a process that, connects via SSH, And, you issue commands like you would do when you log yourself on the on the terminal and type a command. And, you read that output and parse it and present it to the user in a structured format.
[00:09:14] Unknown:
And 1 of the challenges of working with the networking layer is that basically everything on top of it relies on the fact that the network is operating properly and isn't dropping packets or, you know, is just generally available. So I imagine that there's a bit of risk involved with trying to manipulate it programmatically in the event that you have some sort of bug in your scripting. So what are some of the ways that you try to mitigate that problem when you're working with napalm?
[00:09:41] Unknown:
So napalm has a couple of mechanisms to minimize problems like the 1 that you were describing. The first 1 is that when you can create what we call a, well, a candidate configuration, then you can you can ask Snap on to give you a a diff between the actual configuration on the running device and your candidate configuration. So that can you can actually introduce that as of your, change control, workflow. So a a fellow engineer could just review that change and see, like, okay. Yeah. This makes sense. This is not going to to break anything. That's just like a like, doing a pull request if you are a developer. Right? So another mechanism is that if you detect that you broke something and you still have access to the device, we implement, like, a quick rollback. So you can just, like, tell me if I'm just, like, rollback, and you will go to the previous known, state. So that's a a nice, way of just trying to fix a a problem. And then we are now, integrating as well with a feature that some vendors have, which is, they some of them call it, like, commit confirm, but it's basically a sort of automatic, rollback. So when you you when you do a change on the network, the device will expect you to confirm the change, within a period of time. So if that period of time just expires, the device will roll back itself. It will assume that, you lost connectivity to it, that something is broken, so it has to to roll back. So those are the mechanism that we have within Napon to try to minimize or solve problems as easy as possible, as fast as possible.
[00:11:23] Unknown:
And for the built in rollback capability, is that, just caching the diff from before you apply the configuration change so that it knows how to revert it back to the previous
[00:11:33] Unknown:
state? So that depends on the on the platforms. Like some platform, for example, they they support that feature themselves every time that you do a change it actually saves a checkpoint. So if that is supported natively, we will just tell the device, okay. Go to the previous, state. But if that's not supported, we will just save a local copy on the device. And then it's just a matter of, reloading, that configuration that we saved before actually committing the the new 1.
[00:12:03] Unknown:
And for testing the napalm library, imagine that because of the reliance on the hardware platforms and the operating systems that they're running, that it introduces a lot of complexity in terms of being able to accurately test everything at the unit level. So there must be a lot of integration testing involved as well. So I'm wondering what your overall test architecture looks like to be able to, ensure that changes that you make don't introduce, regressions.
[00:12:29] Unknown:
Yeah. So we we do 2 different things here. If you're doing a change on code modifies configuration on the device, that's a bit more complex, because that, you want to do proper integration test, against a real device. You don't want to you probably don't want to mock that because it's kind of, like, too critical. So when when do when we do changes to call that manipulates configuration on the on the devices, we use Vagrant to spin up VMs, and then we have a whole suite of configurations that we know how they look like. So we know what diff we are supposed to get before committing the change and after committing the change. So but but it's, it's actually manual testing, because we have everything on on Travis right now, and Travis doesn't really work that well when you have, like when you want to test many VMs with Vagrant and stuff like that. So so so, yeah, that's the configuration management, part of the project. When we are dealing with retrieving data, that's actually easier to test because then you can just mock the device. Like, I know if I type this command or if I do this API call, that I'm going to get, this, response, this this text or this JSON blob or XML, whatever. So in that case, we just have, like, a battery of test cases. Like, okay. This is, iOS 13. This is iOS 12.2. This is that. This is when you have this particular case where you don't have any BGP neighbors. So we just keep building cases and making sure that the all the parsing, regular expressions, everything just just works.
[00:14:08] Unknown:
And so 1 of the things that I was thinking about too as I was preparing for the show is wondering how working with napalm relates to working with purely software defined networking. Because I know that with the napalm library, you're interfacing with the physical switches that have their software layer exposed via the APIs. But in systems such as OpenStack, where you have entirely virtual network infrastructure as far as the interconnect between the virtual machines, is that something that napalm would be able to manage as well?
[00:14:38] Unknown:
So, I mean, what napalm does is mostly interact with the configuration of the of the network. Right? So, I mean, you have, like, your OpenStack infrastructure and you need to provision, like, I don't know, like a like VXLAN or or a VLAN or a route or something, you could actually use napalm to interact with with the hardware to say, like, okay, you have to configure this VLAN because I provisioned this new service. And I actually think that some people I heard some people talking about using napple within, the open openstack context, but I'm not entirely certain about that.
[00:15:13] Unknown:
And so as you mentioned earlier, napalm sort of abstracts the different underlying platform dependent libraries. So I'm wondering how you architected it in order to be able to easily integrate new classes of network devices and new dependent libraries while still maintaining a consistent API across the different devices?
[00:15:33] Unknown:
The the approach is quite simple and probably couldn't be Pythonic. And the the idea is that we have a base class that we call just, like, I think it's called right now a network driver or something like that. And that base class just defines the the API. It has a set of methods. It has documentation saying, like, this is how the method is supposed to to behave. This is what's supposed to happen. And this is the data that's supposed to to return. So that sort of acts as a as an interface, but Python doesn't have, interfaces. So what we do is we just implement for each particular device out there, we just implement a class that inherits from from that from that base class. And then on the testing framework, we start comparing signatures. Like, for example, this method that you implemented here, it's supposed to match the signature or your parent class, and the output is supposed to match what is out what the parent class would be outputting as well. So we try to enforce somehow this interface on the on the testing framework instead of on the Python layer, because Python doesn't really have any sort of interface feature.
We decided to go this, path instead of using, like, I don't know, like, decorators or metaclasses or abstract classes or something like that because we figured that the user base of of this library would wouldn't be, like, very proficient on Python. So we wanted to do it as simple as possible. So we could actually have network engineers that might have a lower level of Python proficiency, just implementing new things and being able to contribute as well.
[00:17:18] Unknown:
And when you're designing the overall API and trying to figure out which methods you're going to expose as the common denominator across all the different types of devices. How did you go about determining what those methods and type signatures were going to look like in order to be able to have the broadest set of compatibility across the types of devices that you're targeting?
[00:17:40] Unknown:
So there are 2 parts, in here as well. Like, 1 is the configuration management part, And that 1, we tried to emulate what the NETCONF protocol specifies. So the NETCONF protocol, it's a well known standard in the network community, and it has, like, a set of primitives, like, I don't know, like a candidate configuration, like a running configuration. It has certain action that you can do. You can load the candidate configuration. You can start doing a transaction and committing the change. So for the config configuration management bid, we try to emulate that part, which means that it's been quite stable since the very beginning. Like, it was designed at the beginning. It was implemented, and it we haven't changed that much, since then. And then the second part is what we call getters. Getters are just methods that connect to the device and retrieve consistently, information from the devices. So you could say, like, okay. Give me the BGP neighbors, and you would get the same data regardless of the vendor. So it just, like, parses the data and just structures it in a known, format. And for that, that's pretty much community driven. Like, we have a set of method that we were that we were using.
And then as people start using the library, they might open an issue on GitHub saying, like, oh, yeah. I see that you're getting the BGP neighbor, but I would like to have the OSPF neighbors as well. Then we start just discussing how the data would look like so it's as generic as as possible. Yeah. And start implementing
[00:19:14] Unknown:
the the method as as are actually needed. Like, we don't really implement all the methods to all the platform. Like, we just actually implement the methods where actually someone needs it. And the underlying libraries, I'm assuming, expose their data in different formats. So I'm assuming that you have to do some measure of data manipulation in order to be able to present it in a consistent manner across all the different devices that somebody might be querying.
[00:19:40] Unknown:
Yeah. That's that's correct. That's actually the main the main bulk of the of the code in, just trying to get the the text that a device might give you via the CLI and parsing it and returning a dictionary or just converting a dictionary into a different dictionary or this XML object that you got to whatever the method is supposed to return. So, yeah, there is a lot of data manipulation.
[00:20:05] Unknown:
Yeah. It actually, depends very much on, the driver, on the, actually, the, platform capabilities and, what kind of API you are able to use or if you have an API. For example, if you are automate try to automate a Juniper device, they have an API via protocol standardized called NetComm, which is on top of SSH that, and, provides, XML documents. Then then you can extract your data from there. But you may not have an API. And, as I said previously, you need to spawn, SSH process that connects the device and emulates, your command line, and, you extract the the simple text then the which you have to parse. So it depends very much, on, the platform or you are trying to automate. And the the underneath, library in general is, used only as transport.
[00:21:00] Unknown:
And 1 thing that I was just thinking about is particularly in larger networks such as, the types that I'm sure you're working with, Mircha and David, I'm assuming you do as well, is in terms of discovery of the devices that you're going to be working with. So do you just keep a database of all the different locations for the devices, or do you have some measure of discovery using network protocols such as ARP or whatever to figure out the types of devices that you're working with and where they're located within your network?
[00:21:28] Unknown:
Yeah. We're actually having a database. In our case, it's a pretty simple even the network is, quite huge now. But, we are growing, and we add a new pop, with a piece of, about, 1 pop per per week. And, we have the time to we we started from, from scratch when the from from the the beginning has been designed with automation in mind. And, to new PoP, we only need to update, the this database, to to make sure that everyone understands PoP. I I say points a point of presence. That means, a new collocation in a data center where we have, network devices. But, you can also use, SNMP to detect, what kind of device you have. You only need to say, to provide the IP address from where you where to go and collect the the details. And, you can, for example, out update automatically your database or or simply to tell you back on the CLI that I have discovered these devices that you told me they are these IP addresses and they are, I don't know, Juniper, Arista, and so on. Yeah. We have a somehow similar approach.
[00:22:44] Unknown:
We we actually treat the we have a database where we have all the all the information. We we treat the database as the as the source of truth. So the device has to exist, in the database even before it can be deployed on the on the network because operators don't have any mean to actually connect to them and configure them. It has to be done via by automation. So it's kind of like a documentation first approach in our case.
[00:23:11] Unknown:
And so when you're storing the information in the database and then the device actually does get connected to the network, do you have sort of events that will trigger a run of a napalm set of commands to automatically provision the desired state for that device once it comes online?
[00:23:28] Unknown:
So there are 2 parts here. When you provision a new device, we use what we call, set t p, 0 touch provisioning. That's just a way of building the initial configuration dynamically similar to pixie boot but for the for the network. And and then once, that will give you just, like, the, like, credentials and, management IP and all that that stuff. So then it's just a matter of the operator actually doing an Ansible run, which uses napon to build all the configuration, all the Solr packages that have to have to be deployed on the device and deploys everything.
[00:24:07] Unknown:
And so that brings us to talking about integrating napalm with configuration management tooling, such as, as you mentioned, Ansible and SaltStack. And I'm curious as to the amount of support in other tools such as Chef and Puppet.
[00:24:21] Unknown:
So Ansible, we we developed the the models ourselves because Ansible, they have their own networking models, although they have a complete different approach where they have specific models for each specific platform. So with Ansible, if you want to do something with the native models and you want to do something on iOS, for example, on Cisco iOS, you have to use the iOS model that does that. If you want to do this something similar for, I don't know, for Arista, you have to take that specific Arista model that does that. So they have their own thing, but we we actually support the the Ansible model that you have to install as a third party model. But we actually support them. Salt has native models, which were actually developed by, Mitja, so he can probably elaborate on on that. I know that StackStorm has as well, integrated models, for NAPON, But I don't I honestly don't know about, Puppet and and Chef.
[00:25:22] Unknown:
Me neither. I I didn't hear anything, although it's, definitely possible, and, I expect it shouldn't be extremely hard. But I heard something else about a different tool called Trigger.
[00:25:36] Unknown:
Alright. Yeah. Yeah. Yeah. Yeah.
[00:25:39] Unknown:
About Salt, was, David said, in Ansible, also there are some, couple of other modules provided by network vendors or third parties. But, Netballm is a special very special case because it's in, independent. It's community driven, and, you you can do anything you want. It's already embedded in the core. So, basically, when you install Sol to have, all the network automation capabilities embedded, and it's even more even driven network automation and orchestration. Now the list of features is, quite big now. Next release, which will will happen this month in the Nitrogen, There will be even more features provided, and, they are not related strictly only to configuration management.
Also, the get us part that, David presented already when you are able to retrieve data. But salt is from a different perspective. If we zoom out a bit and look from a 4000 foot, you'll see that, is much more than configuration management. It's a trigger configuration management, which can be triggered when, your interfaces can go down, where your BGP neighbors can, start leaking roots and so on. Which can be detected either using those getters, either from, other events in the network. In the network world, it's a it's a very special case because, it's not only about configuration management.
It's, often miss, misunderstood that configuration management, it's it's everything. It is vital and, extremely important, of course, to have your network consistent. But, there's, a lot of, things happening around that you definitely want to to monitor and start acting automatically. And I will introduce here, new kit of the napalm suit is called napalm logs. It's still in better release, but we are working on it now. It collects all these events, from different network devices and presents them in a vendor agnostic form, more specifically structured as, for some, public standards like, opogific or IETF.
And using those events you can ingest them in a tool like, salt or stack Swarm and so on. And you can trigger configuration changes.
[00:28:23] Unknown:
Now that Mircea actually mentioned OpenConfig, I would like to talk about open config for a second. It's actually funny because I was presenting, I think it was 2015 at 60 4. As I was presenting Napalm there, the day after, Google actually presented some other effort that they were doing in parallel that was actually similar to Napalm, which is called OpenConfig. OpenConfig uses Yank, which is a data modeling language, to define a common set of models that are supposed to be vendor agnostic and then supposed to be useful to both retrieve information from the network and to configure the network agnostically without vendor specific stuff in in there. The only problem with OpenConfig is that, the OpenConfig group is way faster than vendors are. So it turns out that 2 years later, there isn't that much OpenConfig support from the vendor side yet. There are very few models supported by very few platforms. So we started actually working as part of Nepal recently on trying to start supporting open config models within the Nepal, context. So you can use Nepal to retrieve information from the from the devices, like native, information, for example, and map it into an open config, object.
So you have just you have just the the the napalm layer doing this, translation between native configuration and well known models that are supposed to to be vendor agnostic. The cool thing about this is that you get not only an abstraction of how to operate the network, but you also get an abstraction on, how the data is supposed to work across, different vendors. So you can actually get the configuration from from, let's say, from Cisco and translate it into native configuration of something that Juniper or Arista may understand. That's actually quite quite powerful as you can start just replacing devices without that much complexity.
[00:30:25] Unknown:
Yeah. I imagine that would be interesting use case to be able to have 1 network switch or a piece of network hardware that you want to then replicate in either a new data center or just redundancy within your existing network and be able to say, just pull the configuration from 1 and load it onto the other, irrespective of what the actual underlying hardware and software systems are on it. Yeah. That's that's super interesting.
[00:30:50] Unknown:
And, the cool thing is that because now you are mapping this, native, configuration, which is just text in most of the cases, into into an object, you can start doing other cool things. Like, you can, for example, simulate changes. Like, okay. Let's imagine that I have this configuration, which might not be living anywhere. Right? Like, maybe just something that you just handcrafted because you wanted to test something. So, yeah, let's imagine that they have this blob of configuration. What would happen if I apply this other blob of configuration on top of it?
Or you can start doing actual simulations based on backups of your configuration. You don't even have to have a live device to do the actual diff of the configuration. You can just grab a configuration backup file or or whatever and start just, yeah, simulating changes.
[00:31:44] Unknown:
And 1 thing that I'm curious about is if there's anything that you're losing by using napalm as an abstraction layer on top of the underlying hardware because as the different vendors, I'm sure they're trying to expose certain features that are unique to them to be able to differentiate themselves in order to be able to drive people to wanna buy their hardware versus somebody else's. So is there a trade off when you're using napalm in versus trying to take advantage of whatever, sort of special capabilities that underlying keys and network equipment has?
[00:32:17] Unknown:
So napalm mostly deals with configuration management plan, which means that you don't really lose any feature. Well, you might lose, like, tiny features like, oh, you cannot add a comment in this before, I don't know, like, in this block of con configuration, you can actually add a a comment, and you cannot do that on another platform. Like, you may lose small feature like that, but the the actual value of the device, it's going to be on the data plane. Like, oh, I support this, this protocol, and this other platform doesn't. And because Naval deals with the configuration management plane, you're not going to be losing, those features. Right? Because it's it's a completely different plane. Like, you're you're you're still going to be able to configure with Napalm that feature on the on the device.
[00:33:04] Unknown:
And going back briefly to the configuration management question, are there any trade offs involved when you're using napalm embedded in 1 of those, tools versus just using it directly either via a script or using its command line capabilities?
[00:33:20] Unknown:
So all the feature that, Snap on support are actually implemented on the on the models. I did it for for Ansible. I think that it's the same for. Right, Mircha? Yeah. Yeah. So all all the all the features are are actually implemented. So it's just a matter of what you're trying to to do. You just need, like, a simple script to retrieve your LLDP neighbors, for example. I mean, you might you may as well just, like, write it yourself because it's going to be, like, I don't know, like, 10 license of Python. But if you want to do something more complex, like, I don't know, like event driven automation, I mean, just use StackStorm. I mean, don't try to reinvent the the wheel or salt or whatever. Like, it just depends on your use case. I would say it actually have, many more
[00:34:05] Unknown:
benefits if you use, a tool like that on on top of napalm because, you're ultimately write some templates to for to generate your configurations. Right? And, in a tool like Ansible, for example, we can use, the embedded filters. That's some filters that are already available for, Jinja, for example, or, I don't know, in salt, you can take, app's activation to a whole new new level because as you can, use many other variables in the in the template. You can reuse, thousands of functions in in the inside the template and so on. And so I would say we have many benefits doing this and but you don't lose anything.
[00:34:46] Unknown:
Yeah. Nowadays, most people is using napalm actually via some framework like Ansible, Sol, or Stackstorm. I mean, I don't think that there are that many users just using directly. Can be done, but, I mean, you get a lot of goodies if you just use a framework.
[00:35:04] Unknown:
Yeah. I use Salt pretty regularly in my day to day basis, and so that's sort of where where my preference lies. And so there's a bit of a leading question, but, and then quickly with SaltStack, I'm assuming that the majority of the interaction that you're doing with the network devices is via the proxy minion capabilities, or is it also operating directly against the network switches via the APIs?
[00:35:25] Unknown:
Yes. At the moment, only through the salt proxy feature, but, we are thinking for the next release in Oxygen to have also a roster that, to not, keep the proxy, process always running. So there are some environments where you need to push a configuration change around once per week or per month. And, it's a bit, pointless to have a proxy process always running. And, while the roster is, on top of the salt SSH subsystem, which is basically the equivalent of, Ansible. And you can leverage this automation like that. But even so, I would say that, there are very good ways to start the proxy process only when necessary. You have only to have, in the salt language, a salt state that, starts the process, enables it, applies the configuration changes, and the stops it. So it's about 4 more lines of SLS.
[00:36:23] Unknown:
And when I was going through the documentation, I was noticing that it's possible to merge the configuration that exists on the device with the desired configuration that you're applying. And having worked with Git a fair amount, I'm pretty familiar with the dreaded merge conflicts. So I'm wondering how you manage those and how you prioritize the application of the nested data structures that are going to be exposed in terms of, do you just completely override them in 1 direction to the or the other? Or do you merge them together? Or is there some sort of tunability parameter for how that's managed?
[00:36:58] Unknown:
So when you're merging configuration, with Maple, you are actually merging a state into another. Like, there are no intermediate states. So here, there is no branching or or anything. So they cannot be conflicts as you may have with gate when you start, like, merging different branches and you have conflicting changes because you're always going from a to b. If you want to merge different branches, what you would be doing is actually manage your configuration files, with git directly. So you would be solving all the all the problems there. And then NABOM is just the delivery mechanism of the resulting configuration into into the device.
[00:37:35] Unknown:
Yeah. I would add 1 more sentence here. This assumption is based on, the fact that you are able to replace parts of configuration. But, there are some devices in this world that are actually not able to replace chunks of configuration. In that case, there are also good ways to do do so. For example, you can retrieve what is configured now, then determine in, your tool what needs to be added and what needs to be removed.
[00:38:09] Unknown:
Yeah. With with the napalm framework that I mentioned before that integrates these open config models, we can actually go 1 step further, because as Mircea was mentioning, we rely nowadays on the devices themselves doing that merge operation. But because we are operating with actual objects when we have, the when we're using the NAPON JAN framework, we can actually compute the the delta ourselves, and we can actually figure out the exact commands that you have to run to go from state a to to state b. So so with that, we'll actually get that merge support consistently across all the all the platforms. But as I as I mentioned before, the there is always going to be, going from 1 state to another without branching or or anything. You want to do this branching, you will have to solve it before, using Gator or something like that because we didn't think it was actually worth it, like, implement something as as complex as this when there are tools already being able to do that.
[00:39:09] Unknown:
And when you're working with napalm, it seems like a fair bit of what you're doing is somewhat declarative as far as applying the desired state of the device as opposed to doing it in a procedural manner. So I'm wondering what are some of the difficulties? How does that differ from managing, a general purpose operating system in a declarative model similar to what configuration management exposes?
[00:39:33] Unknown:
Well, I mean, I wish that was actually true because, unfortunately, it's not entirely true. Like, when you're managing most most platforms, what you actually have is a command line that you use to instruct the device to go to a desired state. For example, let's imagine that you are configuring, like, the network stack of of a Linux machine. What you would be doing is you would just be editing a bunch of files, and then you would be restarting the the network, service. Right? But imagine that what you have instead is you don't have this network service. What you have instead is just the IP route tooling. You only have the IP command to actually, like, oh, yeah. I want to check which config which IP address I have. I have this 1. So now I have to remove this 1. I have to add this other 1, and my routing table looks like this. So I have to remove this route. I have to change this other 1. So unfortunately, most platforms behave like that. Like, you don't really have this declarative model. What you have is a CLI to instruct the device how to how to get there. So that's actually the main challenge with, automation on the networking world.
Fortunately, some vendors are starting now to provide this declarative model, where you can actually send the desired state. Like, this is the config I want. Just apply it. So in those cases, the thing is getting way, way better. But, you still have the problem that those platforms, most of them, they still behave as a single service. So you cannot just reconfigure the NTP service, for example, or just the BGP enabled. You have to actually say, this is my whole configuration. Reload all the services. A lot of people cannot actually do that because they don't have an inventory with all their parameters somewhere, so they cannot actually rebuild this, full configuration and just reapply it. So I would say that that's, like, the actual challenge.
[00:41:24] Unknown:
If I can mention someone, asked me at some point, I I want to have this configuration on the device, and I want to run, let's say, the state or some someone that some tool that, updates the configuration. But I don't want to to put this data in the source of truth. How can I do it? Well,
[00:41:44] Unknown:
that's an unsolvable challenge. Put it in a source of truth. Yeah. So in the past, when I've been going from a non automated environment to an automated environment, I would end up doing a lot of parsing of the configuration. Like, you connect to the device, you parse the configuration, you try to extract the data that you're not automating yet, so you can actually reapply it and then automate the bits that you are already automating, and you just keep moving from non automated, blocks to automated blocks just gradually until you manage to automate everything. But it's certainly a a a big task because, yeah, you cannot really automate the network service by service. You have to automate either everything or just do a lot of complexity and state management yourself.
[00:42:31] Unknown:
And we've covered a bit of the technical difficulties of managing these different classes of device. So I'm wondering if there's any other challenging aspects of working with network hardware in a programmatic fashion.
[00:42:44] Unknown:
Yeah. I, I would say it's mostly inconsistencies and buggy code, but I would leave, this 1 to Bert. I mean, he he loves this 1.
[00:42:54] Unknown:
Yeah. A lot. Yeah. There are many many challenges, but, I find it fun at some point, but, afterwards, it gets ridiculous. Inconsistency in terms also in the way it presents, your the operational data, but also at the configuration level. Beyond this, because somehow this is, natural. I mean, each vendor comes with its own methodologies, its own syntax. Otherwise, someone will go and will apply the law and will tell you have to pay some money because you use my syntax. Right? And beyond this, the APIs, most of the time are buggy and some or sometimes simply don't work or they are not consistent even on themselves. For example, Cisco has, many, many platforms, and, each 1 having its own obscure versioning system. For example, iOS 6r has an API called the XML API. On iOSXe, you have, something else. On iOS, you don't have anything.
So, all of these, being manufactured by Cisco. So this is only about Cisco. Think about also from another vendor to another. This, API is, simply don't work sometimes. I have a issue right now when, the I simply cannot start the, the process for the API at all. This is ongoing for 1 month. I have rebooted the system, but still doesn't help. And this is kind of challenges we are facing, and at some point we may decide that, screen scraping, although it sounds and it is, very very bad, and it's a very poor way to automate. It turns out that, it may be more reliable than waiting, for a vendor to to tell you what to do with, the API, which you can't even start. That was, so not only
[00:44:50] Unknown:
even between different platforms. Some platforms within minor versions may have different data structures. Or even the same platform might return different, data structure for the same API call depending on certain, cases. Like, for example, like, you have you have your BGP neighbors, and you want to retrieve them. So if you have only 1 BGP neighbor, you will get just, like, a dictionary, while if you have a list of, but you have multiple BGP neighbors, you will actually get a list of BGP neighbors. So you you start you need in your call to start accounting on those things. Like, okay. Am I dealing with a list or with a dictionary? Because, otherwise, you would just going to break your code.
[00:45:32] Unknown:
It's even, on the same platform, same version, same box. If you configure, for example, let's say, an NTP here, will give you back a dictionary, like, or let's say, in a JSON format, but they keep an Python dictionary. But you have if you have, 2 or more NTPs, it will be a list.
[00:45:50] Unknown:
And so given the fact that these APIs are so poorly maintained and poorly implemented, it seems that they're sort of a second class consideration from the networking vendors. So if the APIs aren't accessible or consistent, then what is their intended mechanism for you to be able to interface with and operate on the devices?
[00:46:11] Unknown:
I want to start clarifying that this is not the case on all the on all the platforms. Like, there are some vendors, Juniper being the notorious 1 here, that are really, really good with APIs. Like, it's a first class citizen. But the thing is that for most vendors, they they didn't care before about automation because their customers weren't asking for it. So SSH is everything they they have, to be honest. So I don't know. I hope that things are starting to to change now because actual operators are starting to to demand automation.
But right now, I mean, they will tell you, like, yeah, this is what it is, just SSH. And then you just have to fall back to screen scraping again.
[00:46:57] Unknown:
And given the fact that napalm is integrated into a few different pieces of tooling and frameworks and seems to be fairly popular. I'm sure that there are some interesting or unusual, instances of napalm being used in the wild that you've heard about. So I'm wondering if you can both share a bit of, what you've seen that you thought was particularly noteworthy.
[00:47:18] Unknown:
Do you want to start me, champ?
[00:47:20] Unknown:
Yeah. I will, take some examples, starting for with with us at Cloudflare. We are using, napalm, for example, to monitor the backbone of the Internet. So we have a need to to see how reliable our our, IP transit providers and, how how much we can trust them to to forward our packets. To bring more context, we don't have our own, fiber source, means of transportation. We just, pay some some guys to to put a fiber in our devices, and from there on, they carry these packets. But, very often they have, packet loss, which means, somewhere more than 20% of packet loss. And, in that case, we need to reduce the traffic to send our data to our, to different guys. And, we're using napalm in the very first place to retrieve the performances and to monitor how much packet loss they have. We would basically use, SNMP, but, there's a different problem there. Not only that, each vendor has its own SNMP implementation, but others present the data in a very poor way. For example, on a Cisco SXR, if you specify a tag name for a probe, more than 13 characters, when you try to retrieve it via SNMP, it it won't give you anything. And there are many other stupid things. Moving on, monitoring this and I said you will reroute the traffic.
That triggers a lot of things. That means apply the configuration change on the device with napalm. And then, after this configuration change is applied is a couple of things. Also updating our internal tools, post posting to our internal chats, updating, very nice map with, points of presence where we did we reroute the traffic, updating our communication with with, with customers, our status page where we say that traffic is rerouted, sending emails, and so on. We are also using, napalm. Although I'm not speaking about strictly the main napalm, I'm going to remind again about, that new kid, napalm logs. We're starting using, using it and listen to the events in, our network, what's what's going on, and we act immediately and, reliably on these events, without having any impact on our network or or customers.
[00:49:55] Unknown:
So I've seen also some people, replacing their SNMP based monitoring system with with all the the getters that we have. So instead of just checking the interfaces with a DNN, just retrieve it with the getters. We also have this validate functionality built within, Wilton in Naples that lets you lets you define desired state. Like, example, you could say, I want to make sure that I have this, BGP neighbor configured that is up and that it's sending me, a certain amount of of routes. So you can just define that on a on a YAML file with a very simple, DSL to define the rule. And then just figures out, how to retrieve that information and make sure that it's in there. And if it's not, if the rule is not actually passing, it would just, give you a report saying, like, okay. Yeah. There is a compliance issue, so you're failing on on this. So I know that some people is working on replacing, like, their time serious, alerting systems with with this functionality, which because you could call it more like test driven monitoring. I don't know. Because you're actually testing the network all the time, making sure that you have certain states in. And I have also built myself what I like to call, like, immutable infrastructure for the for the network, which means that I actually every time I do do a change on the on the network, regardless of how small the change is, like, could be just, like, a change of an interface description, for example, What I do is actually rebuild the entire configuration, and I send it to the device and I tell the device, like, this is what I want. I mean, if there is something else, just get rid of it.
So with this, what you can actually ensure is that you always are on a known state. Like, if an operator would connect to the device and do a manual change, because you're actually rebuilding the state from scratch, you would actually get rid of that unknown state that the operator introduced.
[00:52:09] Unknown:
Yeah. Because, David was saying about the configuration. Yeah. Well, I can, give 1 more detail, something, very interesting. We have our device configuration is imperative, meaning that, exactly what we have in the source of truth has to be reflected on the device. And the configuration, is, has diverged over years, in a a bit a lot. And there are still some parts of configuration that are not there yet, are not yet prepared to to be run, to be applied configuration changes automatically live. And, because of this, and we're using the napalm functionality to retrieve the div, the configuration difference, we, at specific intervals, it is, run a tool that, determines the div between what we have in the source of truth and what we have actually on the device. And, sends an email reminding you that, okay, this this is not correct. You need either to update your source of truth, either try to make, your, configuration, on the device, to be consistent.
[00:53:24] Unknown:
And so we've been using a lot of different networking acronyms and talking about various aspects of doing network automation and managing types of networks. So for somebody who's interested in learning more about the fundamentals of network management and network operations, what are some of the resources that you would recommend that they look to?
[00:53:43] Unknown:
So, for example, network to call dot com has, lots of resources, has some labs. And there's actually a Slack community, behind it, which is very active. Both Mirchen and I are in there. We actually have, an APOM channel where lots of people is all the time just asking questions and helping each other and just, like, chatting is kind of fun. There is also IPSpace.com with the he has a lot of webinars and information about automation as well. Pinet.twbtech.com is also a great resource if you're familiar with Nedmiko. He is the author of Nedmiko.
And there is also now and not really book, called Network Permability and Automation, which I haven't actually read myself, but I know the the authors are really good, so I'm confident that the content will be of high quality.
[00:54:40] Unknown:
And, Mircea, do you have any resources to recommend?
[00:54:43] Unknown:
Yeah. I would recommend also the packet pushers.net. There's a lot of information there in the podcast and so on, specifically about, networking and also network automation. I sometimes blog on my own website, mircciauinic.net, about, also about networking and also network automation and so on. I'm trying to to put the information as often as possible, although I don't really manage to.
[00:55:22] Unknown:
Yeah. Keep keeping blogs up to date is definitely more effort than it seems to when you first get started.
[00:55:30] Unknown:
Yes. Definitely.
[00:55:33] Unknown:
So for anybody who wants to follow what you're both up to and keep up to date with the work you're doing and with the state of napalm, your contact information is in the show notes, so I'll refer people there. And with that, I'll move us to the picks. And so I've got 2 picks this week that are, germane to the topic at hand. So 1 of them is in, an IETF RFC called the 12 networking truths That is actually much lighter reading than any of the other RFCs that I've perused in the past, and it's quite entertaining. So I definitely recommend people, take a look at that 1. And then the other 1 is a post that is falsehoods programmers believe about networking, which is just a bunch of contradictory truths, all of which are factual and all of which are quite entertaining as you read them together. So I definitely recommend people take a quick look at that as well.
And with that, I'll pass it to you. Do you have any picks for us today, David?
[00:56:25] Unknown:
Yeah. I mean, I've been reading or rather listening to the Fear Saga. It's a sci fi series, and it's it's just great. I really recommend it if you're into into sci fi. Also, I discovered recently VR. I think it's just great. I think that it's not going to be AI. It's actually going to be VR, what's going to bring humanity to an end.
[00:56:53] Unknown:
And, Mircea, do you have any picks for us this week?
[00:56:56] Unknown:
For me, the week has just actually started. I, I've been quite busy yesterday, but, I'm, reading now a book that, really calms me down and makes me feel good. It's called the daily zen or the zen book, and, I get a lot of good advices there.
[00:57:16] Unknown:
Alright. That sounds like something I'll be needing to take a look at as well. So with that, I would like to thank you both for taking the time out of your day to join me and tell me about Napalm and the work that you guys are doing in your respective positions and how you're using Napalm to manage the networks that keep the Internet running. So I appreciate your time, and I hope you enjoy the rest of your days. Yeah. Thank you. Thanks. Thanks for invite.
[00:57:41] Unknown:
Cheers.
Introduction and Guest Introduction
Guest Backgrounds and Initial Python Experience
Introduction to Napalm
Challenges in Network Automation
Testing Napalm
Napalm and Software Defined Networking
Architecting Napalm for Compatibility
Integrating Napalm with Configuration Management Tools
OpenConfig and Napalm
Declarative vs Procedural Network Management
Challenges in Network Automation
Interesting Uses of Napalm
Resources for Learning Network Management
Picks and Recommendations