NAPALM with David Barroso and Mircea Ulinic

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great.

I would like to thank everyone who supports us on Patreon. Your contributions help to make the show sustainable.

When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at ww

w.podcastinnit.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app or experimenting with something that you hear about on the show.

You can visit the site at www.podcastinit.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. To help other people find the show, please leave a review on Itunes or Google Play Music. Tell your friends and coworkers and share it on social media.

Your host as usual is Tobias Macy. And today, I'm interviewing David Barroso and Mircea Ullinik about napalm, network automation and programmability

abstraction layer with multi vendor support, the library for managing programmable network devices. So, David, could you please, start by introducing yourself?

Sure. My name is, David Barroso. I'm a network systems engineer at Fastly, where I spend my time automating,

networks and also trying to build networks where the application can actually decide how to route the packets themselves. And before that, I was, at Spotify, which is actually where this project started.

Hi. I'm Mitch. I'm

a so network system engineer at Cloudflare where I have the chance to automate 1 of the biggest global networks. We have a more traditional

approach to networking, like, like, 99%

of the networks in this world. And, beside this, I'm a napalm user, contributor, and maintainer.

And, David, can you start by, recounting how you first got introduced to Python?

So, traditionally, I'm a network engineer. Although, nowadays,

I spend most of my time, coding, not all my time.

But I think it was back in

2012. I was just a regular network engineer, and I was trying to

get a hold of the inventory of my network, which was living inside,

inside just a a Wiki page, which meant that I couldn't actually query the information. I'm trying to find a device by platform or operating system or something. It was just a nightmare. So I was looking into

trying to move all that data to some sort of database that I

could actually query,

and I just found Django, which seemed like a a gap good fit for for this. And that's pretty much how I started with with Python.

And,

how about yourself?

Yeah. I am, my story is a bit different. I was previously a

software developer.

I, was mostly oriented,

to back end of, websites.

I, began looking at Python because I had to work with the Cherry Python.

It was, back in 2007.

I was with a group of friends, and, we're looking at Python. It is we said it's it's awesome. It looks very interesting, very readable, and so. And, 1 year later, I have,

ordered from Amazon

a book called Learning Python written by Mark Lutz, a very well known author. It was only the 2nd edition. I still remember it had a label on it saying that it covers Python 2.3.

Yeah. You know? And, we are speaking now about Python, 3.6 and so. After was it was,

over a couple of years when I didn't use Python, at all for, like, 5 or 6 years, and then I started using it again.

And so let's start by

explaining what napalm is and the problem that you were solving when you first started working on it. So, David, I believe that, you were the original author. Is that correct? Or the original coauthor, at least? Yeah. That's that's correct.

So if you take a look to most networks out there, like, they're usually comprised of multiple types of devices, like firewalls, load balancers,

routers, switches.

They also run, like, many different operating systems,

depending on the vendor, depending on the platform. They may even have, different

versions of that operating system. And each,

each 1 has a complete different,

way of interacting with those devices. Like, 1 may have only a a CLI. Another 1 may have a complete different CLI. Then some other may have, like, a proprietary API or maybe use in, NetComm

or whatever. Like, there are just many ways you can, manipulate these these devices.

So as you can imagine, if you're trying to do something as simple as just trying to figure out which IP addresses you have configured on your on your network, that becomes, like, a super hard and massive task. You have to start

by figuring out to which device you are trying to connect to. Is it a Juniper device? Is it a physical device? Is it an Arista device?

Which version of the operating system are running? Which API they have available? So you have to start figuring figuring out all this stuff, and it turns out that your code becomes this massive blob just

just trying to figure out that part. Not even trying to get the information. Just trying to figure out how to get

that information.

So as I was trying to automate, the network at Spotify, I

quickly became apparent that this wasn't going to to scale. So I started working on on first building this, abstraction layer with with napon. So when we were actually trying to solve a problem, we could actually focus on the problem, not on

the on the how to do that on a particular device.

When you first started it, did you immediately decide to use Python, or did you play around with other languages for implementing it at first?

So I decided to use Python for 2 reasons. The first 1 is because it's what I knew. As I mentioned, I was a traditional network engineer, so I wasn't really a developer. I mean, I could do, back then,

like, some

Perl, bash, maybe some simple Java, but that wasn't really a developer. But I was a bit more proficient with, with Python,

so

it seemed like just the

the way to go.

And the second reason is that

some vendors had, by then,

some libraries written in Python, so you could interact with those devices. Like, Juniper had, had a library written by by them. Arista had another 1 who was,

there was there was also

Paramico and Netmico

that allows, allowed you to interact with certain

devices. So it's it felt like a like a natural fit. Just like, let's try to rely on those,

libraries and just

have NAPAN build this abstraction on top of them instead of having to reinvent everything from scratch.

And I imagine that there is also networking devices in your infrastructure that don't have any sort of programmability

aspect to them. So does that pose a challenge when you're sort of architecting the overall network and trying to integrate the sort of Juniper and Arista switches and things like that with the more traditional just hardware based networking systems?

Yeah. So there is a lot of, equipment out there, especially

Cisco Ios. It's notorious for that,

where the only thing you have is SSH and just

plain text, which means that you end up with a lot of regular expressions everywhere. But the problem is not that you have to deal with regular expressions. The problem is that they may have, like, a minor release where they are supposedly just patching some minor bug and it turns out that some output may change as well and all of a sudden all your regular expression just broke.

So so, yeah, the Naples has a lot of code just trying to manage those differences, trying to figure out, oh, is this,

like 16.5.1

or is it 16.5.2

and try to pick the right regular expression. It's just

a massive endeavor

doing that.

Yeah. Someone,

made,

like, a poll. And

speaking about numbers, it turns out that only 15 up to 20 of the network device in this world would actually have an API. Not only because,

Windows not supported,

but, historically, they did some didn't have at all. They have only the inversions, and

there's a problem in the network world that you can't upgrade your your device sometimes to the new version that has the API,

or is not stable enough to to to make, the business decision

to to upgrade to a new new version that has the API and so on. And, basically, what

you have to do, this doesn't stop us. You basically start a process

that,

connects

via SSH,

And,

you issue commands like you would do when you log yourself on the on the terminal and type a command. And, you read that output and parse it and present it to the user in a structured format.

And 1 of the challenges

of working with the networking layer is that basically everything on top of it relies on the fact that the network is operating properly and isn't dropping packets or, you know, is just generally available. So I imagine that there's a bit of risk involved with trying to manipulate it programmatically in the event that you have some sort of bug in your scripting. So what are some of the ways that you try to mitigate that problem when you're working with napalm?

So napalm has a couple of mechanisms

to minimize

problems like the 1 that you were describing. The first 1 is that when

you can create what we call a, well, a candidate configuration,

then you can you can ask Snap on to give you

a a diff between the actual configuration on the running device and your candidate configuration.

So that can you can actually introduce that as of your, change control, workflow.

So

a a fellow engineer could just review that change and see, like, okay. Yeah. This makes sense. This is not going to to break anything. That's just like a like, doing a pull request if you are a developer. Right? So another mechanism

is that if you detect that you

broke something and you still have access to the device, we implement, like, a quick rollback. So you can just, like, tell me if I'm just, like, rollback, and you will go to the previous known, state. So that's a a nice, way of just trying to fix a a problem. And then we are now,

integrating as well with a feature that

some vendors have, which is,

they some of them call it, like, commit confirm, but it's basically

a sort of automatic,

rollback. So when you you

when you do a change on the network, the device will expect you to confirm the change, within a period of time. So if that period of time just expires,

the device will roll back itself. It will assume that, you lost connectivity to it, that something is broken, so it has to to roll back. So those are the mechanism that we have within Napon to try to minimize or solve problems as easy as possible, as fast as possible.

And for the built in rollback capability, is that, just caching the diff from before you apply the configuration

change so that it knows how to revert it back to the previous

state? So that depends on the on the platforms. Like some platform, for example, they they support that feature themselves every time that you do a change it actually saves a checkpoint. So if that is

supported natively, we will just tell the device, okay. Go to the previous,

state. But if that's not supported, we will just save a local copy on the device. And then it's just a matter of, reloading,

that configuration that we saved before actually committing the the new 1.

And for testing the napalm library, imagine that because of the reliance on the hardware platforms and the operating systems that they're running, that it introduces a lot of complexity in terms of being able to accurately

test everything at the unit level. So there must be a lot of integration testing involved as well. So I'm wondering what your overall test architecture looks like to be able to, ensure that changes that you make don't introduce,

regressions.

Yeah. So

we we do 2 different things here.

If you're doing

a change on code modifies configuration

on the device, that's a bit more complex, because that, you want to do proper integration test, against a real

device. You don't want to you probably don't want to mock that because it's kind of, like, too critical.

So when

when do when we do changes to call that manipulates configuration on the on the devices, we use Vagrant to spin up VMs, and then we have a whole suite of configurations that we know how they look like. So we know what diff we are supposed to get before

committing the change and after committing the change. So but but it's, it's actually manual testing,

because we have everything on on Travis right now, and Travis doesn't really

work that well when you have, like when you want to test many VMs with Vagrant and stuff like that. So so so, yeah, that's the configuration management,

part of the project. When we are dealing with retrieving data, that's actually easier to test because then you can just mock the device. Like, I know if I type this command or if

I do this API call, that I'm going to get, this,

response,

this this text or this JSON blob or XML, whatever. So in that case, we just have, like, a battery of test cases. Like, okay. This is, iOS

13. This is iOS 12.2. This

is that. This is when you have this particular case where you don't have any BGP neighbors. So we just keep building cases and making sure that the all the parsing,

regular expressions, everything just just works.

And so 1 of the things that I was thinking about too as I was preparing for the show is

wondering how working with napalm relates to working with purely software defined networking. Because I know that with the napalm library, you're interfacing with the physical switches that have their software

layer exposed via the APIs. But in systems such as OpenStack, where you have entirely virtual network infrastructure as far as the interconnect between the virtual machines, is that something that napalm would be able to manage as well?

So,

I mean, what napalm does is mostly interact with the configuration of the of the network. Right? So, I mean, you have, like, your OpenStack infrastructure and you need to provision, like, I don't know, like a like VXLAN

or or a VLAN or a route or something, you could actually use napalm to interact with with the hardware to say, like, okay, you have to configure this VLAN because I provisioned this new service. And I actually think that some people I heard some

people talking about using napple within, the open openstack context, but

I'm not entirely certain about that.

And so as you mentioned earlier,

napalm

sort of abstracts

the different underlying platform dependent libraries. So I'm wondering how you architected it in order to be able to easily integrate new classes of network devices and new dependent libraries while still maintaining a consistent API across the different devices?

The the approach is quite

simple and probably couldn't be Pythonic.

And

the the idea is that we have a base class that we call just, like,

I think it's called right now a network driver or something like that. And

that base class just defines the the API. It has a set of methods. It has documentation saying, like, this is how the method is supposed to

to behave. This is what's supposed to happen. And this is the data that's supposed to to return. So that sort of

acts as a as an interface, but Python doesn't have,

interfaces. So what we do is we just implement for each particular

device out there, we just implement a class that inherits from from that from that base class. And then on the testing framework, we start comparing signatures. Like, for example, this method that you implemented here, it's supposed to match the signature or your parent class, and the output is supposed to match what is out

what the parent class would be outputting as well. So we try to enforce

somehow this interface on the on the testing framework instead of on the Python layer, because Python doesn't really have any sort

of interface

feature.

We decided to go this, path instead of using, like, I don't know, like, decorators or metaclasses

or abstract classes or something like that because

we

figured that the user base of of this library would wouldn't be, like, very proficient on Python. So we wanted to do it as simple as possible. So we could actually have

network engineers that might have a lower level of Python proficiency, just implementing new things and being able to contribute as well.

And when you're designing the overall API and trying to figure out which methods you're going to expose as the common denominator

across all the different types of devices.

How did you go about determining what those methods and type signatures were going to look like in order to be able to have the broadest set of compatibility across the types of devices that you're targeting?

So there are 2 parts, in here as well. Like, 1 is the

configuration

management part,

And that 1, we tried to emulate what the NETCONF protocol specifies. So the NETCONF protocol, it's a well known standard in the network community,

and it has, like, a set of primitives, like, I don't know, like a candidate

configuration,

like a running configuration. It has certain action that you can do. You can load the candidate configuration. You can start doing a transaction and committing the change. So for the config configuration management bid, we try to emulate that part, which means that it's been quite stable since the very beginning. Like, it was designed at the beginning. It was implemented, and it we haven't changed that much, since then.

And then the second part is what we call getters. Getters are just methods that connect to the device and retrieve

consistently,

information from the devices. So you could say, like, okay. Give me the BGP neighbors, and you would get the same data

regardless of the vendor. So it just, like, parses the data and just structures it in

a known,

format. And for that, that's pretty much community driven. Like, we have a set of method that we were that we were using.

And then as people start using the library, they might open

an issue on GitHub saying, like, oh, yeah. I see that you're getting the BGP neighbor, but I would like to have the OSPF neighbors as well. Then we start just discussing how the data would look like so it's

as generic as as possible. Yeah. And start implementing

the the method as as are actually needed. Like, we don't really implement all the methods to all the platform. Like, we just actually implement the methods where actually someone needs it. And the underlying libraries, I'm assuming, expose their data in different formats. So I'm assuming that you have to do some measure of data manipulation in order to be able to present it in a consistent manner across all the different devices that somebody might be querying.

Yeah. That's that's correct. That's actually the main the main bulk of the of the code in, just trying to get the the text that a device might give you via the CLI and parsing it

and returning a dictionary or just

converting a dictionary into a different dictionary or this XML object that you got to whatever the method is supposed to return. So, yeah, there is a lot of data manipulation.

Yeah. It actually,

depends

very much on,

the driver, on the, actually, the, platform capabilities

and, what kind of API you are able to use or if you have an API. For example, if you are automate try to automate

a Juniper device, they have an API

via protocol standardized called NetComm, which is on top of SSH that, and, provides,

XML documents.

Then then you can extract your data from there. But you may not have an API. And, as I said previously, you need to spawn, SSH process that connects the device and emulates,

your command line, and, you extract

the the simple text then the which you have to parse. So it depends very much, on,

the platform or you are trying to automate. And the the underneath,

library in general is,

used only as transport.

And 1 thing that I was just thinking about is particularly in larger networks such as, the types that I'm sure you're working with, Mircha and David, I'm assuming you do as well, is in terms of discovery of the devices that you're going to be working with. So do you just keep a database of all the different locations for the devices, or do you have some measure of discovery using network protocols such as ARP or whatever to figure out the types of devices that you're working with and where they're located within your network?

Yeah.

We're actually having a database. In our case, it's a pretty simple even the network is, quite huge now. But, we

are growing, and we add a new pop,

with a piece of, about, 1 pop per per week. And, we have the time to we we started from, from scratch when the from from the the beginning has been designed with automation in mind. And,

to new PoP, we only need to update,

the this database,

to to make sure that everyone understands PoP. I I say points a point of presence. That means,

a new collocation in a data center where we have, network devices. But, you can

also use,

SNMP to detect,

what kind of device you have. You only need

to say, to provide the IP address from where you where to go and collect the the details. And, you can, for example, out update automatically

your database or

or simply to tell you back on the CLI that I have discovered

these devices that you told me they are these IP addresses and they are, I don't know, Juniper, Arista, and so on. Yeah. We have a somehow similar approach.

We we actually treat the we have a database where we have all the all the information.

We we treat the database as the as the source of truth. So the device has to exist,

in the database even before it can

be deployed on the on the network because operators don't have any mean to actually connect to them and configure them. It has to be done via

by automation. So it's kind of like a documentation

first approach in our case.

And so when you're storing the information in the database and then the device actually does get connected to the network, do you have sort of events that will trigger a run of a napalm set of commands to automatically provision the desired state for that device once it comes online?

So there are 2 parts here. When you provision a new device, we use what we call, set t p, 0 touch

provisioning.

That's just a way of building the initial configuration dynamically similar to pixie

boot but for the for the network. And and then once, that will give you just, like, the, like, credentials

and,

management IP and all that that stuff. So then it's just a matter of the operator actually

doing an Ansible run, which uses napon to build all the configuration,

all the Solr packages that have to have to be deployed on the device

and deploys

everything.

And so that brings us to talking about integrating napalm with configuration management tooling, such as, as you mentioned, Ansible and SaltStack. And I'm curious as to the amount of support in other tools such as Chef and Puppet.

So Ansible, we

we developed

the the models ourselves because Ansible, they have their own networking models, although they have a complete different approach where they have specific models for each specific platform.

So with Ansible, if you want to do something with the native models and you want to do something on iOS, for example, on Cisco iOS, you have to use the iOS model that does that. If you want to do this something similar for, I don't know, for Arista, you have to take that specific Arista model that does that. So they have their own thing, but we we actually support

the the Ansible model that you have to install as a third party model. But we actually support them. Salt has native models, which were actually developed by,

Mitja, so he can probably

elaborate on

on that. I know that StackStorm has as well,

integrated models,

for NAPON,

But I don't I honestly don't know about,

Puppet and and Chef.

Me neither. I I didn't hear anything, although it's, definitely possible, and, I expect it shouldn't be

extremely hard.

But I heard

something else about a different tool called Trigger.

Alright. Yeah. Yeah. Yeah. Yeah.

About Salt,

was, David said, in Ansible,

also there are some, couple of other modules provided by network

vendors or third parties.

But,

Netballm is a special very special case because

it's in, independent. It's community driven,

and, you

you can do anything you want. It's already embedded in the core. So, basically, when you install Sol to have,

all the network automation

capabilities

embedded, and it's even more even driven network automation and orchestration.

Now the list of features is,

quite big now. Next release, which will will happen this month in the Nitrogen,

There will be even more

features provided,

and,

they are not

related strictly only to configuration management.

Also,

the get us part that,

David presented already when you are able to retrieve data.

But

salt is from a different perspective. If we zoom out a bit

and look from a 4000 foot,

you'll

see that,

is much more than configuration management. It's a trigger configuration management, which can be triggered when, your interfaces can go down, where your BGP neighbors can,

start leaking roots and so on. Which can be detected either using those getters,

either from,

other events

in the network. In the network world, it's

a it's a very special case because, it's not only about configuration management.

It's,

often miss,

misunderstood that

configuration management,

it's it's everything.

It is vital and, extremely important, of course, to have your network consistent. But, there's,

a lot of,

things happening around that you definitely want to to monitor and start acting automatically.

And I will introduce

here,

new kit of the napalm suit is called napalm logs.

It's still in better release,

but we are working on it now. It collects

all these events,

from different network devices and presents them in a vendor agnostic form, more specifically

structured as, for some,

public standards

like, opogific or IETF.

And using those events you can ingest them in

a tool like, salt or

stack Swarm and so on. And you can trigger configuration changes.

Now that Mircea actually mentioned

OpenConfig, I would like to talk about open config for a second. It's actually funny because I was presenting,

I think it was 2015

at 60

4. As I was presenting Napalm there, the day after,

Google actually presented some other effort that they were doing in parallel that was actually similar to Napalm, which is called OpenConfig.

OpenConfig

uses Yank, which is a data modeling language, to define a common set of models that are supposed to be vendor agnostic and then supposed to be useful to both retrieve information from the network

and to configure the network agnostically without vendor specific stuff in in there. The only problem with OpenConfig is that,

the OpenConfig group is way faster than vendors are. So it turns out that 2 years later, there isn't that much OpenConfig support from the vendor side yet. There are very few models supported by very few platforms. So we started actually working as part of Nepal recently on trying to

start supporting open config models within the Nepal,

context. So you can use Nepal to retrieve

information from the from the devices, like native,

information, for example,

and map it into an open config,

object.

So you have just you have just the the the napalm layer doing this,

translation between native configuration

and well known models that are supposed to to be vendor agnostic.

The cool thing about this is that you get not only an abstraction of how to operate the network, but you also get an abstraction on, how the data is supposed to work across,

different vendors. So you can actually get the configuration

from from, let's say, from Cisco

and translate it into native configuration

of something that Juniper or Arista may understand.

That's actually quite quite powerful as you can start just replacing devices without that much complexity.

Yeah. I imagine that would be interesting use case to be able to have

1 network switch or a piece of network hardware that you want to then replicate in either a new data center or just redundancy within your existing network and be able to say, just pull the configuration from 1 and load it onto the other, irrespective

of what the actual underlying hardware and software systems are on it. Yeah. That's that's super interesting.

And, the cool thing is that because now you are mapping this,

native,

configuration, which is just text in most of the cases,

into into an object, you can start doing other cool things. Like, you can, for example,

simulate changes. Like, okay. Let's imagine that I have this configuration,

which might not be living anywhere. Right? Like, maybe just something that you just handcrafted because you wanted to test something. So, yeah, let's imagine that they have this

blob of configuration. What would happen if I apply

this other blob of configuration on top of it?

Or you can start

doing actual simulations based on backups of your configuration. You don't even have to have

a live device to do the actual

diff of the configuration. You can just grab

a configuration backup file or or whatever and start just,

yeah, simulating changes.

And 1 thing that I'm curious about is if there's anything that you're losing by using napalm as an abstraction layer on top of the underlying hardware because as the different vendors, I'm sure they're trying

to expose

certain features that are unique to them to be able to differentiate themselves in order to be able to drive people to wanna buy their hardware versus somebody else's.

So is there a trade off when you're using napalm in versus trying to take advantage of whatever,

sort of special capabilities that underlying keys and network equipment has?

So napalm mostly deals with configuration

management plan, which means that you don't really lose any feature. Well, you might lose, like,

tiny features like, oh, you cannot add a comment in this before,

I don't know, like, in this block of con configuration, you can actually add a a comment, and you cannot do that on another platform. Like, you may lose small feature like that,

but the the actual value of the device, it's going to be on the data plane. Like, oh, I support this,

this protocol, and this other platform doesn't.

And because Naval deals with the configuration management plane, you're not going to be losing,

those features. Right? Because it's it's a completely different plane. Like, you're you're you're still going to be able to configure with Napalm

that feature on the on the device.

And going back briefly to the configuration management question, are there any trade offs

involved when you're using napalm embedded in 1 of those, tools versus just using it directly either via a script or using its command line capabilities?

So all the feature that,

Snap on support are actually implemented on the on the models. I did it for for Ansible. I think that it's the same for. Right, Mircha? Yeah.

Yeah. So all all the all the features are are actually

implemented. So it's just a matter of what you're trying to to do. You just need, like, a simple script to retrieve your LLDP neighbors, for example. I mean, you might you may as well just, like, write it yourself because it's going to be, like, I don't know, like, 10 license

of Python. But if you want to do something more complex, like, I don't know, like event driven automation, I mean, just use StackStorm. I mean, don't try

to reinvent the the wheel or salt or whatever. Like, it just depends on your use case. I would say it actually have, many more

benefits if you use,

a tool like that on on top of napalm because,

you're ultimately write some templates to for to generate your configurations. Right? And, in a tool like Ansible, for example, we can use,

the embedded filters. That's some filters that are already

available for, Jinja, for example, or,

I don't know, in salt, you can take,

app's activation to a whole new new level because as you can, use many other variables in the

in the template. You can reuse,

thousands of functions in in the inside the template and so on. And so I would say we have many benefits doing this and but you don't lose anything.

Yeah. Nowadays, most people is using napalm actually via some

framework like Ansible, Sol,

or Stackstorm.

I mean, I don't think that there are that many users just using

directly.

Can be done, but, I mean, you get a lot of goodies if you just use a framework.

Yeah. I use Salt pretty regularly in my day to day basis, and so that's sort of where where my preference lies. And so there's a bit of a leading question, but, and then quickly with SaltStack, I'm assuming that the majority of the interaction that you're doing with the network devices is via the proxy minion capabilities, or is it also operating directly against the network switches via the APIs?

Yes. At the moment, only through the salt proxy feature, but, we are thinking for the next release in Oxygen to have also a roster that, to not, keep the proxy,

process always running. So there are some environments where you need to push a configuration change around once per week or per month. And, it's a bit, pointless to have a proxy

process always running. And,

while the roster

is,

on top of the salt SSH subsystem,

which is basically

the equivalent of,

Ansible. And you can leverage

this automation

like that. But even so, I would say that,

there are very good ways to start the proxy process only when necessary. You have only to have,

in the salt language, a

salt state that,

starts the process, enables it, applies the configuration changes, and the stops it. So it's about 4 more lines of

SLS.

And when I was going through the documentation, I was noticing

that it's possible to

merge the configuration that exists on the device with the desired configuration that you're applying.

And having worked with Git a fair amount, I'm pretty familiar with the dreaded merge conflicts. So I'm wondering how you manage those and how you prioritize the application of the nested data structures that are going to be exposed in terms of, do you just completely override them in 1 direction to the or the other? Or do you merge them together? Or is there some sort of tunability parameter for how that's managed?

So when you're merging configuration,

with Maple,

you are actually merging

a state into another. Like, there are no intermediate states. So here, there is no

branching or or anything. So they cannot be conflicts as you may have with gate when you start, like, merging different branches and you have conflicting changes because you're always going from a to b. If you want to merge different branches, what you would be doing is actually

manage your configuration files,

with git directly. So you would be solving all the all the problems there. And then NABOM is just the delivery mechanism of the resulting configuration

into into the device.

Yeah. I would

add 1 more sentence here. This assumption is based on,

the fact that you are able to replace

parts of configuration.

But, there are some devices in this world that are actually not able to replace chunks of configuration. In that case, there are also

good ways to do do so. For example, you can retrieve

what is configured now, then determine in, your tool

what needs to be added and what needs to be removed.

Yeah. With with the napalm framework

that I mentioned before that integrates these open config models, we can actually go 1 step further, because as Mircea was mentioning,

we rely nowadays on the devices themselves doing that merge operation. But because we are operating with actual objects when we have,

the when we're using the NAPON JAN framework, we can actually compute the the delta ourselves, and we can actually figure out the exact commands

that you have to run to go from state a to to state b. So so with that, we'll actually get that merge support consistently across all the all the platforms. But as I as I mentioned before, the there is always going to be, going from 1 state to another without

branching or or anything. You want to do this branching, you will have to solve it before,

using Gator or something like that because

we didn't think it was actually worth it, like,

implement something as as complex as this when there are tools already being able to do that.

And when you're working with napalm, it seems like a fair bit of what you're doing is

somewhat declarative as far as applying the desired state of the device as opposed to doing it in a procedural manner. So I'm wondering what are some of the

difficulties? How does that differ from managing,

a general purpose operating system in a declarative model similar to what configuration management exposes?

Well, I mean, I wish that was actually true because, unfortunately,

it's not entirely

true. Like, when you're managing most

most platforms, what you actually have is a command line that you use to instruct the device to go to a desired

state. For example, let's imagine that you are configuring, like, the network stack of of a Linux machine. What you would be doing is you would just be editing a bunch of files, and then you would be restarting the the network,

service. Right? But imagine that what you have instead is you don't have this network service. What you have instead is just the IP route tooling. You only have the IP command to actually, like, oh, yeah. I want to check which config which IP address I have. I have this 1. So now I have to remove this 1. I have to add this

other 1, and my routing table looks like this. So I have to remove this route. I have to change this other 1. So unfortunately,

most platforms behave

like that. Like, you don't really have this declarative model. What you have is a CLI to instruct the device how to how to get there. So that's actually the main challenge with, automation on the networking world.

Fortunately,

some vendors

are starting now to provide this declarative model, where you can actually

send the desired state. Like, this is the config I want. Just apply it. So in those cases, the thing is getting way, way better. But, you still have the problem that those platforms, most of them, they still behave as a single service. So you cannot just reconfigure

the NTP service, for example, or just the BGP

enabled. You have to actually say, this is my whole configuration. Reload

all the services.

A lot of people cannot actually do that because they don't have an inventory with all their parameters somewhere, so they cannot

actually rebuild this,

full configuration

and just reapply it. So I would say that that's, like, the actual challenge.

If I can mention someone,

asked me at some point,

I I want to have this configuration on the device, and I want to run, let's say, the state or some someone that some tool that, updates the configuration.

But I don't want to to put this data in the source of truth. How can I do it? Well,

that's an unsolvable challenge. Put it in a source of truth. Yeah. So in the past, when I've been going from a non automated environment to an automated environment,

I would end up doing a lot of parsing of the configuration. Like, you connect to the device, you parse the configuration, you try to extract the data that you're not automating yet, so you can actually reapply it and then automate the bits that you are already automating, and you just keep moving from

non automated,

blocks to automated blocks just gradually until you manage to automate

everything. But it's certainly a a a big task because, yeah, you cannot really automate the network service by service. You have to automate either everything or just do a lot of complexity and state management yourself.

And we've covered a bit of the

technical difficulties of managing these different classes of device. So I'm wondering if there's any other challenging aspects of working with network hardware in a programmatic fashion.

Yeah. I, I would say it's mostly inconsistencies

and buggy code, but I would leave,

this 1 to Bert. I mean, he he loves this 1.

Yeah. A lot. Yeah. There are many many challenges,

but, I find it fun at some point, but, afterwards, it gets ridiculous. Inconsistency

in terms also in the way it presents, your the operational data, but also at the configuration level. Beyond this, because somehow this is,

natural. I mean, each vendor comes with its own methodologies,

its own syntax. Otherwise,

someone

will go and will apply the law and will tell you have to pay some money because you use my syntax. Right? And beyond this, the APIs,

most of the time are buggy

and some or sometimes simply don't work or they are not consistent even on themselves. For example, Cisco

has, many, many platforms,

and,

each 1 having its own obscure versioning system. For example, iOS 6r has an API called the XML API. On iOSXe,

you have, something else. On iOS, you don't have anything.

So,

all of these, being manufactured by Cisco. So this is only about Cisco. Think about also

from

another vendor to another. This, API is,

simply don't work sometimes.

I have a issue right now when, the I simply cannot start the, the process for the API at all. This is ongoing for 1 month. I have rebooted the system,

but still doesn't help. And this

is kind of

challenges we are facing,

and at some point

we may decide that,

screen scraping, although it sounds and it is,

very very bad, and it's a very poor way to automate.

It turns out that,

it may be more reliable than waiting, for a vendor to

to tell you what to do with, the API, which you can't even start. That was, so not only

even

between different platforms. Some platforms within minor versions

may have different

data structures.

Or even the same platform might return

different,

data structure for the same API call depending on certain,

cases. Like, for example, like, you have you have your BGP neighbors, and you want to retrieve them. So if you have only 1 BGP neighbor, you will get just, like, a dictionary, while if you have a list of, but you have multiple BGP neighbors, you will actually get a list of BGP neighbors. So you you start you need in your call to start accounting on those things. Like, okay. Am I dealing with a list or with a dictionary?

Because, otherwise, you would just going to break your code.

It's even,

on the same platform, same version, same box. If you configure, for example, let's say, an NTP here, will give you back a dictionary,

like, or let's say, in a JSON format, but they keep an Python dictionary. But you have if you have, 2 or more NTPs, it will be a list.

And so given the fact that these APIs are so poorly maintained and poorly

implemented, it seems that they're sort of a second class consideration from the networking vendors. So if the APIs aren't accessible or consistent, then what is their intended mechanism for you to be able to interface with and operate on the devices?

I want to start clarifying that this is not the case on all the on all the platforms. Like, there are some vendors,

Juniper being the notorious 1 here, that are really, really good with APIs. Like, it's a first class citizen. But the thing is that for most

vendors, they they didn't care before about automation because their customers weren't

asking for it. So

SSH is everything they

they have, to be honest.

So

I don't know. I hope that

things are starting to to change now because actual operators are starting to to demand

automation.

But right now, I mean, they will tell you, like, yeah, this is

what it is,

just SSH. And then you just have to fall back to screen scraping again.

And given the fact that napalm is integrated into a few different pieces of tooling and frameworks

and seems to be fairly popular. I'm sure that there are some interesting or unusual,

instances of napalm being used in the wild that you've heard about. So I'm wondering if you can both share a bit of, what you've seen that you thought was particularly noteworthy.

Do you want to start me, champ?

Yeah. I will, take some examples, starting for with with us at Cloudflare.

We are using, napalm, for example, to monitor

the backbone of the Internet.

So we have a need to to see how reliable

our our, IP transit providers and, how how much we can trust them

to to forward our packets. To bring more context, we don't have our own,

fiber source,

means of transportation.

We just, pay some some guys to to put a fiber in our devices, and from there on, they carry these packets. But,

very often they have, packet loss, which means, somewhere more than 20% of packet loss. And, in that case, we need to reduce the traffic

to send our data

to our, to different guys. And, we're using napalm in the very first place to

retrieve the performances and to monitor how much packet loss they have. We

would basically use,

SNMP, but, there's a different

problem there. Not only that, each vendor has its own SNMP implementation, but

others present

the data in a very poor way. For example, on a Cisco SXR, if you specify a tag name for a probe,

more than 13 characters,

when you try to retrieve it via SNMP, it it won't

give you anything. And there are many other stupid things. Moving on, monitoring this and I said

you will reroute the traffic.

That triggers a lot of things. That means apply the configuration change on the device with

napalm. And then, after this configuration change is applied is a couple of things. Also

updating our internal tools,

post posting to

our internal chats, updating, very nice map with, points of presence where we did we reroute the traffic, updating our communication with

with, with customers,

our status page where we say that traffic is rerouted,

sending emails, and so on. We are also using,

napalm. Although I'm not speaking about strictly the

main napalm, I'm going to remind again about, that new kid, napalm logs. We're starting using,

using it and listen to the events in, our network, what's what's going on, and we act immediately

and, reliably on these events,

without

having any impact on our network or or customers.

So I've seen also some people,

replacing their SNMP based monitoring system with with

all the the getters that we have. So instead of just checking the interfaces

with a DNN,

just retrieve it with the getters. We also have this

validate functionality

built within, Wilton in Naples

that lets you lets you define

desired state. Like, example, you could say,

I want to make sure that I have this, BGP neighbor configured

that is up

and that it's sending me,

a certain amount of of routes. So you can just define that on a on a YAML file with a very simple,

DSL to define the rule. And then just figures out,

how to retrieve that information

and make sure that it's in there. And if it's not,

if the rule is not actually passing, it would just, give you a report saying, like, okay. Yeah. There is a compliance issue, so you're failing on on this. So I know that some people is working on replacing,

like, their time serious,

alerting systems with

with this functionality,

which

because you could call it more like test driven monitoring. I don't know. Because you're actually testing the network all the time, making sure that you have certain states in. And I have also

built myself what I like to call, like,

immutable

infrastructure for the for the network,

which means that

I actually

every time I do do a change on the on the network, regardless of how small the change is, like, could be just, like, a change of an interface description, for example,

What I do is actually rebuild the entire configuration,

and I send it to the device and I tell the device, like, this is what I want. I mean, if there is something else, just get rid of it.

So with this, what you can actually ensure is that you always are on a known state. Like, if an operator

would connect to the device and do a manual change, because you're actually rebuilding the state from scratch, you would actually get rid of that unknown

state that the operator

introduced.

Yeah. Because,

David was saying about the configuration.

Yeah. Well,

I can,

give 1 more

detail,

something, very interesting.

We have

our device configuration is imperative, meaning that, exactly what we have in the source of truth has to be reflected on the device.

And

the configuration,

is, has diverged over years,

in a a bit a lot. And there are still some parts of configuration that

are not there yet, are not yet prepared to

to be run,

to be applied configuration changes automatically live.

And, because of this, and we're using the napalm functionality to retrieve the div, the configuration difference,

we,

at specific intervals,

it is, run a tool that,

determines the div between what we have in the source of truth and what we have actually on the device. And, sends an email reminding you that,

okay, this this is not correct. You need either to update your source of truth, either try to make, your, configuration,

on the device,

to be consistent.

And so we've been using a lot of different networking acronyms

and talking about various aspects of doing network automation and managing types of networks. So for somebody who's interested in learning more about the fundamentals of network management and network operations, what are some of the resources that you would recommend that they look to?

So, for example,

network to call dot com has,

lots of resources, has some

labs.

And there's actually a Slack community,

behind it, which is very active.

Both Mirchen

and I are in there. We actually have, an APOM channel where lots of people is all the time just asking questions and helping each other

and just, like, chatting is kind of fun.

There is also IPSpace.com

with the he has a lot of webinars and information about automation as well. Pinet.twbtech.com

is also a great resource if you're familiar with Nedmiko.

He is the author of Nedmiko.

And there is also now and not really book,

called Network Permability and Automation, which I haven't actually read myself, but I know the the authors are really good, so I'm confident that the content will be of high quality.

And, Mircea, do you have any resources to recommend?

Yeah.

I would recommend also the packet pushers.net.

There's a lot of information there in the podcast and so on, specifically about, networking and also

network automation.

I sometimes

blog

on my own

website,

mircciauinic.net,

about, also about networking and also network automation and so on.

I'm trying to

to put the information as often as possible, although

I don't really manage to.

Yeah. Keep keeping blogs up to date is definitely more effort than it seems to when you first get started.

Yes.

Definitely.

So for anybody who wants to follow what you're both up to and keep up to date with the work you're doing and with the state of napalm, your contact information is in the show notes, so I'll refer people there. And with that, I'll move us to the picks.

And so I've got 2 picks this week that are,

germane to the topic at hand. So 1 of them is in,

an IETF RFC called the 12 networking truths

That is actually much lighter reading than any of the other RFCs that I've perused in the past, and it's

quite entertaining. So I definitely recommend people, take a look at that 1. And then the other 1 is a post that is falsehoods programmers believe about networking, which is just a bunch of contradictory truths,

all of which are factual and all of which are quite entertaining as you read them together. So I definitely recommend people take a quick look at that as well.

And with that, I'll pass it to you. Do you have any picks for us today, David?

Yeah. I mean, I've been reading or rather listening to the Fear Saga. It's a sci fi series,

and it's it's just

great. I really recommend it if you're into into sci fi.

Also, I discovered recently VR.

I think it's just

great. I think that it's not going to be AI. It's actually going to be VR, what's going to bring humanity to an end.

And, Mircea, do you have any picks for us this week?

For me, the week has just actually started.

I,

I've been quite busy yesterday, but,

I'm, reading now a book that,

really calms me down and makes me feel good. It's called the daily zen or the zen book, and,

I get a lot of good advices there.

Alright. That sounds like something I'll be needing to take a look at as well. So with that, I would like to thank you both for taking the time out of your day to join me and tell me about Napalm and the work that you guys are doing in your

respective positions and how you're using Napalm to manage the networks that keep the Internet running.

So I appreciate your time, and I hope you enjoy the rest of your days. Yeah. Thank you. Thanks. Thanks for invite.

Cheers.

The Python Podcast.init

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.__init__

Summary

Preface

Interview

Keep In Touch

Picks

Links

The Python Podcast.init