Duplicity with Kenneth Loafman

Hello, and welcome to podcast dot in it, the podcast about Python and the people who make it great. I would would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode@linode.com/podcastin

it, and get a $20 credit to try out their fast and reliable Linux virtual servers for running your app or trying out something you hear about on the show. You can visit our site at www.podcastinnit.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. To help other people find the show, you can leave a review on iTunes or Google Play Music, tell your friends and coworkers, and share it on social media. Your host as usual is Tobias Macy, and today I'm interviewing Kenneth Lofman about the duplicity project. So, Kenneth, could you please introduce yourself?

Yes. Hi. This is Ken Lofman. I've been a programmer for over 50 years. I'm retired now officially.

This is my

avocation as well as my vocation.

So really that's,

about it for now. I've I've gotten experience in everything from

micro from embedded systems all the way up to supercomputers.

So it's out there. I've talked to it.

Yeah. Having been in computing for that long, I'm sure it must be pretty astounding to look at the evolution over the years because there's been such drastic shifts in capacity and capability.

So I'm sure that, you know, the more things change, the more they stay the same.

Yeah. That is absolutely the truth.

The funny part is that

there seems to be a 10 year,

roughly,

10, 12 year cycle on

terminology.

Some of the things I learned earlier,

have been resurfaced

and

resubmitted

or we bought out. I don't know what you want to call it, but under a different name.

I had forgotten what we had called it earlier, but,

not 15, 20 years ago, I had to buy a book,

$55

textbook

because somebody used the term red black tree.

It was something I was familiar with, but not under that terminology.

So,

I gotta say, I think some things get recycled

in this field. So that's just a fun observation.

And do you remember how you first got introduced to Python?

Python came about. A friend of mine and I were working on some side projects.

At the time, I was working in

c plus plus,

and

he introduced me to Python.

I got interested in it. This was

roughly

2000, 2001.

My first Python version was 1.5.2.

Very simplistic back then. Mainly, I've been doing Python since then. I've I've done some work in Java, and I've done some playing around with c plus plus 11

because of all the interesting things they have in there. But I'd say 99%

of the stuff I've done in the last

17 years has been in Python.

And the reason that I'm talking to you today is because you're the current maintainer of the duplicity project, which is a backup system written in Python. So I'm wondering if you can share a bit about the history of duplicity

and maybe describe a bit in more detail about what it actually is.

I took over duplicity

in

2007

from,

a man I mean, a man by the name of

Ben Escoda.

He was actually

a student at the time going to Stanford,

and he had written it. Had gotten tired of working on it and put out a note on,

I've forgotten, running the mailing list. It may have even been a Usenet,

list back then. But he asked for somebody to take over the maintenance,

and I did. He and

someone by the name of Jinty, j I n t y, I never found out his last name, had been working on the project together.

I took over maintenance

in 2007, and it's gone slowly since then. I found out that I didn't quite have the time I thought I had to work on this stuff. So it ended up being,

a background

project and quite often went for, a week or 2 at a time without me even having time to touch it. Especially during the,

the 2000.

We've done a couple of start up companies, and those can be rather,

rather time consuming. That's a nice way of putting it.

Yeah. It can be time consuming, and, it was supposed to be a background project that was already working fairly well. And a lot of the contributions

over the last well, actually, over the last

10 years since I've had it's been 10 years now, short 2 months. I was just looking at the, change log, and

the very first entry I have in the change log is 2007,

May 24th.

So this is 10 years minus 2 months,

roughly that I've been working on it. It is intended to be

a backup

system to support

multiple back ends and the main thing is that it supports

really dumb back ends. Right now we have

about 20 different back ends, web dev,

FTP, FTPS,

SSH, SSA, HSCP,

SFTP,

Amazon Cloud,

for a couple of versions of those, like, Backblaze,

and a few, you know, quite a few others. Most of the back ends, I didn't

write, and I barely know how they work. But the requirements for duplicity are fairly easy, and that is they must be able to list

the directory.

They must be able to put a file, to be able to get a file, and to be able to delete a file.

And we

don't ask for anything else. So all of those

protocols that I mentioned can do more than that, but we require actually less

than they can provide. The reason for the simplicity is so that you can have multiple back ends. Now, simplicity can do much, much more and can probably work better if we expanded

the possibilities of the back ends

to, for example, we could put the file out there

in a partial form and then finally rename it. That would be 1 of the handiest things to have because sometimes things fail and we don't know it. The

protocol returns a non error situation

and and the success, and we think it has succeeded and that the file is incomplete for some reason.

So that kind of thing, if we had more control over the back end, we would have, I think, a tighter situation. But the major thing is it does encryption

as it as it works. It does

what we call bandwidth efficient

incremental backups. In other words, we use the same algorithm

as parsing does. We take a look at the file and

the part that has changed, the block that has changed, gets transmitted rather than the entire file. And

an incremental is built up a set of those blocks for each file, plus any new files, plus any deletions, etcetera. So what we have is slightly more efficient way of doing

incremental backups of some fairly large files, like database files, virtual machines, that kind of thing. We can back up just the parts that's changed, and that that saves a lot of bandwidth. And, really, that's our main benefits. You can turn off you can turn off the encryption. You can turn off impression. We have a lot of options to control what's going on, how many backups you keep, how many levels of backup you keep. We we can tell you by time whether to keep a backup or by how many count how many backups you have, that kind of thing. Thing. So it's fairly mature right now. There are parts that I would like to rewrite

because of the error handling. The problem is the back the use of iterators

got quite involved in the original, and I haven't gone back to fix that.

Sometimes you get iterators

that are nested

quite deeply. And when you have an error, it's

difficult to well, most of the time, it's impossible to recover from the error. If you can retry the error and have it fit have it go through, then, yes,

that's recoverable.

But if you fail during the iteration, you've lost essentially everything,

and you you have no way of going right now of going back through and reinitializing

the various iterators that you just popped out of. So that's

that's a weakness of Python and probably a weakness of other languages that have iterators. I think

the future of duplicity is probably going to be a complete rewrite. I would still say in Python.

I might put

a few things more into

c, but right now Python is very good about its c interface. And so the time consuming items

are the encryption

and the libard and the rsync algorithm

and both of those are already external modules in CICE so we can use those and get the speed we need. Very little that Python needs outside

make it faster because we're still limited by bandwidth

and by, well, bandwidth mainly. Yeah. A lot of times when people start talking about the speed of Python, they are looking mainly at just the computational capabilities where a lot of times the actual slowest part of your program is dealing with network

and disk IO, things like that. And so I imagine that if and when you do get around to a rewrite, you could probably make some fairly good use of the async capabilities that have been introduced in the newer versions. Yeah. Right now, I haven't investigated 3,

Python 3 yet, but I understand it still has the global interpreter lock. And that's not too much of a problem, but it's,

it would be an impetus for me if I was,

trying to truly optimize things. I would probably if I've done a if I was going to do a complete rewrite, I would do it in c plus plus just knowing that I wouldn't have to worry about some interpreter lock or something like that. But I doubt that that's ever gonna happen. It's just, I believe we can get 95,

98%

of the speed we need out of Python

and judicious use of,

threading or multitasking.

And going back to your point about the fact that the back ends for duplicity are largely just sort of dumb storage. I think that that is at least a great deal of the appeal for myself because a number of the other backup solutions that you look at require a client server capability where the server is what encapsulates all of the intelligence. And so it makes it a little bit more of a setup overhead from an operational perspective. Whereas with the duplicity, you can just put it on a host, run the backup,

and then not have to worry about setting up any additional pieces, particularly if you're running the backup on ephemeral instances. So for instance, 1 of the things that I'm using at my work is setting up an instance in, e c 2, running duplicity to back up the files from a different, host. So, you know, doing a database dump and then running the backup to s 3 and terminating the instance without necessarily having to worry about keeping that around for perpetual capability. So it simplifies the overall

architecture without having to set up that client server and maintain 2 different pieces.

Yeah. There is a a lot of that kind of simplicity that is is needed. I think the only back end that actually uses,

any kind of client server is the rsync backend itself. But even that is a very simplistic use of rsync. So all the work is done inside of duplicity,

and so it's really a a copy of the file, not

through the rsync protocol, not a, rsync

to get copy of the file.

And at face value, the idea of backing up files generally seems pretty straightforward because, you know, conceptually, it's just moving 1 moving files from 1 location to another. But there's a lot of incidental complexity that comes into play when you start thinking about the long term requirements of those backups, making sure that they are valid, making sure that you can restore them. The fact that files change from backup to backup, so being able to do the incremental backups that you mentioned. I'm wondering if you can describe a bit about the internal architecture of duplicity and

maybe describe how it's able to handle the wide variety of use cases that it's put to. Okay. Internal duplicity is actually fairly simple. It,

uses

the BIVAR sync module and signature file. You find

essentially the areas of the of whatever file has changed. Now if you're presenting it with a new file,

the effect is an empty

signature chain,

and a comparison against that yields

just additions.

So the whole process works the same regardless of whether it's a new file or an old file. And then it has the setup called the back end, and the back end will only do all the,

they call the diff TAR files,

which are really just our special usage of TAR to put together

same directory structure and so forth of the data. And then we have the signature file,

which is also a TAR file,

also in the same structure.

And there's a manifest file, which helps us the manifest file is really my very harsh

manifest. Basically, it tells us

what file the volume started on and what file the the volume ended on and so and where it ended, that kind of thing. So that when we're doing recovery, we can go instead of reading a a 1000 tar files, we can go to the 1 or 2 that we need to use to to recover the file. It has exactly 1 thread

possibility, and that is the async

upload. Once you have a viftar file built, you can send the file up while, you know, we're working on the next 1. It seems to work pretty well. It's still

a case where a lot of times

the if the file is too small you'll have

volume size is too small. You'll be building the file so fast

that basically it becomes a burp and a burp of CPU

and network activity.

That's where we need to have

some optimization

because we really do need to keep things sped up, things

flowing through the network on a continuous basis if we're going to get this thing optimized. It,

like I say, it's a if you look at it internally, it's going to have some gotchas. It works pretty well. It needs work definitely needs work on error messages and error handling. Your messages are still, for the most part, tracebacks,

Python tracebacks with some cryptic little error message, which, if you don't know

simplicity, it was pretty freaky. You know? I'm sure that people get freaked out. If you don't know Python, you get this dump

and trace back. And

you say, what the heck is that? And,

hopefully, you go ask somebody

and not give up. Usually, it's it's some simple thing that can be fixed. That's really

sort of getting into the algorithmic

details and

all that. That's about really all I can say.

It's a good overview of how it works and the different pieces that put it together. For anybody who wants to dig deeper into the actual algorithmic pieces of it, I can add some references to the show notes. And if you have any particular,

sites or documentation that would help with that, I can, add that in after we're done recording.

Okay. Yeah. I think the for somebody that really wants to look into it, yeah, we can do that very easily. There's a couple of sites that especially with the Libarson

algorithm.

I remember the guy's name. His last name is Poole. Supports that, and

he's also a,

duplicity user or at least he is, enough of a user that he submits

bug reports and improvements periodically.

We have several users that are, very active improvements

to the package. Like I say, all of the 20 or so back ends were written by somebody other than myself. I wrote

1 of those back ends myself,

and that was it. The rest I have

modified after it's after having to study the the party, protocol.

Some I'd say most of the back ends have just performed. The, original author has done enough to get by.

And I would say, in general, the normal duplicity user is either using it from

a package like deja doop or from duply and never really encounters

any options outside of what he needs to get the job done. So like I said, the back ends are simplistic. They just seem to be work they seem to be working fairly well.

And you mentioned that the default configuration

will encrypt the backed up files.

So I'm wondering if you can just briefly discuss

the importance

of having those files encrypted even if you are controlling the end storage location as well. If you're physically

controlling the storage location, like you have a large array of disks or something like that, USB drives,

something like that, and you have good physical security, then there is no reason for encryption.

As soon as I can get to that collection of drives which you have, which has no encryption, I now own your data. So even if you are operating at home or in an office that is secured with security guards

and so forth, encryption keeps

anyone well, it doesn't keep anyone, but it keeps most people from getting a hold of your data even though you have you think you have physical control of it. Sending the data over network to something like 1 of the cloud services

is normally done

encrypted. So you actually have encryption on the data itself and you have encryption via encryption via SSL. Well, 1 of the things I've been doing in my since I've since I've retired is I've been taking some courses in ethical hacking. And I have learned just how pathetically

easy it is to

do a man in the middle attack on a wireless

connection.

So I realized that anybody sitting outside my house

or outside my business, if I had a powerful wireless

and was using wireless to communicate from, say, a laptop in a meeting room to an unsecured

server, they could decrypt the entire thing. It's just

pathetically,

the security is poor on wireless. It's

a little better with Ethernet.

Again, going back to the

to physical security because that's really the only security well, now go back to the old saying. The only really secure computer is a computer turned off, sitting in a vault that has no electrical access. Kind of useless data, but it's secure. So it's a hassle for a lot of people to think about doing PGP keys,

GPG keys,

remembering passwords,

that kind of thing. There are a lot of password

managers out there, that could help with that. They're good. You give them a good strong password, and you don't really have to remember all of your passwords. If you wanted

to encrypt everything with strong passwords or if you wanted to encrypt everything with strong ease that is very good. It can make it so that the only people that can, decrypt your data is, well, potentially the the NSA, but they probably have easier ways to get your data going after your backups. So I would say that not encrypting your data is probably just reckless.

Our on your system, you probably do your banking.

You probably do bill paying. You probably have credit card information.

You probably have

Social Security numbers,

things written down about your parents, things written down about your children that you don't want out. You're just gonna want known to the bad world. And if you operate over the Internet without encryption,

then you really are opening yourself up to that kind of,

that kind of issue where they just some stray person can get your data.

Given the fact that people are relying on backups for critical recovery or for, you know, making sure that if they make a mistake that they can get back to a previous known good copy of something. The actual mechanism that's doing the backup needs to be fairly reliable.

And so I'm wondering what are some of the processes

or tooling that you've got in place to ensure that you don't have regressions

or, sort of critical failures

in the backup tool itself?

I have a set

of unit tests. We currently only have about 400 or so of them. They don't cover most of the back ends because the authors of the back ends didn't write them, and I don't have access to the back end itself. That will help us keep away from a regression. Now we have had regressions. We have had cases where piece of codes you got fixed,

and then it got reverted back to its old form quite by accident. It hasn't happened more than 3 or 4 times, so I think that's a fairly decent statistic.

When it does happen, we try to fix it quickly and get a new get a patch out there, release, or something. Every time I do a release

to the trunk of the git repository

on Launchpad,

I do the I run those tests. And I run them run a complete set of tests on the matter whether I consider it a minor change or not. I've been burned too many times by saying this is too trivial a change to test. I learned that way

back a long time ago just by the simple fact of blowing a 3 line subroutine,

which was supposed to be a stub, to test another routine. I committed it, and it didn't work. And then very simply, I misspelled a variable.

So you'd think that a 3 line piece of code wouldn't be a problem, but it can be. And if you're putting it out there for people to use and if you're doing backups that are sometimes

hours and days long, you don't want to go to the very end and find out that we, the programmers, have introduced a bug. It is gonna mean that that entire backup is no good anymore. So we test I'd like to have more tests. I mean, in all honesty. We have quite a few of the functional tests are command line tests. In order to run the entire package, that helps some. I would like to have basically just a lot more testing, a lot more combinations

of things like GPG options

and,

access to a sample of all of the cloud services that we talked to so that we can actually test every back end that we put out. Right now, they go out. The author says, yeah. That works. We say, yeah. Okay. This is good. And then there's just kind of it's a trust thing on the back end. And mainly the tests that we do are for the core and do the back end testing enough to make sure that

we still call the back ends correctly. Open testing would be a a good thing for somebody to volunteer for if they're listening to this.

And what are some of the mechanisms

that Duplicity uses to prevent data corruption during the backup itself?

Okay. During the backup, we take an MD 5 sum of the volume, and we compare that to when we download the volume to make sure it hasn't changed. And

we use,

gzip or bzip depending on which,

option you choose. And that has fairly decent

checks inside of it

for corruption.

And the actual encryption itself is another check. If it decrypts

properly, then you have a good indication that it is,

that everything is good. What we also have what really is the kicker is the fact that we have the signature files.

Now those are very simplistic. Those are very kind of automatic things that happen. But when you verify

the backup, you look at the signature files, and you verify that the signature files are valid and you can do a data comparison if you want. Verification

is really the only way to make sure that your backup is secure. It's

100%

secure. Because of network errors, we won't detect that.

So the only rate I mean, sometimes network errors happen because of a faulty router, faulty configuration.

Job, network errors happen much like disk errors. 1 in a few 1, 000, 000, 000 are gonna have just natural errors that somehow are not detected. So what will happen is we will get a file on the remote system that has been corrupted somehow. It will fail its checksum and will stop. Or if you give it an option, it will try to ignore the error and continue

onwards. If it's

only a minor error, in other words, a file is corrupt

somewhere inside the tar archive, then you might get lucky, and we might be able to continue. Sometimes

a single bit error on an encrypted

zipped and or zip file can cause the rest of the file to be totally useless. So when I say verify, I mean to verify the backup, you have to verify the data in the backup, which means you download your entire backup and compare it against what you just did. And most people do not do that. So I can't think of anything else that we do, but the signature file itself

is the main thing.

If you if you verify and the signature says that the file is different than what it was, then it's like the MD 5 sum only on a very much smaller scale

block of data. So

you have a few you have a good indication that your backup is bad. You can redo it. There's another option, part 2, which will create

ECC files for you, which allow you to rebuild

the file in case of data corruption. Now that adds

a high cost. It's like a rate. It'll add, like, 20, 25%

of your backup size. It'll increase it by 20, 25%,

but you'll have files that can be recovered. Now that's assuming that the backup file and the par 2 file created for that backup file are not both corrupted. So I mean, there's worst case scenarios for everything. But for everything I would suggest

to do is to verify periodically

and to keep multiple backups. Keep my suggestion for best practices is 1 week. Every week, you do a full backup, and every day, you do incremental. That's what I've been doing for years, and then it's, get me from get me from having too many problems. Most of the problems,

in all honesty, most of the problems that you need a backup for are not machine related, are not error related. I mean, not machine error related, I should say. They're human error related. Most of the backup

requests are made or most of the restore requests, rather, are made because,

whoops, I deleted that directory. I didn't mean to. It's not a machine error. It's not a networking error or anything like that. It's just plain old human error. It's probably 95% of the reasons for backup.

And so what would you say are some of the most difficult or complex aspects of the problem space that Duplicity is dealing with?

I'd say the most complex

that we deal with

are people doing long strings of incremental backups. The way duplicity works is

every backup past the full backup is dependent upon the previous

incrementals plus the full backup in order to make the,

rebuilding

the file that has been changed throughout your backup time

between the full backup. So if you have a file or files that are constantly changing, like, say, a database file, we may back up 50 gig of it on the full backup and then 2 gig every night because that's all that's changed. But if you think, well, it's gonna take me too long to do this full backup. I'm just gonna run from a couple of months. Well, now you're relying on the integrity of the full backup and 60,

incremental backups,

all of which are required to be consistent. That can be a problem. Like I say, once

networks make errors, machines make errors, sometimes we don't catch them, especially given the fact that they sit on a remote system. The file could have gotten corrupted on the way over. The file could have gotten corrupted over there. So that's the most complex 1 I have to deal with. We have a solution where we can try to work through that. We have manual mechanisms

of rebuilding from the tar files,

decrypting it yourself,

compressing it yourself, and trying to build from the tar files. But it's problematic,

very time consuming, so it has be an extremely important file. That's the most complex 1. It's it's really dealing with the human side of it that is

not wanting to follow procedures, I guess.

Yeah. Just like with most things in computing, the the problems are dealing with people.

Yeah. We we put the backup out there so that it's there in case you do delete your file or you or in case your machine goes kabooey or what happens. But,

we can guarantee that there will be, on occasion, errors just because of the fact that you are going over a network and you are going to a foreign machine. Disk errors creep in and network errors creep in. I I don't remember the exact numbers. We're in the we're in the billions of operations,

so it's rare, very rare. But if you do a lot of work with computers, you'll get a bad file back from a from a storage provider, and nothing that you used will have reported problem.

And when I was doing the research for the show, I noticed that you've got a proposal for a new archive format to replace tar. So I'm wondering if you can just briefly describe the motivation for that and some of the design considerations that have been made.

What we need

is a good way to do an easy indexing of the files. ARM, you can go into it. And if it's local, it's very fast to do tar, you know, t f file name and,

get an index of the tar file and so forth. We can get a a list of the files,

by looking at the signature files. But what we really need is a better way to to do the indexing, and we need a better way. People have really wanted to have something that will allow them to delete

a file all the way back to the full backup. That's

something I don't know that can be done without a client server model or without having the file local. But there are the TAR file the TAR file has served us very well for now, and I am not inclined right now to change it. Proposal is good, but it does end up being,

something that other tools can't touch, and so you would have to end up writing something to recover that file. Whereas there are good tools

to handle corrupted tar files out there, but a corrupted duplicity file that was a special format would have to be something

specially written. So I'm more inclined to stay with the TAR information format right now and

work on a better complete manifest. So you could do a a list of files that you have, in your backup and a few other improvements like that, and I think you've got a complete product. So, no, I don't think we will be writing our own format. There's just too many advantages to having

something that in an emergency,

you know, in a situation where the file is corrupted or something like that, you can restore it with tools you already have or which are available immediately from the net. You know, it's just a lot more works than I wanna get into right now.

So are there any other topics that you think we should cover before we close out the show? No. Not really. I,

I want to if you don't mind, I'd like to say hello and thanks

to Michael Terry

and to,

Aaron White house

and to

Edgar Solden. They have been 3 of the primary

supporters of duplicity, and I'm probably leaving somebody out. But,

I don't have any notes in front of me, so my mind is, that's those are the 3 that do the most work in keeping this thing going besides me. And, I'd like to say thanks to them. They've done a lot of hard work. They've done a lot of frustrating

bug chasing

and frustrating user interaction. Very happy they took on. So I've had to do it myself.

So for anybody who wants

to get in touch with you or follow your work on the project, I'll have you send me your preferred contact information. Sure. And with that, I'll bring us into the picks. For my pick this week, I'm gonna choose the movie Passengers.

I watched that, recently with my wife and it was a pretty interesting movie. Had a few interesting ideas around sort of ethics

and the complications of space travel. Pretty well done movie. I enjoyed watching it, and I think that others would like it as well. So with that, I'll pass to you. Do you have any picks for us this week, Ken?

I don't have a real pick for this week per se, believe it or not, but I am process of catching up on a lot of old TV shows that I have watched not watched because I was so busy. I would recommend NCIS to anybody, I think. But do it on Netflix, not on commercial

network. I'm willing to wait a year for the next series of net of NCIS to come out. It's a very good series. It's very well done. I have not seen a lot of movies lately. My taste in movies goes to science fiction, but a lot of times, it goes to

the weirder science fiction, like, Plan 9 from Outer Space and,

some of those b movies that are quite funny to watch actually. Pretend to be serious, but they're quite funny to watch.

Well, I appreciate you taking the time out of your day to tell us more about duplicity and the work that you've been doing with it. It's definitely a very useful tool and 1 that I rely on myself for all my backups. So thank you for that and, thank you for continuing to support it. And I hope you enjoy the rest of your day. Thank you. And you too.

The Python Podcast.init

Summary

Preface

Interview

Contact

Picks

Links

The Python Podcast.__init__