Rsync snapshots
Next we need to make another “copy” of the data we backed up. This copy is used as the previous version when we update this version. To do this, we copy the target directory to a .<n> version, where n is any digit. If we do this using hard links, the second “copy” won’t take up any space, so we’ll use the –l option of cp to tell it to use hard links when it makes the copy (cp –al /backups/home.0 /backups/home.1). Now we have two identical copies of our source directory on our backup system (/backups/home.0 and /backups/home.1) that take up the size of only one copy.
Now that we’ve copied the backup to another location, it’s time to make another backup. To do this, we need to identify any files in the source that are new or changed, remove them in the target directory if they’re there, then copy them to the target directory. If it’s an updated version of a file already in our target directory, it must be unlinked first. We can use the rsync command to do this all in one step (rsync –delete –av /home/. /backups/home.0/). This step is the heart of the idea. By removing a file that was already in the backup directory (/backups/home.0), we sever the hard link to the previous version (/backups/home.1). But since we used hard links, the previous version is still there (in /backups/home.1), and the newer version is in the current backup directory (in /backups/home.0).
To make another backup, we move the older directory (/backups/home.1) to an older number (/backups/home.2), then repeat the hard link copy, unlink, and copy process. We can do this as many times as we want and keep as many versions as we want. The space requirements are modest; the only increase in storage is for files that are new with that backup.
Contents |
An Example
Let’s back up the directory /home. If we take a look in this directory, we find three files:
$ echo "original myfile" >/home/myfile.txt $ echo "original myotherfile" >/home/myotherfile.txt $ echo "original mythirdfile" >/home/mythirdfile.txt $ ls -l /home total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt
Now create a copy of /home in /backups/home.
$ cp -a /home /backups/home.0 $ ls -l /backups/home.0 total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $ du -sb /backups 58 /backups
Note that each file shows a 1 in the links column, there is one copy of the /home directory in /backups, it contains the same files as /home, and the entire /backups directory is taking up 58 bytes, which is the same as the number of bytes in all three files. Now let’s create a second copy using hard links.
$ cp -al /backups/home.0 /backups/home.1 $ ls -l /backups/* /backups/home.0: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $ du -sb /backups 58 /backups
Now you can see that there are two copies of /home in /backups, each contains the same files as /home, and they still only take up 58 bytes—because we used hard links. You should also note that the links column in the ls -l listing now contains a 2. Now let’s change a file in the source directory.
$ ls -l /home/myfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 /home/myfile.txt $ echo "LET'S CHANGE MYFILE" >/home/myfile.txt $ ls -l /home/myfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 20 Jul 8 19:41 /home/myfile.txt
Please note that the size and modification time of myfile.txt changed. Now it’s time to make a backup. The process we described earlier would notice that /home/myfile.txt has changed, and that it should be removed from the backup directory and copied from the source. So let’s do that.
$ rm /backups/home.0/myfile.txt $ cp -a /home/myfile.txt /backups/home.0/myfile.txt $ ls -l /backups/* /backups/home.0: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $ du –sb /backups 78
Now you can see that myfile.txt in /backups/home.1 has only one link (because we removed the second link in /backups/home.0), but it still has the same size and modification date as before. You can also see that /backups/home.0 now contains the new version of myfile.txt. And, perhaps most importantly, the size of /backups is now the original size (58 bytes) plus the size of the new version of myfile.txt (20 bytes), for a total of 78 bytes. Now let’s get ready to make another backup. First, we have to create the older versions by moving directories around.
$ mv /backups/home.1 /backups/home.2 $ mv /backups/home.0 /backups/home.1
Then we need to create the new previous version using cp -al.
$ cp -al /backups/home.1 /backups/home.0 $ ls -l /backups/* /backups/home.0: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.2: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $ du -sb /backups 78 /backups
Now we have /backups/home.2, which contains the oldest version, and /backups/home.1 and /backups/home.0, which both contain the current backup. Please note that the size of /backups hasn’t changed since the last time we looked at it; it’s still 78 bytes. Let’s change another file and back it up.
$ echo "LET'S CHANGE MYOTHERFILE" >/home/myotherfile.txt $ rm /backups/home.0/myotherfile.txt $ cp -a /home/myotherfile.txt /backups/home.0/myotherfile.txt $ ls -l /backups/* /backups/home.0: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 25 Jul 8 19:45 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 2 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.2: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $ du -sb /backups 103 /backups
You can see that /backups/home.0 now contains a different version of myotherfile.txt than what is in the other directories and that the size of /backups has changed from 78 to 103, which is a difference of 25—the size of the new version of myotherfile.txt. Let’s prepare for one more backup.
$ mv /backups/home.2 /backups/home.3 $ mv /backups/home.1 /backups/home.2 $ mv /backups/home.0 /backups/home.1 $ cp -al /backups/home.1 /backups/home.0
Now we’ll change one final file and back it up.
$ echo "NOW LET'S CHANGE MYTHIRDFILE" >/home/mythirdfile.txt $ rm /backups/home.0/mythirdfile.txt $ cp -a /home/mythirdfile.txt /backups/home.0/mythirdfile.txt $ ls -l /backups/* /backups/home.0: total 3 -rw-r--r-- 3 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 25 Jul 8 19:45 myotherfile.txt -rw-r--r-- 1 cpreston mkgroup-l-d 29 Jul 8 19:51 mythirdfile.txt /backups/home.1: total 3 -rw-r--r-- 3 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 25 Jul 8 19:45 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.2: total 3 -rw-r--r-- 3 cpreston mkgroup-l-d 20 Jul 8 19:41 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt /backups/home.3: total 3 -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 myfile.txt -rw-r--r-- 2 cpreston mkgroup-l-d 21 Jul 8 19:35 myotherfile.txt -rw-r--r-- 3 cpreston mkgroup-l-d 21 Jul 8 19:35 mythirdfile.txt $ du –sb /backups 132
Again, the total size of /backups has changed from 103 bytes to 132, a difference of 29 bytes, which is the size of the new version of mythirdfile.txt.
The proof is in the restore, right? Let’s look at all versions of all files by running the cat command against them.
$ cat /backups/home.3/* original myfile original myotherfile original mythirdfile $ cat /backups/home.2/* LET'S CHANGE MYFILE original myotherfile original mythirdfile $ cat /backups/home.1/* LET'S CHANGE MYFILE LET'S CHANGE MYOTHERFILE original mythirdfile $ cat /backups/home.0/* LET'S CHANGE MYFILE LET'S CHANGE MYOTHERFILE NOW LET'S CHANGE MYTHIRDFILE
You can see that the oldest version (/backups/home.3) has the original version of every file, the next newest directory has the modified myfile.txt, the next newest version has that file and the modified myotherfile.txt, and the most recent version has all of the changed versions of every file. Isn’t it a thing of beauty?
Beyond the Example
The example, while it works, is simple to understand but kind of clunky to implement. Some of the steps are rather manual, including creating the hard-linked directory, identifying the new files, removing the files we’re about to overwrite, and then sending the new files. Finally, our manual example does not deal with files that have been deleted from the original; they would remain in the backup. What if we had a command that could do all that in one step? We do! It’s called rsync.
The rsync utility is a very well-known piece of GPL’d software, written originally by Andrew Tridgell and Paul Mackerras. If you have a common Unix variant, you probably already have it installed; if not, you can install a precompiled package for your system or download the source code from rsync.samba.org and build it yourself. rsync’s specialty is efficiently synchronizing file trees across a network, but it works well on a single machine too. Here is an example to illustrate basic operation.
Suppose you have a directory called <source>/ whose contents you wish to copy into another directory called <destination>/. If you have GNU cp, you could make the copy like this:
$ cp -a source/. destination/
The “archive” flag (-a) causes cp to descend recursively through the file tree and to preserve file metadata, such as ownerships, permissions, and timestamps. The preceding command first creates the destination directory if necessary. It is important to use <source>/. and not <source>/* because the latter silently ignores top-level files and subdirectories whose names start with a period (.). Such files are considered hidden and are not normally displayed in directory listings, but they may be important to you!
However, if you make regular backups from <source>/ to <destination>/, running cp every time is not efficient because even files that have not changed (which is most of them) must be copied every time. Also, you would have to periodically delete <destination>/ and start fresh, or backups of files that have been deleted from <source>/ will begin to accumulate. Fortunately, where cp falls short at copying mostly unchanged filesystems, rsync excels. The rsync command works similarly to cp but uses a very clever algorithm that copies only changes. The equivalent rsync command would be:
$ rsync -a source/. destination/
rsync is persnickety about trailing slashes on the source argument; it treats <source> and <source>/ differently. Using the trailing /. is a good way to avoid ambiguity.
You’ll probably want to add the –delete flag, which, in addition to copying new changes, also deletes any files in <destination>/ that are absent (because they have presumably been deleted) from <source>/. I also like to add the verbose flag (-v) to get detailed information about the transfer. The following command is a good way to regularly synchronize <source>/ to <destination>/:
$ rsync -av --delete source/. destination/
rsync is good for local file synchronization, but where it really stands out is for synchronizing files over a network. rsync’s unique ability to copy only changes makes it very fast, and it can operate transparently over an ssh connection. To rsync from /<source>/ on the remote computer example.oreilly.com to the local directory /<destination>/ over ssh, you could use the command:
$ rsync -av --delete username@example.oreilly.com:/source/. /destination/
That was pull mode. rsync works just as well in push mode from a local /<source>/ to a remote /<destination>/:
$ rsync -av --delete /source/. username@example.oreilly.com:/destination/
As a final note, rsync provides a variety of –include and –exclude options that allow for fine-grained control over which parts of the source directory to copy. If you wish to exclude certain files from the backup—for example, any file ending in .bak or certain subdirectories, they may be helpful to you. For details and examples, see the rsync man page.
Understanding Hard Links
Hard links are an important Unix concept to understand if you’re going to use this technique. Every object in a Unix filesystem, which includes every directory, symbolic link, named pipe, and device node, is identified by a unique positive integer known as an inode number. An inode keeps track of such mundane details as what kind of object it is, where its data lives, when it was last updated, and who has permission to access it. If you use ls to list files, you can see inode numbers by adding the -i flag.
$ ls -i foo 409736 foo
What does foo consist of? It consists of an inode (specifically, inode number 409736) and some data. Also— and this is the critical part—there is now a record of the new inode in the directory where foo resides. It now lists the name foo next to the number 409736. That last part, the entry of a name and inode number in the parent directory, constitutes a hard link.
Most ordinary files are referenced by only one hard link, but it is possible for an inode to be referenced by more than once. Inodes also keep a count of the number of hard links pointing to them. The ls –l command has a column showing you how many links a given file has. (The number to the left of the user ID owns the file.)
$ ls -l foo -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 foo
To make a second link to a file within a filesystem, use the ln command. For example,
$ ln foo bar $ ls -i foo bar 409736 foo 409736 bar
Hard links, like the one illustrated here, can be created only for files within the same filesystem.
Now the names foo and bar refer to the same file. If you edit foo, you are simultaneously editing bar, and vice versa. If you change the permissions or ownership on one, you’ve changed them on the other too. There is no way to tell which name came first. They are equivalent.
When you remove a file, all you’re removing a link to that inode; this is called unlinking. An inode is not actually released until the number of links to it drops to zero. For example, ls -l now tells us that foo has two links. If we remove bar, only one remains.
$ ls -l foo -rw-r--r-- 2 cpreston mkgroup-l-d 16 Jul 8 19:35 foo $ rm bar $ ls -l foo -rw-r--r-- 1 cpreston mkgroup-l-d 16 Jul 8 19:35 foo
The situation would have been the same if we’d removed foo and run ls -l on bar instead. If we now remove foo, the link count drops to zero and the operating system releases inode number 409736.
Let’s summarize some of the important properties of hard links now. If foo and bar are hard links to the same inode:
- Changes to foo immediately affect bar, and vice-versa.
- Changes to the metadata of foo—the permissions, ownership, or timestamps—affect those of bar as well, and vice-versa.
- The contents of the file are stored only once. The ln command does not appreciably increase disk usage.
- The hard links foo and bar must reside on the same filesystem. You cannot create a hard link in one filesystem to an inode in another because inode numbers are unique only within filesystems.
- You must unlink both foo and bar (using rm) before the inode and data are released to the operating system.
Hard Link Copies
In the previous section, we learned that ln foo bar creates a second hard link called bar to the inode of file foo. In many respects, bar looks like a copy of foo created at the same time. The differences become relevant only when you try to change one of them, examine inode numbers, or check disk space. In other words, so long as in-place changes are prohibited, the outcomes of cp foo bar and ln foo bar are virtually indistinguishable to users. The latter, however, does not use additional disk space.
Suppose we wanted to make regular backups of a directory called <source>/. We might make the first backup, a full copy, using rsync:
$ rsync -av --delete source/. backup.0/
To make the second backup, we could simply make another full copy, like so:
$ mv backup.0 backup.1 $ rsync -av --delete source/. backup.0/
That would be inefficient if only a few of the files in <source>/ changed in the interim. Therefore, we create a second copy of backup.0 to use as a destination. However, since we’re using hard links, it won’t take up any more space. You can two different techniques to make backups in this way. The first is a bit easier to understand, and it’s what we used in the example. The second streamlines things a bit, doing everything in one command.
GNU cp provides a flag, -l, to make hard-link copies rather than regular copies. It can even be invoked recursively on directories:
$ mv backup.0 backup.1 $ cp -al backup.0/. backup.1/ $ rsync -av --delete source /. backup.0/
Putting cp -al in between the two backup commands creates a hard-linked copy of the most recent backup, and then you rsync new changes from <source>/ to the backup.0. rsync ignores files that have not changed, so it leaves the links of unchanged files intact. When it needs to change a file, it unlinks the original first, so its partner is unaffected. As mentioned before, the –delete flag also deletes any files in the destination that are no longer present in the source (which only unlinks them in the current backup directory).
rsync is now taking care of a lot for us. It decides which files to copy, unlinks them from the destination, copies the changed files, and deletes any files it needs to. Now let’s take a look at how it can handle the hard links as well. rsync now provides a new option, –link-dest, that will do this for us, even when only metadata has changed. Rather than running separate cp -al and rsync stages, the –link-dest flag instructs rsync to do the whole job, copying changes into the new directory, and making hard links where possible for unchanged files. It is significantly faster, too.
$ mv backup.0 backup.1 $ rsync -av --delete --link-dest=../home.0 /home/. /backups/home/
Notice the relative path of ../ for home.0/. The path for the –link-dest argument should be relative to the target directory—in this case, /backups/home. This has confused many people.
A simple example script
The following script can be run as many times as you want. The first time it runs, it creates the first “full backup” of /home in /backups/home.inprogress, moves that directory to /backups/home.0 upon completion. The next time through, rsync creates a hard-linked copy of /backups/home.0 in /backups/home.inprogress, then uses that directory to synchronize to, updating any files that have changed, after first unlinking them. This script then keeps three versions of the backups.
rsync -av --delete --link-dest=../home.0 /home/. /backups/home.inprogress/ [ -d /backups/home.2 ] && rm -rf /backups/home.2 [ -d /backups/home.1 ] && mv /backups/home.1 /backups/home.2 [ -d /backups/home.0 ] && mv /backups/home.0 /backups/home.1 [ -d /backups/home.inprogress ] && mv /backups/home.inprogress /backups/home.0 [ -d /backups/home.2 ] && touch backups/home.0
This is a very basic script just for the purpose of understanding how it works. If you’re serious about implementing this idea, you have two options. Either go to Mike Rubel’s web page at http://www.mikerubel.org/computers/rsync_snapshots or look at the section on rsnapshot later in this chapter. It is a full implementation of this idea, complete with a user group that supports it.
Restoring from the Backup
Because backups create this way are just conventional Unix filesystems, there are as many options for restoring them as there are ways to copy files. If your backups are stored locally (as on a removable hard disk) or accessible over a network filesystem like NFS, you can simply cp files from the backups to /home. Or better yet, rsync them back:
$ rsync -av --delete /backups/home.0/. /home/
Be careful with that –delete flag when restoring—make sure you really mean it! If the backups are stored remotely on a machine you can access by ssh, you can use scp or rsync over ssh. Other simple arrangements are also possible, such as placing the directories somewhere accessible to a web server.
Things to Consider
Here are a few other things to consider if you’re going to use the rsync method for creating backups.
How large is each backup?
One drawback of the rsync/hard link approach is that the sharing of unchanged files makes it deceptively hard to define the size of any one backup directory. A normally reasonable question such as, “Which backup directories should I erase to free 100 megabytes of disk space?” cannot be answered in a straightforward way. The space freed by removing any one backup directory is the total disk usage of all files whose only hard links reside in that directory, plus overhead. You can obtain a list of such files using the find command, here applied to the backup directory /backups/home.1/:
$ find /backups/home.1 -type f -links 1 -print
The following command prints their total disk usage:
$ du -hc 'find /backups/home.1 -type f -links 1 -print' | tail -n 1
Deleting more than one backup directory usually frees more than the sum of individual disk usages because it also erases any files which were shared exclusively among them. This command may report erroneous numbers if the source data had a lot of hard-linked files.
A brief word about mail formats
There are a number of popular mail storage formats in use today. The venerable mbox format holds all messages of a folder in one large flat file. The newer maildir format, popularized by Qmail, allows each message to be a small file. Other database mail stores are also in use.
Of these, maildirs are by far the most efficient for the rsync/hard link technique because their structure leaves most files (older messages) unchanged. (This would be true of the original rsync/hard link method and true of rsnapshot that will be covered later in the chapter.) For mbox format mail spools, consider rdiff-backup instead.
Other useful rsync flags
If you have a slow network connection, you may wish to use rsync’s –bwlimit flag to keep the backup from saturating it. It allows you to specify a maximum bandwidth in kilobytes per second. If you give rsync the –numeric-ids option, it ignores usernames, avoiding the need to create user accounts on the backup server.
Backing up databases or other large files that keep changing
The method described in this chapter is designed for many small files that don’t change that much. When this assumption breaks down, such as when backing up large files that change regularly (such databases, mbox-format mail spools, or UML COW files), this method is not disk-space efficient. For these situations, consider using rdiff-backup, covered later in this chapter.
Backing up Windows systems
While rsync works under cygwin, issues have been reported with timestamps, particularly when backing up FAT filesystems. Windows systems traditionally operate on local time, with its daylight savings and local quirks, as opposed to Unix’s universal time. At least some file timestamps are only made with two-second resolution. Consider giving rsync a –modify-window of 2 on Windows.
Large filesystems and rsync’s memory scaling
While it has improved in recent years, rsync uses a lot of memory when synchronizing large file trees. It may be necessary to break up your backup job into pieces and run them individually. If you take this route, it may be necessary to manually –delete pieces from the destination that have been deleted from the server.
Atomicity and partial transfers
rsync takes a finite period of time to operate, and when a source file changes while the backup is in progress, a partial transfer error (code 23) may be generated. Only the files which have changed may not be transferred; rsync completes the job as much as possible.
If you run backups only when your system is relatively static, such as in the middle of the night for an office environment, partial transfers may never be a problem for you. If they are a problem, consider making use of rsync’s –partial option.
Implementations
The backup tools listed below implement Mike’s hard-linked backup approach though the use of rsync.
Windows /linux versie met versie history
apt-get install openjdk-7-jre
HArd link backups:
rsync –archive –one-file-system –hard-links –human-readable –inplace –numeric-ids –delete –delete-excluded –exclude-from=excludes.txt –link-dest=/data/backup/_home_asylum.2005-07-25.15-32-42 asylum:/home/asylum/ /backup/rsync/asylum/_home_asylum.incomplete