Roundup of Data Backup and Archiving Tools

Here is a comparison of various data backup and archiving tools. For background, see my blog post in which I discuss the difference between backup and archiving. In a nutshell, backups are designed to recover from a disaster that you can fairly rapidly detect. Archives are designed to survive for many years, protecting against disaster not only impacting the original equipment but also the original person that created them. That blog post goes into a lot of detail on what makes a good backup or archiving tool.

Comparison table

Let me give you the comparison here, and explain the features and their significance below.

Feature	backuppc	bacula (community edition)	borg	dar	git-annex
Storage type	reference-counted file tree	archive files	reference-counted chunk tree	archive files	sym/hard-link file tree by hash; history via git
Supports streaming-only (tape, etc)	no	yes	no	yes	no
Can save backup to pipe/FIFO	no	yes (FIFO only)	no	yes (pipe and FIFO)	no
Asynchronous backups possible	no	no	no	yes	yes
Multi-volume support	no	yes	no	yes	yes
Single files larger than a volume	no	yes	no	yes	no
Individual backup larger than a volume	no	yes	no	yes	yes (with separate repo)
Volume identification	n/a	volume label (except for FIFO)	n/a	backup filename + slice number	repo name
Backup rotation / pruning	automatic per configured rules	automatic per configured rules	CLI prune call with rules	manual (SaraB/Baras provides CLI with configured rules)	CLI drop call to delete old data
Deduplication	file-level	common base only (paid version has more)	block-level	common base only	file level
Compression	zlib at storage; ssh/rsync transport	zlib	lz4, zstd, zlib, lzma	lz4, zstd, zlib, bzip2, lzo, xz	no
Can avoid re-compressing	no	no	no	yes (based on extension; configurable)	n/a
Binary deltas	at transport, not storage	no	yes	yes	no
Supports encryption	no	data only (filenames & EAs unencrypted)	yes (symmetric)	yes (both public key with gpg and symmetric)	with certain special remotes
Zero-trust target	no	moderate (risk of forced keys to client)	yes (if targeted by only 1 machine)	yes	with certain special remotes
Authentication / verification	no	X.509 RSA file signatures	HMAC-SHA256	gpg-signed session key, detached sha512, par2; any pipe	secure hashes and signed commits
Can directly back up Windows machines	if rsync installed	if agent installed	no	yes	yes (if git installed)
Can directly back up *nix machines	if rsync installed	if agent installed	yes	yes	yes (if git installed)
Can directly back up Mac machines	if rsync installed	if agent installed	yes	yes	yes (if git installed)
Preserves Mac resource forks	no	yes	yes	yes	no
Preserves timestamps	yes	yes	yes	yes	no
Preserves *nix hard links	yes	yes	no	yes	no
Preserves *nix symlinks	yes	yes	yes	yes	no
Preserves *nix EAs and ACLs	yes	yes	yes	yes	no
Preserves *nix ownership (uid/gid)	yes	yes	yes	yes	no
Preserves *nix sparse files	no	yes	simulated	yes	no
System model	daemon on storage; pull via rsync	daemons everywhere; pull	CLI	CLI, C++ library, Python library	CLI
Network/remote support	Backs up systems using rsync+ssh	scheduler; backs up from/to multiple systems	push to remote using ssh+borg	push to remote on any curl backends, SFTP, ssh, or pipe	push to any of numerous special remotes or ssh+git-annex
GUI available	native web interface	yes	yes (vorta)	yes (gdar, DarGUI)	limited web interface focusing on synchronization
Restoration without using tool	no	no	no	no	file data but not tree updates
External runtime dependencies	rsync, Perl	MySQL or PostgreSQL	Python	none	git
Standalone binary distribution	no	no, but bls/bextract can be used in emergency	dynamic (includes Python)	dynamic or static for multiple platforms, from author or distro	dynamic (requires external git)
Disaster recovery method	mount, reconfigure hosts	bscan to rebuild DB	normal commands	normal commands	normal, but may need to rescan repos
Scheduling	internal	internal	external	external or wrapper script	external
Supported platforms for storage	local *nix	*nix, Windows, Mac	local or ssh *nix, Mac	local or ssh *nix, Mac, Windows; curl; SFTP	local or ssh *nix, Mac, Windows; special remotes
Supports incremental backups	yes	yes	yes	yes	yes
Supports decremental backups	yes*	no	yes*	yes	yes*

Here’s what the different features mean:

Storage type: What the backup storage looks like. backuppc’s reference-counted file tree is a tree on the filesystem, where each file corresponds to a file on the original. borg’s reference-counted chunk tree encodes each block of files as a file on the filesystem. archive files are large files that group multiple files into one. git-annex’s tree is similar to a reference-counted file tree, but achieves that via links. Of these, the archive files provide the most flexible storage, since they don’t even require a filesystem (and can be put on tape directly), while the reference-counted chunk tree represents the most efficient; see deduplication below.
Supports streaming-only: Can the backup system write to devices that do not support random access? That would include things such as tapes and pipes.
Can save backup to pipe/FIFO: For those that support streaming-only, can they write the backup data to a pipe or FIFO (named pipe)? These could allow them to be, for example, streamed over ssh.
Asynchronous backups possible: If yes, the system being backed up and the ultimate storage destination do not have to be reachable over the network in real time. This means they support Asynchronous Communication (such as NNCP or Filespooler), which facilitates things like Airgaped backups and using sneakernet to have temporary storage on portable devices to transport the data to its ultimate storage host.
Multi-volume support: Whether the backup system supports more than one volume for storing backups. Here a volume means a removable drive, a tape, an optical disc, or something similar. OS or hardware tricks to aggregate drives (eg, RAID) don’t count here.
Single files larger than a volume: If the system supports multiple volumes, whether it can split a single file across multiple volumes
Individual backup larger than a volume: If the system supports multiple volumes, whether it can split a backup session across multiple volumes
Volume identification: How a multi-volume-capable system identifies volumes
Deduplication: Whether the backup system can detect duplicate data in the backup set and store it once. Block-level is the most efficient, as it detects common parts of files. File-level will typically hash files. Common base means that you can use a single base backup (eg, an installed OS when you back up multiple machines) and base incrementals on that, and is the least flexible.
Compression: Whether the backup system supports compression, and if so, what kind.
Can avoid re-compressing: As a performance optimization, whether the backup system can avoid re-compressing already-compressed data
Binary deltas: Traditional backup systems will take any change in a file, even one bit, as a reason to store an entirely-new copy of that file. Binary deltas store a more efficient representation of the difference, which can be used to bring the previous file to the new state. BackupPC supports binary deltas over the network, but not at storage. borg and dar support binary deltas both over the network and at storage.
Supports encryption: Whether and how the system can generate encrypted backups.
Zero-trust target: If a system supports encryption, whether the host storing the backup data can be prevented from decrypting it. “Yes” is best.
Authentication / verification: Whether a backup system provides integrated authentication of the backup data. With some, this is integrated with the encryption code and may require encryption (eg, Bacula). With others, such as git-annex, it is totally separate. dar provides two built-in options: --sign which signs the encryption key used for the session, and --hash which computes a SHA-512 hash while writing the archive and writes it to a separate file once the archive is written. It also integrates with par2 to create par2 signatures. Since dar creates archive files like tar does, it can also be used with any other tool that can sign data on disk or on a pipe; for instance, gpg could be used to provide stronger assurances than the built-in --sign.
Can directly back up … machines: Whether the program can back up machines running certain operating systems without using external helpers (sftp, etc). “*nix” means Unix/Linux/BSD.
Preserves …: Whether the backup system saves and restores given types of metadata. To preserve a hard link, the backup program must, at restore time, hard link together the exact same set of files that were hard linked in the source data, and no others (even if identical by content). Borg’s simulated support for sparse files means that it saves holes as blocks of NULLs at backup time, and can convert blocks of NULLs to holes at extract time. This doesn’t necessarily preserve the exact sparse structure of the original file, but should achieve roughly similar storage gains.
System model: How the system works. backuppc runs a daemon on the system doing the storage, which pulls data from the systems being backed up using rsync. Bacula had a director daemon that performs scheduling and coordination, a storage daemon that runs on the system(s) providing storage, a file daemon running on systems being backed up, and also requires a PostgreSQL or MySQL database. The CLI tools typically are invoked by command line command (which may be invoked by cron or systemd).
Network/remote support: How it supports having the backup and the source data on different machines. BackupPC can use rsync over ssh. Bacula uses the daemons as noted, which can communicate over a network. borg can push to a remote over ssh, so long as borg itself can be executed on the remote. dar can push to a remote using backends supported by libcurl, or SFTP, or any command that can be piped to. git-annex has a set of special remotes that can be pushed to, though they may not necessarily preserve all metadata.
GUI available: Whether a graphical interface is available, and what type. Third-party FLOSS projects provide these for borg and dar. BackupPC uses it as its primary interface. git-annex’s assitant provides Dropbox-like synchronization with its web interface, but doesn’t work well with all workflows git-annex makes possible.
Restoration without using tool: Whether you can restore data without using the particular backup tool used to create it. Of these, only git-annex has some support here; it could let you at least access the file data, even if you may wind up with duplicate copies after renames, deletions, etc.
External runtime depdendencies: Things that must be present to run the tool. Of these tools, only dar is fully self-contained, and can be built into a statically-linked single binary on *nix platforms that has no external dependencies.
Standalone binary distribution: Whether a self-contained standalone binary is available, and if so, what kind. Borg’s standalone binary is dynamically-linked and includes the Python environment necessary to run. dar provides a statically-linked binary, built by default; a statically-linked is the most portable. git-annex provides a dynamically-linked binary, which also requires git be installed.
Disaster recovery method: How to recover the data if only the backup volumes survive a disaster. With BackupPC, you install BackupPC on a fresh system, configure the hosts, and then can restore. Bacula would have you make a fresh install, then use bscan to load the information about volumes into its database. Bacula does support bls/bextract commands as well, but their usage is complex and impractical for most. borg and dar would just have you use the same commands as usual, since they don’t require any external configuration. git-annex may need to have you do git annex sync from repos to reload their statuses, but otherwise doesn’t need anything special - IF you have saved the git metadata somewhere.
Scheduling: How the backup system schedules backups. “Internal” means the backup software has a daemon running that does its own scheduling, often with limits on simultaneous backups and such that it can enforce. External means something like cron handles the scheduling.
Supported platforms for storage: Built-in support for backup destinations. “Local” means storage local to the backup software. “ssh” means via ssh to another system running the backup software. Dar supports libcurl destinations (https/ftp/sftp/etc). git-annex has support for special remotes for various targets. Since dar is a pipe-friendly CLI program, it can be combined with others to support a wide variety of schemes; for instance, rclone to cloud. Emulations such as the Windows Subsystem for Linux don’t count as Windows support; here I mean native support.
Supports incremental backups: Whether the backup system supports storing just the changes since the last backup. All systems here do.
Supports decremental backups: Whether the backup system supports storing the most recent backup as a full backup, then deltas running back in time – sort of the opposite of a traditional incremental. backuppc, borg, and git-annex use a storage format that is equally efficient going forwards and backwards, so I rated them each as “yes*”.

Features every program here has

Included in the Debian distribution and many others
Supports random access efficient enough to extract a single file without reading an entire backup, when the underlying device supports random access

Overview of the tools and analysis

BackupPC

BackupPC is a single-daemon system that backs up remote systems using rsync. This means that network bandwidth is used efficiently. It stores the files in a file-level deduplicated directory tree. It is a simple solution for basic backups of *nix machines when the backups fit on a standard filesystem.

Bacula

Bacula has its heritage in the tape backup world. It supports full backups and incrementals in the traditional sense. It keeps a database (in PostgreSQL or MySQL) of the different volumes and their contents. This is used both to calculate what media is needed for a restore as well as implement volume reuse rules (only allowing a full to be overwritten when a more recent full exists, for instance). It is the only tool here to provide automation around many-to-many storage relationships (can back up many systems to many storage systems) and provides the most sophisticated automation around volume management. On the other hand, it is also the most complex to install and set up, requiring its own daemons on every relevant system, as well as a database server. The complexity of restores may be a problem for decades-long archival, but on the other hand, those making heavy use of removable media may appreciate its flexibility. Its real target is the enterprise market, and a commercial version adds additional features.

Borg

Borg does backups to a filesystem. Borg’s emphasis is on efficiency; it is most efficient both over the network and on disk of all the tools here. Its on-disk format is a filesystem tree consisting of deduplicated chunks of files, which can also be compressed. Therefore, if you move a file – even from one machine to another – it will have to be neither re-transmitted nor stored again, because borg’s deduplication will detect this. By supporting binary deltas, it also efficiently stores changes to files. It is the best solution for very slow network links or situations where storage space is at a premium. On the other hand, it has its own repository format that should ideally have the time-consuming borg fsck (can take days) run periodically, and backups can be slow. Borg doesn’t support multiple volumes.

Dar

Dar represents a kind of next-generation tar. It is a command-line program that is supremely flexible, offers integrated par2 support, and is designed to integrate well with external tools. I’ve written a lot about dar; my dar page has links to my articles. Of all these tools, dar is the most flexible about storage, since it can be used in a pipeline. It also supports tape drives, with hooks allowing you to run commands to, for instance, operate a changer or have an operator switch tapes. Its isolated catalogs feature makes for efficient tracking of backed-up data without requiring a separate SQL database as with Bacula. You could look at dar as an all-around most flexible option. While it’s not quite as efficient on-disk as borg, or doesn’t have quite the level of built-in volume management sophistication as Bacula, it does pretty well compared to both – and also is a better tar than tar, a better zip than zip, and is the most “Unixy” of all of these due to its ability to be used in pipelines. It can be thought of as a powerful filesystem differ/patcher, or the workhorse of your own backup scripts. It is also the most standalone of all the tools here, being able to be functional as just a single statically-linked binary.

git-annex

git-annex isn’t designed as a backup tool at all, but it has a robust feature set that allows it to be used in such a way. It is more of a data-tracking and moving application. Uniquely, if certain care is used, backed-up data can be presented as plain files along with metadata, meaning that a worst-case scenario of a restore by an unrelated person in the future might at least get at your family photos, even if there are 5 copies of each due to renames; using a full git-annex would resolve that situation.

Links to this note

How Gapped Is Your Air?

Sometimes we want better-than-firewall security for things. For instance:

dar is a Backup and archiving tool. You can think of it as as more modern tar. It supports both streaming and random-access modes, supports correct incrementals (unlike GNU tar’s incremental mode), Encryption, various forms of compression, even integrated rdiff deltas.

Interesting Topics

Here are some (potentially) interesting topics you can find here: