Without a proper understanding of what they are and how they work, companies can adopt bad practices that range from wasted bandwidth and storage to actually missing important data on their backups. Understanding these concepts is also crucial when selecting new data-protection products or services.
A full backup contains all data in the entire system. A full backup of the C: drive in Windows contains every file on the C: drive. A full backup of a Windows system should contain a copy of every file on every drive on the machine or VM (e.g. C:, D:, F:, etc.). The same goes for a full backup of a UNIX or Linux machine; it contains every file on every file system on the machine (e.g./, /home, /opt, etc.).
The only thing that should be excluded from a full backup are files that were specifically excluded by the configuration. For example, many system administrators choose to exclude directories that will have no value during a restore (e.g. /boot or /dev), or contain transient files (e.g. C:WindowsTEMP in Windows, or /tmp in Linux).
There are two philosophies when discussing what files should be included or excluded from backup: backup everything and exclude what you know you don’t need, or select only what you want to backup. The former is the safer option, the latter will save some space on your backup system. Some people see it as a waste to backup application files, such as the directory into which you have loaded Oracle or SQL Server. They believe they would simply reload the application during a restore. The risk of this approach is that someone will place valuable data in a directory that is not selected for backup. For example, if you select only /home1 or D:Data to be backed up, how will the backup system know if someone adds /home2 or E:Data? This is why it is much safer to backup everything and exclude only the files that you know you don’t need, even if it does take up some additional space. An exception to this might be if you have a strongly controlled environment where all data is always loaded in the same place, and you have a well orchestrated solution for replacing the operating system and applications in a restore.
An incremental backup typically backs up all data that has changed since the last backup of any kind. Historically, such backups were file-based backups, meaning that they backed up all files that had changed since the last backup. The challenge with this from a modern data protection standpoint is that we are attempting in every way to minimize the I/O impact of backups on the server (especially when backing up VMs), and backing up a 10 GB file because 1 MB has changed isn’t very efficient.
This is why many vendors have switched to block-based incremental based backups, which back up only the blocks that have changed. The most common way to do this is when backup software products are backing up VMware or Hyper-V using their APIs. The app notifies the appropriate API it is doing a block-based incremental, after which it is given a list of blocks to back up.
Although it has meant a few different things over the years, it is now widely accepted that a differential backup will backup all data that has changed since the last full backup. This type of backup was much more in vogue in the days of tape, as it minimized the number of tapes that was required for a restore. A restore needed the latest full, followed by the latest differential, followed by the latest incremental.
If you are still doing tape-based backups, consider this: move from weekly fulls to a monthly full, weekly differential, and daily incremental. A restore will need to load one more backup than it would have needed to load under a weekly full backup setup. It saves a tremendous amount of tape and network bandwidth. This has been quite popular for quite a while for those still using tapes.
The advent of disk and deduplication has made full and differential backups passé. As mentioned previously, the reason we did the occasional full and differential backups was to minimize the number of tapes necessary to perform a restore. This no longer applies in the world of disk backups. As long as a product has been architected to fully utilize disk, restoring data from thousands of incrementals should take no more time than restoring it from a single full. This is because the backup system is simply keeping a record of where all of the files/blocks are in its storage and transferring all of those files/blocks from its storage back to the client during a restore. How those files/blocks got there is irrelevant in a modern backup world. Forever-incremental, especially if it is implemented using a block-based approach, is the most efficient way to update your backup repository with the latest information from each backup client.
Windows systems use something called the archive bit to determine if a file has changed since the last backup. Any modifications to a file result in its archive bit being set, after which any backup of any level would back it up. After the file has been backed up, the backup application clears the archive bit, after which it will not get backed up again until the next full backup.
Many backup purists do not like the archive bit, if for no other reason than it should be called the backup bit – as backups are not archives. Other issues with the archive bit include the fact that if you have two backup applications running at the same time they will step on each other by clearing the archive bit.
The move of most companies to virtualization, and the use of backup APIs that interface at the virtualization level, followed by the use of block-based incremental backups has somewhat made the archive bit not as important as it used to be. It really only applies in host-based backups, which are becoming more rare every day.
Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.