Thursday, September 17, 2009

Understanding DeDuplication

De-Duplication comes in several flavors (block, hash etc) and each flavor can be implemented in several different ways (variable block/fixed block, hash (128-bit/64-bit), sub-file etc) in different areas (post-processing, source based, in-line etc)

Let us examine each of these and see where they fit in to the products available on the market. The most common questions that are posed with de-duplication are:

- is it source based? post-processing? in-line?
- is it hash based? block based?
- what are the chances of hash collison?
- what is the impact of un-deduping?
- How fast can the device de-dupe?
- what is the de-dupe ratio I am going to get?
- What kind of data can be de-duped?

Let us talk about source-based deduplication in this post:

In my opinion Source based de-duplication is not a preferred option by most Storage/System Engineers as it adds load onto the host and it needs a possibly-large repository if the host has a ton of files. The de-dupe application maintains a local repository and keeps track of all the files that exist on that system with a block-reference or hash-reference. When the time comes to backup, it just scan the whole file system and compares the hash against what the repository has and if it comes across anything that changed - that data will be sent over to the backup application for storing on tape/disk.

It is quite efficient than doing an incremental backup, since incremental backup cannot do sub-file level backups and if a file changes, the entire file is sent rather than just the changed blocks. Several implementations of this (source based de-duplication) are in use today, with the most popular being

CommVault, Avamar (EMC), Symantec (puredisk), OSSV (NetApp).

CommVault also can store the de-dupe blocks on tape. Unlike other solutions, when the data is sent to tape, it gets un-deduped or re-hydrated. But with the latest CommVault's Simpana, de-duped data can be written to tape as they are and can save tons of space on Tape.

Like with any advantages, there are associated drawbacks as well. If you lose a single tape (or for that matter a disk block if disk-storage is the target) from the whole sequence, it might be a disaster as it might contain a lot more data-pointers that just the amount of data that fits on that tape. If you have achieved de-dupe ratio of 9:1, that means 90% of space savings, which boils down to 1-block of data pointing to 9-blocks and it that cannot be recovered you will be losing all of the remaining backups the lost block is pointing to. This is not just in the case of source-based or does not apply to any single vendor but is the general case with any de-dupe solution. One has to make sure whatever technology is being the backend need to be not just a cheap disk but it also has to be a reliable/highly-available target which can guarantee data integrity.

With Puredisk, Avamar and NetApp, their disk need to be used as the backup-target to make this most efficient. And, tape-out is quite complex with Avamar; in case of NetApp, you can simply send it out via NDMP of that volume.

One of the disadvantage I see with this is, that final target area might have data from several hosts that are all independently de-duped and might have commonalities between them that is not getting taken care of. In other words, there is no global de-dupe of that data from all the clients backups. This could be taken care if the backed-disk-target is another device that can de-dupe what it receives.