In this continuation of my blog "understanding deduplication", I would like to talk about deduplication ratios, how they get interpreted and how to read them and what to expect in your environment, how to achieve better ratios, which data types get better ratios and which do not.
Most vendors claim to have at least 10:1, 20:1 and some go to the extreme of 500:1. Now, what are these numbers and how exactly are these helping you in determining the space savings?
In its simplest form, deduplication ratio is the ratio of the actual-bytes referenced by the dedupe blocks to the stored-bytes to the i.e ratio of actual-data to stored-data. If the actual data on source is 100G, but on the dedupe target, you are only storing 10G, the ratio is 100:10 or 10:1 and the space reduction is 1/10 and in % it is (1-1/10)x100 = 90%. If we calculate the % for a 500:1 dedup ratio, it would be (1-1/500)x100 = 99.8% and for 100:1 it is 99%
As large as it may seem to be, 500:1 is not saving you much more space than 100:1 (99.8% vs 99%). If this were to be a 100TB storage, you would be seeing a space savings of 99TB as opposed to 99.8TB. This shows that the ratios are not much of bigger deal once it starts going above 10:1 (90%) as the theoretical max space savings you can achieve is 100%. When you start looking for space saving devices, make sure you understand what they are saying.
Factors that influence deduplication ratios
There is no secret to deduplication: the amount of space savings you see is based on the number of same copies you have. If you don't think there is any amount of duplicate data, no product can get you any amount of space savings. Apart from primary storage, the most dedup savings can be found in backups since these are the ones that keep copying/creating data day-in and day-out repeatedly. If backups are being done like most implementations, Weekly full and daily incremental, all of the fulls can achieve the best dedup, and incrementals also to a better extent. An incremental backup is usually based on the file archive bit change, which happens when there is even a small change in the file - but it backs up the entire file. This can easily be deduped.
Type of data also influences dedup ratios - if it is general user file data, database backups, then it is a good candidate for backups. Medical Imaging, geological data,
Length of data retention and the way backups are performed. This is the most impacting factor in determining the dedup ratio. The more copies of same data is stored, the more dedup ratio one gets. If data retention is only 1-week for backups on a dedup target, the most you can achieve is the dedup data within the same full backup + the incremental (diff+cumilative).
If the full backup size is 100G and assuming 6 differentials with 10% rate of change, you may be backing up 10G per day, and 60G over 6 days. If the unique data in full is 20G, the dedup ratio achieved will be 160/20 = 8x. Now, if that were to be a daily full backup, the dedup ratio would have been 700/20 = 35. This would also help in better restore times (better RTOs) as all you need is the full backup, rather than restoring a full and applying the incremental backups.
Most vendors claim to have at least 10:1, 20:1 and some go to the extreme of 500:1. Now, what are these numbers and how exactly are these helping you in determining the space savings?
In its simplest form, deduplication ratio is the ratio of the actual-bytes referenced by the dedupe blocks to the stored-bytes to the i.e ratio of actual-data to stored-data. If the actual data on source is 100G, but on the dedupe target, you are only storing 10G, the ratio is 100:10 or 10:1 and the space reduction is 1/10 and in % it is (1-1/10)x100 = 90%. If we calculate the % for a 500:1 dedup ratio, it would be (1-1/500)x100 = 99.8% and for 100:1 it is 99%
As large as it may seem to be, 500:1 is not saving you much more space than 100:1 (99.8% vs 99%). If this were to be a 100TB storage, you would be seeing a space savings of 99TB as opposed to 99.8TB. This shows that the ratios are not much of bigger deal once it starts going above 10:1 (90%) as the theoretical max space savings you can achieve is 100%. When you start looking for space saving devices, make sure you understand what they are saying.
Factors that influence deduplication ratios
There is no secret to deduplication: the amount of space savings you see is based on the number of same copies you have. If you don't think there is any amount of duplicate data, no product can get you any amount of space savings. Apart from primary storage, the most dedup savings can be found in backups since these are the ones that keep copying/creating data day-in and day-out repeatedly. If backups are being done like most implementations, Weekly full and daily incremental, all of the fulls can achieve the best dedup, and incrementals also to a better extent. An incremental backup is usually based on the file archive bit change, which happens when there is even a small change in the file - but it backs up the entire file. This can easily be deduped.
Type of data also influences dedup ratios - if it is general user file data, database backups, then it is a good candidate for backups. Medical Imaging, geological data,
Length of data retention and the way backups are performed. This is the most impacting factor in determining the dedup ratio. The more copies of same data is stored, the more dedup ratio one gets. If data retention is only 1-week for backups on a dedup target, the most you can achieve is the dedup data within the same full backup + the incremental (diff+cumilative).
If the full backup size is 100G and assuming 6 differentials with 10% rate of change, you may be backing up 10G per day, and 60G over 6 days. If the unique data in full is 20G, the dedup ratio achieved will be 160/20 = 8x. Now, if that were to be a daily full backup, the dedup ratio would have been 700/20 = 35. This would also help in better restore times (better RTOs) as all you need is the full backup, rather than restoring a full and applying the incremental backups.
Great post. The same question was asked over at Dedupe2.com.
ReplyDeleteThe question I still don't see answered here or there is what are the REAL numbers end users are seeing vs. the theoretical "best case" numbers. Anyone out there to share some experiences?
Theoretical best case numbers are posted in my blog "VTL Vendor comparison for Data DeDuplication". As for real case numbers, I do not have access to evaluate all vendors but I will certainly post the numbers for which I do.
ReplyDeleteSeems like the reals numbers are ellusive here. Vendors are quick to do big numbers to get press but when the real answers are requested they do silent! Ignore the puffed up numbers and assume there will be some 2-10X reductions when you start an ROI analysis!!
ReplyDeleteThat is exactly the reason you will have to always do an internal test with "your" data to see what kind of benefits you see with dedup. As it is not possible to do this with every vendor, you just need to select one based on the advertised features. Even though I may see a dedup ratio of xx for my data type, the same may not be the case with yours as there are lot of variables that impact the ratio as mentioned in my post.
ReplyDelete