Wednesday, August 12, 2009

Data Corruption in Storage

You might be surprised to know that data corruption happens in the latest and greatest storage arrays even with all the good RAID technologies taking care of the data. Data corruption might not be something that you might run into daily, but failed disks are. Let us look at what exactly is a failed disk and what it means to your data.

Let us first examine what a disk looks like and is comprised of. At the granular level, hard drive is divided into sectors which is typically 512 bytes and when data is written to the disk, it is written in blocks, which are sectors of 512 bytes. You might be wondering why that 1TB hard drive you just bought from Fry's or BestBuy shows 900+GB and not a full 1000GB. In order to recover from failed reads/writes/corrupted data, harddrives reserve extra space for each sector to store a checksum (commonly called as ECC). This extra space is the difference between that 'raw' capacity and 'usable' capacity. Well, if harddrive can detect all these errors then why are we even bothered about data corruption? As hard drive ages, there are several factors that contribute to data corruption on individual hard drives.

RAID technology takes care of the issues that independent hard drives cannot by using parity calculations, periodic scrubbing, reconstruction, predictive analysis etc.

A study published on "Analysis of Data Corruption in Storage Stack" by Bairavasundaram et.al shows that an analysis of corruption instances recorded in Production Storage Systems containing 1.53 million disk drives over a period of 41 months has 400,000 checksum mismatches.

As disk capacities keep growing (there is a 2TB SATA hard drive now), these errors will keep growing and reliability will be reduced. There has to be some mechanism to take care of these errors other than the regular raid implementation. Misdirected writes, Torn Writes, Data Path corruption, parity pollution etc are all of great concern with SATA drives. To protect FC drives from these same errors, a new standard was put in place - T10 DIF.

How T10 DIF helps

Enterprise drives have 520/528 byte sectors, and when formatted, Operating Systems/Applications use 512 byte sectors. The additional 8-bytes are then used to store a "TAG". This will protect for any data mis-direction between the host HBA and the storage system. There are DIF-Extensions that enable protection all the way upto Application enabling true end-to-end data integrity protection.

There are only few storage vendors out there that had implemented DIF into their storage arrays - which means absolute data integrity. Some of these are expensive and some are not so much.

2 comments:

  1. Quote "As disk capacities keep growing (there is a 2TB SATA hard drive now), these errors will keep growing and reliability will be reduced."

    This is saying disk reliability is decreasing, but I would suggest this is not correct. I think what you are trying to say is that because disks are getting bigger the number of errors is getting bigger, but this does not mean that disk reliability is reduced unless the percentage is growing.

    ReplyDelete
  2. I appreciate for the correction. You are correct in stating that the disk reliability is not going down, rather the chance of having reliability issues will be high as there is more storage capacity on the drive and is used more frequently.

    ReplyDelete