Monday, August 31, 2009

Storage Area Networks Security

One might be thinking of their storage networks (FC) to be bullet-proof as nobody can hack into it primarily since it is on a private network which is also FibreChannel. Let us look into the various scenarios of storage security and how they can be hacked. I am not trying to say that storage networks are insecure, rather the other way round trying to convince that it is extremely difficult to do that.

Several tools that are provided by the vendors of HBAs makes the tasks quite easy to accomplish few of the hacks, but still physical access is required to accomplish most of the hacks.

World Wide Name (WWN) spoofing – It is the way of bypassing authorization methods in a SAN. Resources in SAN are allocated based on WWN and if someone spoofs a WWN of a HBA to the WWN of another authorized HBA, LUNs that are assigned/allocated to that HBA will be granted to the unauthorized HBA. This is more commonly referred to as DOS attack also, except in this case you can also end up with access to the storage devices.

Name Server Pollution – It is the process of corrupting the name server information on a Fibre Channel switch (Switch-NS) during a PLOGI or FLOGI (port or fabric login) by a client node to a SAN fabric. The pWWN can be spoofed to match that of an authorized one. Once the name server information is polluted, frames can be sent to an authorized entity. This requires sophisticated software and hardware..

Session hijacking – It is the act of intercepting Fibre Channel sessions between two trusting entities by guessing the predictable sequence control number and static sequence ID of a Fibre Channel frame that controls the session. Once an unauthorized user has hijacked the session, sessions (such as management session) can be controlled from an unauthorized resource.

LUN mask subversion – is the act of changing/modifying the masking properties that have been implemented on a particular node by spoofing a node’s WWN or simply changing LUN masking properties on the management client, which does not require authentication.

F-port replication – This occurs when an attacker can copy all the data from one host port to another host port that he or she controls using intelligent switch features, which does not need authentication.

Attack Points:
Any OS that has an IP connection and a Fibre Channel connection (HBA) can be a gateway to the FC SAN.

If any server has been infected with virus, worm or Trojan, then it can also be compromised by an attacker and be used as the gateway into the SAN.

Ethernet interfaces (management interfaces) on all FC switches connected to the SAN are attack points for SAN enumeration.

Thursday, August 27, 2009

Clustered NAS or Scale Out NAS?

Clustered or Scale-out NAS? Let us look at what each of these terms mean and how they would relate in a NAS environment and the impact. Historically clustering has always been done on servers - with Vertias (or any other product) Clustering setting up a NFS share and serving out clients. Then, with NetApp coming into NAS market, started its own file servers and then advanced them to clusters. In the recent few years, another technology - scale-out NAS is becoming popular.

As it goes with the name, clustering implies 2 controllers in an active-active or active-passive configuration. The main advantage of this is that there is no interruption in service when one of the controller fails; the second controller takes on the functions of the first. When designing clustered NAS, keep in mind of the failure scenario when a single controller need to do the functions of two. It might get overloaded and bring all services to a screeching halt if they are both running at >75% of utilization all the time. The other disadvantage in a clustered NAS solution is that each controller "owns" its storage and thereby the NAS shares. So if you have a single share that needs the performance of both controllers combined, it cannot be done. You will end up using 100% of one and the other doing nothing for that one share which needs more performance. This also implies there is no global namespace for the shares - each share can be accessed by an IP that is tied to one hardware controller.

Scale-out solutions started showing few years ago, and there are a lot of them now in the market. The idea here is to have a horizontal scalability by adding more controllers that can add more performance for that single file share you need. Keep an eye on what you need and what the solution offers as the scale-out solutions are unique to each other. Some scale out solutions are storage-agnostic i.e you can attach anything in the backend as the storage, and they just provide the controllers. This provides more flexibility for an Enterprise if they already have an existing relationship with a hardware vendor they procure storage from. You can also move/migrate from one backend storage to another seamlessly without any downtime. You can leverage the storage virtualization features for better performance, scalability and availability.

The other major advantage of a scale-out solution is global-namespace. The IP space is virtualized for the front-end (which NAS clients use) behind more IPs that are logically tied to the controllers. So any access request can be coming in from one-ip (virtualized backend ip) and leaving through another. This makes it a good candidate to leverage more network ports and utilize all available bandwidth rather than getting stuck with one-controller/1-ip/few-network-ports. And any controller failure, (multiple with some vendors) will not impact client communication accessibility.

Wednesday, August 26, 2009

Blade Servers - to boot from local disk or SAN?

Blade servers are meant to reduce data-center footprint in physical space, power and cooling by replacing rack-mount servers. I do not want to debate if they are efficiently / effectively doing that, but let us look at the various methods employed to boot these blades. Most of today's blade servers have internal local disk which can be used to install Operating Systems on. If you are moving from rack mount servers, the first question that will pop-up in your mind is if there is redundancy for this local disk. Some blade centers support dual local hard drives and some do not; even if there is redundancy the size of that hard drive is so small that is probably not going to fit your requirements. Some blades have SSDs for local disks to speed up boot process.

If your blade server does not have dual local hard drives, then it should be out of question whether to use them or not. It might also depend on what purposes you are using these blades for. If they are hosting a virtualized environment, maybe single drive does not matter as they can failover to other available servers in the virtual cluster. But still, it would put a lot of load on the virtual environment depending on how many VMs and their configuration need to be moved - just because the blade server is not capable of having mirrored boot disks.

In some cases, boot from SAN is employed even if there is mirrored boot disk. This might purely be for operations reasons as the data can be moved/migrated to new cluster if needed. Let us look at why this solution is good/bad.

From a storage perspective, in order to sustain boot IOPS from a whole bunch of servers (physical or virtual) the underlying storage need to support 'x' amount of IOPS and hence will determine the number of disks that need to be in the boot-pool. Even though it is a rare case for all of the servers to boot at a single time, all servers will be swapping in/out depending on the application on those servers. If boot is from SAN, obviously swap space is also coming from SAN and that need to be factored into the IO count while finalizing the storage design. Also, SAN storage is more expensive than to have local drives (even for local SSDs). Blade server vendors are now implementing local SSD drive solutions with RAID capabilities and that should be the preferred way to use them, especially for Virtualized environments.

Unless there is a business case for booting off the SAN, it should not be done for "just because he/she said so" reason. The already fine line between systems engineers and storage engineers is now more thinner with Blade server setups and most of the "slow boot" issues tend to blame the storage layer if not designed properly. Sufficient amount of cache on storage will certainly help in this process as well as the RAID configuration on where the boot and swap lives on the storage. Server design should be planned carefully so there is ample amount of RAM on it and there will be no swapping done.

Tuesday, August 25, 2009

HDS Thin Provisioning - Is it really thin?

Thin provisioning in the recent years became a great tool for Storage Admins and Systems Engineers to help provision storage on demand and then go on to purchase what is required. It greatly helps in capacity management as you do not have to procure upfront storage. Not all provisioning can be thin - it depends on what type of data and the application usage. If it is user data and if files are being added to the full capacity then no thin provisioning technology can save you space. But if it is database data, where they need the file upfront, but only writes zero blocks in them, or user shares where 'xx' amount of capacity need to be provisioned upfront but they will be using it over a period of time this is really a great tool to use.

All vendors have implemented TP over a period of time since it is now prime-time. If a storage array cannot support it, then you should think over making that purchase. Why would a vendor want to support TP when they can sell more disks to you? Well, they really didn't want to. But it is the storage community that pushed every vendor to implement it so now it is something that every vendor has in their products. Some give it away for free, some license it.

Let us look at how Thin Provisioning was implemented in HDS high end arrays. The term used by HDS for this is Dynamic Provisioning. Data allocation is done in 42MB blocks which is referred to as 'page'. This is the most granular level at which 'free pages' can be reclaimed. Now, 42MB seem to be a very large number - especially when other storage vendors can reclaim blocks from 4k to 256k (depending on the vendor). How can 42MB be more efficient in un-used storage space reclamation and the bigger question is, will this really work with all kinds of Operating systems and applications?

Right off the bat, there are some Operating systems that are excluded due to the nature they generate metadata while creating file systems. Solaris UFS file system writes metadata for every 52MB blocks which means, all of the pages allocated for a UFS file system will be consumed and none can be reclaimed even though the file system is empty. This certainly does not sound like "thin" provisioning. Linux EXT2 and EXT3 file systems generate metadata for every 128MB block, and maybe you can see some benefit over here. The most-hds-thin-provisioning friendly file systems seem to be VxFS / JFS (HPUX) / JFS2 (AIX) as these write metadata on the top/initial block of the partition. The same benefits applies to NTFS as well.

Also, in order to "efficiently" use HDP, it is suggested to create partition for the right-sized requirement. Why would you want to allocate a 2TB "thin" LUN for a 200G requirement, and then keep monitoring the usage and when the usage hits a threshold, increase the partition size "dynamically" to the next increment of size? And with 42MB blocks, how much of space do you expect to be reclaimed?

Apart from the page size itself, the other challenges today are (maybe not be there with future updates) to have a minimum of 4-parity groups per pool, and you can ONLY increase the pool size by adding the same amount of parity groups that are currently in the pool (i.e 4 more). This sounds ridiculous where other storage vendors can create a thin provisioned block device with just about any amount of storage available and can increase its size as long as there is enough capacity to commit those blocks.

Also, best practices say it is best not to assign independent LDEVs from any other parity groups that you might have, rather create a single-ldev out of a parity group and assign it as a whole to the pool.

The way I see it, HDP is just Hitachi's wide striping rather than thin provisioning. All these years while other storage vendors were able to stripe data across 10s of disks, the largest Hitachi array could only do it across 7 (rather 7 data and 1 parity) which seemed to create lots of hot spots for customers requiring high transaction processing and forcing customers to micro-manage their storage (ldevs) and move them to unused parity groups for better performance.

Oh Yeah! did I forget to mention that any parity groups that are in a HDP pool cannot be released independently unless you destroy the whole pool?

Thursday, August 20, 2009

Remote Office Branch Office (ROBO) Backups

The problem of having to backup remote office / branch office exist in almost all Enterprises and there are several ways each handle them. Most of the solutions involve having a backup server at each location and a small tape drive / library and backup locally without having the need to remove or replace tape cartridges. There are several problems with this. If the remote office does not have an IT setup (most do not as they are sales/marketing sites) and if there are any issues that need to be handled locally - failure of backup server, need to reboot tape library, replace or add or remove tape media someone from Ops need to pick up a phone and work with the non-technical staff at that location. In some scenarios, it would be the security guard on duty or a receptionist who has no clue as to what they are doing or what is expected of them. Also, the major issue at hand will be data that need to be protected is at risk if tapes are not being sent offsite or if the site has a major power loss resulting in un-usable file servers, backup server and tape setup. This will greatly impact productivity at those offices where data need to be accessed timely for getting business contract documents etc. Also, if tapes are simply kept outside that anyone can access, there will be IP theft and legal issues that may stem from that. So, how do we make sure that data is protected at these sites in a timely fashion and have a better RPO and RTO at the same time?

There are a variety of Managed Backup Services that came out in the recent past that can help with this problem where the MSP (Managed Service Provider) adds an agent to the host and takes backup periodically. Your data is now safely backed up and is secured at an offsite location ready to be restored if needed. This approach has many disadvantages
  • Enterprises need to be comfortable in storing their IP information with a third-party
  • Depending on the data size, bandwidth play an important role. Increasing bandwidth just for backups is an expensive proposition
  • Initial backups takes a long time (low bandwidth) and to recover it will be as much challenging. To over come this problem, the initial backups can be seeded and sent over on a USB drive and restores can be done the same way.
  • Most solutions support only Windows based clients
  • You do not have control of your backed up data
  • The overall solution could become quite expensive if the amount of data starts growing. With the kind of data growth that is seen over the past few years, even a small office can have anywhere from 500G-1TB of data and could double or triple in no time.
Another approach is to do the same (as MSP) yourself i.e. choose a site that is large enough to qualify as a central site within the region (have a designated central location in Americas, EU, APAC etc). The central site qualification should be based on the staff skills there, bandwidth to the site, infrastructure presence etc. Then you can choose a product that can do agentless (preferably) backups (yes, there are agentless backup products) which will reduce operations overhead to manage and maintain agent compatibilities whenever you have to upgrade the backup server. You can do the initial backup to a local USB attached disk, shipped over to the central site and seed that first backup.

Technology used in these backups play an important role on reducing the amount of data sent over the WAN during backups. The products should have the intelligence to eliminate redundant data blocks, compress, encrypt and then send to the central vault. This will complete the whole cycle of backups and you will have control of your backed-up data as well as what to backup and what not-to. If you have a SQL server, or an Exchange server at one of these sites, that should not be a problem too as the product should be able to support it (preferably agentless). You should not loose any desired functionality just because the product simply is not capable of doing so.

This will make sure you can maintain your RPO and RTO levels defined as well as reduced costs with eliminating local backup servers and tape infrastructure at all locations. If the central site needs offsite backups, you can send that data to tape at central location and send the tapes offsite - or you can duplicate your central vault to headquarters. This will not only get you peace of mind, but also saves on valuable time your operations is spending to manage these small offices. The beauty of this is, if you desire to backup any laptops that users carry, you can configure them to do so when they connect to the network (CDP).

Wednesday, August 19, 2009

SSD in Storage Arrays

The latest buzz word in Storage world is SSD. What is SSD and how can it help storage systems? Solid State Storage is made from Flash Memory (or rather NAND flash). The beauty of this is that there are no moving parts like in disk devices and hence performs better than regular disk devices. Access times are faster and application performance will be going through the roof and storage will not be a bottleneck anymore. Hold on to that thought while we discuss the use of SSDs in storage arrays.

At the heart of any storage device is Data Integrity which should not be compromised for any kind of performance increase. Let us take a deep dive into this new technology (SSD) and see what it can do and where it can help and where it cannot.

As mentioned earlier, there are no moving parts, and that is a great thing for SSDs. The bad part to it is that these have short-life-span and high bit-error-rate as well as low-capacity compared to disk drives (and expensive too). There could also be data retention problems at higher temperatures in SSDs. These short-comings can be addressed with a 'controller' that can add more reliability into SSDs.

SSDs come in disk drive package with either SATA or FibreChannel interface. There are also SSD solutions with PCI cards as well that can be used as high performing local storage. Media performance is great (IOPS) when compared to HDDs. Also, lower power consumption per IOP makes a great ROI point. When you see vendors advertising their SSD solutions, you need to look for both read and write IOP numbers. Reads are always great on SSD but writes have a very huge impact and so the write IOP numbers will always be less. Unlike in HDD, a write is not just write to a block rather the controller has to do programming to erase, transfer and program and hence the imbalance between read / write IOPS. You also need to be wary of who made the SSD as some cannot do the erase, modify, write i.e write once only.

Performance drivers for SSDs include the number of NAND flash chips (also called as Die), number of buses, Data Protection - ECC, Data Integrity Field (DIF), RAID etc and whether it is a Single-Level Cell (SLC) or a Multi-Level Cell (MLC), effective block size and lots of other factors.

Of these, the most impacting factor is whether it is SLC or MLC. Let us take a quick look at what these mean. SLC and MLC flash memory are designed in a similar way except that MLC devices costs less and can have more storage capacity. SLC fares well with high write (erase) performance and greater reliability (even at higher temperatures). Due to the nature of MLC, high capacity flash memory cards are available to the consumer at low-prices (the ones in your cell phone/pda/cameras etc). Where it is required to have high performance, SLCs are used (and hence expensive). SLC make a good fit in embedded systems.

In order to make high capacity flash (SSD) for storage systems, MLCs are used. And since they are not so reliable, a RAID mechanism has to be implemented to take care of B-E-R (bit error rate) and dedicate half the capacity of SSD for failed/dying cells to have data integrity. The average life of an SSD in a storage array with 75/25 R/W ratio is 5yrs. Since it is almost sure to have these replaced at the end of 5-years, the replacement costs are factored into the selling cost as well.

Finally let us see if we really need SSD in our storage arrays! Requirements for storage can be accounted for in two forms, either by IOPS or by Capacity. Usually it is a capacity requirement for most applications and some have a high IO requirement (OLTP). There are few cases where an application required high IO as well as large capacity (again, OLTP, email etc). Naturally, when you have a huge capacity requirement you will have a rather larger spindle count and can take care of your IO requirements. If proper capacity planning (both IO and size) is not done, and you keep adding new applications to your existing pool of disks, then you will certainly run into performance issues.

Let us take a use case of high IO requirement where 2,000 IOPS are required for a DB of 1TB size. If we do the HDD way, we will need approx 8 disks (with 300G drives, including raid parity disks 6+2 / 7+1 etc) for capacity but that count will not satisfy the IO requirement and hence you will need to double or triple that count (~150 IOPS / 15K drive). Now, if that requirement were to have a sustained random read IO of 2000, the number of HDDS required would be even more. Maybe 5 times. So, for 1TB of data you are now using 40 disks (raw capacity is 12TB). It will be hard for you to explain to management why you cannot use the rest of the capacity for other projects (good luck with that).

If we do it the SSD way, you will need approx 6 SSD (Raid5, 4+2P/5+1P etc) from a capacity storage view and way over your IOPS requirement. This would seem economical ONLY if those 6 SSDS were to be less expensive that 40 HDDs. In today's storage arrays, where most licensing applies to installed raw TB, with 40x300G, you will be paying more for licenses on that array and less if you go with SSD. It all depends on the price-per-requirement.

Not many applications generate that kind of load which really needs SSDs. Maybe if there were to be a tiered storage array which can assign Tier-0 (SSD) blocks for highly active data and then move them off to Tier1 (15k HDD) or Tier2 (10k HDD) that would be an ideal world for Storage Engineers to be able to dynamically create solutions with very less price-per-IO or price-per-gb.

Tuesday, August 18, 2009

Storage Queue Depth and Performance

Queue depth is something that most storage admins tend to overlook. This is the reason for most performance issues. Let us look at what queue depth is, what it can do, how it impacts performance and how to make things better.

Queue depth is the number of commands that the hba can send / receive in a single chunk - per LUN. From a host-hba point (initiator), it is the number of commands that can be queued (or stored) and then sent to storage. From a storage point (target), it is the number of commands it can accept in one-shot, again, per lun. I keep stressing 'per lun' since this is the most important factor that determines what should the queue depth setting be on the host. On a storage target, you cannot change it as the target ports come pre-configured - usually with 2048 or 4096 (Fibre Channel). Most storage target ports use 4096 as the queue depth.

Let us examine how we can determine what should be the queue depth on a initiator. If you have a storage target port that supports 4096 queue depth, and there is a single host accessing that port and has 10 luns, the max queue depth setting is 4096/10 = 409 and since queue depth is set in factors of 2, i.e 2/4/8/16/32/64/128/256 and so on, we should be using 256 in this case. Most hbas have a default queue depth of 32 (0x20 hex) and by changing from 32 to 256, increases the response times for the storage-luns. One important thing to remember here is that you have now maxed out the storage target port queues (2560 out of 4096). You can add 6 more luns to the same host with 256 queue depth - anything above that and you will start seeing issues. You will see latency problems only if all the queues are full. But you have to plan for the worst case and assume the queues are full all the time (at least it will be during backups).

Performance degradation starts quickly once you reach max queue depth on target and it starts backing off the queues and it will hit hard on your response times. Also, you cannot add any further hosts and present any more luns from that target port. Ideally, if you have few hosts and luns, the default is good. If you have the luxury to dedicate ports for high performance applications, make sure the queue depth is configured appropriately so you get the best of performance.

In linux, you can set the queue depth through modules.conf (2.4 kernel) or modprobe.conf (2.6) and in Solaris from the Qlogic config file or Emulex config file. This is just on the host level. Your hba need to change that setting to match that of the host as well which can be done from hba bios or cli interface package that qlogic provides.

Monday, August 17, 2009

VMware backup pains for Storage/Backup Admins

If there is anything that has been talked about more, it is Virtualization (and VMWare). In this post, let us examine the pain points for Storage/Backup Admins as to how this is impacting them and IT Backup Infrastructure.

Backups for most VMWare implementations are done via host i.e install a backup agent on the VM and back it up like a regular host. The advantages here are same as you get in a physical host i.e host level recovery. The main disadvantage to this approach is to have a license for each client, in which case your ROI is not any better (for backups). A slightly better approach for this is to backup via ESX server i.e backup agent is installed on ESX and it backs up all the VMDK files. The disadvantage to this is you will be backing up all data regardless of the backup-type (full or incremental). Since vmdk files get modified while VMs are active, an incremental backup is the same size as full. In some cases backups are done at the storage level i.e backup the underlying storage volume via NDMP. This has the same disadvantage as earlier, incremental backup size being same as full. The 'esx way' to do backups, is VCB - in which case, you have to write scripts and have a dedicated server to act as VCB backup server. The main advantage to this being incrementals are really incrementals.

Is it required to go through the pain of setting up VCB backups (maybe that is the reason most of VMWare backups are not done this way)? A recent approach to handle this trouble is to implement 3rd party software to do backups at block level incremental which really seem to be saving lots of trouble and backup size. But wait, if it is that simple why isn't everyone implementing it? Maybe there is a downside to it? Well, what is not advertised with these products is that they add a very-high amount of load on your primary storage thereby making all the VMs suffer from disk resources and everyone ultimately blaming the storage admin for poor disk response and the architect for designing such a poor storage solution. These backup products claim to do a block-level de-duped incremental - which is true, but at the cost of having all the VMs suffer until that backup completes. The product does not have a true deduplicated repository and hence keeps scanning all the blocks in the VMDK (even empty blocks) for any changes. For example, if you have 500g of VMDK but using only 200g and the remaining is white space, all of 500g is now scanned for changed blocks - not once, but everytime a backup is done.

Due diligence need to be done before you choose whichever method you want to implement and make sure the storage guys are involved. This is not just a VMWare admin decision as it impacts other components and ultimately be the storage engineer resposibility to "fix" the problem. You might like the "deduped, block level backups" but you also need to understand what problems it might cause and how to tackle them.

If only there is a real product that can integrate into both ESX and Storage and can do "real" block level incremental with dedupe, that would be a great one. Since it involves storage, I would assume such a product to be coming from a storage vendor since they have insight into the workings of storage - both blocks and dedupe. The initial investment might be high, but you will be happy that you made it, especially with large VMWare deployments.

vSphere is coming out with a vmware-backup methodology that will help from a backup point as well some-kind of DR. They are not trying to make it a full-blown backup product for VMWare (at least not at this time), but if it matures into that, it would be a real help.

Saturday, August 15, 2009

Unified Storage with focus on NetApp and EMC

When you think of storage, you either think of SAN or NAS - depending on what your requirement is. What if it is both? The tendency is to think of both requirements as separate and hence the requirement is for two solutions. What if both of the requirements can be achieved with a single solution? Then comes "unified storage" into picture. Unified storage is no mystery except that the underlying storage array should be capable of serving both NAS and SAN clients.

Even though it was SUN that developed the original protocol for Network File System, it was NetApp that made it popular in storage world. NetApp pioneered the protocol and developed storage arrays based on NFS. The only file server competitor it had at that time was Auspex (if you care to remember). The names NetApp and Filer goes hand in hand when people usually refer to them. NetApp started adding block based protocol to its storage arrays not soon when it realized there is not much potential for growth in just the NAS segment. That was really a good move for storage admin who preferred to have both file-based and block-based protocols to be served from a single array- managing one device rather than two (or more). Maybe it was not a so good from a stability point (too many bugs!!). But soon, NetApp overcame the initial hiccups and made its storage devices and OS (DATA OnTAP) as robust as it was for NAS.

I don't think EMC needs an introduction to anyone in the Storage (or for that matter any field) field. If there is any company in the Storage Area making acquisitions like crazy, it would be EMC. It can also be credited for the poor integration of its acquisitions into its mainstream products. EMC is very well known for its block based products . It also has unified storage products serving both NAS and SAN.

Let us take a quick look at both of them and compare them for their capabilites, advantages and disadvantages for both NAS and SAN in their unified product line.

Disk virtualization was made popular byNetApp. With its aggregates spread across multiple disks and Ontap managing 4K block segments there would literally be no hot-spots on disks. When it initially came out with an aggregate limitation of 16TB, not many gave a second thought since 16TB seemed to be far off from what they use. But, with growing disk sizes, it is now a major problem bugging them and their customers to look at alternative solutions. The whole concept of disk-virtualization advantages are now gone as your can only have "x" number of disks in each aggregate and you will need to do a layout and maintian spreadsheets of what goes where. This may not be the case with every customer, but certainly with large customers they all face this everday problem of managing space.

EMC took its Clariion product and slapped couple of "heads" with Microsoft OS on them and started doing NAS with Celerra. I am not sure if this is really a good solution, especially when the OS hangs and you have to reboot controllers (or blades). Agreed that all controllers are redundant, this is still an issue. The N+1 architecture defines a fail-over hot-standby for all the remaining N controllers. This may seem to be a good thing until you realize what happens if more than one controller failover to the standby. In order for the standby to support failover for more than a single blade, each of the individual blades now have to be running at 1/N of their capacity. This does not seem right!! At least with NetApp, to support failover you just need to be running at 50%. Now, with 3 active blades, you have to be running at 33% and with 4 blades you have to stick to 25%. Even EMC is plagued with the familiar 16TB limitation, except that theirs is usable capacity while NetApp's is based on RAW capacity.

Both NetApp and EMC can serve NAS (NFS and CIFS) with the same degree robustness. But when it comes to SAN, in case of NetApp you just create another volume and then a lun and assign to host and you are done. But with EMC, you cannot use the same interface that you manage your NAS (Celerra) - you now have to use the Clariion interface to create the luns, and watchout for where you are creating. EMC cannot virtualize disk as NetApp does over a whole bunch of disks - rather they have to carve out luns from a set number of disks. And watchout for those 5 drives in the first DAE where the FLARE code resides. Your array and its configuration is now wholly dependent on these drives which is a RAID-5. So your entire array reliability and availability is as good as RAID-5.

Again, with NetApp you can use built in OS tools to manage multipathing in SAN whereas with EMC, you will have to purchase licensed powerpath. There are also lots of other software that you will need to manage and monitor storage performance that need to be purchased from both EMC and NetApp that you need to be aware of.

The scalability of NAS connectivity with NetApp is as simple as adding more Ethernet ports, but in Celerra you need to know what each blade configuration can provide - some models can support only few ethernet ports and if you need more you will have to do an upgrade to the next model. If you are involved in the decision to purchase, the last thing you want to do is to stand before management and ask for budget to upgrade the storage array you just bought few months ago. I would not want to be standing in those shoes!!

One of the major advantage of using NetApp over EMC as I see it, is the ability to resize volumes as and when required. With Celerra, that is not possible and you will have to do data migration to resize volumes - although EMC claims you can start off with a thinly provisioned volume, that is not going to help with volume resizing in any way.

And snapshots are just a snap with NetApp - be it a NAS volume or a SAN LUN. With EMC, you have to go through the snapview and/or cloning process (again through Clariion) and is not efficient and effective as it is a Copy on Write - which has a write penalty, while NetApp's snapshot is a true physical snapshot.

Where both companies are lagging is in HSM - Heirarchical Storage Management. EMC uses rainfinity, while NetApp does not even have one. NetApp's preferred one used to be VFM, which is no longer being offered (that's the last I know of). With both companies having 16TB limitations on volumes and not addressing it, and cannot even provide a solution for customers' pain points I believe they are not even listening to customers. Rather than starting a bidding war for DataDomain, why cannot they address the issues in their storage arrays? Maybe they don't care? Once you are locked into a vendor they know you will stick to them or else the loss is yours.

Recent technologies enables massive cluster file system with almost no limitations and have virtually all the features that the big boys have. Maybe it is time to let the big boys know that we don't care about them if they are not listening!!

Thursday, August 13, 2009

Understanding DeDuplication - part2

In this continuation of my blog "understanding deduplication", I would like to talk about deduplication ratios, how they get interpreted and how to read them and what to expect in your environment, how to achieve better ratios, which data types get better ratios and which do not.

Most vendors claim to have at least 10:1, 20:1 and some go to the extreme of 500:1. Now, what are these numbers and how exactly are these helping you in determining the space savings?

In its simplest form, deduplication ratio is the ratio of the actual-bytes referenced by the dedupe blocks to the stored-bytes to the i.e ratio of actual-data to stored-data. If the actual data on source is 100G, but on the dedupe target, you are only storing 10G, the ratio is 100:10 or 10:1 and the space reduction is 1/10 and in % it is (1-1/10)x100 = 90%. If we calculate the % for a 500:1 dedup ratio, it would be (1-1/500)x100 = 99.8% and for 100:1 it is 99%

As large as it may seem to be, 500:1 is not saving you much more space than 100:1 (99.8% vs 99%). If this were to be a 100TB storage, you would be seeing a space savings of 99TB as opposed to 99.8TB. This shows that the ratios are not much of bigger deal once it starts going above 10:1 (90%) as the theoretical max space savings you can achieve is 100%. When you start looking for space saving devices, make sure you understand what they are saying.

Factors that influence deduplication ratios

There is no secret to deduplication: the amount of space savings you see is based on the number of same copies you have. If you don't think there is any amount of duplicate data, no product can get you any amount of space savings. Apart from primary storage, the most dedup savings can be found in backups since these are the ones that keep copying/creating data day-in and day-out repeatedly. If backups are being done like most implementations, Weekly full and daily incremental, all of the fulls can achieve the best dedup, and incrementals also to a better extent. An incremental backup is usually based on the file archive bit change, which happens when there is even a small change in the file - but it backs up the entire file. This can easily be deduped.

Type of data also influences dedup ratios - if it is general user file data, database backups, then it is a good candidate for backups. Medical Imaging, geological data,

Length of data retention and the way backups are performed. This is the most impacting factor in determining the dedup ratio. The more copies of same data is stored, the more dedup ratio one gets. If data retention is only 1-week for backups on a dedup target, the most you can achieve is the dedup data within the same full backup + the incremental (diff+cumilative).

If the full backup size is 100G and assuming 6 differentials with 10% rate of change, you may be backing up 10G per day, and 60G over 6 days. If the unique data in full is 20G, the dedup ratio achieved will be 160/20 = 8x. Now, if that were to be a daily full backup, the dedup ratio would have been 700/20 = 35. This would also help in better restore times (better RTOs) as all you need is the full backup, rather than restoring a full and applying the incremental backups.

Wednesday, August 12, 2009

Data Corruption in Storage

You might be surprised to know that data corruption happens in the latest and greatest storage arrays even with all the good RAID technologies taking care of the data. Data corruption might not be something that you might run into daily, but failed disks are. Let us look at what exactly is a failed disk and what it means to your data.

Let us first examine what a disk looks like and is comprised of. At the granular level, hard drive is divided into sectors which is typically 512 bytes and when data is written to the disk, it is written in blocks, which are sectors of 512 bytes. You might be wondering why that 1TB hard drive you just bought from Fry's or BestBuy shows 900+GB and not a full 1000GB. In order to recover from failed reads/writes/corrupted data, harddrives reserve extra space for each sector to store a checksum (commonly called as ECC). This extra space is the difference between that 'raw' capacity and 'usable' capacity. Well, if harddrive can detect all these errors then why are we even bothered about data corruption? As hard drive ages, there are several factors that contribute to data corruption on individual hard drives.

RAID technology takes care of the issues that independent hard drives cannot by using parity calculations, periodic scrubbing, reconstruction, predictive analysis etc.

A study published on "Analysis of Data Corruption in Storage Stack" by Bairavasundaram et.al shows that an analysis of corruption instances recorded in Production Storage Systems containing 1.53 million disk drives over a period of 41 months has 400,000 checksum mismatches.

As disk capacities keep growing (there is a 2TB SATA hard drive now), these errors will keep growing and reliability will be reduced. There has to be some mechanism to take care of these errors other than the regular raid implementation. Misdirected writes, Torn Writes, Data Path corruption, parity pollution etc are all of great concern with SATA drives. To protect FC drives from these same errors, a new standard was put in place - T10 DIF.

How T10 DIF helps

Enterprise drives have 520/528 byte sectors, and when formatted, Operating Systems/Applications use 512 byte sectors. The additional 8-bytes are then used to store a "TAG". This will protect for any data mis-direction between the host HBA and the storage system. There are DIF-Extensions that enable protection all the way upto Application enabling true end-to-end data integrity protection.

There are only few storage vendors out there that had implemented DIF into their storage arrays - which means absolute data integrity. Some of these are expensive and some are not so much.

VTL Vendor comparison for Data DeDuplication

In continuation to my original post "DeDuplicating Data Backups" Here are the details for the follow-up

As I was saying in my requirement list, one of the main objectives is to have integration with NetBackup OST. Now, only few vendors have that - DataDomain, Sepaton, FalconStor, Quantum, Copan (OEM of FalconStor), Diligent/IBM

I had taken each of these individually and compared their features as to what they claim they can do and can they meet my requirements. Based on that FalconStor came out on top of the list.

Feature

Sepaton

Diligent/IBM

Data Domain

FalconStor

Quantum

NetApp

Appliance or Software based

Appliance

Appliance

Appliance

Both

Appliance

Appliance

De-Duplication?

Y

Y

Y

Y

Y

y

Source/Target based de-dup

Target

Target

Target

Target

Target

Target

De-Duplication type (Inline/Post-Processing)

Post

Inline

Inline

Post

Post/policy-based

Post

If post processing, need to wait until all data is received?

Y

NA

NA

Configurable

Configurable

Y

De-duplication level

File

File

sub-file

sub-file

sub-file

block/sub-file

Fixed/Variable Length segment size (and size) - Granuality

Differencing


Variable

Fixed

Variable

Variable

De-duplication technology

Delta Diff.

Delta Diff.

Hash

Hash

Hash

Delta Diff.

Global (if multiple devices) de-duplication?

Y

Y

N

Y

N

N

Periodic/Scheduled scrubbing to remove unclaimed blocks?



Y

N


Y

Max Devices in global de-dup

5

2

1

8

1

1

Max throughput per device (Ingest rate)

600MB/s

900MB/s

750MB/s

1.5GB/s

880MB/s

600MB/s

Max throughput in max config (Ingest rate)

3GB/s (11TB/hr)

NA

750MB/s

12GB/s (43TB/hr)

NA

NA

De-duplication speeds (per device/globally)

1500MB/s (5.5TB/hr)

NA

750MB/s

4GB/s

500MB/s

Not Published

Restore speeds same as backup? (Impact of rehydration)

N

N

N

Y

N

N

Encryption

N



Y


N

Compression

Y

Y

Y

Y

Y

Y

Integrated Replication

Y

Y

Y

Y

Y

Y

Replication technology (FC/IP)

IP

IP

IP

Both

IP

FC (DWDM)

Bi-directional replication on same set of devices?



Y

Y

Y


Network Optimized replication? (dedup/compress)

Y

Y

Y

Y

Y

N

Physical tape integration?

N

N

N

Y

Y

Y

Integration with Symantec OST?

N

N

Y

Y

Y

N

Shared storage (leverage existing storage infrastructure & no vendor tie-in for backend storage)

Y

Y

N

Y

N

N

Dynamic Addition of capacity to VTL

N

Y

N

Y

N

N

HA configuration (in case if one of the VTL appliance fails)

Y

Y

N

Y

N

N

If Appliance, RAID level used

5

Vendor Qualification Matrix

5

Vendor Qualification Matrix

5

5

Scalability (with Max node config - HA only)

1.2PB

4

768TB

2.4PB (SIR) / 32PB (VTL)

220TB

128TB