Thursday, September 17, 2009

Understanding DeDuplication

De-Duplication comes in several flavors (block, hash etc) and each flavor can be implemented in several different ways (variable block/fixed block, hash (128-bit/64-bit), sub-file etc) in different areas (post-processing, source based, in-line etc)

Let us examine each of these and see where they fit in to the products available on the market. The most common questions that are posed with de-duplication are:

- is it source based? post-processing? in-line?
- is it hash based? block based?
- what are the chances of hash collison?
- what is the impact of un-deduping?
- How fast can the device de-dupe?
- what is the de-dupe ratio I am going to get?
- What kind of data can be de-duped?

Let us talk about source-based deduplication in this post:

In my opinion Source based de-duplication is not a preferred option by most Storage/System Engineers as it adds load onto the host and it needs a possibly-large repository if the host has a ton of files. The de-dupe application maintains a local repository and keeps track of all the files that exist on that system with a block-reference or hash-reference. When the time comes to backup, it just scan the whole file system and compares the hash against what the repository has and if it comes across anything that changed - that data will be sent over to the backup application for storing on tape/disk.

It is quite efficient than doing an incremental backup, since incremental backup cannot do sub-file level backups and if a file changes, the entire file is sent rather than just the changed blocks. Several implementations of this (source based de-duplication) are in use today, with the most popular being

CommVault, Avamar (EMC), Symantec (puredisk), OSSV (NetApp).

CommVault also can store the de-dupe blocks on tape. Unlike other solutions, when the data is sent to tape, it gets un-deduped or re-hydrated. But with the latest CommVault's Simpana, de-duped data can be written to tape as they are and can save tons of space on Tape.

Like with any advantages, there are associated drawbacks as well. If you lose a single tape (or for that matter a disk block if disk-storage is the target) from the whole sequence, it might be a disaster as it might contain a lot more data-pointers that just the amount of data that fits on that tape. If you have achieved de-dupe ratio of 9:1, that means 90% of space savings, which boils down to 1-block of data pointing to 9-blocks and it that cannot be recovered you will be losing all of the remaining backups the lost block is pointing to. This is not just in the case of source-based or does not apply to any single vendor but is the general case with any de-dupe solution. One has to make sure whatever technology is being the backend need to be not just a cheap disk but it also has to be a reliable/highly-available target which can guarantee data integrity.

With Puredisk, Avamar and NetApp, their disk need to be used as the backup-target to make this most efficient. And, tape-out is quite complex with Avamar; in case of NetApp, you can simply send it out via NDMP of that volume.

One of the disadvantage I see with this is, that final target area might have data from several hosts that are all independently de-duped and might have commonalities between them that is not getting taken care of. In other words, there is no global de-dupe of that data from all the clients backups. This could be taken care if the backed-disk-target is another device that can de-dupe what it receives.

Monday, August 31, 2009

Storage Area Networks Security

One might be thinking of their storage networks (FC) to be bullet-proof as nobody can hack into it primarily since it is on a private network which is also FibreChannel. Let us look into the various scenarios of storage security and how they can be hacked. I am not trying to say that storage networks are insecure, rather the other way round trying to convince that it is extremely difficult to do that.

Several tools that are provided by the vendors of HBAs makes the tasks quite easy to accomplish few of the hacks, but still physical access is required to accomplish most of the hacks.

World Wide Name (WWN) spoofing – It is the way of bypassing authorization methods in a SAN. Resources in SAN are allocated based on WWN and if someone spoofs a WWN of a HBA to the WWN of another authorized HBA, LUNs that are assigned/allocated to that HBA will be granted to the unauthorized HBA. This is more commonly referred to as DOS attack also, except in this case you can also end up with access to the storage devices.

Name Server Pollution – It is the process of corrupting the name server information on a Fibre Channel switch (Switch-NS) during a PLOGI or FLOGI (port or fabric login) by a client node to a SAN fabric. The pWWN can be spoofed to match that of an authorized one. Once the name server information is polluted, frames can be sent to an authorized entity. This requires sophisticated software and hardware..

Session hijacking – It is the act of intercepting Fibre Channel sessions between two trusting entities by guessing the predictable sequence control number and static sequence ID of a Fibre Channel frame that controls the session. Once an unauthorized user has hijacked the session, sessions (such as management session) can be controlled from an unauthorized resource.

LUN mask subversion – is the act of changing/modifying the masking properties that have been implemented on a particular node by spoofing a node’s WWN or simply changing LUN masking properties on the management client, which does not require authentication.

F-port replication – This occurs when an attacker can copy all the data from one host port to another host port that he or she controls using intelligent switch features, which does not need authentication.

Attack Points:
Any OS that has an IP connection and a Fibre Channel connection (HBA) can be a gateway to the FC SAN.

If any server has been infected with virus, worm or Trojan, then it can also be compromised by an attacker and be used as the gateway into the SAN.

Ethernet interfaces (management interfaces) on all FC switches connected to the SAN are attack points for SAN enumeration.

Thursday, August 27, 2009

Clustered NAS or Scale Out NAS?

Clustered or Scale-out NAS? Let us look at what each of these terms mean and how they would relate in a NAS environment and the impact. Historically clustering has always been done on servers - with Vertias (or any other product) Clustering setting up a NFS share and serving out clients. Then, with NetApp coming into NAS market, started its own file servers and then advanced them to clusters. In the recent few years, another technology - scale-out NAS is becoming popular.

As it goes with the name, clustering implies 2 controllers in an active-active or active-passive configuration. The main advantage of this is that there is no interruption in service when one of the controller fails; the second controller takes on the functions of the first. When designing clustered NAS, keep in mind of the failure scenario when a single controller need to do the functions of two. It might get overloaded and bring all services to a screeching halt if they are both running at >75% of utilization all the time. The other disadvantage in a clustered NAS solution is that each controller "owns" its storage and thereby the NAS shares. So if you have a single share that needs the performance of both controllers combined, it cannot be done. You will end up using 100% of one and the other doing nothing for that one share which needs more performance. This also implies there is no global namespace for the shares - each share can be accessed by an IP that is tied to one hardware controller.

Scale-out solutions started showing few years ago, and there are a lot of them now in the market. The idea here is to have a horizontal scalability by adding more controllers that can add more performance for that single file share you need. Keep an eye on what you need and what the solution offers as the scale-out solutions are unique to each other. Some scale out solutions are storage-agnostic i.e you can attach anything in the backend as the storage, and they just provide the controllers. This provides more flexibility for an Enterprise if they already have an existing relationship with a hardware vendor they procure storage from. You can also move/migrate from one backend storage to another seamlessly without any downtime. You can leverage the storage virtualization features for better performance, scalability and availability.

The other major advantage of a scale-out solution is global-namespace. The IP space is virtualized for the front-end (which NAS clients use) behind more IPs that are logically tied to the controllers. So any access request can be coming in from one-ip (virtualized backend ip) and leaving through another. This makes it a good candidate to leverage more network ports and utilize all available bandwidth rather than getting stuck with one-controller/1-ip/few-network-ports. And any controller failure, (multiple with some vendors) will not impact client communication accessibility.