imagesData deduplication essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB. Data deduplication (often called “intelligent compression” or “single-instance storage”) is a method of reducing storage needs by eliminating redundant data. Only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only one MB.
Process that uses matching logic to eliminate file records that are duplicates (dupes); also called dupe combine. There are different strengths of dedupe computer programs based on the objectives of the file user. For example, if a product being sold by the file user is inappropriate for apartment dwellers, then households with the same street address but different apartment numbers are dupes and are thereby eliminated from the list. If several rented lists are being deduped during a merge/purge a priority statement must be built into the dedupe program matching logic to indicate which lists dupes should be removed from. Random prioritization protects list owners from being disproportionately penalized for duplicate records by removing dupes from the lists on a random basis. Payment is made to the list owners for names remaining after the dedupe process, so the fewer dupes removed from their list, the more they are paid. For example, if List A and List B duplicate eight records, four of the duplicates are removed from List A and four are removed from List B, thus reducing their rental revenue equally.
Data deduplication dedupe offers other benefits. Lower storage space requirements will save money on disk expenditures. The more efficient use of disk space also allows for longer disk retention periods, which provides better recovery time objectives (RTO) for a longer time and reduces the need for tape backups. Data deduplication dedupe also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery. Data deduplication dedupe can generally operate at the file, block, and even the bit level. File deduplication eliminates duplicate files (as in the example above), but this is not a very efficient means of deduplication. Block and bit deduplication looks within a file and saves unique iterations of each block or bit. Each chunk of data is processed using a hash algorithm such as MD5 or SHA-1. This process generates a unique number for each piece which is then stored in an index. If a file is updated, only the changed data is saved. That is, if only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved, the changes don’t constitute an entirely new file. This behaviour makes block and bit deduplication far more efficient. However, block and bit deduplication take more processing power and uses a much larger index to track the individual pieces.
Hash collisions are a potential problem with deduplication. When a piece of data receives a hash number, that number is then compared with the index of other existing hash numbers. If that hash number is already in the index, the piece of data is considered a duplicate and does not need to be stored again. Otherwise the new hash number is added to the index and the new data is stored. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data. When a hash collision occurs, the system won’t store the new data because it sees that its hash number already exists in the index… This is called a false positive, and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining metadata to identify data and prevent collisions. In actual practice, data deduplication dedupe is often used in conjunction with other forms of data reduction such as conventional compression and delta differencing. Taken together, these three techniques can be very effective at optimizing the use of storage space.

Server Backup
Data is very rarely common between production servers of different types. It’s not difficult to imagine that Exchange email server may not have same content as Oracle database server. But data is largely duplicate within file-servers, exchange server and say a bunch of ERP servers (development and test). This duplication creates potential bottlenecks for bandwidth and storage used for backup.

Existing players have offered two solutions to this problem -
1. Traditional single-instancing at backup server to filter out common content e.g. Microsoft Single Instance Service (in Data centre edition). This saves the just storage cost, depending upon at what level to filter commonalities – file / block / byte. A big player in this space is Data-Domain. These solutions don’t have a client component, they just save storage space.
2. New innovative solutions like Avamar (now with EMC) and PureDisk (now with Veritas) which try filter content at backup server level before the data goes to the (remote) store. This makes these solutions much better suited for remote-office backups. They save bandwidth and storage.
But, there are two unsolved problems with both these approaches as well (Which also, explains a poor response for these products in the market)-
1. Most of the times simple block checksum matching fails to figure out common data, as it may not fall on block boundaries. E.g. if you insert a simple byte in a file, the whole file changes and all the blocks shift. And the block checksum approach fails.
2. Checksum calculation is very costly and makes backups CPU exhaustive.
3. These approaches are targeting storage cost, not time/bandwidth which is more critical.

PC Backups
The problem is much more complex at PC level, as duplicated data is distributed among users and is as high as 90% in some cases. Emails / documents and similar file formats create large pool of duplicate data between users. Also, since 50% of PC backup is mainly large email files, this is problem is particularly difficult to solve using simple file based de-duplication techniques used by servers. Druvaa inSync v2.0 uses an on-wire (distributed) de-duplication technique which senses duplicate data before the backup starts and hence skips it from the backup. This is transparent to the user, all he notices is a 10 times boost in backup speed with over 90% reduction in bandwidth and storage usage.

How it works
This technology creates and maintains a Global “Single Instance” File System at backup server. Each time a user wants to backup a file, the insync clients prepares a file-fingerprint (using linear polynomial based hash) and compares it with the server. After the server sends a response, the backup happens only for the “unique” data within the file. The (patent pending) advance file-fingerprinting makes it computationally very easy to filter common content like – same paragraphs in different documents, a same CCed email, media rich corporate presentations etc. This cuts down time for backup by 10 times and reduces bandwidth and storage utilization by 90%.

Other Interesting Features
Another good use of the Gobal Single Instance File System is – Continuous Data protection. The user after starting the restore can see how his files changes over time. This gives him an option to restore point-in-time data from any point in the past. The marketing name for the feature is – “Eternity. Never lose a file. Ever.” A long name, but serves its meaning.

Business Opportunities
The same technology/product can be stripped down to backup PDAs and scaled up to backup servers. A good use case would be to reduce time for backup of bunch of related remote servers. The (patent pending) advance file-fingerprinting makes it computationally very easy to filter common content like – same paragraphs in different documents, a same CCed email, media rich corporate presentations etc. This cuts down time for backup by 10 times and reduces bandwidth and storage utilization by 90%.

TrackBack URI | RSS feed for comments on this post


Leave a reply