[clug] [OT] midnight meandering: In an aggressive de-dupe storage environment, how long does it take to copy a file?

Wed Nov 7 15:24:59 MST 2012

The list helped me with "lrzip"/"lrztar" and its near optimal
compressing files with large sections in common.

It occurs to me that this same technology can, and is, applied to
deduplicating data in storage systems. Better if the system understands
files, doesn't just work with raw blocks, but still doable.

The dedupe layer, on write, identifies 'known' sequences, replacing them
with tags and pointers (and updating a reference count?), on read it
expands the sequences. You get a version of CoW, Copy on Write or diffs
like a Versioning system.

This must happen before compression and encryption layers, all relying
on lossless media management. (eg RAID).

So, how long does it take to copy a file on an aggressive data de-dupe
system?

Best case is akin to hard linking a file. Very fast, because the real
work of writing changed data is deferred.

Next best case is akin to cloning an inode with its chain of used
blocks. Scales on file size, but 'really fast' and again with a deferred
cost.

All this assumes you put a whole lot of effort into making the
underlying storage lossless... Not easy, more than RAID-5/6/7.

Question:

 anyone use or work on de-dupe data storage?

 And, apart from saving disk blocks, do you see throughput improvements?

cheers
steve

-- 
Steve Jenkin, Info Tech, Systems and Design Specialist.
0412 786 915 (+61 412 786 915)
PO Box 48, Kippax ACT 2615, AUSTRALIA

sjenkin at canb.auug.org.au http://members.tip.net.au/~sjenkin