The increasing amount of data stored and backed up in data centres is a major concern. A keen observation is that most backup jobs only hold a small percentage of really new data— typically less than five percent. The rest is a duplicate of data that has remained unmodified from the previous backup. The elimination of this duplicate data promises to reduce storage needs and improve data restore times considerably. This project presents a solution to eliminate such duplicate data using "Data De-duplication".
The de-duplication is performed inline and at block granularity. We use the Tux 3 file system for the prototype implementation. Tux3 is a write-anywhere, atomic commit, btree based versioning file system being developed by Daniel Phillips. It aims to provide efficient snapshoting and replication method with main usage in Networked Attached Storage. Tux3 is the latest file system which shows great promise in making it to the Linux Kernel.
The design includes a btree based lookup layer on top of a Bucket data structure. The Locality Based Bucket Layout and Fingerprint Index enable fast and efficient detection and elimination of duplicate data blocks. The design is integrated into the filesystem and does not require any application level intelligence.