rsync, --sparse and VM disk images

Ryan Malayter malayter at
Fri Oct 9 07:32:02 MDT 2009

From: Bas Bahlmann || Steady IT Systeembeheer
> I am using rsync for my customers to have disaster recovery off-site
> with files from a VMware Server (under Linux). All works very well, but
> when I defragment the VM's (once a week) or Exchange defragments it's
> datastore the disk layout changes offcourse and sometimes a lot.

Defragmenting a virtual disk file is usually not a good idea with most
modern shared storage subsystems (the kind used often with VMware).
NetApp, LeftHand, EqualLogic, and most other arrays already
"virtualize" the block layout so they can do things like snapshots. So
defragmenting really doesn't help performance much and may actually
make things much worse. It also often breaks "thin provisioning" at
either the VMware or disk array layers, since new blocks are written
but the old ones are still allocated even though they are empty.

It also of course makes things tough on rsync, as massive amounts of
data change. While the file data blocks might still get matched if you
force a small block size, this increases CPU utilization for rsync
drastically. And the "index" structures of the filesystem will likely
be completely different and not be matched at all by rsync.

This is very similar to the problem of rsyncing database backup files
(Exchange, SQL Server, mysql, Oracle, whatever) that have had the
indexes rebuilt. There have been several threads on this recently.

> Sometimes, because of the defragment within the VM or Exchange, the disk
> layout changes so much that a split .vmdk file that was very little and
> now becomes filled with 2Gb data. As a result rsync has to transfer 2Gb
> of data for that .vmdk which takes a lot of time. In my opnion that's
> not nessesary because the data is probably available in another split
> .vmdk because it was moved across the virtual disk.

Again, defragmenting VMs usually is not helpful for this very reason.
Once blocks get allocated, you can't get them back. Also, using the 2
GB split VMs has always caused me problems. VMFS and every other
modern filesystem has no trouble with very big files, so keep things
simple and just use single large VMDKs for each virtual disk.

> Is it possible to make an option in Rsync which reads out the vmdk
> config file for the split disks so it can search for known data across
> all the split .vmdk files within one virtual disk? If this is possible
> this will improve the rsync process in a major way!

A VMware-specific enhancement to rsync is likely a non-starter.

More information about the rsync mailing list