An idea: rsyncfs, an rsync-based real-time replicated filesystem

Tue Sep 26 14:55:00 GMT 2006

Recently, I have investigated FUSE as an option for implementing something
like I proposed to this list in April, 2005 (instead of inotify).

Just yesterday, I submitted some patches to the mysqlfs-general mailing
list that improve mysqlfs a bit.  With a little more work (which I may or
may not do), using mysqlfs (a FUSE filesystem) and mysql replication, one
can achieve a real-time, one-to-many, over-the-network replicated
filesystem.  That can be done with mysqlfs today, particularly with my
patches in place, but limitations still exist (at this time), like no
support for sparse files, open()/seek()/read() or open()/seek()/write()
operations.  At present, you have to read and write files contiguously,
from beginning to end (that is true with or without my patches).

My patches to mysqlfs-0.2 (that I posted yesterday) primarily address
performance and mysql mirror-ability.

Mysqlfs certainly has its place, and I am very glad to see that it exists,
but the reason that I may not finish my mysqlfs work is that SQL servers
are just not designed to store filesystems, and I am convinced that no
amount of hacking is ever going to make FUSE+mysqld give performance even
close to that of a real, on-disk file system.  Further, and mostly based
on my mysqlfs work, I am also now fully convinced that I can author a FUSE
filesystem that realizes my original proposition from 18 months ago.

My new concept is this:

1) Mount a filesystem, say /dev/vg0/vol0, on /mnt/.vol0_do_not_touch/

2) Mount the FUSE fs, like: /sbin/rsyncfs -odbhost=host -odbuser=user \
   -odbpasswd=passwd -odb=db -ofstarget=/mnt/.vol0_do_not_touch /mnt/vol0

3) Work only in /mnt/vol0, which will do all the filesystem operations to
   the fstarget dir, and also place records into the DB (build a queue)
   of paths that change.

4) Have a separate process that rsyncs to the target machine(s) only the
   files in the changed queue (and drains that queue), eliminating the
   need to find/transfer/evaluate large lists of unchanged files.

My rsyncfs command line example above demonstrates an external database,
like mysql or postgres, but the more I think about this the more I am
leaning towards using an embedded database, like SQLite, and insisting
that the rsyncing process also live on the master host, and pushes changes
out to the slaves.  A networked database could allow one to reverse that
model, but the embedded database seems attractive to me.  Opinions?

This design can be stretched further, depending on your use case.  For
example, if you knew your FS was "mostly idle" the rsyncing process could
only sync changed paths that have been idle more than 60 secs, to avoid
rsyncing a given file (that seldom changes) more than once during a time
when a user may be making rapid changes to it, but just for a short time.
The rsyncer process could warn if a file has been in queue longer than a
certain time, and is still hot (actively changing), or just sync it.

My reason for writing this email is to solicit comments from the list.  My
post from April of last year received quite a few responses, but not an
overwhelming number.  We have started to upgrade servers to rsync 2.6.8,
and are finding good performance improvements (reduced load) from it, and
that fact lessons (but does not eliminate) our need for rsyncfs.

Steve Bonds (CCed) seemed to be most in need of something like the rsyncfs
that I proposed last year, and so I specifically CCed him to see if that
interest still exists.  There is also the chance that someone else has
solved this problem in a way that I do not know about.  That is another
reason for this email -- I don't want to spend time coding rsyncfs if a
solution that will work for us already exists.

Thanks in advance for any feedback.

--
Lester Hightower

> On 4/12/05, Lester Hightower wrote:
>
> > [...snip...]
> > The actual replication happens in user-land with rsync as the transport.
> > I think rsync will have to be tweaked a little to make this work, but
> > given all the features already in rsync I don't think this will be a big
> > deal.  I envision an rsync running on Host A like:
> >
> > # rsync --constant --from0 --files-from-fifo=/vol/0/chglog.fifo ...
> >
> > that will be communicating with an "rsync --constant ..." on the other
> > end.  The --constant flag is my way of stating that both rsyncs should
> > become daemons and plan to "constantly" exchange syncing information until
> > killed -- that is, this is a "constant" rsync, not just one run.
> > [...snip...]

On Wed, 13 Apr 2005, Steve Bonds wrote:

> Lester:
>
> Something like this is very high on my list of products I wish I had.
> I frequently use rsync to replicate data on a near real-time basis.
> My biggest pain point here is replicating filesystems with many
> (millions) of small files.  The time rsync spends traversing these
> directories is immense.
>
> There have been discussions in the past of making an rsync that would
> replicate the contents of a raw device directly, saving the time spent
> checking each small file:
>
> http://lists.samba.org/archive/rsync/2002-August/003545.html
> http://lists.samba.org/archive/rsync/2003-October/007466.html
>
> It seems that the consensus from the list at those times is that rsync
> is not the best utility for this since it's designed to transfer many
> files rather than just one really big "file" (the contents of the
> device.)
>
> Despite the fact that the above discussions are almost 18 months ago,
> I have seen no sign of the rsync-a-device utility.  If it exists, this
> might be the solution to what you propose-- and it would work on more
> than Linux.
>
> To achieve your goal with this proposed utility you would simply do
> something like this:
>
> + for each device
> ++ make a snapshot if your LVM supports it
> ++ transfer the diffs to the remote device
> + go back and do it all again
>
> If the appropriate permissions were in place this could be done
> entirely in user-mode, which is a great advantage for portability.  As
> you touched on in your original message, knowing what's changed since
> the last run would be very helpful in reducing the amount of data that
> needs to be read on the source side.  In my experience, sequential
> reads like this, even on large devices, don't take a huge amount of
> time compared with accessing large numbers of files.  If there were
> only a few files on a mostly-empty volume the performance difference
> would be more substantial.  ;-)
>
> Another thought to eliminate the kernel dependency is to combine the
> inode-walk done by the "dump" utility with the rsync algorithm to
> reduce the file data transferred.  The inode walk would be
> filesystem-specific, but could be done in user space using existing
> interfaces.
>
>   -- Steve