An idea: rsyncfs, an rsync-based real-time replicated filesystem

Lester Hightower hightowe-rsync-list at 10east.com
Wed Apr 13 03:57:37 GMT 2005


This is only my second email to the rsync mailing list.  My first was sent
under the title "Re: TODO hardlink performance optimizations" on Jan 3,
2004.  The response of the rsync developers to that email was remarkable
(in my opinion).  I felt that the rsync performance enhancements that
resulted from the ensuing discussion and code improvements were so
consequential that I dared not post again until I felt I had another topic
worthy of the time and consideration of this superb group.

This idea might be a little far fetched, but if so I guess that means that
I am just ignorant enough in this area to be willing to pitch a dumb idea.
Sometimes ignorance is fruitful.  If the list consensus is that it's a bad
idea, or outside the goals of the rsync project, please feel free to just
let me know that.  I have thick skin and am fine with picking up my toys
and leaving the playground until I have some other idea of possible value
to contribute.   :)


So, here goes:


Continuity of service, high availability, disaster recovery, etc. are all
hot topics in today's world.  Rsync is an excellent tool in the service
continuity arsenal.  However, routine rsyncs of large volumes with high
file counts can cause service/performance issues on production servers
tasked with their normal load plus servicing rsyncs for mirror slaves.

Recently, I have learned of, and my group has done some evaluating of,
real-time peer-to-peer filesystem replication technologies.  Two
commercial products in particular, PeerFS and Constant Replicator, have
promising designs, but neither is open source and both are from smaller
companies without proven track-records.

http://www.radiantdata.com/
http://www.constantdata.com/products/cr.php

These are _not_ cluster filesystems, where "cluster" would imply a single
filesystem that N number of hosts can simultaneously access, but rather
methods to keep N copies of a filesystem synchronized (in real time)
across N number of Linux hosts. (**important distinction**)

To help articulate my idea for rsyncfs, I feel it important to describe
the two methods taken by PeerFS and Constant Replicator and to use that as
a springboard to describe rsyncfs.  Both embed themselves into the Linux
kernel (as modules), and I assume into/under the VFS, but I am not sure.

PeerFS is a peer-to-peer filesystem. It uses its own on-disk filesystem,
so one uses PeerFS _instead_ of another filesystem like ext3 or reiser.
Constant Replicator sits between the kernel and a normal filesystem, so
one uses Constant Replicator on top of a "normal" Linux filesystem like
ext3.  Here are some simple block diagrams to illustrate how I think each
is architected:

      Host A         Host B                 Host A         Host B

   +----------+   +----------+           +----------+   +----------+
   |  Block   |   |  Block   |           |  Block   |   |  Block   |
   |  Device  |   |  Device  |           |  Device  |   |  Device  |
   +----------+   +----------+           +----------+   +----------+
        ^^             ^^                     ^^             ^^
   +----------+   +----------+           +----------+   +----------+
   |  PeerFS  |<->|  PeerFS  |           |  fs/ext3 |   |  fs/ext3 |
   +----------+   +----------+           +----------+   +----------+
        ^^             ^^                     ^^             ^^
   +----------+   +----------+           +----------+   +----------+
   |  kernel  |   |  kernel  |           | Constant |-->| Constant |
   |   VFS    |   |   VFS    |           |Replicator|   |Replicator|
   +----------+   +----------+           +----------+   +----------+
                                              ^^             ^^
                                         +----------+   +----------+
                                         |  kernel  |   |  kernel  |
                                         |   VFS    |   |   VFS    |
                                         +----------+   +----------+

PeerFS is a many-to-many replication system where all "peers" in the
cluster are read/write.  Constant Replicator is a one-to-many system where
only one master is read/write, and every mirror is read-only.  Replication
communication between hosts in both systems is via TCP/IP.

I see benefits to both designs.  I personally don't need the many-to-many
features of PeerFS, and though we tested it and it seemed to work well,
the design scares me.  It just seems that too many issues would arise that
would be impossible to troubleshoot -- NFS file locking haunts me.  Even
with PeerFS, though, you can force a one-to-many replication system, so
then my next angst: PeerFS is a closed source file system on Linux with no
track-record.  As best I can tell it is _not_ journalled and the system
ships with mkfs.peerfs and fsck.peerfs tools.

The Constant Replicator design appeals to me because it more closely
mirrors my needs, and has fewer "scarey black boxes" in the design.
However, the more I have thought about what is really going on here, the
more I am convinced that my rsyncfs idea is doable.

Let me diagram an rsyncfs scenario that I have in my head.  Host A is the
master, Host B the slave in this example:

              Host A         Host B

           +----------+   +----------+
           |  Block   |   |  Block   |
           |  Device  |   |  Device  |
           +----------+   +----------+
                ^^             ^^
           +----------+   +----------+
           |  fs/ext3 |   |  fs/ext3 |
           +----------+   +----------+
                ^^             ^^
           +----------+   +----------+
           |VFS Change|   |  kernel  |
           |  Logger  |   |   VFS    |
           +----------+   +----------+
                ^^
           +----------+
           |  kernel  |
           |   VFS    |
           +----------+

I envision the "VFS Change Logger" as a (hopefully very thin) middle-ware
that sits between the kernel's VFS interfaces and a real filesystem, like
ext3, reiser, etc.  The "VFS Change Logger" will pass VFS calls to the
underlying filesystem driver, but it will make note of certain types of
calls.  I have these in mind so far, but there are likely others:

  open( ... , "w"), write(fd), unlink( ... ), mkdir( ... ), rmdir( ... )

The "VFS Change Logger" is then responsible for periodically reporting the
names of the paths (files and directories) that have changed to a named
pipe (a FIFO) that it places in the root of the mounted filesystem it is
managing.  So, when rsyncfs is managing a filesystem with an empty root
directory, one should expect to see a named pipe called "chglog.fifo", and
if one cats that named pipe she will see a constant stream of pathnames
that the "VFS Change Logger" has noted changes to.

The actual replication happens in user-land with rsync as the transport.
I think rsync will have to be tweaked a little to make this work, but
given all the features already in rsync I don't think this will be a big
deal.  I envision an rsync running on Host A like:

# rsync --constant --from0 --files-from-fifo=/vol/0/chglog.fifo ...

that will be communicating with an "rsync --constant ..." on the other
end.  The --constant flag is my way of stating that both rsyncs should
become daemons and plan to "constantly" exchange syncing information until
killed -- that is, this is a "constant" rsync, not just one run.

I believe that this could be a highly efficient method of keeping slave
filesystems in sync, in very-near-real-time, and by leveraging all of the
good work already done and ongoing in rsync.

I also believe that two-way peering (more similar to PeerFS) might be
possible (I'm not as confident in this one yet).  Consider:

              Host A         Host B
                                              Legend:
           +----------+   +----------+        <M> - master
           |  Block   |   |  Block   |        <S> - slave
           |  Device  |   |  Device  |
           +----------+   +----------+        * Note, both boxes have
                ^^             ^^               a Master and a Slave
           +----------+   +----------+          rsync daemon running,
           |  fs/ext3 |   |  fs/ext3 |          each supporting one half
           +----------+   +----------+          of the full-duplex, or
                ^^             ^^               bidirectional, syncing.
           +----------+   +----------+
           |VFS Change|   |VFS Change|
           |  Logger  |   |  Logger  |
           +----------+   +----------+
                ^^             ^^
           +----------+   +----------+
           |  kernel  |   |  kernel  |
           |   VFS    |   |   VFS    |
           +----------+   +----------+

Now consider four rsync daemons, two on each host, establishing two-way
syncing (just add another pair in the other direction).  Without any
"help" the communication would be M-rsync-A tells S-rsync-B to update
/tmp/foo and it does, which modifies B's filesystem, so M-rsync-B tells
S-rsync-A to update /tmp/foo, but it is determined that /tmp/foo matches,
so nothing is done and things stop right there.  In *theory* that sounds
doable, but I smell race conditions in there somewhere.

I don't feel capable of coding this myself or I would probably be sending
code snippets along with this email.  I am hoping that someone with much
more kernel expertise than I might read this email and comment on its
practicality, workability, difficulty, or maybe even be inspired to give
it a go.  I would appreciate feedback from anyone that read this far,
positive or negative.

I also apologize for such a lengthy email, but I wanted to share this idea
with the rsync community, and it just took lots of space to convey it...

Sincerely,

--
Lester Hightower
10East Corp.


More information about the rsync mailing list