superlifter design notes (was Re: Latest rZync release: 0.06)
jw schultz
jw at pegasys.ws
Sun Jul 21 01:36:01 EST 2002
On Thu, Jul 11, 2002 at 07:06:29PM +1000, Martin Pool wrote:
> I've put a cleaned-up version of my design notes up here
>
> http://samba.org/~mbp/superlifter/design-notes.html
>
> It's very early days, but (gentle :-) feedback would be welcome. It
> has some comments on Wayne's rzync design, which on the whole looks
> pretty clever.
>
> I don't have any worthwhile code specifically towards this yet, but I
> have been experimenting with the protocol ideas in distcc
>
> http://distcc.samba.org/
>
> I like the way it has worked out there: the protocol is simple and
> easy to understand, the bugs more or less "found themselves", and it
> feels like I'm using TCP in a natural way -- all of these much more so
> than rsync at the moment. (Of course, the rsync problem is much more
> complicated.)
I too have been thinking about a redesign. In fact i was
close to writing something recently but rsync did the job so
i lost the motivation. Your document while useful seems to
imply an actual design but doesn`t outline it. The focus of
this document is to describe an overall algorythm and
discuss the implications. I have no code yet. If that is
your sticking point, move on. I'd rather we discuss this
for a while and look at different approaches than dive in
without a clear design.
Please don't flame but if you are polite you can point out
where i'm a total idiot. I'd rather put up a bad idea and
have it shot down than fail to suggest a good one:)
================================================================
.From what i can see rsync is very clever. The biggest
problems i see with its inability to scale for large trees,
a little bit of accumulated cruft and featuritis, and
excessively tight integration.
By loosening the integration i believe we can deal with both
the scalability problem parallelization) and create an
inherent flexibility where parts of the program could be
replaced for special purposes with no impact on the rest.
Tight integration has helped rsync to be as fast as it is
but given increasing CPU speeds make the network our primary
bottleneck and monolithic implementation doesn't help there.
What i am seeing is a Multi-stage pipeline. Instead of one
side driving the other with comand and response codes each
side (client/server) would set up a pipeline containing
those components that are needed with the appropriate
plumbing. Each stage would largly look like a simple
utility reading from input; doing one thing; writing to
output, error and log. The output of each stage is sent to
the next uni-directionally with no handshake required.
Each stage could be implemented as a library function.
Other utilities could be built with or linked to those
functions. Rsync-TNG would be a framework containing the
library functions that would set up the necessary plumbing
between stages and let them run. Each stage could well be a
separate thread or even process but need not be.
Partial pipelines could be built allowing stream capture and
filtering before or after any stage. Wherever possible the
pipeline would be transfer direction agnostic allowing the
directives in the stream to be inverted.
--- The Stages ---
1 scan
scan directory tree
generate stat info
checksum optional
For comparisons mtime represents data fork etime
represents all other forks. If system doesn't
support etime for other forks etime is max(etime,mtime)
ACL variable length considered part of stat
might only send checksum of ACL at this point.
Might be desirable to generate checksums of forks.
This could accept a file list as input.
and may be optimized to utilize cached data.
2 compare scan
Scan directory tree and compare with stat info.
Based only on stat info and optional checksums
identify files actions according to
unchanged,
stat changed,
data changed,
forks changed,
only exist at one location.
Unchanged files may be removed from list.
This stage may be optimized to utilize cached data.
Must be run on different node than stage 1
3 block checksum
Add block checksums of data forks for data-fork changed
files.
Add block checksums of forks for files
where fork change identified by etime or checksum.
This stage may be optimized to utilize cached data.
May be run on same node as stage 2.
4 rolling checksum comparison
block checksums from stage 3 are compared with the
rolling checksums.
For each changed file generate a list of
fork+offset+offset+length tuples.
This stage is computationally intensive so some
control over whether this is run on client or server
should exist.
Must be run on different node than stage 3
5 transmit
for each using fork+src_offset+src_length+dest_offset+dest_length
tuples insert data changes to stream
This stage must be run on sender.
May be run on same node as stage 4.
6 update
Apply file changes, inode changes, deletions,
creations and links.
This is the only stage that should modify anything.
This stage must be run on receiver.
Must be run on different node than stage 5
--- Notes ---
Timestamps:
timestamps should be represented as seconds from
Epoch (SuS) as unsigned 32 int. It will be >90 years
before we exceed this by which time the protocol
will be extended to use uint64 for milliseconds.
Each system will identify a timestamp mask
representing the precision of the underlying
filesystem. This mask will be transmitted as
part of the datastream and retransmitted whenever
a context change (directory) alters the mask.
Time comparisons will use
mask = mask1 & mask2
(time1 & mask == time2 & mask)
Stat info
The stat info would include (not limited to)
mode file type and permissions mask
uid numeric user ID
gid numeric group ID
size size of main fork
EA_size size of other forks (if any)
mtime time of last modification of main fork
etime time of last modification of other forks
ACL_size size of ACLs
ACL ACL and/or checksum of ACL
(mutially exclusive?)
User/group names
I think by default user and groups only be handled
numerically.
When textual names are used a special chunk in the
datastream would specify a "node+ID -> name"
equivalency immediately before the first use of that
number.
Hard Links
In the tree scan stage the first time we hit a given
inode with st_nlink > 1 we add it to a hardlink list and
decrement st_nlink. Each time we find another path
that references the inode we indicate it is a link
in the datastream and decrement st_nlink of the one
in our list. When the entry in the list has
st_nlink == 0 we remove it from the list.
Most hardlinks are within the same directory so the
hardlink list will remain fairly small. In
situations where one side is hardlinked to another
location (cp -al or rsync --link-dest) the most
efficient thing would be to run the tree scan on the
other node. This is one more reason for the ability
to override the default pipeline construction.
File Forks/Extended Attributes (EA)
Depending on the platform files may have one or more
pieces of additional data associated with them.
Sometimes called "forks" they are generalized on
linux as "extended attributes".
A prime Example of forks would include the "resource
fork" on HPFS for the MacIntosh. On Linux ACLs and
capability flags are being implemented through EA
and Linus has suggested the possibility of storing
icon and thumbnail images as well as arbitrary
meta-data there too.
Where ACLs are stored in a fork on the filesystem
that fork would not be handled separate from the
stat info.
Access Control Lists (ACL)
ACLs should be considered as a variable length
portion of the stat info.
ACLs have the same name/id# issues as file ownership
so need to be dealt with in the same way. I think
numeric IDs should be the default as the most common
case as well as the most efficient but don't care
that much as long i can use numeric IDs.
--- Download IN ACTION ---
client server
compare scan - 2 <----------- 1 - tree scan
|
|
V
block checksum - 3 -----------> 4 - rolling checksum
|
|
V
update - 6 <----------- 5 - transmit
This involves 3 network transfers. Swapping nodes for
stages 1-4 reduces the computational load on the server at
the cost of one more network transmission
client server
tree scan - 1 <----------- 2 - compare scan
|
|
V
rolling checksum - 4 <----------- 3 - block checksum
|
|--------------|
V
update - 6 <----------- 5 - transmit
--- Pro and Con ---
con:
involves a good deal of inter-stage traffic that
_could_ add to network load.
YOUR INPUT HERE
pro:
Inter stage traffic can be done in large blocks
minimizing stalls and freeing us from most network
latency issues.
Each stage can be implemented as a thread or process
in the pipeline and run in parallel with other
stages.
By making the pipeline format source-dest agnostic
all stages except the last 2 can be inverted to run
on either node.
Using a single well defined format for all
inter-stage pipelines we can capture, filter or
externally generate that pipeline.
By running multiple stages in parallel and
processing a tree directory-by-directory (depth
first) with a file or a directory preference we will
achieve the temporal locality to take advantage of
inode and even file caches in the host OS.
--- Optimizations and future features---
File Names:
Use filenames relative to the current directory.
Directory changes could also be relative to current
directory. The current directory would be relative
to the path given on the command line. In this way
no directory name need be given twice.
The only filenames that might need to be retained
past immediate use would be directories, hardlinks
and those being relocated to later in the stream for
some reason.
Network traffic:
The basic design involves sending a good deal of
information about each file between successive
stages. Much of this information will have already
been sent. One approach to this would be to use
back-channels between non-adjacent stages on each
node.
compare scan - 2 <----------- 1 - tree scan
| |
| | backchannel
V V
block checksum - 3 -----------> 4 - rolling checksum
| |
backchannel | |
V V
update - 6 <----------- 5 - transmit
In this way we would only have to transmit the
changes to the datastream over the network.
Chunks with no changes would not even be sent via
network.
One approach i have to this would be to designate a
bitfield for each part of the file data ex:
0x01 ACL checksum
0x04 ACL data
0x08 filename & stat
0x0C action
0x10 file & fork checksums
0x20 file & fork block checksums
0x40 file & fork changelist
0x80 file & fork change data
Each chunk transmitted would have an ID. Positive
ChunkIDs would be created by the client and negative IDs
by the server. A chunk header might look something
like this (i am not wedded to the fieldsizes):
struct chunk {
uint32 ID;
uint16 command_code;
uint16 fieldmask;
uint64 length;
void data[];
}
Each stage (other than the first) would merge the
backchannel stream with the network stream. The
network stream would only have fields that were
changed or added. Each stage (other than the last)
would track which fields were changed for a given chunk
and only send those fields over a network connection
while sending the complete record on the
backchannel.
A set of library routines would handle the merge and
split functionality.
Once a stage has processed and handed off a chunk it
can forget about it. We don't need any registers,
hashes etc.
Client for low impact on servers:
Using caching of stat info, file checksums and block
checksums specialized rsync tools can minimize load
on the server or even eliminate the need for an
rsync server by using chunk transfers via HTTP.
compare scan - 2 <----------- GET http://server/blocksums
|
|
V
rolling checksum - 4
|
|
V
special transmit - 5 <----------- http GET chunks
|
|
V
update - 6
or
compare scan - 2 <----------- GET http://server/stat+filesums
|
|--------------| - (sparse stream)
V
rolling checksum - 4 <----------- 3 - block checksum (from cache)
|
|
V
special transmit - 5 <----------- http GET chunks
|
|
V
update - 6 <----------- 5 - transmit
File Rename:
By inserting another pair of stages between the
compare scan and the block checksum stages we could
identify file renames.
rename - 1
Generate file checksums for all local files
only on this node where one or more files
only on the other node have the same size.
This means relocating the stream chunks for
all single-node files to the end of the
stream.
rename - 2
Generate file checksums for all local files
only on this node and compare file checksums
with corresponding files on the other node.
If a checksum matches change the commands
for both chunks to cause a rename.
This is just one method of moderately low impact
rename detection. More exhaustive methods could be
developed for rename detection involving the rolling
checksums or maintaining inode info from previous
transfers.
Broadcast and replication:
Capturing the stream from the rolling checksums
or transmit stage would allow high efficiency
updates of other nodes.
If you capture the output from the rolling checksum
stage, invert it, send it through the transmit stage
on the receiving node and you will have an undo
stream that can be applied to regenerate the old
tree by feeding it to the update stage.
client server
compare scan - 2 <----------- 1 - tree scan
|
|
V
block checksum - 3 -----------> 4 - rolling checksum
|
transmit - 5 <--invert----|
| |
save to undo file < |
V
defer <----------- 5 - transmit
|
|
V
update - 6
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync
mailing list