superlifter design notes (was Re: Latest rZync release: 0.06)

Sun Jul 21 01:36:01 EST 2002

On Thu, Jul 11, 2002 at 07:06:29PM +1000, Martin Pool wrote:
> I've put a cleaned-up version of my design notes up here
> 
>   http://samba.org/~mbp/superlifter/design-notes.html
> 
> It's very early days, but (gentle :-) feedback would be welcome.  It
> has some comments on Wayne's rzync design, which on the whole looks
> pretty clever.
> 
> I don't have any worthwhile code specifically towards this yet, but I
> have been experimenting with the protocol ideas in distcc
> 
>   http://distcc.samba.org/
> 
> I like the way it has worked out there: the protocol is simple and
> easy to understand, the bugs more or less "found themselves", and it
> feels like I'm using TCP in a natural way -- all of these much more so
> than rsync at the moment.  (Of course, the rsync problem is much more
> complicated.)

I too have been thinking about a redesign.  In fact i was
close to writing something recently but rsync did the job so
i lost the motivation.  Your document while useful seems to
imply an actual design but doesn`t outline it.  The focus of
this document is to describe an overall algorythm and
discuss the implications.  I have no code yet.  If that is
your sticking point, move on.  I'd rather we discuss this
for a while and look at different approaches than dive in
without a clear design.

Please don't flame but if you are polite you can point out
where i'm a total idiot.  I'd rather put up a bad idea and
have it shot down than fail to suggest a good one:)

================================================================

.From what i can see rsync is very clever.  The biggest
problems i see with its inability to scale for large trees,
a little bit  of accumulated cruft and featuritis, and
excessively tight integration.

By loosening the integration i believe we can deal with both
the scalability problem parallelization) and create an
inherent flexibility where parts of the program could be
replaced for special purposes with no impact on the rest.
Tight integration has helped rsync to be as fast as it is
but given increasing CPU speeds make the network our primary
bottleneck and monolithic implementation doesn't help there.

What i am seeing is a Multi-stage pipeline.  Instead of one
side driving the other with comand and response codes each
side (client/server) would set up a pipeline containing
those components that are needed with the appropriate
plumbing.  Each stage would largly look like a simple
utility reading from input; doing one thing; writing to
output, error and log.  The output of each stage is sent to
the next uni-directionally with no handshake required.

Each stage could be implemented as a library function.
Other utilities could be built with or linked to those
functions.  Rsync-TNG would be a framework containing the
library functions that would set up the necessary plumbing
between stages and let them run.  Each stage could well be a
separate thread or even process but need not be.

Partial pipelines could be built allowing stream capture and
filtering before or after any stage.  Wherever possible the
pipeline would be transfer direction agnostic allowing the
directives in the stream to be inverted.

		--- The Stages ---

1	scan

	scan directory tree
	generate stat info

	checksum optional

	For comparisons mtime represents data fork etime
	represents all other forks.  If system doesn't
	support etime for other forks etime is max(etime,mtime)

	ACL variable length considered part of stat
	might only send checksum of ACL at this point.

	Might be desirable to generate checksums of forks.

	This could accept a file list as input.
	and may be optimized to utilize cached data.

2	compare scan

	Scan directory tree and compare with stat info.
	Based only on stat info and optional checksums
	identify files actions according to
		unchanged,
		stat changed,
		data changed,
		forks changed,
		only exist at one location.

	Unchanged files may be removed from list.

	This stage may be optimized to utilize cached data.
	Must be run on different node than stage 1

3	block checksum

	Add block checksums of data forks for data-fork changed
	files.

	Add block checksums of forks for files
	where fork change identified by etime or checksum.

	This stage may be optimized to utilize cached data.
	May be run on same node as stage 2.

4	rolling checksum comparison

	block checksums from stage 3 are compared with the
	rolling checksums.
	For each changed file generate a list of
	fork+offset+offset+length tuples.

	This stage is computationally intensive so some
	control over whether this is run on client or server
	should exist.

	Must be run on different node than stage 3

5	transmit

	for each using fork+src_offset+src_length+dest_offset+dest_length
	tuples insert data changes to stream

	This stage must be run on sender.
	May be run on same node as stage 4.

6	update

	Apply file changes, inode changes, deletions,
	creations and links.

	This is the only stage that should modify anything.

	This stage must be run on receiver.
	Must be run on different node than stage 5

		--- Notes ---

Timestamps:

	timestamps should be represented as seconds from
	Epoch (SuS) as unsigned 32 int.  It will be >90 years
	before we exceed this by which time the protocol
	will be extended to use uint64 for milliseconds.

	Each system will identify a timestamp mask
	representing the precision of the underlying
	filesystem.  This mask will be transmitted as
	part of the datastream and retransmitted whenever
	a context change (directory) alters the mask.

	Time comparisons will use
		mask = mask1 & mask2
		(time1 & mask == time2 & mask)

Stat info

	The stat info would include (not limited to)
		mode		file type and permissions mask
		uid		numeric user ID
		gid		numeric group ID
		size		size of main fork
		EA_size		size of other forks (if any)
		mtime		time of last modification of main fork
		etime		time of last modification of other forks
		ACL_size	size of ACLs
		ACL		ACL and/or checksum of ACL
						(mutially exclusive?)

User/group names

	I think by default user and groups only be handled
	numerically.

	When textual names are used a special chunk in the
	datastream would specify a "node+ID -> name"
	equivalency immediately before the first use of that
	number.

Hard Links

	In the tree scan stage the first time we hit a given
	inode with st_nlink > 1 we add it to a hardlink list and
	decrement st_nlink.  Each time we find another path
	that references the inode we indicate it is a link
	in the datastream and decrement st_nlink of the one
	in our list.  When the entry in the list has
	st_nlink == 0 we remove it from the list.

	Most hardlinks are within the same directory so the
	hardlink list will remain fairly small.  In
	situations where one side is hardlinked to another
	location (cp -al or rsync --link-dest) the most
	efficient thing would be to run the tree scan on the
	other node.  This is one more reason for the ability
	to override the default pipeline construction.

File Forks/Extended Attributes (EA)

	Depending on the platform files may have one or more
	pieces of additional data associated with them.
	Sometimes called "forks" they are generalized on
	linux as "extended attributes". 

	A prime Example of forks would include the "resource
	fork" on HPFS for the MacIntosh.  On Linux ACLs and
	capability flags are being implemented through EA
	and Linus has suggested the possibility of storing
	icon and thumbnail images as well as arbitrary
	meta-data there too.

	Where ACLs are stored in a fork on the filesystem
	that fork would not be handled separate from the
	stat info.

Access Control Lists (ACL)

	ACLs should be considered as a variable length
	portion of the stat info.

	ACLs have the same name/id# issues as file ownership
	so need to be dealt with in the same way.  I think
	numeric IDs should be the default as the most common
	case as well as the most efficient but don't care
	that much as long i can use numeric IDs.

		--- Download IN ACTION ---

             client                      server

          compare scan - 2 <----------- 1 - tree scan
                         |
                         |
                         V
        block checksum - 3 -----------> 4 - rolling checksum
                                        |
                                        |
                                        V
                update - 6 <----------- 5 - transmit

This involves 3 network transfers.  Swapping nodes for
stages 1-4 reduces the computational load on the server at
the cost of one more network transmission

             client                      server

             tree scan - 1 <----------- 2 - compare scan
                                        |
                                        |
                                        V
      rolling checksum - 4 <----------- 3 - block checksum
                         |
                         |--------------|
                                        V
                update - 6 <----------- 5 - transmit

		--- Pro and Con ---

con:

	involves a good deal of inter-stage traffic that
	_could_ add to network load.

		YOUR INPUT HERE

pro:

	Inter stage traffic can be done in large blocks
	minimizing stalls and freeing us from most network
	latency issues.

	Each stage can be implemented as a thread or process
	in the pipeline and run in parallel with other
	stages.

	By making the pipeline format source-dest agnostic
	all stages except the last 2 can be inverted to run
	on either node.

	Using a single well defined format for all
	inter-stage pipelines we can capture, filter or
	externally generate that pipeline.

	By running multiple stages in parallel and
	processing a tree directory-by-directory (depth
	first) with a file or a directory preference we will
	achieve the temporal locality to take advantage of
	inode and even file caches in the host OS.

		--- Optimizations and future features---

File Names:
	Use filenames relative to the current directory.
	Directory changes could also be relative to current
	directory. The current directory would be relative
	to the path given on the command line.  In this way
	no directory name need be given twice.

	The only filenames that might need to be retained
	past immediate use would be directories, hardlinks
	and those being relocated to later in the stream for
	some reason.

Network traffic:

	The basic design involves sending a good deal of
	information about each file between successive
	stages.  Much of this information will have already
	been sent.  One approach to this would be to use
	back-channels between non-adjacent stages on each
	node.

          compare scan - 2 <----------- 1 - tree scan
                         |                      |
                         |                      | backchannel
                         V                      V
        block checksum - 3 -----------> 4 - rolling checksum
                |                       |
    backchannel |                       |
                V                       V
                update - 6 <----------- 5 - transmit

	In this way we would only have to transmit the
	changes to the datastream over the network. 
	Chunks with no changes would not even be sent via
	network.

	One approach i have to this would be to designate a
	bitfield for each part of the file data ex:
		0x01	ACL checksum
		0x04	ACL data
		0x08	filename & stat
		0x0C	action
		0x10	file & fork checksums
		0x20	file & fork block checksums
		0x40	file & fork changelist
		0x80	file & fork change data

	Each chunk transmitted would have an ID.  Positive
	ChunkIDs would be created by the client and negative IDs
	by the server.   A chunk header might look something
	like this (i am not wedded to the fieldsizes):
		struct chunk {
			uint32	ID;
			uint16	command_code;
			uint16	fieldmask;
			uint64	length;
			void	data[];
		}

	Each stage (other than the first) would merge the
	backchannel stream with the network stream.  The
	network stream would only have fields that were
	changed or added.  Each stage (other than the last)
	would track which fields were changed for a given chunk
	and only send those fields over a network connection
	while sending the complete record on the
	backchannel.

	A set of library routines would handle the merge and
	split functionality.

	Once a stage has processed and handed off a chunk it
	can forget about it.  We don't need any registers,
	hashes etc.

Client for low impact on servers:

	Using caching of stat info, file checksums and block
	checksums specialized rsync tools can minimize load
	on the server or even eliminate the need for an
	rsync server by using chunk transfers via HTTP.

          compare scan - 2 <----------- GET http://server/blocksums
                         |
                         |
                         V
      rolling checksum - 4 
                         |
                         |
                         V
      special transmit - 5  <----------- http GET chunks
                         |
                         |
                         V
                update - 6

or

          compare scan - 2 <----------- GET http://server/stat+filesums
                         |
                         |--------------| - (sparse stream)
                                        V
      rolling checksum - 4 <----------- 3 - block checksum (from cache)
                         |
                         |
                         V
      special transmit - 5  <----------- http GET chunks
                         |
                         |
                         V
                update - 6 <----------- 5 - transmit

File Rename:
	By inserting another pair of stages between the
	compare scan and the block checksum stages we could
	identify file renames.

	rename - 1

		Generate file checksums for all local files
		only on this node where one or more files
		only on the other node have the same size.
		This means relocating the stream chunks for
		all single-node files to the end of the
		stream.

	rename - 2
		Generate file checksums for all local files
		only on this node and compare file checksums
		with corresponding files on the other node.
		If a checksum matches change the commands
		for both chunks to cause a rename.

	This is just one method of moderately low impact
	rename detection.  More exhaustive methods could be
	developed for rename detection involving the rolling
	checksums or maintaining inode info from previous
	transfers.

Broadcast and replication:

	Capturing the stream from the rolling checksums
	or transmit stage would allow high efficiency
	updates of other nodes.

	If you capture the output from the rolling checksum
	stage, invert it, send it through the transmit stage
	on the receiving node and you will have an undo
	stream that can be applied to regenerate the old
	tree by feeding it to the update stage.

             client                      server

          compare scan - 2 <----------- 1 - tree scan
                         |
                         |
                         V
        block checksum - 3 -----------> 4 - rolling checksum
                                        |
              transmit - 5 <--invert----|
	                 |              |
      save to undo file  <              |
                                        V
                     defer <----------- 5 - transmit
                         |
                         |
                         V
                update - 6 

-- 
________________________________________________________________
	J.W. Schultz            Pegasystems Technologies
	email address:		jw at pegasys.ws

		Remember Cernan and Schmitt