No subject

Sun Jan 4 23:01:44 GMT 2004

usage appears to grow gradually, not exponentially.  A rsync may take
several hours to complete.  (I have one running now that started over four
hours ago.  The filesystem contains 236 GB of data in 2.4 million files.  It
is currently taking up 1351MB of memory on the mirror server and 646M on the
source server.)  All filesystems are veritas filesystem, in case that is
relevant.

I saw someone on this list recently mentioned changing the block size.  The
rsync man page indicates this defaults to 700 (bytes?).  Would a larger
block size reduce memory usage, since there will be fewer (but larger)
blocks of data, and therefore fewer checksums to store?

You suggested setting ARENA_SIZE to 0... I guess this would be done like
this?

% ARENA_SIZE=0 ./configure

Doug

David Bolen [mailto:db3l at fitlinxx.com] writes:
> Granzow, Doug (NCI) [granzowd at mail.nih.gov] writes:
> 
> > Hmm... I have a filesystem that contains 3,098,119 files.  That's
> > 3,098,119 * 56 bytes or 173,494,664 bytes (about 165 MB).  Allowing
> > for the exponential resizing we end up with space for 4,096,000
> > files * 56 bytes = 218 MB.  But 'top' tells me the rsync running on
> > this filesystem is taking up 646 MB, about 3 times what it should.
> >
> > Are there other factors that affect how much memory rsync takes up?
> > I only ask because I would certainly prefer it used 218 MB instead
> > of 646. :)
> 
> Hmm, yes - I only mentioned the per-file meta-data overhead since
> that's the only memory user in the original note case, which was
> failing before it actually got the file list transferred, and it
> hadn't yet started computing any checksums.  But there are definitely
> some other dynamic memory chunks.  However, in general the per-file
> meta-data ought to be the major contributor to memory usage.
> 
> I've attached an old e-mail of mine when I did some examining of
> memory usage for an older version of rsync (2.4.3) which I think it
> still fairly valid.  I don't think it'll explain your significantly
> larger usage than expected.  (A followup note corrected the first
> paragraph as rsync doesn't create any "tree" structures)
> 
> Two possibilities I can think of have to do with the fact that the
> per-file overhead is handled by 'realloc'ing the space as it grows.
> It's possible that the sequence of events is such that some other
> allocation is being done in the midst of that growth which forces the
> next realloc to actually move the memory to gain more space, thus
> leaving a hole of unused memory that just takes up process space.
> 
> Or, it's also possible that the underlying allocation library (e.g.,
> the system malloc()) is itself performing some exponential rounding up
> in order to help prevent just such movement.  I know that AIX used to
> do that, and even provided an environment variable way to revert to
> older behavior.
> 
> What you might try doing is observing the process growth during the
> directory scanning phase and see how much memory actually gets used to
> that point in time - gauged either by observing client/server traffic
> for when the file list starts getting transmitted, or by
> enabling/adding some debugging output to rsync.
> 
> I just peeked at the latest sources in CVS, and it looks like around
> version 2.4.6 the file list processing added some of it's own
> micro-management of memory for small strings, so there's something
> else going on there too, in theory to help avoid platform growth like
> mentioned in my last paragraph.  So if you're using <2.4.6, you might
> try a later version to see if it improves things.  Or if you're using
> a later version you might try rebuilding with ARENA_SIZE set to 0 to
> disable this code to see if your native platform handles it better
> somehow.
> 
> -- David
> 
> /-------------------------------------------------------------
> ----------\
>  \               David Bolen            \   E-mail: 
> db3l at fitlinxx.com  /
>   |             FitLinxx, Inc.            \  Phone: (203) 
> 708-5192    |
>  /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 
> 316-5150     \
> \-------------------------------------------------------------
> ----------/
> 
> 	  - - - - - - - - - - - - - - - - - - - - - - - - -
> 
> From: David Bolen <db3l at fitlinxx.com>
> To: 'Lenny Foner' <foner at media.mit.edu>
> Cc: rsync at us5.samba.org
> Subject: RE: The "out of memory" problem with large numbers of files
> Date: Thu, 25 Jan 2001 13:25:43 -0500
> 
> Lenny Foner [foner at media.mit.edu] writes:
> 
> > While we're discussing memory issues, could someone provide a simple
> > answer to the following three questions?
> 
> Well, as with any dynamic system, I'm not sure there's a totally
> simple answer to the overall allocation, as the tree structure created
> on the sender side can depend on the files involved and thus the total
> memory demands are themselves dynamic.
> 
> > (a) How much memory, in bytes/file, does rsync allocate?
> 
> This is only based on my informal code peeks in the past, so take it
> with a grain of salt - I don't know if anyone has done a more formal
> memory analysis.
> 
> I believe that the major driving factors in memory usage that I can
> see is:
> 
> 1. The per-file overhead in the filelist for each file in the system.
>    The memory is kept for all files for the life of the rsync process.
> 
>    I believe this is 56 bytes per file (it's a file_list structure),
>    but a critical point is that it is allocated initially for 1000
>    files, but then grows exponentially (doubling).  So the space will
>    grow as 1000, 2000, 4000, 8000 etc.. until it has enough room for
>    the files necessary.  This means you might, worst case, have just
>    about twice as much memory as necessary, but it reduces the
>    reallocation calls quite a bit.  At ~56K per 1000 files, if you've
>    got a file system with 10000 files in it, you'll allocate room for
>    16000 and use up 896K.
> 
>    This growth pattern seems to occur on both sender and receiver of
>    any given file list (e.g., I don't see a transfer of the total
>    count over the wire used to optimize the allocation on the 
> receiver).
> 
> 2. The per-block overhead for the checksums for each file as it is 
>    processed.  This memory exists only for the duration of one file.
>    
>    This is 32 bytes per file (a sum_buf) allocated as on memory chunk.
>    This exists on the receiver as it is computed and transmitted, and
>    on the sender as it receives it and uses it to match against the
>    new file.
> 
> 3. The match tables built to determine the delta between the original
>    file and the new file.
>   
>    I haven't looked at closely at this section of code, but I believe
>    we're basically talking about the hash table, which is going to be
>    a one time (during rsync execution) 256K for the tag table and then
>    8 (or maybe 6 if your compiler doesn't pad the target struct) bytes
>    per block of the file being worked on, which only exists for the
>    duration of the file.
>    
>    This only occurs on the sender.
> 
> There is also some fixed space for various things - I think the
> largest of which is up to 256K for the buffer used to map files.
> 
> > (b) Is this the same for the rsyncs on both ends, or is there
> >     some asymmetry there?
> 
> There's asymmetry.  Both sides need the memory to handle the lists of
> files involved.  But while the receiver just constructs the checksums
> and sends them, and then waits for instructions on how to build the
> new file (either new data or pulling from the old file), the sender
> also constructs the hash of those checksums to use while walking
> through the new file.
> 
> So in general on any given transfer, I think the sender will end up
> using a bit more memory.
> 
> > (c) Does it matter whether pushing or pulling?
> 
> Yes, inasmuch as the asymmetry is based on who is sending and who is
> receiving a given file.  It doesn't matter who initiates the contact,
> but the direction that the files are flowing.  This is due to the
> algorithm (the sender is the component that has to construct the
> mapping from the new file using portions of the old file as
> transmitted by the receiver).
> 
> > By the way, this does seem to be (once again) a potential argument
> > for the --files-from switch: doing it -that- way means (I hope!)
> > that rsync would not be building up an in-memory copy of the
> > filesystem, and its memory requirements would presumably only
> > increase until it had enough files in its current queue to keep its
> > network connections streaming at full speed, and would then
> > basically stabilize.  So presumably it might know about the 10-100
> > files it's currently trying to compute checksums for and get across
> > the network, but not 100,000 files.
> 
> I think you'd need more fundamental changes than just the --files-from
> switch to get the improvement you're thinking about.  Rsync exchanges
> file information up front between sender and receiver and then the
> receiver walks the files to handle the receipt.  That would have to be
> changed to interleave the file listing amongst the transfers but that
> would eliminate the ability to notice problems up front with
> specifications or what not.  I have a feeling that's a pretty
> significant change (it would certainly be a new protocol number).
> 
> I do think there's probably some room for improvement in some respects
> that would stay compatible with the existing protocol.  First in terms
> of how the allocation for the file list is grown (less aggressive
> expansion of the list) - or even perhaps an option to scan the
> filesystem twice to determine just how many files match so that you
> only allocate exactly what you need.
> 
> I could also see transmitting the checksums from the receiver as they
> were computed rather than bothering to store them locally in memory.
> That would remove that usage entirely from the receiver.  In theory
> you could receive them on the sender and place them right into the
> hash without affecting the protocol, but that would be more
> significant surgery on the source itself.
> 
> -- David
> 
> /-------------------------------------------------------------
> ----------\
>  \               David Bolen            \   E-mail: 
> db3l at fitlinxx.com  /
>   |             FitLinxx, Inc.            \  Phone: (203) 
> 708-5192    |
>  /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 
> 316-5150     \
> \-------------------------------------------------------------
> ----------/
> 
> -- 
> To unsubscribe or change options: 
> http://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: 
> http://www.tuxedo.org/~esr/faqs/smart-questions.html
>