File Fragmentation issue...especially relating to NTFS...

Linda Walsh rsync at tlinx.org
Sun Dec 3 00:38:25 GMT 2006


I'm very sorry for being late to the party on this note, but I'm
not sure what the original trigger for this fragmentation problem
was... I "thought" (imagining?) that one issue was related to
performance under cygwin.

I see the solution is going towards a pre-allocation switch, but
the question I have is, why?  rsync should know how big the file
it is going to write will be, under most circumstances, since it
has already compared all the files on the source and dest, where
I thought file-size differing would force a transfer. 

This thread seems(?) to go back to this email:
===============
Rob Bosch, on Sun, 06 Aug 2006 09:04:30, wrote:
>  I've been running some tests on files created by rsync and noticing
>  fragmentation issues. ...
>  The test I conducted was straighforward:
...
>  3. Resulting file had almost 12000 fragments.
>
>  It doesn't really surprise me since rsync grabs blocks as it needs them
>  for the new.  I was wondering why rsync doesn't preallocate the space
>  like copy or other utilites do under Windows.
...
>  I am running the cygwin version of rsync ...
--------------

In doing some investigation, I noticed the problem isn't
specific to rsync but is common to many "gnunix" utils running on
cygwin on Windows.

There seems to be a belief that Windows pre-allocates the space for
copied files.  I haven't found this to be the case in my testing.
What does seem to be the case is that Windows copies files using
as much memory as it can.  If a file is 64M long, for example,
Windows seems to read the entire file, then writes the output file
using 1 write command.  It is the large write command that seems to
"encourage" Win(NT)/NTFS to allocate a space large enough to hold the
entire file.

The file fragmentation problem isn't unique to cygwin on NTFS.
I duplicated file fragmentation problems under linux with the ext[23]
file system.

I first wrote this writeup in response to the same problem being
reported on the cygwin list:
------------

The "fault" is the behavior of the file system.
I compared NTFS with ext3 & xfs on linux (jfs & reiser hide how many
fragments a file is divided into).

NTFS is in the middle as far as fragmentation performance.  My disk
is usually defragmented, but the built-in Windows defragmenter doesn't
defragment free space.

I used a file size of 64M and proceeded copying that file to
a destination file using various utils.

With Xfs (linux), I wasn't able to fragment the target file.  Even
writing 1K chunks in append mode, the target file always ended up
in 1 64M fragment.

With Ext3 (also linux), it didn't seem to matter the copy method,
 cp, dd(blocksize 64M), and rsync all produced a target file with
2473 fragments.

NTFS using cygwin, varies the fragment size based on the the tool
writing the output.  "cp" produced the most fragments at 515 fragments.
"rsync" came next with 19 fragments.
"dd" (using a bs=32M or bs=64M) did best at 1 fragment.
using "dd" and using a block size of 8k produced the same
results as "cp".

It appears cygwin does exactly the right thing as far as file
writes are concerned -- it writes the output using the block size
specified by the client program you are running.  If you use a
small block size, NTFS allocates space for each write that you do.

If you use a big block size, NTFS appears to look for the first place that
the entire write will fit.  Back in DOS days, the built-in COPY command
buffered as much data as would fit in memory then wrote it out -- meaning
it would be like to create the output with a minimal number of fragments. 
If you want your files to be unfragmented, you need to use a
file copy (or file write) util that uses a large buffer size --
one that (if possible), writes the entire file in 1 write.

In the "tar zcvf a.tgz *" case, I'd suggest piping the output of
tar into "dd" and use a large blocksize.

Linda
------------------------------------------

    In regards to rsync's performance under Windows, most of the problems
can be solved by increasing the amount of data written at one time. 
It is the small writes that "force" a problem under windows. 

    However, as noted in my testing -- ext3 seems to have a problem in
allocating contiguous space for a file no matter how big the write. 
If you are creating a special patch for rsync to pre-allocate files, I
suggest it be tested under ext[3] using my "degenerate" case.  For
my linux tests with ext3 & xfs, I created my test case using:
---create_frag_tst_files---
#!/bin/bash
#create lots of small files, & create holes
echo Creating files...
for ((i=0;i<=9;++i)); do
    for ((j=0;j<=9;++j)); do
        k=0;l=0
        echo -ne "\r$i$j$k$l..."
        for ((k=0;k<=9;++k)); do
            for ((l=0;l<=9;++l)); do
                stat="status=noxfer"
                if=/dev/zero
                of=t$i$j$k$l
                dd $stat if=$if of=$of bs=4096 count=1 2>/dev/null
                dd $stat if=$if of=$of-tmp bs=4096 count=1 2>/dev/null
            done
        done
    done
done
echo -ne "\rRemoving tmps..."
rm t*-tmp
echo -e  "\rDone.           "
---------------
    As noted above, xfs did not split the output file.

    If rsync wants to easily encourage less fragmentation on NTFS, I'd
suggest buffering more information in memory before doing the "write".
Ideally, it could buffer each file in full before writing it out, but
large files may not fit in memory. 

    Perhaps instead of a --pre-alloc, a default of using a minimum of
a 32M file buffer would help, with a switch to tell rsync to use
more memory -- like 512MB -- or up to whatever filesize the user can
fit in memory.

    I don't know that even pre-allocation can help on some file systems. 
Of course some files measured in Gigabytes, might have problems fitting
in one segment no matter how they are copied unless they are copied to
a very clean disk....

    As I said at the beginning, sorry for the delay in posting this,
but I've been more than a bit swamped by various items, including some
unhappy personal business that came up and, died, so to speak. :-(

Linda





More information about the rsync mailing list