[RFC] dynamic checksum size
jw schultz
jw at pegasys.ws
Mon Mar 24 03:36:37 EST 2003
On Mon, Mar 24, 2003 at 12:54:26AM +1100, Donovan Baarda wrote:
> On Sun, Mar 23, 2003 at 03:46:34AM -0800, jw schultz wrote:
> > On Sun, Mar 23, 2003 at 05:45:47PM +1100, Donovan Baarda wrote:
> > > On Sun, 2003-03-23 at 07:40, jw schultz wrote:
> > > > The two attached patches implement per-file dynamic checksum
> > > > lengths.
[...]
> > > As you can see from this, your heuristic is very conservative for large
> > > files, but optimistic for smaller files. Might I suggest the following
> > > code fragments;
> >
> > That is due to my error in block count magnitude of the
> > block size calculation.
>
> It's more than that, you are increasing the blocksum size by a byte every
> time the block count quadriples... you are not taking into account how the
> block size/file size impacts on things. This will only work well for a
> particular block size heuristic.
Yes, i never expected that to be what was implemented. I
knew it was overly aggressive in the rate of lengthening
sum2. It merely provided an example. You are the first to
suggest something based on more than vague feelings.
The nice thing is that even after we implement this can be
changed. The receiver is the only end calculating block and
sum2 sizes.
> > > #define BLOCKLEN_EXP 13 /* block_len heuristic 'm' value */
> > > #define BLOCKSUM_EXP 10 /* blocksum size heuristic 'n' value */
> > >
> > > if (block_size) {
> > > block_len = block_size;
> > > } else {
> > > block_len = (len >> BLOCKLEN_EXP) & ~15; /* rough heuristic */
> > > block_len = MIN(block_len, BLOCK_SIZE);
> > > block_len = MAX(block_len, (CHUNK_SIZE / 2));
> >
> > As in the patch, the MIN and MAX macros are reversed.
> [...]
>
> Ahh yeah. I copied these from the patch on the premise that they were
> required for other limitations in the implementation (what is CHUNK_SIZE?).
> I would prefer if block_len could grow indefinitely as the file size
> increases.
CHUNK_SIZE is used in mapping the file into memory. I'm
not absolutely sure (haven't spelunked that part of the code) what
would happen if block size were to exceed it. I suspect it
would be OK because i don't recall hearing anyone complain
as a result of using the --block-size option with a huge
value.
The problem with growing the block size indefinitely is that
you must balance the cost of the checksums with the cost of
sending large blocks for small changes.
> > With a modicum of fixup to get it to actually work and some
> > min-max bounds, here is an approximate table what that would
> > do.
> [...]
> > file length block_len block_count s2length checksum array size
> [...]
> > 122M 15K 8198 3 56K
> > 127M 15K 8193 3 56K
> > 1063M 16K 66K 4 531K
> > 16G 16K 1034K 5 9312K
> > 261G 16K 16M 6 163M
>
> This end bit shows why I'd like the blocksize to be able to grow larger than
> some arbitary limit.
How many 260GB files do you have? On the other hand, with
the fixed block length we see a diminishing advantage.
We also take a hit with the hashing, match lookups and
memory footprint.
> > I'm not qualified to argue the formulae. But i can't say
> > i'm overly fond of the numbers coming out.
> > Personally, i'd like to see it a bit more aggressive on the
> > block size and perhaps a closer relationship between
> > the checksum array size and the file size.
>
> The block_size heuristic is pretty arbitary, but the blocksum_size
> calculation is not, and calculates the minimum blocksum size to achieve a
> particular probability of success for given file and blocksum sizes. You can
> replace the block_size heuristic with anything, and the second part will
> calculate the required blocksum size.
>
> > I do have a variant that scales the length devisor for block
> > size from 1024 to 16538 as the length progresses from 1MB to
> > 128MB. This in conjunction with your sum2 length formula
> > produces this:
>
> That sounds pretty arbitary to me... I'd prefer something like having it
> grow at the square-root of filesize... so 1K for 1M, 8K for 64M, 16K for
> 256M, 32K for 1G, etc.
I most strongly lean toward a bounded square-root of file
size. It produces a nice smooth curve on resource
consumption diminishing load/MB as the files grow. I just
hadn't a decent fast integer sqrt function at hand. I don't
really care to start bringing in the math lib just for this
one calculation. I've cobbled together one that is OK.
This is what i am getting with the square root:
file length block_len block_count s2length xmit sums array_size
50 700 1 2 6 36
831K 912 934 2 5604 32K
1439K 1200 1229 2 7374 43K
2493K 1584 1612 2 9672 56K
4317K 2096 2110 2 12K 74K
7475K 2752 2782 2 16K 97K
12M 3632 3650 2 21K 128K
21M 4784 4799 2 28K 168K
32M 5824 5835 3 39K 205K
56M 7664 7678 3 52K 269K
97M 9K 9K 3 69K 355K
168M 12K 12K 3 90K 467K
1063M 16K 66K 4 531K 2392K
16G 16K 1034K 5 9312K 36M
261G 16K 16M 6 163M 589M
4239G 16K 264M 7 2914M 9539M
64T 16K 4126M 8 48G 145G
Bump the maximum block size to 64KB and:
file length block_len block_count s2length xmit sums array_size
168M 12K 12K 3 90K 467K
291M 17K 17K 3 119K 614K
504M 22K 22K 3 157K 809K
873M 29K 29K 3 207K 1064K
1513M 38K 38K 3 272K 1400K
2070M 45K 45K 4 364K 1638K
3586M 59K 59K 4 479K 2156K
32G 64K 524K 5 4716K 18M
530G 64K 8486K 6 82M 298M
8259G 64K 129M 7 1419M 4645M
130T 64K 2089M 8 24G 73G
Barring other factors a decent size machine could hope to handle a 500GB file.
If we eliminate the block size limit i get:
file length block_len block_count s2length xmit sums array_size
3586M 59K 59K 4 479K 2156K
6210M 78K 78K 4 630K 2837K
10G 103K 103K 4 829K 3733K
18G 136K 136K 4 1091K 4913K
31G 179K 179K 4 1436K 6465K
54G 236K 236K 4 1890K 8507K
66G 260K 260K 5 2346K 9384K
114G 343K 343K 5 3087K 12M
198G 451K 451K 5 4062K 15M
344G 593K 593K 5 5345K 20M
596G 781K 781K 5 7034K 27M
1033G 1028K 1028K 5 9257K 36M
1789G 1353K 1353K 5 11M 47M
2092G 1463K 1463K 6 14M 51M
3624G 1926K 1926K 6 18M 67M
6276G 2535K 2535K 6 24M 89M
10T 3336K 3336K 6 32M 117M
18T 4390K 4390K 6 42M 154M
31T 5776K 5776K 6 56M 203M
55T 7602K 7602K 6 74M 267M
95T 9M 9M 6 97M 351M
130T 11M 11M 7 125M 411M
226T 15M 15M 7 165M 541M
Not to shabby, do you have many 50TB files?
Clearly we are getting into the realms of sci-fi here. Even
with the 16KB block length limit files in the 1-16GB range
should be manageable on almost any system that is likely
to have them.
** tables generated by test program that only displays when
s2length, block_len or count have changed appreciably.
Be sure to note that the last entry on each table is a
different file size. Most duplicate lines were removed.
The s2length calculation is using the algorithm you provided:
#define BLOCKSUM_EXP 10
b = BLOCKSUM_EXP;
l = len;
while (l >>= 1) {
b += 2;
}
c = block_len;
while (c >>= 1 && b) {
b--;
}
s2length = (b + 1 - 32 + 7) / 8;
s2length = MAX(s2length, 2);
--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw at pegasys.ws
Remember Cernan and Schmitt
More information about the rsync
mailing list