[TDB] Patches for file and memory usage growth issues

Mon Apr 11 10:24:49 MDT 2011

On Tue, 2011-04-12 at 01:09 +0930, Rusty Russell wrote:
> On Mon, 11 Apr 2011 08:02:11 -0400, simo <idra at samba.org> wrote:
> > On Mon, 2011-04-11 at 20:29 +0930, Rusty Russell wrote:
> > > On Sat, 09 Apr 2011 23:01:15 -0400, simo <idra at samba.org> wrote:
> > > > Hi all,
> > > > during this week I have been working on a memory growth issue affecting
> > > > sssd when storing large group of users in the LDB based cache.
> > > > 
> > > > It turns out that a lot of memory is being used by TDB.
> > > > 
> > > > Please take a look at the last 3 patches in this tree:
> > > > http://git.samba.org/?p=idra/samba.git;a=shortlog;h=refs/heads/tdb-memsize
> > > > 
> > > > The first one should be uncontroversial, the original intent makes sense
> > > > in a tdb database when you use small records, but when a rare huge
> > > > record is saved in, growing the db by 100 times its size really is
> > > > overkill. I also reduced the minimum overhead from 25% to 10% as with
> > > > large TDBs (in the order of 500MB/1GB, growing by 25% is quite a lot).
> > > 
> > > Yep, that makes sense.  Perhaps just drop the heuristic to ignore record
> > > size altogether?  Double until the database is > 1M, then add 10%?
> > 
> > Can't. You may be unlucky enough that the record you are trying to store
> > is much bigger than the entire database :)
> > Although perhaps I can make the DB grow faster when it is smaller, say
> > min 25% up to 100MiB and then switch to 10% (or maybe even less).
> 
> Right, so something like:
> 
>         if (tdb->map_size < 100 * 1024 * 1024)
>                 additional = tdb->map_size / 4;
>         else
>                 additional = tdb->map_size / 10;
> 
>         if (additional < size)
>                 additional = size;

I would still do size*2 at a minimum. But I can change the patch to
switch from 25% to 10% at the 100M mark.

> > > Note that we repack on expand, so limiting expansions will end up
> > > slowing down some bechmarks.  I don't think it's a problem IRL, though.
> > 
> > Yeah most of the issues may need much smarter heuristics or speed/size
> > compromises. TDB was originally used for very small datasets and so it
> > was a good idea to gain anything you can in speed even if it meant
> > wasting some space relative to the size of the set. But now that TDB is
> > used also for large datasets this is hurting.
> 
> Now, I just figured out your backup problem.  Sorry for being slow.
> 
> If you take and empty database and start a transaction then dump records
> into it, you will do a repack when the transaction finishes, which is
> probably not a good idea.

Right.

> Far better to only repack if we've expanded the DB *once* in the
> transaction: this reflects the fragmentation case better than the
> current heuristic.
> 
> Something like this? (Completely untested, it's 1:30 am here)
> 
> diff --git a/lib/tdb/common/transaction.c b/lib/tdb/common/transaction.c
> index b6cf771..164b0d7 100644
> --- a/lib/tdb/common/transaction.c
> +++ b/lib/tdb/common/transaction.c
> @@ -403,8 +403,6 @@ static int transaction_expand_file(struct tdb_context *tdb, tdb_off_t size,
>  		return -1;
>  	}
>  
> -	tdb->transaction->need_repack = true;
> -
>  	return 0;
>  }
>  
> @@ -979,6 +977,12 @@ static int _tdb_transaction_prepare_commit(struct tdb_context *tdb)
>  			_tdb_transaction_cancel(tdb);
>  			return -1;
>  		}
> +
> +		/* We repack if we're slowly growing, eg. not for backup! */
> +		if (tdb->map_size < tdb->transaction->old_map_size * 2) {
> +			tdb->transaction->need_repack = true;
> +		}
> +
>  		tdb->map_size = tdb->transaction->old_map_size;
>  		methods->tdb_oob(tdb, tdb->map_size + 1, 1);
>  	}

Ok, I'll try this and see what difference it makes.

> > 
> > > > The second one needs discussion. I noticed that a transaction can
> > > > substantially increase the size of a tdb, in my tests I had a 1.2G size
> > > > TDB actually grow to a ~1.7G after a backup ... by not using a
> > > > transaction to store the backup copy the same 1.2G TDB shrunk down to
> > > > ~900MB. I know that the reason we use a transaction is to avoid rsyncs
> > > > of partial DBs, so maybe we want to add an option on whether to use a
> > > > transaction for the 2nd db ? Comments welcome.
> > > 
> > > Hmm, that's odd.  Compressing the recovery area makes a great deal of
> > > sense for tdb2 though.
> > > 
> > > Note that we overallocate all TDB entries by 25%.  (TDB2 only does that
> > > when you expand an existing record).
> > 
> > The overallocation is useful when you save many entries at the same
> > time, which is actually a frequent use case for sssd, how would tdb2
> > behave on burst of growth (many new records) ?
> 
> I think you misunderstand?  I'm saying we allocate 25% extra room in
> each record, to reduce fragmentation in case it grows.

OOh, I did misunderstand, thanks for explaining.
And yes, it would probably make quite a huge difference performance
wise, although it may also end up consuming a lot of extra memory, will
need to carefully check it's impact.

> I've got some regressions with TDB2 in S3, but once I've fixed those
> I'd really like to see what your stress tests gives with tdb2...
> 
> > > > The third one I think is less controversial but still not something I
> > > > would put in without discussion. This feature has the very nice property
> > > > of being completely optional and needs an explicit flags to be passed at
> > > > tdb_open(). This makes it relatively harmless and not very intrusive.
> > > 
> > > But it makes an incompatible database :(
> > > 
> > > OTOH it's fairly easy for the user to do this themselves and they know
> > > what tradeoffs they want to make.
> > 
> > This is why you have to explicitly enable it. Old tools will report
> > errors indeed, but that seems like a very desirable outcome if you do
> > not have a clue on how to handle new data types :)
> 
> I like the idea, but I think it's best to use that compression for the
> recovery area, not normal data.  That might actually speed things up
> (less data to sync)...

I am sure that depending on workloads one will be "better" than the
other, depending on your use case.

Perhaps we should have a setting in tdb2 where you can define what do
you want to optimize the db for at db creation. Then use different
strategies based on the scope.
Some people needs to optimize for speed and don't care how much space is
used. This is certainly true for some sensitive databases in samba3. Yet
some others have to care about size too, esp working memory is something
we need to care about.

When we run tdb_repack we currently double the memory used as we have a
full copy of the DB in memory. Although the kernel can easily handle
that wrt to OOM because mmap areas can be dumped back to disk under
memory pressure, it causes a lot of waste in the file/page caches, as it
forces other stuff out of memory. A repack that doesn't use all that
memory would be highly appreciated too.

Simo.

-- 
Simo Sorce
Samba Team GPL Compliance Officer <simo at samba.org>
Principal Software Engineer at Red Hat, Inc. <simo at redhat.com>