tdb clear-if-first problem [was hanging smbd(s) revisited]

Mon Mar 6 14:36:03 GMT 2006

In working through bug # 3569, I've uncovered what may be the problem with
load and smbd's getting only a little work done. This line of thinking is
with respect to locking.tdb and the logging off and logging on of hundreds
of users simultaneously.

Between 3.0.20 and 3.0.20a, there were three changes to tdb/tdb.c:

* empty chain optimization

This optimization looks quite interesting although I'm still learning the
insides of tdb.

* tdb_brlock read lock moved before race test and tdb_mmap in tdb_reopen().

Do you really want to try a brlock before the race test and before the
tdb_mmap? This just seems wrong to me when compared to tdb_open_ex. Also is
the F_RDLCK really necessary to repeat on the ACTIVE read lock since it was
done in tdb_open_ex() when the initial opener with the TDB_CLEAR_IF_FIRST
flag set. Technically this flag should not be set on a *re*open (if I
understand this correctly).

* no longer clearing the TDB_CLEAR_IF_FIRST flag in tdb_reopen_all()

in server/smbd.c there is language that states special fork handling where
the removal of TDB_CLEAR_IF_FIRST is required. If that's true, then why was
this removed? Doesn't this now complicate the tdb locking process?

It seems to me that under duress, the last two changes would be the catalyst
to send smbd's into unnecessary context switching while waiting on a
resource. This explains (in my corrupt mind, anyway) why work is getting
done, but it's painfully slow. My thinking is that this might eventually
clear up if allowed to finish, but our users are not that patient.

So we bounce smbd and all is well until the next wave about 24 hours later.

I'm going to try reversing just the last two changes and see what happens
when applied to 3.0.21c (keeping the optimization).

Please help me to understand this if I'm off my rocker.

Cheers,

Bill