Problem with connection database

Tue Aug 28 16:26:19 GMT 2001

Hi Claus,
Please cc the samba-technical list on your emails regarding this, so we can
get
more people involved in the issue.
I am interested in your observation regarding large files - how are you able
to 
tell that accessing these files are causing the problem?  and is there ANY
way
you can FORCE this problem to happen, so I could reproduce it under
controlled 
circumstances?

Thanks,
don
-----Original Message-----
From: Claus Svarer [mailto:csvarer at nru.dk]
Sent: Monday, August 27, 2001 11:09 AM
To: MCCALL,DON (HP-USA,ex1)
Subject: Re: Problem with connection database

Hi Don

Thanks for looking at the problem. I don't believe the error condition have
been
wrapped out of the log files because with this low log level (1) the log
files
as far as I can see only is wrapped every second day (or about that).

One other observation that we have had is that it is the really large files
that
causes the problems. Especially, there mailbox'es from the email system
(Netscape Messenger with the files at the net) causes problems probably
because
of the files sizes (200 MByte or more).

Thanks for your help

BR
Claus

"MCCALL,DON (HP-USA,ex1)" wrote:

> Hi Claus,
> I haven't had as much time as I would like to look at this, and I'm not as
> up on the tdb stuff as I would like,
> so I'm posting my initial observations back to the samba-technical list as
> well, in hopes these will suggest a course of action to someone more
> knowlegable - to me it appears to almost be a race condition of some sort:
>
> All of the bad magic number reports in the log all boil down to the
> following magic numbers:
>
> magic 0x42424242
> magic 0x0
> magic 0x10000
> magic 0x3122
> magic 0x41ed
> magic 0x4d8
> magic 0xd9fee666
>
> And of course
>
> magic 0x26011999  is reported when a free call is made, because this is
the
> TDB_MAGIC number, not the TDB_MAGIC_FREE number...  what does this mean?
> something crapped out before changing the magic number to free_magic, but
> somehow put it in the list to be freed???...
>
> The offsets reported all boil down to:
>
> 0x0 offset=18676
> 0x0 offset=2
> 0x0 offset=24113
> 0x0 offset=26003
> 0x0 offset=26019
> 0x0 offset=4589
> 0x0 offset=620
> 0x0 offset=64
> 0x0 offset=65536
> 0x10000 offset=2834
> 0x3122 offset=0
> 0x41ed offset=45336
> 0x42424242 offset=1316
> 0x42424242 offset=18676
> 0x42424242 offset=2556
> 0x42424242 offset=5656
> 0x42424242 offset=620
> 0x42424242 offset=6276
> 0x42424242 offset=6896
> 0x42424242 offset=696
> 0x42424242 offset=7516
>
> 0x4d8 offset=25496
> 0xd9fee666 offset=11856
> 0xd9fee666 offset=15576
> 0xd9fee666 offset=16816
> 0xd9fee666 offset=18676
> 0xd9fee666 offset=45336
> 0xd9fee666 offset=696
> 0xd9fee666 offset=73596
> 0xd9fee666 offset=88740
>
> and the offsets where it is reporting bad magic because it's NOT
> TDB_FREE_MAGIC is:
>
> 0x26011999 offset=16816
> 0x26011999 offset=18676
> 0x26011999 offset=49156
>
> Ok, we are trying to FREE the above three offsets;
> the 1st to got bad magic when the were being tried to
> read/write or something as well; maybe comparing what was going on when it
> was being freed as opposed to being allocated/read/wrote, and the timing
of
> that might give us some clues....
>
> We also have a bunch of tdb_oob len xxxxxxx beyond eof at
>
> at the following len/eof's:
>
> -1627389928 65536
> -2147483577 65536
> -788529101 65536
> 1109419649 73732
> 1111638598 73732   >>> this is 0x42424246
> 1111638598 90112   in fact all of the
> 1111638618 24576   ones in this range look
> 1111638618 65536   like we just have the initialized
> 1111638618 73732   memory except for last byte<<<<<<
> 131137560 65536
> 1346978887 65536
> 16777247 65536
> 181207044 65536
> 181207044 73732
> 485512 65536
> 65560 65536
> 875573320 65536
> 925905992 65536
> 956301336 65536
>
> With all this, it APPEARS that a link in the list is broken, so we are
just
> following out to 'random' offsets in the tdb file; we are prevented from
> making really BAD mistakes by the fact that the TDB_MAGIC for each record
is
> checked, and processing stops at a bad magic number; but how the link got
> corrupted to begin with is still evaiding me...
>
> I haven't been able to pin down the initial corruptor in the logs; we may
> have missed this, and are now seeing only the results of a corruption
whose
> log entry has already wrapped.
>
> Hope this will suggest something to some one more knowlegable on the tdb
> than myself.  I still have your logs, etc. if someone else on the list
> wishes to help out.
> Don
>
> -----Original Message-----
> From: Claus Svarer [mailto:csvarer at nru.dk]
> Sent: Friday, August 24, 2001 5:33
> To: MCCALL,DON (HP-USA,ex1)
> Subject: Problem with connection database
>
> Hi Don
>
> Our server have now recreated the connection.tdb error again. I have
> made a gzipped tar file of all the log and lock files that was at the
> server when it chrashed (attached to this mail). I don't know if you can
> find out what happens, but as you know the error message is:
>
> fog:/users/csvarer 22 : smbstatus | more
> INFO: Debug class all level = 1   (pid 843 from pid 843)
> tdb(/usr/local/samba/var/locks/connections.tdb): rec_read bad magic
> 0x42424242 at offset=5656
>
> If I search int the log file I can find several machines that sees this
> problem. At the time where we observed the problem no new machines was
> authorizing against the samba server.
>
> Best regards
> Claus