Problem with connection database

Claus Svarer csvarer at nru.dk
Fri Aug 31 09:48:56 GMT 2001


Hi Don

I have now setup a job that detects when one of the tdb files are affected by an
error. It is started as a crontab job every 5 minutes and searches through the
log files for tdb errors. I have attached a gzipped and tar'ed version of log
and lock files to this mail. At the time where the samba server chrashed the
Samba server has been running for 35 minutes. It was not the connection database
but the messages database that was corrupted. Further, I have attached the
smb.conf file and the script I used for detection of problems.

Best regards
Claus


"MCCALL,DON (HP-USA,ex1)" wrote:

> Hi Claus,
> Tell me a little more about the environment in which this is happening:
>
> max concurrent connections to samba (guess; how many pc's max
> would be connected to your samba server at any given time):
> average concurrent connections to samba(smbstatus during working hrs):
> how long (average) between the time you bring samba up, and FIRST log
> message involving any tdb database?:
> Look in dmesg, and /var/adm/syslog/syslog.log - any unusual messages
> there about full system tables, or filesystem full, etc?:
> any core files show up?(use "file core" to determine if smbd or not, and
> date of core):
>
>
> You might want to bring samba down, remove any log files, and then make a
> simple cron job that runs every few minutes and  parses all the log files
> for any tdb failure.  once it
> spots a tbd failure, the cron job would copy all the log files to another
> directory, do an smbstatus to a file in that same directory, and stop.
> Then we could try to see what was going on around the FIRST error
> logged, without a lot of extra stuff after the fact.  You might want to
> turn your log level up a bit as well, so that we'd get more info about
> what was being done at the time of the 1st error as well; since you're
> going to copy the files off to a safe location once you hit the first error,
> hopefully we won't wrap around and miss it.
>
> I'm also going to post a message to the samba list to see if anyone else
> can look in their logs to see if any of these errors are showing up on 2.2.x
> on archetectures OTHER than HP-UX...  Right now the only ones I have seen
> are on HP-UX 9.05, 10.20 and 11.xx...  I consider that somewhat
> suspicious...
>
> Thanks,
>
> Don
>
>
> -----Original Message-----
> From: Claus Svarer [mailto:csvarer at nru.dk]
> Sent: Tuesday, August 28, 2001 1:16 PM
> To: MCCALL,DON (HP-USA,ex1)
> Cc: 'samba-technical at lists.samba.org'
> Subject: Re: Problem with connection database
>
> Hi Don,
>
> Regarding problems with .tdb files at a HP-UX 11.00 system.
>
> The reason that I believe that it has something to do with large files is
> that it mostly are four users using Netscape Messenger (with the mail box at
> a net drive) that experience the problems. These four people are also the
> only ones that has these > 200 MByte files and it is almost always in these
> mail files we first experience the problem when the connections.tdb database
> are corrupted again.
>
> I have today tried to compile the Samba CVS (2.2.2-pre) using gcc 3.0.1
> instead of the HP ANSI-C compiler, but we see excatly the same problems,
> that the connection database chrashes, with error messages that look like:
>
> [2001/08/28 15:09:10, 0] tdb/tdbutil.c:tdb_log(342)
>   tdb(/usr/local/samba/var/locks/connections.tdb): rec_read bad magic 0x0 at
> offset=28399
> [2001/08/28 15:09:10, 0] tdb/tdbutil.c:tdb_log(342)
>   tdb(/usr/local/samba/var/locks/connections.tdb): rec_read bad magic
> 0x3b8b97af at offset=127840
> [2001/08/28 15:15:17, 0] tdb/tdbutil.c:tdb_log(342)
>   tdb(/usr/local/samba/var/locks/connections.tdb): rec_read bad magic
> 0x42424242 at offset=82540
> [2001/08/28 15:15:52, 0] smbd/oplock.c:request_oplock_break(1010)
>   request_oplock_break: no response received to oplock break request to pid
> 23374 on port 59593 for dev = 40020002, inode = 2126454
>   for dev = 40020002, inode = 2126454, tv_sec = 3b8b8ca5, tv_usec = 275c3
> [2001/08/28 15:15:52, 0] smbd/open.c:open_mode_check(554)
>   open_mode_check: exlusive oplock left by process 23374 after break ! For
> file Netscape/nru/Cache/fat.db, dev = 40020002, inode = 2
> l
>
> I don't know if anybody can trace down why the connections .tdb chrashes
> (Don MCCALL posted an example of a chrashed tdb lock file to the mailing
> list the other day). Could it be a result of what was reported earlier this
> month at the samba-tecnical mailing list  by andreas moroder
> claudiamoroder at st-ulrich.suedtirol.net Tue, 14 Aug 2001 22:33:54 +0200 that:
>
> Hello,
>
> I did run lclint ( YES I really did it ) over the tdb.c file. I got
> thousand,
>
> no billions of warnings. Most of them are not worth checking except the
>
> warnings about different types.
>
> at line 150     fl.l_type = rw_type;   l_type is short int , rw_type MAY be
>
> they are the same, but I don't know if it is guaranted on every machine
>
> at line tdb->locked[list+1].ltype = ltype;    int is assigned to unsigned
> int
>
> Another proble are the functions that return values that are never checked.
>
> At line 610 the return value of tdb_oob is ignored.
>
> At line 207 tdb_spinunlock
>
> and many more.
>
> Most of the functions seem to give back valuable return values, so they
>
> should all be checked.
>
> Any proposal that could help in solving this problem with the samba at the
> hp-ux system will be very welcome.
>
> Best regards
> Claus Svarer
>
>
> "MCCALL,DON (HP-USA,ex1)" wrote:
>
> Hi Claus,
> Please cc the samba-technical list on your emails regarding this, so we can
> get
> more people involved in the issue.
> I am interested in your observation regarding large files - how are you able
>
> to
> tell that accessing these files are causing the problem?  and is there ANY
> way
> you can FORCE this problem to happen, so I could reproduce it under
> controlled
> circumstances?
>
> Thanks,
> don
> -----Original Message-----
> From: Claus Svarer [ mailto:csvarer at nru.dk <mailto:csvarer at nru.dk> ]
> Sent: Monday, August 27, 2001 11:09 AM
> To: MCCALL,DON (HP-USA,ex1)
> Subject: Re: Problem with connection database
>
> Hi Don
>
> Thanks for looking at the problem. I don't believe the error condition have
> been
> wrapped out of the log files because with this low log level (1) the log
> files
> as far as I can see only is wrapped every second day (or about that).
>
> One other observation that we have had is that it is the really large files
> that
> causes the problems. Especially, there mailbox'es from the email system
> (Netscape Messenger with the files at the net) causes problems probably
> because
> of the files sizes (200 MByte or more).
>
> Thanks for your help
>
> BR
> Claus
>
> "MCCALL,DON (HP-USA,ex1)" wrote:
>
> > Hi Claus,
> > I haven't had as much time as I would like to look at this, and I'm not as
>
> > up on the tdb stuff as I would like,
> > so I'm posting my initial observations back to the samba-technical list as
>
> > well, in hopes these will suggest a course of action to someone more
> > knowlegable - to me it appears to almost be a race condition of some sort:
>
> >
> > All of the bad magic number reports in the log all boil down to the
> > following magic numbers:
> >
> > magic 0x42424242
> > magic 0x0
> > magic 0x10000
> > magic 0x3122
> > magic 0x41ed
> > magic 0x4d8
> > magic 0xd9fee666
> >
> > And of course
> >
> > magic 0x26011999  is reported when a free call is made, because this is
> the
> > TDB_MAGIC number, not the TDB_MAGIC_FREE number...  what does this mean?
> > something crapped out before changing the magic number to free_magic, but
> > somehow put it in the list to be freed???...
> >
> > The offsets reported all boil down to:
> >
> > 0x0 offset=18676
> > 0x0 offset=2
> > 0x0 offset=24113
> > 0x0 offset=26003
> > 0x0 offset=26019
> > 0x0 offset=4589
> > 0x0 offset=620
> > 0x0 offset=64
> > 0x0 offset=65536
> > 0x10000 offset=2834
> > 0x3122 offset=0
> > 0x41ed offset=45336
> > 0x42424242 offset=1316
> > 0x42424242 offset=18676
> > 0x42424242 offset=2556
> > 0x42424242 offset=5656
> > 0x42424242 offset=620
> > 0x42424242 offset=6276
> > 0x42424242 offset=6896
> > 0x42424242 offset=696
> > 0x42424242 offset=7516
> >
> > 0x4d8 offset=25496
> > 0xd9fee666 offset=11856
> > 0xd9fee666 offset=15576
> > 0xd9fee666 offset=16816
> > 0xd9fee666 offset=18676
> > 0xd9fee666 offset=45336
> > 0xd9fee666 offset=696
> > 0xd9fee666 offset=73596
> > 0xd9fee666 offset=88740
> >
> > and the offsets where it is reporting bad magic because it's NOT
> > TDB_FREE_MAGIC is:
> >
> > 0x26011999 offset=16816
> > 0x26011999 offset=18676
> > 0x26011999 offset=49156
> >
> > Ok, we are trying to FREE the above three offsets;
> > the 1st to got bad magic when the were being tried to
> > read/write or something as well; maybe comparing what was going on when it
>
> > was being freed as opposed to being allocated/read/wrote, and the timing
> of
> > that might give us some clues....
> >
> > We also have a bunch of tdb_oob len xxxxxxx beyond eof at
> >
> > at the following len/eof's:
> >
> > -1627389928 65536
> > -2147483577 65536
> > -788529101 65536
> > 1109419649 73732
> > 1111638598 73732   >>> this is 0x42424246
> > 1111638598 90112   in fact all of the
> > 1111638618 24576   ones in this range look
> > 1111638618 65536   like we just have the initialized
> > 1111638618 73732   memory except for last byte<<<<<<
> > 131137560 65536
> > 1346978887 65536
> > 16777247 65536
> > 181207044 65536
> > 181207044 73732
> > 485512 65536
> > 65560 65536
> > 875573320 65536
> > 925905992 65536
> > 956301336 65536
> >
> > With all this, it APPEARS that a link in the list is broken, so we are
> just
> > following out to 'random' offsets in the tdb file; we are prevented from
> > making really BAD mistakes by the fact that the TDB_MAGIC for each record
> is
> > checked, and processing stops at a bad magic number; but how the link got
> > corrupted to begin with is still evaiding me...
> >
> > I haven't been able to pin down the initial corruptor in the logs; we may
> > have missed this, and are now seeing only the results of a corruption
> whose
> > log entry has already wrapped.
> >
> > Hope this will suggest something to some one more knowlegable on the tdb
> > than myself.  I still have your logs, etc. if someone else on the list
> > wishes to help out.
> > Don
> >
> > -----Original Message-----
> > From: Claus Svarer [ mailto:csvarer at nru.dk <mailto:csvarer at nru.dk> ]
> > Sent: Friday, August 24, 2001 5:33
> > To: MCCALL,DON (HP-USA,ex1)
> > Subject: Problem with connection database
> >
> > Hi Don
> >
> > Our server have now recreated the connection.tdb error again. I have
> > made a gzipped tar file of all the log and lock files that was at the
> > server when it chrashed (attached to this mail). I don't know if you can
> > find out what happens, but as you know the error message is:
> >
> > fog:/users/csvarer 22 : smbstatus | more
> > INFO: Debug class all level = 1   (pid 843 from pid 843)
> > tdb(/usr/local/samba/var/locks/connections.tdb): rec_read bad magic
> > 0x42424242 at offset=5656
> >
> > If I search int the log file I can find several machines that sees this
> > problem. At the time where we observed the problem no new machines was
> > authorizing against the samba server.
> >
> > Best regards
> > Claus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: var.310801.113500.tar.gz
Type: application/x-gzip
Size: 18316 bytes
Desc: not available
Url : http://lists.samba.org/archive/samba-technical/attachments/20010831/eac94f20/var.310801.113500.tar.bin
-------------- next part --------------
# Samba config file created using SWAT
# from csport.pet.rh.dk (130.226.104.174)
# Date: 2001/08/29 10:16:45

# Global parameters
[global]
	workgroup = NEUROBIO
	netbios name = FOG
	server string = Samba Server
	encrypt passwords = Yes
	root directory = /
	passwd program = /usr/bin/yppasswd %u
	passwd chat = *Old*NIS*password* %o\n *New*password* %n\n *new*password* %n\n *changed*
	password level = 4
	unix password sync = Yes
	log level = 1
	log file = /usr/local/samba/var/log.%m
	max log size = 50
	time server = Yes
	printcap name = lpstat
	domain admin group = root @pc_adm
	logon script = scripts\%U.bat
	logon path = \\%L\profiles\%U
	logon drive = Z:
	logon home = \\%L\%U\.profiles
	domain logons = Yes
	os level = 64
	preferred master = True
	domain master = True
	wins support = Yes
	guest account = pcguest
	printer admin = @pc_adm
	hosts allow = 130.226.104.128/255.255.255.128 130.226.104.64/255.255.255.192 192.168.1.0/255.255.255.0 192.168.2.0/255.255.255.0

[homes]
	comment = Home Directories
	path = /pc_disk1/users/%u
	read only = No
	create mask = 0750
	veto oplock files = /Netscape/ 

[netlogon]
	comment = Samba Network Logon Service
	path = /pc_disk1/netlogon
	write list = csvarer
	guest ok = Yes

[profiles]
	path = /pc_disk1/netlogon/profiles/
	read only = No
	create mask = 0600
	directory mask = 0700
	guest ok = Yes

[printers]
	comment = All Printers
	path = /usr/spool/samba
	create mask = 0700
	guest ok = Yes
	printable = Yes
	lppause command = /usr/bin/lpalt %p-%j -p0
	lpresume command = /usr/bin/lpalt %p-%j -p2
	browseable = No

[print$]
	path = /usr/local/samba/printers
	write list = @pc_adm,root
	guest ok = Yes

[fax]
	comment = Fax
	path = /usr/spool/samba
	create mask = 0700
	printable = Yes
	postscript = Yes
	print command = /usr/local/bin/faxlpr
	lpq command = /usr/local/bin/faxlpq %U
	lprm command = /usr/local/bin/faxlprm %j %U

[tmp]
	comment = Temporary file space
	path = /pc_disk1/tmp
	read only = No
	create mask = 0770
	directory mask = 0770
	guest ok = Yes

[programs]
	comment = Program space
	path = /pc_disk1/programs
	write list = @pc_adm
	create mask = 0755
	guest ok = Yes

[prg_data]
	comment = Users Program setup
	path = /pc_disk1/prg_data
	read only = No
	create mask = 0775
	guest ok = Yes

[f-prot-v5]
	comment = Virus protection directory
	path = /pc_disk1/f-prot-v5
	force group = users
	read only = No
	force create mode = 0775
	force directory mode = 0775

[n_adm]
	comment = N administration
	path = /pc_disk1/n_adm
	valid users = @adm_n
	read only = No
	create mask = 0770
	directory mask = 0770

[nruweb]
	comment = NRU web
	path = /pc_disk1/web
	valid users = @adm_n, @unix_adm, @pc_adm
	write list = @pc_adm, @unix_adm, @adm_n
	read only = No
	create mask = 0775
	directory mask = 0775

[brain99]
	comment = N administration
	path = /pc_disk1/brain99
	valid users = @b99_grp
	read only = No
	create mask = 0770
	directory mask = 0770

[all]
	comment = All data
	path = /pc_disk1
	valid users = @pc_adm
	read only = No
	create mask = 0770
	browseable = No
-------------- next part --------------
#!/bin/sh
#
# Program that switches the log file for samba, whenever an error in a 
# tdb file is recognized
#
# CS, 310801
#
grep tdb /usr/local/samba/var/log.* > /tmp/SambaTdbError.txt
if (test -s /tmp/SambaTdbError.txt) then
  echo "Error in one of the tdb files detected"
  if (test ! -f /tmp/SambaTdbErrorDetected) then
    cp -R /usr/local/samba/var /usr/local/samba/var.`date +%d%m%y.%H%M%S`
    touch /tmp/SambaTdbErrorDetected
    mailx -s "Error in one of the tdb files detected" pc_adm
  fi
else
  echo "Error in one of the tdb files not detected"
fi
rm /tmp/SambaTdbError.txt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: csvarer.vcf
Type: text/x-vcard
Size: 394 bytes
Desc: Card for Claus Svarer
Url : http://lists.samba.org/archive/samba-technical/attachments/20010831/eac94f20/csvarer.vcf


More information about the samba-technical mailing list