[Samba] smbd mortality
kvkamin at aim.com
kvkamin at aim.com
Thu Nov 13 20:12:35 GMT 2008
I encountered a serious samba problem and want to publish details for public benefit.?
SLES 10 server running Samba 3.0.28 as domain controller, file and cups print server, running uneventfully for 2 years suddenly drops all users, load rapidly grows to about 250 and becomes unresponsive. smbstatus reveals that every user has about 10 instances of smbd instead of one. CPU (dual processsor, dual core) utilization very low (2 % - mostly X and top).? Reboot clears problem but issue returns every 30 minutes or so.? Logs are empty of any usefull info:? /var/log/messages and /var/log/samba/log.smbd.? dmesg shows no errors. System is not using any swap space.? Server passes all diagnostics possible. System is fully patched. 2tb raid array attached via 320 SCSI checks fsck clean with zero errors and so does each of the local file system slices. File system limit not reached, limit of ~202000 , lsof says only 8800 files open during load spool-up. ? 50 irritated people idle.?
Grasping at straws,? we verify all 50 Windows XP clients have latest virus sigs and we do deep scan of every machine.? Two virus' discovered, but niether seemed responsible.
A clue comes in from a user.? "Every time I try to open a certain file, my system freezes".. Oh really...
I go to the subdirectory, via linux console, where the suspect file is located and ls the directory.? 9 files.? ls -al gets Killed. After ls -al filename for each of the 9 files, I determine that 5 of these files are badly corrupt.? I perform an experiment.? Tell everyone to leave these files alone, reboot the server and it runs happily for an hour.? Load is .05 average.? I ask one user to attempt to open one of the corrupt files, and instantly all 50 smbd daemons go to uninterruptible sleep and every WinXP client instantly re-establishes its smbd session with the server and these (all 50) smbd sessions also die and go to heaven.? This cycle continues rapidly sending the load sky high with no cpu utilization to speak of.
The short term fix is to move the offending directory to another place on the volume which is out of scope of any share.? Not sure how to delete these files as linux tools seem unable to handle them.?
Questions that remain:
1.? Why do all client smbd daemons have to die if only one of them ran into trouble?
2.? How do files get in a state that they can't be viewed or managed?? virus, lack of sunspots?
3.? Why did the fsck say that the filesystem was fine, when obviously it isn't?
4.? How to delete these poison files?
More information about the samba