[Samba] Samba PDC problem: Please help me avoid a mutiny! :-)

Sun Nov 10 04:53:00 GMT 2002

I've been beating my head against this one and just can't figure it out. I 
hope someone here may have an answer. The employees using the workstations on 
this network are getting increasingly upset with this problem.

The problem is wildly varying logon and logoff times over the network. This is 
definitely not a matter of long profile transfers. An individual can log onto 
a workstation one time and get on quickly, and another time, have to wait 
five minutes or more. There is no apparent pattern that I can discern. No 
workstations seem to manifest this problem more than others; no users seem to 
have more difficulty with this than others; it seems to make no difference if 
the user has logged onto a particular station before, or even if he/she's 
logged onto another station at the same time.

The network consists of one Samba PDC, 2.2.6, recently upgraded from 2.2.3a, 
and about 12 NT 4.0 workstations on two subnets. The problem occurs with 
workstations on the PDC's local subnet and the other one. Cross-subnet 
browsing is working fine.

In the effort to troubleshoot this, I set up the log file parameter to create 
a separate log for each workstation and user (log file = 
/var/log/samba/log.smbd.%m-%U). It helps untangle the mess. and I can merge 
the log files when I need to. When running tests I jacked up the log level to 
10, and when I upgraded to 2.2.6, I compiled a test version with some extra 
debugging code of my own to help figure it out. Still, I'm baffled.

The manifestation is, in nearly all cases, that the PDC sends a message to the 
workstation and waits for a response. The response eventually arrives, and as 
far as I can tell, makes sense, but the time that elapses before the reply 
from the workstation can sometimes amount to minutes. The workstation event 
logs have entries pertaining to these gaps (verified by comparing timestamps) 
from the Redirector services usually saying "The redirector has timed out a 
request to SERVICES" (SERVICES is the NetBIOS name of the PDC).

Sometimes, however, there is an entry saying, "A write-behind operation has 
failed to the remote server services. The data contains the amount requested 
to write and the amount actually written.": The data dump reads,

00 00 08 00 02 00 52 00

These numbers are consistent in case after case.

It doesn't seem to make any sense if these are 16-bit values, which would mean 
zero requested and 8 written. If they are 32-bit, 524288 (0x80000) was 
requested and 5373954 (0x520002) was written. None of this makes any sense to 
me.

The socket options are SO_KEEPALIVE TCP_NODELAY IPTOS_LOWDELAY
I can't imagine a reason why the workstation would try to send something and 
the server wouldn't accept it. In a few early tests, I added tcpdump output 
to the logs (using hires timestamps to correlate them) and it appears that 
the workstations are not even trying to send anything during that gap.

I'm lost at this point. I really hope someone can help. This problem has been 
around for quite some time and the workers are getting tired of it, and my 
promises to fix it.

Many thanks in advance,

Ray Simard
ray.simard at sylvan-glade.com