[Samba] winbind causes Linux to lockup when connectivity to AD is lost (subject line edited for clarity)

Clayton Hill admin at ateamonsite.com
Mon Oct 19 13:07:06 MDT 2009


Matthew J. Salerno wrote:


 > Please understand that I am not a samba dev, I am just an average 
user who is willing to help others out when I can because I know how 
much it sucks to be stuck.  I do not have the time to mirror your 
environment.  Regarding the settings I recommended in my last post, I'm 
not sure what the best settings would be for them, but since they all 
deal with caching info from AD I figured that they might be usefull.  
Honestly, I would set them all to cache for a very long time, simulate 
outtage, adjust and repeat.
 > 
 > Have you checked on any suse forums?  If it is a suse issue, chances 
are that you are not the only person having this problem.  I'll try the 
outage out in my Redhat env.
 >

I appreciate your help, dev or not - even though my answers are somewhat 
glib. (hopefully amusing!) honestly wish I could have posted this to the 
samba technical list instead... but I like the chain of command here.

Also, I didnt find anything useful on the suse forums and I besides, I 
dont think this is suse issue.
Plus I hope to avoid standard overgeneralized tech support/newbie Linux 
user questions, or inflated forum moderator egos by posting here 
instead. I guaranty they would ask me the opposite question: "hey did 
you check the samba forums?" ;-)

Those options you mentioned:


      idmap cache time (G)

    This parameter specifies the number of seconds that Winbind's idmap
    interface will cache positive SID/uid/gid query results.

    Default: //|idmap cache time|/ = |604800 (one week)| /

This default setting looks fine to me... one week is a lot longer than 1 
hour so this I dont believe causes this issue nor does it help alleviate 
the symptoms. Maybe I am wrong.



      idmap negative cache time (G)

    This parameter specifies the number of seconds that Winbind's idmap
    interface will cache negative SID/uid/gid query results.

    Default: //|idmap negative cache time|/ = |120| /


120 what? hmmm seconds? minutes? LOL
I am assuming the term negative is not an integer and that it means 
"bad". Since I do not query bad SIDs in this test I dont think this is 
the cause either. Maybe I am wrong.


      winbind cache time (G)


    This parameter specifies the number of seconds the winbindd(8)
    <http://samba.org/samba/docs/man/manpages-3/winbindd.8.html> daemon
    will cache user and group information before querying a Windows NT
    server again.

    This does not apply to authentication requests, these are always
    evaluated in real time unless the winbind offline logon
    <http://samba.org/samba/docs/man/manpages-3/smb.conf.5.html#WINBINDOFFLINELOGON>
    option has been enabled.

    Default: //|winbind cache time|/ = |300| /

300 what? -- years? fortnights? furlongs? farthings? bushels? bottles of 
beer on the wall?
This setting may be useful... but the problem with messing with this is 
once the limit is reached - the system is still unusable.
Messing with this I do not see the system go back to a usable state in a 
reasonable amount of time once the AD is back up either.
Perhaps my goal is to find out if this is a design misstep, and if so 
have devs fix that issue and make samba more resilient, able to tell if 
the AD is up or down at a moments notice, and not fubar the samba server 
during a AD server outage. You know, like you would see if you used a 
windows workstation.... 


      winbind offline logon (G)

This isnt really what I am doing here. I am not using this samba box as 
a workstation. I am using it as a NAS joined to a AD domain. The only 
querys it does is validate passwords for logging into CIFS shares from 
windows workstations, and set/read ACLs in the filesystem.
Neither of which cause this condition of the system becoming 
unresponsive. All you need to do is take the AD offline for a minute or two.
--  Option Disqualified! ;-)



      winbind reconnect delay (G)

    This parameter specifies the number of seconds the winbindd(8)
    <http://samba.org/samba/docs/man/manpages-3/winbindd.8.html> daemon
    will wait between attempts to contact a Domain controller for a
    domain that is determined to be down or not contactable.

    Default: //|winbind reconnect delay|/ = |30|/

Hmm 30 bottles of beer? I am guessiung seconds. If this is true, then I 
should not have this issue once the AD is back up. I have seen this 
problem continue long after the AD is back up and running so this causes 
concern. If this was working right then it looks like it would cure my 
problem and know immediately if the AD was up or down if I set it to 5 
instead of 30 -- but hey it could be 30 minutes, hours, days etc - I 
dont know!


Hope this helps!

Thanks,
-Clayton




>
>
> ------------------------------------------------------------------------
> *From:* Clayton Hill <admin at ateamonsite.com>
> *To:* Matthew J. Salerno <Vagabond_king at yahoo.com>
> *Cc:* samba at lists.samba.org
> *Sent:* Mon, October 19, 2009 1:20:00 PM
> *Subject:* Re: [Samba] winbind causes Linux to lockup when 
> connectivity to AD is lost (subject line edited for clarity)
>
> Hi Matthew,
>
> />I don't have the time to setup an environment to match yours, but I did take the time to go back to your initial post and read through your >smb.conf./
>
> Understandable, but that is not going to be of much help if you don't have a way to reproduce this issue.. and I'll be answering too many basic questions. ;-)
>
>
> /> 1. http://samba.org/samba/docs/man/manpages-3/winbindd.8.html - Did you check your winbind config to make sure you are not running it with a "-n" ?
> />
>
> Yes. I am using the default init script to start and stop winbind. Remember I am using suse 11.0 x86_64  
> BUT I have tested this without -n which is a totally useless way to run winbind and ironically should be far worse usability-wise than this scenario - but isn't.
>
>
>
>
> > 2. http://samba.org/samba/docs/man/manpages-3/smb.conf.5.html - Have you tried playing with the "winbind cache time", "winbind offline logon", "winbind reconnect delay" and "idmap cache time" settings?
> >
>
> I will reread those options in the man page, but.... what do you recommend here? Feels like a shot in the dark, and a lengthy way to randomly test this
> IE: This test renders a samba machine useless every time it is ran... so very long, slow, shots in the dark here.
> _Need some experienced expert advice here on which options are best to modify and why._
>
>
>
>
> /> 3. Have you tried increasing the log level and enabling winbind debug and creating an artificial outage and then review the logs?/
>
> Yes - I will give you a snippet of log level 2 though during a "fake AD outage" in a bit. I doubt it will be useful but I'll try it.
>
>
>  
> /> Again, what kind of troubleshooting have you done and what are the results?/
>
> Please- try and reproduce this issue. It will become quite obvious to you after that. 
>   
>
>
> Thanks,
> -Clayton
>
>
>
> Matthew J. Salerno wrote:
>> ----- Original Message ----
>> From: Clayton Hill <admin at ateamonsite.com>
>> To: Matthew J. Salerno <Vagabond_king at yahoo.com>
>> Cc: samba at lists.samba.org; Jeremy Allison <jra at samba.org>
>> Sent: Sun, October 18, 2009 7:49:01 PM
>> Subject: Re: [Samba] winbind causes Linux to lockup when connectivity to AD is lost (subject line edited for clarity)
>>
>> Thanks for confirming my config is good. I already know about the old 
>> problem with SSH and reverse DNS lookups. That actually takes about 5 
>> minutes or less to log in, with this issue be prepared to wait almost an 
>> hour if it even works. Similar but not the same issue.
>> Please, to get an understanding of this problem do the following steps 
>> to reproduce this problem.
>>
>> SUSE 11.0
>> Samba 3.2
>> Join windows 2003 AD domain (with 40,000 objects) using      net ads join
>> Take domain controller offline.
>>
>> Try to log in LOCALLY as ROOT to your console on your domain member 
>> linux box. Do not even bother to log in as any samba user of do ANYTHING 
>> samba related.
>> Watch as it takes more time than bearable (I am talking MORE THAN 20 
>> minutes!) to0 log in to the LOCAL TERMINAL
>> attempt to do the same with ssh
>> if you are already logged in before you do this test as root LOCALLY TTY 
>> then try and run simple commands such as:  top,ls,ps,man etc etc
>>
>> After seeing the problem clearly simply do this to become unstuck:
>> killall winbindd
>> or
>> service winbind stop
>>
>>
>> have a lot of fun.
>>
>> Cheers,
>> -Clayton
>>
>>
>>
>>
>>
>>
>> Matthew J. Salerno wrote:
>>   
>>> Your  /etc/nsswitch.conf looks correct to me.  For services like ssh, you should just disable ptr lookups (VerifyReverseMapping no).  Regarding winbind, do you have any services or processes running on the box as a domain user?  Perhaps there is a timeout setting for krb and winbind.  I don't recall seeing one for winbind, but I would imagine that there is one for kerberos.  Have you bumped up the debugging and purposefully caused an ad failure (ifdown or bad route) ?  Have you had the console open and watched top to see if it's a processes consuming to much cpu?  What kind of troubleshooting have you done?  and what are the results?
>>>
>>>
>>>
>>> ----- Original Message ----
>>> From: "admin at ateamonsite.com" <admin at ateamonsite.com>
>>> To: admin at ateamonsite.com
>>> Cc: samba at lists.samba.org; Jeremy Allison <jra at samba.org>
>>> Sent: Fri, October 16, 2009 3:59:45 PM
>>> Subject: Re: [Samba] winbind causes Linux to lockup when connectivity to AD is lost (subject line edited for clarity)
>>>
>>>
>>> Ok I am not hearing replies back - I dont want this issue to be swept under
>>> the rug. 
>>>
>>>
>>> It has been a issue for me since SuSE 10.1 + samba-3.0.30-0.1.112 even..
>>> I know now that the commands I was telling you all access UN/PW info such
>>> as LS or MAN etc, to see if you have permission to run them? IDK I am
>>> guessing.
>>>
>>> BUT - if winbind is really caching and the connection is lost, then this
>>> should be a non-issue as you say.
>>>
>>> Well here is my nsswitch.conf:
>>>
>>>
>>> cat /etc/nsswitch.conf
>>>
>>>
>>> passwd: compat winbind
>>> group:  compat winbind
>>>
>>> networks:      files dns
>>>
>>> services:      files
>>> protocols:      files
>>> rpc:    files
>>> ethers: files
>>> netmasks:      files
>>> netgroup:      files
>>> publickey:      files
>>>
>>> bootparams:    files
>>> automount:      files
>>> aliases:        files
>>>
>>> hosts:  files dns
>>> shadow: compat
>>>
>>>
>>> Isn't this set up right? ;-)
>>>
>>>
>>> So, famously when DNS is down, crap like SSH and NFS take unreasonable
>>> amounts of time and cause system hangs in linux. This is what I've been
>>> told, and I can accept that.
>>> Since DNS is hosted on the AD server, when that server goes down, SSH, and
>>> even local login hang for extremely long amounts of time - im talking more
>>> than 10 minutes... then fail.
>>>
>>> In Windows (im sorry Im about to compare 2 operating systems) this is a non
>>> issue and you can use the machine even if the networking is hosed or you
>>> cant talk to the AD.
>>>
>>> So.......
>>>
>>> BUMP! :-)
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 14 Oct 2009 16:51:10 -0600, <admin at ateamonsite.com> wrote:
>>>   
>>>     
>>>> Hopefully that isn't a bad thing! haha 
>>>> Thanks! 
>>>>
>>>>
>>>> On Wed, 14 Oct 2009 15:44:54 -0700, Jeremy Allison <jra at samba.org> wrote:
>>>>     
>>>>       
>>>>> On Wed, Oct 14, 2009 at 04:02:41PM -0600, admin at ateamonsite.com wrote:
>>>>>       
>>>>>         
>>>>>> Hi Jeremy,
>>>>>>
>>>>>>
>>>>>>         
>>>>>>           
>>>>>>> Sorry, didn't look too closely at your winbindd issue.
>>>>>>> winbindd will cache all information to allow disconnected
>>>>>>> operation (we made this work perfectly at SuSE), so there
>>>>>>> certainly shouldn't be a problem with a loss of connection to a DC.
>>>>>>>           
>>>>>>>             
>>>>>> I am sorry to report that I am in fact using SuSE, and this problem is
>>>>>> very
>>>>>> easy to reproduce if I power off my AD domain, then wait (I guess) 10
>>>>>> minutes - then try and ssh to my Linux box. There is no way to log into
>>>>>> the
>>>>>> box. 
>>>>>>         
>>>>>>           
>>>>> Ok, then I'm going to hand you over to the SuSE Samba Team
>>>>> maintainers on this list (sorry :-).
>>>>>
>>>>> Jeremy.
>>>>>       
>>>>>         
>> I don't have the time to setup an environment to match yours, but I did take the time to go back to your initial post and read through your smb.conf.
>>
>> 1. http://samba.org/samba/docs/man/manpages-3/winbindd.8.html - Did you check your winbind config to make sure you are not running it with a "-n" ?
>> 2. http://samba.org/samba/docs/man/manpages-3/smb.conf.5.html - Have you tried playing with the "winbind cache time", "winbind offline logon", "winbind reconnect delay" and "idmap cache time" settings?
>> 3. Have you tried increasing the log level and enabling winbind debug and creating an artificial outage and then review the logs?
>>  
>> Again, what kind of troubleshooting have you done and what are the results?
>>
>>
>>       
>>   
>
> Please understand that I am not a samba dev, I am just an average user 
> who is willing to help others out when I can because I know how much 
> it sucks to be stuck.  I do not have the time to mirror your 
> environment.  Regarding the settings I recommended in my last post, 
> I'm not sure what the best settings would be for them, but since they 
> all deal with caching info from AD I figured that they might be 
> usefull.  Honestly, I would set them all to cache for a very long 
> time, simulate outtage, adjust and repeat.
>  
> Have you checked on any suse forums?  If it is a suse issue, chances 
> are that you are not the only person having this problem.  I'll try 
> the outage out in my Redhat env.
>



More information about the samba mailing list