[Samba] Homes shares randomly dissapear on AD-DC'S

Achim Gottinger achim at ag-web.biz
Thu Jul 24 02:48:49 MDT 2014


Am 23.07.2014 17:42, schrieb Achim Gottinger:
> Am 23.07.2014 11:24, schrieb Achim Gottinger:
>> Am 23.07.2014 11:22, schrieb Achim Gottinger:
>>> Am 23.07.2014 10:46, schrieb Achim Gottinger:
>>>> Am 15.07.2014 09:18, schrieb Achim Gottinger:
>>>>> Am 10.07.2014 12:13, schrieb Achim Gottinger:
>>>>>> Am 09.07.2014 12:58, schrieb Achim Gottinger:
>>>>>>> Am 09.07.2014 11:29, schrieb Achim Gottinger:
>>>>>>>> Am 09.07.2014 11:08, schrieb Jonathan Buzzard:
>>>>>>>>> On Wed, 2014-07-09 at 10:42 +0200, Achim Gottinger wrote:
>>>>>>>>>
>>>>>>>>> [SNIP]
>>>>>>>>>
>>>>>>>>>>   I use unscd for caching, restarted it but it did not help.
>>>>>>>>> I take it that you missed the big warnings not to use nscd in
>>>>>>>>> combination with winbind? You are aware that winbind does it's 
>>>>>>>>> own
>>>>>>>>> caching?
>>>>>>>>>
>>>>>>>>> I would suggest your first port of call is to disable unscd 
>>>>>>>>> and see if
>>>>>>>>> the problem goes away.
>>>>>>>>>
>>>>>>>>> JAB.
>>>>>>>>>
>>>>>>>> Thank you for the tip, disabled it at all four locations. I 
>>>>>>>> used unscd also on the main site which always ran rock solid.
>>>>>>>>
>>>>>>>> Restarting samba on the branches witch winbind/nss issues fixed 
>>>>>>>> wbinfo/getent passwd tests for a few minutes but now they do 
>>>>>>>> not resolve again. Gotta watch it with unscd disabled now.
>>>>>>>> Thinking about downgrading tp 4.1.4 which had had the issues 
>>>>>>>> but they appeared only once a week and not every few hours.
>>>>>>>>
>>>>>>>> achim~
>>>>>>>>
>>>>>>> Had to restart samba a few more times meanwhile. Was able to 
>>>>>>> make it fail running wbinfo -u a few times. Since they servers 
>>>>>>> are all vm's with 1GB in the branches i increased the moemory to 
>>>>>>> 3Gb and since then i was not able to make samba fail with wbinfo 
>>>>>>> -u. Hope that did the trick.
>>>>>>>
>>>>>> So far no more [homes] drop outs with 3GB memory assigned. Also 
>>>>>> wbinfo -u getent passwd work flawless. Skimming thru saved log 
>>>>>> files from yesterday trying to find anything memory related but i 
>>>>>> can not find anything. Also there are no sings like OOM kills in 
>>>>>> syslog at that timeframe.
>>>>>> The vm's had 4GB swap space assigned which had shown usage in few 
>>>>>> MB range.
>>>>>> Would have expected slow down's in speed due to swapping but no 
>>>>>> silent dropping of shares if an server runs out of memory.
>>>>>>
>>>>>> achim
>>>>> After it worked on Fr, Sa and Monday, this morning they 
>>>>> dissapeared at our main site for the first time. This vm has 6GB 
>>>>> memory and 4 cpu cores assigned and it is the first time the 
>>>>> [homes] share stopped working. Even after restarting samba wbinfo 
>>>>> -u und wbinfo -g takes sometimes up to 30 seconds to enumerate 
>>>>> users/groups.
>>>>>
>>>>> achim~
>>>>>
>>>> So far the issue reappeared on our main site last friday at around 
>>>> 9am and again multiple times today since 9:15am. It did not appear 
>>>> on the branches since i increased memory to 3gb.
>>>> People start calling that their home directories are not accessible 
>>>> any longer. Not all accounts seem to be affected and others can 
>>>> continue to work for an while.
>>>>
>>>> wbinfo -u reports "Error looking up domain users".
>>>>
>>>> Reloading samba services does not help i have to restart them. It's 
>>>> difficult to track down the issue the server is in production and 
>>>> must get back into an working state asap.
>>>>
>>>> Also i noticed wbinfo -u sometimes takes an long time to report 
>>>> results. This is an snippet of an strace, showing an few timeouts 
>>>> trying to access /var/run/samba/winbindd/pipe.
>>>>
>>>> Any suggestions how i can track the issues down are welcome.
>>>>
>>>> Thanks in advance,
>>>> achim~
>>>>
>>>> connect(3, {sa_family=AF_FILE, 
>>>> path="/var/run/samba/winbindd/pipe"}, 110) = 0
>>>>
>>>> poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, -1) = 1 ([{fd=3, 
>>>> revents=POLLOUT}])
>>>>
>>>> write(3, 
>>>> "0\10\0\0\0\0\0\0\0\0\0\0\306|\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>> 2096) = 2096
>>>>
>>>> poll([{fd=3, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=3, 
>>>> revents=POLLIN}])
>>>>
>>>> read(3, 
>>>> "\250\r\0\0\2\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>> 3496) = 3496
>>>>
>>>> poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, -1) = 1 ([{fd=3, 
>>>> revents=POLLOUT}])
>>>>
>>>> write(3, 
>>>> "0\10\0\0/\0\0\0\0\0\0\0\306|\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>> 2096) = 2096
>>>>
>>>> poll([{fd=3, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=3, 
>>>> revents=POLLIN}])
>>>>
>>>> read(3, 
>>>> "\313\r\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>> 3496) = 3496
>>>>
>>>> poll([{fd=3, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=3, 
>>>> revents=POLLIN}])
>>>>
>>>> read(3, "/var/lib/samba/winbindd_privileg"..., 35) = 35
>>>>
>>>> lstat("/var/lib/samba/winbindd_privileged", {st_mode=S_IFDIR|0750, 
>>>> st_size=4096, ...}) = 0
>>>>
>>>> lstat("/var/lib/samba/winbindd_privileged/pipe", 
>>>> {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
>>>>
>>>> socket(PF_FILE, SOCK_STREAM, 0)         = 4
>>>>
>>>> fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
>>>>
>>>> fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
>>>>
>>>> fcntl(4, F_GETFD)                       = 0
>>>>
>>>> fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
>>>>
>>>> connect(4, {sa_family=AF_FILE, 
>>>> path="/var/lib/samba/winbindd_privileged/pipe"}, 110) = 0
>>>>
>>>> close(3)                                = 0
>>>>
>>>> poll([{fd=4, events=POLLIN|POLLOUT|POLLHUP}], 1, -1) = 1 ([{fd=4, 
>>>> revents=POLLOUT}])
>>>>
>>>> write(4, 
>>>> "0\10\0\0\22\0\0\0\0\0\0\0\306|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>> 2096) = 2096
>>>>
>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>
>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>
>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>
>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>
>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>
>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=4, 
>>>> revents=POLLIN}])
>>>>
>>>> read(4, 
>>>> "\24\20\0\0\2\0\0\0\236\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>> 3496) = 3496
>>>>
>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=4, 
>>>> revents=POLLIN}])
>>>>
>>>>
>>> May I ask list members to do an quick test and post the results of
>>> time "wbinfo -u"
>> It's
>> time wbinfo -u
>> without the quotes
>>>
>>> On this network it often takes up to 30 seconds an short time 
>>> (<5secs) later 1-2 seconds but soon afterwards it are 30 seconds 
>>> again. This domain has around 200 user accounds and around 50 clients.
>>>
>>> On another network with 50 users and 30 clients there is no delay 
>>> calling "wbinfo -u"
>>>
>>> achim~
>>>
>>
> It is really odd, now in late afternoon "time wbinfo -u" usually takes 
> 1.1-1.2s without having an long delay for the first run. On an  ~1/10 
> there are still spices up to10-20s.
> Number of users is in the same range where it has been in the morning.
>
> Inspecting level3 log of the morning for the user whom called me. 
> First appearence of DOMAIN\username is
>
> [2014/07/23 09:12:09.658813,  3] 
> ../source3/smbd/password.c:138(register_homes_share)
>   No home directory defined for user 'DOMAIN\username'
>
> Now looking for that error message it appeared first here.
>
> [2014/07/23 09:08:48.927639,  3] 
> ../source3/smbd/password.c:138(register_homes_share)
>   No home directory defined for user 'DOMAIN\ACRIBA$'
> [2014/07/23 09:08:48.929571,  3] 
> ../source3/smbd/password.c:138(register_homes_share)
>   No home directory defined for user 'DOMAIN\WIN7-Z-EMPFANG2$'
> [2014/07/23 09:08:48.930933,  3] 
> ../source3/smbd/password.c:138(register_homes_share)
>   No home directory defined for user 'DOMAIN\TERMINALSERVER$'
> [2014/07/23 09:08:48.931084,  3] 
> ../source3/smbd/process.c:1802(process_smb)
>   Transaction 2 of length 88 (0 toread)
> [2014/07/23 09:08:48.931374,  3] 
> ../source3/smbd/process.c:1405(switch_message)
>   switch message SMBtconX (pid 22456) conn 0x0
> [2014/07/23 09:08:48.932020,  2] 
> ../source3/smbd/process.c:2672(deadtime_fn)
>   Closing idle connection
> [2014/07/23 09:08:48.932510,  3] 
> ../source3/lib/access.c:338(allow_access)
>   Allowed connection from 192.168.1.104 (192.168.1.104)
> [2014/07/23 09:08:48.932706,  3] 
> ../source3/smbd/server.c:159(msg_exit_server)
>   got a SHUTDOWN message
>
> But the "No home directory defined for user 'DOMAIN\username'" error 
> message also appeard in the afternoon while everything was working.
>
> Running "time wbinfo -u" on the other network with half the user base 
> and an much faster harddisk backend the command takes ~0.075-0.1s.
>
> .....
Tried to figure out why winbind behaves so slowly on that setup. Powered 
down all other vm's running on the server and all vm's in the branches.
The time for an "wbinfo -u" query dropped from 1.2s down to 0.9s. Still 
ten times slower that on the other network which has 50 (vs. 150) user 
entries to return.
I can sort out disk backup constrains because both networks have branch 
servers with aprox identical hardware (amd quadcore 2.4-2.6GHz, 8GB ram, 
adaptec raid controller of the same type running an raid1).
On the "slow" setup the wbinfo -u time at the branch servers is abit 
faster than on the main site ~0.8s. On the "faster" setup it is in 0.1s 
range like on the main server there.

Plan to grab an copy of an vm from the slow setup for further testing at 
my office without beeing in production. I hope this slowness for the 
queries is somehow related to the other issue of the dissapearing home 
shares.





More information about the samba mailing list