[Samba] Homes shares randomly dissapear on AD-DC'S [Case closed]

Achim Gottinger achim at ag-web.biz
Fri Aug 8 08:53:13 MDT 2014


Am 24.07.2014 10:48, schrieb Achim Gottinger:
> Am 23.07.2014 17:42, schrieb Achim Gottinger:
>> Am 23.07.2014 11:24, schrieb Achim Gottinger:
>>> Am 23.07.2014 11:22, schrieb Achim Gottinger:
>>>> Am 23.07.2014 10:46, schrieb Achim Gottinger:
>>>>> Am 15.07.2014 09:18, schrieb Achim Gottinger:
>>>>>> Am 10.07.2014 12:13, schrieb Achim Gottinger:
>>>>>>> Am 09.07.2014 12:58, schrieb Achim Gottinger:
>>>>>>>> Am 09.07.2014 11:29, schrieb Achim Gottinger:
>>>>>>>>> Am 09.07.2014 11:08, schrieb Jonathan Buzzard:
>>>>>>>>>> On Wed, 2014-07-09 at 10:42 +0200, Achim Gottinger wrote:
>>>>>>>>>>
>>>>>>>>>> [SNIP]
>>>>>>>>>>
>>>>>>>>>>>   I use unscd for caching, restarted it but it did not help.
>>>>>>>>>> I take it that you missed the big warnings not to use nscd in
>>>>>>>>>> combination with winbind? You are aware that winbind does 
>>>>>>>>>> it's own
>>>>>>>>>> caching?
>>>>>>>>>>
>>>>>>>>>> I would suggest your first port of call is to disable unscd 
>>>>>>>>>> and see if
>>>>>>>>>> the problem goes away.
>>>>>>>>>>
>>>>>>>>>> JAB.
>>>>>>>>>>
>>>>>>>>> Thank you for the tip, disabled it at all four locations. I 
>>>>>>>>> used unscd also on the main site which always ran rock solid.
>>>>>>>>>
>>>>>>>>> Restarting samba on the branches witch winbind/nss issues 
>>>>>>>>> fixed wbinfo/getent passwd tests for a few minutes but now 
>>>>>>>>> they do not resolve again. Gotta watch it with unscd disabled 
>>>>>>>>> now.
>>>>>>>>> Thinking about downgrading tp 4.1.4 which had had the issues 
>>>>>>>>> but they appeared only once a week and not every few hours.
>>>>>>>>>
>>>>>>>>> achim~
>>>>>>>>>
>>>>>>>> Had to restart samba a few more times meanwhile. Was able to 
>>>>>>>> make it fail running wbinfo -u a few times. Since they servers 
>>>>>>>> are all vm's with 1GB in the branches i increased the moemory 
>>>>>>>> to 3Gb and since then i was not able to make samba fail with 
>>>>>>>> wbinfo -u. Hope that did the trick.
>>>>>>>>
>>>>>>> So far no more [homes] drop outs with 3GB memory assigned. Also 
>>>>>>> wbinfo -u getent passwd work flawless. Skimming thru saved log 
>>>>>>> files from yesterday trying to find anything memory related but 
>>>>>>> i can not find anything. Also there are no sings like OOM kills 
>>>>>>> in syslog at that timeframe.
>>>>>>> The vm's had 4GB swap space assigned which had shown usage in 
>>>>>>> few MB range.
>>>>>>> Would have expected slow down's in speed due to swapping but no 
>>>>>>> silent dropping of shares if an server runs out of memory.
>>>>>>>
>>>>>>> achim
>>>>>> After it worked on Fr, Sa and Monday, this morning they 
>>>>>> dissapeared at our main site for the first time. This vm has 6GB 
>>>>>> memory and 4 cpu cores assigned and it is the first time the 
>>>>>> [homes] share stopped working. Even after restarting samba wbinfo 
>>>>>> -u und wbinfo -g takes sometimes up to 30 seconds to enumerate 
>>>>>> users/groups.
>>>>>>
>>>>>> achim~
>>>>>>
>>>>> So far the issue reappeared on our main site last friday at around 
>>>>> 9am and again multiple times today since 9:15am. It did not appear 
>>>>> on the branches since i increased memory to 3gb.
>>>>> People start calling that their home directories are not 
>>>>> accessible any longer. Not all accounts seem to be affected and 
>>>>> others can continue to work for an while.
>>>>>
>>>>> wbinfo -u reports "Error looking up domain users".
>>>>>
>>>>> Reloading samba services does not help i have to restart them. 
>>>>> It's difficult to track down the issue the server is in production 
>>>>> and must get back into an working state asap.
>>>>>
>>>>> Also i noticed wbinfo -u sometimes takes an long time to report 
>>>>> results. This is an snippet of an strace, showing an few timeouts 
>>>>> trying to access /var/run/samba/winbindd/pipe.
>>>>>
>>>>> Any suggestions how i can track the issues down are welcome.
>>>>>
>>>>> Thanks in advance,
>>>>> achim~
>>>>>
>>>>> connect(3, {sa_family=AF_FILE, 
>>>>> path="/var/run/samba/winbindd/pipe"}, 110) = 0
>>>>>
>>>>> poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, -1) = 1 ([{fd=3, 
>>>>> revents=POLLOUT}])
>>>>>
>>>>> write(3, 
>>>>> "0\10\0\0\0\0\0\0\0\0\0\0\306|\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>>> 2096) = 2096
>>>>>
>>>>> poll([{fd=3, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=3, 
>>>>> revents=POLLIN}])
>>>>>
>>>>> read(3, 
>>>>> "\250\r\0\0\2\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>>> 3496) = 3496
>>>>>
>>>>> poll([{fd=3, events=POLLIN|POLLOUT|POLLHUP}], 1, -1) = 1 ([{fd=3, 
>>>>> revents=POLLOUT}])
>>>>>
>>>>> write(3, 
>>>>> "0\10\0\0/\0\0\0\0\0\0\0\306|\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>>> 2096) = 2096
>>>>>
>>>>> poll([{fd=3, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=3, 
>>>>> revents=POLLIN}])
>>>>>
>>>>> read(3, 
>>>>> "\313\r\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>>> 3496) = 3496
>>>>>
>>>>> poll([{fd=3, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=3, 
>>>>> revents=POLLIN}])
>>>>>
>>>>> read(3, "/var/lib/samba/winbindd_privileg"..., 35) = 35
>>>>>
>>>>> lstat("/var/lib/samba/winbindd_privileged", {st_mode=S_IFDIR|0750, 
>>>>> st_size=4096, ...}) = 0
>>>>>
>>>>> lstat("/var/lib/samba/winbindd_privileged/pipe", 
>>>>> {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
>>>>>
>>>>> socket(PF_FILE, SOCK_STREAM, 0)         = 4
>>>>>
>>>>> fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
>>>>>
>>>>> fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
>>>>>
>>>>> fcntl(4, F_GETFD)                       = 0
>>>>>
>>>>> fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
>>>>>
>>>>> connect(4, {sa_family=AF_FILE, 
>>>>> path="/var/lib/samba/winbindd_privileged/pipe"}, 110) = 0
>>>>>
>>>>> close(3)                                = 0
>>>>>
>>>>> poll([{fd=4, events=POLLIN|POLLOUT|POLLHUP}], 1, -1) = 1 ([{fd=4, 
>>>>> revents=POLLOUT}])
>>>>>
>>>>> write(4, 
>>>>> "0\10\0\0\22\0\0\0\0\0\0\0\306|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>>> 2096) = 2096
>>>>>
>>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>>
>>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>>
>>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>>
>>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>>
>>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 0 (Timeout)
>>>>>
>>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=4, 
>>>>> revents=POLLIN}])
>>>>>
>>>>> read(4, 
>>>>> "\24\20\0\0\2\0\0\0\236\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
>>>>> 3496) = 3496
>>>>>
>>>>> poll([{fd=4, events=POLLIN|POLLHUP}], 1, 5000) = 1 ([{fd=4, 
>>>>> revents=POLLIN}])
>>>>>
>>>>>
>>>> May I ask list members to do an quick test and post the results of
>>>> time "wbinfo -u"
>>> It's
>>> time wbinfo -u
>>> without the quotes
>>>>
>>>> On this network it often takes up to 30 seconds an short time 
>>>> (<5secs) later 1-2 seconds but soon afterwards it are 30 seconds 
>>>> again. This domain has around 200 user accounds and around 50 clients.
>>>>
>>>> On another network with 50 users and 30 clients there is no delay 
>>>> calling "wbinfo -u"
>>>>
>>>> achim~
>>>>
>>>
>> It is really odd, now in late afternoon "time wbinfo -u" usually 
>> takes 1.1-1.2s without having an long delay for the first run. On an  
>> ~1/10 there are still spices up to10-20s.
>> Number of users is in the same range where it has been in the morning.
>>
>> Inspecting level3 log of the morning for the user whom called me. 
>> First appearence of DOMAIN\username is
>>
>> [2014/07/23 09:12:09.658813,  3] 
>> ../source3/smbd/password.c:138(register_homes_share)
>>   No home directory defined for user 'DOMAIN\username'
>>
>> Now looking for that error message it appeared first here.
>>
>> [2014/07/23 09:08:48.927639,  3] 
>> ../source3/smbd/password.c:138(register_homes_share)
>>   No home directory defined for user 'DOMAIN\ACRIBA$'
>> [2014/07/23 09:08:48.929571,  3] 
>> ../source3/smbd/password.c:138(register_homes_share)
>>   No home directory defined for user 'DOMAIN\WIN7-Z-EMPFANG2$'
>> [2014/07/23 09:08:48.930933,  3] 
>> ../source3/smbd/password.c:138(register_homes_share)
>>   No home directory defined for user 'DOMAIN\TERMINALSERVER$'
>> [2014/07/23 09:08:48.931084,  3] 
>> ../source3/smbd/process.c:1802(process_smb)
>>   Transaction 2 of length 88 (0 toread)
>> [2014/07/23 09:08:48.931374,  3] 
>> ../source3/smbd/process.c:1405(switch_message)
>>   switch message SMBtconX (pid 22456) conn 0x0
>> [2014/07/23 09:08:48.932020,  2] 
>> ../source3/smbd/process.c:2672(deadtime_fn)
>>   Closing idle connection
>> [2014/07/23 09:08:48.932510,  3] 
>> ../source3/lib/access.c:338(allow_access)
>>   Allowed connection from 192.168.1.104 (192.168.1.104)
>> [2014/07/23 09:08:48.932706,  3] 
>> ../source3/smbd/server.c:159(msg_exit_server)
>>   got a SHUTDOWN message
>>
>> But the "No home directory defined for user 'DOMAIN\username'" error 
>> message also appeard in the afternoon while everything was working.
>>
>> Running "time wbinfo -u" on the other network with half the user base 
>> and an much faster harddisk backend the command takes ~0.075-0.1s.
>>
>> .....
> Tried to figure out why winbind behaves so slowly on that setup. 
> Powered down all other vm's running on the server and all vm's in the 
> branches.
> The time for an "wbinfo -u" query dropped from 1.2s down to 0.9s. 
> Still ten times slower that on the other network which has 50 (vs. 
> 150) user entries to return.
> I can sort out disk backup constrains because both networks have 
> branch servers with aprox identical hardware (amd quadcore 2.4-2.6GHz, 
> 8GB ram, adaptec raid controller of the same type running an raid1).
> On the "slow" setup the wbinfo -u time at the branch servers is abit 
> faster than on the main site ~0.8s. On the "faster" setup it is in 
> 0.1s range like on the main server there.
>
> Plan to grab an copy of an vm from the slow setup for further testing 
> at my office without beeing in production. I hope this slowness for 
> the queries is somehow related to the other issue of the dissapearing 
> home shares.
>
>
>
Since my last post i did the following on all servers and samba vm's.

- Apply latest hotfixes to xenserver dom0's (rollup pack 1 for 6.2SP1 + 
a few hotfixes released afterwards).
- Increased assinged memory for the vm at my main site from6 to 8gb.
- Lowered tombstone liftetime in steps of 10 days fom 180 down to 30 
days. It took around 20 minutes after such an ten day reduction till the 
deleted objects got purged at all dc's.
- Wrote an small systemtap script which logs exit signals below zero of 
ending processes.
- Now with 20k instead of 73k deleted objects in the directory i ran 
"samba-tool dbcheck --cross-ncs" which took ~2hrs to complete.

Checking 20970 objects
ERROR: wrong
dn[DC=client\0ACNF:ce76b285-0ade-444b-b08c-e8b2f7c9fcf9,CN=Deleted 
Objects,DC=DomainDnsZones,DC=...]
dc='client\nCNF:ce76b285-0ade-444b-b08c-e8b2f7c9fcf9'
name='client\nDEL:ce76b285-0ade-444b-b08c-e8b2f7c9fcf9'
new_dn[DC=client\0ADEL:ce76b285-0ade-444b-b08c-e8b2f7c9fcf9,CN=Deleted 
Objects,DC=DomainDnsZones,DC=...]
Not renaming
DC=client\0ACNF:ce76b285-0ade-444b-b08c-e8b2f7c9fcf9,CN=Deleted 
Objects,DC=DomainDnsZones,DC=... to
DC=client\0ADEL:ce76b285-0ade-444b-b08c-e8b2f7c9fcf9,CN=Deleted 
Objects,DC=DomainDnsZones,DC=...
Please use --fix to fix these errors
Checked 20970 objects (1 errors)

reran with --fix but it did not help. So I took an snapshot of al addc 
vm's and deleted the entry using this syntax:

ldbdel -H /var/lib/samba/private/sam.ldb 
"<GUID=ce76b285-0ade-444b-b08c-e8b2f7c9fcf9>"

on all addc's.

Afterwards dbcheck passed and till now i have not seen replication 
errors. Also winbind name resolving worked since then and home dirs did 
not dissappear again.

Usually i stop samba once a day and do an database backup. It happened 
last friday that not all processes ended and samba did not restart 
proper. Upgraded all vm's to 4.1.11 and added killall samba killall smbd 
statements to my backup script just in case this happens again.

I did alot of modifcations at once so it's unclear which did the trick.


More information about the samba mailing list