Samba4 and memory consumption - needing frequent kills

Kev Latimer klatimer at tolent.co.uk
Wed Aug 8 02:32:26 MDT 2012


Hi all.

Been running Samba4 in production from alpha16ish to recent beta (git 
from last Friday I think - 4.0.0beta6-GIT-02dcf05) and have had this 
problem all the while.  I've addressed the obvious problems in my set up 
(or at least I think I have, as I don't have my DRS errors any more).  I 
was hoping it was going to improve as the code came into beta but if 
anything, it's been getting worse.

I have 6 DC's, 2 the first site and 1 in each of the other four. The 
first "primary" DC also runs BIND9 DLZ with two zones.  Other DC's just 
run standard secondary with zone transfers from the primary for 
read-only redundancy.  All DC's run the same GIT of Samba4-beta6 (as 
above).  They're all ESXi guests with 4-8GB assigned and between 8-12GB 
of swap space available.  Since the first samba install, they've had the 
RAM assignment and swap space doubled to try and mitigate this issue.  
They're all Debian Squeeze 2.6.32-5-amd64 with 1 to 2 vCPU's (no 
pattern, obviously based on what I felt like at provision time!).

Every 36 hours or so, I start getting Nagios warnings from my DC's (they 
usually start complaining over a 12 hour period) about the swap space 
getting low.  Grepping ps for the processes shows this:

root      7411  0.0  0.1 483076  4336 ?        Ss   Aug05   0:00 
/usr/local/samba/sbin/samba -D
root      7417  8.1 30.7 1670360 1248508 ?     R    Aug05 286:21 
/usr/local/samba/sbin/samba -D
root      7418  0.0  0.2 487228 10392 ?        S    Aug05   0:07 
/usr/local/samba/sbin/samba -D
root      7419  0.0  0.0 483076  2264 ?        S    Aug05   0:00 
/usr/local/samba/sbin/samba -D
root      7420  0.1  0.6 483076 27728 ?        S    Aug05   5:28 
/usr/local/samba/sbin/samba -D
root      7421  0.0  1.4 557880 60896 ?        S    Aug05   0:51 
/usr/local/samba/sbin/samba -D
root      7422  0.2  2.7 566988 111088 ?       S    Aug05   9:24 
/usr/local/samba/sbin/samba -D
root      7423  1.0 23.0 1724652 935944 ?      S    Aug05  35:57 
/usr/local/samba/sbin/samba -D
root      7424  0.0  0.1 488072  6724 ?        S    Aug05   0:36 
/usr/local/samba/sbin/samba -D
root      7425  0.0  0.0 483076  2340 ?        S    Aug05   0:00 
/usr/local/samba/sbin/samba -D
root      7426  0.1  5.2 792936 213728 ?       S    Aug05   4:05 
/usr/local/samba/sbin/samba -D
root      7427  0.0  0.2 483076  9832 ?        S    Aug05   0:16 
/usr/local/samba/sbin/samba -D
root      7428  0.0  0.0 483076  3088 ?        S    Aug05   0:37 
/usr/local/samba/sbin/samba -D
root      7766  0.0  0.5 487264 20376 ?        S    Aug05   1:27 
/usr/local/samba/sbin/samba -D
root     19975  0.0  0.2 487228  8672 ?        S    07:47   0:00 
/usr/local/samba/sbin/samba -D
root     19976  1.2  0.3 487276 15976 ?        S    07:47   0:00 
/usr/local/samba/sbin/samba -D

You can see there are two processes consumuing a lot of CPU time and 
RAM.  This is from this morning and it's not a bad one - sometimes I 
have to let it carry on over a weekend and then it's really bad.  I need 
to do a "killall samba" and over about 5-10 minutes the processes all 
exit gracefully.  Sometimes, the smaller of the two big ones (usually 
that one) needs a -9 to get me to a point I can start the sbin/samba 
again, which is usually when users start complaining of login times and 
I need to give them their DC back. This morning, I tried doing a kill of 
the samba PID consuming the most resources before a killall and that 
seems to stop them all much quicker so perhaps there's a clue in there?

As an aside, if I let it consume all the swap space, which I have once, 
the server has become unresponsive and the log fills with segfaults.  
It's been a while since I've let it do that though.

In the situation with the large processes, if I check DRS replication, 
everything looks fine.  No timeouts and all partitions report successful 
replication.  log. samba looks like this:

[2012/08/08 02:51:36,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 02:51:36,  0] 
../source4/dsdb/dns/dns_update.c:323(dnsupdate_spnupdate_done)
   ../source4/dsdb/dns/dns_update.c:323: Failed SPN update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 03:01:36,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 03:01:36,  0] 
../source4/dsdb/dns/dns_update.c:323(dnsupdate_spnupdate_done)
   ../source4/dsdb/dns/dns_update.c:323: Failed SPN update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 03:21:36,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 03:21:37,  0] 
../source4/dsdb/dns/dns_update.c:323(dnsupdate_spnupdate_done)
   ../source4/dsdb/dns/dns_update.c:323: Failed SPN update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 03:31:35,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 04:01:35,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 04:01:37,  0] 
../source4/dsdb/dns/dns_update.c:323(dnsupdate_spnupdate_done)
   ../source4/dsdb/dns/dns_update.c:323: Failed SPN update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 04:11:35,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 04:21:37,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 04:41:37,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 04:41:37,  0] 
../source4/dsdb/dns/dns_update.c:323(dnsupdate_spnupdate_done)
   ../source4/dsdb/dns/dns_update.c:323: Failed SPN update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 04:47:05,  0] 
../source4/ntvfs/posix/pvfs_oplock.c:143(pvfs_oplock_break)
   pvfs_oplock_break: do not resend oplock break level 1 for 
'\CIDs\S000\SAVSCFXP\master.upd' 0x1210350
[2012/08/08 04:51:38,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 04:51:38,  0] 
../source4/dsdb/dns/dns_update.c:323(dnsupdate_spnupdate_done)
   ../source4/dsdb/dns/dns_update.c:323: Failed SPN update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 05:01:39,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 05:01:43,  0] 
../source4/dsdb/dns/dns_update.c:323(dnsupdate_spnupdate_done)
   ../source4/dsdb/dns/dns_update.c:323: Failed SPN update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 05:11:37,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 05:11:39,  0] 
../source4/dsdb/dns/dns_update.c:323(dnsupdate_spnupdate_done)
   ../source4/dsdb/dns/dns_update.c:323: Failed SPN update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 05:21:49,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 05:51:50,  0] 
../source4/dsdb/dns/dns_update.c:294(dnsupdate_nameupdate_done)
   ../source4/dsdb/dns/dns_update.c:294: Failed DNS update - 
NT_STATUS_IO_TIMEOUT
[2012/08/08 07:09:02,  0] 
../auth/ntlmssp/ntlmssp_sign.c:236(ntlmssp_check_packet)
   NTLMSSP NTLM2 packet check failed due to invalid signature!
[2012/08/08 07:16:54,  0] 
../auth/ntlmssp/ntlmssp_sign.c:236(ntlmssp_check_packet)
   NTLMSSP NTLM2 packet check failed due to invalid signature!
[2012/08/08 07:23:20,  0] 
../auth/ntlmssp/ntlmssp_sign.c:236(ntlmssp_check_packet)
   NTLMSSP NTLM2 packet check failed due to invalid signature!
[2012/08/08 07:28:54,  0] 
../auth/ntlmssp/ntlmssp_sign.c:236(ntlmssp_check_packet)
   NTLMSSP NTLM2 packet check failed due to invalid signature!
[2012/08/08 07:37:16,  0] 
../auth/ntlmssp/ntlmssp_sign.c:236(ntlmssp_check_packet)
   NTLMSSP NTLM2 packet check failed due to invalid signature!

I've given it a bit of a trim. but only by half.  Not sure what the 
TIMEOUT represents.  I do have a similar BIND9 consumption issue on the 
"primary" DC, where it's just a CPU and memory hog - I've actually got a 
cron job restarting it daily and I'm probably going to have to increase 
the frequency of that as it's usually taking about 2GB RAM after a few 
hours.  I've never seen BIND misbehave that much and it's only on the 
DLZ box so if anyone has any side notes on BIND memory that'd be great...!

The pvfs_oplock_break is for Sophos Anti-Virus pushing itself to CIFS 
shares for my client machines to update from.  In case that was 
affecting the memory, I tried using s3fs but that actually caused the 
s3fs code to segfault, as well as breaking my GPO's so I switched back 
to NTVFS.  I obviously need to make some oplock setting changes to 
accommodate but using the NTVFS/s3fs file server for Sophos was only a 
temporary arrangement and I expect to relieve it of this duty by next 
week anyhow.

If anyone has any ideas about this, they'd be appreciated.  Is this 
normal?  What's the best way of finding out what these huge processes 
are doing?  I'll try running in foreground but as it takes a few days 
for this manifest, I'm not sure if that's practical.

Thanks everyone.
-- 
Kev


More information about the samba-technical mailing list