[Samba] winbind fails
mabuqu at ilstu.edu
Fri Nov 5 16:51:15 GMT 2004
Hello everyone,
This a response to a problem that I posted earlier this year. I just
wanted to let everyone know that this problem has been solved and it was
NOT samba or winbind that was causing it. As mentioned in the problem
description below our site has 4 active directory DCs. In my smb.conf I
had "password server = *" so it would authenticate with any DC in the
realm for redundancy. Well after looking at all the logs I "finally"
realized that it was always getting hung up while communicating with 1
of the four DCs. I changed the "password server = dc2.dns.name
dc3.dns.name dc4.dns.name" forcing authentication to only the 3 DCs that
were working properly and left out the 1 DC that winbind was getting
hung up on. After that change, no more accumulating CLOSE_WAITs, logon
speeds are phenomenal, and overall performance and stability are
excellent. Our Linux box is now acting as a Linux box should. This
problem has been fixed for a few months now, I just now figured I would
post my experience. I updated to 3.0.7 a day or two after it was
released, and it has been running flawlessly ever since (I still haven't
restarted the service).
As for the problematic DC, the admin never really figured out what the
problem was. All they said was that they saw a few RPC errors in the
event logs from time to time. They wouldn't really take me seriously
because I was using Linux and samba for our local file/cvs server. They
didn't really do anything about the problem until other windows users
(or other departmental Microsoft admins started to have problems with
Active Dir. logon scripts). They ended up having to rebuild the server
to solve the problem.
So all in all, I wanted to thank the developers for the efforts
Majeed wrote:
> I have been having the same problem with winbind for quite a while now
and have researched up and down, but I can’t get the problem resolved. I
have dealing with this since 3.0.2. I then moved to 3.0.2a, then to
3.0.3pre2 since the release notes stated a crash fix when in ads mode,
then to 3.0.3 since it was a production release and then to 3.0.4 since
some memory leaks and socket handling issues were fixed in winbind. I
will now illustrate my problem.
> Info:
> - 4 windows 2000 domain controllers
> - linux box joins the domain and uses Kerberos active directory
authentication to shares - distribution: Gentoo 1.4
> - kernel 2.4.26 (stock sources)
> - current version of samba: 3.0.4
> - If anything else is need please let me know
> - configure command to compile:
> ./configure --prefix=/usr --sysconfdir=/etc/samba --localstatedir=/var
> --with-privatedir=/etc/samba/private --with-lockdir=/var/cache/samba
> --with-swatdir=/usr/share/swat --with-configdir=/etc/samba
> --enable-static --enable-shared --with-manpages-langs=en
--without-spinlocks --with-libsmbclient
> --with-automount --with-smbmount --with-winbind --with-syslog
--with-idmap --with-ldap
> --with-ads --with-krb5 --with-pam
> Problem:
> After compiling and installing samba and copying the pam_winbind.so,
libnss_winbind.so, and libnss_wins.so files to the appropriate
directories I then start samba and winbind using a startup script. It
takes about 30sec to a minute for authentication to start working
(probably winbind talking to the DCs). Once it starts authenticating it
works GREAT and will continue to do so for a period of 3 days to a week.
Once it hits a certain point winbind will no longer authenticate. Since
I have having this problem for a while now, I have been monitoring
winbindd. It seems that around 3 hours after I start winbindd sockets in
the CLOSE_WAIT state will start accumulating when I run the netstat
–antupo command. All the sockets in this state are owned by the winbindd
process. They will never close unless I kill the winbindd process. Once
the number of CLOSE_WAITs accumulate up around 1000 it will cause
winbindd to stop authenticating, samba to crash, and I will not be able
to ssh in (I can connect, I can authenticate, but after I successfully
authenticate ssh shoots back a signal 11 error and drops the
connection). I believe the ssh problem is caused by winbind because of
all sockets and port numbers it has tied up in the close_wait state.
Once I restart winbindd and sshd everything works fine again until that
certain amount of time. After doing much research I found that it is
usually the application that is not closing the socket correctly, due to
a bug. At first I thought it might be the kernel so I upgraded from
2.4.25 to 2.4.26 but the same symptoms came about. After that I was
reading a developers forum and someone said that if you kill the process
that owns the sockets in the close_wait state and they disappear then it
is not a kernel issue. Also during the monitoring of winbindd I noticed
that amount of memory consumption steadily increases (maybe a leak?). I
wanted to be able to show the developers and everyone else what I was
seeing so I wrote a script and tossed in a cronjob to run every hour 10
minutes after the hour. The script runs the following commands and spits
the output to a text file. This isn't the entire script but it is the
meat of it.
> LOG_FILE=`date +%F_%H.%M%P_winbind_info.log`
> PREFIX=/var/log/winbind/
> ps aux | grep PID >> $PREFIX$LOG_FILE
> ps aux | grep winb >> $PREFIX$LOG_FILE
> ps aux | grep mbd >> $PREFIX$LOG_FILE
> cat "/proc/`cat /var/run/samba/winbindd.pid`/status" >> $PREFIX$LOG_FILE
> netstat -antupo >> $PREFIX$LOG_FILE
> I put the all the logs starting from the minute I started winbindd up
until now on a webpage for people to see. They are in order by date and
time and you will be able to see how things progress, memory usage, and
the close_wait problem. Hopefully the developers can use this
information. If not it would be great if anyone has any idea on why I
have all these CLOSE_WAITS. I am replying to a previous post that
created, but back then I was just going to upgrade to see if I still had
the same problems. And I did, as you can see. Any insight would be
great. I would be glad to entertain any questions or tests that people
would like me to try. I have a test server and a production server and
this problem happens on both.
> Go to www.analoglove.com/winbind <http://www.analoglove.com/winbind>
> Below is how the message ended the last time i posted about this.
> Thank you very much for you time,
> Majeed Qulbain
> Majeed wrote:
>> Im going to install the new version, and report back in a week or so.
Thanks for the reply!
>> Majeed
>> Tim Jordan wrote:
>>> I seen a there is a fix for winbind crashing in the latest release notes.
>>> http://download.samba.org/samba/ftp/pre/
>>> TJ
>>> On Mon, 2004-04-05 at 10:25, Majeed wrote:
>>>> /I have also been seeing this over the last few weeks. For me it
also happens randomly as you stated. I am trying to pin point when it
started, and I believe it started right after I upgraded the kernel
2.4.24 to 2.4.25 (vanilla sources on gentoo 1.4) (mremap problems), but
I can't be too sure. Samba 3.0.2 compiled with the following options:
>>>> ./configure --prefix=/usr --sysconfdir=/etc/samba
--localstatedir=/var --libdir=/usr/lib/samba
--with-privatedir=/etc/samba/private --with-lockdir=/var/cache/samba
--with-piddir=/var/run/samba --with-swatdir=/usr/share/swat
--with-configdir=/etc/samba --with-logfilebase=/var/log/samba
--enable-static --enable-shared --with-manpages-langs=en
--without-spinlocks --with-libsmbclient --with-automount --with-smbmount
--with-winbind --with-syslog --with-idmap --with-ldap --with-ads
--with-krb5 --with-pam
>>>> Here are some symptoms I am seeing when the problem occurs.
>>>> Symptom 1) I cannot login through ssh: Its wierd becuase i can
connnect, put in my username and password it authenticates but then the
connection gets reset. There is even a line in the ssh log file that
says access was granted. I then to to the console and login.
>>>> Symptom 2) While logged into the console I run a "netstat -antu" and
get some interesting results
>>>> tcp 0 0 sambaserv_ip:44134 win2000dc_ip:139 CLOSE_WAIT
>>>> tcp 0 0 sambaserv_ip:44072 win2000dc_ip:139 CLOSE_WAIT
>>>> tcp 0 0 sambaserv_ip:44075 win2000dc_ip:139 CLOSE_WAIT
>>>> tcp 0 0 sambaserv_ip:44076 win2000dc_ip:139 CLOSE_WAIT
>>>> tcp 0 0 sambaserv_ip:44078 win2000dc_ip:139 CLOSE_WAIT
>>>> tcp 0 0 sambaserv_ip:44079 win2000dc_ip:139 CLOSE_WAIT
>>>> There are HUNDREDS of these CLOSE_WAIT lines all with different
ascending port numbers
>>>> After restarting samba and winbind netstat looked normal and
everything worked as it should have.
>>>> Symptom 3) While logged into the console I check the samba log files
and log.winbind showed the following problems.
>>>> [2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
>>>> open_socket_in(): socket() call failed: Too many open files
>>>> [2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
>>>> open_socket_in(): socket() call failed: Too many open files
>>>> [2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
>>>> open_socket_in(): socket() call failed: Too many open files
>>>> [2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
>>>> open_socket_in(): socket() call failed: Too many open files
>>>> [2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
>>>> open_socket_in(): socket() call failed: Too many open files
>>>> [2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
>>>> open_socket_in(): socket() call failed: Too many open files
>>>> [2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
>>>> open_socket_in(): socket() call failed: Too many open files
>>>> [2004/04/05 10:11:05, 0] lib/util_sock.c:open_socket_in(634)
>>>> open_socket_in(): socket() call failed: Too many open files
>>>> Again there were HUNDREDS of these lines.
>>>> So I think winbind might be the cause of the problems. This happens
on both my production and my test server. Test server is mirrored to
production for testing.
>>>> Today I am going to download the newest version of the samba 3 and
see if that helps, if it doesn't then I might try a different kernel
version. As mentioned before all i do is restart samba and winbind and
thinks will work perfectly for a random amount of time. Usually 3 or
more days before it happens again.
>>>> Does anyone have any suggestions? Maybe some different things I
could look for? Maybe different compile options?
>>>> Thanks
>>>> Majeed Qulbain
>>>> Hoskinson, David P wrote:
>>>>> We have a windows 2003 dc here at the university and I have
>>>>> setup samba-3.0.2-6.3E on a RHEL WS3 machine. The problem is that after
>>>>> several hours, or several days winbind stops running and connections
>>>>> fail. I have seen instances of this on other sites, but no firm
>>>>> answers. I can provide files and logs if helpful
>>>>> /
More information about the samba
mailing list