Process smbd using 100% CPU and impossible to kill

David Collier-Brown davec-b at rogers.com
Mon Feb 16 12:25:19 MST 2009


Cedric Simon wrote:
> Hello,
>
> We have recently installed a Samba server on OpenSuse 11.1 and we have
> the following problem: after some time, a smbd process starts using 100%
> of the CPU, and it is impossible to kill it, event with a kill -9 pid.
>
> The Samba service can be stop/started, but the smbd process keep using
> 100% CPU. Shutdown does not work either. Only a power off of the server
> can 'solve' the problem.
>   
You've tripped over a low-level problem of some sort which is doing a
denial-of-service
attack on Samba, and therefor on everyone else (;-))

If its always the same user that triggers the problem, or if the first
user who logs on will
trigger it, you can attach a debugger to the samba process, induce the
problem and
tell the samba folks where it died, which *may* give you a clue about
what failed.
If not, you can run strace on it and see if it loops on a system call.
Failing that, I'd try swapping parts (:-().

--dave


> Please fin below my findings and info.
>
> The pid is running as root instead of admon, and the running time = time
> since user (IP 192.168.1.67) actually disconected from Samba. I assume
> something is wrong while closing the process, and the process enter in
> an unstable/phantom status, using 100% of CPU.
>
> As CPU is used 100%, it affects the whole server :-(((
>
> If you have any idea of what could be wrong/solve this problem, feel
> free to tell me. Also if you need some more info, if I can get it I'll
> send it to you.
>
> As the server is in prod at a client's site, I can do 'what I want' with
> the server. We are investigating of moving the Windows clients to NFS.
>
> Please note most users, ie. the IP 192.168.1.67, are using wireless
> conection, and is some case can loose the network. This might be part of
> the problem.
>
> But my major concern is how can Linux have a process running (smbd) that
> is impossible to kill and prohibit shutdown of the server, as well as
> 'normal' operation, since it uses 100% CPU.
>
> Many thanks in advance for your help.
>
> Cedric Simon.
>
>
> smb.conf
>
> [global]
> workgroup = MEDLAB
> server string = Servidor de archivos de Medlab
> map to guest = Bad User
> null passwords = Yes
> guest account = samba
> printcap name = cups
> ldap ssl = no
> create mask = 0777
> force create mode = 0777
> force security mode = 0777
> directory mask = 0777
> force directory mode = 0777
> force directory security mode = 0777
> cups options = raw
>
> [users]
> comment = All users
> path = /shared
> read only = No
> inherit acls = Yes
> veto files = /aquota.user/groups/shares/
>
> [admon]
> comment = Administracion
> path = /shared/admon
> read only = No
> inherit acls = Yes
> veto files = /aquota.user/groups/shares/
>
> [clientes]
> comment = Clientes
> path = /shared/clientes
> read only = No
> inherit acls = Yes
> veto files = /aquota.user/groups/shares/
>
> [gerencia]
> comment = Gerencia
> path = /shared/gerencia
> read only = No
> inherit acls = Yes
> veto files = /aquota.user/groups/shares/
>
> [medicos]
> comment = Medicos
> path = /shared/medicos/
> inherit acls = yes
> veto files = /aquota.user/groups/shares/
> guest ok = yes
> read only = no
>
>
> [compartido]
> comment = All groups
> path = /shared/compartido/
> username = samba
> read only = No
> acl check permissions = No
> force unknown acl user = Yes
> guest ok = Yes
> hosts allow = 192.168.1.
>
>
> relih:~ # top
> top - 12:35:00 up 19:40,  1 user,  load average: 3.01, 2.92, 2.33
> Tasks: 132 total,   4 running, 128 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us, 25.0%sy,  0.0%ni, 74.5%id,  0.3%wa,  0.0%hi,  0.2%si,
> 0.0%st
> Mem:   2048884k total,  1996896k used,    51988k free,    98664k buffers
> Swap:  2104504k total,       28k used,  2104476k free,  1506752k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 11878 root      20   0 16984 5188 3932 R  100  0.3  37:24.13 smbd
> 14763 root      20   0  2432 1132  848 R    1  0.1   0:00.04 top
>     1 root      20   0  1008  380  332 S    0  0.0   0:02.00 init
>     2 root      15  -5     0    0    0 S    0  0.0   0:00.00 kthreadd
>     3 root      RT  -5     0    0    0 S    0  0.0   0:00.00 migration/0
>     4 root      15  -5     0    0    0 S    0  0.0   0:00.84 ksoftirqd/0
>     5 root      RT  -5     0    0    0 S    0  0.0   0:00.00 migration/1
>
> log.smbd:
>
> [2009/02/14 03:45:15,  0] smbd/server.c:main(1208)
>   smbd version 3.2.6-0.3.1-2042-SUSE-CODE11 started.
>   Copyright Andrew Tridgell and the Samba Team 1992-2008
> [2009/02/14 07:25:36,  1] smbd/service.c:make_connection_snum(1194)
>   nadia (::ffff:192.168.1.104) connect to service admon initially as
> user admon (uid=1002, gid=100) (pid 11622)
> [2009/02/14 07:34:17,  1] smbd/service.c:make_connection_snum(1194)
>   lenovo_medicos (::ffff:192.168.1.80) connect to service medicos
> initially as user medicos (uid=1004, gid=100) (pid 11651)
> [2009/02/14 07:34:17,  1] smbd/service.c:make_connection_snum(1194)
>   lenovo_medicos (::ffff:192.168.1.80) connect to service compartido
> initially as user medicos (uid=1004, gid=100) (pid 11651)
> [2009/02/14 07:42:03,  1] smbd/service.c:make_connection_snum(1194)
>   recepcion (::ffff:192.168.1.112) connect to service compartido
> initially as user admon (uid=1002, gid=100) (pid 11661)
> [2009/02/14 07:49:43,  1] smbd/service.c:make_connection_snum(1194)
>   contabilidad (::ffff:192.168.1.68) connect to service admon initially
> as user admon (uid=1002, gid=100) (pid 11694)
> [2009/02/14 07:49:43,  1] smbd/service.c:make_connection_snum(1194)
>   contabilidad (::ffff:192.168.1.68) connect to service compartido
> initially as user admon (uid=1002, gid=100) (pid 11694)
> [2009/02/14 07:49:44,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
> [2009/02/14 07:49:44,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 07:49:44,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 07:53:58,  1] smbd/service.c:make_connection_snum(1194)
>   direccion_medic (::ffff:192.168.1.70) connect to service medicos
> initially as user medicos (uid=1004, gid=100) (pid 11698)
> [2009/02/14 07:53:58,  1] smbd/service.c:make_connection_snum(1194)
>   direccion_medic (::ffff:192.168.1.70) connect to service compartido
> initially as user medicos (uid=1004, gid=100) (pid 11698)
> [2009/02/14 07:54:06,  1] smbd/service.c:make_connection_snum(1194)
>   server_medlab (::ffff:192.168.1.71) connect to service clientes
> initially as user clientes (uid=1006, gid=100) (pid 11700)
> [2009/02/14 07:54:06,  1] smbd/service.c:make_connection_snum(1194)
>   server_medlab (::ffff:192.168.1.71) connect to service users initially
> as user clientes (uid=1006, gid=100) (pid 11700)
> [2009/02/14 08:05:12,  1] smbd/service.c:make_connection_snum(1194)
>   emma (::ffff:192.168.1.162) connect to service medicos initially as
> user medicos (uid=1004, gid=100) (pid 11739)
> [2009/02/14 08:05:12,  1] smbd/service.c:make_connection_snum(1194)
>   emma (::ffff:192.168.1.162) connect to service compartido initially as
> user medicos (uid=1004, gid=100) (pid 11739)
> [2009/02/14 08:05:15,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
> [2009/02/14 08:05:15,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 08:05:15,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 08:09:42,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 08:09:42,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 08:09:42,  1] smbd/service.c:close_cnum(1405)
>   contabilidad (::ffff:192.168.1.68) closed connection to service
> compartido
> [2009/02/14 08:09:42,  1] smbd/service.c:close_cnum(1405)
>   contabilidad (::ffff:192.168.1.68) closed connection to service admon
> [2009/02/14 08:15:07,  1] smbd/service.c:make_connection_snum(1194)
>   contabilidad (::ffff:192.168.1.68) connect to service admon initially
> as user admon (uid=1002, gid=100) (pid 11772)
> [2009/02/14 08:20:06,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 08:20:06,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 08:20:06,  1] smbd/service.c:close_cnum(1405)
>   contabilidad (::ffff:192.168.1.68) closed connection to service admon
> [2009/02/14 08:22:57,  1] smbd/service.c:make_connection_snum(1194)
>   medlab (::ffff:192.168.1.67) connect to service admon initially as
> user admon (uid=1002, gid=100) (pid 11787)
> [2009/02/14 08:23:04,  1] smbd/service.c:make_connection_snum(1194)
>   medlab (::ffff:192.168.1.67) connect to service compartido initially
> as user admon (uid=1002, gid=100) (pid 11787)
> [2009/02/14 08:24:08,  1] smbd/service.c:make_connection_snum(1194)
>   contabilidad (::ffff:192.168.1.68) connect to service admon initially
> as user admon (uid=1002, gid=100) (pid 11791)
> [2009/02/14 08:24:11,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 08:24:11,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 08:30:35,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 08:30:35,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 08:30:35,  1] smbd/service.c:close_cnum(1405)
>   contabilidad (::ffff:192.168.1.68) closed connection to service admon
> [2009/02/14 08:32:15,  1] smbd/service.c:make_connection_snum(1194)
>   contabilidad (::ffff:192.168.1.68) connect to service admon initially
> as user admon (uid=1002, gid=100) (pid 11819)
> [2009/02/14 08:38:48,  1] smbd/service.c:close_cnum(1405)
>   medlab (::ffff:192.168.1.67) closed connection to service admon
> [2009/02/14 08:38:48,  1] smbd/service.c:close_cnum(1405)
>   medlab (::ffff:192.168.1.67) closed connection to service compartido
> [2009/02/14 08:44:47,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
> [2009/02/14 08:44:47,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 08:44:47,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 08:46:28,  1] smbd/service.c:make_connection_snum(1194)
>   medlab (::ffff:192.168.1.67) connect to service admon initially as
> user admon (uid=1002, gid=100) (pid 11868)
> [2009/02/14 08:46:28,  1] smbd/service.c:make_connection_snum(1194)
>   medlab (::ffff:192.168.1.67) connect to service compartido initially
> as user admon (uid=1002, gid=100) (pid 11868)
> [2009/02/14 08:47:33,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 08:47:33,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 08:47:33,  1] smbd/service.c:close_cnum(1405)
>   contabilidad (::ffff:192.168.1.68) closed connection to service admon
> [2009/02/14 08:47:46,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 08:47:46,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 08:47:46,  1] smbd/service.c:make_connection_snum(1194)
>   contabilidad (::ffff:192.168.1.68) connect to service admon initially
> as user admon (uid=1002, gid=100) (pid 11873)
> [2009/02/14 08:50:18,  1] smbd/service.c:close_cnum(1405)
>   medlab (::ffff:192.168.1.67) closed connection to service admon
> [2009/02/14 08:50:18,  1] smbd/service.c:close_cnum(1405)
>   medlab (::ffff:192.168.1.67) closed connection to service compartido
> [2009/02/14 08:51:20,  1] smbd/service.c:make_connection_snum(1194)
>   medlab (::ffff:192.168.1.67) connect to service admon initially as
> user admon (uid=1002, gid=100) (pid 11878)
> [2009/02/14 08:51:20,  1] smbd/service.c:make_connection_snum(1194)
>   medlab (::ffff:192.168.1.67) connect to service compartido initially
> as user admon (uid=1002, gid=100) (pid 11878)
> [2009/02/14 09:37:46,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 09:37:46,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 09:37:46,  1] smbd/service.c:close_cnum(1405)
>   contabilidad (::ffff:192.168.1.68) closed connection to service admon
> [2009/02/14 10:06:23,  1] smbd/service.c:make_connection_snum(1194)
>   contabilidad (::ffff:192.168.1.68) connect to service admon initially
> as user admon (uid=1002, gid=100) (pid 14293)
> [2009/02/14 10:06:26,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
> [2009/02/14 10:06:26,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 10:06:26,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 11:47:29,  1] smbd/service.c:close_cnum(1405)
>   nadia (::ffff:192.168.1.104) closed connection to service admon
> [2009/02/14 11:53:07,  1] smbd/notify_inotify.c:watch_destructor(351)
>   inotify_rm_watch returned Argumento inválido
> [2009/02/14 11:53:11,  1] smbd/notify_inotify.c:watch_destructor(351)
>   inotify_rm_watch returned Argumento inválido
> [2009/02/14 11:53:24,  1] smbd/notify_inotify.c:watch_destructor(351)
>   inotify_rm_watch returned Argumento inválido
> [2009/02/14 11:53:27,  1] smbd/notify_inotify.c:watch_destructor(351)
>   inotify_rm_watch returned Argumento inválido
> [2009/02/14 11:55:58,  1] smbd/service.c:close_cnum(1405)
>   recepcion (::ffff:192.168.1.112) closed connection to service
> compartido
> [2009/02/14 11:57:54,  1] smbd/notify_inotify.c:watch_destructor(351)
>   inotify_rm_watch returned Argumento inválido
> [2009/02/14 11:57:58,  1] smbd/service.c:close_cnum(1405)
>   medlab (::ffff:192.168.1.67) closed connection to service compartido
> [2009/02/14 11:57:58,  1] smbd/service.c:close_cnum(1405)
>   medlab (::ffff:192.168.1.67) closed connection to service admon
> [2009/02/14 12:01:14,  1] smbd/service.c:close_cnum(1405)
>   lenovo_medicos (::ffff:192.168.1.80) closed connection to service
> medicos
> [2009/02/14 12:01:14,  1] smbd/service.c:close_cnum(1405)
>   lenovo_medicos (::ffff:192.168.1.80) closed connection to service
> compartido
> [2009/02/14 12:03:56,  1] smbd/service.c:close_cnum(1405)
>   emma (::ffff:192.168.1.162) closed connection to service medicos
> [2009/02/14 12:03:56,  1] smbd/service.c:close_cnum(1405)
>   emma (::ffff:192.168.1.162) closed connection to service compartido
> [2009/02/14 12:05:15,  1] smbd/service.c:make_connection_snum(1194)
>   ar (::ffff:192.168.1.54) connect to service medicos initially as user
> gerencia (uid=1003, gid=100) (pid 14639)
> [2009/02/14 12:05:15,  1] smbd/service.c:make_connection_snum(1194)
>   ar (::ffff:192.168.1.54) connect to service admon initially as user
> gerencia (uid=1003, gid=100) (pid 14639)
> [2009/02/14 12:05:15,  1] smbd/service.c:make_connection_snum(1194)
>   ar (::ffff:192.168.1.54) connect to service gerencia initially as user
> gerencia (uid=1003, gid=100) (pid 14639)
> [2009/02/14 12:05:15,  1] smbd/service.c:make_connection_snum(1194)
>   ar (::ffff:192.168.1.54) connect to service clientes initially as user
> gerencia (uid=1003, gid=100) (pid 14639)
> [2009/02/14 12:05:15,  1] smbd/service.c:make_connection_snum(1194)
>   ar (::ffff:192.168.1.54) connect to service compartido initially as
> user gerencia (uid=1003, gid=100) (pid 14639)
> [2009/02/14 12:05:18,  0] lib/util_sock.c:read_socket_with_timeout(939)
> [2009/02/14 12:05:18,  0] lib/util_sock.c:get_peer_addr_internal(1607)
>   getpeername failed. Error was El otro extremo de la conexión no está
> conectado
>   read_socket_with_timeout: client 0.0.0.0 read error = Conexión
> reinicializada por la máquina remota.
> [2009/02/14 12:08:26,  1] smbd/service.c:close_cnum(1405)
>   server_medlab (::ffff:192.168.1.71) closed connection to service
> clientes
> [2009/02/14 12:08:26,  1] smbd/service.c:close_cnum(1405)
>   server_medlab (::ffff:192.168.1.71) closed connection to service users
>
>
>
> relih:~ # rcsmb stop
> Shutting down Samba SMB daemon  Warning: daemon not running.
> done
> relih:~ # rcsmb start
> Starting Samba SMB daemon
> done
> relih:~ # smbstatus
>
> Samba version 3.2.6-0.3.1-2042-SUSE-CODE11
> PID     Username      Group         Machine
> -------------------------------------------------------------------
>
> Service      pid     machine       Connected at
> -------------------------------------------------------
>
> No locked files
>
> relih:~ # kill -9 11878
> relih:~ # top
> top - 17:31:07 up 1 day, 36 min,  1 user,  load average: 10.00, 10.00,
> 9.94
> Tasks: 129 total,   4 running, 124 sleeping,   0 stopped,   1 zombie
> Cpu(s):  0.0%us, 25.0%sy,  0.0%ni, 75.0%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:   2048884k total,  2035512k used,    13372k free,    84460k buffers
> Swap:  2104504k total,       28k used,  2104476k free,  1553572k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 11878 root      20   0 16984 5188 3932 R  101  0.3 333:32.09 smbd
>     1 root      20   0  1008  380  332 S    0  0.0   0:02.04 init
>     2 root      15  -5     0    0    0 S    0  0.0   0:00.00 kthreadd
>
>   


-- 
David Collier-Brown,         | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
davecb at spamcop.net           |                      -- Mark Twain
(416) 223-8968



More information about the samba-technical mailing list