[Samba] Troubleshooting a suspected ctdb performance issue
Smith, Jarrod A
jarrod.smith at Vanderbilt.Edu
Tue Mar 8 02:09:33 UTC 2016
We have a three-node ctdb/samba cluster (8x Sandy Bridge cores + 64GB RAM each node) running on top of GPFS 4.1.0.8, serving 5-600 CIFS clients. We use sernet-samba-4.1.6 with ctdb 1.0.114 on Centos 6.6. Unfortunately the administrator who originally installed the ctdb/samba solution has left some time ago, and we are still learning it.
Users are reporting intermittent "latency" issues that occur mutliple times per day over the past several months. Typical complaints include 30-60s to open folders or files, and sometimes being disconnected from the service. The samba and ctdb logs show nothing at debug level WARNING. We have done tcpdump/wireshark packet captures during such events and analyzed these - they showed no obvious ill behavior in the network.
I have recently been probing ctdb itself and today realized that periodically we see the number of ctdbd processes on a node quickly grow from 2 to 250+. This lasts for 30 seconds to single-digit minutes at which point it corrects itself. It seems to be correlated with a large increase in the number of lines in /proc/locks. We also see what I feel are fairly high max_lockwait_latency and max_call_latency values (see our statistics outputs below). I don't know what causes this, or how to fix it (if it indeed needs fixing). Keeping in mind that I am new to samba and ctdb, have you got any other recommendations for us to further troubleshoot and/or fix the issue if you think I've hit upon it already?
Thanks for your advice,
--
Jarrod A. Smith, Ph.D.
Asst. Director, Center for Structural Biology
Research Assoc. Professor, Biochemistry
Vanderbilt University - 5135 MRB III
615-322-1739
-----------------------------------
CTDB statistics for each node.
The counters were reset a week or two ago.
-----------------------------------
CTDB version 1
num_clients 136
frozen 0
recovering 0
client_packets_sent 286501416
client_packets_recv 325458087
node_packets_sent 354199901
node_packets_recv 266253799
keepalive_packets_sent 394382
keepalive_packets_recv 394374
node
req_call 143496620
reply_call 90253
req_dmaster 55005319
reply_dmaster 60629271
reply_error 0
req_message 1720274
req_control 74416680
reply_control 29674368
client
req_call 253163227
req_message 1113562
req_control 71335811
timeouts
call 0
control 1
traverse 3
total_calls 253163227
pending_calls 0
lockwait_calls 11138044
pending_lockwait_calls 0
childwrite_calls 6
pending_childwrite_calls 0
memory_used 210352
max_hop_count 2162
max_reclock_ctdbd 0.141385 sec
max_reclock_recd 169.497819 sec
max_call_latency 310.868259 sec
max_lockwait_latency 214.839209 sec
max_childwrite_latency 0.014314 sec
-----------------------------------
CTDB version 1
num_clients 132
frozen 0
recovering 0
client_packets_sent 247024177
client_packets_recv 286512929
node_packets_sent 336526909
node_packets_recv 255250235
keepalive_packets_sent 394339
keepalive_packets_recv 394328
node
req_call 128153759
reply_call 70305
req_dmaster 60194286
reply_dmaster 53335499
reply_error 0
req_message 1521121
req_control 73830804
reply_control 24189543
client
req_call 219206250
req_message 1037383
req_control 66378108
timeouts
call 0
control 3
traverse 5
total_calls 219206250
pending_calls 0
lockwait_calls 3265686
pending_lockwait_calls 0
childwrite_calls 6
pending_childwrite_calls 0
memory_used 253340
max_hop_count 2163
max_reclock_ctdbd 0.342660 sec
max_reclock_recd 0.000000 sec
max_call_latency 437.201033 sec
max_lockwait_latency 67.572988 sec
max_childwrite_latency 0.015522 sec
-----------------------------------
CTDB version 1
num_clients 205
frozen 0
recovering 0
client_packets_sent 299537550
client_packets_recv 349951795
node_packets_sent 376914119
node_packets_recv 272621669
keepalive_packets_sent 417794
keepalive_packets_recv 417782
node
req_call 139163332
reply_call 154848
req_dmaster 59987264
reply_dmaster 58997998
reply_error 0
req_message 1083367
req_control 85827912
reply_control 34427476
client
req_call 262120527
req_message 2169333
req_control 85858251
timeouts
call 0
control 2
traverse 5
total_calls 262120527
pending_calls 0
lockwait_calls 5550667
pending_lockwait_calls 0
childwrite_calls 6
pending_childwrite_calls 0
memory_used 250736
max_hop_count 2152
max_reclock_ctdbd 0.016747 sec
max_reclock_recd 166.169447 sec
max_call_latency 16350.672816 sec
max_lockwait_latency 74.970163 sec
max_childwrite_latency 0.016126 sec
More information about the samba
mailing list