CTDB woes

Fri Apr 12 06:39:31 MDT 2013

Hi folks,

We've long been using CTDB and Samba for our NAS service, servicing ~500 
users. We've been suffering from some problems with the CTDB performance 
over the last few weeks, likely triggered either by an upgrade of samba 
from 3.5 to 3.6 (and enabling of SMB2 as a result), or possibly by 
additional users coming on with a new workload.

We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44 (again, from 
sernet). Before we roll back, we'd like to make sure we can't fix the 
problem and stick with Samba 3.6 (and we don't even know that a roll 
back would fix the issue).

The symptoms are a complete freeze of the service for CIFS users for 
10-60 seconds, and on the servers a corresponding spawning of large 
numbers of CTDB processes, which seem to be created in a "big bang", and 
then do what they do and exit in the subsequent 10-60 seconds.

We also serve up NFS from the same ctdb-managed frontends, and GPFS from 
the cluster - and these are both fine throughout.

This was happening 5-10 times per hour, not at exact intervals though. 
When we added a third node to the CTDB cluster, it "got worse", and when 
we dropped the CTDB cluster down to a single node and everything started 
behaving fine - which is where we are now.

So, I've got a bunch of questions!

  - does anyone know why ctdb would be spawning these processes, and if 
there's anything we can do to stop it needing to do it? Also - any idea 
how we might reproduce this kind of behaviour in a dev/test lab?
  - has anyone done any more general performance / config optimisation 
of CTDB/Samba/GPFS/Linux?

And - more generally - does anyone else actually use ctdb/samba/gpfs on 
the scale of ~500 users or higher? If so - how do you find it?

-- 
             --
    Dr Orlando Richards
   Information Services
IT Infrastructure Division
        Unix Section
     Tel: 0131 650 4994

The University of Edinburgh is a charitable body, registered in 
Scotland, with registration number SC005336.