CTDB woes
Orlando Richards
orlando.richards at ed.ac.uk
Mon Apr 15 04:16:44 MDT 2013
On 13/04/13 01:21, Amitay Isaacs wrote:
>
>
>
> On Sat, Apr 13, 2013 at 1:41 AM, Orlando Richards
> <orlando.richards at ed.ac.uk <mailto:orlando.richards at ed.ac.uk>> wrote:
>
> On 12/04/13 16:35, Amitay Isaacs wrote:
>
>
> On Fri, Apr 12, 2013 at 10:39 PM, Orlando Richards
> <orlando.richards at ed.ac.uk <mailto:orlando.richards at ed.ac.uk>
> <mailto:orlando.richards at ed.__ac.uk
> <mailto:orlando.richards at ed.ac.uk>>> wrote:
>
>
> Hi folks,
>
> We've long been using CTDB and Samba for our NAS service,
> servicing
> ~500 users. We've been suffering from some problems with
> the CTDB
> performance over the last few weeks, likely triggered
> either by an
> upgrade of samba from 3.5 to 3.6 (and enabling of SMB2 as a
> result),
> or possibly by additional users coming on with a new workload.
>
> We run CTDB 1.0.114.4-1 (from sernet) and samba3-3.6.12-44
> (again,
> from sernet). Before we roll back, we'd like to make sure
> we can't
> fix the problem and stick with Samba 3.6 (and we don't even
> know
> that a roll back would fix the issue).
>
> The symptoms are a complete freeze of the service for CIFS
> users for
> 10-60 seconds, and on the servers a corresponding spawning
> of large
> numbers of CTDB processes, which seem to be created in a
> "big bang",
> and then do what they do and exit in the subsequent 10-60
> seconds.
>
> We also serve up NFS from the same ctdb-managed frontends,
> and GPFS
> from the cluster - and these are both fine throughout.
>
> This was happening 5-10 times per hour, not at exact intervals
> though. When we added a third node to the CTDB cluster, it "got
> worse", and when we dropped the CTDB cluster down to a
> single node
> and everything started behaving fine - which is where we
> are now.
>
> So, I've got a bunch of questions!
>
> - does anyone know why ctdb would be spawning these
> processes, and
> if there's anything we can do to stop it needing to do it?
> Also -
> any idea how we might reproduce this kind of behaviour in a
> dev/test
> lab?
>
>
> Hi Amitay,
>
>
>
>
> It looks like there is contention for some record(s) which
> results in
> CTDB creating lockwait child processes to wait for the record. I
> would
> suggest you try CTDB 1.2.61.
>
>
> Is that the current "stable" release? I must admit to getting a bit
> confused around release numbers for ctdb! The sernet release we're
> on has proved to be very stable for us (it can't be said enough -
> thanks Sernet!).
>
> Yes. The current development release is 2.1.
>
> - has anyone done any more general performance / config
> optimisation of CTDB/Samba/GPFS/Linux?
>
>
> For general performance tracking you will have to check if there is
> heavy CPU load, high memory pressure, or lots of processes in wait
> state. That will give you clues as to what the next bottleneck is.
>
>
> From what we could see at the time, I'd have characterised it as
> typical of "lots of processes in wait state", but I couldn't figure
> out what they were waiting for.
>
>
> Yes. Those are lockwait processes waiting for fcntl locks.
Thanks Amitay,
I've got CTDB 1.2.61 built (from the 1.2.40 git branch) and ready to go,
and a recompiled samba built against it too. However, this morning we
tried re-initialising the tdb databases in our two-node cluster:
- shut down ctdb on one node, move /var/ctdb out the way, start it
back up again
- shut down ctdb on the other node, move /var/ctdb out the way, start
it back up again
And we've been in production today with no obvious re-occurrence of the
problem. Of course, that could easily be down to a reduced load (it's
Monday morning!), so we're continuing to monitor things. We have seen a
few spikes in the number of ctdb processes, but they lasted less than 10
seconds so I don't know if they caused any service "pauses". Associated
with those spikes were these messages from ctdb (we've got it on DEBUG
level logging just now):
Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_call.c:475 deferred
ctdb_request_call
Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_ltdb_server.c:278 PACKET
REQUEUED
We're also seeing loads of these all the time, as one would probably expect:
Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_vacuum.c:1484 schedule for
deletion: db[locking.tdb] db_id[0x42fe72c5] key_hash[0xd1218614]
lmaster[1] migrated_with_data[no]
Apr 15 11:12:52 nasfe03 ctdbd: server/ctdb_vacuum.c:1484 schedule for
deletion: db[brlock.tdb] db_id[0x1421fb78] key_hash[0x21dbd0cc]
lmaster[1] migrated_with_data[yes]
I'll follow up with how things go today - our next intervention is
likely to be to deploy the updated ctdb, as you suggested.
Cheers,
Orlando.
--
--
Dr Orlando Richards
Information Services
IT Infrastructure Division
Unix Section
Tel: 0131 650 4994
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the samba-technical
mailing list