[RFC] Performance improvements

Wed Jul 18 05:46:45 UTC 2018

On Tue, Jul 17, 2018 at 9:07 PM, Swen Schillig via samba-technical
<samba-technical at lists.samba.org> wrote:
> Hi Martin
>
> As suggested by you, I added some debug messages to see how CTDB
> performs under high load in the area of change.
>
> The code is updated as discussed before
>         - no patch-10 (leaving tevent immediate scheduling as is)
>         - no alignment patches
>         - buffer_size raised to 32k (see comment below)
>
> As a prelimenary statement, even though the debug output is slowing
> down the system it can be seen that the positive effects reported
> earlier are still existing.
>
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2240 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=59720 used=16 offset=59704 ready=47672 max_offset=2047816 max_ready=644680
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=80 max_offset=0 max_ready=200
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=8640 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=200 max_offset=0 max_ready=200
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2160 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3280 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=1520 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2320 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=5600 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2240 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=7360 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=107464 used=64 offset=107400 ready=30608 max_offset=1906056 max_ready=682248
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3840 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3680 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3760 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=6240 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=4400 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=200 max_offset=0 max_ready=200
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3360 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=5040 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=5200 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2720 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3600 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=400 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=6240 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=7360 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=1600 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=7200 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=5760 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3440 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2000 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=640 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=4560 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=1840 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=7280 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=1360 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3680 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2320 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=1760 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=640 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2720 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=3520 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=1920 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2000 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=640 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=2480 max_offset=0 max_ready=21840
> Jul 17 12:06:03 fscc-x36m3-32 ctdbd[10007]: buffer size=32768 used=0 offset=0 ready=4480 max_offset=0 max_ready=21840
>
> as you can see there's almost always more than 1k data on the socket but only rarely more than 8k (value for "ready").
> In addition we can see that the processing of the received data is almost always fully done before new data
> is read from the socket (offset == 0)
> I will try and put some higher load (overload ?) on the system and see if things will change.

I have spent few years trying to figure out a better way of dealing
with this problem (may be I am just too dumb!).  Just increasing
buffer size is *definitely* the wrong thing to do here.  There are
couple of reasons for it.

1. Even though it's not clear from the output if the output relates to
a single fd or multiple fds, that's irrelevant.  On a large system,
you can have thousands of smbd connected to ctdb.  Each smbd opens
multiple connections to ctdb (last time I checked it was 3, but could
be down to 2).  This means there is a huge increase in the amount of
memory pre-allocated (1k to 32k is 32 times increase).  That may not
be a problem for big servers, but could be an issue for embedded
systems.

2. The main issue is unfair scheduling and in the worst case
starvation (which is often seen as "Handling event took xx seconds"
issue).  This has to do with the way tevent handles events.

If you read large buffer from an fd, then potentially there are large
number of packets.  queue_process() schedules an immediate event if
there are more than one packet in the buffer. If you look at the way
loop_once() is coded, it will first try to finish all the pending
immediate and timer events before processing fd events.  So a large
buffer results (generally) in large number of packets, and ctdb spends
more time processing them before it can do anything else.

The next problem is that if lots of smbd processes are sending packets
to ctdb, then in the event_loop() function epoll_wait() can return
multiple fds (this can be multiple smbd clients and tcp sockets to
other nodes).  Now event_loop() processes each fd in a loop, which
will end up calling queue_process() in ctdb.  This means now for each
fd, queue_process() would end up setting up immediate events.  That
means now ctdb has to do lot more work before the next epoll_wait()
will get called.  In such a case, if epoll_wait() did not return any
of the tcp sockets for the other nodes, the packets from that node
will not get processed till all the packets for all the fds (returned
by epoll_wait) are processed.  This is usually the culprit for
"Handling event took xx seconds" (where xx is anywhere from tens to
hundreds of seconds).  This unnecessarily causes a failover because
one or more node(s) think that the node is unreachable.  This is the
worst case starvation due to unfair scheduling of fd events.

This problem is even more complicated by timer events that require
ctdb to fork processes.  This causes additional delay in getting back
to epoll_wait() for pending fd events.

i. CTDB runs monitor event few seconds and that means forking for
running 20 odd event scripts.  On Linux fork() can be very expensive
if a process has lots of mapped pages.  This was solved first by
replacing fork() with vfork() and event helper and later by creating
dedicated event daemon.

ii. CTDB runs vacuuming every few seconds for each volatile database.
This is still done by fork() and on a busy system fork() can take a
while.  This problem has been partially dealt with by serializing
vacuuming and allowing only single vacuuming process to run at a time.
However, this is only a band-aid and not real fix.

iii. There are few other instances where CTDB creates processes during
normal operation. For example, persistent database transactions,
cluster-wide database traverse, etc.

There are 2 competing goals when trying to process the packets:
 - Reducing the number of packets processed per fd.
 - Reducing too many ioctl() and read() calls to process packets on
each fd.  For each buffer read it's one ioctl() and one read().

Any change in the packet size is a compromise between the two goals
(number of packets to process and number of system calls).  This is
the reasons behind reducing the buffer size to 1k -- most of the
packets are between 64 to 128 bytes. Buffer of 1k holds maximum of 16
packets, which is better than >100 packets.  I'm sure we can reduce it
further.

In all the new code that is being written using tevent_req, only a
single packet is read at a time.  I am sure that is the best way to
avoid unfair scheduling completely.  However, I am yet to be convinced
whether it will improve performance as each packet read now requires
two ioctl() and two read() calls.

So any proposal to increase buffer size upstream will get strong NACK
without substantially improving the way ctdb handles packets using
tevent.  You are free to change the buffer size for your product for a
specific workload.

Many of the issues mentioned above will be solved by ongoing ctdb
re-design.  The crucial cluster-wide communication is decoupled from a
single process that has to deal with thousands of fds.  However, the
problem still remains for the database daemon on how to efficiently
process packets from thousands of fds.  However, that is much less
critical if database daemon is only doing database operations and not
involved in important ctdb functions.   Avoiding fork() in a database
daemon is still a challenging problem but can be solved using vfork()
and helpers.  Vacuuming will be done by a dedicated daemon, so that
will completely avoids the problem of fork() from the database daemon.

Amitay.