Samba with multichannel and io_uring
axboe at kernel.dk
Thu Oct 15 16:11:02 UTC 2020
On 10/15/20 3:58 AM, Stefan Metzmacher wrote:
> related to my talk at the virtual storage developer conference
> "multichannel / iouring Status Update within Samba"
> I have some additional updates.
> DDN was so kind to sponsor about a week of research on real world
> hardware with 100GBit/s interfaces and two NUMA nodes per server.
> I was able to improve the performance drastically.
> I concentrated on SMB2 read performance, but similar improvements would be expected for write too.
> We used "server multi channel support = yes" and the network interface is RSS capable,
> it means that a Windows client uses 4 connections by default.
> I first tested a share using /dev/shm and the results where really slow,
> it was not possible to reach more than ~30 GBits/s on the net and ~ 3.8 GBytes/s
> from fio.exe.
> smbd uses pread() from within a pthread based threadpool for file io
> and sendmsg() to deliver the response to the socket. All multichannel
> connections are served by the same smbd process (based on the client guid).
> The main smbd is cpu bound and the helper threads also use quite some cpu
> about ~ 600% in total!
> It turns out that NUMA access caused a lot of slow down.
> The network adapter was connected to numa node 1, so we pinned
> the ramdisk and smbd to that node.
> mount -t tmpfs -o size=60g,mpol=bind:1 tmpfs /dev/shm-numanode1
> numactl --cpunodebind=netdev:ens3f0 --membind=netdev:ens3f0 smbd
> With that it was possible to reach ~ 5 GBytes/s from fio.exe
> But the main problem remains the kernel is busy copying data
> and sendmsg() takes up to 0.5 msecs, which means that we don't process new requests
> during these 0.5 msecs.
> I created a prototype that uses IORING_OP_SENDMSG with IOSQE_ASYNC (I used a 5.8.12 kernel)
> instead of the sync sendmsg() calls, which means that one kernel thread
> (io_wqe_work ~50% cpu) per connection is doing the memory copy to the socket
> and the main smbd only uses ~11% cpu, but we still use > 400% cpu in total.
> But it seems the numa binding for the io_wqe_work thread doesn't seem to work as expected,
> so the results vary between 5.0 GBytes/s and 7.6 GBytes/s, depending on which numa node
> io_wqe_work kernel threads are running. Also note that the threadpool with pread was
> still faster than using IORING_OP_READV towards the filesystem, the reason might also
> be numa dependent.
> The main problem is still copy_user_enhanced_fast_string, so I tried to use
> IORING_IO_SPLICE (from the filesystem via a pipe to the socket) in order to avoid
> copying memory around.
> With that I was able to reduce the cpu usage of the main smbd to ~6% cpu with
> io_wqe_work threads using between ~3-6% cpu (filesystem to pipe) and
> 6-30% cpu (pipe to socket).
> But the Windows client wasn't able to reach better numbers than 7.6 GBytes/s (65 GBits/s).
> Only using "Set-SmbClientConfiguration -ConnectionCountPerRssNetworkInterface 16" helped to
> get up to 8.9 GBytes/s (76 GBits/s).
> With 8 MByte IOs smbd is quite idle at ~ 5% cpu with the io_wqe_work threads ~100% cpu in total.
> With 512 KByte IOs smbd uses ~56% cpu with the io_wqe_work threads ~130% cpu in total.
> With 256 KByte IOS smbd uses ~87% cpu with the io_wqe_work threads ~180% cpu in total.
> In order to get higher numbers I also tested with smbclient.
> - With the default configuration (sendmsg and threadpool pread) I was able to get
> 4.2 GBytes/s over a single connection, while smbd with all threads uses ~150% cpu.
> - With IORING_IO_SPLICE I was able to get 5 GBytes/s over a single connection,
> while smbd uses ~ 6% cpu, with 2 io_wqe_work threads (filesystem to pipe) at 5.3% cpu each +
> 1 io_wqe_work thread (pipe to socket) at ~29% cpu. This is only ~55% cpu in total on the server
> and the client is the bottleneck here.
> - With a modified smbclient using a forced client guid I used 4 connections into
> a single smbd on the server. With that I was able to reach ~ 11 GBytes/s (92 GBits/s)
> (This is similar to what 4 iperf instances are able to reach).
> The main smbd uses 8.6 % cpu with 4 io_wqe_work threads (pipe to socket) at ~20% cpu each.
> - With 8 smbclient instances over loopback we are able to reach ~ 22 GBytes/s (180 GBits/s)
> and smbd uses 22 % cpu.
> So IORING_IO_SPLICE will bring us into a very good shape for streaming reads.
> Also note that numa pinning is not really needed here as the memory is not really touched at all.
> It's very likely that IORING_IO_RECVMSG in combination with IORING_IO_SPLICE would also improve the write path.
> Using AF_KCM socket (Kernel Connection Multiplexor) as wrapper to the
> (TCP) stream socket might be able to avoid wakeups for incoming packets and
> should allow better buffer management for incoming packets within smbd.
> The prototype/work in process patches are available here:
> Also notice the missing generic multichannel things via this meta bug:
> I'm not sure when all this will be production ready, but it's great to know
> the potential we have on a modern Linux kernel!
> Later SMB-Direct should be able to reduce the cpu load of the io_wqe_work threads (pipe to socket)...
Thanks for sending this, very interesting! As per this email, I took a
look at the NUMA bindings. If you can, please try this one-liner below.
I'd be interested to know if that removes the fluctuations you're seeing
due to bad locality.
Looks like kthread_create_on_node() doesn't actually do anything (at
least in terms of binding).
diff --git a/fs/io-wq.c b/fs/io-wq.c
index 74b84e8562fb..7bebb198b3df 100644
@@ -676,6 +676,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
+ kthread_bind_mask(worker->task, cpumask_of_node(wqe->node));
More information about the samba-technical