Work in progress SMB-Direct driver for the linux kernel

Stefan Metzmacher metze at samba.org
Wed Apr 4 14:30:58 UTC 2018


Hi Tom,

>>>>> The first goal is to provide a socket fd to userspace (or in kernel
>>>>> consumers)
>>>>> which provides semantics like a TCP socket which is used as transport
>>>>> for SMB3. Basically frames are submitted with a 4 byte length header.

I was able to fix the kernel freezes (a 100% cpu loop in the kernel is
not a good idea).

>>>> Part of the point of RDMA is that we don't need to make protocol
>>>> specific kernel modules like this - is there a specific reason this
>>>> needs to be in the kernel like this?
>>>
>>> If I had to guess it would be because Samba currently uses a fork
>>> model ... it might be years before it gets to a completely threaded
>>> model.
>>
>> Yes, and it also means that our client and server code only need
>> minimal changes in order to work in the same way it would work
>> over tcp.

With these minimal changes to the userspace code I was able to use
smbclient over soft iwrap or soft roce (I guess it'll also work with
real hardware).

https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-smbdirect
https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=957c01705a97
https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=244a10b272b5
https://git.samba.org/?p=metze/samba/wip.git;a=commitdiff;h=d2dc5bac16eb

(The magic with port 5445 is only to get things going, we'll need a
better strategy in future so that we can also use RDMA with port 445 for
roce and infiniband)

Note smbclient just work as on top of a tcp socket here,
there's no RDMA READ/WRITE...

>> Only the RDMA read and writes need some more work, but I have
>> some ideas where the userspace gives the kernel an fd, offset and length
>> plus a remove memory descriptor as ioctl on the connection fd. Then the
>> kernel can get the content from the filesystem and directly pass it to
>> the rdma adapter, avoiding the copy from kernel to userspace and back.
>> From userspace we'll just wait in the syscall and don't have to care
>> about memory registrations and all other complex stuff.
> 
> Doesn't this sort of transport shimming put back all the overhead it was
> trying to avoid? Stripping off the 4-byte record marker, rearranging the
> read/write data and SMB3_READ operation header to add the channel
> (memory registration) handles, and most importantly placing the data
> in bounce buffers to accommodate the readv()/writev() calls are quite
> complex and expensive. And, just to present a file descriptor? 

I'm not sure I'd say that the overhead should be smaller, as it avoids
the complex tcp layer in the kernel. And the RDMA READ/WRITE operations
will not use readv/writev calls, they will be optimized ioctl calls.

And the really good thing is that smbd needs minimal changes
and it will at least be faster than tcp.

Once we have a working solution we can further optimize things.

> Experience in early NFS/RDMA and Windows Sockets Direct have taught
> that transparency above the RDMA transport interface is generally the
> enemy of performance. The shims are forced to perform additional syscalls,
> RDMA work requests, and sometimes even network round trips. Do you
> have performance results for yours?

No, I just got smbclient connecting multiple times without freezing the
kernel:-)

>> It also happens that smbd sometimes blocks in syscalls like unlink for
>> a long time. It's good to have the kernel as 2nd entity that takes care
>> of keepalives.
> 
> I agree that implementing SMB Direct in your userspace SMB3 daemon
> may be problematic. But what of the existing SMB Direct code in the
> CIFS kernel client? How will that coexist going forward?

I'm not sure, the most important thing is that it's already in the
upstream kernel and seems to work quite good.

Once I have smbd and smbclient working with RDMA read/write, we can
think about merging things together in the kernel.
The good thing would be that we can easily compare my code with the
existing code. And merging makes only sense if my code won't be slower.

But this is currently just a spare time project and it might take
month for the next coding sprint to happen.

metze

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: OpenPGP digital signature
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20180404/3f875a97/signature.sig>


More information about the samba-technical mailing list