SMB Direct Implementation via Linux Device Driver for Samba Design Doc, final?

Mon Aug 26 01:42:58 MDT 2013

Hi Richard,

> Sorry for Spamming you yet again on this.
> 
> I have received feedback from several people, and have modified things
> to incorporate that feedback.
> 
> I am now ready to start coding so I don't expect to be sending this out again.
> 
> I am aware that there is another effort under way to implement this,
> and I think that is healthy.

Which one? I know only of my own work (based on librdmacm and libibverbs),
which is mainly for exploring the protocol as it's for the client side only.
There's no fork nor fd-passing support in libibverbs.

> I have included Samba Technical as well so it receives wider
> distribution. Some of you are not on Samba Technical, which might
> cause bounces if you reply to a reply.

Thanks, I think it's the best place to discuss things like this...

> When Windows uses SMB Direct, or SMB over RDMA, it does so in a way that is 
> not easy to integrate into Samba as it exists today.
> 
> Samba uses a forking model of handling connections from Windows clients. The
> master smbd listens for new connections and forks a new smbd before handling
> any SMB PDUs. It is the new smbd process that handles all PDUs on the new
> connection.
> 
> Please see the documents [MS-SMB2].pdf and [MS-SMBD].pdf for more details about
> SMB Over additional channels and the SMB Direct protocol. However, in brief,
> what happens is the following:
> 
> 1. The client establishes an SMB connection over TCP to a server. For Samba,
>    this involves the forking of a new process.
> 
> 2. The client NEGOTIATES the protocol and then does a SESSION SETUP. If this
>    is successful, the client now has a Session ID it will use in establishing
>    additional channels, including any via SMB Direct (RDMA).
> 
> 3. The client uses a TREE CONNECT request to connect to a share.
> 
> 4. The client issues an FSCTL_QUERY_NETWORK_INTERFACE_INFO IOCTL to determine
>    what interfaces are available.
> 
> 5. If there are any RDMA interfaces in common between the client and the 
>    server, and the server supports MULTI_CHANNEL, the client initiates an 
>    RDMA connection to the server.
> 
> 6. The client then sends a NEGOTIATE requesting SMB3.0 and above as well as
>    support for MULTI_CHANNEL.
> 
> 7. It that succeeded, the client then sends a SESSION_SETUP and specifies
>    SMB2_SESSION_FLAG_BINDING along with the Session ID obtained on the first
>    connection.
> 
> At this point, we now have an RDMA channel between the client and server.

I think there's one more important detail, the client_guid send in the
negprot request
should identify the client.

> SMB Direct actually involves a small protocol but the details are not relevant
> here and can be read about in [MS-SMBD].pdf.
> 
> There is a problem here for Samba in handing any form or MULTI_CHANNEL support
> but there is an even bigger problem in handling SMB Direct.

I don't think there's a much bigger problem with SMB Direct, most of the
work
to get MULTI_CHANNEL to work is really protocol independent. The largest
problem
is exit_server*(), which exists the whole process if an incoming connection
terminates. We also have some smb_panic() calls, which could also cause
problems.
In general we're not prepared to handle more than one connection per
process,
Volker and I started to remove global variables and cleanup things to take
context pointers, but this work is far from being finished...

> The problem for MULTI_CHANNEL is that we cannot determine which smbd should 
> handle the new channel (be it TCP or RDMA based) until we have seen the 
> SESSION_SETUP request on the new channel. In addition, Windows clients always
> connect on port 445 for TCP and 5445 for SMB Direct.
> 
> Here, I only want to handle SMB Direct and not generic MULTI_CHANNEL. However,
> to fully support generic MULTI_CHANNEL would require that Samba defer passing
> a new TCP connection to a subsidiary smbd until it determines that the
> connection is not destined to join an existing session.

I think if we base this on the client_guid, we can have the smbd, which
already
handles this client listen on a unix domain socket with the client_guid
as name.

The smbd which accepts a new connection looks for an existing socket
with the client_guid
as name. If there is a valid socket it would use sendmsg() and pass the
received
SMB2 Negprot request together with some metadata and one or more
file-descriptors
to the other smbd, which already handles that client. On current process
will exit
on success.

When we do this at the SMB2 Negprot stage, we don't have any
session/tree/open IDs
or other state in memory, so it should be relatively easy to transfer
the connection
with it's relatively small state to an other process. Which avoids the
need of layer violations
and keep things simple.

> In a like manner, we cannot hand an RDMA connection to an existing smbd until
> we have determined which session it wishes to join.
> 
> However, there is an additional issue with RDMA. The RDMA connections have to 
> be terminated in a single process (as only one process can listen on port 5445)
> but then they would have to be transferred to the process that should control 
> that connection, but only after some RDMA RECVs and RDMA SENDs have occurred.
> 
> I am told by Mellanox folks that there is no real support for transferring all
> the RDMA state between processes at this stage.

Yes, that's the real problem with doing RDMA in userspace with
libibverbs/librdmacm.

> Another approach would be to have a single process responsible for all RDMA 
> handling and have the smbd's communicate with that process about new incoming
> connections and reads and writes. While this would work, and could eliminate
> multiple copies of the data with shared memory, it would involve a context 
> switch for most if not all RDMA transfers.

Only reason for doing this would be research how the protocol works...

> A LINUX KERNEL SMB DIRECT MODULE
> 
> An alternative model is to develop a Linux kernel driver to handle RDMA
> connections.
>
> While this approach locks us into Linux for the moment, it seems to be a
> useful alternative.

This is an excellent idea! I think this is the only way to implement SMB
Direct in Samba,
where we would get a performance win. We could also add some shortcuts to do
something like sendfile/recvfile (not copying file data into userspace).

With libibverbs we would still have to copy file content into userspace
buffers,
which will be used by the RDMA hardware.

A generic SMB Direct kernel implementation would also make sense for the
client side:
a) for testing
b) to handle keepalive messages at the kernel level, this is important
   as smbclient uses sync function calls, which means the low level
   tevent loop that could handle the keepalives from the server
   is only active during the function call, not when the connection is idle.

If we should then also have softiwarp available in the kernel, it would
even be possible
to develop everything with the need of RDMA hardware, which would be a
big improvement
at least for me...

> It would function somewhat like this:
> 
> The smbdirect device driver would be a character device driver and would be 
> loaded after any drivers for the RDMA cards and ipoib.
> 
> When Samba starts, it would attempt to open the device driver, and if 
> successful, would call an IOCTL to set the basic SMB Direct parameters, as
> explained below. This would allow the smbdirect driver to start accepting 
> incoming connections.
> 
> When an smbd gets to the point of accepting a SESSION SETUP request it would
> call another IOCTL against the driver to register this session with the 
> driver.
> 
> The driver would accept all incoming SMB Direct RDMA connections via the 
> connection manager and would:
> 
> 1. Initialize the SMB Direct protocol
> 
> 2. Handle the NEGOTIATE request once the SMB Direct protocol engine is running
> 
> 3. Accept the SESSION_SETUP request, and if it matches a registered Session ID
>    of an established session, would pass the request to the smbd that owns
>    that session. Otherwise it would reject the SESSION setup and drop the 
>    connection.
> 
> This is discussed in more detail below.

This seems like layer violation, which makes the task much more complex
than it has to be.

I'd prefer to do something like this:

- have a way to enumerate all SMB Direct network interfaces,
  which there details like link status, link speed and ip address(es)

- have a way to tell the driver that it should listen for incoming traffic
  on a specified interfaces(es) or ALL. The caller also needs to specify
  the desired options for the SMB Direct layer (max credits, max
send/recv size,
  max fragmented size, keepalive interval, ...)
  And export some kind of fd which is pollable for incoming connections.

At that stage the parent smbd can do similar things as for tcp sockets,
it just looks into its configuration and sets up the listen SMB direct
"sockets"
and uses tevent_add_fd for READ to get notified when a new connection
arrives.

- we need a way to call "accept", which will return some kind of fd
which represents
  the new connection, it should also return the negotiated parameters of the
  SMB Direct connection and the address of both peers.

At that stage smbd can accept the connection similar to tcp sockets
and forks a child. But Samba still needs a smb_transport abstraction
as the communication with the SMB Direct kernel driver is likely be
different
compared a STREAM socket. The transport abstraction should allow
sending/receiving
PDU BLOBs and receiving PDU BLOBs and the NBT/Length header or the SMB
Direct headers
are hidden.

The important thing is that it has to be possible to transfer the SMB
Direct fd to another
smbd using sendmsg().

> STEPS TO BE TAKEN BY SAMBA
> 
> 1. When an smbd successfully handles a SESSION SETUP, and the smbdirect 
>    driver has been successfully opened, it will call an IOCTL on the device
>    to register the current Session ID. It would also enable the device as
>    a FD to be monitored by tevent using tevent_add_fd for READ and possibly
>    WRITE events.

> 2. When an FSCTL_QUERY_NETWORK_INTERFACE_INFO IOCTL request is received, it 
>    will respond with the IP address(es) of all the RDMA interfaces as specified
>    in [MS-SMB2].pdf.

This should just enumerate all network interfaces and SMB Direct interfaces
and use the same logic that was used to decide on which interfaces we
want to listen.
The difference would be to also take the link state into account.

It has to be possible to run multiple smbd instaces on different
IP-addreses/interfaces.

> 3. When the handler for READ events on the smbdirect FD is called, it will
>    retrieve a set of events and process them. Any that are for incoming
>    SMB PDUs will be sent down the stack for processing. Responses will be
>    sent back possibly by the next IOCTL to the driver.
> 
> 4. When a LOGOFF is received on the smbdirect connection, a response will be
>    sent. Once that has completed, the device will be closed, which will cause
>    the RDMA connection to be dropped.

What do you mean by 'LOGOFF' here?

> REQUIREMENTS OF THE DRIVER
> 
> When the driver loads it will begin listening for incoming RDMA connections
> to IP_ADDR_ANY:5445. If there are no sessions registered by smbds or if the
> smbd layer has not been initialized by Samba, these connection attempts
> will be rejected.

I would not do that, the driver should be dump it should not do any
action on its
own! Everything should be actively triggered by the caller, similar to
the socket()
interface which also need explicit bind()/listen()/accept()/connect() calls.

> When Samba opens the driver the first time, it will use an IOCTL to register
> the following parameters:
> 
>  - ReceiveCreditsMax
>  - SendCreditMax
>  - MaxSendSize
>  - MaxFragmentSize
>  - MaxReceiveSize
>  - KeepAliveInterval

In my idea this would be the "listen" call.

>  - The initial security blob required to handle the SMB3 Negotiate response.
>
> The security blob is a constant, in any case, and needs to be available to
> handle the SMB3 Negotiate response.

This would be a layer violation and should be avoided.

> When an smbd is forked to handle a TCP connection, that smbd will also open
> the device. It will subsequently perform the following actions:
> 
> 1. Register the Session ID for the current session once the SESSION_SETUP has
>    been processed.

I really think we don't need this layer violation...

> 2. Call an IOCTL to retrieve the shared memory parameters (typically) the
>    size of the shared memory region required.
> 
> 3. Call mmap on the device to mmap the shared memory region that allows us 
>    to avoid copying large amounts of data between userspace and the kernel.
> 
> When PDUs are available for the smbd to process, or when RDMA READ or WRITE
> operations have completed (and possibly when PDU SENDs have completed) the
> device will set in a POLLIN state so that the smbd can process the new events.
> 
> IOCTLS
> 
> The following IOCTLS are needed:
> 
> 1. SET_SMBD_PARAMETERS
> 
> This IOCTL sets the set of parameters that SMB Direct operates under:
> 
>   - ReceiveCreditMax
>   - SendCreditMax
>   - MaxSendSize
>   - MaxFragmentSize
>   - MaxReceiveSize
>   - KeepaliveInterval
>   - The initial security blob required to handle the SMB3 Negotiate response.

See above...

> 2. SET_SMBD_SESSION_ID
> 
> This ioctl tells the smbd driver the session ID in use by the current smbd
> process and thus allows connections over RDMA using this session id.

See above...

> 3. GET_MEM_PARAMS
> 
> This ioctl is used to retrieve important memory parameters established when an
> smbd opens the device. Each open after the first open allocates memory that
> will be used to receive and send PDUs as well as buffers to be used for
> RDMA READs and WRITES.
> 
> The information retrieved by this IOCTL includes the size of the memory area
> that the smbd should mmap against the device.
> 
> 4. GET_SMBD_EVENT
> 
> This ioctl is used by the smbd to retrieve the latest events from the driver.
> Events can be of the following type:
> 
> a. PDU received
> b. PDU sent
> c. RDMA READ/WRITE complete and thus the buffers can be reused.
> 
> A list of events is provided for the smbd to deal with.
> 
> When PDUs received events are handled, the PDU will be copied into memory
> pointed to by the event array passed in. The reason for this copy is to allow
> the SMB Direct protocol engine to turn its internal buffers around and return
> credits to the client. The cost of copying these PDUs is small in return for
> getting more requests in.
> 
> The device will remain in a POLLIN state if there are outstanding events
> to be handled.
> 
> 5. SEND_PDU
> 
> This ioctl takes an array of pointers to memory containing PDUs. These are 
> copied to internal buffers and then scheduled for sending. When the IOCTL
> returns the data has been copied but not yet sent.
> 
> An event will be returned when the send is complete.
> 
> 6. RDMA_READ_WRITE
> 
> This ioctl takes a set of shared memory areas as well as remote memory
> descriptors and schedules RDMA READs or RDMA WRITEs as needed. 
> 
> Each memory region is registered prior to the RDMA operation and unregistered
> after the RDMA operation.
> 
> 7. SET_SMBD_DISCONNECT.

This might work out...

Another idea would be to use writev()/readv() for the normal PDUs
and just do shared memory tricks for the RDMA_READ_WRITE.

> EVENT SIGNALLING
> 
> The driver will maintain a queue of events to userland. When events are
> available, the device will be placed in the POLLIN state, allowing poll/epoll
> to be used to determine when events are available.
> 
> MEMORY LAYOUT
> 
> When the smbd driver is opened for the second and subsequent times by a
> different user, it will allocate 64MB of memory (which might need to be
> physically contiguous.) Subsequent opens by the same process will not
> allocate more memory.

I think the memory for normal SMB Direct PDUs should depend on the used
credits
and max sizes on the connection.

The memory for RDMA_READ_WRITE should depend on the caller
and the "RDMA Provider IRD/ORD Negotiation" together with the negotiated
MaxReadWriteSize.

> This memory will be available via mmap. It is expected that the GET_MEM_PARAMS
> IOCTL will be called to get the size and other parameters before mmap is
> called.
> 
> This memory will be available via mmap. It is expected that the GET_MEM_PARAMS
> IOCTL will be called to get the size and other parameters before mmap is
> called.

In addition we should have a way to instruct the driver to take (an
array of)
fd, offset, length which should be transfered via RDMA READ/WRITE.
This way we would avoid context switches.

But the mmap way is also needed for VFS backends without real files...

> The memory will be organized as 64 1MiB buffers for RDMA READs or RDMA WRITEs.

I don't think this should be fixed values...

> SAMBA CHANGES
> 
> There will need to be a few changes to Samba, including:
> 
> 1. During startup to attempt to open /dev/smbd.
> 
> 2. The FSCTL_QUERY_NETWORK_INTERFACE_INFO ioctl will need to be implemented.
> 
> 3. SMB2_READ and SMB2_WRITE handling code in the SMB2 code-path will need to
>    be modified to understand remote buffer descriptors and call the correct
>    driver IOCTL to initiate RDMA_WRITE or RDMA_READ operations as needed.
> 
> 4. Changes might be needed to have Samba understand that the PDUs have come
>    from another source other than the TCP socket it expects.

As said above the changes to Samba (at least for SMB2/3) to support SMB
Direct
should be relatively small. The real work is to get MULTI_CHANNEL
support first!

I hope you take my comments as constructive improvements.

metze

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 261 bytes
Desc: OpenPGP digital signature
URL: <http://lists.samba.org/pipermail/samba-technical/attachments/20130826/cfeb971a/attachment.pgp>