SMB Direct Implementation via Linux Device Driver for Samba Design Doc, final?

Or Gerlitz ogerlitz at mellanox.com
Mon Aug 26 05:29:11 MDT 2013


On 26/08/2013 01:45, Richard Sharpe wrote:
> IOCTLS
>
> The following IOCTLS are needed:
>
> 1. SET_SMBD_PARAMETERS
>
> This IOCTL sets the set of parameters that SMB Direct operates under:
>
>    - ReceiveCreditMax
>    - SendCreditMax
>    - MaxSendSize
>    - MaxFragmentSize
>    - MaxReceiveSize
>    - KeepaliveInterval
>    - The initial security blob required to handle the SMB3 Negotiate response.
>
> 2. SET_SMBD_SESSION_ID
>
> This ioctl tells the smbd driver the session ID in use by the current smbd
> process and thus allows connections over RDMA using this session id.
>
> 3. GET_MEM_PARAMS
>
> This ioctl is used to retrieve important memory parameters established when an
> smbd opens the device. Each open after the first open allocates memory that
> will be used to receive and send PDUs as well as buffers to be used for
> RDMA READs and WRITES.
>
> The information retrieved by this IOCTL includes the size of the memory area
> that the smbd should mmap against the device.
>
> 4. GET_SMBD_EVENT
>
> This ioctl is used by the smbd to retrieve the latest events from the driver.
> Events can be of the following type:
>
> a. PDU received
> b. PDU sent
> c. RDMA READ/WRITE complete and thus the buffers can be reused.
>
> A list of events is provided for the smbd to deal with.
>
> When PDUs received events are handled, the PDU will be copied into memory
> pointed to by the event array passed in. The reason for this copy is to allow
> the SMB Direct protocol engine to turn its internal buffers around and return
> credits to the client. The cost of copying these PDUs is small in return for
> getting more requests in.
>
> The device will remain in a POLLIN state if there are outstanding events
> to be handled.

For getting nice IOPS numbers, its important to put elements into the 
design which will basically allow for
cost amortization of user/kernel travels. E.g if multiple PDUs has 
arrives, the client should issue multiple
IOCTLs to consume them all before it goes back it it epoll/select on the 
char-device fd.

So maybe add an IOCTL that tells you how many events are there?


>
> 5. SEND_PDU
>
> This ioctl takes an array of pointers to memory containing PDUs. These are
> copied to internal buffers and then scheduled for sending. When the IOCTL
> returns the data has been copied but not yet sent.
>
> An event will be returned when the send is complete.
>
> 6. RDMA_READ_WRITE
>
> This ioctl takes a set of shared memory areas as well as remote memory
> descriptors and schedules RDMA READs or RDMA WRITEs as needed.
>
> Each memory region is registered prior to the RDMA operation and unregistered
> after the RDMA operation.

Same for send-pdu and rdma-read-write IOCTLs, lets have a way for the 
user space process to batch multiple
such directives in single IOCTL.

Basically, some drivers use WRITE and not IOCTL for user/kernel 
communication, such as the IB uverbs layer,
maybe you want to look on that approach vs IOCTL in this stage and see 
if it could serve better for batching/amortization
and maybe even generally (avoid the BKL if its still out there).


>
> 7. SET_SMBD_DISCONNECT.
>
> Not sure if I need this.
>
> EVENT SIGNALLING
>
> The driver will maintain a queue of events to userland. When events are
> available, the device will be placed in the POLLIN state, allowing poll/epoll
> to be used to determine when events are available.
>
> MEMORY LAYOUT
>
> When the smbd driver is opened for the second and subsequent times by a
> different user, it will allocate 64MB of memory (which might need to be
> physically contiguous.) Subsequent opens by the same process will not
> allocate more memory.
>
> This memory will be available via mmap. It is expected that the GET_MEM_PARAMS
> IOCTL will be called to get the size and other parameters before mmap is
> called.
>
> This memory will be available via mmap. It is expected that the GET_MEM_PARAMS
> IOCTL will be called to get the size and other parameters before mmap is
> called.
>
> The memory will be organized as 64 1MiB buffers for RDMA READs or RDMA WRITEs.

Please note that the 1MB memory for RDMA buffers need not be physically 
contiguous, you can use the RDMA stack
APIs to register each such buffer which is made of (on a system with 4K 
page size) random pages and get unique lkey
which represents the whole 1MB with the card. Later you plug this lkey 
to the ib_post_send operation which does
the rdma.

You can see how this is done in the iser initiator and also in the LIO 
iser target

for the initiator patches see 
http://git.kernel.org/cgit/linux/kernel/git/roland/infiniband.git/log/?id=refs/heads/for-next
for  the target patches see the patches I will fwd you, they will appear 
in few days under
http://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/log/?h=for-next

Note that clients/initiators typically truly deal with random list of 
pages for each transaction so they need to
do memory registration per IO and for that end, we are using 
Fast-Memory-Registration (FRWR) techniques
in the iser initiator. In LIO we do it from other reasons which are 
beyond the scope of the SMB (SCSI T-10 sig).

In your case, you can allocate the pages ones, and for each set of 256 
pages, produce once the required mapping,
use it for whole the life cycle of this session and only when the 
session ends, deregister the mapping with the card.


Or.


More information about the samba-technical mailing list