Question(s) about smbd in respect to preadv2

Mon Mar 16 11:30:08 MDT 2015

Guys,

I have an update on this but first of sorry for the long break.

I spoke about preadv2 and use the samba use case and numbers at the
LSF/MM summit in Boston this past week. Christop was very helpful in
getting me there in the first place and helping to build consensus
around getting these upstream. Also spent a bit of time with Steve on
the last day talking about these changes.

Here is the materials I used for discussion at LSF:
https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
It's not really a presentation just a synopsis of the sycall and a
summary of the Samba numbers using this.

The test uses the FIO engine I wrote. It's only a single client
because samba forks a process for each user / connection and with
multi users I'm not really testing preadv2 improvements. Here's the
raw numbers for a lot of different cases including trail latencies.
https://docs.google.com/spreadsheets/d/1GGTivi-MfZU0doMzomG4XUo9ioWtRvOGQ5FId042L6s/edit?usp=sharing

As you can see we can make quite a bit of impact for mean latency of
requests (in terms of percentage) and overall throughput even at
larger client block sizes. Also, it looks like there's a bit of
overhead in the samba case (different code path) when doing threadpool
even before we get to the threadpool enqueing compared to the sync
case because we still do a bit worse then sync in terms of latency
even in the 100% cached case (when it's not send to the threadpool).

I wish I could have tested multiple multiplexes (SMB3) requests to the
server but the samba client library doesn't do a good job at exposing
that (or I'm to thick to use it). I believe that in that case
threadpool + preadv2 will be a better balance then either sync or
threadpool alone. Sync will suffer worse tail latencies in that case
(due to stalls in the epoll loop for paging in the data) and preadv2
should have the same through and better mean latency then threadpool
(since it can execute them faster).

To wrap this up. Hopefully the next patch I send to the kernel,
manpages, xfstests will be the final one. Once that's done, I'll send
you guys the patch that I used for samba to test this. Volker did the
initial work and I wrap it up. Right now it's hack so I imagine you'll
want to clean it up. Additionally, I'm going to try to get the FIO
engine upstream which requires making it build-able at least with
system samba (Ubuntu and Redhat) and trunk samba... hopefully it'll be
useful to others.

Best,
- Milosz

On Tue, Jan 27, 2015 at 11:44 AM, Milosz Tanski <milosz at adfin.com> wrote:
> Replying to my self with some additional numbers.
>
> On Mon, Jan 26, 2015 at 10:58 AM, Milosz Tanski <milosz at adfin.com> wrote:
>> On Sat, Jan 24, 2015 at 6:39 PM, Jeremy Allison <jra at samba.org> wrote:
>>> On Sat, Jan 24, 2015 at 03:46:18PM -0500, Milosz Tanski wrote:
>>>> I'm at a bit of a cross roads with testing preadv2 using samba and how
>>>> samba could benefit from the syscall. I need to better understand the
>>>> samba architecture today and where it's going. I spent a few hours
>>>> yesterday and today better to understand samba; I also looked at
>>>> Jermey's SambaXP 2014 talk. There's a few clarifications I need.
>>>>
>>>> I did some preliminary testing using preadv2 to perform. The first
>>>> tests I ran were using the source3 smbd server. And I compared the
>>>> sync, pthreadpool and pthreadpool + preadv2. Just a simple rand read
>>>> test with one client with a cached data.
>>>>
>>>> The results were that sync was fastest (no surprise there).
>>>> Pthreadpool + preadv2 was about 8% slower then sync. Plain old
>>>> pthreadpool 26% slower. So not a bad win there. Additionally, it looks
>>>> like the vfs_pread_send and vfs_pread_recv have a bit more overhead
>>>> over plain old vfs_pread in the code path so it's possible to get that
>>>> 8% even closer to the sync case.
>>>>
>>>> So far, so good... but what struct me is that I don't really
>>>> understand why samba uses the pthreadpool if forks() for each client
>>>> connection? Why bother.
>>>
>>> Because it allows a client to have multiple outstanding
>>> read and write IOops to a single smbd daemon.
>>>
>>> This is important if a client has multiple processes
>>> reading and writing multiple large files - the client
>>> redirector just pipelines to the number of outstanding
>>> SMB2 credits.
>>>
>>> Using pthreadpool then allows a single smbd to have
>>> multiple preads/pwrites outstanding.
>>
>> I can only imagine that the latency distribution would shift left
>> (lower) for fully cached / sequential reads as I've seen that in our
>> app that developed preadv2 for originally.
>
> I have some additional numbers for the single sync client test. This
> test case is a 64k sequential read of a large file (8x page cache) of
> a file on a spinning disk. I expected the the differences in this case
> (bandwidth) to narrow between each on bandwidth... but it doesn't look
> like the difference got noticeably smaller.
>
> Here's a link to the full numbers: http://i.imgur.com/Lnxb4Lk.png
>
>>
>> I haven't figured out yet to make async client submit multiple SMB2
>> async requests using the cifs FIO engine yet After spending more time
>> reading the stuff in libcli, it looks like the code is there it just
>> not exported in a external library.
>>
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz at adfin.com

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz at adfin.com