Solved (sort of): Strange VFS performance problem

Thu Feb 17 18:10:28 GMT 2005

On Wed, Feb 17, Terry Griffin wrote:
> On Wed, 2005-02-09 at 17:01, Jeremy Allison wrote:
>> On Wed, Feb 09, 2005 at 04:56:25PM -0800, Terry Griffin wrote:
>> > Hi all,
>> >
>> > I'm having a very strange Samba VFS performance problem. Hoping
>> > you can provide some clues.
>> >
>> > I've implemented a custom Samba VFS module. Functionally everything
>> > is fine and the module does what it's supposed to do.
>> >
>> > With a Windows 2000 client the VFS module introduces an expected
>> > throughput hit on large writes in the range of 10-20% (over Gigabit
>> > Ethernet). But with a Linux CIFS client the throughput hit is more
>> > like 90%!
>> >
>> > The Linux/CIFS and W2K client throughput numbers are similar in
>> > the case where the custom VFS module is not in use.
>> >
>> > What about a VFS module could cause such drastically different
>> > results between a W2K client and a Linux/CIFS client? And is there
>> > something I can do to improve the Linux numbers? I've fiddled with
>> > directio and CIFSMaxBufSize options on the client side with no
>> > help. I've fiddled with all the usual tuning parameters on the Samba
>> > side, again with no improvement.
>> >
>> > Linux (on both the CIFS client and Samba server) is Fedora Core 2
>> > with kernel 2.6.10. The Samba version is 3.0.10.
>>
>> Try using cachegrind :
>>
>> http://developer.kde.org/~sewardj/docs-2.2.0/cg_techdocs.html
>>
>> Profile the fast and slow cases and look for differences. Use
>> smbd -i to look at one instance.
>>
>> It's what I use to track down code performance problems in Samba.
>>
>> Jeremy.
>
> Well I've made a little progress after getting sidetracked for
> a while.
>
> The main difference between W2K as a client and Linux/CIFS is the
> sizes of the writes. W2K will send over 64K at a time while
> Linux/CIFS will send over only 4K at a time when copying a large
> file. There doesn't seem to be anything I can do to get Linux/CIFS
> to send anything other than 4K chunks.
>
> The other dimension to the problem is that I'm using asynchronous
> I/O (AIO) in my VFS module's write/pwrite functions (aio_write,
> aio_error, and aio_return). The AIO routines seem especially
> inefficient for 4K-sized writes but perform nicely with 64K writes.
> Oddly I don't see the same performance difference between 4K and 64K
> writes with the synchronous I/O functions even though in 2.6.x the
> synchronous I/O functions are supposedly just wrappers around the AIO
> functions.
>
> Anyway, as before any clues would be appreciated, especially if they
> are clues that get me bigger-than-4K writes between a Linux/CIFS
> client and a Samba server.
>
> Thanks,
> Terry
>

The final culprit was AIO, or rather the absence of any mechanism
for a user-space application to "instantly" respond to the completion
of an AIO operation. The 4K sizes sent to VFS pwrite() via the
Linux/CIFS client made things worse but was not the root cause.

The original sequence in my VFS modules's pwrite() was:

- aio_write()
- Do some other useful stuff with the data
- Polling loop: aio_error()/nanosleep()
- aio_return()

The problem is nanosleep(). (See the BUGS section of the man page.)
It sleeps way too long for this to be efficient when the data is
coming in over 1Gb Ethernet.

One workaround is:

- aio_write()
- Do some other useful stuff with the data
- nice(1)
- Polling loop: aio_error(), but no sleeping!
- nice(-1)
- aio_return()

This gives me a ten-fold improvement in throughput with the
Linux/CIFS 4K-writes. The nice(1)/nice(-1) thing is required so the
polling loop doesn't take CPU away from the aio_write() operation.

What I really need is a blocking aio_error() with instant return
on completion of the AIO operation. A real nanosleep() might also
do it.

Terry
-- 
Terry Griffin
Axian Inc.
http://www.axian.com/