fix to util_sock.c

Tue Nov 14 09:29:19 GMT 2000

> What will cause slower because of this implementation, is that now,
> samba will STUPIDLY wait for entire SMB request data stream to
> arrive, while we can do many things even with only first four bytes.

Kenichi,

Many of your points are good, but please forget your idea of doing
lots of stuff with the first 4 bytes. That idea is completely bogus.

There are several reasons for this:

1) the first 4 bytes only give the length. If you want to nit-pick
   then in fact only 17 bits of those 4 bytes give the length, and the
   other 15 bits are just padding on all except the first packet
   (where it carries info for a session request) or a SMB keepalive.

2) we need the full SMB header (a minimum of 39 bytes) _before_ you
   start any processing because without the full header you don't know
   if that the client is asking for is allowed by that user. You don't
   know what security context to run the operation in, you don't know
   what file descriptor to operate on and you don't know what offset
   in the file they are interested in.

3) SMB packets are variable length, and even the control info for
   starting to make a decision on whether you can do the command in
   the packet is variable length. You need _all_ the svbwvw[] words
   before you can make decisions about the packet and you don't know
   how many there are of those till after you've read the first 39
   bytes, and you can't read the first 39 bytes until after you've
   read the first 4 bytes giving the length.

Now if you want to read more than 4 bytes from the socket initially
then you _must_ use a non-blocking read. Otherwise you could block, as
there might be only 4 bytes sitting there! (in the case of a SMB
keep-alive). Using a non-blocking read immediately kills the idea of
using MSG_WAITALL for that initial read. What the heck would it mean
to use MSG_WAITALL with a non-blocking read?

Ok, so what we could do is read 4 bytes, then ask to read something
shorter than the known length of the packet (we could use MIN(length,
1500) as a good first guess). That would be fine, but it would only
help for SMBwrite*(). It isn't a general "speed up all of SMB" trick
because SMB request packets are almost always shorter than a ethernet
frame except in the case of SMBwrite. Chaining isn't used widely
enough to be important here (and rarely takes a packet over 1.5k
anyway).

So what you're really doing is a hack for SMBwrite*(), and it is not
at all clear to me that you win anything in this case. If (as is not
uncommon these days) the bottleneck is the disk subsystem rather than
your network then deliberately reading less data than is in the packet
for a SMBwrite*() can slow you down - if the network is significantly
faster than the disk then smbd will be lagging the network and you
may have the whole packet sitting in the socket buffer. Reading that
in several chunks rather than 1 go will be slower, and will consume
more of the context switches you are so worried about.

This is where netbench is such a poor choice for making these kinds of
decisions. Because of the way oplocks work the core of netbench often
ends up like:

open
write write write write
close
unlink

On Linux the data never hits disk at all if you have more than 23MB
of available page cache per client. This means you are operating at
the extreme edge of disk technology, with effectively infinately fast
disks. On more realistic systems the data does get written out to
disk, and in that case if the disk subsystem isn't keeping up with the
network doing the reads piecemeal will hurt.

The other reason I am wary of this sort of change is that it is
concentrating on things that just don't matter all that much. With SMB
and its complex interaction with different OSes we often see _massive_
difference in performance with different strategies (negotiated buffer
sizes, filesystem settings, OS choice etc) whereas what you are
fiddling with is at the 1% level at best. What's more, its in an area
that has specifically burnt us on a major platform (Solaris) in the
past. I hope you understand why you are receiving a lot of resistance
to your ideas in this case.

> I disagree. tbench will not give you the information INSIDE kernel.
> And that's what we should focus for using recv().

fine, so run a kernel profiling tool while tbench is running. I spent
a lot of time using readprofile plus my own kgprof extensions to Linux
a couple of years ago when trying to improve the performance of Linux
as a large fileserver. It helped a _lot_ having a tool that gave a
simple and reproducible load on the box, rather than trying to
interpret the incredibly complex interactions that a netbench run
produces.

As it turned out, the initial solution I used to gain a heap of
performance (nearly a factor of 2) was using recv() with a special
flag instead of read(). That flag wasn't MSG_WAITALL (which didn't
help) it was a hack to reduce the locking contention in the tcp code
in Linux.

Cheers, Tridge