frustration Re: FW: [PATCH] Stop packet_recv greediness, stop using event_add_timed

Sam Liddicott sam at
Tue May 19 13:20:29 GMT 2009

* Volker Lendecke wrote, On 19/05/09 12:31:
> On Tue, May 19, 2009 at 11:51:14AM +0100, Sam Liddicott wrote:
>>> Sorry for that, but I think my query is still valid. You're
>>> seeing a problem with the policy to read everything in a
>>> greedy way. I *think* this has been added due to performance
>>> reasons. 
>> Maybe, but as you mention below, you suspect that the gains will be lost
>> in the noise; and the quest for such minimal gains results in the sort
>> of bad behaviour I'm trying to fix, as well as the potential DOS you
>> mention.
>> It is worth addressing the performance gains of different solutions, but
>> not in comparison to the current behaviour which is broken, and
>> unnecessarily corks server responses in order to save a few syscalls -
>> i.e. it wastes network capacity in favour of cpu capacity, to a
>> noticeable degree.
> I'm trying to get that torture test because *I* myself would
> have to go through that if I wanted to apply such a change
> to the core of Samba 4.

Then I'm still not clear on which test you want;

I had thought that you wanted a test between my patch and the supposed
efficiency benefits of current behaviour*, but now I think that you want
a performance test between two patches:

* returning to epoll after 4 bytes
* reading the full packet if possible before returning to epoll

is that correct?

[* which would be valueless if the current behaviour is more efficient
by virtue of being broken]

>> For real numbers obtained during debugging here, a sequence of 5
>> responses which would normally have been sent out around 10ms apart were
>> being collected and set out at once 10us apart. That means the first
>> response was being delayed by 40ms, which more than doubles the
>> effective RTT for many WAN, or wastes around 500Kbytes of data
>> opportunity at LAN speeds.
> Wait a second -- this sounds like a problem with your TCP
> socket options. This definitely sounds like you want to turn
> off NAGLE.

I've not altered this, so I can't imagine that TCP_NODELAY is not set,
but I will check if this is the case by adding an ioctl to query the
socket settings to be sure.

>>> If you would be happy with just reading a packet at
>>> a time with 4 syscalls (epoll -> read 4 bytes -> epoll ->
>>> read rest) this might be slower than the 3 syscalls (epoll
>>> -> ioctl(FIONREAD) -> read everything). 
>> I was never proposing anything as bad as that, there is no need to
>> return to epoll after 4 bytes at all.
> Seriously: I would definitely propose exactly that:
> epoll->read4->epoll->readrest. This is conceptually the
> simplest of all schemes, and would probably fully solve your
> problem. But to actually get that through, I would want
> those performance numbers.

With respect; you would definitely propose that patch because you like it.
I don't propose that patch because I don't want to. I propose the patch
I sent.

Your patch is very similar, the in-function execution flow (goto again)
has been replaced by an out-of-function flow (return, epoll, call). It's
subjective if it is actually simpler. I think it isn't.

> Probably all these mails at some point get me doing your
> work and provide that trivial 100-liner myself, I should
> probably have spent reading those mails doing it and I would
> already have it by now :-(((

What? I don't want you to do my work; I've already done it.

I've already provided this "trivial 100 liner" you speak of already, and
we are arguing about the trivial details of that patch.

I'm not asking you to do my work but you are asking me to write a test
to prove that your patch is nearly as efficient as mine.

What we have here is the old scenario: "You didn't write it how we would
have done" - and it's because I'm not you.

But the difference is _so trivial_, I find it had to believe that the
objections are so material...

I don't want you to do my work, but I don't intend to provide the patch
you would propose along with the justifying test.

I offer /this/ patch. It is very close to one you prefer. It does not
have the objectionable qualities that A Bartlett despised.
The question is:   is it good enough?

I seemed to be doomed to condemn myself and whoever I send a patch to,
to waste a hole day talking about it.

>>> I think the discussion
>>> will become much easier. If you get a packet rate loss below
>>> noise, it's done I think. If you get a measurable packet
>>> rate loss, we have to discuss efficiency vs code clarity.
>> I don't understand what you mean by packet loss here, I don't expect it
>> to introduce any kind of packet loss.
> I wanted to talk about "loss in packet *RATE*", not packet
> loss. Sorry if my english was not sufficient.
>> I'm not proposing as many epolls as the "4 syscall" scheme though.
> Again: Why not? If this turns out to be fast enough, why not
> have much simpler code that is easier to understand?
> Deferring all the buffering work to the kernel *might* not
> be as bad performance-wise as we all suspect.

Because in an effort not to shock reviewers with my dreadful code style
I'm following the same pattern unreliable_select follows.

Because I don't think the code is simpler.

Because I certainly don't think effort of writing the tests to prove
that the code is only slightly less efficient is worth the possible
marginal simplicity gains.

Because then it avoids the whole debate of whether or not it is too

only somehow we are still having a debate over the virtues of marginal
simplicity and marginal efficiency.

>> Not really a randomiser although it may serve.
>> With a server socket current implementation already prefers to write
>> than to read (apart from this problem that contradicts it) - this is so
>> that the server gets to send the responses it worked so hard to generated.
>> smbd/service_stream.c/stream_io_handler()
>> With client sockets we prefer to read than to write, this is because it
>> is polite to the server to listen to the answers instead of just queuing
>> more, and if we don't, we may deadlock if it decides to stop responding.
> Even in the client you want to get rid of data you generated
> as quickly as possible. It's the smbclient application that
> actually triggered that change in Samba 3.  If you want to
> do multiplexed read&x you are better off sending as many
> requests to the server as possible and rely on the kernel
> queues to sort it out. Otherwise you will end up with
> oscillating traffic which sucks for bandwidth utilization.
I can go with that.
>> However, when we have a related server/client socket pair (e.g. when
>> server requests sometimes generate new client requests of any kind),  we
>> have this priority scenario, which epoll doesn't support as it keeps
>> fd's seperate:
>> 1. send server responses
>> 2. receive client responses
>> 3. send client requests
>> 4. receive server requests
>> This effectively pushes any bottleneck back to the client, instead of
>> buffering dangerously in the server.
>> There is no point in receiving server requests if resultant client
>> requests are being blocked because the upstream server is slow (or
>> blocking because we are not receiving our client responses) or because
>> we are just storing them in ram - so server requests is the lowest
>> priority - thus throttling our client to our available processing rate.
>> There is no point though in sending client requests if we are not
>> listening to the client responses or we risk deadlock.
>> There is no point in receiving client responses if we just fill up the
>> our server send queue.
> Ah, ok, I see what you mean. You're coming from your proxy
> perspective. In your case I think you definitely want to
> read all your proxy client's requests into user space,
> regardless of what the server you're proxying does at this
> moment: The magic word is SMBEcho requests. If a server is
> stuck for extended periods of time for any reason, clients
> start sending SMBEchos. You definitely want to fish them out
> of the client request floods, and reply to them locally.
> There is no other way than to just read everything and
> filter on SMBEcho.
Well, I may be wrong but this is how I see it

1. The server ought not to be stuck for extended periods of time for any
reason, if it did, that sounds like a bug that should get fixed.
(Somewhat like the one I'm trying to fix when responses get stuck for
some reason)

2. Any client dumb enough to expect to send an smb echo on a socket that
is not yet writable gets what it deserves.
Sucking up the whole socket in case a client wants to do this seems
bizarre and seems more likely to cause the client to send the echo.

3. A client ought not to get anxious for an smb echo as long as it is
receiving responses generally. (All the more reason not to sit on them).

4. It wouldn't stop the packet_recv of one socket doesn't starving the
smb_echo of another socket, but may make it more likely! All the more
reason to have a fairness in tevent so that all descriptors get a chance.

5. To get the gains you describe and manage to pull smb_echo
out-of-order would involve changes to packet.c and smb_server.c so
significant that I would have no hope of ever getting the committed to samba


More information about the samba-technical mailing list