okuyamak at dd.iij4u.or.jp
okuyamak at dd.iij4u.or.jp
Thu Nov 2 18:34:48 GMT 2000
Your question is what I found several days ago, in Japanese ML, too.
>>>>> "NT" == Nagy Tamás <nagyt at regens.hu> writes:
NT> I've noticed that when the inner network is slightly loaded, the computers
NT> farther on (the) linux, couldn't enter the network. It's because (the) linux
NT> doesn't allow them to enter. (The) Windows messages that the password isn't
NT> correct or there's no suitable domain controller. I'm sure that the password
NT> is correct because several times I try to log in, (the) linux allows me to
NT> enter, but once in a while it hasn't allowed me.
NT> In that case when (the) linux didn't allow me to enter, in the log file I
NT> found things like that:
Make smb.conf to set Send buffer size as small as possible, down to,
well, my recommendation is '1460*2'. And if network goes down even
then... make ENTIRE network speed faster, i.e if Server is
connected to network with 100Mbps ethernet, connect client with
100Mbps ethernet as well. If Server with 1G, Client with 1G. Don't
simply make Server 1G and Client 100M.
Linux's TCP layer tires to send as much IP packet as possible, upto
Send buffer size or FIFO to NIC card size. If you still have
someting to send, it sleeps and wait for someone to wake him up.
Problem is, on very heavy network, once context switch occur, and
other process starts running, this process ( which is usually
another smbd on server ) have high possibility of 'this process
sending packet to client' as well. Once this 'second' process called
"send()" system call, we have no chance of re-scheduling.
'second process' will start filling 'FIFO to NIC' even though there
still exists something to send on 'first process'.
This means, 'From Macro point of view, TCP/IP layer does not keep
FIFO' on Linux. This will cause one smbd process replying to client
more frequently than other smbd process replying to client. If your
Windows client was attached to "UNLUCKY" smbd process, turn around
time will become longer and longer, until finally, connection cause
timeout, and Client will cut the connection.
But if you look at this from Micro point of view, what you'll see is
Server sending packet to single specific client, concentrated.
If Server is connected to network in fast NIC card, like 1Gbps,
while Client is connected in 100Mbps, then sombody (usually HUB)
have to change transfer speed using STORE&FORWARD functionality.
This STORE&FORWARD functionality requires buffer, which is stored on
HUB. But since buffer size is not infinite, if packet to single
client is concentrated too much, this buffer will overflow, which
means packet drop will occur. Packet drop will cause
re-transmittion, and if re-transmittion is being caused too
freqently, 'sender(Server)'s TCP layer(which exists on individual
socket ) will stop sending packet, waiting for network to settle,
and this will also cause specific connection between specific client
to be slow. Once one connection stop sending packet, other
connection have higher possibility of succeed in sending and make
packet reach to client.
So, 'from Micro point of view, priority to each network is not equal
on heavy loaded network'.
Since over loaded network will cause unfairness from both Micro and
Macro point of view, the only way to solve this, is "lower the
network load", and there's two way.
1) Make ENTIRE network fast. Not partially, like making Server fast.
Partial speed up will raise higher possibility of causing
2) Make individual connection speed slow. This can be done by
shrinking Send buffer size. Please remind that any setting that
will cause 're-sending' will only cause network even more
overloaded. So, please also look at :
Here, you'll see other reason why Winsock2 will cause packet drop.
You also have to keep this in mind, to minimize packet dropping.
In old Japanese saying, if you had something twice, you'll have it
three time. Same with this case. You also have another answer to
this problem as well.
're-brush up Samba, so that smbd/nmbd will not use CPU time so much'.
If system is not being loaded heavily, system have more chance of
treating this situation. But currently, samba eats up many CPU
resource, and since so, system have less chance of solving it.
Current Samba's major path does something like this:
* read() 4 byte to get SMB request type and SMB packet length.
* read() rest of the length
* ether stat(), read(), or write() to regular file.
* send() reply
In heavy loaded situation, this 'in kernel' work will eat up CPU
power upto 75%. smbd process only uses 20%. BUT IT'S SAMBA WHO's
CAUSING THIS HEAVY LOAD. If system call is being caused 30ktimes/sec
( which is result of what I've measured ), almost any system will be
If you can prefetch the request, you can read() entire request at
once( except writing to Server request ). But it's not that
easy. Samba's current implementation crushes receive buffer while it
tries to create reply ( Don't ask me why it's been implemented that
way. At least, this will not cause any quickness: ex. we need to
convert filename from dos format to unix format. So, having new
buffer for unix format filename will make single copy, while if you
tries to use recieve buffer, you first have to make unix format
string on local buffer, then copy it back to original buffer, which
causes two copy. ). This implementation prevent from adding prefetch
In most case, stat() can be replaced with fstat(), for usually,
after you call stat() and make sure that file exists, you'll call
open(). Both stat() and open() requires 'filename->i-node'
conversion, which takes lots of time ( ex. 90% of stat() is used for
filename->i-node conversion ). fstat() does not need this
conversion, for file descriptor remembers i-node. And what's even
more important is that:
i) open() can check existance of file just like stat().
ii) very fact that you found file with stat() does not mean
file will exist even at open() timing. There are other process
who have chance of removing the file.
and since so, stat()->open() sequence have no meaning. Correct
answer is open()->fstat(). But since code is so messy, it's hard to
find out which can be replaced and which cannot be.
If select() does not have to wait for pipe, select() can be removed
and we only need to block-read(). This can be implemented by using
interruption from IPC caller, but since SIG_USER1 and SIG_USER2 are
both being used for debug purpose ( which I don't think is smart
idea. using SIG_USER? for ANY communication with parent process, and
using SIG_HUP for passing debugging flag to top smbd, is enough ),
this can't be implemented without large change.
As you can see, you can think of many ways to reduce cpu power, but
none of them can be done easily. It'll be impossible especially when
major development team focus on adding new functionality, for
shooting moving target is lot harder than to shood stopping target.
Current Samba's implementation is not designed well so that lower
communication layer is devided from upper functionality layer too.
# And if you take look at code more carefully, you'll find
# many bugs in non-major path, which also will cause unstability.
NT> What to do?
Stop developing Samba-2.2, and focus on re-designing. What we first
need is Samba-2.0.8, with totally new lower communication layer.
After we made stable, robust, and fast lower communication layer,
it'll be lot easier to make more function above it.
# And since so, I don't thing Samba-TNG team will have any chance
# in making. Think. Who made this messy code messy?! (^^;)
## Like old saying: Making First step is a talent, walking is
## another talent.
Kenichi Okuyama at Tokyo Research Lab. IBM-Japan. Co.
More information about the samba-technical