avoiding stat() races

okuyamak at dd.iij4u.or.jp okuyamak at dd.iij4u.or.jp
Sat Nov 18 09:54:26 GMT 2000

>>>>> "CTD" == Cole, Timothy D <timothy_d_cole at md.northgrum.com> writes:
>> ... Or maybe we should create our own FILE structure, and use it
>> instead of file descriptor itself. our FILE structure knows where
>> and which scache we should look at.
CTD> 	That might be preferable.  Foisting the task of maintaining
CTD> association between fds and scache * on the API's client is not ideal.

For this purpose, I believe we should have 'poll' like API.
# but with time_t for timeout, and also can set 0, meaning
# 'come back without any waiting', and NULL.

And there, I wish to have something like "don't come back until you
recieved *** bytes" for individual FILE.

Also, we should use recvmsg/sendmsg type of API, instead of
recv/send, so that we don't need to conbine multiple buffers our
# We can make header part and body part of message differently,
# ( if this causes better performance ).

.... Or, if we can use threads, or change kernel ....

There's idea of having shared memory area for each stream
socket/pipe IO, which is common in OS other than unix.

What we will do as preparation is create input/output ring buffer like:

struct _IO_area {
       int	fd;

       volatile char	*inbuf;
       volatile ssize_t	*inbuf_start;
       volatile ssize_t	*inbuf_size;

       volatile char	*outbuf;
       volatile ssize_t	*outbuf_start;
       volatile ssize_t	*outbuf_size;

       /* I think we need mutex and conditions for each inbuf/outbuf too */
} IO_area;

with large enough ( around 3 to 4 times the size of SMB request size )
inbuf ( it is better to have this preparation inside kernel, if we can,
so that inbuf/outbuf can be page aligned. Also, inbuf_start, inbuf_size,
outbuf_start and outbuf_size are all pointers, so that we can have differnt
access permission for each parameters ).

Once we are finished with initializing, there will be thread that
simply waits for next data from socket ( this can be in-kernel
thread or part of TCP/IP socket implementation as well ). And when
we recieve packet, it will be stored into inbuf area, and
inbuf_start, inbuf_size is maintaind, and required thread will be
awaken with condition.

Actual SMB replying thread will be awaken by condition. It will
check what's in inbuf, and size it's already recieved. If you can do
anything with already recieved data, like branching into reply_*
functions, you'll simply do it. If it's not enough, you'll wait with
condition again, until required size arrive. And when you're
finished with that request, remove the SMB request from IO area, 
and wait for next request.

This works extreamely good especially when we can have this "thread
for recieving" as part of TCP/IP stack. We no longer need extra
overhead we have now for system calls + copying from recieve buffer,
for we can use inbuf area as recieve buffer. We do not need to copy
the data from inbuf, but directly use what's already being stored

Output is also better, because you can use outbuf for creating
output data, instead of creating buffer in user space then copying
it to kernel space inside system call. All you have to do is not to
update outbuf_size, if you are still constructing data stream for
reply. When you're finished with specific data area, change the
outbuf_size and wake sender thread with condition.

Kenichi Okuyama at Tokyo Research Lab, IBM-Japan, Co.
# Winds, Clouds, tell me..:
# Why is Linux-2.4.0-test8(even test10)'s
# net/ipv4/tcp.c: cleanup_rbuf() works so slow against samba?
# It requires nearly 30usec for P-II 300MHz machine...and that's
# 3/4 th of the CPU time which tcp_recvmsg requires to recieve
# 64kbytes, and that's why MSG_WAITALL seems to have no big
# difference against performance ( what we save is so small against
# total requirement ).

More information about the samba-technical mailing list