avoiding stat() races

Sat Nov 11 06:52:51 GMT 2000

Dear Timothy

>>>>> "CTD" == Cole, Timothy D <timothy_d_cole at md.northgrum.com> writes:
CTD> 	Okay, so something to the effect of:
CTD> 	 scache *scache_new(int flags);
CTD> 	 int scache_open(scache *cache, const char *path, int flags, int
CTD> mode);
CTD> 	 int scache_openi(scache *cache, const char *path, int flags, int
CTD> mode);
CTD> 	 int scache_stat(scache *cache, const char *path, SMB_STAT_STRUCT
CTD> *st_buf);
CTD> 	 int scache_stati(scache *cache, const char *path, SMB_STAT_STRUCT
CTD> *st_buf);
CTD> 	 void scache_close(int fd);
CTD> 	 void scache_destroy(scache *cache);

I'd like to recommend adding 

    int	 scaceh_fstat( scache *cache, const int fd, SMB_STAT_STRUCT *st_buf );

to the list. Even on application side, we don't wish to search for
path matching if we don't need to, and still, if we could ask fstat
information through scache, then scache have chance of updating the
information.

# This means most of the file manipulations, including sendfile()
# syscall of linux, is something of interest in adding.

... Or maybe we should create our own FILE structure, and use it
instead of file descriptor itself. our FILE structure knows where
and which scache we should look at.

>> You should rather say, current timestamp only serve to give you
>> information of "INVALIDNESS", like hash function.
>> 
CTD> 	I ... think that's what I said, isn't it?  Mmm.. wait, we're looking
CTD> at 'valid' from different directions.  Maybe 'not stale' would have been
CTD> better than 'valid' in this case.

We're saying exactly same thing. I only focused on "invalidness"
rather than validness( I only focused on what we can have, and not
what we can't have. )

>> What I belieave is, that we should have 256bits for timestamp.  128
>> for describing over dot seconds, 128bit for under dot second.  If
>> system time does not have accuracy of 128bits, like ... 30 bits for
>> example ... use 128-30=98bits for reference counter within that time
>> accuracy.
>> 
CTD> 	This is reasonable.
CTD> 	I still think keeping a separate 'reference counter' regardless of
CTD> the availible time precision would be preferable.  It's a nice hedge against
CTD> access times passing timestamp resolution (or more important, accuracy),
CTD> which _will_ keep happening.

... Maybe we are talking about slightly different "REFERENCE COUNTER".

You are talking about 'reference counter per file', individual one, right?

I'm talking about 'reference counter per system'. Single global
reference counter for one entire Operating system. Any action against
filesystem will increase reference counter, until 'time' changes.

Let's think about this example case:

Suppose we are having system which manages timestamp with 32bits.

You made file name './afo' at time 0x00001111.
Then you changed './afo' while we are still at time 0x00001111.
Currently we have no way of finding whether './afo' have changed or not
from time stamp.

Let's add 'Reference counter per file' to system.
Now we can find out that first './afo' have time 0x00001111 and 
counter 0x00000000. Second './afo' have time 0x00001111 and counter
0x00000001.

But what whill happen if you 
1) create './afo'
2) delete './afo'
3) create './afo' again
within same timestamp. And what's so unlucky was, that system attached
same i-node for 1st step and 3rd step ( this can happen, thought it is
vary rare case ). The only clue we have is timestamp and reference counter.

1) create './afo' : time = 0x00001111, counter = 0x00000000
2) delete './afo'
  < we lost all information about ./afo now >
3) create './afo' : time = 0x00001111, counter = 0x00000000

Now we have no way of finding difference between 1st and 3rd './afo'.

If we choose 'Reference counter per system' to system, story differs.
System will count up reference counter while we're in same time stamp.
Now we'll have

1) create './afo' : time = 0x00001111, counter = 0x00000000
2) delete './afo'
  < this action was counted as 0x00000001 >
3) create './afo' : time = 0x00001111, counter = 0x00000002

As result, for 3rd change, we'll get reference counter different
from 1st file.

By this way, we can keep 'order' of file system manipulation into
the timestamp. All the file have correct change order.

And once this was kept, we can merge timestamp and reference counter
into one field, for any comparison works correctly regardless of
timestamp accuracy and reference counter, without deep thinking.

If I remember right, this was first found by..... I'm sorry I forgot
the name, the person who made LaTeX ( Lamport ... ? ).

best regards,
---- 
Kenichi Okuyama at Tokyo Research Lab. IBM-Japan, Co.