[jcifs] Curious "race" condition

Fri Oct 22 10:06:59 MDT 2010

Michael B Allen wrote:
> On Thu, Oct 21, 2010 at 5:45 AM, André Warnier <aw at ice-sa.com> wrote:
>> Hi.
>> I don't know if this is the right list for this, but I figure that there are
>> enough Samba experts here to maybe give me some pointers (or tell me I'm
>> wrong and need to look somewhere else).
>>
>> Here is the issue :
>>
>> Program A runs on a Windows system.  It creates new files on a network
>> drive, which is actually situated on a Solaris machine and shared via Samba.
>> The creation sequence is as follows :
>> - open the new file for output, with a name
>> "//servername/sharename/xxxx.dat.tmp" (where "xxxxx" is guaranteed to be
>> unique each time)
>> - write data to the file
>> - close the file
>> - only if no errors occurred, rename the output file from "xxxxx.dat.tmp" to
>> "xxxxx.dat"
>>
>> At the same time, program B runs on the Solaris machine.
>> It regularly scans the same (for him, local) directory, for files ending in
>> ".dat".
>> When it finds one, it opens it and reads it.
>>
>> This happens thousands of times per month without problems.
>>
>> But once in a great while (2-3 times per year, no more), program B reports
>> an error and crashes.  The reported error leads me to believe that it finds
>> a "xxxxx.dat" file that is either empty or only partially written.
>> If we restart program B, it processes that same file properly.
>>
>> Considering the sequence of operations above, my understanding is that
>> program B should never be able to find a "xxxxxx.dat" file that is empty of
>> partially written.
> 
> So which is it? What is the condition of the file after the failure.
> Does it contain any data, some data?

The file is fine when we check it after the crash.  It is never empty, and always contains 
the correct data.  If we just restart program B, it finds the file again, and processes it 
just fine.  It then processes thousands of files just fine again, during the next 3-4 
months. Then it will crash again, with the same problem.

The file contains XML.  Program B opens it and parses it, using an XML parser, in which 
the crash occurs.  The parser bombs out with a fatal XML parsing error, either :
- a) with a message saying that it has not found any entity at byte 0 of the file (meaning 
basically that it sees the file as empty)
- or b) with a message saying that the given XML is invalid (because a portion is missing)
(but this second case has happened only once in more than a year, and we are not sure in 
that case; it may have been a different issue). (a) seems to be the main problem.

> 
> I think it is more likely that there is a code path where the file
> creation stop looked like it was successful to your program A when in
> fact there was just some kind of network or server failure.
> 

That is not possible here, for the reason given above.  Program A writes the file only 
once, and when we check the file after the crash of program B, it is always ok.
If the failure was in program A or the network, then we would find the file with size 0 or 
truncated, but we never do.

> You could add a step to open the .tmp file a second time, seek to the
> end and check the last 16 bytes. Then rename it.
> 
>> But my question is : considering that this happens on a network share shared
>> via Samba, is it possible due to some race condition or configuration issue,
>> that the above may nevertheless "sometimes" occur ?
> 
> It shouldn't. But you probably should ask on the samba users list.
> 

Ok. I will ask on the Samba list.

I was just wondering if somehow, even if extremely unlikely, there could be some small 
window of time while program A was in the process of renaming the file across a SMB share, 
and process B is reading that same directory and opening that file, where program B 
/could/ find a directory entry associated to a zero-size file.
I know that it sounds unlikely, but at this moment I cannot think of any other explanation 
which would match the symptoms.

Your response seems to indicate that you do not really believe in that scenario either.
Like in all such issues, the reason will probably be very clear and evident when we have 
found it.