[jcifs] Curious "race" condition

Michael Mercier mmercier at gmail.com
Fri Oct 22 15:52:34 MDT 2010

On 22-Oct-10, at 12:06 PM, André Warnier wrote:

> Michael B Allen wrote:
>> On Thu, Oct 21, 2010 at 5:45 AM, André Warnier <aw at ice-sa.com> wrote:
>>> Hi.
>>> I don't know if this is the right list for this, but I figure that  
>>> there are
>>> enough Samba experts here to maybe give me some pointers (or tell  
>>> me I'm
>>> wrong and need to look somewhere else).
>>> Here is the issue :
>>> Program A runs on a Windows system.  It creates new files on a  
>>> network
>>> drive, which is actually situated on a Solaris machine and shared  
>>> via Samba.
>>> The creation sequence is as follows :
>>> - open the new file for output, with a name
>>> "//servername/sharename/xxxx.dat.tmp" (where "xxxxx" is guaranteed  
>>> to be
>>> unique each time)
>>> - write data to the file
>>> - close the file
>>> - only if no errors occurred, rename the output file from  
>>> "xxxxx.dat.tmp" to
>>> "xxxxx.dat"
>>> At the same time, program B runs on the Solaris machine.
>>> It regularly scans the same (for him, local) directory, for files  
>>> ending in
>>> ".dat".
>>> When it finds one, it opens it and reads it.
>>> This happens thousands of times per month without problems.
>>> But once in a great while (2-3 times per year, no more), program B  
>>> reports
>>> an error and crashes.  The reported error leads me to believe that  
>>> it finds
>>> a "xxxxx.dat" file that is either empty or only partially written.
>>> If we restart program B, it processes that same file properly.
>>> Considering the sequence of operations above, my understanding is  
>>> that
>>> program B should never be able to find a "xxxxxx.dat" file that is  
>>> empty of
>>> partially written.
>> So which is it? What is the condition of the file after the failure.
>> Does it contain any data, some data?
> The file is fine when we check it after the crash.  It is never  
> empty, and always contains the correct data.  If we just restart  
> program B, it finds the file again, and processes it just fine.  It  
> then processes thousands of files just fine again, during the next  
> 3-4 months. Then it will crash again, with the same problem.
> The file contains XML.  Program B opens it and parses it, using an  
> XML parser, in which the crash occurs.  The parser bombs out with a  
> fatal XML parsing error, either :
> - a) with a message saying that it has not found any entity at byte  
> 0 of the file (meaning basically that it sees the file as empty)
> - or b) with a message saying that the given XML is invalid (because  
> a portion is missing)
> (but this second case has happened only once in more than a year,  
> and we are not sure in that case; it may have been a different  
> issue). (a) seems to be the main problem.
>> I think it is more likely that there is a code path where the file
>> creation stop looked like it was successful to your program A when in
>> fact there was just some kind of network or server failure.
> That is not possible here, for the reason given above.  Program A  
> writes the file only once, and when we check the file after the  
> crash of program B, it is always ok.
> If the failure was in program A or the network, then we would find  
> the file with size 0 or truncated, but we never do.
>> You could add a step to open the .tmp file a second time, seek to the
>> end and check the last 16 bytes. Then rename it.
>>> But my question is : considering that this happens on a network  
>>> share shared
>>> via Samba, is it possible due to some race condition or  
>>> configuration issue,
>>> that the above may nevertheless "sometimes" occur ?
>> It shouldn't. But you probably should ask on the samba users list.
> Ok. I will ask on the Samba list.
> I was just wondering if somehow, even if extremely unlikely, there  
> could be some small window of time while program A was in the  
> process of renaming the file across a SMB share, and process B is  
> reading that same directory and opening that file, where program B / 
> could/ find a directory entry associated to a zero-size file.
> I know that it sounds unlikely, but at this moment I cannot think of  
> any other explanation which would match the symptoms.
> Your response seems to indicate that you do not really believe in  
> that scenario either.
> Like in all such issues, the reason will probably be very clear and  
> evident when we have found it.

I had a similar issue as described with Samba on Linux

Here is what I did...

I used the 'smbstatus' command to make sure the file was not in use  
before processing the file.  You could also try using (if Solaris has  
an equivalent) 'lsof'.


More information about the jCIFS mailing list