LF-delimited files are corrupted when written to a Samba/VMS share

Ben Armstrong BArmstrong at dymaxion.ca
Tue Jun 27 13:12:21 GMT 2006


A problem we've been unable to resolve for some time is that 
LF-delimited files are corrupted when they are written to a Samba/VMS 
share, double-spacing every record.  I have at least once before 
mentioned this problem (the most recent time I have on record being 
August 2005) but have no record of having received an answer.

A trivial ruby script can be used to demonstrate the problem.  (C++ and 
perl test programs can reproduce it too; see the perl example at end of 
this message.)  The client system in this test case is a Linux system, 
which considers a bare LF to be a newline:

$ ruby -e 'puts "a\nb"' >bg1.tmp

A:BG> dir/full bg1.tmp
....
Record format:      Stream, maximum 0 bytes, longest 32767 bytes
...
A:BG> dump/rec bg1.tmp
...
Record number 1 (00000001), 2 (0002) bytes, RFA(0001,0000,0000)
                                                                   0A61 
a............................... 000000
Record number 2 (00000002), 2 (0002) bytes, RFA(0001,0000,0002)
                                                                   0A62 
b............................... 000000
A:BG> dump bg1.tmp
...
00000000 00000000 00000000 00000000 00000000 00000000 00000000 0A620A61 
a.b............................. 000000
...
A:BG> type bg1.tmp
a

b

A:BG>

You can see why "type" is printing the file as double-spaced.  The file 
now consists of two records, each of which contains two characters, 
*including* the LF line delimiter.  Stream is apparently a very 
"forgiving" file format, which does actually consider records to end at 
LF characters, as well as at a wide variety of other possible delimiters 
(e.g. form-feed, CR+LF, vertical tab, etc.)  However, out of all of 
these possible delimiters, it seems that only the CR+LF pair is excluded 
from the record itself.

The difficulty is, there is no way I know of for the client 
application/system to convey that the file is LF-delimited and must 
remain LF-delimited, and therefore should be written as Stream-LF.  The 
end result is that any LF-delimited file written to the Samba share is 
corrupted, being converted into a double-spaced file, so far as RMS is 
concerned.  The corruption gets worse as further reads & writes occur on 
the file from both systems, double-double spacing the file, then 
double-double-double-spacing it, etc.

A Windows client system running natively compiled Ruby, which considers 
newlines to consist of CR+LF, does not exhibit the same problem behaviour.

Here are the results when the file is created on the Windows system with 
the same Ruby test:

$ ruby -e 'puts "a\nb"' >bg1.tmp

A:BG> dir/full bg2.tmp
...
Record format:      Stream, maximum 0 bytes, longest 32767 bytes
...
A:BG> dump/rec bg2.tmp
...
Record number 1 (00000001), 1 (0001) byte, RFA(0001,0000,0000)
                                                                     61 
a............................... 000000
Record number 2 (00000002), 1 (0001) byte, RFA(0001,0000,0003)
                                                                     62 
b............................... 000000
A:BG> dump bg2.tmp
...
00000000 00000000 00000000 00000000 00000000 00000000 00000A0D 620A0D61 
a..b............................ 000000
...
A:BG> type bg2.tmp
a
b
A:BG>

As you can see, we now get the expected results, a single-spaced file 
containing CR+LF-delimited lines in our "Stream" type file.  RMS sees 
the CR+LF delimiters as terminating each record, and does not consider 
them to be a part of the record itself.

So if all applications on all three platforms could agree to use CR+LF 
as the "canonical" text file format, we wouldn't have this problem.  
However,  even Ruby on VMS, and also Perl on VMS (the more popular and 
widely used of the two) are examples of applications that insist on 
writing Stream_LF files by default.  For example:

A:BG> perl -e "print ""a\nb\n"";" >bg3.tmp
A:BG> dir/full bg.tmp
...
Record format:      Stream_LF, maximum 0 bytes, longest 32767 bytes
...
A:BG> dump/rec bg3.tmp
...
Record number 1 (00000001), 1 (0001) byte, RFA(0001,0000,0000)
                                                                      61 
a............................... 000000
Record number 2 (00000002), 1 (0001) byte, RFA(0001,0000,0002)
                                                                      62 
b............................... 000000
A:BG> dump bg3.tmp
...
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0A620A61 
a.b............................. 000000
...

Furthermore, not every Windows application writes CR+LF line terminators 
in text files.  For instance, Vim for Windows (the text editor we use) 
understands how to read "unix" (LF-delimited) text files and "dos" 
(CR+LF-delimited) text files, and preserves the original line terminator 
type when it is written out again.  With an increasing number of open 
source applications being ported to the Windows platform, and which must 
operate correctly in a mixed-platform environment, LF-delimited text 
files written from a Windows system are now a fact of life that cannot 
be easily worked around.

In conclusion, it is not a practical solution to insist that all text 
files be written as CR+LF-delimited.  Samba/VMS *must* accommodate for 
LF-delimited text files somehow.  Without a solution for this problem, 
the product's usefulness in a production cross-platform environment is 
seriously limited.  If anyone has any idea how we can solve the problem 
effectively, please share it!

Ben




More information about the samba-vms mailing list