[clug] [OT] Null-terminated strings: expensive mistake or not
bob at cs.anu.edu.au
Thu Aug 4 00:48:20 MDT 2011
On 04/08/11 16:22, Robert Edwards wrote:
> On 04/08/11 12:41, steve jenkin wrote:
>> A friend sent me this link making a case that null-terminated strings
>> were "The Most Expensive One-byte Mistake".
>> I think it was exactly right at the time:
>> - fits the Unix Philosophy: simple and definitive solutions
>> - falls naturally out of C and how it handles pointers
>> Which begs the question:
>> Is it right for now?
>> There are a bunch of C and other coders on the list.
>> Thought you might have interesting opinions on this.
> I don't personally see any particular problem with \0 terminated strings
> in C or assembly. Some operations can be performed quite quickly whilst
> others take longer, but then so does maintaining a [1|2|4|more] byte
> length parameter at the beginning of your string object.
> The article referenced makes some questionable observations:
> "If the source string is NUL terminated, however, attempting to access
> it in units larger than bytes risks attempting to read characters after
> the NUL. If the NUL character is the last byte of a VM (virtual memory)
> page and the next VM page is not defined, this would cause the process
> to die from an unwarranted "page not present" fault."
> Performing non-word-aligned accesses on a string in units of larger than
> a byte is going to hurt anyway, so why would anyone do that?
> And most architectures will have their VM pages word-aligned... so I
> can't really see this being a problem. I wonder if it has actually
> happened to the author, or if he might just be hypothesizing?
> As for Hardware Development Costs, one of the key reasons for using
> \0-terminated strings (other than, say, 0x7f terminated) is that almost
> all CPUs will set the "Z" (or zero) bit when moving etc. a 0 value
> and so the test is trivial to perform on existing hardware. No need for
> the special instructions of the Z-80 etc. that limited string length
> to 255 chars. I think this argument works in favour of \0-termination.
> I think a similar argument applies to the difference between DIX
> Ethernet frames (as used by IPv4 etc.) which have no length field in
> the Ethernet header and IEEE802.3 frames which do. Manipulating the
> IEEE802.3 frame often requires going back to the header and updating
> the length field. Too bad if the frame is already being put on the
> wire... With DIX Ethernet frames, you just send data until the frame
> ends (as detected in hardware) and the length is implicit
> (cf. \0-terminated). Apparently makes network stacks heaps easier to
> Bob Edwards.
Replying to my own e-mail (sorry about that).
I just had a look at my first edition K&R "The C Programming Language"
and the first time it references strings (page 27), it says:
'...when a string constant like
is written in a C program, the compiler creates an array of characters
containing the characters of the string, and terminates it with a \0
so that functions such as printf can detect the end:...'
From this, I understand that the C language itself is not at fault
here, only in dealing with string constants. In all other ways, the
language is ignorant of strings and the \0-terminated string is an
artefact of the standard C library functions, such as printf(),
I might be splitting hairs here, but if you don't like \0-terminated
strings, you can write your own functions in C that use a different
representation of a string and the only way the compiler would not
be helpful is in dealing with initialising string constants - you would
need to write a macro to do this instead.
Or am I wrong on this one?
More information about the linux