[clug] [OT] Null-terminated strings: expensive mistake or not

Thu Aug 4 00:22:25 MDT 2011

On 04/08/11 12:41, steve jenkin wrote:
> A friend sent me this link making a case that null-terminated strings
> were "The Most Expensive One-byte Mistake".
>
> I think it was exactly right at the time:
>   - fits the Unix Philosophy: simple and definitive solutions
>   - falls naturally out of C and how it handles pointers
>
> Which begs the question:
>    Is it right for now?
>
> <http://queue.acm.org/detail.cfm?id=2010365>
>
> There are a bunch of C and other coders on the list.
> Thought you might have interesting opinions on this.
>

I don't personally see any particular problem with \0 terminated strings
in C or assembly. Some operations can be performed quite quickly whilst
others take longer, but then so does maintaining a [1|2|4|more] byte
length parameter at the beginning of your string object.

The article referenced makes some questionable observations:

"If the source string is NUL terminated, however, attempting to access 
it in units larger than bytes risks attempting to read characters after 
the NUL. If the NUL character is the last byte of a VM (virtual memory) 
page and the next VM page is not defined, this would cause the process 
to die from an unwarranted "page not present" fault."

Performing non-word-aligned accesses on a string in units of larger than 
a byte is going to hurt anyway, so why would anyone do that?
And most architectures will have their VM pages word-aligned... so I
can't really see this being a problem. I wonder if it has actually
happened to the author, or if he might just be hypothesizing?

As for Hardware Development Costs, one of the key reasons for using
\0-terminated strings (other than, say, 0x7f terminated) is that almost
all CPUs will set the "Z" (or zero) bit when moving etc. a 0 value
and so the test is trivial to perform on existing hardware. No need for
the special instructions of the Z-80 etc. that limited string length
to 255 chars. I think this argument works in favour of \0-termination.

I think a similar argument applies to the difference between DIX
Ethernet frames (as used by IPv4 etc.) which have no length field in
the Ethernet header and IEEE802.3 frames which do. Manipulating the
IEEE802.3 frame often requires going back to the header and updating
the length field. Too bad if the frame is already being put on the
wire... With DIX Ethernet frames, you just send data until the frame
ends (as detected in hardware) and the length is implicit
(cf. \0-terminated). Apparently makes network stacks heaps easier to
write.

Cheers,

Bob Edwards.