Are code assertions considered harmful?

Sat Nov 8 19:52:31 GMT 2003

On Sat, 8 Nov 2003, "Richard Sharpe" <rsharpe at richardsharpe.com> wrote
> On Sat, 8 Nov 2003, Cameron Paine wrote:
>
> > I'm interested in opinions from a group of seasoned designers and
> > coders...

> What undocumented features ...?

I'm fighting four. You are aware of one of them and I've been
reluctant to hassle you about it. The other three look to be
manageable by a SMB newbie so I'm tackling them myself. See
below.

> > To my question then: why are assertions considered unhelpful? ...

> Assertions are not considered unhelpful. At work they are often very
> helpful. However, you will find very few in the two open
> source projects I work on. Perhaps it is the nature of those projects.

Hmmm, that concurs with my own observation. I'm a passionate advocate
of open source in my work places. My advocacy is not however based on
cost. It's about transparency, the free exchange of knowledge and the
fact that we can spend our time improving on each other's work rather
than re-inventing pale imitations of it.

Your comment implies that open source is fundamentally less rigorous
than "work" coding. Can we afford to take that view if we want open
source to be taken seriously by other than our peers and the home
hobbyists?

> > Also, I understand the traditional (and lame IMO) argument for not
> > documenting code. And yes, I can read C better than I can read my
> > own handwriting. However, when the code implements conflicting
> > behaviours, how can the maintainer know which was the intended one
> > and which one slipped through the cracks in the testing regime.
>
> What conflicting behavio[u]rs?

My observation was intended as a discussion point. Specific
instances in the client library code are to be the subject of
some future patch submissions. However, lest you think I'm
ducking and weaving, here's a few, albeit trivial, examples.

smbc_init:

1. ignores its logging argument
2. sets errno and returns -1 on most errors.
3. returns an errno.h constant if it cannot allocate
the file table.

smbc_open:

1. sets errno and returns -1 on most errors.
2. returns 1 if it falls out the bottom (currently it can't
do that but...)
3. returns a descriptor on success

smbc_getatr (a function declared as returning a BOOL):

1. sets errno and returns -1 on one error.
2. doesn't set errno and returns False on another error.
3. returns True on success.

Since these occur at the library's API layer and were therefore
amongst the first things I saw when I started my bug hunt, you
can perhaps appreciate that they made me wonder about the rest
of the code.

> > While on the subject of code commentary, what is the attitude of
> > the team to others adding commentary to their code? ...

> You are more than welcome to submit documentation in the form
> of comments. I will look through them and commit them if I am
> happy with them.

Thanks. I'll punt them to you as patches once I've got my app
working reliably.

> Howeevr, please tell us more about the problems you are experiencing.

Thanks for the offer. Background: I have a daemon (24x7) process
that's talking to 90 wintel servers. It detects changed data on
a share on each server and transfers the changed blocks. The
servers are scattered around the .au capital cities and are
interconnected (ultimately) via MPLS.

I have four failure modes; one we've discussed (you may recall I
forwarded you some session captures):

Memory management. I was losing 16-20 KB every time I visited
a server. I know memory is cheap these days, but... :-) I
found a fix proposed to this mailing list by "Ong Kian Win"
(codegrunt at rubbercookie.com) back in Feb 2002. So far, it has
worked for me. I'm bothered though that you declined to
integrate it. Why was that? What side-effects await me?

Inconsistent opendir returns. This is the one I referred to you.
Essentially smbc_opendir returns varying numbers of files from
a share that is not changing. I have a hack in place that's
keeping this end of the universe propped up until I get advice
from you. It's not (currently) urgent but I don't want to give
my customers a solution that contains a hack that I don't
understand.

Silent death. The process simply dies. The frequency varies. By
carefully adding additional trace messages I'm generating logs
that are starting to point to the problem. I've narrowed it down
to clientgen.c:cli_send_smb(). This is a work in progress.

Transport layer errors don't propagate upwards. When this one
bites I cannot continue transactions with the target server
unless I shut down the process. There seems to be some persistent,
per-server state that is maintained that I won't pretend to
understand.

I've added enough trace code to let me identify the "trigger"
event. I don't fully understand the implication of what I'm
about to tell you. A typical scenario goes like this:

libsmb/libsmbclient.c:smbc_opendir()
  libsmb/clilist.c:cli_list_new()
    libsmb/clitrans.c:cli_receive_trans()
      libsmb/clientgen.c:cli_receive_smb()
        lib/util_sock.c:client_receive_smb()
          lib/util_sock.c:receive_smb()
            lib/util_sock.c:read_smb_length_return_keepalive()
              lib/util_sock.c:read_socket_with_timeout()

[2003/11/07 17:00:16, 0] lib/util_sock.c:read_socket_with_timeout(300)
  read_socket_with_timeout: timeout read. read error = Connection reset by
peer.

This correctly sets cli->smb_rw_error and propagates a failure
back up the call chain. However, cli_list_new() only considers
server-side errors as (potential) failure modes. So it sees the
error return but reports NT_STATUS_OK.

There is code inside lib/util_sock.c:receive_smb() that
invalidates cli->fd when this (and other similar) errors
occur. This is transparent to the library client.

On subsequent calls to smbc_opendir() for a server that has
previously failed in this way, the cli structure seems to
be recovered (from some semi-persistent cache?) and the
invalid cli->fd gets passed down into util_sock.c. This too
causes an error, this time without setting *any* error code.
>From this point forward I'm hosed and I have to restart the
process.

It might be helpful for you to know that all servers are in
the same AD domain. I suspect that this means that the
authenticator gets cached which is possibly why the cli
structure gets reused. This is wild speculation on my part.

This problem is also a work in progress. I took some time
out from working on it to compose this response. I had just
discovered that cli->smb_rw_error is only ever read inside
cli_errstr() and only then for the purpose of generating a
human-readable error string.

Any insight/assistance you can offer will be gratefully
received. I'm currently only able to chip away at these
problems in the wee-small hours after the day's real work
is done.

Cameron Paine