improper check on written in non mmap path of tdb_write, tdb 1.2.9

Wed Feb 2 18:11:42 MST 2011

On Fri, 28 Jan 2011 01:49:29 pm James Caccese wrote:
> Hi all, I was just casually reading the sources for tdb 1.2.9 and in tdb_write i think there may be a bug in the non mmap path.
> 
> The non mmap path begins at line 94:
> 
>  94:   } else {
>  95:      ssize_t written = pwrite(tdb->fd, buf, len, off);
>  96:      if ((written != (ssize_t)len) && (written != -1)) {
>  97:         /* try once more */
>  98:         tdb->ecode = TDB_ERR_IO;
>  99:         TDB_LOG((tdb, TDB_DEBUG_FATAL, "tdb_write: wrote only "
> 100:             "%d of %d bytes at %d, trying once more\n",
> 101:             (int)written, len, off));
> 102:         written = pwrite(tdb->fd, (const char *)buf+written,
> 103:                len-written,
> 104:                off+written);
> 105:      }
> 106:      if (written == -1) {
> 107:         /* Ensure ecode is set for log fn. */
> 108:         tdb->ecode = TDB_ERR_IO;
> 109:         TDB_LOG((tdb, TDB_DEBUG_FATAL,"tdb_write failed at %d "
> 110:             "len=%d (%s)\n", off, len, strerror(errno)));
> 111:         return -1;
> 112:      } else if (written != (ssize_t)len) {
> 113:         tdb->ecode = TDB_ERR_IO;
> 114:         TDB_LOG((tdb, TDB_DEBUG_FATAL, "tdb_write: failed to "
> 115:             "write %d bytes at %d in two attempts\n",
> 116:             len, off));
> 117:         return -1;
> 118:      }
> 119:   }
> 
> At line 95 the first attempt to call pwrite returns the bytes stored into "written". Lets call that value written1.
> 
> Line 96 tests to see if it was an incomplete write, and if so, at 102 a second attempt is made. This second attempt starts written1 bytes later (with 0 <= written1 <= len - 1), and returns the bytes stored also into written. Lets call that written2.
> 
> Later on line 112 we check to see that written is the same as len, which would be fine if the first call to pwrite completes fully or was interrupted with zero bytes stored, but, and here's the bug, if the first call to pwrite writes partially and the second finishes fully, written has the value of written2, not the value of written1 + written2.
> 
> If this occurs, the test at line 112 will report failure even though the whole the operation completed successfully.

Yep... last time I read this code I was tempted to rip out the entire thing.
We don't create sparse files, so write failures here are unlikely.  And
retrying seems like a forlorn hope (and as you've shown, untested and buggy).

We basically try again to get the real errno, but unless someone has some real
case (NFS or something weird?) I think we should just log the error w/o errno
in that case.

Looks like Michael's code, so I've CC'd him.

Thoughts?
Rusty.