improper check on written in non mmap path of tdb_write, tdb 1.2.9

Michael Adam obnox at
Mon Feb 7 06:57:24 MST 2011


sorry for the late reply.

Rusty, you are right in that I did those changes.
Actually they were done in oder to treat only the correct
error conditions on pwrite as errors. But it seems
something is not quite correct here. Thanks, James,
for spotting this.

There is another place where I did the first patch like this:
it is tdb_expand_file() in common/io.c
But there, I think I did it right or at least better
(commits d568e2b and af8432e in master):

263         /* now fill the file with something. This ensures that the
264            file isn't sparse, which would be very bad if we ran out of
265            disk. This must be done with write, not via mmap */
266         memset(buf, TDB_PAD_BYTE, sizeof(buf));
267         while (addition) {
268                 size_t n = addition>sizeof(buf)?sizeof(buf):addition;
269                 ssize_t written = pwrite(tdb->fd, buf, n, size);
270                 if (written == 0) {
271                         /* prevent infinite loops: try _once_ more */
272                         written = pwrite(tdb->fd, buf, n, size);
273                 }
274                 if (written == 0) {
275                         /* give up, trying to provide a useful errno */
276                         TDB_LOG((tdb, TDB_DEBUG_FATAL, "expand_file write "
277                                 "returned 0 twice: giving up!\n"));
278                         errno = ENOSPC;
279                         return -1;
280                 } else if (written == -1) {
281                         TDB_LOG((tdb, TDB_DEBUG_FATAL, "expand_file write of "
282                                  "%d bytes failed (%s)\n", (int)n,
283                                  strerror(errno)));
284                         return -1;
285                 } else if (written != n) {
286                         TDB_LOG((tdb, TDB_DEBUG_WARNING, "expand_file: wrote "
287                                  "only %d of %d bytes - retrying\n", (int)written,
288                                  (int)n));
289                 }
290                 addition -= written;
291                 size += written;
292         }

At least it is better in two respects:
- it tries more than once
- it correctly corrects the summed up size of what has been written
maybe the made-up errno of ENOSPC is not good.

Doing these patches was triggered by a real error situation
over which I stumbled, in which using
(return code of pwrite) != (bytes to be written)
really was wrong. I can't quite remember what it was right now, but
I will try to reproduce it.

Anyways, this led me to checking the documentation of pwrite for
the correct usage, and this seemed to be to read again until all
had been written or return code was 0 two times in a row.

It might well be that some of this is theoretic (at least in our case of tdb).

I will think some more about how I would do it properly.
Attached find two patches:
* the first one lowers the one debug message from FATAL to WARNING.
* the second one changes the retry logic to a proper retry loop
  in the lines of the expand_file code.

This is not thoroughly tested yet, but does that look more

Rusty Russell wrote:
> On Sat, 5 Feb 2011 05:45:36 am James Caccese wrote:
> > > Yep... last time I read this code I was tempted to rip out
> > > the entire thing.
> > > We don't create sparse files, so write failures here are
> > > unlikely.  And
> > > retrying seems like a forlorn hope (and as you've shown,
> > > untested and buggy).
> > > 
> > > We basically try again to get the real errno, but unless
> > > someone has some real
> > > case (NFS or something weird?) I think we should just log
> > > the error w/o errno
> > > in that case.
> > > 
> > > Looks like Michael's code, so I've CC'd him.
> > > 
> > > Thoughts?
> > > Rusty.
> > > 
> > 
> > I'd go with the rewrite. I don't know the TDB codebase other then casual reading, but if you'd like I can put together a patch.
> > 
> > As for write failures, it may be my noobishness, but what do sparse files have to do with them? I was under the impression that anything from a signal, to the OS just feeling like it could result in an incomplete write, and that write / pwrite should be called in a loop until you get all the data out, or it gives an error. The code as it stands just looks like a buggy version of that idea except that it only tries once.
> Not on a real file.  Try doing a 1G write on a file, and hitting ^C.  You'll
> have to wait until the write is finished.  I was reminded of this (and
> retested it!) recently by sfr.  (For pipes, sockets and devices, the rules
> are different).
> So for real files, there are only two cases I can think of which could
> return partial write:
> 1) Disk is full, quota exceeded, etc.
> 2) I/O error, though in practice it might just return -1 an EIO, or
>    return success.
> Now, #1 can't happen in the tdb_write case, since the file isn't sparse by
> design.  We're basically down to 

Something missing here?

> > Also, the pread call in tdb_read should be tried in a loop as well.
> > 
> > I'm not sure I understand what you're saying about the real errno though.
> Hope that clarifies?
> Rusty.

Cheers - Michael

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-tdb-lower-a-debug-level-from-FATAL-to-WARNING-in-tdb.patch
Type: text/x-patch
Size: 944 bytes
Desc: not available
URL: <>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-tdb-fix-the-pwrite-retry-logic-in-tdb_write.patch
Type: text/x-patch
Size: 2660 bytes
Desc: not available
URL: <>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 206 bytes
Desc: not available
URL: <>

More information about the samba-technical mailing list