[ccache] why is limit_multiple ignored?

Mon Jan 29 06:14:44 UTC 2018

Joel Rosdahl <joel at rosdahl.net> wrote:

> On 7 January 2018 at 14:02, Scott Bennett wrote:
>
> > The design problem is that there is no centralized index maintained of
> > cache entries' paths, their sizes, and their timestamps, necessitating
> > the plumbing of the directory trees. [...]
>
> Thanks for sharing your ideas!

     You may wish to retract any thanks once you've read what follows.  The
current independence of ccache from any other third-party software is valued
and for good reasons.  However, I hope to show below a better way to do things.
That independence can still be maintained, but only at the cost of another
wheel reinvention. :-(
>
> I fully agree that the cleanup algorithm/design hasn't aged well. It has
> essentially stayed the same since Tridge created ccache in 2002, when
> storage devices were much smaller and a cache of one GB or two probably
> was considered quite large.
>
> Trying to improve the cleanup algorithm/design has not been a priority
> since I personally haven't seen such pathological behavior that you
> describe ("cleanups that can take over a half hour to run and hammer a

     I don't know whether users of other operating systems are using ccache
in building their systems, but many FreeBSD users do so because the time
savings are so great.  When one can cut a build time of six hours to, say,
an hour and a half, one tends to appreciate the tool(s) that make(s) it
possible.  I.e., we use and love ccache because, in general, it works so well
and improves performance so much.
     However, compiling an operating system means a pretty large cache area
is needed if one is to fit the working set within the cache.  Similarly,
FreeBSD users who compile third-party software from the ports tree, rather
than installing it from prebuilt packages, potentially need an even larger
cache area whose size roughly depends upon the number and size of the ports
built and installed onto their systems.  For example, I currently have over
2300 ports installed, which should make clear the reason my ports cache area
is so large.  Large cache areas take a long time for the "cleanups" to run.
(FWIW, I use cache_dir_levels = 5, which may not be optimal in terms of
performance.  I don't have a good way of determining the optimal depth to
use for the cache directory trees.  It seems to be very, very fast for use
in building things, but may well be a killer for cleanups.)

> hard drive mercilessly"). However, I'm not at all convinced that
> introducing a centralized index is the panacea you describe.

     Countless data base software implementations handle these situations
acceptably well.
>
> Do you have a sketch design of how to maintain a centralized index? Here

     Well, sort of.  I.e., I haven't written up a design spec or anything
of that sort, but some things seem rather obvious.

> are some requirements to consider for the design:
>
> A. It should cope with a ccache process being killed at any time.

     Sure.

> B. It should work reasonably well on flaky and/or slow file systems,
>    e.g. NFS.

     No, not at all.  Using a file system as data base software is usually
a Very Bad Idea (tm).

> C. It should not introduce lock contention for reasonable use cases.
> D. It should be quick for cache misses (not only for cleanup).
> E. It should handle cleanup quickly and gracefully.

     In my view, the above are misconceived in the sense that they are
predicated upon the use of file system code as data base software.
>
> I'm guessing that you envision having one centralized lock for the
> index. The tiny stats files already suffer from lock contention in some
> scenarios because they are so few. That's why ideas like
> https://github.com/ccache/ccache/issues/168 and comments like
> https://www.mail-archive.com/ccache@lists.samba.org/msg01011.html
> (comment number 2) pop up. Even if a centralized index only needs a lock
> for writing, it would still serialize writes to the cache. I have
> trouble seeing how that would work out well. But I'll gladly be proved
> wrong.
>
     Try this on for size for a moment.  Imagine the software as two programs,
ccache and ccached.  ccache would contain all the current code analysis and
comparison (including hashes) stuff that it currently has, but it would make
a connection via UDP or TCP to the other program, which we will call ccached,
to access the cache data base.  Modern data base software packages do very
well at handling multiple, simultaneous clients, atomic commission of updates,
multiple indices, and so forth.
     Now, keep in mind that this "ccached" might be a specialized program
linked to data base software or it might simply be a generic data base server.
Multiple caches (in the current sense) might be maintained as separate data
bases, either through a single server instance or as multiple, discrete server
processes, depending upon the software chosen for the purpose, but the
server(s) would be accessed by potentially many concurrent ccache processes
and could deal with consistency/integrity issues at the cache-entry or
cache-entry-element level.
     Please don't ask me for a recommendation of particular data base software
because I haven't the foggiest idea.  I haven't worked with a data base
package since the early 1970s, although I did work considerably later with
various software that today would be thought of a data base applications, but
were not so thought of at the time, that used IBM's ISAM.  Back then, a data
base typically involved many files and indices, all interlinked at the record
level, so an access method like ISAM was not, by itself, sufficient to be
called a data base, but it was sometimes a component of a data base.  Very
often, though, people wrote their own data base access methods or bought a
commercial data base package.  A data base was a more formal affair with every
field defined in a data dictionary, etc., etc.  ccache needs nothing so
complex, but you would need to consult someone familiar with each of the
"modern" types of data base software available to decide which way to go.
Very possibly you have the requisite knowledge/experience yourself.
     To modify ccache to use data base software is admittedly a major
rewriting job, so I expect such an idea to put you off, but it's a project
that should ultimately yield a far superior product, IMO.  Those are my two
bits' worth, and you are more than welcome to take shots at what I've written.

                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************