[ccache] ccache scalabitily

Tue Apr 12 19:34:50 GMT 2005

On Tue, Apr 12, 2005 at 07:59:55PM +0200, Minto van der Sluis wrote:

> I wonder how scalable ccache it. In other words, are very large cache 
> trees (multiple Gb) still efficient?
> 
> The reason I ask? I am working on T2 ( http://www.t2-project.org ). This 
> is a linux distribution build environment. Every package in a 
> distribution is build from source if possible. Currently for every 
> distribution being build we have a separate cache. I wonder about the 
> possibility to have a single cache for every distribution being build. 

I do not recommend this.

We're building a distribution, too. We have approx. 1000 source packages,
built only for one architecture, with only one kind of optimization flags.

When we used to have one huge cache pool, the whole common cache was around
10-20 GB. At that time only one machine was used to build all our packages.
We didn't have any drawbacks due to the huge size.

Later, when we decentralized our system, and implemented distributed build,
we realized that this was the wrong way to go, and we switched to separate
cache pool for all our packages. Note that we do not use distcc or similar
system, a particular build of a package is done by only one host, it fetches
the source and the ccache pool from the server at the beginning and puts
back the result at the end. (The server distributes builds and gives jobs to
the clients, based on build dependencies amongst the packages, controlled by
a Makefile. One client handles one or two, rarely three (due to a race
condition :-)) builds at a time. However, all this here in parentheses is
irrelevant to the current topic.)

There were two main reasons why we switched to per-package ccache pool:

1) Speed. Not bandwidth, rather roundtrip time. If you have ccache over nfs,
and each and every ccache query goes over nfs, compiling a normal
application (e.g. bash) gets even slower than without ccache at all.
However, fetching bash-ccache.tar.gz at the beginning (either over nfs or
with scp) and putting back the new version of this file at the end is much
faster, negligible compared to the build time of a package.

2) Maintenance of the ccache pool. If you have one giant pool, it just keeps
growing and growing and it's really hard to keep it clean, remove the files
that will most likely be not used anymore. After an upgrade of a core
component, such as gcc, glibc, sometimes ccache itself, it's quite likely
that you'll get no more hits, so you can manually clear the whole cache. But
if you upgrade a piece of software such as glib, then most likely glib
applications will not get many hits due to changed glib headers, but other
applications will. This time it's quite hard to keep track of the ccache
files that are no longer needed.

In our new system, the build procedure of a packages remembers the timestamp
when the build started, and if the build was successful, then at the end it
removes all files from its ccache pool whose access time stamp is older than
the start of this build (with a little trick so that .stderr files are
removed if and only if the main file is removed too). (If the build failed,
however, it keeps all the files in the pool.) And then it compresses this
tree (using "gzip -1" which seems to result in the best compress+upload
time) and puts back to the server.

-- 
Egmont