Joel Rosdahl [Fri, 19 Jul 2019 07:22:04 +0000 (09:22 +0200)]
Don’t pass -Werror and compilation-only options to the preprocessor
Clang emits warnings when it sees unused options, so when ccache runs
the Clang preprocessor separately, options that are not used by the
preprocessor will produce warnings. This means that the user may get
warnings which would not be present when not using ccache. And if
-Werror is present then the preprocessing step fails, which needless to
say is not optimal.
To work around this:
* Options known to have the above mentioned problem are not passed to
the preprocessor.
* In addition, -Werror is also not passed to the preprocessor so that
options not properly marked as “compiler only” will only trigger
warnings, not errors.
Joel Rosdahl [Wed, 17 Jul 2019 08:39:30 +0000 (10:39 +0200)]
Improve -x/--show-compression
- Ignore *.tmp.* files.
- Mention on-disk size (adjusted for disk block size) to make it match
the cache size reported by “ccache --show-stats”.
- Introduced “space savings” and “of original” percentages.
- Calculate compression ratio only for compressed files.
- Include “incompressible files” size, i.e. total size of .raw files and
files produced by previous ccache versions.
- Removed file counts since I don’t think that they are of much
interest.
- Handle unparsable manifest files from previous ccache versions
gracefully.
Joel Rosdahl [Mon, 15 Jul 2019 12:10:28 +0000 (14:10 +0200)]
Implement support for file cloning on Linux (Btrfs/XFS)
- Added a new file_clone (CCACHE_FILECLONE) configuration setting. If
set, ccache uses the FICLONE ioctl if available to clone files to/from
the cache. If file cloning is not supported by the file system, ccache
will silently fall back to copying (or hard linking if hard_link is
enabled).
- Compression will be disabled if file_clone is enabled, just like for
hard_link.
- file_clone has priority over hard_link.
- Tested on Btrfs and XFS on Linux 5.0.0.
Anders Björklund [Mon, 15 Jul 2019 13:28:26 +0000 (15:28 +0200)]
Add command to show compression statistics (#440)
This will only show information about the files that is knows about
(right magic bytes). So the file count might differ from what is shown
with the regular statistics (which shows all files, including old ones).
The terminology used here is a bit confused, the compression ratio is
supposed to grow upwards. Sometimes known as "space savings" instead,
so list both values (ratio and savings) to make the output more obvious.
Joel Rosdahl [Fri, 5 Jul 2019 19:43:07 +0000 (21:43 +0200)]
Reimplement the hard link mode
- Files stored by hard linking are saved as _N.raw files next to their
.result file, where N is the 0-based index of the entry in the .result
file.
- The .result file stores expected file sizes for the .raw files and the
code verifies that they are correct before retrieving the files from
the cache.
- The manual has been updated to mention the new file size check and
also some other caveats.
1. Hard links are error prone.
2. Compression will make hard links obsolete as a means of saving cache
space.
3. A future backend storage API will be easier to write.
Point 1 is still true, but since the result file now stores expected
file sizes, many inadvertent modifications of files will be detected.
Point 2 is also still true, but you might want to trade cache size for
speed in cases where increased speed actually is measurable, like with
very large object files.
Point 3 does not quite hold after thinking some more about future APIs.
I think that it will be relatively straight-forward to add operations
like supports_raw_files, get_raw_file and put_raw_file to the API.
Joel Rosdahl [Tue, 2 Jul 2019 11:57:11 +0000 (13:57 +0200)]
Probe whether the compiler produces a .dwo
GCC and Clang behave differently when given e.g. “-gsplit-dwarf -g1”:
GCC produces a .dwo file but Clang doesn’t. Trying to guess how the
different options behave for each compiler is complex and error prone.
Instead, Ccache now probes whether the compiler produced a .dwo and only
stores it if it was produced. On a cache hit, the .dwo is restored if it
exists in the previous result – if it doesn’t exist in the result, it
means that the compilation didn’t produce a .dwo.
Joel Rosdahl [Sun, 30 Jun 2019 12:01:38 +0000 (14:01 +0200)]
Add checksumming of cached content
Both compressed and uncompressed content are checksummed and verified.
The chosen checksum algorithm is XXH64, which is the same that the zstd
frame format uses (but ccache stores all 64 bits instead of only 32,
because why not?).
Joel Rosdahl [Sat, 29 Jun 2019 20:35:50 +0000 (22:35 +0200)]
Require libzstd and remove zlib support
* zlib has been removed. Good riddance!
* libzstd is now required for building ccache. However, it’s not bundled
like zlib was.
* To make it easier to build ccache on systems that lack an easily
installable libzstd, the configure script now offers a
--with-libzstd-from-internet option, which downloads a zstd source
release archive, unpacks it in the tree and sets up the Makefile to
build the library and link ccache (statically) with it.
* Enabled compression by default.
* Made compression level 0 mean “use a default level suitable for the
current compression algorithm”. For zstd, that’s initially level -1,
but that could change in the future. The reason for using 0 as a
special marker is that a future alternative compression algorithm
could have another reasonable default than zstd. (Let’s hope that
future algorithms don’t use level 0 for something.)
* Changed default compression level to 0.
Joel Rosdahl [Sat, 29 Jun 2019 18:39:00 +0000 (20:39 +0200)]
Restructure Travis configuration
In preparation for switching from zlib to zstd. I find it easier to use
a flat job list instead of a matrix and state settings explicitly for
the different jobs.
Joel Rosdahl [Sat, 29 Jun 2019 18:35:48 +0000 (20:35 +0200)]
Replace murmurhashneutral2 with xxHash (XXH64)
XXH64 is significantly faster than murmurhashneutral2 (on 64-bit
systems, which one can assume ccache almost always is running on these
days). This of course doesn’t matter for keys in hash tables, but it
opens up for using it as a checksumming algorithm for cached data as
well.
Joel Rosdahl [Sat, 29 Jun 2019 18:25:31 +0000 (20:25 +0200)]
Don’t try a higher zstd level than supported
If the user tries a higher level than supported by libzstd,
initialization will fail. Instead, let’s clamp the level to the highest
supported value.
Regarding negative levels: They are supported from libzstd 1.3.4, but
the query function ZSTD_minCLevel is only supported from 1.4.0 (from
1.3.6 with ZSTD_STATIC_LINKING_ONLY), so let’s not use it for
verification of the level. In libzstd 1.3.3 and older, negative levels
are silently converted to the zstd’s default level (3), so there’s no
major harm done if a user uses a negative level with older libzstd
versions.
Joel Rosdahl [Sat, 22 Jun 2019 20:51:42 +0000 (22:51 +0200)]
Use the compression API for reading and writing manifests
* Manifest and result files now share the same common header (sans the
magic bytes) and will be compressed using the common compression
settings.
* Removed the legacy “hash size” and “reserved” fields.
Joel Rosdahl [Sun, 16 Jun 2019 14:42:42 +0000 (16:42 +0200)]
Add a content size field to the result file header
The content size field says how much uncompressed data is stored in the
file. This can be used to relatively quickly determine the compression
rate for the whole cache by only inspecting each file’s header insted of
having to read and decompress all files.
Since the content size needs to be calculated before actually adding the
content to the result file, I’ve reverted back to let the format use a
“number of entries” field instead of an EOF marker (similar to Anders
Björklund’s original work in 0399be2d) since the information about the
number of files now has to be known beforehand.
Another subtle change is that the compression level field now is int8_t
instead of uint8_t to make it possible to record negative levels.
Igor Pylypiv [Mon, 10 Jun 2019 20:52:35 +0000 (13:52 -0700)]
Fix possible NULL pointer dereference (#433)
cppcheck:
[src/manifest.c:270] -> [src/manifest.c:269]: (warning)
Either the condition '!errmsg' is redundant or there is possible null pointer
dereference: errmsg.
Joel Rosdahl [Sat, 8 Jun 2019 18:49:39 +0000 (20:49 +0200)]
Improve naming of things
Some words are being used to mean several things in the code base:
* “object” can mean both “the object file (.o) produced by the compiler”
and “the result stored in the cache, including e.g. the .o file and .d
file”.
* “hash” can mean both “the state that the hash_* functions operate on”,
“the output of a hash function” and “the key used to index results
and manifests in the cache”.
This commits tries to make the naming more consistent:
* “object” means “the object file (.o) produced by the compiler”
* “result” means “the result stored in the cache, including e.g. the .o
file and .d file”.
* “struct hash” means “the state that the hash_* functions operate on”.
* “digest” means “the output of a hash function”. However, “hash” is
still used in documentation and command line output since I think that
“hash” is easier to understand for most people, especially since
that’s the term used by Git.
* “name” means “the key used to index results and manifests in the
cache”.
Joel Rosdahl [Sat, 8 Jun 2019 11:25:49 +0000 (13:25 +0200)]
Improve how <MD4, number of hashed bytes> is represented
Internally, the tuple <MD4 hash, number of hashed bytes>,which is the
key used for cached results and manifests, was represented as 16 bytes +
1 uint32_t. Externally, i.e. in file names, it was represented as
<MD4>-<size>, with <MD4> being 32 hex digits and <size> being the number
of hashed bytes in human-readable form.
This commits changes the internal representation to 20 bytes, where the
last 4 bytes are the number of hashed bytes in big-endian order. The
external representation has been changed to match this, i.e. to be 40
hex digits. This makes the code slightly less complex and more
consistent. Also, the code that converts the key into string form has
been rewritten to not allocate on the heap but to just write the output
into a buffer owned by the caller.
struct file_hash (16 bytes + 1 uint32_t) has been renamed to struct
digest (20 bytes) in order to emphasize that it represents the output of
a hash algorithm that not necessarily gets file content as its input.
The documentation of the manifest format has been updated to reflect the
logical change of keys, even though the actual serialized content of
manifest files hasn’t changed. While at it, reading of the obsolete
“hash_size” and “reserved” fields has been removed. (Future changes in
the manifest format will be handled by just stepping the version.)
Joel Rosdahl [Thu, 6 Jun 2019 18:10:10 +0000 (20:10 +0200)]
Remove the hard link mode
Rationale:
* The hard link feature is prone to errors: a) changes to files outside
the cache will corrupt the cache, and b) the mtime field in the file's
i-node is used for different purposes by ccache and build tools like
make.
* The upcoming enabling of LZ4 compression by default will make the hard
link mode obsolete as a means of saving cache space.
* Not supporting hard links will make a future backend storage API
simpler.
Joel Rosdahl [Thu, 6 Jun 2019 11:44:16 +0000 (13:44 +0200)]
Improve error handling of (de)compressors
Previously, some kinds of corruption were not detected by the zlib
decompressor since it didn’t check that it had reached the end of the
stream and therefore didn’t verify the Adler-32 checksum.
Joel Rosdahl [Tue, 4 Jun 2019 19:49:52 +0000 (21:49 +0200)]
Use the compression API for results
It didn’t feel right to use zlib’s gzip format for the embedded content,
especially since other compression libraries don’t support a similar
interface. Therefore, use the standard low-level zlib API instead.
Joel Rosdahl [Thu, 30 May 2019 18:37:12 +0000 (20:37 +0200)]
Revise disk format for results
* Removed unused hash_size and reserved fields. Since there are no
hashes stored in in the result metadata, hash size is superfluous. The
reserved bits field is also unnecessary; if we need to change the
format, we can just step RESULT_VERSION and be done with it.
* Instead of storing file count in the header, store an EOF marker after
the file entries. The main reason for this is that files then can be
appended to the result file without having to precalculate how many
files the result will contain.
* Don’t include trailing NUL in suffix strings since the length is known.
* Instead of potentially compressing the whole file, added an
uncompressed header telling how/if the rest of the file is
compressed (which algorithm and level). This makes it possible to more
efficiently recompress files in a batch job since it’s possible to
reasonably efficiently check if a cached file should be repacked. The
reason for not having compression info in each subfile
header (supporting different compression algorithms/levels per
subfile) is to make the repacking scenario simpler.
* Prepared for adding support for “reference entries”, which refer to
other results. There are two potential use cases for reference
entries: a) deduplication and b) storing partial results with a
different compression algorithm/level. It’s probably only the
deduplication use case that is interesting, though. It can be done
either at cache miss time or later as a batch job. If we really want
to, we can in the future add similar “raw reference entries” that
refer to files stored verbatim in the storage, thus re-enabling hard
link functionality.
* Changed to cCrS as the magic bytes for result files. This is analogous
to the magic bytes used for manifest files.
* Added documentation of the format.