Jia Tan [Wed, 1 Nov 2023 06:58:24 +0000 (14:58 +0800)]
xz: Disable sandbox when recursive mode is used.
The sandbox is very restrictive when one file is being encoded/decoded
to standard out. In recursive mode, processing a directory requires
opening sub-files and sub-directories which would not be allowed under
the sandbox.
Jia Tan [Thu, 9 Nov 2023 05:56:25 +0000 (13:56 +0800)]
xz: Hide the number of input files with recursive mode.
In recursive mode we don't know how many files to process at the
beginning. So just like when using --files or --files0, the number of
total files will not be shown.
Jia Tan [Wed, 1 Nov 2023 06:47:05 +0000 (14:47 +0800)]
xz: Parse directories in recursive mode.
This directory parsing method prioritizes lower memory usage and file
descriptor utilization at the cost of more complicated code and a higher
number of small allocations. This method makes no recursive calls and
instead keeps a queue of directories to parse.Only one directory file
descriptor is ever needed at one time.
The directory_iterator abstracts the implementation of the directory
parsing to allow for an easy interface for both POSIX and MSVC.
Currently the MSVC builds suffers from MAX_PATH being limited to 260 by
default. This restricts the usefulness of recursive mode on Windows. A
user can edit a registry config in Windows 10, Version 1607 and later to
remove this low path limit. Alternatively, we can prefix the absolute
path with "\\?\" to also remove the restriction. Note, this restriction
also applies to the compatibility functions so MSVC builds cannot read
or write to files with paths longer than 260 characters.
Jia Tan [Wed, 1 Nov 2023 06:43:03 +0000 (14:43 +0800)]
xz: Restrict when recursive mode can be used.
If we are not compiling with dirent.h or MSVC, then we cannot use
recursive mode. Unfortunatly, there is not a great portable way to parse
directory contents.
There are _find_next() functions available for DOS like platforms, but
Windows version of these functions is different. Since we do not have a
good way to test these functions, support will not be added at this
time.
Jia Tan [Sat, 21 Oct 2023 15:07:08 +0000 (23:07 +0800)]
xz: Change the way coder_run() and list_run() are called in main().
Previously, a function pointer was used to determine if coder_run() or
list_run() should be called in the main entry processing loop. This was
replaced by an extra function call to process_entry().
coder_run() and list_run() were changed to accept a file_pair * argument
instead of a filename. The common repeated code was moved to
process_entry() instead.
Jia Tan [Sat, 21 Oct 2023 14:55:51 +0000 (22:55 +0800)]
xz: Move some list_file() checks to args_parse().
The checks enforce that list mode will only run on .xz files. The
opt_format is only set during argument parsing and will not change
after. So we only need to check this once instead of every call to
list_file(). Additionally, this will cause the error to be detected
slightly earlier.
Jia Tan [Sat, 21 Oct 2023 12:32:05 +0000 (20:32 +0800)]
xz: Add a function to print Windows specific error messages.
Native Windows C API functions do not use errno, but instead have to
call GetLastError(). There is not an easy way to convert this error
code into a helpful message, so this creates a wrapper around the
slightly complicated FormatMessage() function.
The new message_windows_error() function calls message_error() under the
hood, so it will set the exit status to 1.
Jia Tan [Tue, 9 Jan 2024 08:40:56 +0000 (16:40 +0800)]
Doxygen: Added the XZ logo and copyright information.
The PROJECT_LOGO field is now used to include the XZ logo. The footer
of each page now lists the copyright information instead of the default
footer. The license is also copied to statisfy the copyright and so the
link in the documentation can be local.
Lasse Collin [Tue, 23 Jan 2024 16:29:28 +0000 (18:29 +0200)]
xz: Use threaded mode by defaut (as if --threads=0 was used).
This hopefully does more good than bad:
+ It's faster by default.
+ Only the threaded compressor creates files that
can be decompressed in threaded mode.
- Compression ratio is worse, usually not too much though.
When it matters, -T1 must be used.
- Memory usage increases.
- Scripts that assume single-threaded mode but don't use -T1 will
possibly use too much resources, for example, if they run
multiple xz processes in parallel to compress multiple files.
- Output from single-threaded and multi-threaded compressors
differ but such changes could happen for other reasons too
(they just haven't happened since 5.0.0).
Lasse Collin [Mon, 22 Jan 2024 22:09:48 +0000 (00:09 +0200)]
liblzma: RISC-V filter: Use byte-by-byte access.
Not all RISC-V processors support fast unaligned access so
it's better to read only one byte in the main loop. This can
be faster even on x86-64 when compared to reading 32 bits at
a time as half the time the address is only 16-bit aligned.
The downside is larger code size on archs that do support
fast unaligned access.
Jia Tan [Mon, 22 Jan 2024 15:33:39 +0000 (23:33 +0800)]
xz: Update xz -lvv for RISC-V filter.
Version 5.6.0 will be shown, even though upcoming alphas and betas
will be able to support this filter. 5.6.0 looks nicer in the output and
people shouldn't be encouraged to use an unstable version in production
in any way.
Jia Tan [Mon, 22 Jan 2024 15:33:39 +0000 (23:33 +0800)]
Tests: Add two RISC-V Filter test files.
These test files achieve 100% code coverage in
src/liblzma/simple/riscv.c. They contain all of the instructions that
should be filtered and a few cases that should not.
Lasse Collin [Thu, 11 Jan 2024 13:22:36 +0000 (15:22 +0200)]
liblzma: CRC: Remove crc_always_inline, use lzma_always_inline instead.
Now crc_simd_body() in crc_x86_clmul.h is only called once
in a translation unit, we no longer need to be so cautious
about ensuring the always-inline behavior.
Lasse Collin [Wed, 10 Jan 2024 16:23:31 +0000 (18:23 +0200)]
liblzma: Rename arch-specific CRC functions and macros.
CRC_CLMUL was split to CRC_ARCH_OPTIMIZED and CRC_X86_CLMUL.
CRC_ARCH_OPTIMIZED is defined when an arch-optimized version is used.
Currently the x86 CLMUL implementations are the only arch-optimized
versions, and these also use the CRC_x86_CLMUL macro to tell when
crc_x86_clmul.h needs to be included.
is_clmul_supported() was renamed to is_arch_extension_supported().
crc32_clmul() and crc64_clmul() were renamed to
crc32_arch_optimized() and crc64_arch_optimized().
This way the names make sense with arch-specific non-CLMUL
implementations as well.
Lasse Collin [Fri, 20 Oct 2023 20:35:10 +0000 (23:35 +0300)]
liblzma: Avoid extern lzma_crc32_clmul() and lzma_crc64_clmul().
A CLMUL-only build will have the crcxx_clmul() inlined into
lzma_crcxx(). Previously a jump to the extern lzma_crcxx_clmul()
was needed. Notes about shared liblzma on ELF platforms:
- On platforms that support ifunc and -fvisibility=hidden, this
was silly because CLMUL-only build would have that single extra
jump instruction of extra overhead.
- On platforms that support neither -fvisibility=hidden nor linker
version script (liblzma*.map), jumping to lzma_crcxx_clmul()
would go via PLT so a few more instructions of overhead (still
not a big issue but silly nevertheless).
There was a downside with static liblzma too: if an application only
needs lzma_crc64(), static linking would make the linker include the
CLMUL code for both CRC32 and CRC64 from crc_x86_clmul.o even though
the CRC32 code wouldn't be needed, thus increasing code size of the
executable (assuming that -ffunction-sections isn't used).
Also, now compilers are likely to inline crc_simd_body()
even if they don't support the always_inline attribute
(or MSVC's __forceinline). Quite possibly all compilers
that build the code do support such an attribute. But now
it likely isn't a problem even if the attribute wasn't supported.
Now all x86-specific stuff is in crc_x86_clmul.h. If other archs
The other archs can then have their own headers with their own
is_clmul_supported() and crcxx_clmul().
Another bonus is that the build system doesn't need to care if
crc_clmul.c is needed.
is_clmul_supported() stays as inline function as it's not needed
when doing a CLMUL-only build (avoids a warning about unused function).
Lasse Collin [Wed, 20 Dec 2023 19:15:16 +0000 (21:15 +0200)]
liblzma: Use 8-byte method in memcmplen.h on ARM64.
It requires fast unaligned access to 64-bit integers
and a fast instruction to count leading zeros in
a 64-bit integer (__builtin_ctzll()). This perhaps
should be enabled on some other archs too.
Thanks to Chenxi Mao for the original patch:
https://github.com/tukaani-project/xz/pull/75 (the first commit)
According to the numbers there, this may improve encoding
speed by about 3-5 %.
This enables the 8-byte method on MSVC ARM64 too which
should work but wasn't tested.
Jia Tan [Wed, 20 Dec 2023 14:39:13 +0000 (22:39 +0800)]
CMake: Move sandbox detection outside of xz section.
The sandbox is now enabled for xzdec as well, so it no longer belongs
in just the xz section. xz and xzdec are always built, except for older
MSVC versions, so there isn't a need to conditionally show the sandbox
configuration. CMake will do a little unecessary work on older MSVC
versions that can't build xz or xzdec, but this is a very small
downside.
Jia Tan [Tue, 19 Dec 2023 13:18:28 +0000 (21:18 +0800)]
xzdec: Add sandbox support for Pledge, Capsicum, and Landlock.
A very strict sandbox is used when the last file is decompressed. The
likely most common use case of xzdec is to decompress a single file.
The Pledge sandbox is applied to the entire process with slightly more
relaxed promises, until the last file is processed.
Thanks to Christian Weisgerber for the initial patch adding Pledge
sandboxing.
Jia Tan [Wed, 20 Dec 2023 13:31:34 +0000 (21:31 +0800)]
liblzma: Initialize lzma_lz_encoder pointers with NULL.
This fixes the recent change to lzma_lz_encoder that used memzero
instead of the NULL constant. On some compilers the NULL constant
(always 0) may not equal the NULL pointer (this only needs to guarentee
to not point to valid memory address).
Later code compares the pointers to the NULL pointer so we must
initialize them with the NULL pointer instead of 0 to guarentee
code correctness.
Jia Tan [Sat, 16 Dec 2023 12:51:38 +0000 (20:51 +0800)]
liblzma: Set all values in lzma_lz_encoder to NULL after allocation.
The first member of lzma_lz_encoder doesn't necessarily need to be set
to NULL since it will always be set before anything tries to use it.
However the function pointer members must be set to NULL since other
functions rely on this NULL value to determine if this behavior is
supported or not.
This fixes a somewhat serious bug, where the options_update() and
set_out_limit() function pointers are not set to NULL. This seems to
have been forgotten since these function pointers were added many years
after the original two (code() and end()).
The problem is that by not setting this to NULL we are relying on the
memory allocation to zero things out if lzma_filters_update() is called
on a LZMA1 encoder. The function pointer for set_out_limit() is less
serious because there is not an API function that could call this in an
incorrect way. set_out_limit() is only called by the MicroLZMA encoder,
which must use LZMA1 where set_out_limit() is always set. Its currently
not possible to call set_out_limit() on an LZMA2 encoder at this time.
So calling lzma_filters_update() on an LZMA1 encoder had undefined
behavior since its possible that memory could be manipulated so the
options_update member pointed to a different instruction sequence.
This is unlikely to be a bug in an existing application since it relies
on calling lzma_filters_update() on an LZMA1 encoder in the first place.
For instance, it does not affect xz because lzma_filters_update() can
only be used when encoding to the .xz format.
This is fixed by using memzero() to set all members of lzma_lz_encoder
to NULL after it is allocated. This ensures this mistake will not occur
here in the future if any additional function pointers are added.
Jia Tan [Sat, 16 Dec 2023 12:28:21 +0000 (20:28 +0800)]
liblzma: Make parameter names in function definition match declaration.
lzma_raw_encoder() and lzma_raw_encoder_init() used "options" as the
parameter name instead of "filters" (used by the declaration). "filters"
is more clear since the parameter represents the list of filters passed
to the raw encoder, each of which contains filter options.
Jia Tan [Sat, 16 Dec 2023 12:18:47 +0000 (20:18 +0800)]
liblzma: Improve lzma encoder init function consistency.
lzma_encoder_init() did not check for NULL options, but
lzma2_encoder_init() did. This is more of a code style improvement than
anything else to help make lzma_encoder_init() and lzma2_encoder_init()
more similar.
Jia Tan [Wed, 6 Dec 2023 10:30:25 +0000 (18:30 +0800)]
Tests: Minor cleanups to OSS-Fuzz files.
Most of these fixes are small typos and tweaks. A few were caused by bad
advice from me. Here is the summary of what is changed:
- Author line edits
- Small comment changes/additions
- Using the return value in the error messages in the fuzz targets'
coder initialization code
- Removed fuzz_encode_stream.options. This set a max length, which may
prevent some worthwhile code paths from being properly exercised.
- Removed the max_len option from fuzz_decode_stream.options for the
same reason as fuzz_encode_stream. The alone decoder fuzz target still
has this restriction.
- Altered the dictionary contents for fuzz_lzma.dict. Instead of keeping
the properties static and varying the dictionary size, the properties
are varied and the dictionary size is kept small. The dictionary size
doesn't have much impact on the code paths but the properties do.
Maksym Vatsyk [Tue, 5 Dec 2023 15:31:09 +0000 (16:31 +0100)]
Tests: Add fuzz_encode_stream ossfuzz target.
This fuzz target handles .xz stream encoding. The first byte of input
is used to dynamically set the preset level in order to increase the
fuzz coverage of complex critical code paths.
Maksym Vatsyk [Mon, 4 Dec 2023 16:23:24 +0000 (17:23 +0100)]
Tests: Add fuzz_decode_alone OSS-Fuzz target
This fuzz target that handles LZMA alone decoding. A new fuzz
dictionary .dict was also created with common LZMA header values to
help speed up the discovery of valid headers.
Maksym Vatsyk [Mon, 4 Dec 2023 16:21:29 +0000 (17:21 +0100)]
Tests: Update OSS-Fuzz Makefile.
All .c files can be built as separate fuzz targets. This simplifies
the Makefile by allowing us to use wildcards instead of having a
Makefile target for each fuzz target.
Jia Tan [Tue, 21 Nov 2023 12:56:55 +0000 (20:56 +0800)]
Build: Change --enable-ifunc handling.
Some compilers support __attribute__((__ifunc__())) even though the
dynamic linker does not. The compiler is able to create the binary
but it will fail on startup. So it is not enough to just test if
the attribute is supported.
The default value for enable_ifunc is now auto, which will attempt
to compile a program using __attribute__((__ifunc__())). There are
additional checks in this program if glibc is being used or if it
is running on FreeBSD.
Setting --enable-ifunc will skip this test and always enable
__attribute__((__ifunc__())), even if is not supported.
Jia Tan [Thu, 23 Nov 2023 14:04:35 +0000 (22:04 +0800)]
xz: Create separate is_tty() function.
The new is_tty() will report if a file descriptor is a terminal or not.
On POSIX systems, it is a wrapper around isatty(). However, the native
Windows implementation of isatty() will return true for all character
devices, not just terminals. So is_tty() has a special case for Windows
so it can use alternative Windows API functions to determine if a file
descriptor is a terminal.
This fixes a bug with MSVC and MinGW-w64 builds that refused to read from
or write to non-terminal character devices because xz thought it was a
terminal. For instance:
xz foo -c > /dev/null
would fail because /dev/null was assumed to be a terminal.
Jia Tan [Fri, 17 Nov 2023 12:35:11 +0000 (20:35 +0800)]
Tests: Create test_suffix.sh.
This tests some complicated interactions with the --suffix= option.
The suffix option must be used with --format=raw, but can optionally
be used to override the default .xz suffix.
This test also verifies some recent bugs have been correctly solved
and to hopefully avoid further regressions in the future.
Jia Tan [Fri, 17 Nov 2023 12:19:26 +0000 (20:19 +0800)]
xz: Fix a bug with --files and --files0 in raw mode without a suffix.
The following command caused a segmentation fault:
xz -Fraw --lzma1 --files=foo
when foo was a valid file. The usage of --files or --files0 was not
being checked when compressing or decompressing in raw mode without a
suffix. The suffix checking code was meant to validate that all files
to be processed are "-" (if not writing to standard out), meaning the
data is only coming from standard in. In this case, there were no file
names to check since --files and --files0 store their file name in a
different place.
Later code assumed the suffix was set and caused a segmentation fault.
Now, the above command results in an error.
Jia Tan [Wed, 15 Nov 2023 15:40:13 +0000 (23:40 +0800)]
xz: Refactor suffix test with raw format.
The previous version set opt_stdout, but this caused an issue with
copying an input file to standard out when decompressing an unknown file
type. The following needs to result in an error:
echo foo | xz -df
since -c, --stdout is not used. This fixes the previous error by not
setting opt_stdout.
Jia Tan [Tue, 14 Nov 2023 12:27:46 +0000 (20:27 +0800)]
xz: Move suffix check after stdout mode is detected.
This fixes a bug introduced in cc5aa9ab138beeecaee5a1e81197591893ee9ca0
when the suffix check was initially moved. This caused a situation that
previously worked:
echo foo | xz -Fraw --lzma1 | wc -c
to fail because the old code knew that this would write to standard out
so a suffix was not needed.
Jia Tan [Tue, 14 Nov 2023 12:27:04 +0000 (20:27 +0800)]
xz: Detect when all data will be written to standard out earlier.
If the -c, --stdout argument is not used, then we can still detect when
the data will be written to standard out if all of the provided
filenames are "-" (denoting standard in) or if no filenames are
provided.
Lasse Collin [Tue, 31 Oct 2023 19:41:09 +0000 (21:41 +0200)]
liblzma: Fix compilation of fastpos_tablegen.c.
The macro lzma_attr_visibility_hidden has to be defined to make
fastpos.h usable. The visibility attribute is irrelevant to
fastpos_tablegen.c so simply #define the macro to an empty value.
fastpos_tablegen.c is never built by the included build systems
and so the problem wasn't noticed earlier. It's just a standalone
program for generating fastpos_table.c.
Fixes: https://github.com/tukaani-project/xz/pull/69
Thanks to GitHub user Jamaika1.