Lasse Collin [Sun, 6 Mar 2022 21:36:20 +0000 (23:36 +0200)]
liblzma: Add threaded .xz decompressor.
I realize that this is about a decade late.
Big thanks to Sebastian Andrzej Siewior for the original patch.
I made a bunch of smaller changes but after a while quite a few
things got rewritten. So any bugs in the commit were created by me.
Lasse Collin [Sun, 6 Mar 2022 14:41:19 +0000 (16:41 +0200)]
liblzma: Add new output queue (lzma_outq) features.
Add lzma_outq_clear_cache2() which may leave one buffer allocated
in the cache.
Add lzma_outq_outbuf_memusage() to get the memory needed for
a single lzma_outbuf. This is now used internally in outqueue.c too.
Track both the total amount of memory allocated and the amount of
memory that is in active use (not in cache).
In lzma_outbuf, allow storing the current input position that
matches the current output position. This way the main thread
can notice when no more output is possible without first providing
more input.
Allow specifying return code for lzma_outq_read() in a finished
lzma_outbuf.
Lasse Collin [Tue, 22 Feb 2022 00:04:18 +0000 (02:04 +0200)]
liblzma: Check the return value of lzma_index_append() in threaded encoder.
If lzma_index_append() failed (most likely memory allocation failure)
it could have gone unnoticed and the resulting .xz file would have
an incorrect Index. Decompressing such a file would produce the
correct uncompressed data but then an error would occur when
verifying the Index field.
Lasse Collin [Sun, 20 Feb 2022 18:36:27 +0000 (20:36 +0200)]
liblzma: Make Block decoder catch certain types of errors better.
Now it limits the input and output buffer sizes that are
passed to a raw decoder. This way there's no need to check
if the sizes can grow too big or overflow when updating
Compressed Size and Uncompressed Size counts. This also means
that a corrupt file cannot cause the raw decoder to process
useless extra input or output that would exceed the size info
in Block Header (and thus cause LZMA_DATA_ERROR anyway).
More importantly, now the size information is verified more
carefully in case raw decoder returns LZMA_OK. This doesn't
really matter with the current single-threaded .xz decoder
as the errors would be detected slightly later anyway. But
this helps avoiding corner cases in the upcoming threaded
decompressor, and it might help other Block decoder uses
outside liblzma too.
The test files bad-1-lzma2-{9,10,11}.xz test these conditions.
With the single-threaded .xz decoder the only difference is
that LZMA_DATA_ERROR is detected in a difference place now.
Lasse Collin [Sun, 6 Feb 2022 23:14:37 +0000 (01:14 +0200)]
Translations: Add French translation of man pages.
This matches xz-utils 5.2.5-2 in Debian.
The translation was done by "bubu", proofread by the debian-l10n-french
mailing list contributors, and submitted to me on the xz-devel mailing
list by Jean-Pierre Giraud. Thanks to everyone!
jiat75 [Fri, 28 Jan 2022 12:47:55 +0000 (20:47 +0800)]
liblzma: Add NULL checks to LZMA and LZMA2 properties encoders.
Previously lzma_lzma_props_encode() and lzma_lzma2_props_encode()
assumed that the options pointers must be non-NULL because the
with these filters the API says it must never be NULL. It is
good to do these checks anyway.
Ville Skyttä [Sat, 13 Nov 2021 08:11:57 +0000 (10:11 +0200)]
xzgrep: use `grep -E/-F` instead of `egrep` and `fgrep`
`egrep` and `fgrep` have been deprecated in GNU grep since 2007, and in
current post 3.7 Git they have been made to emit obsolescence warnings:
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=a9515624709865d480e3142fd959bccd1c9372d1
Alexander Bluhm [Tue, 5 Oct 2021 21:33:16 +0000 (23:33 +0200)]
xz: Avoid fchown(2) failure.
OpenBSD does not allow to change the group of a file if the user
does not belong to this group. In contrast to Linux, OpenBSD also
fails if the new group is the same as the old one. Do not call
fchown(2) in this case, it would change nothing anyway.
This fixes an issue with Perl Alien::Build module.
https://github.com/PerlAlien/Alien-Build/issues/62
Lasse Collin [Thu, 9 Sep 2021 18:41:51 +0000 (21:41 +0300)]
liblzma: Use _MSVC_LANG to detect when "noexcept" can be used with MSVC.
By default, MSVC always sets __cplusplus to 199711L. The real
C++ standard version is available in _MSVC_LANG (or one could
use /Zc:__cplusplus to set __cplusplus correctly).
Lasse Collin [Fri, 4 Jun 2021 15:52:48 +0000 (18:52 +0300)]
xzless: Fix less(1) version detection when it contains a dot.
Sometimes the version number from "less -V" contains a dot,
sometimes not. xzless failed detect the version number when
it does contain a dot. This fixes it.
Thanks to nick87720z for reporting this. Apparently it had been
reported here <https://bugs.gentoo.org/489362> in 2013.
Due to architectural limitations, address space available to a single
userspace process on MIPS32 is limited to 2 GiB, not 4, even on systems
that have more physical RAM -- e.g. 64-bit systems with 32-bit
userspace, or systems that use XPA (an extension similar to x86's PAE).
So, for MIPS32, we have to impose stronger memory limits. I've chosen
2000MiB to give the process some headroom.
Lasse Collin [Sat, 30 Jan 2021 16:36:04 +0000 (18:36 +0200)]
CMake: Try to improve compatibility with the FindLibLZMA module.
The naming conflict with FindLibLZMA module gets worse.
Not avoiding it in the first place was stupid.
Normally find_package(LibLZMA) will use the module and
find_package(liblzma 5.2.5 REQUIRED CONFIG) will use the config
file even with a case insensitive file system. However, if
CMAKE_FIND_PACKAGE_PREFER_CONFIG is TRUE and the file system
is case insensitive, find_package(LibLZMA) will find our liblzma
config file instead of using FindLibLZMA module.
One big problem with this is that FindLibLZMA uses
LibLZMA::LibLZMA and we use liblzma::liblzma as the target
name. With target names CMake happens to be case sensitive.
To workaround this, this commit adds
add_library(LibLZMA::LibLZMA ALIAS liblzma::liblzma)
to the config file. Then both spellings work.
To make the behavior consistent between case sensitive and
insensitive file systems, the config and related files are
renamed from liblzmaConfig.cmake to liblzma-config.cmake style.
With this style CMake looks for lowercase version of the package
name so find_package(LiBLzmA 5.2.5 REQUIRED CONFIG) will work
to find our config file.
There are other differences between our config file and
FindLibLZMA so it's still possible that things break for
reasons other than the spelling of the target name. Hopefully
those situations aren't too common.
When the config file is available, it should always give as good or
better results as FindLibLZMA so this commit doesn't affect the
recommendation to use find_package(liblzma 5.2.5 REQUIRED CONFIG)
which explicitly avoids FindLibLZMA.
Lasse Collin [Sun, 17 Jan 2021 17:20:50 +0000 (19:20 +0200)]
liblzma: In EROFS LZMA decoder, verify that comp_size matches at the end.
When the uncompressed size is known to be exact, after decompressing
the stream exactly comp_size bytes of input must have been consumed.
This is a minor improvement to error detection.
Lasse Collin [Thu, 14 Jan 2021 18:07:01 +0000 (20:07 +0200)]
liblzma: Add EROFS LZMA encoder and decoder.
Right now this is just a planned extra-compact format for use
in the EROFS file system in Linux. At this point it's possible
that the format will either change or be abandoned and removed
completely.
The special thing about the encoder is that it uses the
output-size-limited encoding added in the previous commit.
EROFS uses fixed-sized blocks (e.g. 4 KiB) to hold compressed
data so the compressors must be able to create valid streams
that fill the given block size.
Lasse Collin [Wed, 13 Jan 2021 17:16:32 +0000 (19:16 +0200)]
liblzma: Add rough support for output-size-limited encoding in LZMA1.
With this it is possible to encode LZMA1 data without EOPM so that
the encoder will encode as much input as it can without exceeding
the specified output size limit. The resulting LZMA1 stream will
be a normal LZMA1 stream without EOPM. The actual uncompressed size
will be available to the caller via the uncomp_size pointer.
One missing thing is that the LZMA layer doesn't inform the LZ layer
when the encoding is finished and thus the LZ may read more input
when it won't be used. However, this doesn't matter if encoding is
done with a single call (which is the planned use case for now).
For proper multi-call encoding this should be improved.
This commit only adds the functionality for internal use.
Nothing uses it yet.
Lasse Collin [Mon, 11 Jan 2021 21:41:16 +0000 (23:41 +0200)]
xz: Make --keep accept symlinks, hardlinks, and setuid/setgid/sticky.
Previously this required using --force but that has other
effects too which might be undesirable. Changing the behavior
of --keep has a small risk of breaking existing scripts but
since this is a fairly special corner case I expect the
likehood of breakage to be low enough.
I think the new behavior is more logical. The only reason for
the old behavior was to be consistent with gzip and bzip2.
Thanks to Vincent Lefevre and Sebastian Andrzej Siewior.
Lasse Collin [Mon, 11 Jan 2021 21:28:52 +0000 (23:28 +0200)]
Scripts: Fix exit status of xzgrep.
Omit the -q option from xz, gzip, and bzip2. With xz this shouldn't
matter. With gzip it's important because -q makes gzip replace SIGPIPE
with exit status 2. With bzip2 it's important because with -q bzip2
is completely silent if input is corrupt while other decompressors
still give an error message.
Avoiding exit status 2 from gzip is important because bzip2 uses
exit status 2 to indicate corrupt input. Before this commit xzgrep
didn't recognize corrupt .bz2 files because xzgrep was treating
exit status 2 as SIGPIPE for gzip compatibility.
zstd still needs -q because otherwise it is noisy in normal
operation.
The code to detect real SIGPIPE didn't check if the exit status
was due to a signal (>= 128) and so could ignore some other exit
status too.
Lasse Collin [Mon, 11 Jan 2021 20:01:51 +0000 (22:01 +0200)]
Scripts: Fix exit status of xzdiff/xzcmp.
This is a minor fix since this affects only the situation when
the files differ and the exit status is something else than 0.
In such case there could be SIGPIPE from a decompression tool
and that would result in exit status of 2 from xzdiff/xzcmp
while the correct behavior would be to return 1 or whatever
else diff or cmp may have returned.
This commit omits the -q option from xz/gzip/bzip2/lzop arguments.
I'm not sure why the -q was used in the first place, perhaps it
hides warnings in some situation that I cannot see at the moment.
Hopefully the removal won't introduce a new bug.
With gzip the -q option was harmful because it made gzip return 2
instead of >= 128 with SIGPIPE. Ignoring exit status 2 (warning
from gzip) isn't practical because bzip2 uses exit status 2 to
indicate corrupt input file. It's better if SIGPIPE results in
exit status >= 128.
With bzip2 the removal of -q seems to be good because with -q
it prints nothing if input is corrupt. The other tools aren't
silent in this situation even with -q. On the other hand, if
zstd support is added, it will need -q since otherwise it's
noisy in normal situations.
Thanks to Étienne Mollier and Sebastian Andrzej Siewior.
Lasse Collin [Sat, 9 Jan 2021 19:14:36 +0000 (21:14 +0200)]
liblzma: Make lzma_outq usable for threaded decompression too.
Before this commit all output queue buffers were allocated as
a single big allocation. Now each buffer is allocated separately
when needed. Used buffers are cached to avoid reallocation
overhead but the cache will keep only one buffer size at a time.
This should make things work OK in the decompression where most
of the time the buffer sizes will be the same but with some less
common files the buffer sizes may vary.
While this should work fine, it's still a bit preliminary
and may even get reverted if it turns out to be useless for
decompression.
Lasse Collin [Tue, 17 Nov 2020 18:51:48 +0000 (20:51 +0200)]
CMake: Fix compatibility with CMake 3.13.
The syntax "if(DEFINED CACHE{FOO})" requires CMake 3.14.
In some other places the code treats the cache variables
like normal variables already (${FOO} or if(FOO) is used,
not ${CACHE{FOO}).
Lasse Collin [Sun, 1 Nov 2020 20:34:25 +0000 (22:34 +0200)]
xz: Avoid unneeded \f escapes on the man page.
I don't want to use \c in macro arguments but groff_man(7)
suggests that \f has better portability. \f would be needed
for the .TP strings for portability reasons anyway.
Lasse Collin [Sun, 1 Nov 2020 16:41:21 +0000 (18:41 +0200)]
xz: Avoid the abbreviation "e.g." on the man page.
A few are simply omitted, most are converted to "for example"
and surrounded with commas. Sounds like that this is better
style, for example, man-pages(7) recommends avoiding such
abbreviations except in parenthesis.
Lasse Collin [Sun, 12 Jul 2020 20:10:03 +0000 (23:10 +0300)]
xz man page: Change \- (minus) to \(en (en-dash) for a numeric range.
Docs of ancient troff/nroff mention \(em (em-dash) but not \(en
and \- was used for both minus and en-dash. I don't know how
portable \(en is nowadays but it can be changed back if someone
complains. At least GNU groff and OpenBSD's mandoc support it.
Output is from: test-groff -b -e -mandoc -T utf8 -rF0 -t -w w -z
[ "test-groff" is a developmental version of "groff" ]
Input file is ./src/scripts/xzgrep.1
<src/scripts/xzgrep.1>:20 (macro RB): only 1 argument, but more are expected
<src/scripts/xzgrep.1>:23 (macro RB): only 1 argument, but more are expected
<src/scripts/xzgrep.1>:26 (macro RB): only 1 argument, but more are expected
<src/scripts/xzgrep.1>:29 (macro RB): only 1 argument, but more are expected
<src/scripts/xzgrep.1>:32 (macro RB): only 1 argument, but more are expected
"abc..." does not mean the same as "abc ...".
The output from nroff and troff is unchanged except for the space
between "file" and "...".
Output is from: test-groff -b -e -mandoc -T utf8 -rF0 -t -w w -z
[ "test-groff" is a developmental version of "groff" ]
Input file is ./src/xz/xz.1
<src/xz/xz.1>:408 (macro BR): only 1 argument, but more are expected
<src/xz/xz.1>:1009 (macro BR): only 1 argument, but more are expected
<src/xz/xz.1>:1743 (macro BR): only 1 argument, but more are expected
<src/xz/xz.1>:1920 (macro BR): only 1 argument, but more are expected
<src/xz/xz.1>:2213 (macro BR): only 1 argument, but more are expected
Output from nroff and troff is unchanged, except for a font change of a
full stop (.).
Lasse Collin [Wed, 11 Mar 2020 19:15:35 +0000 (21:15 +0200)]
xz: Never use thousand separators in DJGPP builds.
DJGPP 2.05 added support for thousands separators but it's
broken at least under WinXP with Finnish locale that uses
a non-breaking space as the thousands separator. Workaround
by disabling thousands separators for DJGPP builds.
Lasse Collin [Mon, 2 Mar 2020 11:54:33 +0000 (13:54 +0200)]
liblzma: Fix a comment and RC_SYMBOLS_MAX.
The comment didn't match the value of RC_SYMBOLS_MAX and the value
itself was slightly larger than actually needed. The only harm
about this was that memory usage was a few bytes larger.
Lasse Collin [Thu, 27 Feb 2020 18:24:27 +0000 (20:24 +0200)]
Build: Add support for --no-po4a option to autogen.sh.
Normally, if po4a isn't available, autogen.sh will return
with non-zero exit status. The option --no-po4a can be useful
when one knows that po4a isn't available but wants autogen.sh
to still return with zero exit status.
Lasse Collin [Tue, 25 Feb 2020 18:42:31 +0000 (20:42 +0200)]
Build: Fix bugs in the CMake files.
Seems that the phrase "add more quotes" from sh/bash scripting
applies to CMake as well. E.g. passing an unquoted list ${FOO}
to a function that expects one argument results in only the
first element of the list being passed as an argument and
the rest get ignored. Adding quotes helps ("${FOO}").
list(INSERT ...) is weird. Inserting an empty string to an empty
variable results in empty list, but inserting it to a non-empty
variable does insert an empty element to the list.
Since INSERT requires at least one element,
"${CMAKE_THREAD_LIBS_INIT}" needs to be quoted in CMakeLists.txt.
It might result in an empty element in the list. It seems to not
matter as empty elements consistently get ignored in that variable.
In fact, calling cmake_check_push_state() and cmake_check_pop_state()
will strip the empty elements from CMAKE_REQUIRED_LIBRARIES!
In addition to quoting fixes, this fixes checks for the cache
variables in tuklib_cpucores.cmake and tuklib_physmem.cmake.
Thanks to Martin Matuška for testing and reporting the problems.
These fixes aren't tested yet but hopefully they soon will be.
Lasse Collin [Mon, 24 Feb 2020 21:38:16 +0000 (23:38 +0200)]
Build: Add very limited experimental CMake support.
This does *NOT* replace the Autotools-based build system in
the foreseeable future. See the comment in the beginning
of CMakeLists.txt.
So far this has been tested only on GNU/Linux but I commit
it anyway to make it easier for others to test. Since I
haven't played much with CMake before, it's likely that
there are things that have been done in a silly or wrong
way and need to be fixed.
Lasse Collin [Mon, 24 Feb 2020 21:29:35 +0000 (23:29 +0200)]
tuklib: Omit an unneeded <sys/types.h> from a tests.
tuklib_cpucores.c and tuklib_physmem.c don't include <sys/types.h>
even via other files in this package, so clearly that header isn't
needed in the tests either (no one has reported build problems due
to a missing header in a .c file).
Lasse Collin [Fri, 21 Feb 2020 15:01:15 +0000 (17:01 +0200)]
Build: Add visibility.m4 from gnulib.
Appears that this file used to get included as a side effect of
gettext. After the change to gettext version requirements this file
no longer got copied to the package and so the build was broken.
Lasse Collin [Thu, 20 Feb 2020 16:54:04 +0000 (18:54 +0200)]
tuklib_exit: Add missing header.
strerror() needs <string.h> which happened to be included via
tuklib_common.h -> tuklib_config.h -> sysdefs.h if HAVE_CONFIG_H
was defined. This wasn't tested without config.h before so it
had worked fine.
Lasse Collin [Tue, 18 Feb 2020 17:12:35 +0000 (19:12 +0200)]
Revert the previous commit and add a comment.
The previous commit broke crc32_tablegen.c.
If the whole package is built without config.h (with defines
set on the compiler command line) this should still work fine
as long as these headers conform to C99 well enough.
Lasse Collin [Sun, 16 Feb 2020 09:18:28 +0000 (11:18 +0200)]
sysdefs.h: Omit the conditionals around string.h and limits.h.
string.h is used unconditionally elsewhere in the project and
configure has always stopped if limits.h is missing, so these
headers must have been always available even on the weirdest
systems.
Lasse Collin [Sat, 15 Feb 2020 01:08:32 +0000 (03:08 +0200)]
Build: Use AM_GNU_GETTEXT_REQUIRE_VERSION and require 0.19.6.
This bumps the version requirement from 0.19 (from 2014) to
0.19.6 (2015).
Using only the old AM_GNU_GETTEXT_VERSION results in old
gettext infrastructure being placed in the package. By using
both macros we get the latest gettext files while the other
programs in the Autotools family can still see the old macro.