git.ipfire.org Git - thirdparty/xz.git/commit

author	Lasse Collin <lasse.collin@tukaani.org>
	Tue, 25 Mar 2025 13:18:31 +0000 (15:18 +0200)
committer	Lasse Collin <lasse.collin@tukaani.org>
	Tue, 25 Mar 2025 13:18:31 +0000 (15:18 +0200)
commit	943b012d09f717f7b44284c4e4976ea41264c731
tree	b69e283e597586459c2f527d8c9a04be08033ede	tree
parent	bc14e4c94e788d42eeab984298391fc0ca46f969	commit \| diff

liblzma: Use SSE2 intrinsics instead of memcpy() in dict_repeat()

SSE2 is supported on every x86-64 processor. The SSE2 code is used on
32-bit x86 if compiler options permit unconditional use of SSE2.

dict_repeat() copies short random-sized unaligned buffers. At least
on glibc, FreeBSD, and Windows (MSYS2, UCRT, MSVCRT), memcpy() is
clearly faster than byte-by-byte copying in this use case. Compared
to the memcpy() version, the new SSE2 version reduces decompression
time by 0-5 % depending on the machine and libc. It should never be
slower than the memcpy() version.

However, on musl 1.2.5 on x86-64, the memcpy() version is the slowest.
Compared to the memcpy() version:

- The byte-by-version takes 6-7 % less time to decompress.
- The SSE2 version takes 16-18 % less time to decompress.

The numbers are from decompressing a Linux kernel source tarball in
single-threaded mode on older AMD and Intel systems. The tarball
compresses well, and thus dict_repeat() performance matters more
than with some other files.

src/liblzma/lz/lz_decoder.c		diff \| blob \| blame \| history
src/liblzma/lz/lz_decoder.h		diff \| blob \| blame \| history