From: Adam Stylinski Date: Mon, 3 Feb 2025 02:05:37 +0000 (-0500) Subject: Fix an unfortunate bug with Visual Studio 2015 X-Git-Tag: 2.2.4~2 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=287c4dce22a3244a0e2602d9e5bf0929df74fd27;p=thirdparty%2Fzlib-ng.git Fix an unfortunate bug with Visual Studio 2015 Evidently this instruction, despite the intrinsic having a register operand, is a memory-register instruction. There seems to be no alignment requirement for the source operand. Because of this, compilers when not optimized are doing the unaligned load and then dumping back to the stack to do the broadcasting load. In doing this, MSVC seems to be dumping to the stack with an aligned move at an unaligned address, causing a segfault. GCC does not seem to make this mistake, as it stashes to an aligned address. If we're on Visual Studio 2015, let's just do the longer 9 cycle sequence of a 128 bit load followed by a vinserti128. This _should_ fix this (issue #1861). --- diff --git a/arch/x86/chunkset_avx2.c b/arch/x86/chunkset_avx2.c index b9051bb9..c7f336fd 100644 --- a/arch/x86/chunkset_avx2.c +++ b/arch/x86/chunkset_avx2.c @@ -32,7 +32,13 @@ static inline void chunkmemset_8(uint8_t *from, chunk_t *chunk) { } static inline void chunkmemset_16(uint8_t *from, chunk_t *chunk) { + /* See explanation in chunkset_avx512.c */ +#if defined(_MSC_VER) && _MSC_VER <= 1900 + halfchunk_t half = _mm_loadu_si128((__m128i*)from); + *chunk = _mm256_inserti128_si256(_mm256_castsi128_si256(half), half, 1); +#else *chunk = _mm256_broadcastsi128_si256(_mm_loadu_si128((__m128i*)from)); +#endif } static inline void loadchunk(uint8_t const *s, chunk_t *chunk) { diff --git a/arch/x86/chunkset_avx512.c b/arch/x86/chunkset_avx512.c index 929d04cd..9d28d33d 100644 --- a/arch/x86/chunkset_avx512.c +++ b/arch/x86/chunkset_avx512.c @@ -46,7 +46,17 @@ static inline void chunkmemset_8(uint8_t *from, chunk_t *chunk) { } static inline void chunkmemset_16(uint8_t *from, chunk_t *chunk) { + /* Unfortunately there seems to be a compiler bug in Visual Studio 2015 where + * the load is dumped to the stack with an aligned move for this memory-register + * broadcast. The vbroadcasti128 instruction is 2 fewer cycles and this dump to + * stack doesn't exist if compiled with optimizations. For the sake of working + * properly in a debugger, let's take the 2 cycle penalty */ +#if defined(_MSC_VER) && _MSC_VER <= 1900 + halfchunk_t half = _mm_loadu_si128((__m128i*)from); + *chunk = _mm256_inserti128_si256(_mm256_castsi128_si256(half), half, 1); +#else *chunk = _mm256_broadcastsi128_si256(_mm_loadu_si128((__m128i*)from)); +#endif } static inline void loadchunk(uint8_t const *s, chunk_t *chunk) {