Evidently this instruction, despite the intrinsic having a register operand,
is a memory-register instruction. There seems to be no alignment requirement
for the source operand. Because of this, compilers when not optimized are doing
the unaligned load and then dumping back to the stack to do the broadcasting load.
In doing this, MSVC seems to be dumping to the stack with an aligned move at an
unaligned address, causing a segfault. GCC does not seem to make this mistake, as
it stashes to an aligned address.
If we're on Visual Studio 2015, let's just do the longer 9 cycle sequence of a 128
bit load followed by a vinserti128. This _should_ fix this (issue #1861).
}
static inline void chunkmemset_16(uint8_t *from, chunk_t *chunk) {
+ /* See explanation in chunkset_avx512.c */
+#if defined(_MSC_VER) && _MSC_VER <= 1900
+ halfchunk_t half = _mm_loadu_si128((__m128i*)from);
+ *chunk = _mm256_inserti128_si256(_mm256_castsi128_si256(half), half, 1);
+#else
*chunk = _mm256_broadcastsi128_si256(_mm_loadu_si128((__m128i*)from));
+#endif
}
static inline void loadchunk(uint8_t const *s, chunk_t *chunk) {
}
static inline void chunkmemset_16(uint8_t *from, chunk_t *chunk) {
+ /* Unfortunately there seems to be a compiler bug in Visual Studio 2015 where
+ * the load is dumped to the stack with an aligned move for this memory-register
+ * broadcast. The vbroadcasti128 instruction is 2 fewer cycles and this dump to
+ * stack doesn't exist if compiled with optimizations. For the sake of working
+ * properly in a debugger, let's take the 2 cycle penalty */
+#if defined(_MSC_VER) && _MSC_VER <= 1900
+ halfchunk_t half = _mm_loadu_si128((__m128i*)from);
+ *chunk = _mm256_inserti128_si256(_mm256_castsi128_si256(half), half, 1);
+#else
*chunk = _mm256_broadcastsi128_si256(_mm_loadu_si128((__m128i*)from));
+#endif
}
static inline void loadchunk(uint8_t const *s, chunk_t *chunk) {