From: Michael Weiser <michael.weiser@gmx.de>
Date: Tue, 13 Feb 2018 21:13:14 +0000 (+0100)
Subject: Document arm endianness considerations
X-Git-Tag: nettle_3.5rc1~74
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=70135c70863eedfd9b300614f4a5535b8b93066c;p=thirdparty%2Fnettle.git

Document arm endianness considerations

Extend arm/README to provide some background on considerations to be taken into
account when writing assembly routines supposed to work in big and little memory
endianness.
---

diff --git a/arm/README b/arm/README
index 9bacd97b..1ba54e0d 100644
--- a/arm/README
+++ b/arm/README
@@ -44,4 +44,71 @@ q12 (d24, d25)	Y
 q13 (d26, d27)	Y
 q14 (d28, d29)	Y
 q15 (d30, d31)	Y
-		    
+
+Endianness
+
+ARM supports big- and little-endian memory access modes. Representation in
+registers stays the same but loads and stores switch bytes. This has to be
+taken into account in various cases.
+
+Two m4 macros are provided to handle these special cases in assembly source:
+IF_LE(<if-true>,<if-false>)
+IF_BE(<if-true>,<if-false>)
+respectively expand to <if-true> if the target system's endianness is
+little-endian or big-endian. Otherwise they expand to <if-false>.
+
+1. ldr/str
+
+Loading and storing 32-bit words will reverse the words' bytes in little-endian
+mode. If the handled data is actually a byte sequence or data in network byte
+order (big-endian), the loaded word needs to be reversed after load to get it
+back into correct sequence. See v6/sha1-compress.asm LOAD macro for example.
+
+2. shifts
+
+If data is to be processed with bit operations only, endianness can be ignored
+because byte-swapping on load and store will cancel each other out. Shifts
+however have to be inverted. See arm/memxor.asm for an example.
+
+3. vld1.8
+
+NEON's vld instruction can be used to produce endianness-neutral code. vld1.8
+will load a byte sequence into a register regardless of memory endianness. This
+can be used to process byte sequences. See arm/neon/umac-nh.asm for example.
+
+4. vldm/vstm
+
+Care has to be taken when using vldm/vstm because they have two non-obvious
+characteristics:
+
+a. vldm/vstm do normal byte-swapping on each value they load. When loading into
+   d (doubleword) registers, this means that bytes, halfwords and words of the
+   doubleword get swapped. When the data loaded actually represents e.g.
+   vectors of 32-bit words this will swap columns.
+a. vldm/vstm on q (quadword) registers get translated into lvdm/vstm on the
+   equivalent number of d (doubleword) registers. Instead of a 128-bit load it
+   does two 64-bit loads. When again handling vectors of 32-bit words this will
+   still swap adjacent columns but will not reverse all four columns.
+
+memory adr0: w0 w1 w2 w3
+register q0: w1 w0 w3 w2
+
+See arm/neon/chacha-core-internal.asm for an example.
+
+5. simple byte store
+
+Sometimes it is necessary to store remaining single bytes to memory. A simple
+logic will store the lowest byte from a register, then do a right shift and
+start over until all bytes are stored. Since this constitutes a
+least-significant-byte-first store, the data to be stored needs to be reversed
+first on a big-endian system. See arm/memxor.asm Lmemxor_leftover for an
+example.
+
+6. Function parameters/return values
+
+AAPCS requires 64-bit parameters to be passed to and returned from functions
+"in two consecutive registers [...] as if the value had been loaded from memory
+representation with a single LDM instruction." Since loading a big-endian
+doubleword using ldm transposes its words, the same has to be done when e.g.
+returning a 64-bit value from an assembler routine. See arm/neon/umac-nh.asm
+for an example.