]>
Commit | Line | Data |
---|---|---|
6aa36e8e RS |
1 | #! /usr/bin/env perl |
2 | # Copyright 2010-2016 The OpenSSL Project Authors. All Rights Reserved. | |
3 | # | |
81cae8ce | 4 | # Licensed under the Apache License 2.0 (the "License"). You may not use |
6aa36e8e RS |
5 | # this file except in compliance with the License. You can obtain a copy |
6 | # in the file LICENSE in the source distribution or at | |
7 | # https://www.openssl.org/source/license.html | |
8 | ||
e3a510f8 AP |
9 | # |
10 | # ==================================================================== | |
11 | # Written by Andy Polyakov <appro@openssl.org> for the OpenSSL | |
12 | # project. The module is, however, dual licensed under OpenSSL and | |
13 | # CRYPTOGAMS licenses depending on where you obtain it. For further | |
14 | # details see http://www.openssl.org/~appro/cryptogams/. | |
15 | # ==================================================================== | |
16 | # | |
8525950e | 17 | # March, May, June 2010 |
480cd6ab | 18 | # |
c3473126 AP |
19 | # The module implements "4-bit" GCM GHASH function and underlying |
20 | # single multiplication operation in GF(2^128). "4-bit" means that it | |
21 | # uses 256 bytes per-key table [+64/128 bytes fixed table]. It has two | |
5c88dcca AP |
22 | # code paths: vanilla x86 and vanilla SSE. Former will be executed on |
23 | # 486 and Pentium, latter on all others. SSE GHASH features so called | |
8525950e AP |
24 | # "528B" variant of "4-bit" method utilizing additional 256+16 bytes |
25 | # of per-key storage [+512 bytes shared table]. Performance results | |
26 | # are for streamed GHASH subroutine and are expressed in cycles per | |
27 | # processed byte, less is better: | |
e3a510f8 | 28 | # |
5c88dcca | 29 | # gcc 2.95.3(*) SSE assembler x86 assembler |
e3a510f8 | 30 | # |
d52d5ad1 AP |
31 | # Pentium 105/111(**) - 50 |
32 | # PIII 68 /75 12.2 24 | |
33 | # P4 125/125 17.8 84(***) | |
34 | # Opteron 66 /70 10.1 30 | |
35 | # Core2 54 /67 8.4 18 | |
d2e18031 AP |
36 | # Atom 105/105 16.8 53 |
37 | # VIA Nano 69 /71 13.0 27 | |
e3a510f8 AP |
38 | # |
39 | # (*) gcc 3.4.x was observed to generate few percent slower code, | |
480cd6ab | 40 | # which is one of reasons why 2.95.3 results were chosen, |
e3a510f8 | 41 | # another reason is lack of 3.4.x results for older CPUs; |
5c88dcca | 42 | # comparison with SSE results is not completely fair, because C |
d52d5ad1 AP |
43 | # results are for vanilla "256B" implementation, while |
44 | # assembler results are for "528B";-) | |
e3a510f8 AP |
45 | # (**) second number is result for code compiled with -fPIC flag, |
46 | # which is actually more relevant, because assembler code is | |
47 | # position-independent; | |
48 | # (***) see comment in non-MMX routine for further details; | |
49 | # | |
8525950e | 50 | # To summarize, it's >2-5 times faster than gcc-generated code. To |
480cd6ab | 51 | # anchor it to something else SHA1 assembler processes one byte in |
5c88dcca AP |
52 | # ~7 cycles on contemporary x86 cores. As for choice of MMX/SSE |
53 | # in particular, see comment at the end of the file... | |
e3a510f8 | 54 | |
c1f092d1 AP |
55 | # May 2010 |
56 | # | |
d52d5ad1 | 57 | # Add PCLMULQDQ version performing at 2.10 cycles per processed byte. |
c1f092d1 AP |
58 | # The question is how close is it to theoretical limit? The pclmulqdq |
59 | # instruction latency appears to be 14 cycles and there can't be more | |
60 | # than 2 of them executing at any given time. This means that single | |
61 | # Karatsuba multiplication would take 28 cycles *plus* few cycles for | |
62 | # pre- and post-processing. Then multiplication has to be followed by | |
63 | # modulo-reduction. Given that aggregated reduction method [see | |
64 | # "Carry-less Multiplication and Its Usage for Computing the GCM Mode" | |
65 | # white paper by Intel] allows you to perform reduction only once in | |
66 | # a while we can assume that asymptotic performance can be estimated | |
67 | # as (28+Tmod/Naggr)/16, where Tmod is time to perform reduction | |
68 | # and Naggr is the aggregation factor. | |
69 | # | |
70 | # Before we proceed to this implementation let's have closer look at | |
71 | # the best-performing code suggested by Intel in their white paper. | |
72 | # By tracing inter-register dependencies Tmod is estimated as ~19 | |
d52d5ad1 AP |
73 | # cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per |
74 | # processed byte. As implied, this is quite optimistic estimate, | |
75 | # because it does not account for Karatsuba pre- and post-processing, | |
76 | # which for a single multiplication is ~5 cycles. Unfortunately Intel | |
77 | # does not provide performance data for GHASH alone. But benchmarking | |
78 | # AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt | |
79 | # alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that | |
80 | # the result accounts even for pre-computing of degrees of the hash | |
81 | # key H, but its portion is negligible at 16KB buffer size. | |
c1f092d1 AP |
82 | # |
83 | # Moving on to the implementation in question. Tmod is estimated as | |
84 | # ~13 cycles and Naggr is 2, giving asymptotic performance of ... | |
85 | # 2.16. How is it possible that measured performance is better than | |
86 | # optimistic theoretical estimate? There is one thing Intel failed | |
d52d5ad1 AP |
87 | # to recognize. By serializing GHASH with CTR in same subroutine |
88 | # former's performance is really limited to above (Tmul + Tmod/Naggr) | |
89 | # equation. But if GHASH procedure is detached, the modulo-reduction | |
90 | # can be interleaved with Naggr-1 multiplications at instruction level | |
91 | # and under ideal conditions even disappear from the equation. So that | |
92 | # optimistic theoretical estimate for this implementation is ... | |
93 | # 28/16=1.75, and not 2.16. Well, it's probably way too optimistic, | |
94 | # at least for such small Naggr. I'd argue that (28+Tproc/Naggr), | |
95 | # where Tproc is time required for Karatsuba pre- and post-processing, | |
96 | # is more realistic estimate. In this case it gives ... 1.91 cycles. | |
97 | # Or in other words, depending on how well we can interleave reduction | |
60250017 | 98 | # and one of the two multiplications the performance should be between |
d52d5ad1 AP |
99 | # 1.91 and 2.16. As already mentioned, this implementation processes |
100 | # one byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart | |
101 | # - in 2.02. x86_64 performance is better, because larger register | |
102 | # bank allows to interleave reduction and multiplication better. | |
c1f092d1 AP |
103 | # |
104 | # Does it make sense to increase Naggr? To start with it's virtually | |
105 | # impossible in 32-bit mode, because of limited register bank | |
46f4e1be | 106 | # capacity. Otherwise improvement has to be weighed against slower |
c1f092d1 AP |
107 | # setup, as well as code size and complexity increase. As even |
108 | # optimistic estimate doesn't promise 30% performance improvement, | |
109 | # there are currently no plans to increase Naggr. | |
1aa8a629 | 110 | # |
e3713c36 RS |
111 | # Special thanks to David Woodhouse for providing access to a |
112 | # Westmere-based system on behalf of Intel Open Source Technology Centre. | |
c1f092d1 | 113 | |
bc5b136c AP |
114 | # January 2010 |
115 | # | |
116 | # Tweaked to optimize transitions between integer and FP operations | |
117 | # on same XMM register, PCLMULQDQ subroutine was measured to process | |
118 | # one byte in 2.07 cycles on Sandy Bridge, and in 2.12 - on Westmere. | |
119 | # The minor regression on Westmere is outweighed by ~15% improvement | |
120 | # on Sandy Bridge. Strangely enough attempt to modify 64-bit code in | |
121 | # similar manner resulted in almost 20% degradation on Sandy Bridge, | |
122 | # where original 64-bit code processes one byte in 1.95 cycles. | |
123 | ||
d2e18031 AP |
124 | ##################################################################### |
125 | # For reference, AMD Bulldozer processes one byte in 1.98 cycles in | |
126 | # 32-bit mode and 1.89 in 64-bit. | |
127 | ||
273a8081 AP |
128 | # February 2013 |
129 | # | |
130 | # Overhaul: aggregate Karatsuba post-processing, improve ILP in | |
131 | # reduction_alg9. Resulting performance is 1.96 cycles per byte on | |
132 | # Westmere, 1.95 - on Sandy/Ivy Bridge, 1.76 - on Bulldozer. | |
133 | ||
e3a510f8 AP |
134 | $0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; |
135 | push(@INC,"${dir}","${dir}../../perlasm"); | |
136 | require "x86asm.pl"; | |
137 | ||
1aa89a7a | 138 | $output=pop and open STDOUT,">$output"; |
4f0d5f18 | 139 | |
e195c8a2 | 140 | &asm_init($ARGV[0],$x86only = $ARGV[$#ARGV] eq "386"); |
e3a510f8 | 141 | |
c1f092d1 AP |
142 | $sse2=0; |
143 | for (@ARGV) { $sse2=1 if (/-DOPENSSL_IA32_SSE2/); } | |
e3a510f8 | 144 | |
c1f092d1 | 145 | ($Zhh,$Zhl,$Zlh,$Zll) = ("ebp","edx","ecx","ebx"); |
e3a510f8 AP |
146 | $inp = "edi"; |
147 | $Htbl = "esi"; | |
c1f092d1 | 148 | \f |
e3a510f8 AP |
149 | $unroll = 0; # Affects x86 loop. Folded loop performs ~7% worse |
150 | # than unrolled, which has to be weighted against | |
c1f092d1 | 151 | # 2.5x x86-specific code size reduction. |
e3a510f8 AP |
152 | |
153 | sub x86_loop { | |
154 | my $off = shift; | |
155 | my $rem = "eax"; | |
156 | ||
157 | &mov ($Zhh,&DWP(4,$Htbl,$Zll)); | |
158 | &mov ($Zhl,&DWP(0,$Htbl,$Zll)); | |
159 | &mov ($Zlh,&DWP(12,$Htbl,$Zll)); | |
160 | &mov ($Zll,&DWP(8,$Htbl,$Zll)); | |
161 | &xor ($rem,$rem); # avoid partial register stalls on PIII | |
162 | ||
163 | # shrd practically kills P4, 2.5x deterioration, but P4 has | |
164 | # MMX code-path to execute. shrd runs tad faster [than twice | |
165 | # the shifts, move's and or's] on pre-MMX Pentium (as well as | |
166 | # PIII and Core2), *but* minimizes code size, spares register | |
167 | # and thus allows to fold the loop... | |
168 | if (!$unroll) { | |
169 | my $cnt = $inp; | |
170 | &mov ($cnt,15); | |
171 | &jmp (&label("x86_loop")); | |
172 | &set_label("x86_loop",16); | |
173 | for($i=1;$i<=2;$i++) { | |
174 | &mov (&LB($rem),&LB($Zll)); | |
175 | &shrd ($Zll,$Zlh,4); | |
176 | &and (&LB($rem),0xf); | |
177 | &shrd ($Zlh,$Zhl,4); | |
178 | &shrd ($Zhl,$Zhh,4); | |
179 | &shr ($Zhh,4); | |
180 | &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); | |
181 | ||
182 | &mov (&LB($rem),&BP($off,"esp",$cnt)); | |
183 | if ($i&1) { | |
184 | &and (&LB($rem),0xf0); | |
185 | } else { | |
186 | &shl (&LB($rem),4); | |
187 | } | |
188 | ||
189 | &xor ($Zll,&DWP(8,$Htbl,$rem)); | |
190 | &xor ($Zlh,&DWP(12,$Htbl,$rem)); | |
191 | &xor ($Zhl,&DWP(0,$Htbl,$rem)); | |
192 | &xor ($Zhh,&DWP(4,$Htbl,$rem)); | |
193 | ||
194 | if ($i&1) { | |
195 | &dec ($cnt); | |
196 | &js (&label("x86_break")); | |
197 | } else { | |
198 | &jmp (&label("x86_loop")); | |
199 | } | |
200 | } | |
201 | &set_label("x86_break",16); | |
202 | } else { | |
203 | for($i=1;$i<32;$i++) { | |
204 | &comment($i); | |
205 | &mov (&LB($rem),&LB($Zll)); | |
206 | &shrd ($Zll,$Zlh,4); | |
207 | &and (&LB($rem),0xf); | |
208 | &shrd ($Zlh,$Zhl,4); | |
209 | &shrd ($Zhl,$Zhh,4); | |
210 | &shr ($Zhh,4); | |
211 | &xor ($Zhh,&DWP($off+16,"esp",$rem,4)); | |
212 | ||
213 | if ($i&1) { | |
214 | &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); | |
215 | &and (&LB($rem),0xf0); | |
216 | } else { | |
217 | &mov (&LB($rem),&BP($off+15-($i>>1),"esp")); | |
218 | &shl (&LB($rem),4); | |
219 | } | |
220 | ||
221 | &xor ($Zll,&DWP(8,$Htbl,$rem)); | |
222 | &xor ($Zlh,&DWP(12,$Htbl,$rem)); | |
223 | &xor ($Zhl,&DWP(0,$Htbl,$rem)); | |
224 | &xor ($Zhh,&DWP(4,$Htbl,$rem)); | |
225 | } | |
226 | } | |
227 | &bswap ($Zll); | |
228 | &bswap ($Zlh); | |
229 | &bswap ($Zhl); | |
230 | if (!$x86only) { | |
231 | &bswap ($Zhh); | |
232 | } else { | |
233 | &mov ("eax",$Zhh); | |
234 | &bswap ("eax"); | |
235 | &mov ($Zhh,"eax"); | |
236 | } | |
237 | } | |
238 | ||
239 | if ($unroll) { | |
240 | &function_begin_B("_x86_gmult_4bit_inner"); | |
241 | &x86_loop(4); | |
242 | &ret (); | |
243 | &function_end_B("_x86_gmult_4bit_inner"); | |
244 | } | |
245 | ||
c1f092d1 AP |
246 | sub deposit_rem_4bit { |
247 | my $bias = shift; | |
e3a510f8 | 248 | |
c1f092d1 AP |
249 | &mov (&DWP($bias+0, "esp"),0x0000<<16); |
250 | &mov (&DWP($bias+4, "esp"),0x1C20<<16); | |
251 | &mov (&DWP($bias+8, "esp"),0x3840<<16); | |
252 | &mov (&DWP($bias+12,"esp"),0x2460<<16); | |
253 | &mov (&DWP($bias+16,"esp"),0x7080<<16); | |
254 | &mov (&DWP($bias+20,"esp"),0x6CA0<<16); | |
255 | &mov (&DWP($bias+24,"esp"),0x48C0<<16); | |
256 | &mov (&DWP($bias+28,"esp"),0x54E0<<16); | |
257 | &mov (&DWP($bias+32,"esp"),0xE100<<16); | |
258 | &mov (&DWP($bias+36,"esp"),0xFD20<<16); | |
259 | &mov (&DWP($bias+40,"esp"),0xD940<<16); | |
260 | &mov (&DWP($bias+44,"esp"),0xC560<<16); | |
261 | &mov (&DWP($bias+48,"esp"),0x9180<<16); | |
262 | &mov (&DWP($bias+52,"esp"),0x8DA0<<16); | |
263 | &mov (&DWP($bias+56,"esp"),0xA9C0<<16); | |
264 | &mov (&DWP($bias+60,"esp"),0xB5E0<<16); | |
265 | } | |
266 | \f | |
267 | $suffix = $x86only ? "" : "_x86"; | |
e3a510f8 | 268 | |
c1f092d1 | 269 | &function_begin("gcm_gmult_4bit".$suffix); |
e3a510f8 AP |
270 | &stack_push(16+4+1); # +1 for stack alignment |
271 | &mov ($inp,&wparam(0)); # load Xi | |
272 | &mov ($Htbl,&wparam(1)); # load Htable | |
273 | ||
274 | &mov ($Zhh,&DWP(0,$inp)); # load Xi[16] | |
275 | &mov ($Zhl,&DWP(4,$inp)); | |
276 | &mov ($Zlh,&DWP(8,$inp)); | |
277 | &mov ($Zll,&DWP(12,$inp)); | |
278 | ||
279 | &deposit_rem_4bit(16); | |
280 | ||
281 | &mov (&DWP(0,"esp"),$Zhh); # copy Xi[16] on stack | |
282 | &mov (&DWP(4,"esp"),$Zhl); | |
283 | &mov (&DWP(8,"esp"),$Zlh); | |
284 | &mov (&DWP(12,"esp"),$Zll); | |
285 | &shr ($Zll,20); | |
286 | &and ($Zll,0xf0); | |
287 | ||
288 | if ($unroll) { | |
289 | &call ("_x86_gmult_4bit_inner"); | |
290 | } else { | |
291 | &x86_loop(0); | |
292 | &mov ($inp,&wparam(0)); | |
293 | } | |
294 | ||
295 | &mov (&DWP(12,$inp),$Zll); | |
296 | &mov (&DWP(8,$inp),$Zlh); | |
297 | &mov (&DWP(4,$inp),$Zhl); | |
298 | &mov (&DWP(0,$inp),$Zhh); | |
299 | &stack_pop(16+4+1); | |
c1f092d1 AP |
300 | &function_end("gcm_gmult_4bit".$suffix); |
301 | ||
302 | &function_begin("gcm_ghash_4bit".$suffix); | |
303 | &stack_push(16+4+1); # +1 for 64-bit alignment | |
304 | &mov ($Zll,&wparam(0)); # load Xi | |
305 | &mov ($Htbl,&wparam(1)); # load Htable | |
306 | &mov ($inp,&wparam(2)); # load in | |
307 | &mov ("ecx",&wparam(3)); # load len | |
308 | &add ("ecx",$inp); | |
309 | &mov (&wparam(3),"ecx"); | |
310 | ||
311 | &mov ($Zhh,&DWP(0,$Zll)); # load Xi[16] | |
312 | &mov ($Zhl,&DWP(4,$Zll)); | |
313 | &mov ($Zlh,&DWP(8,$Zll)); | |
314 | &mov ($Zll,&DWP(12,$Zll)); | |
315 | ||
316 | &deposit_rem_4bit(16); | |
317 | ||
318 | &set_label("x86_outer_loop",16); | |
319 | &xor ($Zll,&DWP(12,$inp)); # xor with input | |
320 | &xor ($Zlh,&DWP(8,$inp)); | |
321 | &xor ($Zhl,&DWP(4,$inp)); | |
322 | &xor ($Zhh,&DWP(0,$inp)); | |
323 | &mov (&DWP(12,"esp"),$Zll); # dump it on stack | |
324 | &mov (&DWP(8,"esp"),$Zlh); | |
325 | &mov (&DWP(4,"esp"),$Zhl); | |
326 | &mov (&DWP(0,"esp"),$Zhh); | |
327 | ||
328 | &shr ($Zll,20); | |
329 | &and ($Zll,0xf0); | |
330 | ||
331 | if ($unroll) { | |
332 | &call ("_x86_gmult_4bit_inner"); | |
333 | } else { | |
334 | &x86_loop(0); | |
335 | &mov ($inp,&wparam(2)); | |
336 | } | |
337 | &lea ($inp,&DWP(16,$inp)); | |
338 | &cmp ($inp,&wparam(3)); | |
339 | &mov (&wparam(2),$inp) if (!$unroll); | |
340 | &jb (&label("x86_outer_loop")); | |
341 | ||
342 | &mov ($inp,&wparam(0)); # load Xi | |
343 | &mov (&DWP(12,$inp),$Zll); | |
344 | &mov (&DWP(8,$inp),$Zlh); | |
345 | &mov (&DWP(4,$inp),$Zhl); | |
346 | &mov (&DWP(0,$inp),$Zhh); | |
347 | &stack_pop(16+4+1); | |
348 | &function_end("gcm_ghash_4bit".$suffix); | |
349 | \f | |
350 | if (!$x86only) {{{ | |
351 | ||
352 | &static_label("rem_4bit"); | |
353 | ||
98909c1d | 354 | if (!$sse2) {{ # pure-MMX "May" version... |
8525950e AP |
355 | |
356 | $S=12; # shift factor for rem_4bit | |
357 | ||
07e29c12 AP |
358 | &function_begin_B("_mmx_gmult_4bit_inner"); |
359 | # MMX version performs 3.5 times better on P4 (see comment in non-MMX | |
360 | # routine for further details), 100% better on Opteron, ~70% better | |
361 | # on Core2 and PIII... In other words effort is considered to be well | |
362 | # spent... Since initial release the loop was unrolled in order to | |
363 | # "liberate" register previously used as loop counter. Instead it's | |
364 | # used to optimize critical path in 'Z.hi ^= rem_4bit[Z.lo&0xf]'. | |
365 | # The path involves move of Z.lo from MMX to integer register, | |
366 | # effective address calculation and finally merge of value to Z.hi. | |
367 | # Reference to rem_4bit is scheduled so late that I had to >>4 | |
368 | # rem_4bit elements. This resulted in 20-45% procent improvement | |
053fa39a | 369 | # on contemporary µ-archs. |
07e29c12 AP |
370 | { |
371 | my $cnt; | |
372 | my $rem_4bit = "eax"; | |
373 | my @rem = ($Zhh,$Zll); | |
c1f092d1 AP |
374 | my $nhi = $Zhl; |
375 | my $nlo = $Zlh; | |
c1f092d1 AP |
376 | |
377 | my ($Zlo,$Zhi) = ("mm0","mm1"); | |
378 | my $tmp = "mm2"; | |
379 | ||
380 | &xor ($nlo,$nlo); # avoid partial register stalls on PIII | |
381 | &mov ($nhi,$Zll); | |
382 | &mov (&LB($nlo),&LB($nhi)); | |
c1f092d1 AP |
383 | &shl (&LB($nlo),4); |
384 | &and ($nhi,0xf0); | |
385 | &movq ($Zlo,&QWP(8,$Htbl,$nlo)); | |
386 | &movq ($Zhi,&QWP(0,$Htbl,$nlo)); | |
07e29c12 AP |
387 | &movd ($rem[0],$Zlo); |
388 | ||
389 | for ($cnt=28;$cnt>=-2;$cnt--) { | |
390 | my $odd = $cnt&1; | |
391 | my $nix = $odd ? $nlo : $nhi; | |
392 | ||
393 | &shl (&LB($nlo),4) if ($odd); | |
394 | &psrlq ($Zlo,4); | |
395 | &movq ($tmp,$Zhi); | |
396 | &psrlq ($Zhi,4); | |
397 | &pxor ($Zlo,&QWP(8,$Htbl,$nix)); | |
398 | &mov (&LB($nlo),&BP($cnt/2,$inp)) if (!$odd && $cnt>=0); | |
399 | &psllq ($tmp,60); | |
400 | &and ($nhi,0xf0) if ($odd); | |
401 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem[1],8)) if ($cnt<28); | |
402 | &and ($rem[0],0xf); | |
403 | &pxor ($Zhi,&QWP(0,$Htbl,$nix)); | |
404 | &mov ($nhi,$nlo) if (!$odd && $cnt>=0); | |
405 | &movd ($rem[1],$Zlo); | |
406 | &pxor ($Zlo,$tmp); | |
407 | ||
408 | push (@rem,shift(@rem)); # "rotate" registers | |
409 | } | |
c1f092d1 | 410 | |
07e29c12 | 411 | &mov ($inp,&DWP(4,$rem_4bit,$rem[1],8)); # last rem_4bit[rem] |
c1f092d1 AP |
412 | |
413 | &psrlq ($Zlo,32); # lower part of Zlo is already there | |
414 | &movd ($Zhl,$Zhi); | |
415 | &psrlq ($Zhi,32); | |
416 | &movd ($Zlh,$Zlo); | |
417 | &movd ($Zhh,$Zhi); | |
07e29c12 | 418 | &shl ($inp,4); # compensate for rem_4bit[i] being >>4 |
c1f092d1 AP |
419 | |
420 | &bswap ($Zll); | |
421 | &bswap ($Zhl); | |
422 | &bswap ($Zlh); | |
07e29c12 | 423 | &xor ($Zhh,$inp); |
c1f092d1 | 424 | &bswap ($Zhh); |
07e29c12 AP |
425 | |
426 | &ret (); | |
c1f092d1 | 427 | } |
07e29c12 | 428 | &function_end_B("_mmx_gmult_4bit_inner"); |
c1f092d1 AP |
429 | |
430 | &function_begin("gcm_gmult_4bit_mmx"); | |
431 | &mov ($inp,&wparam(0)); # load Xi | |
432 | &mov ($Htbl,&wparam(1)); # load Htable | |
e3a510f8 | 433 | |
e3a510f8 AP |
434 | &call (&label("pic_point")); |
435 | &set_label("pic_point"); | |
436 | &blindpop("eax"); | |
e3a510f8 AP |
437 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); |
438 | ||
c1f092d1 AP |
439 | &movz ($Zll,&BP(15,$inp)); |
440 | ||
07e29c12 | 441 | &call ("_mmx_gmult_4bit_inner"); |
c1f092d1 | 442 | |
07e29c12 | 443 | &mov ($inp,&wparam(0)); # load Xi |
c1f092d1 AP |
444 | &emms (); |
445 | &mov (&DWP(12,$inp),$Zll); | |
446 | &mov (&DWP(4,$inp),$Zhl); | |
447 | &mov (&DWP(8,$inp),$Zlh); | |
448 | &mov (&DWP(0,$inp),$Zhh); | |
449 | &function_end("gcm_gmult_4bit_mmx"); | |
450 | \f | |
451 | # Streamed version performs 20% better on P4, 7% on Opteron, | |
452 | # 10% on Core2 and PIII... | |
453 | &function_begin("gcm_ghash_4bit_mmx"); | |
4f39edbf AP |
454 | &mov ($Zhh,&wparam(0)); # load Xi |
455 | &mov ($Htbl,&wparam(1)); # load Htable | |
456 | &mov ($inp,&wparam(2)); # load in | |
457 | &mov ($Zlh,&wparam(3)); # load len | |
c1f092d1 AP |
458 | |
459 | &call (&label("pic_point")); | |
460 | &set_label("pic_point"); | |
461 | &blindpop("eax"); | |
462 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); | |
463 | ||
e3a510f8 | 464 | &add ($Zlh,$inp); |
4f39edbf | 465 | &mov (&wparam(3),$Zlh); # len to point at the end of input |
e3a510f8 | 466 | &stack_push(4+1); # +1 for stack alignment |
c1f092d1 | 467 | |
e3a510f8 AP |
468 | &mov ($Zll,&DWP(12,$Zhh)); # load Xi[16] |
469 | &mov ($Zhl,&DWP(4,$Zhh)); | |
470 | &mov ($Zlh,&DWP(8,$Zhh)); | |
471 | &mov ($Zhh,&DWP(0,$Zhh)); | |
c1f092d1 | 472 | &jmp (&label("mmx_outer_loop")); |
e3a510f8 AP |
473 | |
474 | &set_label("mmx_outer_loop",16); | |
475 | &xor ($Zll,&DWP(12,$inp)); | |
476 | &xor ($Zhl,&DWP(4,$inp)); | |
477 | &xor ($Zlh,&DWP(8,$inp)); | |
478 | &xor ($Zhh,&DWP(0,$inp)); | |
07e29c12 | 479 | &mov (&wparam(2),$inp); |
e3a510f8 AP |
480 | &mov (&DWP(12,"esp"),$Zll); |
481 | &mov (&DWP(4,"esp"),$Zhl); | |
482 | &mov (&DWP(8,"esp"),$Zlh); | |
483 | &mov (&DWP(0,"esp"),$Zhh); | |
484 | ||
07e29c12 | 485 | &mov ($inp,"esp"); |
e3a510f8 AP |
486 | &shr ($Zll,24); |
487 | ||
07e29c12 | 488 | &call ("_mmx_gmult_4bit_inner"); |
e3a510f8 | 489 | |
07e29c12 | 490 | &mov ($inp,&wparam(2)); |
e3a510f8 | 491 | &lea ($inp,&DWP(16,$inp)); |
4f39edbf | 492 | &cmp ($inp,&wparam(3)); |
e3a510f8 AP |
493 | &jb (&label("mmx_outer_loop")); |
494 | ||
4f39edbf | 495 | &mov ($inp,&wparam(0)); # load Xi |
e3a510f8 AP |
496 | &emms (); |
497 | &mov (&DWP(12,$inp),$Zll); | |
498 | &mov (&DWP(4,$inp),$Zhl); | |
499 | &mov (&DWP(8,$inp),$Zlh); | |
500 | &mov (&DWP(0,$inp),$Zhh); | |
501 | ||
502 | &stack_pop(4+1); | |
c1f092d1 AP |
503 | &function_end("gcm_ghash_4bit_mmx"); |
504 | \f | |
8525950e | 505 | }} else {{ # "June" MMX version... |
04e2b793 AP |
506 | # ... has slower "April" gcm_gmult_4bit_mmx with folded |
507 | # loop. This is done to conserve code size... | |
8525950e AP |
508 | $S=16; # shift factor for rem_4bit |
509 | ||
510 | sub mmx_loop() { | |
511 | # MMX version performs 2.8 times better on P4 (see comment in non-MMX | |
512 | # routine for further details), 40% better on Opteron and Core2, 50% | |
513 | # better on PIII... In other words effort is considered to be well | |
514 | # spent... | |
515 | my $inp = shift; | |
516 | my $rem_4bit = shift; | |
517 | my $cnt = $Zhh; | |
518 | my $nhi = $Zhl; | |
519 | my $nlo = $Zlh; | |
520 | my $rem = $Zll; | |
521 | ||
522 | my ($Zlo,$Zhi) = ("mm0","mm1"); | |
523 | my $tmp = "mm2"; | |
524 | ||
525 | &xor ($nlo,$nlo); # avoid partial register stalls on PIII | |
526 | &mov ($nhi,$Zll); | |
527 | &mov (&LB($nlo),&LB($nhi)); | |
528 | &mov ($cnt,14); | |
529 | &shl (&LB($nlo),4); | |
530 | &and ($nhi,0xf0); | |
531 | &movq ($Zlo,&QWP(8,$Htbl,$nlo)); | |
532 | &movq ($Zhi,&QWP(0,$Htbl,$nlo)); | |
533 | &movd ($rem,$Zlo); | |
534 | &jmp (&label("mmx_loop")); | |
535 | ||
536 | &set_label("mmx_loop",16); | |
537 | &psrlq ($Zlo,4); | |
538 | &and ($rem,0xf); | |
539 | &movq ($tmp,$Zhi); | |
540 | &psrlq ($Zhi,4); | |
541 | &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); | |
542 | &mov (&LB($nlo),&BP(0,$inp,$cnt)); | |
543 | &psllq ($tmp,60); | |
544 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); | |
545 | &dec ($cnt); | |
546 | &movd ($rem,$Zlo); | |
547 | &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); | |
548 | &mov ($nhi,$nlo); | |
549 | &pxor ($Zlo,$tmp); | |
550 | &js (&label("mmx_break")); | |
551 | ||
552 | &shl (&LB($nlo),4); | |
553 | &and ($rem,0xf); | |
554 | &psrlq ($Zlo,4); | |
555 | &and ($nhi,0xf0); | |
556 | &movq ($tmp,$Zhi); | |
557 | &psrlq ($Zhi,4); | |
558 | &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); | |
559 | &psllq ($tmp,60); | |
560 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); | |
561 | &movd ($rem,$Zlo); | |
562 | &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); | |
563 | &pxor ($Zlo,$tmp); | |
564 | &jmp (&label("mmx_loop")); | |
565 | ||
566 | &set_label("mmx_break",16); | |
567 | &shl (&LB($nlo),4); | |
568 | &and ($rem,0xf); | |
569 | &psrlq ($Zlo,4); | |
570 | &and ($nhi,0xf0); | |
571 | &movq ($tmp,$Zhi); | |
572 | &psrlq ($Zhi,4); | |
573 | &pxor ($Zlo,&QWP(8,$Htbl,$nlo)); | |
574 | &psllq ($tmp,60); | |
575 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); | |
576 | &movd ($rem,$Zlo); | |
577 | &pxor ($Zhi,&QWP(0,$Htbl,$nlo)); | |
578 | &pxor ($Zlo,$tmp); | |
579 | ||
580 | &psrlq ($Zlo,4); | |
581 | &and ($rem,0xf); | |
582 | &movq ($tmp,$Zhi); | |
583 | &psrlq ($Zhi,4); | |
584 | &pxor ($Zlo,&QWP(8,$Htbl,$nhi)); | |
585 | &psllq ($tmp,60); | |
586 | &pxor ($Zhi,&QWP(0,$rem_4bit,$rem,8)); | |
587 | &movd ($rem,$Zlo); | |
588 | &pxor ($Zhi,&QWP(0,$Htbl,$nhi)); | |
589 | &pxor ($Zlo,$tmp); | |
590 | ||
591 | &psrlq ($Zlo,32); # lower part of Zlo is already there | |
592 | &movd ($Zhl,$Zhi); | |
593 | &psrlq ($Zhi,32); | |
594 | &movd ($Zlh,$Zlo); | |
595 | &movd ($Zhh,$Zhi); | |
596 | ||
597 | &bswap ($Zll); | |
598 | &bswap ($Zhl); | |
599 | &bswap ($Zlh); | |
600 | &bswap ($Zhh); | |
601 | } | |
602 | ||
603 | &function_begin("gcm_gmult_4bit_mmx"); | |
604 | &mov ($inp,&wparam(0)); # load Xi | |
605 | &mov ($Htbl,&wparam(1)); # load Htable | |
606 | ||
607 | &call (&label("pic_point")); | |
608 | &set_label("pic_point"); | |
609 | &blindpop("eax"); | |
610 | &lea ("eax",&DWP(&label("rem_4bit")."-".&label("pic_point"),"eax")); | |
611 | ||
612 | &movz ($Zll,&BP(15,$inp)); | |
613 | ||
614 | &mmx_loop($inp,"eax"); | |
615 | ||
616 | &emms (); | |
617 | &mov (&DWP(12,$inp),$Zll); | |
618 | &mov (&DWP(4,$inp),$Zhl); | |
619 | &mov (&DWP(8,$inp),$Zlh); | |
620 | &mov (&DWP(0,$inp),$Zhh); | |
621 | &function_end("gcm_gmult_4bit_mmx"); | |
622 | \f | |
623 | ###################################################################### | |
624 | # Below subroutine is "528B" variant of "4-bit" GCM GHASH function | |
625 | # (see gcm128.c for details). It provides further 20-40% performance | |
04e2b793 | 626 | # improvement over above mentioned "May" version. |
8525950e AP |
627 | |
628 | &static_label("rem_8bit"); | |
629 | ||
630 | &function_begin("gcm_ghash_4bit_mmx"); | |
631 | { my ($Zlo,$Zhi) = ("mm7","mm6"); | |
632 | my $rem_8bit = "esi"; | |
633 | my $Htbl = "ebx"; | |
634 | ||
635 | # parameter block | |
636 | &mov ("eax",&wparam(0)); # Xi | |
637 | &mov ("ebx",&wparam(1)); # Htable | |
638 | &mov ("ecx",&wparam(2)); # inp | |
639 | &mov ("edx",&wparam(3)); # len | |
640 | &mov ("ebp","esp"); # original %esp | |
641 | &call (&label("pic_point")); | |
642 | &set_label ("pic_point"); | |
643 | &blindpop ($rem_8bit); | |
644 | &lea ($rem_8bit,&DWP(&label("rem_8bit")."-".&label("pic_point"),$rem_8bit)); | |
645 | ||
646 | &sub ("esp",512+16+16); # allocate stack frame... | |
647 | &and ("esp",-64); # ...and align it | |
648 | &sub ("esp",16); # place for (u8)(H[]<<4) | |
649 | ||
650 | &add ("edx","ecx"); # pointer to the end of input | |
651 | &mov (&DWP(528+16+0,"esp"),"eax"); # save Xi | |
652 | &mov (&DWP(528+16+8,"esp"),"edx"); # save inp+len | |
653 | &mov (&DWP(528+16+12,"esp"),"ebp"); # save original %esp | |
654 | ||
655 | { my @lo = ("mm0","mm1","mm2"); | |
656 | my @hi = ("mm3","mm4","mm5"); | |
657 | my @tmp = ("mm6","mm7"); | |
f9c5e5d9 | 658 | my ($off1,$off2,$i) = (0,0,); |
8525950e AP |
659 | |
660 | &add ($Htbl,128); # optimize for size | |
661 | &lea ("edi",&DWP(16+128,"esp")); | |
662 | &lea ("ebp",&DWP(16+256+128,"esp")); | |
663 | ||
664 | # decompose Htable (low and high parts are kept separately), | |
04e2b793 | 665 | # generate Htable[]>>4, (u8)(Htable[]<<4), save to stack... |
8525950e AP |
666 | for ($i=0;$i<18;$i++) { |
667 | ||
668 | &mov ("edx",&DWP(16*$i+8-128,$Htbl)) if ($i<16); | |
669 | &movq ($lo[0],&QWP(16*$i+8-128,$Htbl)) if ($i<16); | |
670 | &psllq ($tmp[1],60) if ($i>1); | |
671 | &movq ($hi[0],&QWP(16*$i+0-128,$Htbl)) if ($i<16); | |
672 | &por ($lo[2],$tmp[1]) if ($i>1); | |
673 | &movq (&QWP($off1-128,"edi"),$lo[1]) if ($i>0 && $i<17); | |
674 | &psrlq ($lo[1],4) if ($i>0 && $i<17); | |
675 | &movq (&QWP($off1,"edi"),$hi[1]) if ($i>0 && $i<17); | |
676 | &movq ($tmp[0],$hi[1]) if ($i>0 && $i<17); | |
677 | &movq (&QWP($off2-128,"ebp"),$lo[2]) if ($i>1); | |
678 | &psrlq ($hi[1],4) if ($i>0 && $i<17); | |
679 | &movq (&QWP($off2,"ebp"),$hi[2]) if ($i>1); | |
680 | &shl ("edx",4) if ($i<16); | |
681 | &mov (&BP($i,"esp"),&LB("edx")) if ($i<16); | |
682 | ||
683 | unshift (@lo,pop(@lo)); # "rotate" registers | |
684 | unshift (@hi,pop(@hi)); | |
685 | unshift (@tmp,pop(@tmp)); | |
686 | $off1 += 8 if ($i>0); | |
687 | $off2 += 8 if ($i>1); | |
688 | } | |
689 | } | |
690 | ||
691 | &movq ($Zhi,&QWP(0,"eax")); | |
692 | &mov ("ebx",&DWP(8,"eax")); | |
693 | &mov ("edx",&DWP(12,"eax")); # load Xi | |
694 | ||
695 | &set_label("outer",16); | |
696 | { my $nlo = "eax"; | |
697 | my $dat = "edx"; | |
698 | my @nhi = ("edi","ebp"); | |
699 | my @rem = ("ebx","ecx"); | |
700 | my @red = ("mm0","mm1","mm2"); | |
701 | my $tmp = "mm3"; | |
702 | ||
04e2b793 | 703 | &xor ($dat,&DWP(12,"ecx")); # merge input data |
8525950e AP |
704 | &xor ("ebx",&DWP(8,"ecx")); |
705 | &pxor ($Zhi,&QWP(0,"ecx")); | |
706 | &lea ("ecx",&DWP(16,"ecx")); # inp+=16 | |
707 | #&mov (&DWP(528+12,"esp"),$dat); # save inp^Xi | |
708 | &mov (&DWP(528+8,"esp"),"ebx"); | |
709 | &movq (&QWP(528+0,"esp"),$Zhi); | |
710 | &mov (&DWP(528+16+4,"esp"),"ecx"); # save inp | |
711 | ||
712 | &xor ($nlo,$nlo); | |
713 | &rol ($dat,8); | |
714 | &mov (&LB($nlo),&LB($dat)); | |
715 | &mov ($nhi[1],$nlo); | |
716 | &and (&LB($nlo),0x0f); | |
717 | &shr ($nhi[1],4); | |
718 | &pxor ($red[0],$red[0]); | |
04e2b793 | 719 | &rol ($dat,8); # next byte |
8525950e AP |
720 | &pxor ($red[1],$red[1]); |
721 | &pxor ($red[2],$red[2]); | |
722 | ||
60250017 | 723 | # Just like in "May" version modulo-schedule for critical path in |
04e2b793 AP |
724 | # 'Z.hi ^= rem_8bit[Z.lo&0xff^((u8)H[nhi]<<4)]<<48'. Final 'pxor' |
725 | # is scheduled so late that rem_8bit[] has to be shifted *right* | |
726 | # by 16, which is why last argument to pinsrw is 2, which | |
727 | # corresponds to <<32=<<48>>16... | |
8525950e AP |
728 | for ($j=11,$i=0;$i<15;$i++) { |
729 | ||
730 | if ($i>0) { | |
731 | &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] | |
732 | &rol ($dat,8); # next byte | |
733 | &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); | |
734 | ||
735 | &pxor ($Zlo,$tmp); | |
736 | &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); | |
04e2b793 | 737 | &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) |
8525950e AP |
738 | } else { |
739 | &movq ($Zlo,&QWP(16,"esp",$nlo,8)); | |
740 | &movq ($Zhi,&QWP(16+128,"esp",$nlo,8)); | |
741 | } | |
742 | ||
743 | &mov (&LB($nlo),&LB($dat)); | |
04e2b793 | 744 | &mov ($dat,&DWP(528+$j,"esp")) if (--$j%4==0); |
8525950e AP |
745 | |
746 | &movd ($rem[0],$Zlo); | |
04e2b793 AP |
747 | &movz ($rem[1],&LB($rem[1])) if ($i>0); |
748 | &psrlq ($Zlo,8); # Z>>=8 | |
8525950e AP |
749 | |
750 | &movq ($tmp,$Zhi); | |
751 | &mov ($nhi[0],$nlo); | |
752 | &psrlq ($Zhi,8); | |
753 | ||
754 | &pxor ($Zlo,&QWP(16+256+0,"esp",$nhi[1],8)); # Z^=H[nhi]>>4 | |
755 | &and (&LB($nlo),0x0f); | |
756 | &psllq ($tmp,56); | |
757 | ||
758 | &pxor ($Zhi,$red[1]) if ($i>1); | |
759 | &shr ($nhi[0],4); | |
760 | &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2) if ($i>0); | |
761 | ||
762 | unshift (@red,pop(@red)); # "rotate" registers | |
763 | unshift (@rem,pop(@rem)); | |
764 | unshift (@nhi,pop(@nhi)); | |
765 | } | |
766 | ||
767 | &pxor ($Zlo,&QWP(16,"esp",$nlo,8)); # Z^=H[nlo] | |
768 | &pxor ($Zhi,&QWP(16+128,"esp",$nlo,8)); | |
04e2b793 | 769 | &xor (&LB($rem[1]),&BP(0,"esp",$nhi[0])); # rem^(H[nhi]<<4) |
8525950e AP |
770 | |
771 | &pxor ($Zlo,$tmp); | |
772 | &pxor ($Zhi,&QWP(16+256+128,"esp",$nhi[0],8)); | |
773 | &movz ($rem[1],&LB($rem[1])); | |
774 | ||
775 | &pxor ($red[2],$red[2]); # clear 2nd word | |
776 | &psllq ($red[1],4); | |
777 | ||
778 | &movd ($rem[0],$Zlo); | |
04e2b793 | 779 | &psrlq ($Zlo,4); # Z>>=4 |
8525950e AP |
780 | |
781 | &movq ($tmp,$Zhi); | |
782 | &psrlq ($Zhi,4); | |
04e2b793 | 783 | &shl ($rem[0],4); # rem<<4 |
8525950e AP |
784 | |
785 | &pxor ($Zlo,&QWP(16,"esp",$nhi[1],8)); # Z^=H[nhi] | |
786 | &psllq ($tmp,60); | |
787 | &movz ($rem[0],&LB($rem[0])); | |
788 | ||
789 | &pxor ($Zlo,$tmp); | |
790 | &pxor ($Zhi,&QWP(16+128,"esp",$nhi[1],8)); | |
791 | ||
792 | &pinsrw ($red[0],&WP(0,$rem_8bit,$rem[1],2),2); | |
793 | &pxor ($Zhi,$red[1]); | |
794 | ||
795 | &movd ($dat,$Zlo); | |
04e2b793 | 796 | &pinsrw ($red[2],&WP(0,$rem_8bit,$rem[0],2),3); # last is <<48 |
8525950e | 797 | |
04e2b793 | 798 | &psllq ($red[0],12); # correct by <<16>>4 |
8525950e AP |
799 | &pxor ($Zhi,$red[0]); |
800 | &psrlq ($Zlo,32); | |
801 | &pxor ($Zhi,$red[2]); | |
802 | ||
803 | &mov ("ecx",&DWP(528+16+4,"esp")); # restore inp | |
804 | &movd ("ebx",$Zlo); | |
805 | &movq ($tmp,$Zhi); # 01234567 | |
806 | &psllw ($Zhi,8); # 1.3.5.7. | |
807 | &psrlw ($tmp,8); # .0.2.4.6 | |
808 | &por ($Zhi,$tmp); # 10325476 | |
809 | &bswap ($dat); | |
810 | &pshufw ($Zhi,$Zhi,0b00011011); # 76543210 | |
811 | &bswap ("ebx"); | |
609b0852 | 812 | |
8525950e AP |
813 | &cmp ("ecx",&DWP(528+16+8,"esp")); # are we done? |
814 | &jne (&label("outer")); | |
815 | } | |
816 | ||
817 | &mov ("eax",&DWP(528+16+0,"esp")); # restore Xi | |
818 | &mov (&DWP(12,"eax"),"edx"); | |
819 | &mov (&DWP(8,"eax"),"ebx"); | |
820 | &movq (&QWP(0,"eax"),$Zhi); | |
821 | ||
822 | &mov ("esp",&DWP(528+16+12,"esp")); # restore original %esp | |
823 | &emms (); | |
824 | } | |
825 | &function_end("gcm_ghash_4bit_mmx"); | |
826 | }} | |
827 | \f | |
c1f092d1 AP |
828 | if ($sse2) {{ |
829 | ###################################################################### | |
830 | # PCLMULQDQ version. | |
831 | ||
832 | $Xip="eax"; | |
833 | $Htbl="edx"; | |
834 | $const="ecx"; | |
835 | $inp="esi"; | |
836 | $len="ebx"; | |
837 | ||
838 | ($Xi,$Xhi)=("xmm0","xmm1"); $Hkey="xmm2"; | |
839 | ($T1,$T2,$T3)=("xmm3","xmm4","xmm5"); | |
840 | ($Xn,$Xhn)=("xmm6","xmm7"); | |
841 | ||
842 | &static_label("bswap"); | |
843 | ||
844 | sub clmul64x64_T2 { # minimal "register" pressure | |
273a8081 | 845 | my ($Xhi,$Xi,$Hkey,$HK)=@_; |
c1f092d1 AP |
846 | |
847 | &movdqa ($Xhi,$Xi); # | |
848 | &pshufd ($T1,$Xi,0b01001110); | |
273a8081 | 849 | &pshufd ($T2,$Hkey,0b01001110) if (!defined($HK)); |
c1f092d1 | 850 | &pxor ($T1,$Xi); # |
273a8081 AP |
851 | &pxor ($T2,$Hkey) if (!defined($HK)); |
852 | $HK=$T2 if (!defined($HK)); | |
c1f092d1 AP |
853 | |
854 | &pclmulqdq ($Xi,$Hkey,0x00); ####### | |
855 | &pclmulqdq ($Xhi,$Hkey,0x11); ####### | |
273a8081 | 856 | &pclmulqdq ($T1,$HK,0x00); ####### |
bc5b136c AP |
857 | &xorps ($T1,$Xi); # |
858 | &xorps ($T1,$Xhi); # | |
c1f092d1 AP |
859 | |
860 | &movdqa ($T2,$T1); # | |
861 | &psrldq ($T1,8); | |
862 | &pslldq ($T2,8); # | |
863 | &pxor ($Xhi,$T1); | |
864 | &pxor ($Xi,$T2); # | |
865 | } | |
e3a510f8 | 866 | |
c1f092d1 AP |
867 | sub clmul64x64_T3 { |
868 | # Even though this subroutine offers visually better ILP, it | |
869 | # was empirically found to be a tad slower than above version. | |
870 | # At least in gcm_ghash_clmul context. But it's just as well, | |
871 | # because loop modulo-scheduling is possible only thanks to | |
872 | # minimized "register" pressure... | |
873 | my ($Xhi,$Xi,$Hkey)=@_; | |
874 | ||
875 | &movdqa ($T1,$Xi); # | |
876 | &movdqa ($Xhi,$Xi); | |
877 | &pclmulqdq ($Xi,$Hkey,0x00); ####### | |
878 | &pclmulqdq ($Xhi,$Hkey,0x11); ####### | |
879 | &pshufd ($T2,$T1,0b01001110); # | |
880 | &pshufd ($T3,$Hkey,0b01001110); | |
881 | &pxor ($T2,$T1); # | |
882 | &pxor ($T3,$Hkey); | |
883 | &pclmulqdq ($T2,$T3,0x00); ####### | |
884 | &pxor ($T2,$Xi); # | |
885 | &pxor ($T2,$Xhi); # | |
886 | ||
887 | &movdqa ($T3,$T2); # | |
888 | &psrldq ($T2,8); | |
889 | &pslldq ($T3,8); # | |
890 | &pxor ($Xhi,$T2); | |
891 | &pxor ($Xi,$T3); # | |
892 | } | |
893 | \f | |
894 | if (1) { # Algorithm 9 with <<1 twist. | |
895 | # Reduction is shorter and uses only two | |
896 | # temporary registers, which makes it better | |
897 | # candidate for interleaving with 64x64 | |
898 | # multiplication. Pre-modulo-scheduled loop | |
899 | # was found to be ~20% faster than Algorithm 5 | |
07e29c12 AP |
900 | # below. Algorithm 9 was therefore chosen for |
901 | # further optimization... | |
c1f092d1 | 902 | |
273a8081 | 903 | sub reduction_alg9 { # 17/11 times faster than Intel version |
c1f092d1 AP |
904 | my ($Xhi,$Xi) = @_; |
905 | ||
906 | # 1st phase | |
273a8081 AP |
907 | &movdqa ($T2,$Xi); # |
908 | &movdqa ($T1,$Xi); | |
909 | &psllq ($Xi,5); | |
910 | &pxor ($T1,$Xi); # | |
c1f092d1 AP |
911 | &psllq ($Xi,1); |
912 | &pxor ($Xi,$T1); # | |
c1f092d1 | 913 | &psllq ($Xi,57); # |
273a8081 | 914 | &movdqa ($T1,$Xi); # |
c1f092d1 | 915 | &pslldq ($Xi,8); |
609b0852 | 916 | &psrldq ($T1,8); # |
273a8081 AP |
917 | &pxor ($Xi,$T2); |
918 | &pxor ($Xhi,$T1); # | |
c1f092d1 AP |
919 | |
920 | # 2nd phase | |
921 | &movdqa ($T2,$Xi); | |
273a8081 AP |
922 | &psrlq ($Xi,1); |
923 | &pxor ($Xhi,$T2); # | |
924 | &pxor ($T2,$Xi); | |
c1f092d1 AP |
925 | &psrlq ($Xi,5); |
926 | &pxor ($Xi,$T2); # | |
927 | &psrlq ($Xi,1); # | |
273a8081 | 928 | &pxor ($Xi,$Xhi) # |
c1f092d1 | 929 | } |
e3a510f8 | 930 | |
c1f092d1 AP |
931 | &function_begin_B("gcm_init_clmul"); |
932 | &mov ($Htbl,&wparam(0)); | |
933 | &mov ($Xip,&wparam(1)); | |
e3a510f8 | 934 | |
c1f092d1 AP |
935 | &call (&label("pic")); |
936 | &set_label("pic"); | |
937 | &blindpop ($const); | |
938 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | |
e3a510f8 | 939 | |
c1f092d1 AP |
940 | &movdqu ($Hkey,&QWP(0,$Xip)); |
941 | &pshufd ($Hkey,$Hkey,0b01001110);# dword swap | |
e3a510f8 | 942 | |
c1f092d1 AP |
943 | # <<1 twist |
944 | &pshufd ($T2,$Hkey,0b11111111); # broadcast uppermost dword | |
945 | &movdqa ($T1,$Hkey); | |
946 | &psllq ($Hkey,1); | |
947 | &pxor ($T3,$T3); # | |
948 | &psrlq ($T1,63); | |
949 | &pcmpgtd ($T3,$T2); # broadcast carry bit | |
950 | &pslldq ($T1,8); | |
951 | &por ($Hkey,$T1); # H<<=1 | |
e3a510f8 | 952 | |
c1f092d1 AP |
953 | # magic reduction |
954 | &pand ($T3,&QWP(16,$const)); # 0x1c2_polynomial | |
955 | &pxor ($Hkey,$T3); # if(carry) H^=0x1c2_polynomial | |
e3a510f8 | 956 | |
c1f092d1 AP |
957 | # calculate H^2 |
958 | &movdqa ($Xi,$Hkey); | |
959 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); | |
960 | &reduction_alg9 ($Xhi,$Xi); | |
e3a510f8 | 961 | |
273a8081 AP |
962 | &pshufd ($T1,$Hkey,0b01001110); |
963 | &pshufd ($T2,$Xi,0b01001110); | |
964 | &pxor ($T1,$Hkey); # Karatsuba pre-processing | |
c1f092d1 | 965 | &movdqu (&QWP(0,$Htbl),$Hkey); # save H |
273a8081 | 966 | &pxor ($T2,$Xi); # Karatsuba pre-processing |
c1f092d1 | 967 | &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 |
273a8081 AP |
968 | &palignr ($T2,$T1,8); # low part is H.lo^H.hi |
969 | &movdqu (&QWP(32,$Htbl),$T2); # save Karatsuba "salt" | |
c1f092d1 AP |
970 | |
971 | &ret (); | |
972 | &function_end_B("gcm_init_clmul"); | |
973 | ||
974 | &function_begin_B("gcm_gmult_clmul"); | |
975 | &mov ($Xip,&wparam(0)); | |
976 | &mov ($Htbl,&wparam(1)); | |
977 | ||
978 | &call (&label("pic")); | |
979 | &set_label("pic"); | |
980 | &blindpop ($const); | |
981 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | |
982 | ||
983 | &movdqu ($Xi,&QWP(0,$Xip)); | |
984 | &movdqa ($T3,&QWP(0,$const)); | |
bc5b136c | 985 | &movups ($Hkey,&QWP(0,$Htbl)); |
c1f092d1 | 986 | &pshufb ($Xi,$T3); |
273a8081 | 987 | &movups ($T2,&QWP(32,$Htbl)); |
c1f092d1 | 988 | |
273a8081 | 989 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey,$T2); |
c1f092d1 AP |
990 | &reduction_alg9 ($Xhi,$Xi); |
991 | ||
992 | &pshufb ($Xi,$T3); | |
993 | &movdqu (&QWP(0,$Xip),$Xi); | |
994 | ||
995 | &ret (); | |
996 | &function_end_B("gcm_gmult_clmul"); | |
997 | ||
998 | &function_begin("gcm_ghash_clmul"); | |
999 | &mov ($Xip,&wparam(0)); | |
1000 | &mov ($Htbl,&wparam(1)); | |
1001 | &mov ($inp,&wparam(2)); | |
1002 | &mov ($len,&wparam(3)); | |
1003 | ||
1004 | &call (&label("pic")); | |
1005 | &set_label("pic"); | |
1006 | &blindpop ($const); | |
1007 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | |
1008 | ||
1009 | &movdqu ($Xi,&QWP(0,$Xip)); | |
1010 | &movdqa ($T3,&QWP(0,$const)); | |
1011 | &movdqu ($Hkey,&QWP(0,$Htbl)); | |
1012 | &pshufb ($Xi,$T3); | |
1013 | ||
1014 | &sub ($len,0x10); | |
1015 | &jz (&label("odd_tail")); | |
1016 | ||
1017 | ####### | |
1018 | # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = | |
1019 | # [(H*Ii+1) + (H*Xi+1)] mod P = | |
1020 | # [(H*Ii+1) + H^2*(Ii+Xi)] mod P | |
1021 | # | |
1022 | &movdqu ($T1,&QWP(0,$inp)); # Ii | |
1023 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 | |
1024 | &pshufb ($T1,$T3); | |
1025 | &pshufb ($Xn,$T3); | |
273a8081 | 1026 | &movdqu ($T3,&QWP(32,$Htbl)); |
c1f092d1 AP |
1027 | &pxor ($Xi,$T1); # Ii+Xi |
1028 | ||
273a8081 AP |
1029 | &pshufd ($T1,$Xn,0b01001110); # H*Ii+1 |
1030 | &movdqa ($Xhn,$Xn); | |
1031 | &pxor ($T1,$Xn); # | |
98e143f1 | 1032 | &lea ($inp,&DWP(32,$inp)); # i+=2 |
273a8081 AP |
1033 | |
1034 | &pclmulqdq ($Xn,$Hkey,0x00); ####### | |
1035 | &pclmulqdq ($Xhn,$Hkey,0x11); ####### | |
273a8081 | 1036 | &pclmulqdq ($T1,$T3,0x00); ####### |
98e143f1 AP |
1037 | &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 |
1038 | &nop (); | |
c1f092d1 | 1039 | |
c1f092d1 AP |
1040 | &sub ($len,0x20); |
1041 | &jbe (&label("even_tail")); | |
273a8081 | 1042 | &jmp (&label("mod_loop")); |
c1f092d1 | 1043 | |
273a8081 AP |
1044 | &set_label("mod_loop",32); |
1045 | &pshufd ($T2,$Xi,0b01001110); # H^2*(Ii+Xi) | |
1046 | &movdqa ($Xhi,$Xi); | |
1047 | &pxor ($T2,$Xi); # | |
98e143f1 | 1048 | &nop (); |
c1f092d1 | 1049 | |
273a8081 AP |
1050 | &pclmulqdq ($Xi,$Hkey,0x00); ####### |
1051 | &pclmulqdq ($Xhi,$Hkey,0x11); ####### | |
273a8081 | 1052 | &pclmulqdq ($T2,$T3,0x10); ####### |
98e143f1 | 1053 | &movups ($Hkey,&QWP(0,$Htbl)); # load H |
c1f092d1 | 1054 | |
273a8081 | 1055 | &xorps ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) |
98e143f1 | 1056 | &movdqa ($T3,&QWP(0,$const)); |
273a8081 AP |
1057 | &xorps ($Xhi,$Xhn); |
1058 | &movdqu ($Xhn,&QWP(0,$inp)); # Ii | |
1059 | &pxor ($T1,$Xi); # aggregated Karatsuba post-processing | |
1060 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 | |
1061 | &pxor ($T1,$Xhi); # | |
c1f092d1 | 1062 | |
273a8081 | 1063 | &pshufb ($Xhn,$T3); |
98e143f1 | 1064 | &pxor ($T2,$T1); # |
c1f092d1 | 1065 | |
273a8081 AP |
1066 | &movdqa ($T1,$T2); # |
1067 | &psrldq ($T2,8); | |
1068 | &pslldq ($T1,8); # | |
1069 | &pxor ($Xhi,$T2); | |
1070 | &pxor ($Xi,$T1); # | |
1071 | &pshufb ($Xn,$T3); | |
1072 | &pxor ($Xhi,$Xhn); # "Ii+Xi", consume early | |
1073 | ||
1074 | &movdqa ($Xhn,$Xn); #&clmul64x64_TX ($Xhn,$Xn,$Hkey); H*Ii+1 | |
1075 | &movdqa ($T2,$Xi); #&reduction_alg9($Xhi,$Xi); 1st phase | |
1076 | &movdqa ($T1,$Xi); | |
1077 | &psllq ($Xi,5); | |
1078 | &pxor ($T1,$Xi); # | |
c1f092d1 AP |
1079 | &psllq ($Xi,1); |
1080 | &pxor ($Xi,$T1); # | |
c1f092d1 | 1081 | &pclmulqdq ($Xn,$Hkey,0x00); ####### |
98e143f1 | 1082 | &movups ($T3,&QWP(32,$Htbl)); |
c1f092d1 | 1083 | &psllq ($Xi,57); # |
273a8081 | 1084 | &movdqa ($T1,$Xi); # |
c1f092d1 | 1085 | &pslldq ($Xi,8); |
609b0852 | 1086 | &psrldq ($T1,8); # |
273a8081 AP |
1087 | &pxor ($Xi,$T2); |
1088 | &pxor ($Xhi,$T1); # | |
1089 | &pshufd ($T1,$Xhn,0b01001110); | |
c1f092d1 | 1090 | &movdqa ($T2,$Xi); # 2nd phase |
273a8081 AP |
1091 | &psrlq ($Xi,1); |
1092 | &pxor ($T1,$Xhn); | |
98e143f1 | 1093 | &pxor ($Xhi,$T2); # |
273a8081 AP |
1094 | &pclmulqdq ($Xhn,$Hkey,0x11); ####### |
1095 | &movups ($Hkey,&QWP(16,$Htbl)); # load H^2 | |
273a8081 | 1096 | &pxor ($T2,$Xi); |
c1f092d1 AP |
1097 | &psrlq ($Xi,5); |
1098 | &pxor ($Xi,$T2); # | |
1099 | &psrlq ($Xi,1); # | |
273a8081 | 1100 | &pxor ($Xi,$Xhi) # |
c1f092d1 | 1101 | &pclmulqdq ($T1,$T3,0x00); ####### |
c1f092d1 AP |
1102 | |
1103 | &lea ($inp,&DWP(32,$inp)); | |
1104 | &sub ($len,0x20); | |
1105 | &ja (&label("mod_loop")); | |
1106 | ||
1107 | &set_label("even_tail"); | |
273a8081 AP |
1108 | &pshufd ($T2,$Xi,0b01001110); # H^2*(Ii+Xi) |
1109 | &movdqa ($Xhi,$Xi); | |
1110 | &pxor ($T2,$Xi); # | |
1111 | ||
1112 | &pclmulqdq ($Xi,$Hkey,0x00); ####### | |
1113 | &pclmulqdq ($Xhi,$Hkey,0x11); ####### | |
1114 | &pclmulqdq ($T2,$T3,0x10); ####### | |
1115 | &movdqa ($T3,&QWP(0,$const)); | |
1116 | ||
1117 | &xorps ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) | |
1118 | &xorps ($Xhi,$Xhn); | |
1119 | &pxor ($T1,$Xi); # aggregated Karatsuba post-processing | |
1120 | &pxor ($T1,$Xhi); # | |
1121 | ||
1122 | &pxor ($T2,$T1); # | |
c1f092d1 | 1123 | |
273a8081 AP |
1124 | &movdqa ($T1,$T2); # |
1125 | &psrldq ($T2,8); | |
1126 | &pslldq ($T1,8); # | |
1127 | &pxor ($Xhi,$T2); | |
1128 | &pxor ($Xi,$T1); # | |
c1f092d1 AP |
1129 | |
1130 | &reduction_alg9 ($Xhi,$Xi); | |
1131 | ||
1132 | &test ($len,$len); | |
1133 | &jnz (&label("done")); | |
1134 | ||
bc5b136c | 1135 | &movups ($Hkey,&QWP(0,$Htbl)); # load H |
c1f092d1 AP |
1136 | &set_label("odd_tail"); |
1137 | &movdqu ($T1,&QWP(0,$inp)); # Ii | |
1138 | &pshufb ($T1,$T3); | |
1139 | &pxor ($Xi,$T1); # Ii+Xi | |
1140 | ||
1141 | &clmul64x64_T2 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) | |
1142 | &reduction_alg9 ($Xhi,$Xi); | |
1143 | ||
1144 | &set_label("done"); | |
1145 | &pshufb ($Xi,$T3); | |
1146 | &movdqu (&QWP(0,$Xip),$Xi); | |
1147 | &function_end("gcm_ghash_clmul"); | |
1148 | \f | |
60250017 | 1149 | } else { # Algorithm 5. Kept for reference purposes. |
c1f092d1 AP |
1150 | |
1151 | sub reduction_alg5 { # 19/16 times faster than Intel version | |
1152 | my ($Xhi,$Xi)=@_; | |
1153 | ||
1154 | # <<1 | |
1155 | &movdqa ($T1,$Xi); # | |
1156 | &movdqa ($T2,$Xhi); | |
1157 | &pslld ($Xi,1); | |
1158 | &pslld ($Xhi,1); # | |
1159 | &psrld ($T1,31); | |
1160 | &psrld ($T2,31); # | |
1161 | &movdqa ($T3,$T1); | |
1162 | &pslldq ($T1,4); | |
1163 | &psrldq ($T3,12); # | |
1164 | &pslldq ($T2,4); | |
1165 | &por ($Xhi,$T3); # | |
1166 | &por ($Xi,$T1); | |
1167 | &por ($Xhi,$T2); # | |
1168 | ||
1169 | # 1st phase | |
1170 | &movdqa ($T1,$Xi); | |
1171 | &movdqa ($T2,$Xi); | |
1172 | &movdqa ($T3,$Xi); # | |
1173 | &pslld ($T1,31); | |
1174 | &pslld ($T2,30); | |
1175 | &pslld ($Xi,25); # | |
1176 | &pxor ($T1,$T2); | |
1177 | &pxor ($T1,$Xi); # | |
1178 | &movdqa ($T2,$T1); # | |
1179 | &pslldq ($T1,12); | |
1180 | &psrldq ($T2,4); # | |
1181 | &pxor ($T3,$T1); | |
1182 | ||
1183 | # 2nd phase | |
1184 | &pxor ($Xhi,$T3); # | |
1185 | &movdqa ($Xi,$T3); | |
1186 | &movdqa ($T1,$T3); | |
1187 | &psrld ($Xi,1); # | |
1188 | &psrld ($T1,2); | |
1189 | &psrld ($T3,7); # | |
1190 | &pxor ($Xi,$T1); | |
1191 | &pxor ($Xhi,$T2); | |
1192 | &pxor ($Xi,$T3); # | |
1193 | &pxor ($Xi,$Xhi); # | |
e3a510f8 AP |
1194 | } |
1195 | ||
c1f092d1 AP |
1196 | &function_begin_B("gcm_init_clmul"); |
1197 | &mov ($Htbl,&wparam(0)); | |
1198 | &mov ($Xip,&wparam(1)); | |
1199 | ||
1200 | &call (&label("pic")); | |
1201 | &set_label("pic"); | |
1202 | &blindpop ($const); | |
1203 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | |
1204 | ||
1205 | &movdqu ($Hkey,&QWP(0,$Xip)); | |
1206 | &pshufd ($Hkey,$Hkey,0b01001110);# dword swap | |
1207 | ||
1208 | # calculate H^2 | |
1209 | &movdqa ($Xi,$Hkey); | |
1210 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); | |
1211 | &reduction_alg5 ($Xhi,$Xi); | |
1212 | ||
1213 | &movdqu (&QWP(0,$Htbl),$Hkey); # save H | |
1214 | &movdqu (&QWP(16,$Htbl),$Xi); # save H^2 | |
1215 | ||
1216 | &ret (); | |
1217 | &function_end_B("gcm_init_clmul"); | |
1218 | ||
1219 | &function_begin_B("gcm_gmult_clmul"); | |
1220 | &mov ($Xip,&wparam(0)); | |
1221 | &mov ($Htbl,&wparam(1)); | |
1222 | ||
1223 | &call (&label("pic")); | |
1224 | &set_label("pic"); | |
1225 | &blindpop ($const); | |
1226 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | |
1227 | ||
1228 | &movdqu ($Xi,&QWP(0,$Xip)); | |
1229 | &movdqa ($Xn,&QWP(0,$const)); | |
1230 | &movdqu ($Hkey,&QWP(0,$Htbl)); | |
1231 | &pshufb ($Xi,$Xn); | |
1232 | ||
1233 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); | |
1234 | &reduction_alg5 ($Xhi,$Xi); | |
1235 | ||
1236 | &pshufb ($Xi,$Xn); | |
1237 | &movdqu (&QWP(0,$Xip),$Xi); | |
1238 | ||
1239 | &ret (); | |
1240 | &function_end_B("gcm_gmult_clmul"); | |
1241 | ||
1242 | &function_begin("gcm_ghash_clmul"); | |
1243 | &mov ($Xip,&wparam(0)); | |
1244 | &mov ($Htbl,&wparam(1)); | |
1245 | &mov ($inp,&wparam(2)); | |
1246 | &mov ($len,&wparam(3)); | |
1247 | ||
1248 | &call (&label("pic")); | |
1249 | &set_label("pic"); | |
1250 | &blindpop ($const); | |
1251 | &lea ($const,&DWP(&label("bswap")."-".&label("pic"),$const)); | |
1252 | ||
1253 | &movdqu ($Xi,&QWP(0,$Xip)); | |
1254 | &movdqa ($T3,&QWP(0,$const)); | |
1255 | &movdqu ($Hkey,&QWP(0,$Htbl)); | |
1256 | &pshufb ($Xi,$T3); | |
1257 | ||
1258 | &sub ($len,0x10); | |
1259 | &jz (&label("odd_tail")); | |
1260 | ||
1261 | ####### | |
1262 | # Xi+2 =[H*(Ii+1 + Xi+1)] mod P = | |
1263 | # [(H*Ii+1) + (H*Xi+1)] mod P = | |
1264 | # [(H*Ii+1) + H^2*(Ii+Xi)] mod P | |
1265 | # | |
1266 | &movdqu ($T1,&QWP(0,$inp)); # Ii | |
1267 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 | |
1268 | &pshufb ($T1,$T3); | |
1269 | &pshufb ($Xn,$T3); | |
1270 | &pxor ($Xi,$T1); # Ii+Xi | |
1271 | ||
1272 | &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 | |
1273 | &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 | |
1274 | ||
1275 | &sub ($len,0x20); | |
1276 | &lea ($inp,&DWP(32,$inp)); # i+=2 | |
1277 | &jbe (&label("even_tail")); | |
1278 | ||
1279 | &set_label("mod_loop"); | |
1280 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) | |
1281 | &movdqu ($Hkey,&QWP(0,$Htbl)); # load H | |
1282 | ||
1283 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) | |
1284 | &pxor ($Xhi,$Xhn); | |
1285 | ||
1286 | &reduction_alg5 ($Xhi,$Xi); | |
1287 | ||
1288 | ####### | |
1289 | &movdqa ($T3,&QWP(0,$const)); | |
1290 | &movdqu ($T1,&QWP(0,$inp)); # Ii | |
1291 | &movdqu ($Xn,&QWP(16,$inp)); # Ii+1 | |
1292 | &pshufb ($T1,$T3); | |
1293 | &pshufb ($Xn,$T3); | |
1294 | &pxor ($Xi,$T1); # Ii+Xi | |
1295 | ||
1296 | &clmul64x64_T3 ($Xhn,$Xn,$Hkey); # H*Ii+1 | |
1297 | &movdqu ($Hkey,&QWP(16,$Htbl)); # load H^2 | |
1298 | ||
1299 | &sub ($len,0x20); | |
1300 | &lea ($inp,&DWP(32,$inp)); | |
1301 | &ja (&label("mod_loop")); | |
1302 | ||
1303 | &set_label("even_tail"); | |
1304 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H^2*(Ii+Xi) | |
1305 | ||
1306 | &pxor ($Xi,$Xn); # (H*Ii+1) + H^2*(Ii+Xi) | |
1307 | &pxor ($Xhi,$Xhn); | |
1308 | ||
1309 | &reduction_alg5 ($Xhi,$Xi); | |
1310 | ||
1311 | &movdqa ($T3,&QWP(0,$const)); | |
1312 | &test ($len,$len); | |
1313 | &jnz (&label("done")); | |
1314 | ||
1315 | &movdqu ($Hkey,&QWP(0,$Htbl)); # load H | |
1316 | &set_label("odd_tail"); | |
1317 | &movdqu ($T1,&QWP(0,$inp)); # Ii | |
1318 | &pshufb ($T1,$T3); | |
1319 | &pxor ($Xi,$T1); # Ii+Xi | |
1320 | ||
1321 | &clmul64x64_T3 ($Xhi,$Xi,$Hkey); # H*(Ii+Xi) | |
1322 | &reduction_alg5 ($Xhi,$Xi); | |
1323 | ||
1324 | &movdqa ($T3,&QWP(0,$const)); | |
1325 | &set_label("done"); | |
1326 | &pshufb ($Xi,$T3); | |
1327 | &movdqu (&QWP(0,$Xip),$Xi); | |
1328 | &function_end("gcm_ghash_clmul"); | |
1329 | ||
1330 | } | |
1331 | \f | |
1332 | &set_label("bswap",64); | |
1333 | &data_byte(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0); | |
1334 | &data_byte(1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xc2); # 0x1c2_polynomial | |
8525950e AP |
1335 | &set_label("rem_8bit",64); |
1336 | &data_short(0x0000,0x01C2,0x0384,0x0246,0x0708,0x06CA,0x048C,0x054E); | |
1337 | &data_short(0x0E10,0x0FD2,0x0D94,0x0C56,0x0918,0x08DA,0x0A9C,0x0B5E); | |
1338 | &data_short(0x1C20,0x1DE2,0x1FA4,0x1E66,0x1B28,0x1AEA,0x18AC,0x196E); | |
1339 | &data_short(0x1230,0x13F2,0x11B4,0x1076,0x1538,0x14FA,0x16BC,0x177E); | |
1340 | &data_short(0x3840,0x3982,0x3BC4,0x3A06,0x3F48,0x3E8A,0x3CCC,0x3D0E); | |
1341 | &data_short(0x3650,0x3792,0x35D4,0x3416,0x3158,0x309A,0x32DC,0x331E); | |
1342 | &data_short(0x2460,0x25A2,0x27E4,0x2626,0x2368,0x22AA,0x20EC,0x212E); | |
1343 | &data_short(0x2A70,0x2BB2,0x29F4,0x2836,0x2D78,0x2CBA,0x2EFC,0x2F3E); | |
1344 | &data_short(0x7080,0x7142,0x7304,0x72C6,0x7788,0x764A,0x740C,0x75CE); | |
1345 | &data_short(0x7E90,0x7F52,0x7D14,0x7CD6,0x7998,0x785A,0x7A1C,0x7BDE); | |
1346 | &data_short(0x6CA0,0x6D62,0x6F24,0x6EE6,0x6BA8,0x6A6A,0x682C,0x69EE); | |
1347 | &data_short(0x62B0,0x6372,0x6134,0x60F6,0x65B8,0x647A,0x663C,0x67FE); | |
1348 | &data_short(0x48C0,0x4902,0x4B44,0x4A86,0x4FC8,0x4E0A,0x4C4C,0x4D8E); | |
1349 | &data_short(0x46D0,0x4712,0x4554,0x4496,0x41D8,0x401A,0x425C,0x439E); | |
1350 | &data_short(0x54E0,0x5522,0x5764,0x56A6,0x53E8,0x522A,0x506C,0x51AE); | |
1351 | &data_short(0x5AF0,0x5B32,0x5974,0x58B6,0x5DF8,0x5C3A,0x5E7C,0x5FBE); | |
1352 | &data_short(0xE100,0xE0C2,0xE284,0xE346,0xE608,0xE7CA,0xE58C,0xE44E); | |
1353 | &data_short(0xEF10,0xEED2,0xEC94,0xED56,0xE818,0xE9DA,0xEB9C,0xEA5E); | |
1354 | &data_short(0xFD20,0xFCE2,0xFEA4,0xFF66,0xFA28,0xFBEA,0xF9AC,0xF86E); | |
1355 | &data_short(0xF330,0xF2F2,0xF0B4,0xF176,0xF438,0xF5FA,0xF7BC,0xF67E); | |
1356 | &data_short(0xD940,0xD882,0xDAC4,0xDB06,0xDE48,0xDF8A,0xDDCC,0xDC0E); | |
1357 | &data_short(0xD750,0xD692,0xD4D4,0xD516,0xD058,0xD19A,0xD3DC,0xD21E); | |
1358 | &data_short(0xC560,0xC4A2,0xC6E4,0xC726,0xC268,0xC3AA,0xC1EC,0xC02E); | |
1359 | &data_short(0xCB70,0xCAB2,0xC8F4,0xC936,0xCC78,0xCDBA,0xCFFC,0xCE3E); | |
1360 | &data_short(0x9180,0x9042,0x9204,0x93C6,0x9688,0x974A,0x950C,0x94CE); | |
1361 | &data_short(0x9F90,0x9E52,0x9C14,0x9DD6,0x9898,0x995A,0x9B1C,0x9ADE); | |
1362 | &data_short(0x8DA0,0x8C62,0x8E24,0x8FE6,0x8AA8,0x8B6A,0x892C,0x88EE); | |
1363 | &data_short(0x83B0,0x8272,0x8034,0x81F6,0x84B8,0x857A,0x873C,0x86FE); | |
1364 | &data_short(0xA9C0,0xA802,0xAA44,0xAB86,0xAEC8,0xAF0A,0xAD4C,0xAC8E); | |
1365 | &data_short(0xA7D0,0xA612,0xA454,0xA596,0xA0D8,0xA11A,0xA35C,0xA29E); | |
1366 | &data_short(0xB5E0,0xB422,0xB664,0xB7A6,0xB2E8,0xB32A,0xB16C,0xB0AE); | |
1367 | &data_short(0xBBF0,0xBA32,0xB874,0xB9B6,0xBCF8,0xBD3A,0xBF7C,0xBEBE); | |
5c88dcca AP |
1368 | }} # $sse2 |
1369 | ||
1370 | &set_label("rem_4bit",64); | |
1371 | &data_word(0,0x0000<<$S,0,0x1C20<<$S,0,0x3840<<$S,0,0x2460<<$S); | |
1372 | &data_word(0,0x7080<<$S,0,0x6CA0<<$S,0,0x48C0<<$S,0,0x54E0<<$S); | |
1373 | &data_word(0,0xE100<<$S,0,0xFD20<<$S,0,0xD940<<$S,0,0xC560<<$S); | |
1374 | &data_word(0,0x9180<<$S,0,0x8DA0<<$S,0,0xA9C0<<$S,0,0xB5E0<<$S); | |
c1f092d1 AP |
1375 | }}} # !$x86only |
1376 | ||
e3a510f8 AP |
1377 | &asciz("GHASH for x86, CRYPTOGAMS by <appro\@openssl.org>"); |
1378 | &asm_finish(); | |
07e29c12 | 1379 | |
a21314db | 1380 | close STDOUT or die "error closing STDOUT: $!"; |
4f0d5f18 | 1381 | |
07e29c12 AP |
1382 | # A question was risen about choice of vanilla MMX. Or rather why wasn't |
1383 | # SSE2 chosen instead? In addition to the fact that MMX runs on legacy | |
1384 | # CPUs such as PIII, "4-bit" MMX version was observed to provide better | |
1385 | # performance than *corresponding* SSE2 one even on contemporary CPUs. | |
1386 | # SSE2 results were provided by Peter-Michael Hager. He maintains SSE2 | |
1387 | # implementation featuring full range of lookup-table sizes, but with | |
1388 | # per-invocation lookup table setup. Latter means that table size is | |
1389 | # chosen depending on how much data is to be hashed in every given call, | |
1390 | # more data - larger table. Best reported result for Core2 is ~4 cycles | |
04e2b793 AP |
1391 | # per processed byte out of 64KB block. This number accounts even for |
1392 | # 64KB table setup overhead. As discussed in gcm128.c we choose to be | |
1393 | # more conservative in respect to lookup table sizes, but how do the | |
1394 | # results compare? Minimalistic "256B" MMX version delivers ~11 cycles | |
1395 | # on same platform. As also discussed in gcm128.c, next in line "8-bit | |
1396 | # Shoup's" or "4KB" method should deliver twice the performance of | |
1397 | # "256B" one, in other words not worse than ~6 cycles per byte. It | |
1398 | # should be also be noted that in SSE2 case improvement can be "super- | |
1399 | # linear," i.e. more than twice, mostly because >>8 maps to single | |
1400 | # instruction on SSE2 register. This is unlike "4-bit" case when >>4 | |
1401 | # maps to same amount of instructions in both MMX and SSE2 cases. | |
8525950e AP |
1402 | # Bottom line is that switch to SSE2 is considered to be justifiable |
1403 | # only in case we choose to implement "8-bit" method... |