2 # Copyright 2009-2020 The OpenSSL Project Authors. All Rights Reserved.
4 # Licensed under the Apache License 2.0 (the "License"). You may not use
5 # this file except in compliance with the License. You can obtain a copy
6 # in the file LICENSE in the source distribution or at
7 # https://www.openssl.org/source/license.html
10 # ====================================================================
11 # Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
12 # project. The module is, however, dual licensed under OpenSSL and
13 # CRYPTOGAMS licenses depending on where you obtain it. For further
14 # details see http://www.openssl.org/~appro/cryptogams/.
15 # ====================================================================
17 # This module implements support for Intel AES-NI extension. In
18 # OpenSSL context it's used with Intel engine, but can also be used as
19 # drop-in replacement for crypto/aes/asm/aes-x86_64.pl [see below for
24 # Given aes(enc|dec) instructions' latency asymptotic performance for
25 # non-parallelizable modes such as CBC encrypt is 3.75 cycles per byte
26 # processed with 128-bit key. And given their throughput asymptotic
27 # performance for parallelizable modes is 1.25 cycles per byte. Being
28 # asymptotic limit it's not something you commonly achieve in reality,
29 # but how close does one get? Below are results collected for
30 # different modes and block sized. Pairs of numbers are for en-/
33 # 16-byte 64-byte 256-byte 1-KB 8-KB
34 # ECB 4.25/4.25 1.38/1.38 1.28/1.28 1.26/1.26 1.26/1.26
35 # CTR 5.42/5.42 1.92/1.92 1.44/1.44 1.28/1.28 1.26/1.26
36 # CBC 4.38/4.43 4.15/1.43 4.07/1.32 4.07/1.29 4.06/1.28
37 # CCM 5.66/9.42 4.42/5.41 4.16/4.40 4.09/4.15 4.06/4.07
38 # OFB 5.42/5.42 4.64/4.64 4.44/4.44 4.39/4.39 4.38/4.38
39 # CFB 5.73/5.85 5.56/5.62 5.48/5.56 5.47/5.55 5.47/5.55
41 # ECB, CTR, CBC and CCM results are free from EVP overhead. This means
42 # that otherwise used 'openssl speed -evp aes-128-??? -engine aesni
43 # [-decrypt]' will exhibit 10-15% worse results for smaller blocks.
44 # The results were collected with specially crafted speed.c benchmark
45 # in order to compare them with results reported in "Intel Advanced
46 # Encryption Standard (AES) New Instruction Set" White Paper Revision
47 # 3.0 dated May 2010. All above results are consistently better. This
48 # module also provides better performance for block sizes smaller than
49 # 128 bytes in points *not* represented in the above table.
51 # Looking at the results for 8-KB buffer.
53 # CFB and OFB results are far from the limit, because implementation
54 # uses "generic" CRYPTO_[c|o]fb128_encrypt interfaces relying on
55 # single-block aesni_encrypt, which is not the most optimal way to go.
56 # CBC encrypt result is unexpectedly high and there is no documented
57 # explanation for it. Seemingly there is a small penalty for feeding
58 # the result back to AES unit the way it's done in CBC mode. There is
59 # nothing one can do and the result appears optimal. CCM result is
60 # identical to CBC, because CBC-MAC is essentially CBC encrypt without
61 # saving output. CCM CTR "stays invisible," because it's neatly
62 # interleaved with CBC-MAC. This provides ~30% improvement over
63 # "straightforward" CCM implementation with CTR and CBC-MAC performed
64 # disjointly. Parallelizable modes practically achieve the theoretical
67 # Looking at how results vary with buffer size.
69 # Curves are practically saturated at 1-KB buffer size. In most cases
70 # "256-byte" performance is >95%, and "64-byte" is ~90% of "8-KB" one.
71 # CTR curve doesn't follow this pattern and is "slowest" changing one
72 # with "256-byte" result being 87% of "8-KB." This is because overhead
73 # in CTR mode is most computationally intensive. Small-block CCM
74 # decrypt is slower than encrypt, because first CTR and last CBC-MAC
75 # iterations can't be interleaved.
77 # Results for 192- and 256-bit keys.
79 # EVP-free results were observed to scale perfectly with number of
80 # rounds for larger block sizes, i.e. 192-bit result being 10/12 times
81 # lower and 256-bit one - 10/14. Well, in CBC encrypt case differences
82 # are a tad smaller, because the above mentioned penalty biases all
83 # results by same constant value. In similar way function call
84 # overhead affects small-block performance, as well as OFB and CFB
85 # results. Differences are not large, most common coefficients are
86 # 10/11.7 and 10/13.4 (as opposite to 10/12.0 and 10/14.0), but one
87 # observe even 10/11.2 and 10/12.4 (CTR, OFB, CFB)...
91 # While Westmere processor features 6 cycles latency for aes[enc|dec]
92 # instructions, which can be scheduled every second cycle, Sandy
93 # Bridge spends 8 cycles per instruction, but it can schedule them
94 # every cycle. This means that code targeting Westmere would perform
95 # suboptimally on Sandy Bridge. Therefore this update.
97 # In addition, non-parallelizable CBC encrypt (as well as CCM) is
98 # optimized. Relative improvement might appear modest, 8% on Westmere,
99 # but in absolute terms it's 3.77 cycles per byte encrypted with
100 # 128-bit key on Westmere, and 5.07 - on Sandy Bridge. These numbers
101 # should be compared to asymptotic limits of 3.75 for Westmere and
102 # 5.00 for Sandy Bridge. Actually, the fact that they get this close
103 # to asymptotic limits is quite amazing. Indeed, the limit is
104 # calculated as latency times number of rounds, 10 for 128-bit key,
105 # and divided by 16, the number of bytes in block, or in other words
106 # it accounts *solely* for aesenc instructions. But there are extra
107 # instructions, and numbers so close to the asymptotic limits mean
108 # that it's as if it takes as little as *one* additional cycle to
109 # execute all of them. How is it possible? It is possible thanks to
110 # out-of-order execution logic, which manages to overlap post-
111 # processing of previous block, things like saving the output, with
112 # actual encryption of current block, as well as pre-processing of
113 # current block, things like fetching input and xor-ing it with
114 # 0-round element of the key schedule, with actual encryption of
115 # previous block. Keep this in mind...
117 # For parallelizable modes, such as ECB, CBC decrypt, CTR, higher
118 # performance is achieved by interleaving instructions working on
119 # independent blocks. In which case asymptotic limit for such modes
120 # can be obtained by dividing above mentioned numbers by AES
121 # instructions' interleave factor. Westmere can execute at most 3
122 # instructions at a time, meaning that optimal interleave factor is 3,
123 # and that's where the "magic" number of 1.25 come from. "Optimal
124 # interleave factor" means that increase of interleave factor does
125 # not improve performance. The formula has proven to reflect reality
126 # pretty well on Westmere... Sandy Bridge on the other hand can
127 # execute up to 8 AES instructions at a time, so how does varying
128 # interleave factor affect the performance? Here is table for ECB
129 # (numbers are cycles per byte processed with 128-bit key):
131 # instruction interleave factor 3x 6x 8x
132 # theoretical asymptotic limit 1.67 0.83 0.625
133 # measured performance for 8KB block 1.05 0.86 0.84
135 # "as if" interleave factor 4.7x 5.8x 6.0x
137 # Further data for other parallelizable modes:
139 # CBC decrypt 1.16 0.93 0.74
142 # Well, given 3x column it's probably inappropriate to call the limit
143 # asymptotic, if it can be surpassed, isn't it? What happens there?
144 # Rewind to CBC paragraph for the answer. Yes, out-of-order execution
145 # magic is responsible for this. Processor overlaps not only the
146 # additional instructions with AES ones, but even AES instructions
147 # processing adjacent triplets of independent blocks. In the 6x case
148 # additional instructions still claim disproportionally small amount
149 # of additional cycles, but in 8x case number of instructions must be
150 # a tad too high for out-of-order logic to cope with, and AES unit
151 # remains underutilized... As you can see 8x interleave is hardly
152 # justifiable, so there no need to feel bad that 32-bit aesni-x86.pl
153 # utilizes 6x interleave because of limited register bank capacity.
155 # Higher interleave factors do have negative impact on Westmere
156 # performance. While for ECB mode it's negligible ~1.5%, other
157 # parallelizables perform ~5% worse, which is outweighed by ~25%
158 # improvement on Sandy Bridge. To balance regression on Westmere
159 # CTR mode was implemented with 6x aesenc interleave factor.
163 # Add aesni_xts_[en|de]crypt. Westmere spends 1.25 cycles processing
164 # one byte out of 8KB with 128-bit key, Sandy Bridge - 0.90. Just like
165 # in CTR mode AES instruction interleave factor was chosen to be 6x.
169 # Add aesni_ocb_[en|de]crypt. AES instruction interleave factor was
172 ######################################################################
173 # Current large-block performance in cycles per byte processed with
174 # 128-bit key (less is better).
176 # CBC en-/decrypt CTR XTS ECB OCB
177 # Westmere 3.77/1.25 1.25 1.25 1.26
178 # * Bridge 5.07/0.74 0.75 0.90 0.85 0.98
179 # Haswell 4.44/0.63 0.63 0.73 0.63 0.70
180 # Skylake 2.62/0.63 0.63 0.63 0.63
181 # Silvermont 5.75/3.54 3.56 4.12 3.87(*) 4.11
182 # Knights L 2.54/0.77 0.78 0.85 - 1.50
183 # Goldmont 3.82/1.26 1.26 1.29 1.29 1.50
184 # Bulldozer 5.77/0.70 0.72 0.90 0.70 0.95
185 # Ryzen 2.71/0.35 0.35 0.44 0.38 0.49
187 # (*) Atom Silvermont ECB result is suboptimal because of penalties
188 # incurred by operations on %xmm8-15. As ECB is not considered
189 # critical, nothing was done to mitigate the problem.
191 $PREFIX="aesni"; # if $PREFIX is set to "AES", the script
192 # generates drop-in replacement for
193 # crypto/aes/asm/aes-x86_64.pl:-)
195 # $output is the last argument if it looks like a file (it has an extension)
196 # $flavour is the first argument if it doesn't look like a file
197 $output = $#ARGV >= 0 && $ARGV[$#ARGV] =~ m
|\
.\w
+$| ?
pop : undef;
198 $flavour = $#ARGV >= 0 && $ARGV[0] !~ m
|\
.| ?
shift : undef;
200 $win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/);
202 $0 =~ m/(.*[\/\\])[^\
/\\]+$/; $dir=$1;
203 ( $xlate="${dir}x86_64-xlate.pl" and -f
$xlate ) or
204 ( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f
$xlate) or
205 die "can't locate x86_64-xlate.pl";
207 open OUT
,"| \"$^X\" \"$xlate\" $flavour \"$output\""
208 or die "can't call $xlate: $!";
211 $movkey = $PREFIX eq "aesni" ?
"movups" : "movups";
212 @_4args=$win64?
("%rcx","%rdx","%r8", "%r9") : # Win64 order
213 ("%rdi","%rsi","%rdx","%rcx"); # Unix order
216 $code.=".extern OPENSSL_ia32cap_P\n";
218 $rounds="%eax"; # input to and changed by aesni_[en|de]cryptN !!!
219 # this is natural Unix argument order for public $PREFIX_[ecb|cbc]_encrypt ...
223 $key="%rcx"; # input to and changed by aesni_[en|de]cryptN !!!
224 $ivp="%r8"; # cbc, ctr, ...
226 $rnds_="%r10d"; # backup copy for $rounds
227 $key_="%r11"; # backup copy for $key
229 # %xmm register layout
230 $rndkey0="%xmm0"; $rndkey1="%xmm1";
231 $inout0="%xmm2"; $inout1="%xmm3";
232 $inout2="%xmm4"; $inout3="%xmm5";
233 $inout4="%xmm6"; $inout5="%xmm7";
234 $inout6="%xmm8"; $inout7="%xmm9";
236 $in2="%xmm6"; $in1="%xmm7"; # used in CBC decrypt, CTR, ...
237 $in0="%xmm8"; $iv="%xmm9";
239 # Inline version of internal aesni_[en|de]crypt1.
241 # Why folded loop? Because aes[enc|dec] is slow enough to accommodate
242 # cycles which take care of loop variables...
244 sub aesni_generate1
{
245 my ($p,$key,$rounds,$inout,$ivec)=@_; $inout=$inout0 if (!defined($inout));
248 $movkey ($key),$rndkey0
249 $movkey 16($key),$rndkey1
251 $code.=<<___
if (defined($ivec));
256 $code.=<<___
if (!defined($ivec));
258 xorps
$rndkey0,$inout
262 aes
${p
} $rndkey1,$inout
264 $movkey ($key),$rndkey1
266 jnz
.Loop_
${p
}1_
$sn # loop body is 16 bytes
267 aes
${p
}last $rndkey1,$inout
270 # void $PREFIX_[en|de]crypt (const void *inp,void *out,const AES_KEY *key);
272 { my ($inp,$out,$key) = @_4args;
275 .globl
${PREFIX
}_encrypt
276 .type
${PREFIX
}_encrypt
,\
@abi-omnipotent
281 movups
($inp),$inout0 # load input
282 mov
240($key),$rounds # key->rounds
284 &aesni_generate1
("enc",$key,$rounds);
286 pxor
$rndkey0,$rndkey0 # clear register bank
287 pxor
$rndkey1,$rndkey1
288 movups
$inout0,($out) # output
292 .size
${PREFIX
}_encrypt
,.-${PREFIX
}_encrypt
294 .globl
${PREFIX
}_decrypt
295 .type
${PREFIX
}_decrypt
,\
@abi-omnipotent
300 movups
($inp),$inout0 # load input
301 mov
240($key),$rounds # key->rounds
303 &aesni_generate1
("dec",$key,$rounds);
305 pxor
$rndkey0,$rndkey0 # clear register bank
306 pxor
$rndkey1,$rndkey1
307 movups
$inout0,($out) # output
311 .size
${PREFIX
}_decrypt
, .-${PREFIX
}_decrypt
315 # _aesni_[en|de]cryptN are private interfaces, N denotes interleave
316 # factor. Why 3x subroutine were originally used in loops? Even though
317 # aes[enc|dec] latency was originally 6, it could be scheduled only
318 # every *2nd* cycle. Thus 3x interleave was the one providing optimal
319 # utilization, i.e. when subroutine's throughput is virtually same as
320 # of non-interleaved subroutine [for number of input blocks up to 3].
321 # This is why it originally made no sense to implement 2x subroutine.
322 # But times change and it became appropriate to spend extra 192 bytes
323 # on 2x subroutine on Atom Silvermont account. For processors that
324 # can schedule aes[enc|dec] every cycle optimal interleave factor
325 # equals to corresponding instructions latency. 8x is optimal for
326 # * Bridge and "super-optimal" for other Intel CPUs...
328 sub aesni_generate2
{
330 # As already mentioned it takes in $key and $rounds, which are *not*
331 # preserved. $inout[0-1] is cipher/clear text...
333 .type _aesni_
${dir
}rypt2
,\
@abi-omnipotent
337 $movkey ($key),$rndkey0
339 $movkey 16($key),$rndkey1
340 xorps
$rndkey0,$inout0
341 xorps
$rndkey0,$inout1
342 $movkey 32($key),$rndkey0
343 lea
32($key,$rounds),$key
348 aes
${dir
} $rndkey1,$inout0
349 aes
${dir
} $rndkey1,$inout1
350 $movkey ($key,%rax),$rndkey1
352 aes
${dir
} $rndkey0,$inout0
353 aes
${dir
} $rndkey0,$inout1
354 $movkey -16($key,%rax),$rndkey0
357 aes
${dir
} $rndkey1,$inout0
358 aes
${dir
} $rndkey1,$inout1
359 aes
${dir
}last $rndkey0,$inout0
360 aes
${dir
}last $rndkey0,$inout1
363 .size _aesni_
${dir
}rypt2
,.-_aesni_
${dir
}rypt2
366 sub aesni_generate3
{
368 # As already mentioned it takes in $key and $rounds, which are *not*
369 # preserved. $inout[0-2] is cipher/clear text...
371 .type _aesni_
${dir
}rypt3
,\
@abi-omnipotent
375 $movkey ($key),$rndkey0
377 $movkey 16($key),$rndkey1
378 xorps
$rndkey0,$inout0
379 xorps
$rndkey0,$inout1
380 xorps
$rndkey0,$inout2
381 $movkey 32($key),$rndkey0
382 lea
32($key,$rounds),$key
387 aes
${dir
} $rndkey1,$inout0
388 aes
${dir
} $rndkey1,$inout1
389 aes
${dir
} $rndkey1,$inout2
390 $movkey ($key,%rax),$rndkey1
392 aes
${dir
} $rndkey0,$inout0
393 aes
${dir
} $rndkey0,$inout1
394 aes
${dir
} $rndkey0,$inout2
395 $movkey -16($key,%rax),$rndkey0
398 aes
${dir
} $rndkey1,$inout0
399 aes
${dir
} $rndkey1,$inout1
400 aes
${dir
} $rndkey1,$inout2
401 aes
${dir
}last $rndkey0,$inout0
402 aes
${dir
}last $rndkey0,$inout1
403 aes
${dir
}last $rndkey0,$inout2
406 .size _aesni_
${dir
}rypt3
,.-_aesni_
${dir
}rypt3
409 # 4x interleave is implemented to improve small block performance,
410 # most notably [and naturally] 4 block by ~30%. One can argue that one
411 # should have implemented 5x as well, but improvement would be <20%,
412 # so it's not worth it...
413 sub aesni_generate4
{
415 # As already mentioned it takes in $key and $rounds, which are *not*
416 # preserved. $inout[0-3] is cipher/clear text...
418 .type _aesni_
${dir
}rypt4
,\
@abi-omnipotent
422 $movkey ($key),$rndkey0
424 $movkey 16($key),$rndkey1
425 xorps
$rndkey0,$inout0
426 xorps
$rndkey0,$inout1
427 xorps
$rndkey0,$inout2
428 xorps
$rndkey0,$inout3
429 $movkey 32($key),$rndkey0
430 lea
32($key,$rounds),$key
436 aes
${dir
} $rndkey1,$inout0
437 aes
${dir
} $rndkey1,$inout1
438 aes
${dir
} $rndkey1,$inout2
439 aes
${dir
} $rndkey1,$inout3
440 $movkey ($key,%rax),$rndkey1
442 aes
${dir
} $rndkey0,$inout0
443 aes
${dir
} $rndkey0,$inout1
444 aes
${dir
} $rndkey0,$inout2
445 aes
${dir
} $rndkey0,$inout3
446 $movkey -16($key,%rax),$rndkey0
449 aes
${dir
} $rndkey1,$inout0
450 aes
${dir
} $rndkey1,$inout1
451 aes
${dir
} $rndkey1,$inout2
452 aes
${dir
} $rndkey1,$inout3
453 aes
${dir
}last $rndkey0,$inout0
454 aes
${dir
}last $rndkey0,$inout1
455 aes
${dir
}last $rndkey0,$inout2
456 aes
${dir
}last $rndkey0,$inout3
459 .size _aesni_
${dir
}rypt4
,.-_aesni_
${dir
}rypt4
462 sub aesni_generate6
{
464 # As already mentioned it takes in $key and $rounds, which are *not*
465 # preserved. $inout[0-5] is cipher/clear text...
467 .type _aesni_
${dir
}rypt6
,\
@abi-omnipotent
471 $movkey ($key),$rndkey0
473 $movkey 16($key),$rndkey1
474 xorps
$rndkey0,$inout0
475 pxor
$rndkey0,$inout1
476 pxor
$rndkey0,$inout2
477 aes
${dir
} $rndkey1,$inout0
478 lea
32($key,$rounds),$key
480 aes
${dir
} $rndkey1,$inout1
481 pxor
$rndkey0,$inout3
482 pxor
$rndkey0,$inout4
483 aes
${dir
} $rndkey1,$inout2
484 pxor
$rndkey0,$inout5
485 $movkey ($key,%rax),$rndkey0
487 jmp
.L
${dir
}_loop6_enter
490 aes
${dir
} $rndkey1,$inout0
491 aes
${dir
} $rndkey1,$inout1
492 aes
${dir
} $rndkey1,$inout2
493 .L
${dir
}_loop6_enter
:
494 aes
${dir
} $rndkey1,$inout3
495 aes
${dir
} $rndkey1,$inout4
496 aes
${dir
} $rndkey1,$inout5
497 $movkey ($key,%rax),$rndkey1
499 aes
${dir
} $rndkey0,$inout0
500 aes
${dir
} $rndkey0,$inout1
501 aes
${dir
} $rndkey0,$inout2
502 aes
${dir
} $rndkey0,$inout3
503 aes
${dir
} $rndkey0,$inout4
504 aes
${dir
} $rndkey0,$inout5
505 $movkey -16($key,%rax),$rndkey0
508 aes
${dir
} $rndkey1,$inout0
509 aes
${dir
} $rndkey1,$inout1
510 aes
${dir
} $rndkey1,$inout2
511 aes
${dir
} $rndkey1,$inout3
512 aes
${dir
} $rndkey1,$inout4
513 aes
${dir
} $rndkey1,$inout5
514 aes
${dir
}last $rndkey0,$inout0
515 aes
${dir
}last $rndkey0,$inout1
516 aes
${dir
}last $rndkey0,$inout2
517 aes
${dir
}last $rndkey0,$inout3
518 aes
${dir
}last $rndkey0,$inout4
519 aes
${dir
}last $rndkey0,$inout5
522 .size _aesni_
${dir
}rypt6
,.-_aesni_
${dir
}rypt6
525 sub aesni_generate8
{
527 # As already mentioned it takes in $key and $rounds, which are *not*
528 # preserved. $inout[0-7] is cipher/clear text...
530 .type _aesni_
${dir
}rypt8
,\
@abi-omnipotent
534 $movkey ($key),$rndkey0
536 $movkey 16($key),$rndkey1
537 xorps
$rndkey0,$inout0
538 xorps
$rndkey0,$inout1
539 pxor
$rndkey0,$inout2
540 pxor
$rndkey0,$inout3
541 pxor
$rndkey0,$inout4
542 lea
32($key,$rounds),$key
544 aes
${dir
} $rndkey1,$inout0
545 pxor
$rndkey0,$inout5
546 pxor
$rndkey0,$inout6
547 aes
${dir
} $rndkey1,$inout1
548 pxor
$rndkey0,$inout7
549 $movkey ($key,%rax),$rndkey0
551 jmp
.L
${dir
}_loop8_inner
554 aes
${dir
} $rndkey1,$inout0
555 aes
${dir
} $rndkey1,$inout1
556 .L
${dir
}_loop8_inner
:
557 aes
${dir
} $rndkey1,$inout2
558 aes
${dir
} $rndkey1,$inout3
559 aes
${dir
} $rndkey1,$inout4
560 aes
${dir
} $rndkey1,$inout5
561 aes
${dir
} $rndkey1,$inout6
562 aes
${dir
} $rndkey1,$inout7
563 .L
${dir
}_loop8_enter
:
564 $movkey ($key,%rax),$rndkey1
566 aes
${dir
} $rndkey0,$inout0
567 aes
${dir
} $rndkey0,$inout1
568 aes
${dir
} $rndkey0,$inout2
569 aes
${dir
} $rndkey0,$inout3
570 aes
${dir
} $rndkey0,$inout4
571 aes
${dir
} $rndkey0,$inout5
572 aes
${dir
} $rndkey0,$inout6
573 aes
${dir
} $rndkey0,$inout7
574 $movkey -16($key,%rax),$rndkey0
577 aes
${dir
} $rndkey1,$inout0
578 aes
${dir
} $rndkey1,$inout1
579 aes
${dir
} $rndkey1,$inout2
580 aes
${dir
} $rndkey1,$inout3
581 aes
${dir
} $rndkey1,$inout4
582 aes
${dir
} $rndkey1,$inout5
583 aes
${dir
} $rndkey1,$inout6
584 aes
${dir
} $rndkey1,$inout7
585 aes
${dir
}last $rndkey0,$inout0
586 aes
${dir
}last $rndkey0,$inout1
587 aes
${dir
}last $rndkey0,$inout2
588 aes
${dir
}last $rndkey0,$inout3
589 aes
${dir
}last $rndkey0,$inout4
590 aes
${dir
}last $rndkey0,$inout5
591 aes
${dir
}last $rndkey0,$inout6
592 aes
${dir
}last $rndkey0,$inout7
595 .size _aesni_
${dir
}rypt8
,.-_aesni_
${dir
}rypt8
598 &aesni_generate2
("enc") if ($PREFIX eq "aesni");
599 &aesni_generate2
("dec");
600 &aesni_generate3
("enc") if ($PREFIX eq "aesni");
601 &aesni_generate3
("dec");
602 &aesni_generate4
("enc") if ($PREFIX eq "aesni");
603 &aesni_generate4
("dec");
604 &aesni_generate6
("enc") if ($PREFIX eq "aesni");
605 &aesni_generate6
("dec");
606 &aesni_generate8
("enc") if ($PREFIX eq "aesni");
607 &aesni_generate8
("dec");
609 if ($PREFIX eq "aesni") {
610 ########################################################################
611 # void aesni_ecb_encrypt (const void *in, void *out,
612 # size_t length, const AES_KEY *key,
615 .globl aesni_ecb_encrypt
616 .type aesni_ecb_encrypt
,\
@function,5
622 $code.=<<___
if ($win64);
624 movaps
%xmm6,(%rsp) # offload $inout4..7
625 movaps
%xmm7,0x10(%rsp)
626 movaps
%xmm8,0x20(%rsp)
627 movaps
%xmm9,0x30(%rsp)
631 and \
$-16,$len # if ($len<16)
632 jz
.Lecb_ret
# return
634 mov
240($key),$rounds # key->rounds
635 $movkey ($key),$rndkey0
636 mov
$key,$key_ # backup $key
637 mov
$rounds,$rnds_ # backup $rounds
638 test
%r8d,%r8d # 5th argument
640 #--------------------------- ECB ENCRYPT ------------------------------#
641 cmp \
$0x80,$len # if ($len<8*16)
642 jb
.Lecb_enc_tail
# short input
644 movdqu
($inp),$inout0 # load 8 input blocks
645 movdqu
0x10($inp),$inout1
646 movdqu
0x20($inp),$inout2
647 movdqu
0x30($inp),$inout3
648 movdqu
0x40($inp),$inout4
649 movdqu
0x50($inp),$inout5
650 movdqu
0x60($inp),$inout6
651 movdqu
0x70($inp),$inout7
652 lea
0x80($inp),$inp # $inp+=8*16
653 sub \
$0x80,$len # $len-=8*16 (can be zero)
654 jmp
.Lecb_enc_loop8_enter
657 movups
$inout0,($out) # store 8 output blocks
658 mov
$key_,$key # restore $key
659 movdqu
($inp),$inout0 # load 8 input blocks
660 mov
$rnds_,$rounds # restore $rounds
661 movups
$inout1,0x10($out)
662 movdqu
0x10($inp),$inout1
663 movups
$inout2,0x20($out)
664 movdqu
0x20($inp),$inout2
665 movups
$inout3,0x30($out)
666 movdqu
0x30($inp),$inout3
667 movups
$inout4,0x40($out)
668 movdqu
0x40($inp),$inout4
669 movups
$inout5,0x50($out)
670 movdqu
0x50($inp),$inout5
671 movups
$inout6,0x60($out)
672 movdqu
0x60($inp),$inout6
673 movups
$inout7,0x70($out)
674 lea
0x80($out),$out # $out+=8*16
675 movdqu
0x70($inp),$inout7
676 lea
0x80($inp),$inp # $inp+=8*16
677 .Lecb_enc_loop8_enter
:
682 jnc
.Lecb_enc_loop8
# loop if $len-=8*16 didn't borrow
684 movups
$inout0,($out) # store 8 output blocks
685 mov
$key_,$key # restore $key
686 movups
$inout1,0x10($out)
687 mov
$rnds_,$rounds # restore $rounds
688 movups
$inout2,0x20($out)
689 movups
$inout3,0x30($out)
690 movups
$inout4,0x40($out)
691 movups
$inout5,0x50($out)
692 movups
$inout6,0x60($out)
693 movups
$inout7,0x70($out)
694 lea
0x80($out),$out # $out+=8*16
695 add \
$0x80,$len # restore real remaining $len
696 jz
.Lecb_ret
# done if ($len==0)
698 .Lecb_enc_tail
: # $len is less than 8*16
699 movups
($inp),$inout0
702 movups
0x10($inp),$inout1
704 movups
0x20($inp),$inout2
707 movups
0x30($inp),$inout3
709 movups
0x40($inp),$inout4
712 movups
0x50($inp),$inout5
714 movdqu
0x60($inp),$inout6
715 xorps
$inout7,$inout7
717 movups
$inout0,($out) # store 7 output blocks
718 movups
$inout1,0x10($out)
719 movups
$inout2,0x20($out)
720 movups
$inout3,0x30($out)
721 movups
$inout4,0x40($out)
722 movups
$inout5,0x50($out)
723 movups
$inout6,0x60($out)
728 &aesni_generate1
("enc",$key,$rounds);
730 movups
$inout0,($out) # store one output block
735 movups
$inout0,($out) # store 2 output blocks
736 movups
$inout1,0x10($out)
741 movups
$inout0,($out) # store 3 output blocks
742 movups
$inout1,0x10($out)
743 movups
$inout2,0x20($out)
748 movups
$inout0,($out) # store 4 output blocks
749 movups
$inout1,0x10($out)
750 movups
$inout2,0x20($out)
751 movups
$inout3,0x30($out)
755 xorps
$inout5,$inout5
757 movups
$inout0,($out) # store 5 output blocks
758 movups
$inout1,0x10($out)
759 movups
$inout2,0x20($out)
760 movups
$inout3,0x30($out)
761 movups
$inout4,0x40($out)
766 movups
$inout0,($out) # store 6 output blocks
767 movups
$inout1,0x10($out)
768 movups
$inout2,0x20($out)
769 movups
$inout3,0x30($out)
770 movups
$inout4,0x40($out)
771 movups
$inout5,0x50($out)
773 \f#--------------------------- ECB DECRYPT ------------------------------#
776 cmp \
$0x80,$len # if ($len<8*16)
777 jb
.Lecb_dec_tail
# short input
779 movdqu
($inp),$inout0 # load 8 input blocks
780 movdqu
0x10($inp),$inout1
781 movdqu
0x20($inp),$inout2
782 movdqu
0x30($inp),$inout3
783 movdqu
0x40($inp),$inout4
784 movdqu
0x50($inp),$inout5
785 movdqu
0x60($inp),$inout6
786 movdqu
0x70($inp),$inout7
787 lea
0x80($inp),$inp # $inp+=8*16
788 sub \
$0x80,$len # $len-=8*16 (can be zero)
789 jmp
.Lecb_dec_loop8_enter
792 movups
$inout0,($out) # store 8 output blocks
793 mov
$key_,$key # restore $key
794 movdqu
($inp),$inout0 # load 8 input blocks
795 mov
$rnds_,$rounds # restore $rounds
796 movups
$inout1,0x10($out)
797 movdqu
0x10($inp),$inout1
798 movups
$inout2,0x20($out)
799 movdqu
0x20($inp),$inout2
800 movups
$inout3,0x30($out)
801 movdqu
0x30($inp),$inout3
802 movups
$inout4,0x40($out)
803 movdqu
0x40($inp),$inout4
804 movups
$inout5,0x50($out)
805 movdqu
0x50($inp),$inout5
806 movups
$inout6,0x60($out)
807 movdqu
0x60($inp),$inout6
808 movups
$inout7,0x70($out)
809 lea
0x80($out),$out # $out+=8*16
810 movdqu
0x70($inp),$inout7
811 lea
0x80($inp),$inp # $inp+=8*16
812 .Lecb_dec_loop8_enter
:
816 $movkey ($key_),$rndkey0
818 jnc
.Lecb_dec_loop8
# loop if $len-=8*16 didn't borrow
820 movups
$inout0,($out) # store 8 output blocks
821 pxor
$inout0,$inout0 # clear register bank
822 mov
$key_,$key # restore $key
823 movups
$inout1,0x10($out)
825 mov
$rnds_,$rounds # restore $rounds
826 movups
$inout2,0x20($out)
828 movups
$inout3,0x30($out)
830 movups
$inout4,0x40($out)
832 movups
$inout5,0x50($out)
834 movups
$inout6,0x60($out)
836 movups
$inout7,0x70($out)
838 lea
0x80($out),$out # $out+=8*16
839 add \
$0x80,$len # restore real remaining $len
840 jz
.Lecb_ret
# done if ($len==0)
843 movups
($inp),$inout0
846 movups
0x10($inp),$inout1
848 movups
0x20($inp),$inout2
851 movups
0x30($inp),$inout3
853 movups
0x40($inp),$inout4
856 movups
0x50($inp),$inout5
858 movups
0x60($inp),$inout6
859 $movkey ($key),$rndkey0
860 xorps
$inout7,$inout7
862 movups
$inout0,($out) # store 7 output blocks
863 pxor
$inout0,$inout0 # clear register bank
864 movups
$inout1,0x10($out)
866 movups
$inout2,0x20($out)
868 movups
$inout3,0x30($out)
870 movups
$inout4,0x40($out)
872 movups
$inout5,0x50($out)
874 movups
$inout6,0x60($out)
881 &aesni_generate1
("dec",$key,$rounds);
883 movups
$inout0,($out) # store one output block
884 pxor
$inout0,$inout0 # clear register bank
889 movups
$inout0,($out) # store 2 output blocks
890 pxor
$inout0,$inout0 # clear register bank
891 movups
$inout1,0x10($out)
897 movups
$inout0,($out) # store 3 output blocks
898 pxor
$inout0,$inout0 # clear register bank
899 movups
$inout1,0x10($out)
901 movups
$inout2,0x20($out)
907 movups
$inout0,($out) # store 4 output blocks
908 pxor
$inout0,$inout0 # clear register bank
909 movups
$inout1,0x10($out)
911 movups
$inout2,0x20($out)
913 movups
$inout3,0x30($out)
918 xorps
$inout5,$inout5
920 movups
$inout0,($out) # store 5 output blocks
921 pxor
$inout0,$inout0 # clear register bank
922 movups
$inout1,0x10($out)
924 movups
$inout2,0x20($out)
926 movups
$inout3,0x30($out)
928 movups
$inout4,0x40($out)
935 movups
$inout0,($out) # store 6 output blocks
936 pxor
$inout0,$inout0 # clear register bank
937 movups
$inout1,0x10($out)
939 movups
$inout2,0x20($out)
941 movups
$inout3,0x30($out)
943 movups
$inout4,0x40($out)
945 movups
$inout5,0x50($out)
949 xorps
$rndkey0,$rndkey0 # %xmm0
950 pxor
$rndkey1,$rndkey1
952 $code.=<<___
if ($win64);
954 movaps
%xmm0,(%rsp) # clear stack
955 movaps
0x10(%rsp),%xmm7
956 movaps
%xmm0,0x10(%rsp)
957 movaps
0x20(%rsp),%xmm8
958 movaps
%xmm0,0x20(%rsp)
959 movaps
0x30(%rsp),%xmm9
960 movaps
%xmm0,0x30(%rsp)
967 .size aesni_ecb_encrypt
,.-aesni_ecb_encrypt
971 ######################################################################
972 # void aesni_ccm64_[en|de]crypt_blocks (const void *in, void *out,
973 # size_t blocks, const AES_KEY *key,
974 # const char *ivec,char *cmac);
976 # Handles only complete blocks, operates on 64-bit counter and
977 # does not update *ivec! Nor does it finalize CMAC value
978 # (see engine/eng_aesni.c for details)
981 my $cmac="%r9"; # 6th argument
983 my $increment="%xmm9";
985 my $bswap_mask="%xmm7";
988 .globl aesni_ccm64_encrypt_blocks
989 .type aesni_ccm64_encrypt_blocks
,\
@function,6
991 aesni_ccm64_encrypt_blocks
:
995 $code.=<<___
if ($win64);
997 movaps
%xmm6,(%rsp) # $iv
998 movaps
%xmm7,0x10(%rsp) # $bswap_mask
999 movaps
%xmm8,0x20(%rsp) # $in0
1000 movaps
%xmm9,0x30(%rsp) # $increment
1004 mov
240($key),$rounds # key->rounds
1006 movdqa
.Lincrement64
(%rip),$increment
1007 movdqa
.Lbswap_mask
(%rip),$bswap_mask
1012 movdqu
($cmac),$inout1
1014 lea
32($key,$rounds),$key # end of key schedule
1015 pshufb
$bswap_mask,$iv
1016 sub %rax,%r10 # twisted $rounds
1017 jmp
.Lccm64_enc_outer
1020 $movkey ($key_),$rndkey0
1022 movups
($inp),$in0 # load inp
1024 xorps
$rndkey0,$inout0 # counter
1025 $movkey 16($key_),$rndkey1
1027 xorps
$rndkey0,$inout1 # cmac^=inp
1028 $movkey 32($key_),$rndkey0
1031 aesenc
$rndkey1,$inout0
1032 aesenc
$rndkey1,$inout1
1033 $movkey ($key,%rax),$rndkey1
1035 aesenc
$rndkey0,$inout0
1036 aesenc
$rndkey0,$inout1
1037 $movkey -16($key,%rax),$rndkey0
1038 jnz
.Lccm64_enc2_loop
1039 aesenc
$rndkey1,$inout0
1040 aesenc
$rndkey1,$inout1
1041 paddq
$increment,$iv
1042 dec
$len # $len-- ($len is in blocks)
1043 aesenclast
$rndkey0,$inout0
1044 aesenclast
$rndkey0,$inout1
1047 xorps
$inout0,$in0 # inp ^= E(iv)
1049 movups
$in0,($out) # save output
1050 pshufb
$bswap_mask,$inout0
1051 lea
16($out),$out # $out+=16
1052 jnz
.Lccm64_enc_outer
# loop if ($len!=0)
1054 pxor
$rndkey0,$rndkey0 # clear register bank
1055 pxor
$rndkey1,$rndkey1
1056 pxor
$inout0,$inout0
1057 movups
$inout1,($cmac) # store resulting mac
1058 pxor
$inout1,$inout1
1062 $code.=<<___
if ($win64);
1064 movaps
%xmm0,(%rsp) # clear stack
1065 movaps
0x10(%rsp),%xmm7
1066 movaps
%xmm0,0x10(%rsp)
1067 movaps
0x20(%rsp),%xmm8
1068 movaps
%xmm0,0x20(%rsp)
1069 movaps
0x30(%rsp),%xmm9
1070 movaps
%xmm0,0x30(%rsp)
1077 .size aesni_ccm64_encrypt_blocks
,.-aesni_ccm64_encrypt_blocks
1079 ######################################################################
1081 .globl aesni_ccm64_decrypt_blocks
1082 .type aesni_ccm64_decrypt_blocks
,\
@function,6
1084 aesni_ccm64_decrypt_blocks
:
1088 $code.=<<___
if ($win64);
1089 lea
-0x58(%rsp),%rsp
1090 movaps
%xmm6,(%rsp) # $iv
1091 movaps
%xmm7,0x10(%rsp) # $bswap_mask
1092 movaps
%xmm8,0x20(%rsp) # $in8
1093 movaps
%xmm9,0x30(%rsp) # $increment
1097 mov
240($key),$rounds # key->rounds
1099 movdqu
($cmac),$inout1
1100 movdqa
.Lincrement64
(%rip),$increment
1101 movdqa
.Lbswap_mask
(%rip),$bswap_mask
1106 pshufb
$bswap_mask,$iv
1108 &aesni_generate1
("enc",$key,$rounds);
1112 movups
($inp),$in0 # load inp
1113 paddq
$increment,$iv
1114 lea
16($inp),$inp # $inp+=16
1115 sub %r10,%rax # twisted $rounds
1116 lea
32($key_,$rnds_),$key # end of key schedule
1118 jmp
.Lccm64_dec_outer
1121 xorps
$inout0,$in0 # inp ^= E(iv)
1123 movups
$in0,($out) # save output
1124 lea
16($out),$out # $out+=16
1125 pshufb
$bswap_mask,$inout0
1127 sub \
$1,$len # $len-- ($len is in blocks)
1128 jz
.Lccm64_dec_break
# if ($len==0) break
1130 $movkey ($key_),$rndkey0
1132 $movkey 16($key_),$rndkey1
1134 xorps
$rndkey0,$inout0
1135 xorps
$in0,$inout1 # cmac^=out
1136 $movkey 32($key_),$rndkey0
1137 jmp
.Lccm64_dec2_loop
1140 aesenc
$rndkey1,$inout0
1141 aesenc
$rndkey1,$inout1
1142 $movkey ($key,%rax),$rndkey1
1144 aesenc
$rndkey0,$inout0
1145 aesenc
$rndkey0,$inout1
1146 $movkey -16($key,%rax),$rndkey0
1147 jnz
.Lccm64_dec2_loop
1148 movups
($inp),$in0 # load input
1149 paddq
$increment,$iv
1150 aesenc
$rndkey1,$inout0
1151 aesenc
$rndkey1,$inout1
1152 aesenclast
$rndkey0,$inout0
1153 aesenclast
$rndkey0,$inout1
1154 lea
16($inp),$inp # $inp+=16
1155 jmp
.Lccm64_dec_outer
1159 #xorps $in0,$inout1 # cmac^=out
1160 mov
240($key_),$rounds
1162 &aesni_generate1
("enc",$key_,$rounds,$inout1,$in0);
1164 pxor
$rndkey0,$rndkey0 # clear register bank
1165 pxor
$rndkey1,$rndkey1
1166 pxor
$inout0,$inout0
1167 movups
$inout1,($cmac) # store resulting mac
1168 pxor
$inout1,$inout1
1172 $code.=<<___
if ($win64);
1174 movaps
%xmm0,(%rsp) # clear stack
1175 movaps
0x10(%rsp),%xmm7
1176 movaps
%xmm0,0x10(%rsp)
1177 movaps
0x20(%rsp),%xmm8
1178 movaps
%xmm0,0x20(%rsp)
1179 movaps
0x30(%rsp),%xmm9
1180 movaps
%xmm0,0x30(%rsp)
1187 .size aesni_ccm64_decrypt_blocks
,.-aesni_ccm64_decrypt_blocks
1190 ######################################################################
1191 # void aesni_ctr32_encrypt_blocks (const void *in, void *out,
1192 # size_t blocks, const AES_KEY *key,
1193 # const char *ivec);
1195 # Handles only complete blocks, operates on 32-bit counter and
1196 # does not update *ivec! (see crypto/modes/ctr128.c for details)
1198 # Overhaul based on suggestions from Shay Gueron and Vlad Krasnov,
1199 # http://rt.openssl.org/Ticket/Display.html?id=3021&user=guest&pass=guest.
1200 # Keywords are full unroll and modulo-schedule counter calculations
1201 # with zero-round key xor.
1203 my ($in0,$in1,$in2,$in3,$in4,$in5)=map("%xmm$_",(10..15));
1204 my ($key0,$ctr)=("%ebp","${ivp}d");
1205 my $frame_size = 0x80 + ($win64?
160:0);
1208 .globl aesni_ctr32_encrypt_blocks
1209 .type aesni_ctr32_encrypt_blocks
,\
@function,5
1211 aesni_ctr32_encrypt_blocks
:
1217 # handle single block without allocating stack frame,
1218 # useful when handling edges
1219 movups
($ivp),$inout0
1220 movups
($inp),$inout1
1221 mov
240($key),%edx # key->rounds
1223 &aesni_generate1
("enc",$key,"%edx");
1225 pxor
$rndkey0,$rndkey0 # clear register bank
1226 pxor
$rndkey1,$rndkey1
1227 xorps
$inout1,$inout0
1228 pxor
$inout1,$inout1
1229 movups
$inout0,($out)
1230 xorps
$inout0,$inout0
1231 jmp
.Lctr32_epilogue
1235 lea
(%rsp),$key_ # use $key_ as frame pointer
1236 .cfi_def_cfa_register
$key_
1239 sub \
$$frame_size,%rsp
1240 and \
$-16,%rsp # Linux kernel stack can be incorrectly seeded
1242 $code.=<<___
if ($win64);
1243 movaps
%xmm6,-0xa8($key_) # offload everything
1244 movaps
%xmm7,-0x98($key_)
1245 movaps
%xmm8,-0x88($key_)
1246 movaps
%xmm9,-0x78($key_)
1247 movaps
%xmm10,-0x68($key_)
1248 movaps
%xmm11,-0x58($key_)
1249 movaps
%xmm12,-0x48($key_)
1250 movaps
%xmm13,-0x38($key_)
1251 movaps
%xmm14,-0x28($key_)
1252 movaps
%xmm15,-0x18($key_)
1257 # 8 16-byte words on top of stack are counter values
1258 # xor-ed with zero-round key
1260 movdqu
($ivp),$inout0
1261 movdqu
($key),$rndkey0
1262 mov
12($ivp),$ctr # counter LSB
1263 pxor
$rndkey0,$inout0
1264 mov
12($key),$key0 # 0-round key LSB
1265 movdqa
$inout0,0x00(%rsp) # populate counter block
1267 movdqa
$inout0,$inout1
1268 movdqa
$inout0,$inout2
1269 movdqa
$inout0,$inout3
1270 movdqa
$inout0,0x40(%rsp)
1271 movdqa
$inout0,0x50(%rsp)
1272 movdqa
$inout0,0x60(%rsp)
1273 mov
%rdx,%r10 # about to borrow %rdx
1274 movdqa
$inout0,0x70(%rsp)
1282 pinsrd \
$3,%eax,$inout1
1284 movdqa
$inout1,0x10(%rsp)
1285 pinsrd \
$3,%edx,$inout2
1287 mov
%r10,%rdx # restore %rdx
1289 movdqa
$inout2,0x20(%rsp)
1292 pinsrd \
$3,%eax,$inout3
1294 movdqa
$inout3,0x30(%rsp)
1296 mov
%r10d,0x40+12(%rsp)
1299 mov
240($key),$rounds # key->rounds
1302 mov
%r9d,0x50+12(%rsp)
1305 mov
%r10d,0x60+12(%rsp)
1307 mov OPENSSL_ia32cap_P
+4(%rip),%r10d
1309 and \
$`1<<26|1<<22`,%r10d # isolate XSAVE+MOVBE
1310 mov
%r9d,0x70+12(%rsp)
1312 $movkey 0x10($key),$rndkey1
1314 movdqa
0x40(%rsp),$inout4
1315 movdqa
0x50(%rsp),$inout5
1317 cmp \
$8,$len # $len is in blocks
1318 jb
.Lctr32_tail
# short input if ($len<8)
1320 sub \
$6,$len # $len is biased by -6
1321 cmp \
$`1<<22`,%r10d # check for MOVBE without XSAVE
1322 je
.Lctr32_6x
# [which denotes Atom Silvermont]
1324 lea
0x80($key),$key # size optimization
1325 sub \
$2,$len # $len is biased by -8
1333 lea
32($key,$rounds),$key # end of key schedule
1334 sub %rax,%r10 # twisted $rounds
1339 add \
$6,$ctr # next counter value
1340 $movkey -48($key,$rnds_),$rndkey0
1341 aesenc
$rndkey1,$inout0
1344 aesenc
$rndkey1,$inout1
1345 movbe
%eax,`0x00+12`(%rsp) # store next counter value
1347 aesenc
$rndkey1,$inout2
1349 movbe
%eax,`0x10+12`(%rsp)
1350 aesenc
$rndkey1,$inout3
1353 aesenc
$rndkey1,$inout4
1354 movbe
%eax,`0x20+12`(%rsp)
1356 aesenc
$rndkey1,$inout5
1357 $movkey -32($key,$rnds_),$rndkey1
1360 aesenc
$rndkey0,$inout0
1361 movbe
%eax,`0x30+12`(%rsp)
1363 aesenc
$rndkey0,$inout1
1365 movbe
%eax,`0x40+12`(%rsp)
1366 aesenc
$rndkey0,$inout2
1369 aesenc
$rndkey0,$inout3
1370 movbe
%eax,`0x50+12`(%rsp)
1371 mov
%r10,%rax # mov $rnds_,$rounds
1372 aesenc
$rndkey0,$inout4
1373 aesenc
$rndkey0,$inout5
1374 $movkey -16($key,$rnds_),$rndkey0
1378 movdqu
($inp),$inout6 # load 6 input blocks
1379 movdqu
0x10($inp),$inout7
1380 movdqu
0x20($inp),$in0
1381 movdqu
0x30($inp),$in1
1382 movdqu
0x40($inp),$in2
1383 movdqu
0x50($inp),$in3
1384 lea
0x60($inp),$inp # $inp+=6*16
1385 $movkey -64($key,$rnds_),$rndkey1
1386 pxor
$inout0,$inout6 # inp^=E(ctr)
1387 movaps
0x00(%rsp),$inout0 # load next counter [xor-ed with 0 round]
1388 pxor
$inout1,$inout7
1389 movaps
0x10(%rsp),$inout1
1391 movaps
0x20(%rsp),$inout2
1393 movaps
0x30(%rsp),$inout3
1395 movaps
0x40(%rsp),$inout4
1397 movaps
0x50(%rsp),$inout5
1398 movdqu
$inout6,($out) # store 6 output blocks
1399 movdqu
$inout7,0x10($out)
1400 movdqu
$in0,0x20($out)
1401 movdqu
$in1,0x30($out)
1402 movdqu
$in2,0x40($out)
1403 movdqu
$in3,0x50($out)
1404 lea
0x60($out),$out # $out+=6*16
1407 jnc
.Lctr32_loop6
# loop if $len-=6 didn't borrow
1409 add \
$6,$len # restore real remaining $len
1410 jz
.Lctr32_done
# done if ($len==0)
1412 lea
-48($rnds_),$rounds
1413 lea
-80($key,$rnds_),$key # restore $key
1415 shr \
$4,$rounds # restore $rounds
1420 add \
$8,$ctr # next counter value
1421 movdqa
0x60(%rsp),$inout6
1422 aesenc
$rndkey1,$inout0
1424 movdqa
0x70(%rsp),$inout7
1425 aesenc
$rndkey1,$inout1
1427 $movkey 0x20-0x80($key),$rndkey0
1428 aesenc
$rndkey1,$inout2
1431 aesenc
$rndkey1,$inout3
1432 mov
%r9d,0x00+12(%rsp) # store next counter value
1434 aesenc
$rndkey1,$inout4
1435 aesenc
$rndkey1,$inout5
1436 aesenc
$rndkey1,$inout6
1437 aesenc
$rndkey1,$inout7
1438 $movkey 0x30-0x80($key),$rndkey1
1440 for($i=2;$i<8;$i++) {
1441 my $rndkeyx = ($i&1)?
$rndkey1:$rndkey0;
1444 aesenc
$rndkeyx,$inout0
1445 aesenc
$rndkeyx,$inout1
1448 aesenc
$rndkeyx,$inout2
1449 aesenc
$rndkeyx,$inout3
1450 mov
%r9d,`0x10*($i-1)`+12(%rsp)
1452 aesenc
$rndkeyx,$inout4
1453 aesenc
$rndkeyx,$inout5
1454 aesenc
$rndkeyx,$inout6
1455 aesenc
$rndkeyx,$inout7
1456 $movkey `0x20+0x10*$i`-0x80($key),$rndkeyx
1461 aesenc
$rndkey0,$inout0
1462 aesenc
$rndkey0,$inout1
1463 aesenc
$rndkey0,$inout2
1465 movdqu
0x00($inp),$in0 # start loading input
1466 aesenc
$rndkey0,$inout3
1467 mov
%r9d,0x70+12(%rsp)
1469 aesenc
$rndkey0,$inout4
1470 aesenc
$rndkey0,$inout5
1471 aesenc
$rndkey0,$inout6
1472 aesenc
$rndkey0,$inout7
1473 $movkey 0xa0-0x80($key),$rndkey0
1477 aesenc
$rndkey1,$inout0
1478 aesenc
$rndkey1,$inout1
1479 aesenc
$rndkey1,$inout2
1480 aesenc
$rndkey1,$inout3
1481 aesenc
$rndkey1,$inout4
1482 aesenc
$rndkey1,$inout5
1483 aesenc
$rndkey1,$inout6
1484 aesenc
$rndkey1,$inout7
1485 $movkey 0xb0-0x80($key),$rndkey1
1487 aesenc
$rndkey0,$inout0
1488 aesenc
$rndkey0,$inout1
1489 aesenc
$rndkey0,$inout2
1490 aesenc
$rndkey0,$inout3
1491 aesenc
$rndkey0,$inout4
1492 aesenc
$rndkey0,$inout5
1493 aesenc
$rndkey0,$inout6
1494 aesenc
$rndkey0,$inout7
1495 $movkey 0xc0-0x80($key),$rndkey0
1498 aesenc
$rndkey1,$inout0
1499 aesenc
$rndkey1,$inout1
1500 aesenc
$rndkey1,$inout2
1501 aesenc
$rndkey1,$inout3
1502 aesenc
$rndkey1,$inout4
1503 aesenc
$rndkey1,$inout5
1504 aesenc
$rndkey1,$inout6
1505 aesenc
$rndkey1,$inout7
1506 $movkey 0xd0-0x80($key),$rndkey1
1508 aesenc
$rndkey0,$inout0
1509 aesenc
$rndkey0,$inout1
1510 aesenc
$rndkey0,$inout2
1511 aesenc
$rndkey0,$inout3
1512 aesenc
$rndkey0,$inout4
1513 aesenc
$rndkey0,$inout5
1514 aesenc
$rndkey0,$inout6
1515 aesenc
$rndkey0,$inout7
1516 $movkey 0xe0-0x80($key),$rndkey0
1517 jmp
.Lctr32_enc_done
1521 movdqu
0x10($inp),$in1
1522 pxor
$rndkey0,$in0 # input^=round[last]
1523 movdqu
0x20($inp),$in2
1525 movdqu
0x30($inp),$in3
1527 movdqu
0x40($inp),$in4
1529 movdqu
0x50($inp),$in5
1532 aesenc
$rndkey1,$inout0
1533 aesenc
$rndkey1,$inout1
1534 aesenc
$rndkey1,$inout2
1535 aesenc
$rndkey1,$inout3
1536 aesenc
$rndkey1,$inout4
1537 aesenc
$rndkey1,$inout5
1538 aesenc
$rndkey1,$inout6
1539 aesenc
$rndkey1,$inout7
1540 movdqu
0x60($inp),$rndkey1 # borrow $rndkey1 for inp[6]
1541 lea
0x80($inp),$inp # $inp+=8*16
1543 aesenclast
$in0,$inout0 # $inN is inp[N]^round[last]
1544 pxor
$rndkey0,$rndkey1 # borrowed $rndkey
1545 movdqu
0x70-0x80($inp),$in0
1546 aesenclast
$in1,$inout1
1548 movdqa
0x00(%rsp),$in1 # load next counter block
1549 aesenclast
$in2,$inout2
1550 aesenclast
$in3,$inout3
1551 movdqa
0x10(%rsp),$in2
1552 movdqa
0x20(%rsp),$in3
1553 aesenclast
$in4,$inout4
1554 aesenclast
$in5,$inout5
1555 movdqa
0x30(%rsp),$in4
1556 movdqa
0x40(%rsp),$in5
1557 aesenclast
$rndkey1,$inout6
1558 movdqa
0x50(%rsp),$rndkey0
1559 $movkey 0x10-0x80($key),$rndkey1#real 1st-round key
1560 aesenclast
$in0,$inout7
1562 movups
$inout0,($out) # store 8 output blocks
1564 movups
$inout1,0x10($out)
1566 movups
$inout2,0x20($out)
1568 movups
$inout3,0x30($out)
1570 movups
$inout4,0x40($out)
1572 movups
$inout5,0x50($out)
1573 movdqa
$rndkey0,$inout5
1574 movups
$inout6,0x60($out)
1575 movups
$inout7,0x70($out)
1576 lea
0x80($out),$out # $out+=8*16
1579 jnc
.Lctr32_loop8
# loop if $len-=8 didn't borrow
1581 add \
$8,$len # restore real remaining $len
1582 jz
.Lctr32_done
# done if ($len==0)
1583 lea
-0x80($key),$key
1586 # note that at this point $inout0..5 are populated with
1587 # counter values xor-ed with 0-round key
1593 # if ($len>4) compute 7 E(counter)
1595 movdqa
0x60(%rsp),$inout6
1596 pxor
$inout7,$inout7
1598 $movkey 16($key),$rndkey0
1599 aesenc
$rndkey1,$inout0
1600 aesenc
$rndkey1,$inout1
1601 lea
32-16($key,$rounds),$key# prepare for .Lenc_loop8_enter
1603 aesenc
$rndkey1,$inout2
1604 add \
$16,%rax # prepare for .Lenc_loop8_enter
1606 aesenc
$rndkey1,$inout3
1607 aesenc
$rndkey1,$inout4
1608 movups
0x10($inp),$in1 # pre-load input
1609 movups
0x20($inp),$in2
1610 aesenc
$rndkey1,$inout5
1611 aesenc
$rndkey1,$inout6
1613 call
.Lenc_loop8_enter
1615 movdqu
0x30($inp),$in3
1617 movdqu
0x40($inp),$in0
1619 movdqu
$inout0,($out) # store output
1621 movdqu
$inout1,0x10($out)
1623 movdqu
$inout2,0x20($out)
1625 movdqu
$inout3,0x30($out)
1626 movdqu
$inout4,0x40($out)
1628 jb
.Lctr32_done
# $len was 5, stop store
1630 movups
0x50($inp),$in1
1632 movups
$inout5,0x50($out)
1633 je
.Lctr32_done
# $len was 6, stop store
1635 movups
0x60($inp),$in2
1637 movups
$inout6,0x60($out)
1638 jmp
.Lctr32_done
# $len was 7, stop store
1642 aesenc
$rndkey1,$inout0
1645 aesenc
$rndkey1,$inout1
1646 aesenc
$rndkey1,$inout2
1647 aesenc
$rndkey1,$inout3
1648 $movkey ($key),$rndkey1
1650 aesenclast
$rndkey1,$inout0
1651 aesenclast
$rndkey1,$inout1
1652 movups
($inp),$in0 # load input
1653 movups
0x10($inp),$in1
1654 aesenclast
$rndkey1,$inout2
1655 aesenclast
$rndkey1,$inout3
1656 movups
0x20($inp),$in2
1657 movups
0x30($inp),$in3
1660 movups
$inout0,($out) # store output
1662 movups
$inout1,0x10($out)
1664 movdqu
$inout2,0x20($out)
1666 movdqu
$inout3,0x30($out)
1667 jmp
.Lctr32_done
# $len was 4, stop store
1671 aesenc
$rndkey1,$inout0
1674 aesenc
$rndkey1,$inout1
1675 aesenc
$rndkey1,$inout2
1676 $movkey ($key),$rndkey1
1678 aesenclast
$rndkey1,$inout0
1679 aesenclast
$rndkey1,$inout1
1680 aesenclast
$rndkey1,$inout2
1682 movups
($inp),$in0 # load input
1684 movups
$inout0,($out) # store output
1686 jb
.Lctr32_done
# $len was 1, stop store
1688 movups
0x10($inp),$in1
1690 movups
$inout1,0x10($out)
1691 je
.Lctr32_done
# $len was 2, stop store
1693 movups
0x20($inp),$in2
1695 movups
$inout2,0x20($out) # $len was 3, stop store
1698 xorps
%xmm0,%xmm0 # clear register bank
1706 $code.=<<___
if (!$win64);
1709 movaps
%xmm0,0x00(%rsp) # clear stack
1711 movaps
%xmm0,0x10(%rsp)
1713 movaps
%xmm0,0x20(%rsp)
1715 movaps
%xmm0,0x30(%rsp)
1717 movaps
%xmm0,0x40(%rsp)
1719 movaps
%xmm0,0x50(%rsp)
1721 movaps
%xmm0,0x60(%rsp)
1723 movaps
%xmm0,0x70(%rsp)
1726 $code.=<<___
if ($win64);
1727 movaps
-0xa8($key_),%xmm6
1728 movaps
%xmm0,-0xa8($key_) # clear stack
1729 movaps
-0x98($key_),%xmm7
1730 movaps
%xmm0,-0x98($key_)
1731 movaps
-0x88($key_),%xmm8
1732 movaps
%xmm0,-0x88($key_)
1733 movaps
-0x78($key_),%xmm9
1734 movaps
%xmm0,-0x78($key_)
1735 movaps
-0x68($key_),%xmm10
1736 movaps
%xmm0,-0x68($key_)
1737 movaps
-0x58($key_),%xmm11
1738 movaps
%xmm0,-0x58($key_)
1739 movaps
-0x48($key_),%xmm12
1740 movaps
%xmm0,-0x48($key_)
1741 movaps
-0x38($key_),%xmm13
1742 movaps
%xmm0,-0x38($key_)
1743 movaps
-0x28($key_),%xmm14
1744 movaps
%xmm0,-0x28($key_)
1745 movaps
-0x18($key_),%xmm15
1746 movaps
%xmm0,-0x18($key_)
1747 movaps
%xmm0,0x00(%rsp)
1748 movaps
%xmm0,0x10(%rsp)
1749 movaps
%xmm0,0x20(%rsp)
1750 movaps
%xmm0,0x30(%rsp)
1751 movaps
%xmm0,0x40(%rsp)
1752 movaps
%xmm0,0x50(%rsp)
1753 movaps
%xmm0,0x60(%rsp)
1754 movaps
%xmm0,0x70(%rsp)
1760 .cfi_def_cfa_register
%rsp
1764 .size aesni_ctr32_encrypt_blocks
,.-aesni_ctr32_encrypt_blocks
1768 ######################################################################
1769 # void aesni_xts_[en|de]crypt(const char *inp,char *out,size_t len,
1770 # const AES_KEY *key1, const AES_KEY *key2
1771 # const unsigned char iv[16]);
1774 my @tweak=map("%xmm$_",(10..15));
1775 my ($twmask,$twres,$twtmp)=("%xmm8","%xmm9",@tweak[4]);
1776 my ($key2,$ivp,$len_)=("%r8","%r9","%r9");
1777 my $frame_size = 0x70 + ($win64?
160:0);
1778 my $key_ = "%rbp"; # override so that we can use %r11 as FP
1781 .globl aesni_xts_encrypt
1782 .type aesni_xts_encrypt
,\
@function,6
1787 lea
(%rsp),%r11 # frame pointer
1788 .cfi_def_cfa_register
%r11
1791 sub \
$$frame_size,%rsp
1792 and \
$-16,%rsp # Linux kernel stack can be incorrectly seeded
1794 $code.=<<___
if ($win64);
1795 movaps
%xmm6,-0xa8(%r11) # offload everything
1796 movaps
%xmm7,-0x98(%r11)
1797 movaps
%xmm8,-0x88(%r11)
1798 movaps
%xmm9,-0x78(%r11)
1799 movaps
%xmm10,-0x68(%r11)
1800 movaps
%xmm11,-0x58(%r11)
1801 movaps
%xmm12,-0x48(%r11)
1802 movaps
%xmm13,-0x38(%r11)
1803 movaps
%xmm14,-0x28(%r11)
1804 movaps
%xmm15,-0x18(%r11)
1808 movups
($ivp),$inout0 # load clear-text tweak
1809 mov
240(%r8),$rounds # key2->rounds
1810 mov
240($key),$rnds_ # key1->rounds
1812 # generate the tweak
1813 &aesni_generate1
("enc",$key2,$rounds,$inout0);
1815 $movkey ($key),$rndkey0 # zero round key
1816 mov
$key,$key_ # backup $key
1817 mov
$rnds_,$rounds # backup $rounds
1819 mov
$len,$len_ # backup $len
1822 $movkey 16($key,$rnds_),$rndkey1 # last round key
1824 movdqa
.Lxts_magic
(%rip),$twmask
1825 movdqa
$inout0,@tweak[5]
1826 pshufd \
$0x5f,$inout0,$twres
1827 pxor
$rndkey0,$rndkey1
1829 # alternative tweak calculation algorithm is based on suggestions
1830 # by Shay Gueron. psrad doesn't conflict with AES-NI instructions
1831 # and should help in the future...
1832 for ($i=0;$i<4;$i++) {
1834 movdqa
$twres,$twtmp
1836 movdqa
@tweak[5],@tweak[$i]
1837 psrad \
$31,$twtmp # broadcast upper bits
1838 paddq
@tweak[5],@tweak[5]
1840 pxor
$rndkey0,@tweak[$i]
1841 pxor
$twtmp,@tweak[5]
1845 movdqa
@tweak[5],@tweak[4]
1847 paddq
@tweak[5],@tweak[5]
1849 pxor
$rndkey0,@tweak[4]
1850 pxor
$twres,@tweak[5]
1851 movaps
$rndkey1,0x60(%rsp) # save round[0]^round[last]
1854 jc
.Lxts_enc_short
# if $len-=6*16 borrowed
1857 lea
32($key_,$rnds_),$key # end of key schedule
1858 sub %r10,%rax # twisted $rounds
1859 $movkey 16($key_),$rndkey1
1860 mov
%rax,%r10 # backup twisted $rounds
1861 lea
.Lxts_magic
(%rip),%r8
1862 jmp
.Lxts_enc_grandloop
1865 .Lxts_enc_grandloop
:
1866 movdqu
`16*0`($inp),$inout0 # load input
1867 movdqa
$rndkey0,$twmask
1868 movdqu
`16*1`($inp),$inout1
1869 pxor
@tweak[0],$inout0 # input^=tweak^round[0]
1870 movdqu
`16*2`($inp),$inout2
1871 pxor
@tweak[1],$inout1
1872 aesenc
$rndkey1,$inout0
1873 movdqu
`16*3`($inp),$inout3
1874 pxor
@tweak[2],$inout2
1875 aesenc
$rndkey1,$inout1
1876 movdqu
`16*4`($inp),$inout4
1877 pxor
@tweak[3],$inout3
1878 aesenc
$rndkey1,$inout2
1879 movdqu
`16*5`($inp),$inout5
1880 pxor
@tweak[5],$twmask # round[0]^=tweak[5]
1881 movdqa
0x60(%rsp),$twres # load round[0]^round[last]
1882 pxor
@tweak[4],$inout4
1883 aesenc
$rndkey1,$inout3
1884 $movkey 32($key_),$rndkey0
1885 lea
`16*6`($inp),$inp
1886 pxor
$twmask,$inout5
1888 pxor
$twres,@tweak[0] # calculate tweaks^round[last]
1889 aesenc
$rndkey1,$inout4
1890 pxor
$twres,@tweak[1]
1891 movdqa
@tweak[0],`16*0`(%rsp) # put aside tweaks^round[last]
1892 aesenc
$rndkey1,$inout5
1893 $movkey 48($key_),$rndkey1
1894 pxor
$twres,@tweak[2]
1896 aesenc
$rndkey0,$inout0
1897 pxor
$twres,@tweak[3]
1898 movdqa
@tweak[1],`16*1`(%rsp)
1899 aesenc
$rndkey0,$inout1
1900 pxor
$twres,@tweak[4]
1901 movdqa
@tweak[2],`16*2`(%rsp)
1902 aesenc
$rndkey0,$inout2
1903 aesenc
$rndkey0,$inout3
1905 movdqa
@tweak[4],`16*4`(%rsp)
1906 aesenc
$rndkey0,$inout4
1907 aesenc
$rndkey0,$inout5
1908 $movkey 64($key_),$rndkey0
1909 movdqa
$twmask,`16*5`(%rsp)
1910 pshufd \
$0x5f,@tweak[5],$twres
1914 aesenc
$rndkey1,$inout0
1915 aesenc
$rndkey1,$inout1
1916 aesenc
$rndkey1,$inout2
1917 aesenc
$rndkey1,$inout3
1918 aesenc
$rndkey1,$inout4
1919 aesenc
$rndkey1,$inout5
1920 $movkey -64($key,%rax),$rndkey1
1923 aesenc
$rndkey0,$inout0
1924 aesenc
$rndkey0,$inout1
1925 aesenc
$rndkey0,$inout2
1926 aesenc
$rndkey0,$inout3
1927 aesenc
$rndkey0,$inout4
1928 aesenc
$rndkey0,$inout5
1929 $movkey -80($key,%rax),$rndkey0
1932 movdqa
(%r8),$twmask # start calculating next tweak
1933 movdqa
$twres,$twtmp
1935 aesenc
$rndkey1,$inout0
1936 paddq
@tweak[5],@tweak[5]
1938 aesenc
$rndkey1,$inout1
1940 $movkey ($key_),@tweak[0] # load round[0]
1941 aesenc
$rndkey1,$inout2
1942 aesenc
$rndkey1,$inout3
1943 aesenc
$rndkey1,$inout4
1944 pxor
$twtmp,@tweak[5]
1945 movaps
@tweak[0],@tweak[1] # copy round[0]
1946 aesenc
$rndkey1,$inout5
1947 $movkey -64($key),$rndkey1
1949 movdqa
$twres,$twtmp
1950 aesenc
$rndkey0,$inout0
1952 pxor
@tweak[5],@tweak[0]
1953 aesenc
$rndkey0,$inout1
1955 paddq
@tweak[5],@tweak[5]
1956 aesenc
$rndkey0,$inout2
1957 aesenc
$rndkey0,$inout3
1959 movaps
@tweak[1],@tweak[2]
1960 aesenc
$rndkey0,$inout4
1961 pxor
$twtmp,@tweak[5]
1962 movdqa
$twres,$twtmp
1963 aesenc
$rndkey0,$inout5
1964 $movkey -48($key),$rndkey0
1967 aesenc
$rndkey1,$inout0
1968 pxor
@tweak[5],@tweak[1]
1970 aesenc
$rndkey1,$inout1
1971 paddq
@tweak[5],@tweak[5]
1973 aesenc
$rndkey1,$inout2
1974 aesenc
$rndkey1,$inout3
1975 movdqa
@tweak[3],`16*3`(%rsp)
1976 pxor
$twtmp,@tweak[5]
1977 aesenc
$rndkey1,$inout4
1978 movaps
@tweak[2],@tweak[3]
1979 movdqa
$twres,$twtmp
1980 aesenc
$rndkey1,$inout5
1981 $movkey -32($key),$rndkey1
1984 aesenc
$rndkey0,$inout0
1985 pxor
@tweak[5],@tweak[2]
1987 aesenc
$rndkey0,$inout1
1988 paddq
@tweak[5],@tweak[5]
1990 aesenc
$rndkey0,$inout2
1991 aesenc
$rndkey0,$inout3
1992 aesenc
$rndkey0,$inout4
1993 pxor
$twtmp,@tweak[5]
1994 movaps
@tweak[3],@tweak[4]
1995 aesenc
$rndkey0,$inout5
1997 movdqa
$twres,$rndkey0
1999 aesenc
$rndkey1,$inout0
2000 pxor
@tweak[5],@tweak[3]
2002 aesenc
$rndkey1,$inout1
2003 paddq
@tweak[5],@tweak[5]
2004 pand
$twmask,$rndkey0
2005 aesenc
$rndkey1,$inout2
2006 aesenc
$rndkey1,$inout3
2007 pxor
$rndkey0,@tweak[5]
2008 $movkey ($key_),$rndkey0
2009 aesenc
$rndkey1,$inout4
2010 aesenc
$rndkey1,$inout5
2011 $movkey 16($key_),$rndkey1
2013 pxor
@tweak[5],@tweak[4]
2014 aesenclast
`16*0`(%rsp),$inout0
2016 paddq
@tweak[5],@tweak[5]
2017 aesenclast
`16*1`(%rsp),$inout1
2018 aesenclast
`16*2`(%rsp),$inout2
2020 mov
%r10,%rax # restore $rounds
2021 aesenclast
`16*3`(%rsp),$inout3
2022 aesenclast
`16*4`(%rsp),$inout4
2023 aesenclast
`16*5`(%rsp),$inout5
2024 pxor
$twres,@tweak[5]
2026 lea
`16*6`($out),$out # $out+=6*16
2027 movups
$inout0,`-16*6`($out) # store 6 output blocks
2028 movups
$inout1,`-16*5`($out)
2029 movups
$inout2,`-16*4`($out)
2030 movups
$inout3,`-16*3`($out)
2031 movups
$inout4,`-16*2`($out)
2032 movups
$inout5,`-16*1`($out)
2034 jnc
.Lxts_enc_grandloop
# loop if $len-=6*16 didn't borrow
2038 mov
$key_,$key # restore $key
2039 shr \
$4,$rounds # restore original value
2042 # at the point @tweak[0..5] are populated with tweak values
2043 mov
$rounds,$rnds_ # backup $rounds
2044 pxor
$rndkey0,@tweak[0]
2045 add \
$16*6,$len # restore real remaining $len
2046 jz
.Lxts_enc_done
# done if ($len==0)
2048 pxor
$rndkey0,@tweak[1]
2050 jb
.Lxts_enc_one
# $len is 1*16
2051 pxor
$rndkey0,@tweak[2]
2052 je
.Lxts_enc_two
# $len is 2*16
2054 pxor
$rndkey0,@tweak[3]
2056 jb
.Lxts_enc_three
# $len is 3*16
2057 pxor
$rndkey0,@tweak[4]
2058 je
.Lxts_enc_four
# $len is 4*16
2060 movdqu
($inp),$inout0 # $len is 5*16
2061 movdqu
16*1($inp),$inout1
2062 movdqu
16*2($inp),$inout2
2063 pxor
@tweak[0],$inout0
2064 movdqu
16*3($inp),$inout3
2065 pxor
@tweak[1],$inout1
2066 movdqu
16*4($inp),$inout4
2067 lea
16*5($inp),$inp # $inp+=5*16
2068 pxor
@tweak[2],$inout2
2069 pxor
@tweak[3],$inout3
2070 pxor
@tweak[4],$inout4
2071 pxor
$inout5,$inout5
2073 call _aesni_encrypt6
2075 xorps
@tweak[0],$inout0
2076 movdqa
@tweak[5],@tweak[0]
2077 xorps
@tweak[1],$inout1
2078 xorps
@tweak[2],$inout2
2079 movdqu
$inout0,($out) # store 5 output blocks
2080 xorps
@tweak[3],$inout3
2081 movdqu
$inout1,16*1($out)
2082 xorps
@tweak[4],$inout4
2083 movdqu
$inout2,16*2($out)
2084 movdqu
$inout3,16*3($out)
2085 movdqu
$inout4,16*4($out)
2086 lea
16*5($out),$out # $out+=5*16
2091 movups
($inp),$inout0
2092 lea
16*1($inp),$inp # inp+=1*16
2093 xorps
@tweak[0],$inout0
2095 &aesni_generate1
("enc",$key,$rounds);
2097 xorps
@tweak[0],$inout0
2098 movdqa
@tweak[1],@tweak[0]
2099 movups
$inout0,($out) # store one output block
2100 lea
16*1($out),$out # $out+=1*16
2105 movups
($inp),$inout0
2106 movups
16($inp),$inout1
2107 lea
32($inp),$inp # $inp+=2*16
2108 xorps
@tweak[0],$inout0
2109 xorps
@tweak[1],$inout1
2111 call _aesni_encrypt2
2113 xorps
@tweak[0],$inout0
2114 movdqa
@tweak[2],@tweak[0]
2115 xorps
@tweak[1],$inout1
2116 movups
$inout0,($out) # store 2 output blocks
2117 movups
$inout1,16*1($out)
2118 lea
16*2($out),$out # $out+=2*16
2123 movups
($inp),$inout0
2124 movups
16*1($inp),$inout1
2125 movups
16*2($inp),$inout2
2126 lea
16*3($inp),$inp # $inp+=3*16
2127 xorps
@tweak[0],$inout0
2128 xorps
@tweak[1],$inout1
2129 xorps
@tweak[2],$inout2
2131 call _aesni_encrypt3
2133 xorps
@tweak[0],$inout0
2134 movdqa
@tweak[3],@tweak[0]
2135 xorps
@tweak[1],$inout1
2136 xorps
@tweak[2],$inout2
2137 movups
$inout0,($out) # store 3 output blocks
2138 movups
$inout1,16*1($out)
2139 movups
$inout2,16*2($out)
2140 lea
16*3($out),$out # $out+=3*16
2145 movups
($inp),$inout0
2146 movups
16*1($inp),$inout1
2147 movups
16*2($inp),$inout2
2148 xorps
@tweak[0],$inout0
2149 movups
16*3($inp),$inout3
2150 lea
16*4($inp),$inp # $inp+=4*16
2151 xorps
@tweak[1],$inout1
2152 xorps
@tweak[2],$inout2
2153 xorps
@tweak[3],$inout3
2155 call _aesni_encrypt4
2157 pxor
@tweak[0],$inout0
2158 movdqa
@tweak[4],@tweak[0]
2159 pxor
@tweak[1],$inout1
2160 pxor
@tweak[2],$inout2
2161 movdqu
$inout0,($out) # store 4 output blocks
2162 pxor
@tweak[3],$inout3
2163 movdqu
$inout1,16*1($out)
2164 movdqu
$inout2,16*2($out)
2165 movdqu
$inout3,16*3($out)
2166 lea
16*4($out),$out # $out+=4*16
2171 and \
$15,$len_ # see if $len%16 is 0
2176 movzb
($inp),%eax # borrow $rounds ...
2177 movzb
-16($out),%ecx # ... and $key
2185 sub $len_,$out # rewind $out
2186 mov
$key_,$key # restore $key
2187 mov
$rnds_,$rounds # restore $rounds
2189 movups
-16($out),$inout0
2190 xorps
@tweak[0],$inout0
2192 &aesni_generate1
("enc",$key,$rounds);
2194 xorps
@tweak[0],$inout0
2195 movups
$inout0,-16($out)
2198 xorps
%xmm0,%xmm0 # clear register bank
2205 $code.=<<___
if (!$win64);
2208 movaps
%xmm0,0x00(%rsp) # clear stack
2210 movaps
%xmm0,0x10(%rsp)
2212 movaps
%xmm0,0x20(%rsp)
2214 movaps
%xmm0,0x30(%rsp)
2216 movaps
%xmm0,0x40(%rsp)
2218 movaps
%xmm0,0x50(%rsp)
2220 movaps
%xmm0,0x60(%rsp)
2224 $code.=<<___
if ($win64);
2225 movaps
-0xa8(%r11),%xmm6
2226 movaps
%xmm0,-0xa8(%r11) # clear stack
2227 movaps
-0x98(%r11),%xmm7
2228 movaps
%xmm0,-0x98(%r11)
2229 movaps
-0x88(%r11),%xmm8
2230 movaps
%xmm0,-0x88(%r11)
2231 movaps
-0x78(%r11),%xmm9
2232 movaps
%xmm0,-0x78(%r11)
2233 movaps
-0x68(%r11),%xmm10
2234 movaps
%xmm0,-0x68(%r11)
2235 movaps
-0x58(%r11),%xmm11
2236 movaps
%xmm0,-0x58(%r11)
2237 movaps
-0x48(%r11),%xmm12
2238 movaps
%xmm0,-0x48(%r11)
2239 movaps
-0x38(%r11),%xmm13
2240 movaps
%xmm0,-0x38(%r11)
2241 movaps
-0x28(%r11),%xmm14
2242 movaps
%xmm0,-0x28(%r11)
2243 movaps
-0x18(%r11),%xmm15
2244 movaps
%xmm0,-0x18(%r11)
2245 movaps
%xmm0,0x00(%rsp)
2246 movaps
%xmm0,0x10(%rsp)
2247 movaps
%xmm0,0x20(%rsp)
2248 movaps
%xmm0,0x30(%rsp)
2249 movaps
%xmm0,0x40(%rsp)
2250 movaps
%xmm0,0x50(%rsp)
2251 movaps
%xmm0,0x60(%rsp)
2257 .cfi_def_cfa_register
%rsp
2261 .size aesni_xts_encrypt
,.-aesni_xts_encrypt
2265 .globl aesni_xts_decrypt
2266 .type aesni_xts_decrypt
,\
@function,6
2271 lea
(%rsp),%r11 # frame pointer
2272 .cfi_def_cfa_register
%r11
2275 sub \
$$frame_size,%rsp
2276 and \
$-16,%rsp # Linux kernel stack can be incorrectly seeded
2278 $code.=<<___
if ($win64);
2279 movaps
%xmm6,-0xa8(%r11) # offload everything
2280 movaps
%xmm7,-0x98(%r11)
2281 movaps
%xmm8,-0x88(%r11)
2282 movaps
%xmm9,-0x78(%r11)
2283 movaps
%xmm10,-0x68(%r11)
2284 movaps
%xmm11,-0x58(%r11)
2285 movaps
%xmm12,-0x48(%r11)
2286 movaps
%xmm13,-0x38(%r11)
2287 movaps
%xmm14,-0x28(%r11)
2288 movaps
%xmm15,-0x18(%r11)
2292 movups
($ivp),$inout0 # load clear-text tweak
2293 mov
240($key2),$rounds # key2->rounds
2294 mov
240($key),$rnds_ # key1->rounds
2296 # generate the tweak
2297 &aesni_generate1
("enc",$key2,$rounds,$inout0);
2299 xor %eax,%eax # if ($len%16) len-=16;
2305 $movkey ($key),$rndkey0 # zero round key
2306 mov
$key,$key_ # backup $key
2307 mov
$rnds_,$rounds # backup $rounds
2309 mov
$len,$len_ # backup $len
2312 $movkey 16($key,$rnds_),$rndkey1 # last round key
2314 movdqa
.Lxts_magic
(%rip),$twmask
2315 movdqa
$inout0,@tweak[5]
2316 pshufd \
$0x5f,$inout0,$twres
2317 pxor
$rndkey0,$rndkey1
2319 for ($i=0;$i<4;$i++) {
2321 movdqa
$twres,$twtmp
2323 movdqa
@tweak[5],@tweak[$i]
2324 psrad \
$31,$twtmp # broadcast upper bits
2325 paddq
@tweak[5],@tweak[5]
2327 pxor
$rndkey0,@tweak[$i]
2328 pxor
$twtmp,@tweak[5]
2332 movdqa
@tweak[5],@tweak[4]
2334 paddq
@tweak[5],@tweak[5]
2336 pxor
$rndkey0,@tweak[4]
2337 pxor
$twres,@tweak[5]
2338 movaps
$rndkey1,0x60(%rsp) # save round[0]^round[last]
2341 jc
.Lxts_dec_short
# if $len-=6*16 borrowed
2344 lea
32($key_,$rnds_),$key # end of key schedule
2345 sub %r10,%rax # twisted $rounds
2346 $movkey 16($key_),$rndkey1
2347 mov
%rax,%r10 # backup twisted $rounds
2348 lea
.Lxts_magic
(%rip),%r8
2349 jmp
.Lxts_dec_grandloop
2352 .Lxts_dec_grandloop
:
2353 movdqu
`16*0`($inp),$inout0 # load input
2354 movdqa
$rndkey0,$twmask
2355 movdqu
`16*1`($inp),$inout1
2356 pxor
@tweak[0],$inout0 # input^=tweak^round[0]
2357 movdqu
`16*2`($inp),$inout2
2358 pxor
@tweak[1],$inout1
2359 aesdec
$rndkey1,$inout0
2360 movdqu
`16*3`($inp),$inout3
2361 pxor
@tweak[2],$inout2
2362 aesdec
$rndkey1,$inout1
2363 movdqu
`16*4`($inp),$inout4
2364 pxor
@tweak[3],$inout3
2365 aesdec
$rndkey1,$inout2
2366 movdqu
`16*5`($inp),$inout5
2367 pxor
@tweak[5],$twmask # round[0]^=tweak[5]
2368 movdqa
0x60(%rsp),$twres # load round[0]^round[last]
2369 pxor
@tweak[4],$inout4
2370 aesdec
$rndkey1,$inout3
2371 $movkey 32($key_),$rndkey0
2372 lea
`16*6`($inp),$inp
2373 pxor
$twmask,$inout5
2375 pxor
$twres,@tweak[0] # calculate tweaks^round[last]
2376 aesdec
$rndkey1,$inout4
2377 pxor
$twres,@tweak[1]
2378 movdqa
@tweak[0],`16*0`(%rsp) # put aside tweaks^last round key
2379 aesdec
$rndkey1,$inout5
2380 $movkey 48($key_),$rndkey1
2381 pxor
$twres,@tweak[2]
2383 aesdec
$rndkey0,$inout0
2384 pxor
$twres,@tweak[3]
2385 movdqa
@tweak[1],`16*1`(%rsp)
2386 aesdec
$rndkey0,$inout1
2387 pxor
$twres,@tweak[4]
2388 movdqa
@tweak[2],`16*2`(%rsp)
2389 aesdec
$rndkey0,$inout2
2390 aesdec
$rndkey0,$inout3
2392 movdqa
@tweak[4],`16*4`(%rsp)
2393 aesdec
$rndkey0,$inout4
2394 aesdec
$rndkey0,$inout5
2395 $movkey 64($key_),$rndkey0
2396 movdqa
$twmask,`16*5`(%rsp)
2397 pshufd \
$0x5f,@tweak[5],$twres
2401 aesdec
$rndkey1,$inout0
2402 aesdec
$rndkey1,$inout1
2403 aesdec
$rndkey1,$inout2
2404 aesdec
$rndkey1,$inout3
2405 aesdec
$rndkey1,$inout4
2406 aesdec
$rndkey1,$inout5
2407 $movkey -64($key,%rax),$rndkey1
2410 aesdec
$rndkey0,$inout0
2411 aesdec
$rndkey0,$inout1
2412 aesdec
$rndkey0,$inout2
2413 aesdec
$rndkey0,$inout3
2414 aesdec
$rndkey0,$inout4
2415 aesdec
$rndkey0,$inout5
2416 $movkey -80($key,%rax),$rndkey0
2419 movdqa
(%r8),$twmask # start calculating next tweak
2420 movdqa
$twres,$twtmp
2422 aesdec
$rndkey1,$inout0
2423 paddq
@tweak[5],@tweak[5]
2425 aesdec
$rndkey1,$inout1
2427 $movkey ($key_),@tweak[0] # load round[0]
2428 aesdec
$rndkey1,$inout2
2429 aesdec
$rndkey1,$inout3
2430 aesdec
$rndkey1,$inout4
2431 pxor
$twtmp,@tweak[5]
2432 movaps
@tweak[0],@tweak[1] # copy round[0]
2433 aesdec
$rndkey1,$inout5
2434 $movkey -64($key),$rndkey1
2436 movdqa
$twres,$twtmp
2437 aesdec
$rndkey0,$inout0
2439 pxor
@tweak[5],@tweak[0]
2440 aesdec
$rndkey0,$inout1
2442 paddq
@tweak[5],@tweak[5]
2443 aesdec
$rndkey0,$inout2
2444 aesdec
$rndkey0,$inout3
2446 movaps
@tweak[1],@tweak[2]
2447 aesdec
$rndkey0,$inout4
2448 pxor
$twtmp,@tweak[5]
2449 movdqa
$twres,$twtmp
2450 aesdec
$rndkey0,$inout5
2451 $movkey -48($key),$rndkey0
2454 aesdec
$rndkey1,$inout0
2455 pxor
@tweak[5],@tweak[1]
2457 aesdec
$rndkey1,$inout1
2458 paddq
@tweak[5],@tweak[5]
2460 aesdec
$rndkey1,$inout2
2461 aesdec
$rndkey1,$inout3
2462 movdqa
@tweak[3],`16*3`(%rsp)
2463 pxor
$twtmp,@tweak[5]
2464 aesdec
$rndkey1,$inout4
2465 movaps
@tweak[2],@tweak[3]
2466 movdqa
$twres,$twtmp
2467 aesdec
$rndkey1,$inout5
2468 $movkey -32($key),$rndkey1
2471 aesdec
$rndkey0,$inout0
2472 pxor
@tweak[5],@tweak[2]
2474 aesdec
$rndkey0,$inout1
2475 paddq
@tweak[5],@tweak[5]
2477 aesdec
$rndkey0,$inout2
2478 aesdec
$rndkey0,$inout3
2479 aesdec
$rndkey0,$inout4
2480 pxor
$twtmp,@tweak[5]
2481 movaps
@tweak[3],@tweak[4]
2482 aesdec
$rndkey0,$inout5
2484 movdqa
$twres,$rndkey0
2486 aesdec
$rndkey1,$inout0
2487 pxor
@tweak[5],@tweak[3]
2489 aesdec
$rndkey1,$inout1
2490 paddq
@tweak[5],@tweak[5]
2491 pand
$twmask,$rndkey0
2492 aesdec
$rndkey1,$inout2
2493 aesdec
$rndkey1,$inout3
2494 pxor
$rndkey0,@tweak[5]
2495 $movkey ($key_),$rndkey0
2496 aesdec
$rndkey1,$inout4
2497 aesdec
$rndkey1,$inout5
2498 $movkey 16($key_),$rndkey1
2500 pxor
@tweak[5],@tweak[4]
2501 aesdeclast
`16*0`(%rsp),$inout0
2503 paddq
@tweak[5],@tweak[5]
2504 aesdeclast
`16*1`(%rsp),$inout1
2505 aesdeclast
`16*2`(%rsp),$inout2
2507 mov
%r10,%rax # restore $rounds
2508 aesdeclast
`16*3`(%rsp),$inout3
2509 aesdeclast
`16*4`(%rsp),$inout4
2510 aesdeclast
`16*5`(%rsp),$inout5
2511 pxor
$twres,@tweak[5]
2513 lea
`16*6`($out),$out # $out+=6*16
2514 movups
$inout0,`-16*6`($out) # store 6 output blocks
2515 movups
$inout1,`-16*5`($out)
2516 movups
$inout2,`-16*4`($out)
2517 movups
$inout3,`-16*3`($out)
2518 movups
$inout4,`-16*2`($out)
2519 movups
$inout5,`-16*1`($out)
2521 jnc
.Lxts_dec_grandloop
# loop if $len-=6*16 didn't borrow
2525 mov
$key_,$key # restore $key
2526 shr \
$4,$rounds # restore original value
2529 # at the point @tweak[0..5] are populated with tweak values
2530 mov
$rounds,$rnds_ # backup $rounds
2531 pxor
$rndkey0,@tweak[0]
2532 pxor
$rndkey0,@tweak[1]
2533 add \
$16*6,$len # restore real remaining $len
2534 jz
.Lxts_dec_done
# done if ($len==0)
2536 pxor
$rndkey0,@tweak[2]
2538 jb
.Lxts_dec_one
# $len is 1*16
2539 pxor
$rndkey0,@tweak[3]
2540 je
.Lxts_dec_two
# $len is 2*16
2542 pxor
$rndkey0,@tweak[4]
2544 jb
.Lxts_dec_three
# $len is 3*16
2545 je
.Lxts_dec_four
# $len is 4*16
2547 movdqu
($inp),$inout0 # $len is 5*16
2548 movdqu
16*1($inp),$inout1
2549 movdqu
16*2($inp),$inout2
2550 pxor
@tweak[0],$inout0
2551 movdqu
16*3($inp),$inout3
2552 pxor
@tweak[1],$inout1
2553 movdqu
16*4($inp),$inout4
2554 lea
16*5($inp),$inp # $inp+=5*16
2555 pxor
@tweak[2],$inout2
2556 pxor
@tweak[3],$inout3
2557 pxor
@tweak[4],$inout4
2559 call _aesni_decrypt6
2561 xorps
@tweak[0],$inout0
2562 xorps
@tweak[1],$inout1
2563 xorps
@tweak[2],$inout2
2564 movdqu
$inout0,($out) # store 5 output blocks
2565 xorps
@tweak[3],$inout3
2566 movdqu
$inout1,16*1($out)
2567 xorps
@tweak[4],$inout4
2568 movdqu
$inout2,16*2($out)
2570 movdqu
$inout3,16*3($out)
2571 pcmpgtd
@tweak[5],$twtmp
2572 movdqu
$inout4,16*4($out)
2573 lea
16*5($out),$out # $out+=5*16
2574 pshufd \
$0x13,$twtmp,@tweak[1] # $twres
2578 movdqa
@tweak[5],@tweak[0]
2579 paddq
@tweak[5],@tweak[5] # psllq 1,$tweak
2580 pand
$twmask,@tweak[1] # isolate carry and residue
2581 pxor
@tweak[5],@tweak[1]
2586 movups
($inp),$inout0
2587 lea
16*1($inp),$inp # $inp+=1*16
2588 xorps
@tweak[0],$inout0
2590 &aesni_generate1
("dec",$key,$rounds);
2592 xorps
@tweak[0],$inout0
2593 movdqa
@tweak[1],@tweak[0]
2594 movups
$inout0,($out) # store one output block
2595 movdqa
@tweak[2],@tweak[1]
2596 lea
16*1($out),$out # $out+=1*16
2601 movups
($inp),$inout0
2602 movups
16($inp),$inout1
2603 lea
32($inp),$inp # $inp+=2*16
2604 xorps
@tweak[0],$inout0
2605 xorps
@tweak[1],$inout1
2607 call _aesni_decrypt2
2609 xorps
@tweak[0],$inout0
2610 movdqa
@tweak[2],@tweak[0]
2611 xorps
@tweak[1],$inout1
2612 movdqa
@tweak[3],@tweak[1]
2613 movups
$inout0,($out) # store 2 output blocks
2614 movups
$inout1,16*1($out)
2615 lea
16*2($out),$out # $out+=2*16
2620 movups
($inp),$inout0
2621 movups
16*1($inp),$inout1
2622 movups
16*2($inp),$inout2
2623 lea
16*3($inp),$inp # $inp+=3*16
2624 xorps
@tweak[0],$inout0
2625 xorps
@tweak[1],$inout1
2626 xorps
@tweak[2],$inout2
2628 call _aesni_decrypt3
2630 xorps
@tweak[0],$inout0
2631 movdqa
@tweak[3],@tweak[0]
2632 xorps
@tweak[1],$inout1
2633 movdqa
@tweak[4],@tweak[1]
2634 xorps
@tweak[2],$inout2
2635 movups
$inout0,($out) # store 3 output blocks
2636 movups
$inout1,16*1($out)
2637 movups
$inout2,16*2($out)
2638 lea
16*3($out),$out # $out+=3*16
2643 movups
($inp),$inout0
2644 movups
16*1($inp),$inout1
2645 movups
16*2($inp),$inout2
2646 xorps
@tweak[0],$inout0
2647 movups
16*3($inp),$inout3
2648 lea
16*4($inp),$inp # $inp+=4*16
2649 xorps
@tweak[1],$inout1
2650 xorps
@tweak[2],$inout2
2651 xorps
@tweak[3],$inout3
2653 call _aesni_decrypt4
2655 pxor
@tweak[0],$inout0
2656 movdqa
@tweak[4],@tweak[0]
2657 pxor
@tweak[1],$inout1
2658 movdqa
@tweak[5],@tweak[1]
2659 pxor
@tweak[2],$inout2
2660 movdqu
$inout0,($out) # store 4 output blocks
2661 pxor
@tweak[3],$inout3
2662 movdqu
$inout1,16*1($out)
2663 movdqu
$inout2,16*2($out)
2664 movdqu
$inout3,16*3($out)
2665 lea
16*4($out),$out # $out+=4*16
2670 and \
$15,$len_ # see if $len%16 is 0
2674 mov
$key_,$key # restore $key
2675 mov
$rnds_,$rounds # restore $rounds
2677 movups
($inp),$inout0
2678 xorps
@tweak[1],$inout0
2680 &aesni_generate1
("dec",$key,$rounds);
2682 xorps
@tweak[1],$inout0
2683 movups
$inout0,($out)
2686 movzb
16($inp),%eax # borrow $rounds ...
2687 movzb
($out),%ecx # ... and $key
2695 sub $len_,$out # rewind $out
2696 mov
$key_,$key # restore $key
2697 mov
$rnds_,$rounds # restore $rounds
2699 movups
($out),$inout0
2700 xorps
@tweak[0],$inout0
2702 &aesni_generate1
("dec",$key,$rounds);
2704 xorps
@tweak[0],$inout0
2705 movups
$inout0,($out)
2708 xorps
%xmm0,%xmm0 # clear register bank
2715 $code.=<<___
if (!$win64);
2718 movaps
%xmm0,0x00(%rsp) # clear stack
2720 movaps
%xmm0,0x10(%rsp)
2722 movaps
%xmm0,0x20(%rsp)
2724 movaps
%xmm0,0x30(%rsp)
2726 movaps
%xmm0,0x40(%rsp)
2728 movaps
%xmm0,0x50(%rsp)
2730 movaps
%xmm0,0x60(%rsp)
2734 $code.=<<___
if ($win64);
2735 movaps
-0xa8(%r11),%xmm6
2736 movaps
%xmm0,-0xa8(%r11) # clear stack
2737 movaps
-0x98(%r11),%xmm7
2738 movaps
%xmm0,-0x98(%r11)
2739 movaps
-0x88(%r11),%xmm8
2740 movaps
%xmm0,-0x88(%r11)
2741 movaps
-0x78(%r11),%xmm9
2742 movaps
%xmm0,-0x78(%r11)
2743 movaps
-0x68(%r11),%xmm10
2744 movaps
%xmm0,-0x68(%r11)
2745 movaps
-0x58(%r11),%xmm11
2746 movaps
%xmm0,-0x58(%r11)
2747 movaps
-0x48(%r11),%xmm12
2748 movaps
%xmm0,-0x48(%r11)
2749 movaps
-0x38(%r11),%xmm13
2750 movaps
%xmm0,-0x38(%r11)
2751 movaps
-0x28(%r11),%xmm14
2752 movaps
%xmm0,-0x28(%r11)
2753 movaps
-0x18(%r11),%xmm15
2754 movaps
%xmm0,-0x18(%r11)
2755 movaps
%xmm0,0x00(%rsp)
2756 movaps
%xmm0,0x10(%rsp)
2757 movaps
%xmm0,0x20(%rsp)
2758 movaps
%xmm0,0x30(%rsp)
2759 movaps
%xmm0,0x40(%rsp)
2760 movaps
%xmm0,0x50(%rsp)
2761 movaps
%xmm0,0x60(%rsp)
2767 .cfi_def_cfa_register
%rsp
2771 .size aesni_xts_decrypt
,.-aesni_xts_decrypt
2775 ######################################################################
2776 # void aesni_ocb_[en|de]crypt(const char *inp, char *out, size_t blocks,
2777 # const AES_KEY *key, unsigned int start_block_num,
2778 # unsigned char offset_i[16], const unsigned char L_[][16],
2779 # unsigned char checksum[16]);
2782 my @offset=map("%xmm$_",(10..15));
2783 my ($checksum,$rndkey0l)=("%xmm8","%xmm9");
2784 my ($block_num,$offset_p)=("%r8","%r9"); # 5th and 6th arguments
2785 my ($L_p,$checksum_p) = ("%rbx","%rbp");
2786 my ($i1,$i3,$i5) = ("%r12","%r13","%r14");
2787 my $seventh_arg = $win64 ?
56 : 8;
2791 .globl aesni_ocb_encrypt
2792 .type aesni_ocb_encrypt
,\
@function,6
2809 $code.=<<___
if ($win64);
2810 lea
-0xa0(%rsp),%rsp
2811 movaps
%xmm6,0x00(%rsp) # offload everything
2812 movaps
%xmm7,0x10(%rsp)
2813 movaps
%xmm8,0x20(%rsp)
2814 movaps
%xmm9,0x30(%rsp)
2815 movaps
%xmm10,0x40(%rsp)
2816 movaps
%xmm11,0x50(%rsp)
2817 movaps
%xmm12,0x60(%rsp)
2818 movaps
%xmm13,0x70(%rsp)
2819 movaps
%xmm14,0x80(%rsp)
2820 movaps
%xmm15,0x90(%rsp)
2824 mov
$seventh_arg(%rax),$L_p # 7th argument
2825 mov
$seventh_arg+8(%rax),$checksum_p# 8th argument
2827 mov
240($key),$rnds_
2830 $movkey ($key),$rndkey0l # round[0]
2831 $movkey 16($key,$rnds_),$rndkey1 # round[last]
2833 movdqu
($offset_p),@offset[5] # load last offset_i
2834 pxor
$rndkey1,$rndkey0l # round[0] ^ round[last]
2835 pxor
$rndkey1,@offset[5] # offset_i ^ round[last]
2838 lea
32($key_,$rnds_),$key
2839 $movkey 16($key_),$rndkey1 # round[1]
2840 sub %r10,%rax # twisted $rounds
2841 mov
%rax,%r10 # backup twisted $rounds
2843 movdqu
($L_p),@offset[0] # L_0 for all odd-numbered blocks
2844 movdqu
($checksum_p),$checksum # load checksum
2846 test \
$1,$block_num # is first block number odd?
2852 movdqu
($L_p,$i1),$inout5 # borrow
2853 movdqu
($inp),$inout0
2858 movdqa
$inout5,@offset[5]
2859 movups
$inout0,($out)
2865 lea
1($block_num),$i1 # even-numbered blocks
2866 lea
3($block_num),$i3
2867 lea
5($block_num),$i5
2868 lea
6($block_num),$block_num
2869 bsf
$i1,$i1 # ntz(block)
2872 shl \
$4,$i1 # ntz(block) -> table offset
2878 jmp
.Locb_enc_grandloop
2881 .Locb_enc_grandloop
:
2882 movdqu
`16*0`($inp),$inout0 # load input
2883 movdqu
`16*1`($inp),$inout1
2884 movdqu
`16*2`($inp),$inout2
2885 movdqu
`16*3`($inp),$inout3
2886 movdqu
`16*4`($inp),$inout4
2887 movdqu
`16*5`($inp),$inout5
2888 lea
`16*6`($inp),$inp
2892 movups
$inout0,`16*0`($out) # store output
2893 movups
$inout1,`16*1`($out)
2894 movups
$inout2,`16*2`($out)
2895 movups
$inout3,`16*3`($out)
2896 movups
$inout4,`16*4`($out)
2897 movups
$inout5,`16*5`($out)
2898 lea
`16*6`($out),$out
2900 jnc
.Locb_enc_grandloop
2906 movdqu
`16*0`($inp),$inout0
2909 movdqu
`16*1`($inp),$inout1
2912 movdqu
`16*2`($inp),$inout2
2915 movdqu
`16*3`($inp),$inout3
2918 movdqu
`16*4`($inp),$inout4
2919 pxor
$inout5,$inout5
2923 movdqa
@offset[4],@offset[5]
2924 movups
$inout0,`16*0`($out)
2925 movups
$inout1,`16*1`($out)
2926 movups
$inout2,`16*2`($out)
2927 movups
$inout3,`16*3`($out)
2928 movups
$inout4,`16*4`($out)
2934 movdqa
@offset[0],$inout5 # borrow
2938 movdqa
$inout5,@offset[5]
2939 movups
$inout0,`16*0`($out)
2944 pxor
$inout2,$inout2
2945 pxor
$inout3,$inout3
2949 movdqa
@offset[1],@offset[5]
2950 movups
$inout0,`16*0`($out)
2951 movups
$inout1,`16*1`($out)
2957 pxor
$inout3,$inout3
2961 movdqa
@offset[2],@offset[5]
2962 movups
$inout0,`16*0`($out)
2963 movups
$inout1,`16*1`($out)
2964 movups
$inout2,`16*2`($out)
2972 movdqa
@offset[3],@offset[5]
2973 movups
$inout0,`16*0`($out)
2974 movups
$inout1,`16*1`($out)
2975 movups
$inout2,`16*2`($out)
2976 movups
$inout3,`16*3`($out)
2979 pxor
$rndkey0,@offset[5] # "remove" round[last]
2980 movdqu
$checksum,($checksum_p) # store checksum
2981 movdqu
@offset[5],($offset_p) # store last offset_i
2983 xorps
%xmm0,%xmm0 # clear register bank
2990 $code.=<<___
if (!$win64);
3004 $code.=<<___
if ($win64);
3005 movaps
0x00(%rsp),%xmm6
3006 movaps
%xmm0,0x00(%rsp) # clear stack
3007 movaps
0x10(%rsp),%xmm7
3008 movaps
%xmm0,0x10(%rsp)
3009 movaps
0x20(%rsp),%xmm8
3010 movaps
%xmm0,0x20(%rsp)
3011 movaps
0x30(%rsp),%xmm9
3012 movaps
%xmm0,0x30(%rsp)
3013 movaps
0x40(%rsp),%xmm10
3014 movaps
%xmm0,0x40(%rsp)
3015 movaps
0x50(%rsp),%xmm11
3016 movaps
%xmm0,0x50(%rsp)
3017 movaps
0x60(%rsp),%xmm12
3018 movaps
%xmm0,0x60(%rsp)
3019 movaps
0x70(%rsp),%xmm13
3020 movaps
%xmm0,0x70(%rsp)
3021 movaps
0x80(%rsp),%xmm14
3022 movaps
%xmm0,0x80(%rsp)
3023 movaps
0x90(%rsp),%xmm15
3024 movaps
%xmm0,0x90(%rsp)
3025 lea
0xa0+0x28(%rsp),%rax
3040 .cfi_def_cfa_register
%rsp
3044 .size aesni_ocb_encrypt
,.-aesni_ocb_encrypt
3046 .type __ocb_encrypt6
,\
@abi-omnipotent
3050 pxor
$rndkey0l,@offset[5] # offset_i ^ round[0]
3051 movdqu
($L_p,$i1),@offset[1]
3052 movdqa
@offset[0],@offset[2]
3053 movdqu
($L_p,$i3),@offset[3]
3054 movdqa
@offset[0],@offset[4]
3055 pxor
@offset[5],@offset[0]
3056 movdqu
($L_p,$i5),@offset[5]
3057 pxor
@offset[0],@offset[1]
3058 pxor
$inout0,$checksum # accumulate checksum
3059 pxor
@offset[0],$inout0 # input ^ round[0] ^ offset_i
3060 pxor
@offset[1],@offset[2]
3061 pxor
$inout1,$checksum
3062 pxor
@offset[1],$inout1
3063 pxor
@offset[2],@offset[3]
3064 pxor
$inout2,$checksum
3065 pxor
@offset[2],$inout2
3066 pxor
@offset[3],@offset[4]
3067 pxor
$inout3,$checksum
3068 pxor
@offset[3],$inout3
3069 pxor
@offset[4],@offset[5]
3070 pxor
$inout4,$checksum
3071 pxor
@offset[4],$inout4
3072 pxor
$inout5,$checksum
3073 pxor
@offset[5],$inout5
3074 $movkey 32($key_),$rndkey0
3076 lea
1($block_num),$i1 # even-numbered blocks
3077 lea
3($block_num),$i3
3078 lea
5($block_num),$i5
3080 pxor
$rndkey0l,@offset[0] # offset_i ^ round[last]
3081 bsf
$i1,$i1 # ntz(block)
3085 aesenc
$rndkey1,$inout0
3086 aesenc
$rndkey1,$inout1
3087 aesenc
$rndkey1,$inout2
3088 aesenc
$rndkey1,$inout3
3089 pxor
$rndkey0l,@offset[1]
3090 pxor
$rndkey0l,@offset[2]
3091 aesenc
$rndkey1,$inout4
3092 pxor
$rndkey0l,@offset[3]
3093 pxor
$rndkey0l,@offset[4]
3094 aesenc
$rndkey1,$inout5
3095 $movkey 48($key_),$rndkey1
3096 pxor
$rndkey0l,@offset[5]
3098 aesenc
$rndkey0,$inout0
3099 aesenc
$rndkey0,$inout1
3100 aesenc
$rndkey0,$inout2
3101 aesenc
$rndkey0,$inout3
3102 aesenc
$rndkey0,$inout4
3103 aesenc
$rndkey0,$inout5
3104 $movkey 64($key_),$rndkey0
3105 shl \
$4,$i1 # ntz(block) -> table offset
3111 aesenc
$rndkey1,$inout0
3112 aesenc
$rndkey1,$inout1
3113 aesenc
$rndkey1,$inout2
3114 aesenc
$rndkey1,$inout3
3115 aesenc
$rndkey1,$inout4
3116 aesenc
$rndkey1,$inout5
3117 $movkey ($key,%rax),$rndkey1
3120 aesenc
$rndkey0,$inout0
3121 aesenc
$rndkey0,$inout1
3122 aesenc
$rndkey0,$inout2
3123 aesenc
$rndkey0,$inout3
3124 aesenc
$rndkey0,$inout4
3125 aesenc
$rndkey0,$inout5
3126 $movkey -16($key,%rax),$rndkey0
3129 aesenc
$rndkey1,$inout0
3130 aesenc
$rndkey1,$inout1
3131 aesenc
$rndkey1,$inout2
3132 aesenc
$rndkey1,$inout3
3133 aesenc
$rndkey1,$inout4
3134 aesenc
$rndkey1,$inout5
3135 $movkey 16($key_),$rndkey1
3138 aesenclast
@offset[0],$inout0
3139 movdqu
($L_p),@offset[0] # L_0 for all odd-numbered blocks
3140 mov
%r10,%rax # restore twisted rounds
3141 aesenclast
@offset[1],$inout1
3142 aesenclast
@offset[2],$inout2
3143 aesenclast
@offset[3],$inout3
3144 aesenclast
@offset[4],$inout4
3145 aesenclast
@offset[5],$inout5
3148 .size __ocb_encrypt6
,.-__ocb_encrypt6
3150 .type __ocb_encrypt4
,\
@abi-omnipotent
3154 pxor
$rndkey0l,@offset[5] # offset_i ^ round[0]
3155 movdqu
($L_p,$i1),@offset[1]
3156 movdqa
@offset[0],@offset[2]
3157 movdqu
($L_p,$i3),@offset[3]
3158 pxor
@offset[5],@offset[0]
3159 pxor
@offset[0],@offset[1]
3160 pxor
$inout0,$checksum # accumulate checksum
3161 pxor
@offset[0],$inout0 # input ^ round[0] ^ offset_i
3162 pxor
@offset[1],@offset[2]
3163 pxor
$inout1,$checksum
3164 pxor
@offset[1],$inout1
3165 pxor
@offset[2],@offset[3]
3166 pxor
$inout2,$checksum
3167 pxor
@offset[2],$inout2
3168 pxor
$inout3,$checksum
3169 pxor
@offset[3],$inout3
3170 $movkey 32($key_),$rndkey0
3172 pxor
$rndkey0l,@offset[0] # offset_i ^ round[last]
3173 pxor
$rndkey0l,@offset[1]
3174 pxor
$rndkey0l,@offset[2]
3175 pxor
$rndkey0l,@offset[3]
3177 aesenc
$rndkey1,$inout0
3178 aesenc
$rndkey1,$inout1
3179 aesenc
$rndkey1,$inout2
3180 aesenc
$rndkey1,$inout3
3181 $movkey 48($key_),$rndkey1
3183 aesenc
$rndkey0,$inout0
3184 aesenc
$rndkey0,$inout1
3185 aesenc
$rndkey0,$inout2
3186 aesenc
$rndkey0,$inout3
3187 $movkey 64($key_),$rndkey0
3192 aesenc
$rndkey1,$inout0
3193 aesenc
$rndkey1,$inout1
3194 aesenc
$rndkey1,$inout2
3195 aesenc
$rndkey1,$inout3
3196 $movkey ($key,%rax),$rndkey1
3199 aesenc
$rndkey0,$inout0
3200 aesenc
$rndkey0,$inout1
3201 aesenc
$rndkey0,$inout2
3202 aesenc
$rndkey0,$inout3
3203 $movkey -16($key,%rax),$rndkey0
3206 aesenc
$rndkey1,$inout0
3207 aesenc
$rndkey1,$inout1
3208 aesenc
$rndkey1,$inout2
3209 aesenc
$rndkey1,$inout3
3210 $movkey 16($key_),$rndkey1
3211 mov
%r10,%rax # restore twisted rounds
3213 aesenclast
@offset[0],$inout0
3214 aesenclast
@offset[1],$inout1
3215 aesenclast
@offset[2],$inout2
3216 aesenclast
@offset[3],$inout3
3219 .size __ocb_encrypt4
,.-__ocb_encrypt4
3221 .type __ocb_encrypt1
,\
@abi-omnipotent
3225 pxor
@offset[5],$inout5 # offset_i
3226 pxor
$rndkey0l,$inout5 # offset_i ^ round[0]
3227 pxor
$inout0,$checksum # accumulate checksum
3228 pxor
$inout5,$inout0 # input ^ round[0] ^ offset_i
3229 $movkey 32($key_),$rndkey0
3231 aesenc
$rndkey1,$inout0
3232 $movkey 48($key_),$rndkey1
3233 pxor
$rndkey0l,$inout5 # offset_i ^ round[last]
3235 aesenc
$rndkey0,$inout0
3236 $movkey 64($key_),$rndkey0
3241 aesenc
$rndkey1,$inout0
3242 $movkey ($key,%rax),$rndkey1
3245 aesenc
$rndkey0,$inout0
3246 $movkey -16($key,%rax),$rndkey0
3249 aesenc
$rndkey1,$inout0
3250 $movkey 16($key_),$rndkey1 # redundant in tail
3251 mov
%r10,%rax # restore twisted rounds
3253 aesenclast
$inout5,$inout0
3256 .size __ocb_encrypt1
,.-__ocb_encrypt1
3258 .globl aesni_ocb_decrypt
3259 .type aesni_ocb_decrypt
,\
@function,6
3276 $code.=<<___
if ($win64);
3277 lea
-0xa0(%rsp),%rsp
3278 movaps
%xmm6,0x00(%rsp) # offload everything
3279 movaps
%xmm7,0x10(%rsp)
3280 movaps
%xmm8,0x20(%rsp)
3281 movaps
%xmm9,0x30(%rsp)
3282 movaps
%xmm10,0x40(%rsp)
3283 movaps
%xmm11,0x50(%rsp)
3284 movaps
%xmm12,0x60(%rsp)
3285 movaps
%xmm13,0x70(%rsp)
3286 movaps
%xmm14,0x80(%rsp)
3287 movaps
%xmm15,0x90(%rsp)
3291 mov
$seventh_arg(%rax),$L_p # 7th argument
3292 mov
$seventh_arg+8(%rax),$checksum_p# 8th argument
3294 mov
240($key),$rnds_
3297 $movkey ($key),$rndkey0l # round[0]
3298 $movkey 16($key,$rnds_),$rndkey1 # round[last]
3300 movdqu
($offset_p),@offset[5] # load last offset_i
3301 pxor
$rndkey1,$rndkey0l # round[0] ^ round[last]
3302 pxor
$rndkey1,@offset[5] # offset_i ^ round[last]
3305 lea
32($key_,$rnds_),$key
3306 $movkey 16($key_),$rndkey1 # round[1]
3307 sub %r10,%rax # twisted $rounds
3308 mov
%rax,%r10 # backup twisted $rounds
3310 movdqu
($L_p),@offset[0] # L_0 for all odd-numbered blocks
3311 movdqu
($checksum_p),$checksum # load checksum
3313 test \
$1,$block_num # is first block number odd?
3319 movdqu
($L_p,$i1),$inout5 # borrow
3320 movdqu
($inp),$inout0
3325 movdqa
$inout5,@offset[5]
3326 movups
$inout0,($out)
3327 xorps
$inout0,$checksum # accumulate checksum
3333 lea
1($block_num),$i1 # even-numbered blocks
3334 lea
3($block_num),$i3
3335 lea
5($block_num),$i5
3336 lea
6($block_num),$block_num
3337 bsf
$i1,$i1 # ntz(block)
3340 shl \
$4,$i1 # ntz(block) -> table offset
3346 jmp
.Locb_dec_grandloop
3349 .Locb_dec_grandloop
:
3350 movdqu
`16*0`($inp),$inout0 # load input
3351 movdqu
`16*1`($inp),$inout1
3352 movdqu
`16*2`($inp),$inout2
3353 movdqu
`16*3`($inp),$inout3
3354 movdqu
`16*4`($inp),$inout4
3355 movdqu
`16*5`($inp),$inout5
3356 lea
`16*6`($inp),$inp
3360 movups
$inout0,`16*0`($out) # store output
3361 pxor
$inout0,$checksum # accumulate checksum
3362 movups
$inout1,`16*1`($out)
3363 pxor
$inout1,$checksum
3364 movups
$inout2,`16*2`($out)
3365 pxor
$inout2,$checksum
3366 movups
$inout3,`16*3`($out)
3367 pxor
$inout3,$checksum
3368 movups
$inout4,`16*4`($out)
3369 pxor
$inout4,$checksum
3370 movups
$inout5,`16*5`($out)
3371 pxor
$inout5,$checksum
3372 lea
`16*6`($out),$out
3374 jnc
.Locb_dec_grandloop
3380 movdqu
`16*0`($inp),$inout0
3383 movdqu
`16*1`($inp),$inout1
3386 movdqu
`16*2`($inp),$inout2
3389 movdqu
`16*3`($inp),$inout3
3392 movdqu
`16*4`($inp),$inout4
3393 pxor
$inout5,$inout5
3397 movdqa
@offset[4],@offset[5]
3398 movups
$inout0,`16*0`($out) # store output
3399 pxor
$inout0,$checksum # accumulate checksum
3400 movups
$inout1,`16*1`($out)
3401 pxor
$inout1,$checksum
3402 movups
$inout2,`16*2`($out)
3403 pxor
$inout2,$checksum
3404 movups
$inout3,`16*3`($out)
3405 pxor
$inout3,$checksum
3406 movups
$inout4,`16*4`($out)
3407 pxor
$inout4,$checksum
3413 movdqa
@offset[0],$inout5 # borrow
3417 movdqa
$inout5,@offset[5]
3418 movups
$inout0,`16*0`($out) # store output
3419 xorps
$inout0,$checksum # accumulate checksum
3424 pxor
$inout2,$inout2
3425 pxor
$inout3,$inout3
3429 movdqa
@offset[1],@offset[5]
3430 movups
$inout0,`16*0`($out) # store output
3431 xorps
$inout0,$checksum # accumulate checksum
3432 movups
$inout1,`16*1`($out)
3433 xorps
$inout1,$checksum
3439 pxor
$inout3,$inout3
3443 movdqa
@offset[2],@offset[5]
3444 movups
$inout0,`16*0`($out) # store output
3445 xorps
$inout0,$checksum # accumulate checksum
3446 movups
$inout1,`16*1`($out)
3447 xorps
$inout1,$checksum
3448 movups
$inout2,`16*2`($out)
3449 xorps
$inout2,$checksum
3457 movdqa
@offset[3],@offset[5]
3458 movups
$inout0,`16*0`($out) # store output
3459 pxor
$inout0,$checksum # accumulate checksum
3460 movups
$inout1,`16*1`($out)
3461 pxor
$inout1,$checksum
3462 movups
$inout2,`16*2`($out)
3463 pxor
$inout2,$checksum
3464 movups
$inout3,`16*3`($out)
3465 pxor
$inout3,$checksum
3468 pxor
$rndkey0,@offset[5] # "remove" round[last]
3469 movdqu
$checksum,($checksum_p) # store checksum
3470 movdqu
@offset[5],($offset_p) # store last offset_i
3472 xorps
%xmm0,%xmm0 # clear register bank
3479 $code.=<<___
if (!$win64);
3493 $code.=<<___
if ($win64);
3494 movaps
0x00(%rsp),%xmm6
3495 movaps
%xmm0,0x00(%rsp) # clear stack
3496 movaps
0x10(%rsp),%xmm7
3497 movaps
%xmm0,0x10(%rsp)
3498 movaps
0x20(%rsp),%xmm8
3499 movaps
%xmm0,0x20(%rsp)
3500 movaps
0x30(%rsp),%xmm9
3501 movaps
%xmm0,0x30(%rsp)
3502 movaps
0x40(%rsp),%xmm10
3503 movaps
%xmm0,0x40(%rsp)
3504 movaps
0x50(%rsp),%xmm11
3505 movaps
%xmm0,0x50(%rsp)
3506 movaps
0x60(%rsp),%xmm12
3507 movaps
%xmm0,0x60(%rsp)
3508 movaps
0x70(%rsp),%xmm13
3509 movaps
%xmm0,0x70(%rsp)
3510 movaps
0x80(%rsp),%xmm14
3511 movaps
%xmm0,0x80(%rsp)
3512 movaps
0x90(%rsp),%xmm15
3513 movaps
%xmm0,0x90(%rsp)
3514 lea
0xa0+0x28(%rsp),%rax
3529 .cfi_def_cfa_register
%rsp
3533 .size aesni_ocb_decrypt
,.-aesni_ocb_decrypt
3535 .type __ocb_decrypt6
,\
@abi-omnipotent
3539 pxor
$rndkey0l,@offset[5] # offset_i ^ round[0]
3540 movdqu
($L_p,$i1),@offset[1]
3541 movdqa
@offset[0],@offset[2]
3542 movdqu
($L_p,$i3),@offset[3]
3543 movdqa
@offset[0],@offset[4]
3544 pxor
@offset[5],@offset[0]
3545 movdqu
($L_p,$i5),@offset[5]
3546 pxor
@offset[0],@offset[1]
3547 pxor
@offset[0],$inout0 # input ^ round[0] ^ offset_i
3548 pxor
@offset[1],@offset[2]
3549 pxor
@offset[1],$inout1
3550 pxor
@offset[2],@offset[3]
3551 pxor
@offset[2],$inout2
3552 pxor
@offset[3],@offset[4]
3553 pxor
@offset[3],$inout3
3554 pxor
@offset[4],@offset[5]
3555 pxor
@offset[4],$inout4
3556 pxor
@offset[5],$inout5
3557 $movkey 32($key_),$rndkey0
3559 lea
1($block_num),$i1 # even-numbered blocks
3560 lea
3($block_num),$i3
3561 lea
5($block_num),$i5
3563 pxor
$rndkey0l,@offset[0] # offset_i ^ round[last]
3564 bsf
$i1,$i1 # ntz(block)
3568 aesdec
$rndkey1,$inout0
3569 aesdec
$rndkey1,$inout1
3570 aesdec
$rndkey1,$inout2
3571 aesdec
$rndkey1,$inout3
3572 pxor
$rndkey0l,@offset[1]
3573 pxor
$rndkey0l,@offset[2]
3574 aesdec
$rndkey1,$inout4
3575 pxor
$rndkey0l,@offset[3]
3576 pxor
$rndkey0l,@offset[4]
3577 aesdec
$rndkey1,$inout5
3578 $movkey 48($key_),$rndkey1
3579 pxor
$rndkey0l,@offset[5]
3581 aesdec
$rndkey0,$inout0
3582 aesdec
$rndkey0,$inout1
3583 aesdec
$rndkey0,$inout2
3584 aesdec
$rndkey0,$inout3
3585 aesdec
$rndkey0,$inout4
3586 aesdec
$rndkey0,$inout5
3587 $movkey 64($key_),$rndkey0
3588 shl \
$4,$i1 # ntz(block) -> table offset
3594 aesdec
$rndkey1,$inout0
3595 aesdec
$rndkey1,$inout1
3596 aesdec
$rndkey1,$inout2
3597 aesdec
$rndkey1,$inout3
3598 aesdec
$rndkey1,$inout4
3599 aesdec
$rndkey1,$inout5
3600 $movkey ($key,%rax),$rndkey1
3603 aesdec
$rndkey0,$inout0
3604 aesdec
$rndkey0,$inout1
3605 aesdec
$rndkey0,$inout2
3606 aesdec
$rndkey0,$inout3
3607 aesdec
$rndkey0,$inout4
3608 aesdec
$rndkey0,$inout5
3609 $movkey -16($key,%rax),$rndkey0
3612 aesdec
$rndkey1,$inout0
3613 aesdec
$rndkey1,$inout1
3614 aesdec
$rndkey1,$inout2
3615 aesdec
$rndkey1,$inout3
3616 aesdec
$rndkey1,$inout4
3617 aesdec
$rndkey1,$inout5
3618 $movkey 16($key_),$rndkey1
3621 aesdeclast
@offset[0],$inout0
3622 movdqu
($L_p),@offset[0] # L_0 for all odd-numbered blocks
3623 mov
%r10,%rax # restore twisted rounds
3624 aesdeclast
@offset[1],$inout1
3625 aesdeclast
@offset[2],$inout2
3626 aesdeclast
@offset[3],$inout3
3627 aesdeclast
@offset[4],$inout4
3628 aesdeclast
@offset[5],$inout5
3631 .size __ocb_decrypt6
,.-__ocb_decrypt6
3633 .type __ocb_decrypt4
,\
@abi-omnipotent
3637 pxor
$rndkey0l,@offset[5] # offset_i ^ round[0]
3638 movdqu
($L_p,$i1),@offset[1]
3639 movdqa
@offset[0],@offset[2]
3640 movdqu
($L_p,$i3),@offset[3]
3641 pxor
@offset[5],@offset[0]
3642 pxor
@offset[0],@offset[1]
3643 pxor
@offset[0],$inout0 # input ^ round[0] ^ offset_i
3644 pxor
@offset[1],@offset[2]
3645 pxor
@offset[1],$inout1
3646 pxor
@offset[2],@offset[3]
3647 pxor
@offset[2],$inout2
3648 pxor
@offset[3],$inout3
3649 $movkey 32($key_),$rndkey0
3651 pxor
$rndkey0l,@offset[0] # offset_i ^ round[last]
3652 pxor
$rndkey0l,@offset[1]
3653 pxor
$rndkey0l,@offset[2]
3654 pxor
$rndkey0l,@offset[3]
3656 aesdec
$rndkey1,$inout0
3657 aesdec
$rndkey1,$inout1
3658 aesdec
$rndkey1,$inout2
3659 aesdec
$rndkey1,$inout3
3660 $movkey 48($key_),$rndkey1
3662 aesdec
$rndkey0,$inout0
3663 aesdec
$rndkey0,$inout1
3664 aesdec
$rndkey0,$inout2
3665 aesdec
$rndkey0,$inout3
3666 $movkey 64($key_),$rndkey0
3671 aesdec
$rndkey1,$inout0
3672 aesdec
$rndkey1,$inout1
3673 aesdec
$rndkey1,$inout2
3674 aesdec
$rndkey1,$inout3
3675 $movkey ($key,%rax),$rndkey1
3678 aesdec
$rndkey0,$inout0
3679 aesdec
$rndkey0,$inout1
3680 aesdec
$rndkey0,$inout2
3681 aesdec
$rndkey0,$inout3
3682 $movkey -16($key,%rax),$rndkey0
3685 aesdec
$rndkey1,$inout0
3686 aesdec
$rndkey1,$inout1
3687 aesdec
$rndkey1,$inout2
3688 aesdec
$rndkey1,$inout3
3689 $movkey 16($key_),$rndkey1
3690 mov
%r10,%rax # restore twisted rounds
3692 aesdeclast
@offset[0],$inout0
3693 aesdeclast
@offset[1],$inout1
3694 aesdeclast
@offset[2],$inout2
3695 aesdeclast
@offset[3],$inout3
3698 .size __ocb_decrypt4
,.-__ocb_decrypt4
3700 .type __ocb_decrypt1
,\
@abi-omnipotent
3704 pxor
@offset[5],$inout5 # offset_i
3705 pxor
$rndkey0l,$inout5 # offset_i ^ round[0]
3706 pxor
$inout5,$inout0 # input ^ round[0] ^ offset_i
3707 $movkey 32($key_),$rndkey0
3709 aesdec
$rndkey1,$inout0
3710 $movkey 48($key_),$rndkey1
3711 pxor
$rndkey0l,$inout5 # offset_i ^ round[last]
3713 aesdec
$rndkey0,$inout0
3714 $movkey 64($key_),$rndkey0
3719 aesdec
$rndkey1,$inout0
3720 $movkey ($key,%rax),$rndkey1
3723 aesdec
$rndkey0,$inout0
3724 $movkey -16($key,%rax),$rndkey0
3727 aesdec
$rndkey1,$inout0
3728 $movkey 16($key_),$rndkey1 # redundant in tail
3729 mov
%r10,%rax # restore twisted rounds
3731 aesdeclast
$inout5,$inout0
3734 .size __ocb_decrypt1
,.-__ocb_decrypt1
3738 ########################################################################
3739 # void $PREFIX_cbc_encrypt (const void *inp, void *out,
3740 # size_t length, const AES_KEY *key,
3741 # unsigned char *ivp,const int enc);
3743 my $frame_size = 0x10 + ($win64?
0xa0:0); # used in decrypt
3744 my ($iv,$in0,$in1,$in2,$in3,$in4)=map("%xmm$_",(10..15));
3747 .globl
${PREFIX
}_cbc_encrypt
3748 .type
${PREFIX
}_cbc_encrypt
,\
@function,6
3750 ${PREFIX
}_cbc_encrypt
:
3753 test
$len,$len # check length
3756 mov
240($key),$rnds_ # key->rounds
3757 mov
$key,$key_ # backup $key
3758 test
%r9d,%r9d # 6th argument
3760 #--------------------------- CBC ENCRYPT ------------------------------#
3761 movups
($ivp),$inout0 # load iv as initial state
3769 movups
($inp),$inout1 # load input
3771 #xorps $inout1,$inout0
3773 &aesni_generate1
("enc",$key,$rounds,$inout0,$inout1);
3775 mov
$rnds_,$rounds # restore $rounds
3776 mov
$key_,$key # restore $key
3777 movups
$inout0,0($out) # store output
3783 pxor
$rndkey0,$rndkey0 # clear register bank
3784 pxor
$rndkey1,$rndkey1
3785 movups
$inout0,($ivp)
3786 pxor
$inout0,$inout0
3787 pxor
$inout1,$inout1
3791 mov
$len,%rcx # zaps $key
3792 xchg
$inp,$out # $inp is %rsi and $out is %rdi now
3793 .long
0x9066A4F3 # rep movsb
3794 mov \
$16,%ecx # zero tail
3797 .long
0x9066AAF3 # rep stosb
3798 lea
-16(%rdi),%rdi # rewind $out by 1 block
3799 mov
$rnds_,$rounds # restore $rounds
3800 mov
%rdi,%rsi # $inp and $out are the same
3801 mov
$key_,$key # restore $key
3802 xor $len,$len # len=16
3803 jmp
.Lcbc_enc_loop
# one more spin
3804 \f#--------------------------- CBC DECRYPT ------------------------------#
3808 jne
.Lcbc_decrypt_bulk
3810 # handle single block without allocating stack frame,
3811 # useful in ciphertext stealing mode
3812 movdqu
($inp),$inout0 # load input
3813 movdqu
($ivp),$inout1 # load iv
3814 movdqa
$inout0,$inout2 # future iv
3816 &aesni_generate1
("dec",$key,$rnds_);
3818 pxor
$rndkey0,$rndkey0 # clear register bank
3819 pxor
$rndkey1,$rndkey1
3820 movdqu
$inout2,($ivp) # store iv
3821 xorps
$inout1,$inout0 # ^=iv
3822 pxor
$inout1,$inout1
3823 movups
$inout0,($out) # store output
3824 pxor
$inout0,$inout0
3828 lea
(%rsp),%r11 # frame pointer
3829 .cfi_def_cfa_register
%r11
3832 sub \
$$frame_size,%rsp
3833 and \
$-16,%rsp # Linux kernel stack can be incorrectly seeded
3835 $code.=<<___
if ($win64);
3836 movaps
%xmm6,0x10(%rsp)
3837 movaps
%xmm7,0x20(%rsp)
3838 movaps
%xmm8,0x30(%rsp)
3839 movaps
%xmm9,0x40(%rsp)
3840 movaps
%xmm10,0x50(%rsp)
3841 movaps
%xmm11,0x60(%rsp)
3842 movaps
%xmm12,0x70(%rsp)
3843 movaps
%xmm13,0x80(%rsp)
3844 movaps
%xmm14,0x90(%rsp)
3845 movaps
%xmm15,0xa0(%rsp)
3849 my $inp_=$key_="%rbp"; # reassign $key_
3852 mov
$key,$key_ # [re-]backup $key [after reassignment]
3858 $movkey ($key),$rndkey0
3859 movdqu
0x00($inp),$inout0 # load input
3860 movdqu
0x10($inp),$inout1
3862 movdqu
0x20($inp),$inout2
3864 movdqu
0x30($inp),$inout3
3866 movdqu
0x40($inp),$inout4
3868 movdqu
0x50($inp),$inout5
3870 mov OPENSSL_ia32cap_P
+4(%rip),%r9d
3872 jbe
.Lcbc_dec_six_or_seven
3874 and \
$`1<<26|1<<22`,%r9d # isolate XSAVE+MOVBE
3875 sub \
$0x50,$len # $len is biased by -5*16
3876 cmp \
$`1<<22`,%r9d # check for MOVBE without XSAVE
3877 je
.Lcbc_dec_loop6_enter
# [which denotes Atom Silvermont]
3878 sub \
$0x20,$len # $len is biased by -7*16
3879 lea
0x70($key),$key # size optimization
3880 jmp
.Lcbc_dec_loop8_enter
3883 movups
$inout7,($out)
3885 .Lcbc_dec_loop8_enter
:
3886 movdqu
0x60($inp),$inout6
3887 pxor
$rndkey0,$inout0
3888 movdqu
0x70($inp),$inout7
3889 pxor
$rndkey0,$inout1
3890 $movkey 0x10-0x70($key),$rndkey1
3891 pxor
$rndkey0,$inout2
3893 cmp \
$0x70,$len # is there at least 0x60 bytes ahead?
3894 pxor
$rndkey0,$inout3
3895 pxor
$rndkey0,$inout4
3896 pxor
$rndkey0,$inout5
3897 pxor
$rndkey0,$inout6
3899 aesdec
$rndkey1,$inout0
3900 pxor
$rndkey0,$inout7
3901 $movkey 0x20-0x70($key),$rndkey0
3902 aesdec
$rndkey1,$inout1
3903 aesdec
$rndkey1,$inout2
3904 aesdec
$rndkey1,$inout3
3905 aesdec
$rndkey1,$inout4
3906 aesdec
$rndkey1,$inout5
3907 aesdec
$rndkey1,$inout6
3910 aesdec
$rndkey1,$inout7
3912 $movkey 0x30-0x70($key),$rndkey1
3914 for($i=1;$i<12;$i++) {
3915 my $rndkeyx = ($i&1)?
$rndkey0:$rndkey1;
3916 $code.=<<___
if ($i==7);
3920 aesdec
$rndkeyx,$inout0
3921 aesdec
$rndkeyx,$inout1
3922 aesdec
$rndkeyx,$inout2
3923 aesdec
$rndkeyx,$inout3
3924 aesdec
$rndkeyx,$inout4
3925 aesdec
$rndkeyx,$inout5
3926 aesdec
$rndkeyx,$inout6
3927 aesdec
$rndkeyx,$inout7
3928 $movkey `0x30+0x10*$i`-0x70($key),$rndkeyx
3930 $code.=<<___
if ($i<6 || (!($i&1) && $i>7));
3933 $code.=<<___
if ($i==7);
3936 $code.=<<___
if ($i==9);
3939 $code.=<<___
if ($i==11);
3946 aesdec
$rndkey1,$inout0
3947 aesdec
$rndkey1,$inout1
3950 aesdec
$rndkey1,$inout2
3951 aesdec
$rndkey1,$inout3
3954 aesdec
$rndkey1,$inout4
3955 aesdec
$rndkey1,$inout5
3958 aesdec
$rndkey1,$inout6
3959 aesdec
$rndkey1,$inout7
3960 movdqu
0x50($inp),$rndkey1
3962 aesdeclast
$iv,$inout0
3963 movdqu
0x60($inp),$iv # borrow $iv
3964 pxor
$rndkey0,$rndkey1
3965 aesdeclast
$in0,$inout1
3967 movdqu
0x70($inp),$rndkey0 # next IV
3968 aesdeclast
$in1,$inout2
3970 movdqu
0x00($inp_),$in0
3971 aesdeclast
$in2,$inout3
3972 aesdeclast
$in3,$inout4
3973 movdqu
0x10($inp_),$in1
3974 movdqu
0x20($inp_),$in2
3975 aesdeclast
$in4,$inout5
3976 aesdeclast
$rndkey1,$inout6
3977 movdqu
0x30($inp_),$in3
3978 movdqu
0x40($inp_),$in4
3979 aesdeclast
$iv,$inout7
3980 movdqa
$rndkey0,$iv # return $iv
3981 movdqu
0x50($inp_),$rndkey1
3982 $movkey -0x70($key),$rndkey0
3984 movups
$inout0,($out) # store output
3986 movups
$inout1,0x10($out)
3988 movups
$inout2,0x20($out)
3990 movups
$inout3,0x30($out)
3992 movups
$inout4,0x40($out)
3994 movups
$inout5,0x50($out)
3995 movdqa
$rndkey1,$inout5
3996 movups
$inout6,0x60($out)
4002 movaps
$inout7,$inout0
4003 lea
-0x70($key),$key
4005 jle
.Lcbc_dec_clear_tail_collected
4006 movups
$inout7,($out)
4012 .Lcbc_dec_six_or_seven
:
4016 movaps
$inout5,$inout6
4017 call _aesni_decrypt6
4018 pxor
$iv,$inout0 # ^= IV
4021 movdqu
$inout0,($out)
4023 movdqu
$inout1,0x10($out)
4024 pxor
$inout1,$inout1 # clear register bank
4026 movdqu
$inout2,0x20($out)
4027 pxor
$inout2,$inout2
4029 movdqu
$inout3,0x30($out)
4030 pxor
$inout3,$inout3
4032 movdqu
$inout4,0x40($out)
4033 pxor
$inout4,$inout4
4035 movdqa
$inout5,$inout0
4036 pxor
$inout5,$inout5
4037 jmp
.Lcbc_dec_tail_collected
4041 movups
0x60($inp),$inout6
4042 xorps
$inout7,$inout7
4043 call _aesni_decrypt8
4044 movups
0x50($inp),$inout7
4045 pxor
$iv,$inout0 # ^= IV
4046 movups
0x60($inp),$iv
4048 movdqu
$inout0,($out)
4050 movdqu
$inout1,0x10($out)
4051 pxor
$inout1,$inout1 # clear register bank
4053 movdqu
$inout2,0x20($out)
4054 pxor
$inout2,$inout2
4056 movdqu
$inout3,0x30($out)
4057 pxor
$inout3,$inout3
4059 movdqu
$inout4,0x40($out)
4060 pxor
$inout4,$inout4
4061 pxor
$inout7,$inout6
4062 movdqu
$inout5,0x50($out)
4063 pxor
$inout5,$inout5
4065 movdqa
$inout6,$inout0
4066 pxor
$inout6,$inout6
4067 pxor
$inout7,$inout7
4068 jmp
.Lcbc_dec_tail_collected
4072 movups
$inout5,($out)
4074 movdqu
0x00($inp),$inout0 # load input
4075 movdqu
0x10($inp),$inout1
4077 movdqu
0x20($inp),$inout2
4079 movdqu
0x30($inp),$inout3
4081 movdqu
0x40($inp),$inout4
4083 movdqu
0x50($inp),$inout5
4085 .Lcbc_dec_loop6_enter
:
4087 movdqa
$inout5,$inout6
4089 call _aesni_decrypt6
4091 pxor
$iv,$inout0 # ^= IV
4094 movdqu
$inout0,($out)
4096 movdqu
$inout1,0x10($out)
4098 movdqu
$inout2,0x20($out)
4101 movdqu
$inout3,0x30($out)
4104 movdqu
$inout4,0x40($out)
4109 movdqa
$inout5,$inout0
4111 jle
.Lcbc_dec_clear_tail_collected
4112 movups
$inout5,($out)
4116 movups
($inp),$inout0
4118 jbe
.Lcbc_dec_one
# $len is 1*16 or less
4120 movups
0x10($inp),$inout1
4123 jbe
.Lcbc_dec_two
# $len is 2*16 or less
4125 movups
0x20($inp),$inout2
4128 jbe
.Lcbc_dec_three
# $len is 3*16 or less
4130 movups
0x30($inp),$inout3
4133 jbe
.Lcbc_dec_four
# $len is 4*16 or less
4135 movups
0x40($inp),$inout4 # $len is 5*16 or less
4138 xorps
$inout5,$inout5
4139 call _aesni_decrypt6
4143 movdqu
$inout0,($out)
4145 movdqu
$inout1,0x10($out)
4146 pxor
$inout1,$inout1 # clear register bank
4148 movdqu
$inout2,0x20($out)
4149 pxor
$inout2,$inout2
4151 movdqu
$inout3,0x30($out)
4152 pxor
$inout3,$inout3
4154 movdqa
$inout4,$inout0
4155 pxor
$inout4,$inout4
4156 pxor
$inout5,$inout5
4158 jmp
.Lcbc_dec_tail_collected
4164 &aesni_generate1
("dec",$key,$rounds);
4168 jmp
.Lcbc_dec_tail_collected
4172 call _aesni_decrypt2
4176 movdqu
$inout0,($out)
4177 movdqa
$inout1,$inout0
4178 pxor
$inout1,$inout1 # clear register bank
4180 jmp
.Lcbc_dec_tail_collected
4184 call _aesni_decrypt3
4188 movdqu
$inout0,($out)
4190 movdqu
$inout1,0x10($out)
4191 pxor
$inout1,$inout1 # clear register bank
4192 movdqa
$inout2,$inout0
4193 pxor
$inout2,$inout2
4195 jmp
.Lcbc_dec_tail_collected
4199 call _aesni_decrypt4
4203 movdqu
$inout0,($out)
4205 movdqu
$inout1,0x10($out)
4206 pxor
$inout1,$inout1 # clear register bank
4208 movdqu
$inout2,0x20($out)
4209 pxor
$inout2,$inout2
4210 movdqa
$inout3,$inout0
4211 pxor
$inout3,$inout3
4213 jmp
.Lcbc_dec_tail_collected
4216 .Lcbc_dec_clear_tail_collected
:
4217 pxor
$inout1,$inout1 # clear register bank
4218 pxor
$inout2,$inout2
4219 pxor
$inout3,$inout3
4221 $code.=<<___
if (!$win64);
4222 pxor
$inout4,$inout4 # %xmm6..9
4223 pxor
$inout5,$inout5
4224 pxor
$inout6,$inout6
4225 pxor
$inout7,$inout7
4228 .Lcbc_dec_tail_collected
:
4231 jnz
.Lcbc_dec_tail_partial
4232 movups
$inout0,($out)
4233 pxor
$inout0,$inout0
4236 .Lcbc_dec_tail_partial
:
4237 movaps
$inout0,(%rsp)
4238 pxor
$inout0,$inout0
4243 .long
0x9066A4F3 # rep movsb
4244 movdqa
$inout0,(%rsp)
4247 xorps
$rndkey0,$rndkey0 # %xmm0
4248 pxor
$rndkey1,$rndkey1
4250 $code.=<<___
if ($win64);
4251 movaps
0x10(%rsp),%xmm6
4252 movaps
%xmm0,0x10(%rsp) # clear stack
4253 movaps
0x20(%rsp),%xmm7
4254 movaps
%xmm0,0x20(%rsp)
4255 movaps
0x30(%rsp),%xmm8
4256 movaps
%xmm0,0x30(%rsp)
4257 movaps
0x40(%rsp),%xmm9
4258 movaps
%xmm0,0x40(%rsp)
4259 movaps
0x50(%rsp),%xmm10
4260 movaps
%xmm0,0x50(%rsp)
4261 movaps
0x60(%rsp),%xmm11
4262 movaps
%xmm0,0x60(%rsp)
4263 movaps
0x70(%rsp),%xmm12
4264 movaps
%xmm0,0x70(%rsp)
4265 movaps
0x80(%rsp),%xmm13
4266 movaps
%xmm0,0x80(%rsp)
4267 movaps
0x90(%rsp),%xmm14
4268 movaps
%xmm0,0x90(%rsp)
4269 movaps
0xa0(%rsp),%xmm15
4270 movaps
%xmm0,0xa0(%rsp)
4276 .cfi_def_cfa_register
%rsp
4280 .size
${PREFIX
}_cbc_encrypt
,.-${PREFIX
}_cbc_encrypt
4283 # int ${PREFIX}_set_decrypt_key(const unsigned char *inp,
4284 # int bits, AES_KEY *key)
4286 # input: $inp user-supplied key
4287 # $bits $inp length in bits
4288 # $key pointer to key schedule
4289 # output: %eax 0 denoting success, -1 or -2 - failure (see C)
4290 # *$key key schedule
4292 { my ($inp,$bits,$key) = @_4args;
4296 .globl
${PREFIX
}_set_decrypt_key
4297 .type
${PREFIX
}_set_decrypt_key
,\
@abi-omnipotent
4299 ${PREFIX
}_set_decrypt_key
:
4301 .byte
0x48,0x83,0xEC,0x08 # sub rsp,8
4302 .cfi_adjust_cfa_offset
8
4303 call __aesni_set_encrypt_key
4304 shl \
$4,$bits # rounds-1 after _aesni_set_encrypt_key
4307 lea
16($key,$bits),$inp # points at the end of key schedule
4309 $movkey ($key),%xmm0 # just swap
4310 $movkey ($inp),%xmm1
4311 $movkey %xmm0,($inp)
4312 $movkey %xmm1,($key)
4317 $movkey ($key),%xmm0 # swap and inverse
4318 $movkey ($inp),%xmm1
4323 $movkey %xmm0,16($inp)
4324 $movkey %xmm1,-16($key)
4326 ja
.Ldec_key_inverse
4328 $movkey ($key),%xmm0 # inverse middle
4331 $movkey %xmm0,($inp)
4335 .cfi_adjust_cfa_offset
-8
4338 .LSEH_end_set_decrypt_key
:
4339 .size
${PREFIX
}_set_decrypt_key
,.-${PREFIX
}_set_decrypt_key
4342 # This is based on submission from Intel by
4347 # Aggressively optimized in respect to aeskeygenassist's critical path
4348 # and is contained in %xmm0-5 to meet Win64 ABI requirement.
4350 # int ${PREFIX}_set_encrypt_key(const unsigned char *inp,
4351 # int bits, AES_KEY * const key);
4353 # input: $inp user-supplied key
4354 # $bits $inp length in bits
4355 # $key pointer to key schedule
4356 # output: %eax 0 denoting success, -1 or -2 - failure (see C)
4357 # $bits rounds-1 (used in aesni_set_decrypt_key)
4358 # *$key key schedule
4359 # $key pointer to key schedule (used in
4360 # aesni_set_decrypt_key)
4362 # Subroutine is frame-less, which means that only volatile registers
4363 # are used. Note that it's declared "abi-omnipotent", which means that
4364 # amount of volatile registers is smaller on Windows.
4367 .globl
${PREFIX
}_set_encrypt_key
4368 .type
${PREFIX
}_set_encrypt_key
,\
@abi-omnipotent
4370 ${PREFIX
}_set_encrypt_key
:
4371 __aesni_set_encrypt_key
:
4373 .byte
0x48,0x83,0xEC,0x08 # sub rsp,8
4374 .cfi_adjust_cfa_offset
8
4381 mov \
$`1<<28|1<<11`,%r10d # AVX and XOP bits
4382 movups
($inp),%xmm0 # pull first 128 bits of *userKey
4383 xorps
%xmm4,%xmm4 # low dword of xmm4 is assumed 0
4384 and OPENSSL_ia32cap_P
+4(%rip),%r10d
4385 lea
16($key),%rax # %rax is used as modifiable copy of $key
4394 mov \
$9,$bits # 10 rounds for 128-bit key
4395 cmp \
$`1<<28`,%r10d # AVX, bit no XOP
4398 $movkey %xmm0,($key) # round 0
4399 aeskeygenassist \
$0x1,%xmm0,%xmm1 # round 1
4400 call
.Lkey_expansion_128_cold
4401 aeskeygenassist \
$0x2,%xmm0,%xmm1 # round 2
4402 call
.Lkey_expansion_128
4403 aeskeygenassist \
$0x4,%xmm0,%xmm1 # round 3
4404 call
.Lkey_expansion_128
4405 aeskeygenassist \
$0x8,%xmm0,%xmm1 # round 4
4406 call
.Lkey_expansion_128
4407 aeskeygenassist \
$0x10,%xmm0,%xmm1 # round 5
4408 call
.Lkey_expansion_128
4409 aeskeygenassist \
$0x20,%xmm0,%xmm1 # round 6
4410 call
.Lkey_expansion_128
4411 aeskeygenassist \
$0x40,%xmm0,%xmm1 # round 7
4412 call
.Lkey_expansion_128
4413 aeskeygenassist \
$0x80,%xmm0,%xmm1 # round 8
4414 call
.Lkey_expansion_128
4415 aeskeygenassist \
$0x1b,%xmm0,%xmm1 # round 9
4416 call
.Lkey_expansion_128
4417 aeskeygenassist \
$0x36,%xmm0,%xmm1 # round 10
4418 call
.Lkey_expansion_128
4419 $movkey %xmm0,(%rax)
4420 mov
$bits,80(%rax) # 240(%rdx)
4426 movdqa
.Lkey_rotate
(%rip),%xmm5
4428 movdqa
.Lkey_rcon1
(%rip),%xmm4
4436 aesenclast
%xmm4,%xmm0
4449 movdqu
%xmm0,-16(%rax)
4455 movdqa
.Lkey_rcon1b
(%rip),%xmm4
4458 aesenclast
%xmm4,%xmm0
4474 aesenclast
%xmm4,%xmm0
4485 movdqu
%xmm0,16(%rax)
4487 mov
$bits,96(%rax) # 240($key)
4493 movq
16($inp),%xmm2 # remaining 1/3 of *userKey
4494 mov \
$11,$bits # 12 rounds for 192
4495 cmp \
$`1<<28`,%r10d # AVX, but no XOP
4498 $movkey %xmm0,($key) # round 0
4499 aeskeygenassist \
$0x1,%xmm2,%xmm1 # round 1,2
4500 call
.Lkey_expansion_192a_cold
4501 aeskeygenassist \
$0x2,%xmm2,%xmm1 # round 2,3
4502 call
.Lkey_expansion_192b
4503 aeskeygenassist \
$0x4,%xmm2,%xmm1 # round 4,5
4504 call
.Lkey_expansion_192a
4505 aeskeygenassist \
$0x8,%xmm2,%xmm1 # round 5,6
4506 call
.Lkey_expansion_192b
4507 aeskeygenassist \
$0x10,%xmm2,%xmm1 # round 7,8
4508 call
.Lkey_expansion_192a
4509 aeskeygenassist \
$0x20,%xmm2,%xmm1 # round 8,9
4510 call
.Lkey_expansion_192b
4511 aeskeygenassist \
$0x40,%xmm2,%xmm1 # round 10,11
4512 call
.Lkey_expansion_192a
4513 aeskeygenassist \
$0x80,%xmm2,%xmm1 # round 11,12
4514 call
.Lkey_expansion_192b
4515 $movkey %xmm0,(%rax)
4516 mov
$bits,48(%rax) # 240(%rdx)
4522 movdqa
.Lkey_rotate192
(%rip),%xmm5
4523 movdqa
.Lkey_rcon1
(%rip),%xmm4
4533 aesenclast
%xmm4,%xmm2
4545 pshufd \
$0xff,%xmm0,%xmm3
4552 movdqu
%xmm0,-16(%rax)
4557 mov
$bits,32(%rax) # 240($key)
4563 movups
16($inp),%xmm2 # remaining half of *userKey
4564 mov \
$13,$bits # 14 rounds for 256
4566 cmp \
$`1<<28`,%r10d # AVX, but no XOP
4569 $movkey %xmm0,($key) # round 0
4570 $movkey %xmm2,16($key) # round 1
4571 aeskeygenassist \
$0x1,%xmm2,%xmm1 # round 2
4572 call
.Lkey_expansion_256a_cold
4573 aeskeygenassist \
$0x1,%xmm0,%xmm1 # round 3
4574 call
.Lkey_expansion_256b
4575 aeskeygenassist \
$0x2,%xmm2,%xmm1 # round 4
4576 call
.Lkey_expansion_256a
4577 aeskeygenassist \
$0x2,%xmm0,%xmm1 # round 5
4578 call
.Lkey_expansion_256b
4579 aeskeygenassist \
$0x4,%xmm2,%xmm1 # round 6
4580 call
.Lkey_expansion_256a
4581 aeskeygenassist \
$0x4,%xmm0,%xmm1 # round 7
4582 call
.Lkey_expansion_256b
4583 aeskeygenassist \
$0x8,%xmm2,%xmm1 # round 8
4584 call
.Lkey_expansion_256a
4585 aeskeygenassist \
$0x8,%xmm0,%xmm1 # round 9
4586 call
.Lkey_expansion_256b
4587 aeskeygenassist \
$0x10,%xmm2,%xmm1 # round 10
4588 call
.Lkey_expansion_256a
4589 aeskeygenassist \
$0x10,%xmm0,%xmm1 # round 11
4590 call
.Lkey_expansion_256b
4591 aeskeygenassist \
$0x20,%xmm2,%xmm1 # round 12
4592 call
.Lkey_expansion_256a
4593 aeskeygenassist \
$0x20,%xmm0,%xmm1 # round 13
4594 call
.Lkey_expansion_256b
4595 aeskeygenassist \
$0x40,%xmm2,%xmm1 # round 14
4596 call
.Lkey_expansion_256a
4597 $movkey %xmm0,(%rax)
4598 mov
$bits,16(%rax) # 240(%rdx)
4604 movdqa
.Lkey_rotate
(%rip),%xmm5
4605 movdqa
.Lkey_rcon1
(%rip),%xmm4
4607 movdqu
%xmm0,0($key)
4609 movdqu
%xmm2,16($key)
4615 aesenclast
%xmm4,%xmm2
4632 pshufd \
$0xff,%xmm0,%xmm2
4634 aesenclast
%xmm3,%xmm2
4645 movdqu
%xmm2,16(%rax)
4652 mov
$bits,16(%rax) # 240($key)
4667 .cfi_adjust_cfa_offset
-8
4669 .LSEH_end_set_encrypt_key
:
4672 .Lkey_expansion_128
:
4673 $movkey %xmm0,(%rax)
4675 .Lkey_expansion_128_cold
:
4676 shufps \
$0b00010000,%xmm0,%xmm4
4678 shufps \
$0b10001100,%xmm0,%xmm4
4680 shufps \
$0b11111111,%xmm1,%xmm1 # critical path
4685 .Lkey_expansion_192a
:
4686 $movkey %xmm0,(%rax)
4688 .Lkey_expansion_192a_cold
:
4690 .Lkey_expansion_192b_warm
:
4691 shufps \
$0b00010000,%xmm0,%xmm4
4694 shufps \
$0b10001100,%xmm0,%xmm4
4697 pshufd \
$0b01010101,%xmm1,%xmm1 # critical path
4700 pshufd \
$0b11111111,%xmm0,%xmm3
4705 .Lkey_expansion_192b
:
4707 shufps \
$0b01000100,%xmm0,%xmm5
4708 $movkey %xmm5,(%rax)
4709 shufps \
$0b01001110,%xmm2,%xmm3
4710 $movkey %xmm3,16(%rax)
4712 jmp
.Lkey_expansion_192b_warm
4715 .Lkey_expansion_256a
:
4716 $movkey %xmm2,(%rax)
4718 .Lkey_expansion_256a_cold
:
4719 shufps \
$0b00010000,%xmm0,%xmm4
4721 shufps \
$0b10001100,%xmm0,%xmm4
4723 shufps \
$0b11111111,%xmm1,%xmm1 # critical path
4728 .Lkey_expansion_256b
:
4729 $movkey %xmm0,(%rax)
4732 shufps \
$0b00010000,%xmm2,%xmm4
4734 shufps \
$0b10001100,%xmm2,%xmm4
4736 shufps \
$0b10101010,%xmm1,%xmm1 # critical path
4740 .size
${PREFIX
}_set_encrypt_key
,.-${PREFIX
}_set_encrypt_key
4741 .size __aesni_set_encrypt_key
,.-__aesni_set_encrypt_key
4748 .byte
15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
4756 .byte
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4758 .long
0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d,0x0c0f0e0d
4760 .long
0x04070605,0x04070605,0x04070605,0x04070605
4764 .long
0x1b,0x1b,0x1b,0x1b
4766 .asciz
"AES for Intel AES-NI, CRYPTOGAMS by <appro\@openssl.org>"
4770 # EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
4771 # CONTEXT *context,DISPATCHER_CONTEXT *disp)
4779 .extern __imp_RtlVirtualUnwind
4781 $code.=<<___
if ($PREFIX eq "aesni");
4782 .type ecb_ccm64_se_handler
,\
@abi-omnipotent
4784 ecb_ccm64_se_handler
:
4796 mov
120($context),%rax # pull context->Rax
4797 mov
248($context),%rbx # pull context->Rip
4799 mov
8($disp),%rsi # disp->ImageBase
4800 mov
56($disp),%r11 # disp->HandlerData
4802 mov
0(%r11),%r10d # HandlerData[0]
4803 lea
(%rsi,%r10),%r10 # prologue label
4804 cmp %r10,%rbx # context->Rip<prologue label
4805 jb
.Lcommon_seh_tail
4807 mov
152($context),%rax # pull context->Rsp
4809 mov
4(%r11),%r10d # HandlerData[1]
4810 lea
(%rsi,%r10),%r10 # epilogue label
4811 cmp %r10,%rbx # context->Rip>=epilogue label
4812 jae
.Lcommon_seh_tail
4814 lea
0(%rax),%rsi # %xmm save area
4815 lea
512($context),%rdi # &context.Xmm6
4816 mov \
$8,%ecx # 4*sizeof(%xmm0)/sizeof(%rax)
4817 .long
0xa548f3fc # cld; rep movsq
4818 lea
0x58(%rax),%rax # adjust stack pointer
4820 jmp
.Lcommon_seh_tail
4821 .size ecb_ccm64_se_handler
,.-ecb_ccm64_se_handler
4823 .type ctr_xts_se_handler
,\
@abi-omnipotent
4837 mov
120($context),%rax # pull context->Rax
4838 mov
248($context),%rbx # pull context->Rip
4840 mov
8($disp),%rsi # disp->ImageBase
4841 mov
56($disp),%r11 # disp->HandlerData
4843 mov
0(%r11),%r10d # HandlerData[0]
4844 lea
(%rsi,%r10),%r10 # prologue label
4845 cmp %r10,%rbx # context->Rip<prologue label
4846 jb
.Lcommon_seh_tail
4848 mov
152($context),%rax # pull context->Rsp
4850 mov
4(%r11),%r10d # HandlerData[1]
4851 lea
(%rsi,%r10),%r10 # epilogue label
4852 cmp %r10,%rbx # context->Rip>=epilogue label
4853 jae
.Lcommon_seh_tail
4855 mov
208($context),%rax # pull context->R11
4857 lea
-0xa8(%rax),%rsi # %xmm save area
4858 lea
512($context),%rdi # & context.Xmm6
4859 mov \
$20,%ecx # 10*sizeof(%xmm0)/sizeof(%rax)
4860 .long
0xa548f3fc # cld; rep movsq
4862 mov
-8(%rax),%rbp # restore saved %rbp
4863 mov
%rbp,160($context) # restore context->Rbp
4864 jmp
.Lcommon_seh_tail
4865 .size ctr_xts_se_handler
,.-ctr_xts_se_handler
4867 .type ocb_se_handler
,\
@abi-omnipotent
4881 mov
120($context),%rax # pull context->Rax
4882 mov
248($context),%rbx # pull context->Rip
4884 mov
8($disp),%rsi # disp->ImageBase
4885 mov
56($disp),%r11 # disp->HandlerData
4887 mov
0(%r11),%r10d # HandlerData[0]
4888 lea
(%rsi,%r10),%r10 # prologue label
4889 cmp %r10,%rbx # context->Rip<prologue label
4890 jb
.Lcommon_seh_tail
4892 mov
4(%r11),%r10d # HandlerData[1]
4893 lea
(%rsi,%r10),%r10 # epilogue label
4894 cmp %r10,%rbx # context->Rip>=epilogue label
4895 jae
.Lcommon_seh_tail
4897 mov
8(%r11),%r10d # HandlerData[2]
4898 lea
(%rsi,%r10),%r10
4899 cmp %r10,%rbx # context->Rip>=pop label
4902 mov
152($context),%rax # pull context->Rsp
4904 lea
(%rax),%rsi # %xmm save area
4905 lea
512($context),%rdi # & context.Xmm6
4906 mov \
$20,%ecx # 10*sizeof(%xmm0)/sizeof(%rax)
4907 .long
0xa548f3fc # cld; rep movsq
4908 lea
0xa0+0x28(%rax),%rax
4917 mov
%rbx,144($context) # restore context->Rbx
4918 mov
%rbp,160($context) # restore context->Rbp
4919 mov
%r12,216($context) # restore context->R12
4920 mov
%r13,224($context) # restore context->R13
4921 mov
%r14,232($context) # restore context->R14
4923 jmp
.Lcommon_seh_tail
4924 .size ocb_se_handler
,.-ocb_se_handler
4927 .type cbc_se_handler
,\
@abi-omnipotent
4941 mov
152($context),%rax # pull context->Rsp
4942 mov
248($context),%rbx # pull context->Rip
4944 lea
.Lcbc_decrypt_bulk
(%rip),%r10
4945 cmp %r10,%rbx # context->Rip<"prologue" label
4946 jb
.Lcommon_seh_tail
4948 mov
120($context),%rax # pull context->Rax
4950 lea
.Lcbc_decrypt_body
(%rip),%r10
4951 cmp %r10,%rbx # context->Rip<cbc_decrypt_body
4952 jb
.Lcommon_seh_tail
4954 mov
152($context),%rax # pull context->Rsp
4956 lea
.Lcbc_ret
(%rip),%r10
4957 cmp %r10,%rbx # context->Rip>="epilogue" label
4958 jae
.Lcommon_seh_tail
4960 lea
16(%rax),%rsi # %xmm save area
4961 lea
512($context),%rdi # &context.Xmm6
4962 mov \
$20,%ecx # 10*sizeof(%xmm0)/sizeof(%rax)
4963 .long
0xa548f3fc # cld; rep movsq
4965 mov
208($context),%rax # pull context->R11
4967 mov
-8(%rax),%rbp # restore saved %rbp
4968 mov
%rbp,160($context) # restore context->Rbp
4973 mov
%rax,152($context) # restore context->Rsp
4974 mov
%rsi,168($context) # restore context->Rsi
4975 mov
%rdi,176($context) # restore context->Rdi
4977 mov
40($disp),%rdi # disp->ContextRecord
4978 mov
$context,%rsi # context
4979 mov \
$154,%ecx # sizeof(CONTEXT)
4980 .long
0xa548f3fc # cld; rep movsq
4983 xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER
4984 mov
8(%rsi),%rdx # arg2, disp->ImageBase
4985 mov
0(%rsi),%r8 # arg3, disp->ControlPc
4986 mov
16(%rsi),%r9 # arg4, disp->FunctionEntry
4987 mov
40(%rsi),%r10 # disp->ContextRecord
4988 lea
56(%rsi),%r11 # &disp->HandlerData
4989 lea
24(%rsi),%r12 # &disp->EstablisherFrame
4990 mov
%r10,32(%rsp) # arg5
4991 mov
%r11,40(%rsp) # arg6
4992 mov
%r12,48(%rsp) # arg7
4993 mov
%rcx,56(%rsp) # arg8, (NULL)
4994 call
*__imp_RtlVirtualUnwind
(%rip)
4996 mov \
$1,%eax # ExceptionContinueSearch
5008 .size cbc_se_handler
,.-cbc_se_handler
5013 $code.=<<___
if ($PREFIX eq "aesni");
5014 .rva
.LSEH_begin_aesni_ecb_encrypt
5015 .rva
.LSEH_end_aesni_ecb_encrypt
5018 .rva
.LSEH_begin_aesni_ccm64_encrypt_blocks
5019 .rva
.LSEH_end_aesni_ccm64_encrypt_blocks
5020 .rva
.LSEH_info_ccm64_enc
5022 .rva
.LSEH_begin_aesni_ccm64_decrypt_blocks
5023 .rva
.LSEH_end_aesni_ccm64_decrypt_blocks
5024 .rva
.LSEH_info_ccm64_dec
5026 .rva
.LSEH_begin_aesni_ctr32_encrypt_blocks
5027 .rva
.LSEH_end_aesni_ctr32_encrypt_blocks
5028 .rva
.LSEH_info_ctr32
5030 .rva
.LSEH_begin_aesni_xts_encrypt
5031 .rva
.LSEH_end_aesni_xts_encrypt
5032 .rva
.LSEH_info_xts_enc
5034 .rva
.LSEH_begin_aesni_xts_decrypt
5035 .rva
.LSEH_end_aesni_xts_decrypt
5036 .rva
.LSEH_info_xts_dec
5038 .rva
.LSEH_begin_aesni_ocb_encrypt
5039 .rva
.LSEH_end_aesni_ocb_encrypt
5040 .rva
.LSEH_info_ocb_enc
5042 .rva
.LSEH_begin_aesni_ocb_decrypt
5043 .rva
.LSEH_end_aesni_ocb_decrypt
5044 .rva
.LSEH_info_ocb_dec
5047 .rva
.LSEH_begin_
${PREFIX
}_cbc_encrypt
5048 .rva
.LSEH_end_
${PREFIX
}_cbc_encrypt
5051 .rva
${PREFIX
}_set_decrypt_key
5052 .rva
.LSEH_end_set_decrypt_key
5055 .rva
${PREFIX
}_set_encrypt_key
5056 .rva
.LSEH_end_set_encrypt_key
5061 $code.=<<___
if ($PREFIX eq "aesni");
5064 .rva ecb_ccm64_se_handler
5065 .rva
.Lecb_enc_body
,.Lecb_enc_ret
# HandlerData[]
5066 .LSEH_info_ccm64_enc
:
5068 .rva ecb_ccm64_se_handler
5069 .rva
.Lccm64_enc_body
,.Lccm64_enc_ret
# HandlerData[]
5070 .LSEH_info_ccm64_dec
:
5072 .rva ecb_ccm64_se_handler
5073 .rva
.Lccm64_dec_body
,.Lccm64_dec_ret
# HandlerData[]
5076 .rva ctr_xts_se_handler
5077 .rva
.Lctr32_body
,.Lctr32_epilogue
# HandlerData[]
5080 .rva ctr_xts_se_handler
5081 .rva
.Lxts_enc_body
,.Lxts_enc_epilogue
# HandlerData[]
5084 .rva ctr_xts_se_handler
5085 .rva
.Lxts_dec_body
,.Lxts_dec_epilogue
# HandlerData[]
5089 .rva
.Locb_enc_body
,.Locb_enc_epilogue
# HandlerData[]
5095 .rva
.Locb_dec_body
,.Locb_dec_epilogue
# HandlerData[]
5104 .byte
0x01,0x04,0x01,0x00
5105 .byte
0x04,0x02,0x00,0x00 # sub rsp,8
5110 local *opcode
=shift;
5114 $rex|=0x04 if($dst>=8);
5115 $rex|=0x01 if($src>=8);
5116 push @opcode,$rex|0x40 if($rex);
5123 if ($line=~/(aeskeygenassist)\s+\$([x0-9a-f]+),\s*%xmm([0-9]+),\s*%xmm([0-9]+)/) {
5124 rex
(\
@opcode,$4,$3);
5125 push @opcode,0x0f,0x3a,0xdf;
5126 push @opcode,0xc0|($3&7)|(($4&7)<<3); # ModR/M
5128 push @opcode,$c=~/^0/?
oct($c):$c;
5129 return ".byte\t".join(',',@opcode);
5131 elsif ($line=~/(aes[a-z]+)\s+%xmm([0-9]+),\s*%xmm([0-9]+)/) {
5134 "aesenc" => 0xdc, "aesenclast" => 0xdd,
5135 "aesdec" => 0xde, "aesdeclast" => 0xdf
5137 return undef if (!defined($opcodelet{$1}));
5138 rex
(\
@opcode,$3,$2);
5139 push @opcode,0x0f,0x38,$opcodelet{$1};
5140 push @opcode,0xc0|($2&7)|(($3&7)<<3); # ModR/M
5141 return ".byte\t".join(',',@opcode);
5143 elsif ($line=~/(aes[a-z]+)\s+([0x1-9a-fA-F]*)\(%rsp\),\s*%xmm([0-9]+)/) {
5145 "aesenc" => 0xdc, "aesenclast" => 0xdd,
5146 "aesdec" => 0xde, "aesdeclast" => 0xdf
5148 return undef if (!defined($opcodelet{$1}));
5150 push @opcode,0x44 if ($3>=8);
5151 push @opcode,0x0f,0x38,$opcodelet{$1};
5152 push @opcode,0x44|(($3&7)<<3),0x24; # ModR/M
5153 push @opcode,($off=~/^0/?
oct($off):$off)&0xff;
5154 return ".byte\t".join(',',@opcode);
5160 ".byte 0x0f,0x38,0xf1,0x44,0x24,".shift;
5163 $code =~ s/\`([^\`]*)\`/eval($1)/gem;
5164 $code =~ s/\b(aes.*%xmm[0-9]+).*$/aesni($1)/gem;
5165 #$code =~ s/\bmovbe\s+%eax/bswap %eax; mov %eax/gm; # debugging artefact
5166 $code =~ s/\bmovbe\s+%eax,\s*([0-9]+)\(%rsp\)/movbe($1)/gem;
5170 close STDOUT
or die "error closing STDOUT: $!";