-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch05.xml
504 lines (456 loc) · 21.8 KB
/
ch05.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<chapter id="ch05" xreflabel="SHA">
<title>Secure Hash Standard</title>
<indexterm>
<primary>SHA</primary>
</indexterm>
<para>SHA is the Secure Hash Standard. SHA is specified in <ulink url="https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.180-4.pdf">FIPS 180-4, Secure Hash Standard (SHS)</ulink>. You should read the standard if you are not familiar with the hash family.</para>
<para>The code below is available online at <ulink url="https://github.com/noloader/SHA-Intrinsics">SHA Intrinsics</ulink>. The GitHub provides accelerated SHA for Intel, ARMv8 and POWER8.</para>
<section id="sha_strategy" xreflabel="SHA Strategy">
<title>Strategy</title>
<para>SHA provides a lot of freedom to an implementation. You can approach your SHA implementation in several ways, but most of them will result in an under-performing SHA. This section provides one of the strategies for a better performing implementation.</para>
<para>The first design element is to perform everything in vector registers. The only integer operations should be reading 2 longs or 4 integers from memory during a load, and writing 2 longs or 4 integers after the round during a store.</para>
<para>Second, when you need an integer for a calculation you will shift it out from a vector register to another vector register using <systemitem>vec_sld</systemitem>. Most of the time you only care about element 0 in a vector register, and the remainder of elements are "don't care" elements.</para>
<para>Third, don't maintain a full <systemitem>W[64]</systemitem> or <systemitem>W[80]</systemitem> table. Use <systemitem>X[16]</systemitem> instead, and transform each element in-place using a rolling strategy.</para>
<para>Fourth, the eight working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem> each get their own vector register. The one you care about is located at element 0, the remainder of the elements in the vector are "don't care" elements.</para>
<para>It does not matter if you rotate the working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem> in the caller or in the callee. Both designs have nearly the same performance characteristics.</para>
<para>Since you are operating on <systemitem>X[16]</systemitem> in a rolling fashion instead of <systemitem>W[64]</systemitem> or <systemitem>W[80]</systemitem> the main body of your compression function will look similar to below.</para>
<programlisting><?code-font-size 75% ?>// SHA-256 partial compression function
uint32x4_p X[16];
...
for (i = 16; i < 64; i++)
{
uint32x4_p s0, s1, T0, T1;
s0 = sigma0(X[(i + 1) & 0x0f]);
s1 = sigma1(X[(i + 14) & 0x0f]);
T1 = (X[i & 0xf] += s0 + s1 + X[(i + 9) & 0xf]);
T1 += h + Sigma1(e) + Ch(e, f, g) + KEY[i];
T2 = Sigma0(a) + Maj(a, b, c);
...
}
</programlisting>
</section>
<section id="sha_ch" xreflabel="Ch Function">
<title>Ch function</title>
<para><indexterm><primary>SHA</primary><secondary>Ch function</secondary></indexterm>The SHA <systemitem>Ch</systemitem> function is implemented in POWER systems using the <systemitem>vsel</systemitem> instruction or the <systemitem>vec_sel</systemitem> builtin. The implementation for the 32x4 arrangement is shown below. The code is the same for the 64x2 arrangement, but the function takes <systemitem>uint64x2_p</systemitem> arguments. The important piece of information is <systemitem>x</systemitem> used as the selector.</para>
<programlisting><?code-font-size 75% ?>uint32x4_p
VecCh(uint32x4_p x, uint32x4_p y, uint32x4_p z)
{
return vec_sel(z, y, x);
}</programlisting>
</section>
<section id="sha_maj" xreflabel="Maj Function">
<title>Maj function</title>
<para><indexterm><primary>SHA</primary><secondary>Maj function</secondary></indexterm>The SHA <systemitem>Maj</systemitem> function is implemented in POWER systems using the <systemitem>vsel</systemitem> instruction or the <systemitem>vec_sel</systemitem> builtin. The implementation for the 32x4 arrangement is shown below. The code is the same for the 64x2 arrangement, but the function takes <systemitem>uint64x2_p</systemitem> arguments. The important piece of information is <systemitem>x^y</systemitem> used as the selector.</para>
<programlisting><?code-font-size 75% ?>uint32x4_p
VecCh(uint32x4_p x, uint32x4_p y, uint32x4_p z)
{
return vec_sel(y, z, vec_xor(x, y));
}</programlisting>
</section>
<section id="sha_insn" xreflabel="Signma Functions">
<title>Sigma functions</title>
<para><indexterm><primary>SHA</primary><secondary>Sigma functions</secondary></indexterm>POWER8 provides the <systemitem>vshasigmaw</systemitem> and <systemitem>vshasigmad</systemitem> instructions to accelerate SHA calculations for 32-bit and 64-bit words, respectively. The instructions take two integer arguments and the constants are used to select among <systemitem>Sigma0</systemitem>, <systemitem>Sigma1</systemitem>, <systemitem>sigma0</systemitem> and <systemitem>sigma1</systemitem>.</para>
<para><indexterm><primary>__builtin_crypto_vshasigmaw</primary></indexterm><indexterm><primary>__builtin_crypto_vshasigmad</primary></indexterm><indexterm><primary>__vshasigmaw</primary></indexterm><indexterm><primary>__vshasigmad</primary></indexterm>The builtin GCC functions for the instructions are <systemitem>__builtin_crypto_vshasigmaw</systemitem> and <systemitem>__builtin_crypto_vshasigmad</systemitem>. The XLC functions for the instructions are <systemitem>__vshasigmaw</systemitem> and <systemitem>__vshasigmad</systemitem>. The C/C++ wrapper for the SHA-256 functions should look similar to below.</para>
<programlisting><?code-font-size 75% ?>uint32x4_p Vec_sigma0(const uint32x4_p val)
{
#if defined(__xlc__) || defined(__xlC__)
return __vshasigmaw(val, 0, 0);
#else
return __builtin_crypto_vshasigmaw(val, 0, 0);
#endif
}
uint32x4_p Vec_sigma1(const uint32x4_p val)
{
#if defined(__xlc__) || defined(__xlC__)
return __vshasigmaw(val, 0, 0xf);
#else
return __builtin_crypto_vshasigmaw(val, 0, 0xf);
#endif
}
uint32x4_p VecSigma0(const uint32x4_p val)
{
#if defined(__xlc__) || defined(__xlC__)
return __vshasigmaw(val, 1, 0);
#else
return __builtin_crypto_vshasigmaw(val, 1, 0);
#endif
}
uint32x4_p VecSigma1(const uint32x4_p val)
{
#if defined(__xlc__) || defined(__xlC__)
return __vshasigmaw(val, 1, 0xf);
#else
return __builtin_crypto_vshasigmaw(val, 1, 0xf);
#endif
}</programlisting>
</section>
<section id="sha_sha256" xreflabel="SHA256">
<title>SHA-256</title>
<para><indexterm><primary>SHA</primary><secondary>SHA-256</secondary></indexterm>The SHA-256 implementation has four parts. The first part is loads the existing state and creates working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem>. The second part loads the message and performs the first 16 rounds. The third part performs the remaining rounds. The final part stores the new state.</para>
<para><emphasis role="bold">Part 1.</emphasis> Load the existing state and create working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem>.</para>
<programlisting><?code-font-size 75% ?>uint32x4_p abcd = VecLoad32x4u(state+0, 0);
uint32x4_p efgh = VecLoad32x4u(state+4, 0);
enum {A=0,B=1,C,D,E,F,G,H};
uint32x4_p X[16], S[8];
S[A] = abcd; S[E] = efgh;
S[B] = VecShiftLeft<4>(S[A]);
S[F] = VecShiftLeft<4>(S[E]);
S[C] = VecShiftLeft<4>(S[B]);
S[G] = VecShiftLeft<4>(S[F]);
S[D] = VecShiftLeft<4>(S[C]);
S[H] = VecShiftLeft<4>(S[G]);
</programlisting>
<para><emphasis role="bold">Part 2.</emphasis> Load the message and perform the first 16 rounds.</para>
<programlisting><?code-font-size 75% ?>const uint32_t* k = reinterpret_cast<const uint32_t*>(KEY256);
const uint32_t* m = reinterpret_cast<const uint32_t*>(data);
uint32x4_p vm, vk;
unsigned int i, offset=0;
vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1<0>(X,S, vk,vm);
SHA256_ROUND1<1>(X,S, VecShiftLeft<4>(vk), VecShiftLeft<4>(vm));
SHA256_ROUND1<2>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
SHA256_ROUND1<3>(X,S, VecShiftLeft<12>(vk), VecShiftLeft<12>(vm));
offset+=16;
vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1<4>(X,S, vk,vm);
SHA256_ROUND1<5>(X,S, VecShiftLeft<4>(vk), VecShiftLeft<4>(vm));
SHA256_ROUND1<6>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
SHA256_ROUND1<7>(X,S, VecShiftLeft<12>(vk), VecShiftLeft<12>(vm));
offset+=16;
vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1<8>(X,S, vk,vm);
SHA256_ROUND1<9>(X,S, VecShiftLeft<4>(vk), VecShiftLeft<4>(vm));
SHA256_ROUND1<10>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
SHA256_ROUND1<11>(X,S, VecShiftLeft<12>(vk), VecShiftLeft<12>(vm));
offset+=16;
vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1<12>(X,S, vk,vm);
SHA256_ROUND1<13>(X,S, VecShiftLeft<4>(vk), VecShiftLeft<4>(vm));
SHA256_ROUND1<14>(X,S, VecShiftLeft<8>(vk), VecShiftLeft<8>(vm));
SHA256_ROUND1<15>(X,S, VecShiftLeft<12>(vk), VecShiftLeft<12>(vm));
offset+=16;
</programlisting>
<para><emphasis role="bold">Part 3.</emphasis> Perform the remaining rounds.
</para>
<programlisting><?code-font-size 75% ?>for (i=16; i<64; i+=16)
{
vk = VecLoad32x4(k, offset);
SHA256_ROUND2<0>(X,S, vk);
SHA256_ROUND2<1>(X,S, VecShiftLeft<4>(vk));
SHA256_ROUND2<2>(X,S, VecShiftLeft<8>(vk));
SHA256_ROUND2<3>(X,S, VecShiftLeft<12>(vk));
offset+=16;
vk = VecLoad32x4(k, offset);
SHA256_ROUND2<4>(X,S, vk);
SHA256_ROUND2<5>(X,S, VecShiftLeft<4>(vk));
SHA256_ROUND2<6>(X,S, VecShiftLeft<8>(vk));
SHA256_ROUND2<7>(X,S, VecShiftLeft<12>(vk));
offset+=16;
vk = VecLoad32x4(k, offset);
SHA256_ROUND2<8>(X,S, vk);
SHA256_ROUND2<9>(X,S, VecShiftLeft<4>(vk));
SHA256_ROUND2<10>(X,S, VecShiftLeft<8>(vk));
SHA256_ROUND2<11>(X,S, VecShiftLeft<12>(vk));
offset+=16;
vk = VecLoad32x4(k, offset);
SHA256_ROUND2<12>(X,S, vk);
SHA256_ROUND2<13>(X,S, VecShiftLeft<4>(vk));
SHA256_ROUND2<14>(X,S, VecShiftLeft<8>(vk));
SHA256_ROUND2<15>(X,S, VecShiftLeft<12>(vk));
offset+=16;
}
</programlisting>
<para><emphasis role="bold">Part 4.</emphasis> Repack and store the new state.</para>
<programlisting><?code-font-size 75% ?>abcd += VecPack(S[A],S[B],S[C],S[D]);
efgh += VecPack(S[E],S[F],S[G],S[H]);
VecStore32x4u(abcd, state+0, 0);
VecStore32x4u(efgh, state+4, 0);
</programlisting>
<para><emphasis role="bold">VecLoadMsg32x4.</emphasis> Perform an endian-aware load of a user message into a word.
</para>
<programlisting><?code-font-size 75% ?>template <class T>
uint32x4_p VecLoadMsg32x4(const T* data, int offset)
{
#if __LITTLE_ENDIAN__
uint8x16_p mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
uint32x4_p r = VecLoad32x4u(data, offset);
return (uint32x4_p)vec_perm(r, r, mask);
#else
return VecLoad32x4u(data, offset);
#endif
}
</programlisting>
<para><emphasis role="bold">SHA256_ROUND1.</emphasis> Mix state with a round key and user message.
</para>
<programlisting><?code-font-size 75% ?>template <unsigned int R>
void SHA256_ROUND1(uint32x4_p X[16], uint32x4_p S[8],
const uint32x4_p K, const uint32x4_p M)
{
uint32x4_p T1, T2;
X[R] = M;
T1 = S[H] + VecSigma1(S[E]);
T1 += VecCh(S[E],S[F],S[G]) + K + M;
T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);
S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
S[E] = S[D] + T1;
S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
S[A] = T1 + T2;
}
</programlisting>
<para><emphasis role="bold">SHA256_ROUND2.</emphasis> Mix state with a round key.
</para>
<programlisting><?code-font-size 75% ?>template <unsigned int R>
void SHA256_ROUND2(uint32x4_p X[16], uint32x4_p S[8],
const uint32x4_p K)
{
// Indexes into the X[] array
enum {IDX0=(R+0)&0xf, IDX1=(R+1)&0xf,
IDX9=(R+9)&0xf, IDX14=(R+14)&0xf};
const uint32x4_p s0 = Vec_sigma0(X[IDX1]);
const uint32x4_p s1 = Vec_sigma1(X[IDX14]);
uint32x4_p T1 = (X[IDX0] += s0 + s1 + X[IDX9]);
T1 += S[H] + VecSigma1(S[E]) + VecCh(S[E],S[F],S[G]) + K;
uint32x4_p T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);
S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
S[E] = S[D] + T1;
S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
S[A] = T1 + T2;
}
</programlisting>
<para><emphasis role="bold">VecPack.</emphasis> Repack working variables.
</para>
<programlisting><?code-font-size 75% ?>uint32x4_p VecPack(const uint32x4_p a, const uint32x4_p b,
const uint32x4_p c, const uint32x4_p d)
{
uint8x16_p m1 = {0,1,2,3, 16,17,18,19, 0,0,0,0, 0,0,0,0};
uint8x16_p m2 = {0,1,2,3, 4,5,6,7, 16,17,18,19, 20,21,22,23};
return vec_perm(vec_perm(a,b,m1), vec_perm(c,d,m1), m2);
}
</programlisting>
</section>
<section id="sha_sha512" xreflabel="SHA512">
<title>SHA-512</title>
<para><indexterm><primary>SHA</primary><secondary>SHA-512</secondary></indexterm>The SHA-512 implementation is like SHA-256 and has four parts. The first part is loads the existing state and creates working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem>. The second part loads the message and performs the first 16 rounds. The third part performs the remaining rounds. The final part stores the new state.</para>
<para><emphasis role="bold">Part 1.</emphasis> Load the existing state and create working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem>.</para>
<programlisting><?code-font-size 75% ?>uint64x2_p ab = VecLoad64x2u(state+0, 0);
uint64x2_p cd = VecLoad64x2u(state+2, 0);
uint64x2_p ef = VecLoad64x2u(state+4, 0);
uint64x2_p gh = VecLoad64x2u(state+6, 0);
// Indexes into the S[] array
enum {A=0, B=1, C, D, E, F, G, H};
uint64x2_p X[16], S[8];
S[A] = ab; S[C] = cd;
S[E] = ef; S[G] = gh;
S[B] = VecShiftLeft<8>(S[A]);
S[D] = VecShiftLeft<8>(S[C]);
S[F] = VecShiftLeft<8>(S[E]);
S[H] = VecShiftLeft<8>(S[G]);
</programlisting>
<para><emphasis role="bold">Part 2.</emphasis> Load the message and perform the first 16 rounds.</para>
<programlisting><?code-font-size 75% ?>const uint64_t* k = reinterpret_cast<const uint64_t*>(KEY512);
const uint64_t* m = reinterpret_cast<const uint64_t*>(data);
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<0>(X,S, vk,vm);
SHA512_ROUND1<1>(X,S, VecShiftLeft<8>(vk),VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<2>(X,S, vk,vm);
SHA512_ROUND1<3>(X,S, VecShiftLeft<8>(vk),VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<4>(X,S, vk,vm);
SHA512_ROUND1<5>(X,S, VecShiftLeft<8>(vk),VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<6>(X,S, vk,vm);
SHA512_ROUND1<7>(X,S, VecShiftLeft<8>(vk),VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<8>(X,S, vk,vm);
SHA512_ROUND1<9>(X,S, VecShiftLeft<8>(vk),VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<10>(X,S, vk,vm);
SHA512_ROUND1<11>(X,S, VecShiftLeft<8>(vk),VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<12>(X,S, vk,vm);
SHA512_ROUND1<13>(X,S, VecShiftLeft<8>(vk),VecShiftLeft<8>(vm));
offset+=16;
vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1<14>(X,S, vk,vm);
SHA512_ROUND1<15>(X,S, VecShiftLeft<8>(vk),VecShiftLeft<8>(vm));
offset+=16;
</programlisting>
<para><emphasis role="bold">Part 3.</emphasis> Perform the remaining rounds.
</para>
<programlisting><?code-font-size 75% ?>for (i=16; i<80; i+=16)
{
vk = VecLoad64x2(k, offset);
SHA512_ROUND2<0>(X,S, vk);
SHA512_ROUND2<1>(X,S, VecShiftLeft<8>(vk));
offset+=16;
vk = VecLoad64x2(k, offset);
SHA512_ROUND2<2>(X,S, vk);
SHA512_ROUND2<3>(X,S, VecShiftLeft<8>(vk));
offset+=16;
vk = VecLoad64x2(k, offset);
SHA512_ROUND2<4>(X,S, vk);
SHA512_ROUND2<5>(X,S, VecShiftLeft<8>(vk));
offset+=16;
vk = VecLoad64x2(k, offset);
SHA512_ROUND2<6>(X,S, vk);
SHA512_ROUND2<7>(X,S, VecShiftLeft<8>(vk));
offset+=16;
vk = VecLoad64x2(k, offset);
SHA512_ROUND2<8>(X,S, vk);
SHA512_ROUND2<9>(X,S, VecShiftLeft<8>(vk));
offset+=16;
vk = VecLoad64x2(k, offset);
SHA512_ROUND2<10>(X,S, vk);
SHA512_ROUND2<11>(X,S, VecShiftLeft<8>(vk));
offset+=16;
vk = VecLoad64x2(k, offset);
SHA512_ROUND2<12>(X,S, vk);
SHA512_ROUND2<13>(X,S, VecShiftLeft<8>(vk));
offset+=16;
vk = VecLoad64x2(k, offset);
SHA512_ROUND2<14>(X,S, vk);
SHA512_ROUND2<15>(X,S, VecShiftLeft<8>(vk));
offset+=16;
}
</programlisting>
<para><emphasis role="bold">Part 4.</emphasis> Repack and store the new state.
</para>
<programlisting><?code-font-size 75% ?>ab += VecPack(S[A],S[B]);
cd += VecPack(S[C],S[D]);
ef += VecPack(S[E],S[F]);
gh += VecPack(S[G],S[H]);
VecStore64x2u(ab, state+0, 0);
VecStore64x2u(cd, state+2, 0);
VecStore64x2u(ef, state+4, 0);
VecStore64x2u(gh, state+6, 0);
</programlisting>
<para><emphasis role="bold">VecLoadMsg64x2.</emphasis> Perform an endian-aware load of a user message into a word.
</para>
<programlisting><?code-font-size 75% ?>template <class T>
uint32x4_p VecLoadMsg64x2(const T* data, int offset)
{
#if __LITTLE_ENDIAN__
uint8x16_p mask = {7,6,5,4, 3,2,1,0, 15,14,13,12, 11,10,9,8};
uint64x2_p r = VecLoad64x2u(data, offset);
return (uint64x2_p)vec_perm(r, r, mask);
#else
return VecLoad64x2u(data, offset);
#endif
}
</programlisting>
<para><emphasis role="bold">SHA512_ROUND1.</emphasis> Mix state with a round key and user message.
</para>
<programlisting><?code-font-size 75% ?>template <unsigned int R>
void SHA512_ROUND1(uint64x2_p X[16], uint64x2_p S[8],
const uint64x2_p K, const uint64x2_p M)
{
uint64x2_p T1, T2;
X[R] = M;
T1 = S[H] + VecSigma1(S[E]);
T1 += VecCh(S[E],S[F],S[G]) + K + M;
T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);
S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
S[E] = S[D] + T1;
S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
S[A] = T1 + T2;
}
</programlisting>
<para><emphasis role="bold">SHA512_ROUND2.</emphasis> Mix state with a round key.
</para>
<programlisting><?code-font-size 75% ?>template <unsigned int R>
void SHA512_ROUND2(uint64x2_p X[16], uint64x2_p S[8],
const uint64x2_p K)
{
// Indexes into the X[] array
enum {IDX0=(R+0)&0xf, IDX1=(R+1)&0xf,
IDX9=(R+9)&0xf, IDX14=(R+14)&0xf};
const uint64x2_p s0 = Vec_sigma0(X[IDX1]);
const uint64x2_p s1 = Vec_sigma1(X[IDX14]);
uint64x2_p T1 = (X[IDX0] += s0 + s1 + X[IDX9]);
T1 += S[H] + VecSigma1(S[E]) + VecCh(S[E],S[F],S[G]) + K;
uint64x2_p T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);
S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
S[E] = S[D] + T1;
S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
S[A] = T1 + T2;
}
</programlisting>
<para><emphasis role="bold">VecPack.</emphasis> Repack working variables.
</para>
<programlisting><?code-font-size 75% ?>uint64x2_p VecPack(const uint64x2_p x, const uint64x2_p y)
{
const uint8x16_p m = {0,1,2,3, 4,5,6,7, 16,17,18,19, 20,21,22,23};
return vec_perm(x,y,m);
}
</programlisting>
<para>The SHA-512 implementation uses the same functions as SHA-256, but SHA-512 uses a 64x2 arrangement rather than the 32x4 arrangement. You should copy/paste/replace as required for SHA-512. For example, below is the SHA <systemitem>Ch</systemitem> for the 64x2 arrangement.</para>
<programlisting><?code-font-size 75% ?>uint64x2_p
VecCh(uint64x2_p x, uint64x2_p y, uint64x2_p z)
{
return vec_sel(z,y,x);
}
</programlisting>
<para>In fact, since this is C++ code, a template function works nicely. The language will use the template to instantiate <systemitem>VecCh</systemitem> using both <systemitem>uint32x4_p</systemitem> and <systemitem>uint64x2_p</systemitem>.</para>
<programlisting><?code-font-size 75% ?>template <class T>
T VecCh(T x, T y, T z)
{
return vec_sel(z,y,x);
}
</programlisting>
<para>Templates do not work the Sigma functions and you will have to supply C++ overloaded functions as shown below.</para>
<programlisting><?code-font-size 75% ?>uint64x2_p Vec_sigma0(const uint64x2_p val)
{
#if defined(__xlc__) || defined(__xlC__)
return __vshasigmad(val, 0, 0);
#else
return __builtin_crypto_vshasigmad(val, 0, 0);
#endif
}
uint64x2_p Vec_sigma1(const uint64x2_p val)
{
#if defined(__xlc__) || defined(__xlC__)
return __vshasigmad(val, 0, 0xf);
#else
return __builtin_crypto_vshasigmad(val, 0, 0xf);
#endif
}
uint64x2_p VecSigma0(const uint64x2_p val)
{
#if defined(__xlc__) || defined(__xlC__)
return __vshasigmad(val, 1, 0);
#else
return __builtin_crypto_vshasigmad(val, 1, 0);
#endif
}
uint64x2_p VecSigma1(const uint64x2_p val)
{
#if defined(__xlc__) || defined(__xlC__)
return __vshasigmad(val, 1, 0xf);
#else
return __builtin_crypto_vshasigmad(val, 1, 0xf);
#endif
}
</programlisting>
</section>
</chapter>