-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch04.xml
375 lines (333 loc) · 18.4 KB
/
ch04.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<chapter id="ch04" xreflabel="AES">
<title>Advanced Encryption Standard</title>
<indexterm>
<primary>AES</primary>
</indexterm>
<para>AES is the Advanced Encryption Standard. AES is specified in <ulink url="https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf">FIPS 197, Advanced Encryption Standard (AES)</ulink>. You should read the standard if you are not familiar with the block cipher.</para>
<para>GCC and XL C/C++ use different data types and intrinsics to perform AES. GCC uses a 64x2 arrangement and IBM XL C/C++ uses a 8x16 arrangement. GCC uses <indexterm><primary>__builtin_crypto_vcipher</primary></indexterm><systemitem>__builtin_crypto_vcipher</systemitem>, <indexterm><primary>__builtin_crypto_vcipherlast</primary></indexterm><systemitem>__builtin_crypto_vcipherlast</systemitem>, <indexterm><primary>__builtin_crypto_vncipher</primary></indexterm><systemitem>__builtin_crypto_vncipher</systemitem> and <indexterm><primary>__builtin_crypto_vncipherlast</primary></indexterm><systemitem>__builtin_crypto_vncipherlast</systemitem> intrinsics. IBM XL C/C++ uses <indexterm><primary>__vcipher</primary></indexterm><systemitem>__vcipher</systemitem>, <indexterm><primary>__vcipherlast</primary></indexterm><systemitem>__vcipherlast</systemitem>, <indexterm><primary>__vncipher</primary></indexterm><systemitem>__vncipher</systemitem> and <indexterm><primary>__vncipherlast</primary></indexterm><systemitem>__vncipherlast</systemitem> intrinsics.</para>
<para>POWER8 offers instructions to perform encryption and decryption only. The ISA does not supply instructions that assist in key generation, like Intel's <systemitem>AESKEYGENASSIST</systemitem>.</para>
<para>Finally the code below is available online at <ulink url="https://github.com/noloader/AES-Intrinsics">AES Intrinsics</ulink>. The GitHub provides accelerated AES for Intel, ARMv8 and POWER8.</para>
<section id="aes_strategy" xreflabel="AES Strategy">
<title>Strategy</title>
<para>The strategy to perform AES encryption and decryption is straight forward. First the subkey or round key table is created based on the user key. The round keys are stored in big-endian format so a swap is avoided when loading a round key. Second, the message is loaded into the AES state array and an endian swap is performed as required. Third the the AES encryption or decryption round function is applied to the state array the required number of times. Each application of the round function is accompanied by a loading of a subkey. Finally the encrypted or decrypted message is stored after performing an endian swap as needed.</para>
</section>
<section id="aes_endianness" xreflabel="AES Endianness">
<title>Endianness</title>
<para>The AES hardware operates in big-endian mode. On little-endian systems like <systemitem>gcc112</systemitem> you have to convert from little-endian to big-endian during loads and stores. The code to perform the conversion is shown below. Also recall from <xref linkend="vsx_unaligned_ld_st"/> POWER7 provides unaligned loads and stores so POWER8 has them available.</para>
<programlisting><?code-font-size 75% ?>uint8x16_p VecReverse8x16(const uint8x16_p src)
{
uint8x16_p mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
return vec_perm(src, src, mask);
}
uint64x2_p VecReverse64x2(const uint64x2_p src)
{
uint8x16_p mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
uint8x16_p val = (uint8x16_p) src;
return (uint64x2_p)vec_perm(val, val, mask);
}
uint8x16_p VecLoad8x16(const uint8_t src[16])
{
#if defined(__xlc__) || defined(__xlC__)
return vec_xl_be(0, (uint8_t*)src);
#else
# if __LITTLE_ENDIAN__
return VecReverse8x16(vec_vsx_ld(0, src));
# else
return vec_vsx_ld(0, src);
# endif
#endif
}
void VecStore8x16(const uint8x16_p src, uint8_t dest[16])
{
#if defined(__xlc__) || defined(__xlC__)
vec_xst_be(src, 0, (uint8_t*)dest);
#else
# if __LITTLE_ENDIAN__
vec_vsx_st(VecReverse8x16(src), 0, dest);
# else
vec_vsx_st(src, 0, dest);
# endif
#endif
}
uint64x2_p VecLoad64x2(const uint8_t src[16])
{
#if defined(__xlc__) || defined(__xlC__)
return (uint64x2_p)vec_xl_be(0, (uint8_t*)src);
#else
# if __LITTLE_ENDIAN__
return (uint64x2_p)VecReverse8x16(vec_vsx_ld(0, src));
# else
return (uint64x2_p)vec_vsx_ld(0, src);
# endif
#endif
}
void VecStore64x2(const uint64x2_p src, uint8_t dest[16])
{
#if defined(__xlc__) || defined(__xlC__)
vec_xst_be((uint8x16_p)src, 0, (uint8_t*)dest);
#else
# if __LITTLE_ENDIAN__
vec_vsx_st(VecReverse8x16((uint8x16_p)src), 0, dest);
# else
vec_vsx_st((uint8x16_p)src, 0, dest);
# endif
#endif
}</programlisting>
</section>
<section id="aes_functions" xreflabel="AES Functions">
<title>Functions</title>
<para>GCC and IBM XL C/C++ uses different intrinsics and different datatypes for encryption and decryption. GCC uses a 64x2 vector arrangement while IBM XL CC++ uses a 8x16 arrangement. An intrinsics based implementation should wrap two functions for encryption and two functions for decryption.</para>
<para>For the encryption operation POWER8 provides a standard round function and a function to encrypt the last round. Source code should look similar to below.</para>
<programlisting><?code-font-size 75% ?>template <class T>
T VecEncrypt(const T state, const T rkey)
{
#if defined(__xlc__) || defined(__xlC__)
uint8x16_p s = (uint8x16_p)state;
uint8x16_p k = (uint8x16_p)rkey;
return (T)__vcipher(s, k);
#else
uint64x2_p s = (uint64x2_p)state;
uint64x2_p k = (uint64x2_p)rkey;
return (T)__builtin_crypto_vcipher(s, k);
#endif
}
template <class T>
T VecEncryptLast(const T state, const T rkey)
{
#if defined(__xlc__) || defined(__xlC__)
uint8x16_p s = (uint8x16_p)state;
uint8x16_p k = (uint8x16_p)rkey;
return (T)__vcipherlast(s, k);
#else
uint64x2_p s = (uint64x2_p)state;
uint64x2_p k = (uint64x2_p)rkey;
return (T)__builtin_crypto_vcipherlast(s, k);
#endif
}</programlisting>
<para>And the corresponding decryption functions are shown below.</para>
<programlisting><?code-font-size 75% ?>template <class T>
T VecDecrypt(const T state, const T rkey)
{
#if defined(__xlc__) || defined(__xlC__)
uint8x16_p s = (uint8x16_p)state;
uint8x16_p k = (uint8x16_p)rkey;
return (T)__vncipher(s, k);
#else
uint64x2_p s = (uint64x2_p)state;
uint64x2_p k = (uint64x2_p)rkey;
return (T)__builtin_crypto_vncipher(s, k);
#endif
}
template <class T>
T VecDecryptLast(const T state, const T rkey)
{
#if defined(__xlc__) || defined(__xlC__)
uint8x16_p s = (uint8x16_p)state;
uint8x16_p k = (uint8x16_p)rkey;
return (T)__vncipherlast(s, k);
#else
uint64x2_p s = (uint64x2_p)state;
uint64x2_p k = (uint64x2_p)rkey;
return (T)__builtin_crypto_vncipherlast(s, k);
#endif
}</programlisting>
</section>
<section id="aes_golden_key" xreflabel="AES Golden Key">
<title>Golden key</title>
<para>FIPS 197 Appendix B provides a user key expanded into round keys. We refer to it as the "golden key" and it allows us to independently test round key derivation, encryption and decryption. The sections <xref linkend="aes_encryption"/> and <xref linkend="aes_decryption"/> use the golden key to simplify the discussions.</para>
<para>Appendix B provides two key parameters. The first is the AES key supplied by the user. The second is the expanded subkey or round key table. Below is the user key supplied by Appendix B.</para>
<programlisting><?code-font-size 75% ?>const uint8_t key[16] = {
0x32, 0x43, 0xf6, 0xa8, 0x88, 0x5a, 0x30, 0x8d,
0x31, 0x31, 0x98, 0xa2, 0xe0, 0x37, 0x07, 0x34
};</programlisting>
<para>The round keys for AES-128 are as follows. Since we control the round key buffer we can make it aligned. The aligned loads will save a tiny amount of time during each load of a round key.</para>
<programlisting><?code-font-size 75% ?>__attribute__((aligned(16)))
const uint8_t subkeys[10][16] = {
{0xA0, 0xFA, 0xFE, 0x17, 0x88, 0x54, 0x2c, 0xb1,
0x23, 0xa3, 0x39, 0x39, 0x2a, 0x6c, 0x76, 0x05},
{0xF2, 0xC2, 0x95, 0xF2, 0x7a, 0x96, 0xb9, 0x43,
0x59, 0x35, 0x80, 0x7a, 0x73, 0x59, 0xf6, 0x7f},
{0x3D, 0x80, 0x47, 0x7D, 0x47, 0x16, 0xFE, 0x3E,
0x1E, 0x23, 0x7E, 0x44, 0x6D, 0x7A, 0x88, 0x3B},
{0xEF, 0x44, 0xA5, 0x41, 0xA8, 0x52, 0x5B, 0x7F,
0xB6, 0x71, 0x25, 0x3B, 0xDB, 0x0B, 0xAD, 0x00},
{0xD4, 0xD1, 0xC6, 0xF8, 0x7C, 0x83, 0x9D, 0x87,
0xCA, 0xF2, 0xB8, 0xBC, 0x11, 0xF9, 0x15, 0xBC},
{0x6D, 0x88, 0xA3, 0x7A, 0x11, 0x0B, 0x3E, 0xFD,
0xDB, 0xF9, 0x86, 0x41, 0xCA, 0x00, 0x93, 0xFD},
{0x4E, 0x54, 0xF7, 0x0E, 0x5F, 0x5F, 0xC9, 0xF3,
0x84, 0xA6, 0x4F, 0xB2, 0x4E, 0xA6, 0xDC, 0x4F},
{0xEA, 0xD2, 0x73, 0x21, 0xB5, 0x8D, 0xBA, 0xD2,
0x31, 0x2B, 0xF5, 0x60, 0x7F, 0x8D, 0x29, 0x2F},
{0xAC, 0x77, 0x66, 0xF3, 0x19, 0xFA, 0xDC, 0x21,
0x28, 0xD1, 0x29, 0x41, 0x57, 0x5c, 0x00, 0x6E},
{0xD0, 0x14, 0xF9, 0xA8, 0xC9, 0xEE, 0x25, 0x89,
0xE1, 0x3F, 0x0c, 0xC8, 0xB6, 0x63, 0x0C, 0xA6}
};</programlisting>
</section>
<section id="aes_keying" xreflabel="AES Key Schedule">
<title>Key schedule</title>
<indexterm>
<primary>AES</primary>
<secondary>Key schedule</secondary>
</indexterm>
<para>TODO. We don't have optimized code for key scheduling. Use Paulo Barreto's code to generate the key table in C/C++. It is available on the internet.</para>
<para>A brief discussion of an POWER8 optimized key schedule can be found at <ulink url="https://www.ibm.com/developerworks/library/se-power8-in-core-cryptography/index.html">POWER8 in-core cryptography</ulink>.</para>
</section>
<section id="aes_encryption" xreflabel="AES Encryption">
<title>Encryption</title>
<indexterm>
<primary>AES</primary>
<secondary>Encryption</secondary>
</indexterm>
<para>AES encryption consists of three steps. First, the user's message is loaded into a state buffer. On little-endian machines the byte order will be reversed. Second, a round key is loaded and the AES round function is applied. The second part is repeated a required number of times. For example, AES with 128-bit key applies the round function 10 times. The third part stores the result of encrypting the state, which is the encrypted block. On little-endian machines the byte order will be reversed.</para>
<para><emphasis role="bold">Part 1.</emphasis> Load the user message into the state vector. <systemitem>VecLoad64x2</systemitem> swaps endianness as required. The 64x2 arrangement tells this is a GCC code path.</para>
<programlisting><?code-font-size 75% ?>uint64x2_p s = VecLoad64x2(input);
uint64x2_p k = VecLoad64x2(key);
s = VecXor(s, k);</programlisting>
<para><emphasis role="bold">Part 2.</emphasis> Load a subkey and encrypt the state buffer. The round key does not need an endian swap. Lather, rinse and repeat the required number of times.</para>
<para>In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__builtin_crypto_vcipher</primary></indexterm>
<indexterm><primary>__builtin_crypto_vcipherlast</primary></indexterm></para>
<programlisting><?code-font-size 75% ?>k = VecLoad64x2(subkeys[0]);
s = VecEncrypt(s, k);
k = VecLoad64x2(subkeys[1]);
s = VecEncrypt(s, k);
k = VecLoad64x2(subkeys[2]);
s = VecEncrypt(s, k);
...
k = VecLoad64x2(subkeys[7]);
s = VecEncrypt(s, k);
k = VecLoad64x2(subkeys[8]);
s = VecEncrypt(s, k);
k = VecLoad64x2(subkeys[9]);
s = VecEncryptLast(s, k);</programlisting>
<para><emphasis role="bold">Part 3.</emphasis> Store the new state which is the encrypted block. <systemitem>VecStore64x2</systemitem> swaps endianness as required.</para>
<programlisting><?code-font-size 75% ?>VecStore64x2(s, output);</programlisting>
<para>The AES-128 code shown above demonstrates a GCC code path using the 64x2 arrangement. Below is the IBM XL C/C++ code path using an 8x16 arrangement. In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__vcipher</primary></indexterm>
<indexterm><primary>__vcipherlast</primary></indexterm></para>
<programlisting><?code-font-size 75% ?>uint8x16_p s = VecLoad8x16(input);
uint8x16_p k = VecLoad8x16(key);
s = VecXor(s, k);
k = VecLoad8x16(subkeys[0]);
s = VecEncrypt(s, k);
k = VecLoad8x16(subkeys[1]);
s = VecEncrypt(s, k);
k = VecLoad8x16(subkeys[2]);
s = VecEncrypt(s, k);
...
k = VecLoad8x16(subkeys[7]);
s = VecEncrypt(s, k);
k = VecLoad8x16(subkeys[8]);
s = VecEncrypt(s, k);
k = VecLoad8x16(subkeys[9]);
s = VecEncryptLast(s, k);
VecStore8x16(s, output);</programlisting>
</section>
<section id="aes_decryption" xreflabel="AES Decryption">
<title>Decryption</title>
<indexterm>
<primary>AES</primary>
<secondary>Decryption</secondary>
</indexterm>
<para>AES decryption is the reverse operation of AES encryption. There are three parts as with AES decryption. First, the encrypted message is loaded into a state buffer. The message is endian swapped as required. The second part loads a subkey and applies the AES inverse round function. The second part is repeated a required number of times. For example, AES with 128-bit key applies the inverse round function 10 times. The third part stores the result of decrypting the state, which is the decrypted block. The decrypted message is endian swapped as required.</para>
<para>AES decryption has two minor differences from the encryption algorithm. First the round keys are iterated in reverse order. Second, the user or master key is used last instead of first.</para>
<para>The code below demonstrates AES-128 using the GCC code path. GCC uses the 64x2 arrangement. In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__builtin_crypto_vncipher</primary></indexterm>
<indexterm><primary>__builtin_crypto_vncipherlast</primary></indexterm></para>
<programlisting><?code-font-size 75% ?>uint64x2_p s = VecLoad64x2(input);
uint64x2_p k = VecLoad64x2(subkeys[9]);
s = VecXor(s, k);
k = VecLoad64x2(subkeys[8]);
s = VecDecrypt(s, k);
k = VecLoad64x2(subkeys[7]);
s = VecDecrypt(s, k);
k = VecLoad64x2(subkeys[6]);
s = VecDecrypt(s, k);
...
k = VecLoad64x2(subkeys[1]);
s = VecDecrypt(s, k);
k = VecLoad64x2(subkeys[0]);
s = VecDecrypt(s, k);
k = VecLoad64x2(key);
s = VecDecryptLast(s, k);
VecStore8x16(s, output);</programlisting>
<para>As with AES encryption there is a different code path for IBM XL C/C++ using the 8x16 datatypes. The code below shows XL C/C++ decryption using the 8x16 datatype. In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__vncipher</primary></indexterm>
<indexterm><primary>__vncipherlast</primary></indexterm></para>
<programlisting><?code-font-size 75% ?>uint8x16_p s = VecLoad8x16(input);
uint8x16_p k = VecLoad8x16(subkeys[9]);
s = VecXor(s, k);
k = VecLoad8x16(subkeys[8]);
s = VecDecrypt(s, k);
k = VecLoad8x16(subkeys[7]);
s = VecDecrypt(s, k);
k = VecLoad8x16(subkeys[6]);
s = VecDecrypt(s, k);
...
k = VecLoad8x16(subkeys[1]);
s = VecDecrypt(s, k);
k = VecLoad8x16(subkeys[0]);
s = VecDecrypt(s, k);
k = VecLoad8x16(key);
s = VecDecryptLast(s, k);
VecStore8x16(s, output);</programlisting>
</section>
<section id="aes_performance" xreflabel="AES Performance">
<title>Performance</title>
<indexterm>
<primary>AES</primary>
<secondary>Performance</secondary>
</indexterm>
<para>The code in <xref linkend="aes_encryption"/> and <xref linkend="aes_decryption"/> provides the basic AES algorithms. They will perform well when compared to C/C++ but there is room for improvement. You can improve the code to run closer to 1 to 2 cycle-per-byte (cpb) by processing multiple blocks at a time.</para>
<para>Experimentation shows 6 or 8 blocks at a time is a good place to be. Crypto++ processes 6 blocks at a time while Botan processes 8 blocks at a time. The Linux kernel processes 12 blocks at a time for some POWER8 algorithms.</para>
<para>The code below processes 16*8 or 128-bytes of data at a time using the GCC code path. The IBM code path would be similar.</para>
<para>In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__builtin_crypto_vcipher</primary></indexterm>
<indexterm><primary>__builtin_crypto_vcipherlast</primary></indexterm></para>
<programlisting><?code-font-size 75% ?>uint64x2_p k = VecLoad64x2(key);
uint64x2_p s0 = VecLoad64x2(input+0);
uint64x2_p s1 = VecLoad64x2(input+16);
uint64x2_p s2 = VecLoad64x2(input+32);
uint64x2_p s3 = VecLoad64x2(input+48);
uint64x2_p s4 = VecLoad64x2(input+64);
uint64x2_p s5 = VecLoad64x2(input+80);
uint64x2_p s6 = VecLoad64x2(input+96);
uint64x2_p s7 = VecLoad64x2(input+112);
s0 = VecXor(s0, k);
s1 = VecXor(s1, k);
s2 = VecXor(s2, k);
s3 = VecXor(s3, k);
s4 = VecXor(s4, k);
s5 = VecXor(s5, k);
s6 = VecXor(s6, k);
s7 = VecXor(s7, k);
for (size_t i=0; i<rounds-1; ++i)
{
k = VecLoad64x2(subkeys[i]);
s0 = VecEncrypt(s0, k);
s1 = VecEncrypt(s1, k);
s2 = VecEncrypt(s2, k);
s3 = VecEncrypt(s3, k);
s4 = VecEncrypt(s4, k);
s5 = VecEncrypt(s5, k);
s6 = VecEncrypt(s6, k);
s7 = VecEncrypt(s7, k);
}
k = VecLoad64x2(subkeys[rounds-1]);
s0 = VecEncryptLast(s0, k);
s1 = VecEncryptLast(s1, k);
s2 = VecEncryptLast(s2, k);
s3 = VecEncryptLast(s3, k);
s4 = VecEncryptLast(s4, k);
s5 = VecEncryptLast(s5, k);
s6 = VecEncryptLast(s6, k);
s7 = VecEncryptLast(s7, k);
VecStore64x2(s0, output+0);
VecStore64x2(s1, output+16);
VecStore64x2(s2, output+32);
VecStore64x2(s3, output+48);
VecStore64x2(s4, output+64);
VecStore64x2(s5, output+80);
VecStore64x2(s6, output+96);
VecStore64x2(s7, output+112);</programlisting>
</section>
</chapter>