ch04.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<chapter id="ch04" xreflabel="AES">
  <title>Advanced Encryption Standard</title>
  <indexterm>
    <primary>AES</primary>
  </indexterm>
  <para>AES is the Advanced Encryption Standard. AES is specified in <ulink url="https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.197.pdf">FIPS 197, Advanced Encryption Standard (AES)</ulink>. You should read the standard if you are not familiar with the block cipher.</para>
  <para>GCC and XL C/C++ use different data types and intrinsics to perform AES. GCC uses a 64x2 arrangement and IBM XL C/C++ uses a 8x16 arrangement. GCC uses <indexterm><primary>__builtin_crypto_vcipher</primary></indexterm><systemitem>__builtin_crypto_vcipher</systemitem>, <indexterm><primary>__builtin_crypto_vcipherlast</primary></indexterm><systemitem>__builtin_crypto_vcipherlast</systemitem>, <indexterm><primary>__builtin_crypto_vncipher</primary></indexterm><systemitem>__builtin_crypto_vncipher</systemitem> and <indexterm><primary>__builtin_crypto_vncipherlast</primary></indexterm><systemitem>__builtin_crypto_vncipherlast</systemitem> intrinsics. IBM XL C/C++ uses <indexterm><primary>__vcipher</primary></indexterm><systemitem>__vcipher</systemitem>, <indexterm><primary>__vcipherlast</primary></indexterm><systemitem>__vcipherlast</systemitem>, <indexterm><primary>__vncipher</primary></indexterm><systemitem>__vncipher</systemitem> and <indexterm><primary>__vncipherlast</primary></indexterm><systemitem>__vncipherlast</systemitem> intrinsics.</para>
  <para>POWER8 offers instructions to perform encryption and decryption only. The ISA does not supply instructions that assist in key generation, like Intel's <systemitem>AESKEYGENASSIST</systemitem>.</para>
  <para>Finally the code below is available online at <ulink url="https://github.com/noloader/AES-Intrinsics">AES Intrinsics</ulink>. The GitHub provides accelerated AES for Intel, ARMv8 and POWER8.</para>
  <section id="aes_strategy" xreflabel="AES Strategy">
    <title>Strategy</title>
    <para>The strategy to perform AES encryption and decryption is straight forward. First the subkey or round key table is created based on the user key. The round keys are stored in big-endian format so a swap is avoided when loading a round key. Second, the message is loaded into the AES state array and an endian swap is performed as required. Third the the AES encryption or decryption round function is applied to the state array the required number of times. Each application of the round function is accompanied by a loading of a subkey. Finally the encrypted or decrypted message is stored after performing an endian swap as needed.</para>
  </section>
  <section id="aes_endianness" xreflabel="AES Endianness">
    <title>Endianness</title>
    <para>The AES hardware operates in big-endian mode. On little-endian systems like <systemitem>gcc112</systemitem> you have to convert from little-endian to big-endian during loads and stores. The code to perform the conversion is shown below. Also recall from <xref linkend="vsx_unaligned_ld_st"/> POWER7 provides unaligned loads and stores so POWER8 has them available.</para>
    <programlisting><?code-font-size 75% ?>uint8x16_p VecReverse8x16(const uint8x16_p src)
{
    uint8x16_p mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
    return vec_perm(src, src, mask);
}

uint64x2_p VecReverse64x2(const uint64x2_p src)
{
    uint8x16_p mask = {15,14,13,12, 11,10,9,8, 7,6,5,4, 3,2,1,0};
    uint8x16_p val = (uint8x16_p) src;
    return (uint64x2_p)vec_perm(val, val, mask);
}

uint8x16_p VecLoad8x16(const uint8_t src[16])
{
#if defined(__xlc__) || defined(__xlC__)
    return vec_xl_be(0, (uint8_t*)src);
#else
# if __LITTLE_ENDIAN__
    return VecReverse8x16(vec_vsx_ld(0, src));
# else
    return vec_vsx_ld(0, src);
# endif
#endif
}

void VecStore8x16(const uint8x16_p src, uint8_t dest[16])
{
#if defined(__xlc__) || defined(__xlC__)
    vec_xst_be(src, 0, (uint8_t*)dest);
#else
# if __LITTLE_ENDIAN__
    vec_vsx_st(VecReverse8x16(src), 0, dest);
# else
    vec_vsx_st(src, 0, dest);
# endif
#endif
}

uint64x2_p VecLoad64x2(const uint8_t src[16])
{
#if defined(__xlc__) || defined(__xlC__)
    return (uint64x2_p)vec_xl_be(0, (uint8_t*)src);
#else
# if __LITTLE_ENDIAN__
    return (uint64x2_p)VecReverse8x16(vec_vsx_ld(0, src));
# else
    return (uint64x2_p)vec_vsx_ld(0, src);
# endif
#endif
}

void VecStore64x2(const uint64x2_p src, uint8_t dest[16])
{
#if defined(__xlc__) || defined(__xlC__)
    vec_xst_be((uint8x16_p)src, 0, (uint8_t*)dest);
#else
# if __LITTLE_ENDIAN__
    vec_vsx_st(VecReverse8x16((uint8x16_p)src), 0, dest);
# else
    vec_vsx_st((uint8x16_p)src, 0, dest);
# endif
#endif
}</programlisting>
  </section>
  <section id="aes_functions" xreflabel="AES Functions">
    <title>Functions</title>
    <para>GCC and IBM XL C/C++ uses different intrinsics and different datatypes for encryption and decryption. GCC uses a 64x2 vector arrangement while IBM XL CC++ uses a 8x16 arrangement. An intrinsics based implementation should wrap two functions for encryption and two functions for decryption.</para>
    <para>For the encryption operation POWER8 provides a standard round function and a function to encrypt the last round. Source code should look similar to below.</para>
    <programlisting><?code-font-size 75% ?>template &lt;class T&gt;
T VecEncrypt(const T state, const T rkey)
{
#if defined(__xlc__) || defined(__xlC__)
    uint8x16_p s = (uint8x16_p)state;
    uint8x16_p k = (uint8x16_p)rkey;
    return (T)__vcipher(s, k);
#else
    uint64x2_p s = (uint64x2_p)state;
    uint64x2_p k = (uint64x2_p)rkey;
    return (T)__builtin_crypto_vcipher(s, k);
#endif
}

template &lt;class T&gt;
T VecEncryptLast(const T state, const T rkey)
{
#if defined(__xlc__) || defined(__xlC__)
    uint8x16_p s = (uint8x16_p)state;
    uint8x16_p k = (uint8x16_p)rkey;
    return (T)__vcipherlast(s, k);
#else
    uint64x2_p s = (uint64x2_p)state;
    uint64x2_p k = (uint64x2_p)rkey;
    return (T)__builtin_crypto_vcipherlast(s, k);
#endif
}</programlisting>
    <para>And the corresponding decryption functions are shown below.</para>
    <programlisting><?code-font-size 75% ?>template &lt;class T&gt;
T VecDecrypt(const T state, const T rkey)
{
#if defined(__xlc__) || defined(__xlC__)
    uint8x16_p s = (uint8x16_p)state;
    uint8x16_p k = (uint8x16_p)rkey;
    return (T)__vncipher(s, k);
#else
    uint64x2_p s = (uint64x2_p)state;
    uint64x2_p k = (uint64x2_p)rkey;
    return (T)__builtin_crypto_vncipher(s, k);
#endif
}

template &lt;class T&gt;
T VecDecryptLast(const T state, const T rkey)
{
#if defined(__xlc__) || defined(__xlC__)
    uint8x16_p s = (uint8x16_p)state;
    uint8x16_p k = (uint8x16_p)rkey;
    return (T)__vncipherlast(s, k);
#else
    uint64x2_p s = (uint64x2_p)state;
    uint64x2_p k = (uint64x2_p)rkey;
    return (T)__builtin_crypto_vncipherlast(s, k);
#endif
}</programlisting>
  </section>
  <section id="aes_golden_key" xreflabel="AES Golden Key">
    <title>Golden key</title>
    <para>FIPS 197 Appendix B provides a user key expanded into round keys. We refer to it as the "golden key" and it allows us to independently test round key derivation, encryption and decryption. The sections <xref linkend="aes_encryption"/> and <xref linkend="aes_decryption"/> use the golden key to simplify the discussions.</para>
    <para>Appendix B provides two key parameters. The first is the AES key supplied by the user. The second is the expanded subkey or round key table. Below is the user key supplied by Appendix B.</para>
    <programlisting><?code-font-size 75% ?>const uint8_t key[16] = {
     0x32, 0x43, 0xf6, 0xa8, 0x88, 0x5a, 0x30, 0x8d,
     0x31, 0x31, 0x98, 0xa2, 0xe0, 0x37, 0x07, 0x34
};</programlisting>
    <para>The round keys for AES-128 are as follows. Since we control the round key buffer we can make it aligned. The aligned loads will save a tiny amount of time during each load of a round key.</para>
    <programlisting><?code-font-size 75% ?>__attribute__((aligned(16)))
const uint8_t subkeys[10][16] = {
     {0xA0, 0xFA, 0xFE, 0x17, 0x88, 0x54, 0x2c, 0xb1,
        0x23, 0xa3, 0x39, 0x39, 0x2a, 0x6c, 0x76, 0x05},
     {0xF2, 0xC2, 0x95, 0xF2, 0x7a, 0x96, 0xb9, 0x43,
        0x59, 0x35, 0x80, 0x7a, 0x73, 0x59, 0xf6, 0x7f},
     {0x3D, 0x80, 0x47, 0x7D, 0x47, 0x16, 0xFE, 0x3E,
        0x1E, 0x23, 0x7E, 0x44, 0x6D, 0x7A, 0x88, 0x3B},
     {0xEF, 0x44, 0xA5, 0x41, 0xA8, 0x52, 0x5B, 0x7F,
        0xB6, 0x71, 0x25, 0x3B, 0xDB, 0x0B, 0xAD, 0x00},
     {0xD4, 0xD1, 0xC6, 0xF8, 0x7C, 0x83, 0x9D, 0x87,
        0xCA, 0xF2, 0xB8, 0xBC, 0x11, 0xF9, 0x15, 0xBC},
     {0x6D, 0x88, 0xA3, 0x7A, 0x11, 0x0B, 0x3E, 0xFD,
        0xDB, 0xF9, 0x86, 0x41, 0xCA, 0x00, 0x93, 0xFD},
     {0x4E, 0x54, 0xF7, 0x0E, 0x5F, 0x5F, 0xC9, 0xF3,
        0x84, 0xA6, 0x4F, 0xB2, 0x4E, 0xA6, 0xDC, 0x4F},
     {0xEA, 0xD2, 0x73, 0x21, 0xB5, 0x8D, 0xBA, 0xD2,
        0x31, 0x2B, 0xF5, 0x60, 0x7F, 0x8D, 0x29, 0x2F},
     {0xAC, 0x77, 0x66, 0xF3, 0x19, 0xFA, 0xDC, 0x21,
        0x28, 0xD1, 0x29, 0x41, 0x57, 0x5c, 0x00, 0x6E},
     {0xD0, 0x14, 0xF9, 0xA8, 0xC9, 0xEE, 0x25, 0x89,
        0xE1, 0x3F, 0x0c, 0xC8, 0xB6, 0x63, 0x0C, 0xA6}
};</programlisting>
  </section>
  <section id="aes_keying" xreflabel="AES Key Schedule">
    <title>Key schedule</title>
    <indexterm>
      <primary>AES</primary>
      <secondary>Key schedule</secondary>
    </indexterm>
    <para>TODO. We don't have optimized code for key scheduling. Use Paulo Barreto's code to generate the key table in C/C++. It is available on the internet.</para>
    <para>A brief discussion of an POWER8 optimized key schedule can be found at <ulink url="https://www.ibm.com/developerworks/library/se-power8-in-core-cryptography/index.html">POWER8 in-core cryptography</ulink>.</para>
  </section>
  <section id="aes_encryption" xreflabel="AES Encryption">
    <title>Encryption</title>
    <indexterm>
      <primary>AES</primary>
      <secondary>Encryption</secondary>
    </indexterm>
    <para>AES encryption consists of three steps. First, the user's message is loaded into a state buffer. On little-endian machines the byte order will be reversed. Second, a round key is loaded and the AES round function is applied. The second part is repeated a required number of times. For example, AES with 128-bit key applies the round function 10 times. The third part stores the result of encrypting the state, which is the encrypted block. On little-endian machines the byte order will be reversed.</para>
    <para><emphasis role="bold">Part 1.</emphasis> Load the user message into the state vector. <systemitem>VecLoad64x2</systemitem> swaps endianness as required. The 64x2 arrangement tells this is a GCC code path.</para>
    <programlisting><?code-font-size 75% ?>uint64x2_p s = VecLoad64x2(input);
uint64x2_p k = VecLoad64x2(key);
s = VecXor(s, k);</programlisting>
    <para><emphasis role="bold">Part 2.</emphasis> Load a subkey and encrypt the state buffer. The round key does not need an endian swap. Lather, rinse and repeat the required number of times.</para>
    <para>In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__builtin_crypto_vcipher</primary></indexterm>
      <indexterm><primary>__builtin_crypto_vcipherlast</primary></indexterm></para>
    <programlisting><?code-font-size 75% ?>k = VecLoad64x2(subkeys[0]);
s = VecEncrypt(s, k);

k = VecLoad64x2(subkeys[1]);
s = VecEncrypt(s, k);

k = VecLoad64x2(subkeys[2]);
s = VecEncrypt(s, k);

...

k = VecLoad64x2(subkeys[7]);
s = VecEncrypt(s, k);

k = VecLoad64x2(subkeys[8]);
s = VecEncrypt(s, k);

k = VecLoad64x2(subkeys[9]);
s = VecEncryptLast(s, k);</programlisting>
    <para><emphasis role="bold">Part 3.</emphasis> Store the new state which is the encrypted block. <systemitem>VecStore64x2</systemitem> swaps endianness as required.</para>
    <programlisting><?code-font-size 75% ?>VecStore64x2(s, output);</programlisting>
    <para>The AES-128 code shown above demonstrates a GCC code path using the 64x2 arrangement. Below is the IBM XL C/C++ code path using an 8x16 arrangement. In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__vcipher</primary></indexterm>
      <indexterm><primary>__vcipherlast</primary></indexterm></para>
    <programlisting><?code-font-size 75% ?>uint8x16_p s = VecLoad8x16(input);
uint8x16_p k = VecLoad8x16(key);
s = VecXor(s, k);

k = VecLoad8x16(subkeys[0]);
s = VecEncrypt(s, k);

k = VecLoad8x16(subkeys[1]);
s = VecEncrypt(s, k);

k = VecLoad8x16(subkeys[2]);
s = VecEncrypt(s, k);

...

k = VecLoad8x16(subkeys[7]);
s = VecEncrypt(s, k);

k = VecLoad8x16(subkeys[8]);
s = VecEncrypt(s, k);

k = VecLoad8x16(subkeys[9]);
s = VecEncryptLast(s, k);

VecStore8x16(s, output);</programlisting>
  </section>
  <section id="aes_decryption" xreflabel="AES Decryption">
    <title>Decryption</title>
    <indexterm>
      <primary>AES</primary>
      <secondary>Decryption</secondary>
    </indexterm>
    <para>AES decryption is the reverse operation of AES encryption. There are three parts as with AES decryption. First, the encrypted message is loaded into a state buffer. The message is endian swapped as required. The second part loads a subkey and applies the AES inverse round function. The second part is repeated a required number of times. For example, AES with 128-bit key applies the inverse round function 10 times. The third part stores the result of decrypting the state, which is the decrypted block. The decrypted message is endian swapped as required.</para>
    <para>AES decryption has two minor differences from the encryption algorithm. First the round keys are iterated in reverse order. Second, the user or master key is used last instead of first.</para>
    <para>The code below demonstrates AES-128 using the GCC code path. GCC uses the 64x2 arrangement. In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__builtin_crypto_vncipher</primary></indexterm>
      <indexterm><primary>__builtin_crypto_vncipherlast</primary></indexterm></para>
    <programlisting><?code-font-size 75% ?>uint64x2_p s = VecLoad64x2(input);
uint64x2_p k = VecLoad64x2(subkeys[9]);
s = VecXor(s, k);

k = VecLoad64x2(subkeys[8]);
s = VecDecrypt(s, k);

k = VecLoad64x2(subkeys[7]);
s = VecDecrypt(s, k);

k = VecLoad64x2(subkeys[6]);
s = VecDecrypt(s, k);

...

k = VecLoad64x2(subkeys[1]);
s = VecDecrypt(s, k);

k = VecLoad64x2(subkeys[0]);
s = VecDecrypt(s, k);

k = VecLoad64x2(key);
s = VecDecryptLast(s, k);

VecStore8x16(s, output);</programlisting>
    <para>As with AES encryption there is a different code path for IBM XL C/C++ using the 8x16 datatypes. The code below shows XL C/C++ decryption using the 8x16 datatype. In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__vncipher</primary></indexterm>
      <indexterm><primary>__vncipherlast</primary></indexterm></para>
    <programlisting><?code-font-size 75% ?>uint8x16_p s = VecLoad8x16(input);
uint8x16_p k = VecLoad8x16(subkeys[9]);
s = VecXor(s, k);

k = VecLoad8x16(subkeys[8]);
s = VecDecrypt(s, k);

k = VecLoad8x16(subkeys[7]);
s = VecDecrypt(s, k);

k = VecLoad8x16(subkeys[6]);
s = VecDecrypt(s, k);

...

k = VecLoad8x16(subkeys[1]);
s = VecDecrypt(s, k);

k = VecLoad8x16(subkeys[0]);
s = VecDecrypt(s, k);

k = VecLoad8x16(key);
s = VecDecryptLast(s, k);

VecStore8x16(s, output);</programlisting>
  </section>
  <section id="aes_performance" xreflabel="AES Performance">
    <title>Performance</title>
    <indexterm>
      <primary>AES</primary>
      <secondary>Performance</secondary>
    </indexterm>
    <para>The code in <xref linkend="aes_encryption"/> and <xref linkend="aes_decryption"/> provides the basic AES algorithms. They will perform well when compared to C/C++ but there is room for improvement. You can improve the code to run closer to 1 to 2 cycle-per-byte (cpb) by processing multiple blocks at a time.</para>
    <para>Experimentation shows 6 or 8 blocks at a time is a good place to be. Crypto++ processes 6 blocks at a time while Botan processes 8 blocks at a time. The Linux kernel processes 12 blocks at a time for some POWER8 algorithms.</para>
    <para>The code below processes 16*8 or 128-bytes of data at a time using the GCC code path. The IBM code path would be similar.</para>
    <para>In the code below remember that <systemitem>subkeys</systemitem> is <systemitem>subkeys[10][16]</systemitem>. The expression <systemitem>subkeys[i]</systemitem> is a byte pointer and indexes into the i-th 16-byte round key.<indexterm><primary>__builtin_crypto_vcipher</primary></indexterm>
      <indexterm><primary>__builtin_crypto_vcipherlast</primary></indexterm></para>
    <programlisting><?code-font-size 75% ?>uint64x2_p k = VecLoad64x2(key);

uint64x2_p s0 = VecLoad64x2(input+0);
uint64x2_p s1 = VecLoad64x2(input+16);
uint64x2_p s2 = VecLoad64x2(input+32);
uint64x2_p s3 = VecLoad64x2(input+48);
uint64x2_p s4 = VecLoad64x2(input+64);
uint64x2_p s5 = VecLoad64x2(input+80);
uint64x2_p s6 = VecLoad64x2(input+96);
uint64x2_p s7 = VecLoad64x2(input+112);

s0 = VecXor(s0, k);
s1 = VecXor(s1, k);
s2 = VecXor(s2, k);
s3 = VecXor(s3, k);
s4 = VecXor(s4, k);
s5 = VecXor(s5, k);
s6 = VecXor(s6, k);
s7 = VecXor(s7, k);

for (size_t i=0; i&lt;rounds-1; ++i)
{
    k = VecLoad64x2(subkeys[i]);
    s0 = VecEncrypt(s0, k);
    s1 = VecEncrypt(s1, k);
    s2 = VecEncrypt(s2, k);
    s3 = VecEncrypt(s3, k);
    s4 = VecEncrypt(s4, k);
    s5 = VecEncrypt(s5, k);
    s6 = VecEncrypt(s6, k);
    s7 = VecEncrypt(s7, k);
}

k = VecLoad64x2(subkeys[rounds-1]);
s0 = VecEncryptLast(s0, k);
s1 = VecEncryptLast(s1, k);
s2 = VecEncryptLast(s2, k);
s3 = VecEncryptLast(s3, k);
s4 = VecEncryptLast(s4, k);
s5 = VecEncryptLast(s5, k);
s6 = VecEncryptLast(s6, k);
s7 = VecEncryptLast(s7, k);

VecStore64x2(s0, output+0);
VecStore64x2(s1, output+16);
VecStore64x2(s2, output+32);
VecStore64x2(s3, output+48);
VecStore64x2(s4, output+64);
VecStore64x2(s5, output+80);
VecStore64x2(s6, output+96);
VecStore64x2(s7, output+112);</programlisting>
  </section>
</chapter>