ch05.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<chapter id="ch05" xreflabel="SHA">
  <title>Secure Hash Standard</title>
  <indexterm>
    <primary>SHA</primary>
  </indexterm>
  <para>SHA is the Secure Hash Standard. SHA is specified in <ulink url="https://nvlpubs.nist.gov/nistpubs/fips/nist.fips.180-4.pdf">FIPS 180-4, Secure Hash Standard (SHS)</ulink>. You should read the standard if you are not familiar with the hash family.</para>
  <para>The code below is available online at <ulink url="https://github.com/noloader/SHA-Intrinsics">SHA Intrinsics</ulink>. The GitHub provides accelerated SHA for Intel, ARMv8 and POWER8.</para>
  <section id="sha_strategy" xreflabel="SHA Strategy">
    <title>Strategy</title>
    <para>SHA provides a lot of freedom to an implementation. You can approach your SHA implementation in several ways, but most of them will result in an under-performing SHA. This section provides one of the strategies for a better performing implementation.</para>
    <para>The first design element is to perform everything in vector registers. The only integer operations should be reading 2 longs or 4 integers from memory during a load, and writing 2 longs or 4 integers after the round during a store.</para>
    <para>Second, when you need an integer for a calculation you will shift it out from a vector register to another vector register using <systemitem>vec_sld</systemitem>. Most of the time you only care about element 0 in a vector register, and the remainder of elements are "don't care" elements.</para>
    <para>Third, don't maintain a full <systemitem>W[64]</systemitem> or <systemitem>W[80]</systemitem> table. Use <systemitem>X[16]</systemitem> instead, and transform each element in-place using a rolling strategy.</para>
    <para>Fourth, the eight working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem> each get their own vector register. The one you care about is located at element 0, the remainder of the elements in the vector are "don't care" elements.</para>
    <para>It does not matter if you rotate the working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem> in the caller or in the callee. Both designs have nearly the same performance characteristics.</para>
    <para>Since you are operating on <systemitem>X[16]</systemitem> in a rolling fashion instead of <systemitem>W[64]</systemitem> or <systemitem>W[80]</systemitem> the main body of your compression function will look similar to below.</para>
    <programlisting><?code-font-size 75% ?>// SHA-256 partial compression function
uint32x4_p X[16];
...

for (i = 16; i &lt; 64; i++)
{
   uint32x4_p s0, s1, T0, T1;

   s0 = sigma0(X[(i + 1) &amp; 0x0f]);
   s1 = sigma1(X[(i + 14) &amp; 0x0f]);

   T1 = (X[i &amp; 0xf] += s0 + s1 + X[(i + 9) &amp; 0xf]);
   T1 += h + Sigma1(e) + Ch(e, f, g) + KEY[i];
   T2 = Sigma0(a) + Maj(a, b, c);
   ...
}
</programlisting>
  </section>
  <section id="sha_ch" xreflabel="Ch Function">
    <title>Ch function</title>
    <para><indexterm><primary>SHA</primary><secondary>Ch function</secondary></indexterm>The SHA <systemitem>Ch</systemitem> function is implemented in POWER systems using the <systemitem>vsel</systemitem> instruction or the <systemitem>vec_sel</systemitem> builtin. The implementation for the 32x4 arrangement is shown below. The code is the same for the 64x2 arrangement, but the function takes <systemitem>uint64x2_p</systemitem> arguments. The important piece of information is <systemitem>x</systemitem> used as the selector.</para>
    <programlisting><?code-font-size 75% ?>uint32x4_p
VecCh(uint32x4_p x, uint32x4_p y, uint32x4_p z)
{
    return vec_sel(z, y, x);
}</programlisting>
  </section>
  <section id="sha_maj" xreflabel="Maj Function">
    <title>Maj function</title>
    <para><indexterm><primary>SHA</primary><secondary>Maj function</secondary></indexterm>The SHA <systemitem>Maj</systemitem> function is implemented in POWER systems using the <systemitem>vsel</systemitem> instruction or the <systemitem>vec_sel</systemitem> builtin. The implementation for the 32x4 arrangement is shown below. The code is the same for the 64x2 arrangement, but the function takes <systemitem>uint64x2_p</systemitem> arguments. The important piece of information is <systemitem>x^y</systemitem> used as the selector.</para>
    <programlisting><?code-font-size 75% ?>uint32x4_p
VecCh(uint32x4_p x, uint32x4_p y, uint32x4_p z)
{
    return vec_sel(y, z, vec_xor(x, y));
}</programlisting>
  </section>
  <section id="sha_insn" xreflabel="Signma Functions">
    <title>Sigma functions</title>
    <para><indexterm><primary>SHA</primary><secondary>Sigma functions</secondary></indexterm>POWER8 provides the <systemitem>vshasigmaw</systemitem> and <systemitem>vshasigmad</systemitem> instructions to accelerate SHA calculations for 32-bit and 64-bit words, respectively. The instructions take two integer arguments and the constants are used to select among <systemitem>Sigma0</systemitem>, <systemitem>Sigma1</systemitem>, <systemitem>sigma0</systemitem> and <systemitem>sigma1</systemitem>.</para>
    <para><indexterm><primary>__builtin_crypto_vshasigmaw</primary></indexterm><indexterm><primary>__builtin_crypto_vshasigmad</primary></indexterm><indexterm><primary>__vshasigmaw</primary></indexterm><indexterm><primary>__vshasigmad</primary></indexterm>The builtin GCC functions for the instructions are <systemitem>__builtin_crypto_vshasigmaw</systemitem> and <systemitem>__builtin_crypto_vshasigmad</systemitem>. The XLC functions for the instructions are <systemitem>__vshasigmaw</systemitem> and <systemitem>__vshasigmad</systemitem>. The C/C++ wrapper for the SHA-256 functions should look similar to below.</para>
    <programlisting><?code-font-size 75% ?>uint32x4_p Vec_sigma0(const uint32x4_p val)
{
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmaw(val, 0, 0);
#else
    return __builtin_crypto_vshasigmaw(val, 0, 0);
#endif
}

uint32x4_p Vec_sigma1(const uint32x4_p val)
{
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmaw(val, 0, 0xf);
#else
    return __builtin_crypto_vshasigmaw(val, 0, 0xf);
#endif
}

uint32x4_p VecSigma0(const uint32x4_p val)
{
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmaw(val, 1, 0);
#else
    return __builtin_crypto_vshasigmaw(val, 1, 0);
#endif
}

uint32x4_p VecSigma1(const uint32x4_p val)
{
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmaw(val, 1, 0xf);
#else
    return __builtin_crypto_vshasigmaw(val, 1, 0xf);
#endif
}</programlisting>
  </section>
  <section id="sha_sha256" xreflabel="SHA256">
    <title>SHA-256</title>
    <para><indexterm><primary>SHA</primary><secondary>SHA-256</secondary></indexterm>The SHA-256 implementation has four parts. The first part is loads the existing state and creates working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem>. The second part loads the message and performs the first 16 rounds. The third part performs the remaining rounds. The final part stores the new state.</para>
    <para><emphasis role="bold">Part 1.</emphasis> Load the existing state and create working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem>.</para>
    <programlisting><?code-font-size 75% ?>uint32x4_p abcd = VecLoad32x4u(state+0, 0);
uint32x4_p efgh = VecLoad32x4u(state+4, 0);

enum {A=0,B=1,C,D,E,F,G,H};
uint32x4_p X[16], S[8];

S[A] = abcd; S[E] = efgh;
S[B] = VecShiftLeft&lt;4&gt;(S[A]);
S[F] = VecShiftLeft&lt;4&gt;(S[E]);
S[C] = VecShiftLeft&lt;4&gt;(S[B]);
S[G] = VecShiftLeft&lt;4&gt;(S[F]);
S[D] = VecShiftLeft&lt;4&gt;(S[C]);
S[H] = VecShiftLeft&lt;4&gt;(S[G]);
</programlisting>
    <para><emphasis role="bold">Part 2.</emphasis> Load the message and perform the first 16 rounds.</para>
    <programlisting><?code-font-size 75% ?>const uint32_t* k = reinterpret_cast&lt;const uint32_t*&gt;(KEY256);
const uint32_t* m = reinterpret_cast&lt;const uint32_t*&gt;(data);

uint32x4_p vm, vk;
unsigned int i, offset=0;

vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1&lt;0&gt;(X,S, vk,vm);
SHA256_ROUND1&lt;1&gt;(X,S, VecShiftLeft&lt;4&gt;(vk), VecShiftLeft&lt;4&gt;(vm));
SHA256_ROUND1&lt;2&gt;(X,S, VecShiftLeft&lt;8&gt;(vk), VecShiftLeft&lt;8&gt;(vm));
SHA256_ROUND1&lt;3&gt;(X,S, VecShiftLeft&lt;12&gt;(vk), VecShiftLeft&lt;12&gt;(vm));
offset+=16;

vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1&lt;4&gt;(X,S, vk,vm);
SHA256_ROUND1&lt;5&gt;(X,S, VecShiftLeft&lt;4&gt;(vk), VecShiftLeft&lt;4&gt;(vm));
SHA256_ROUND1&lt;6&gt;(X,S, VecShiftLeft&lt;8&gt;(vk), VecShiftLeft&lt;8&gt;(vm));
SHA256_ROUND1&lt;7&gt;(X,S, VecShiftLeft&lt;12&gt;(vk), VecShiftLeft&lt;12&gt;(vm));
offset+=16;

vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1&lt;8&gt;(X,S, vk,vm);
SHA256_ROUND1&lt;9&gt;(X,S, VecShiftLeft&lt;4&gt;(vk), VecShiftLeft&lt;4&gt;(vm));
SHA256_ROUND1&lt;10&gt;(X,S, VecShiftLeft&lt;8&gt;(vk), VecShiftLeft&lt;8&gt;(vm));
SHA256_ROUND1&lt;11&gt;(X,S, VecShiftLeft&lt;12&gt;(vk), VecShiftLeft&lt;12&gt;(vm));
offset+=16;

vk = VecLoad32x4(k, offset);
vm = VecLoadMsg32x4(m, offset);
SHA256_ROUND1&lt;12&gt;(X,S, vk,vm);
SHA256_ROUND1&lt;13&gt;(X,S, VecShiftLeft&lt;4&gt;(vk), VecShiftLeft&lt;4&gt;(vm));
SHA256_ROUND1&lt;14&gt;(X,S, VecShiftLeft&lt;8&gt;(vk), VecShiftLeft&lt;8&gt;(vm));
SHA256_ROUND1&lt;15&gt;(X,S, VecShiftLeft&lt;12&gt;(vk), VecShiftLeft&lt;12&gt;(vm));
offset+=16;
</programlisting>
    <para><emphasis role="bold">Part 3.</emphasis> Perform the remaining rounds.
    </para>
    <programlisting><?code-font-size 75% ?>for (i=16; i&lt;64; i+=16)
{
    vk = VecLoad32x4(k, offset);
    SHA256_ROUND2&lt;0&gt;(X,S, vk);
    SHA256_ROUND2&lt;1&gt;(X,S, VecShiftLeft&lt;4&gt;(vk));
    SHA256_ROUND2&lt;2&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    SHA256_ROUND2&lt;3&gt;(X,S, VecShiftLeft&lt;12&gt;(vk));
    offset+=16;

    vk = VecLoad32x4(k, offset);
    SHA256_ROUND2&lt;4&gt;(X,S, vk);
    SHA256_ROUND2&lt;5&gt;(X,S, VecShiftLeft&lt;4&gt;(vk));
    SHA256_ROUND2&lt;6&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    SHA256_ROUND2&lt;7&gt;(X,S, VecShiftLeft&lt;12&gt;(vk));
    offset+=16;

    vk = VecLoad32x4(k, offset);
    SHA256_ROUND2&lt;8&gt;(X,S, vk);
    SHA256_ROUND2&lt;9&gt;(X,S, VecShiftLeft&lt;4&gt;(vk));
    SHA256_ROUND2&lt;10&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    SHA256_ROUND2&lt;11&gt;(X,S, VecShiftLeft&lt;12&gt;(vk));
    offset+=16;

    vk = VecLoad32x4(k, offset);
    SHA256_ROUND2&lt;12&gt;(X,S, vk);
    SHA256_ROUND2&lt;13&gt;(X,S, VecShiftLeft&lt;4&gt;(vk));
    SHA256_ROUND2&lt;14&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    SHA256_ROUND2&lt;15&gt;(X,S, VecShiftLeft&lt;12&gt;(vk));
    offset+=16;
}
</programlisting>
    <para><emphasis role="bold">Part 4.</emphasis> Repack and store the new state.</para>
    <programlisting><?code-font-size 75% ?>abcd += VecPack(S[A],S[B],S[C],S[D]);
efgh += VecPack(S[E],S[F],S[G],S[H]);

VecStore32x4u(abcd, state+0, 0);
VecStore32x4u(efgh, state+4, 0);
</programlisting>
    <para><emphasis role="bold">VecLoadMsg32x4.</emphasis> Perform an endian-aware load of a user message into a word.
    </para>
    <programlisting><?code-font-size 75% ?>template &lt;class T&gt;
uint32x4_p VecLoadMsg32x4(const T* data, int offset)
{
#if __LITTLE_ENDIAN__
    uint8x16_p mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
    uint32x4_p r = VecLoad32x4u(data, offset);
    return (uint32x4_p)vec_perm(r, r, mask);
#else
    return VecLoad32x4u(data, offset);
#endif
}
</programlisting>
    <para><emphasis role="bold">SHA256_ROUND1.</emphasis> Mix state with a round key and user message.
    </para>
    <programlisting><?code-font-size 75% ?>template &lt;unsigned int R&gt;
void SHA256_ROUND1(uint32x4_p X[16], uint32x4_p S[8],
                   const uint32x4_p K, const uint32x4_p M)
{
    uint32x4_p T1, T2;

    X[R] = M;
    T1 = S[H] + VecSigma1(S[E]);
    T1 += VecCh(S[E],S[F],S[G]) + K + M;
    T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);

    S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
    S[E] = S[D] + T1;
    S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
    S[A] = T1 + T2;
}
</programlisting>
    <para><emphasis role="bold">SHA256_ROUND2.</emphasis> Mix state with a round key.
    </para>
    <programlisting><?code-font-size 75% ?>template &lt;unsigned int R&gt;
void SHA256_ROUND2(uint32x4_p X[16], uint32x4_p S[8],
                   const uint32x4_p K)
{
    // Indexes into the X[] array
    enum {IDX0=(R+0)&amp;0xf, IDX1=(R+1)&amp;0xf,
          IDX9=(R+9)&amp;0xf, IDX14=(R+14)&amp;0xf};

    const uint32x4_p s0 = Vec_sigma0(X[IDX1]);
    const uint32x4_p s1 = Vec_sigma1(X[IDX14]);

    uint32x4_p T1 = (X[IDX0] += s0 + s1 + X[IDX9]);
    T1 += S[H] + VecSigma1(S[E]) + VecCh(S[E],S[F],S[G]) + K;
    uint32x4_p T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);

    S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
    S[E] = S[D] + T1;
    S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
    S[A] = T1 + T2;
}
</programlisting>
    <para><emphasis role="bold">VecPack.</emphasis> Repack working variables.
    </para>
    <programlisting><?code-font-size 75% ?>uint32x4_p VecPack(const uint32x4_p a, const uint32x4_p b,
                    const uint32x4_p c, const uint32x4_p d)
{
    uint8x16_p m1 = {0,1,2,3, 16,17,18,19, 0,0,0,0, 0,0,0,0};
    uint8x16_p m2 = {0,1,2,3, 4,5,6,7, 16,17,18,19, 20,21,22,23};
    return vec_perm(vec_perm(a,b,m1), vec_perm(c,d,m1), m2);
}
</programlisting>
  </section>
  <section id="sha_sha512" xreflabel="SHA512">
    <title>SHA-512</title>
    <para><indexterm><primary>SHA</primary><secondary>SHA-512</secondary></indexterm>The SHA-512 implementation is like SHA-256 and has four parts. The first part is loads the existing state and creates working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem>. The second part loads the message and performs the first 16 rounds. The third part performs the remaining rounds. The final part stores the new state.</para>
    <para><emphasis role="bold">Part 1.</emphasis> Load the existing state and create working variables <systemitem>{A,B,C,D,E,F,G,H}</systemitem>.</para>
    <programlisting><?code-font-size 75% ?>uint64x2_p ab = VecLoad64x2u(state+0, 0);
uint64x2_p cd = VecLoad64x2u(state+2, 0);
uint64x2_p ef = VecLoad64x2u(state+4, 0);
uint64x2_p gh = VecLoad64x2u(state+6, 0);

// Indexes into the S[] array
enum {A=0, B=1, C, D, E, F, G, H};
uint64x2_p X[16], S[8];

S[A] = ab; S[C] = cd;
S[E] = ef; S[G] = gh;
S[B] = VecShiftLeft&lt;8&gt;(S[A]);
S[D] = VecShiftLeft&lt;8&gt;(S[C]);
S[F] = VecShiftLeft&lt;8&gt;(S[E]);
S[H] = VecShiftLeft&lt;8&gt;(S[G]);
</programlisting>
    <para><emphasis role="bold">Part 2.</emphasis> Load the message and perform the first 16 rounds.</para>
    <programlisting><?code-font-size 75% ?>const uint64_t* k = reinterpret_cast&lt;const uint64_t*&gt;(KEY512);
const uint64_t* m = reinterpret_cast&lt;const uint64_t*&gt;(data);

vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1&lt;0&gt;(X,S, vk,vm);
SHA512_ROUND1&lt;1&gt;(X,S, VecShiftLeft&lt;8&gt;(vk),VecShiftLeft&lt;8&gt;(vm));
offset+=16;

vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1&lt;2&gt;(X,S, vk,vm);
SHA512_ROUND1&lt;3&gt;(X,S, VecShiftLeft&lt;8&gt;(vk),VecShiftLeft&lt;8&gt;(vm));
offset+=16;

vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1&lt;4&gt;(X,S, vk,vm);
SHA512_ROUND1&lt;5&gt;(X,S, VecShiftLeft&lt;8&gt;(vk),VecShiftLeft&lt;8&gt;(vm));
offset+=16;

vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1&lt;6&gt;(X,S, vk,vm);
SHA512_ROUND1&lt;7&gt;(X,S, VecShiftLeft&lt;8&gt;(vk),VecShiftLeft&lt;8&gt;(vm));
offset+=16;

vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1&lt;8&gt;(X,S, vk,vm);
SHA512_ROUND1&lt;9&gt;(X,S, VecShiftLeft&lt;8&gt;(vk),VecShiftLeft&lt;8&gt;(vm));
offset+=16;

vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1&lt;10&gt;(X,S, vk,vm);
SHA512_ROUND1&lt;11&gt;(X,S, VecShiftLeft&lt;8&gt;(vk),VecShiftLeft&lt;8&gt;(vm));
offset+=16;

vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1&lt;12&gt;(X,S, vk,vm);
SHA512_ROUND1&lt;13&gt;(X,S, VecShiftLeft&lt;8&gt;(vk),VecShiftLeft&lt;8&gt;(vm));
offset+=16;

vk = VecLoad64x2(k, offset);
vm = VecLoadMsg64x2(m, offset);
SHA512_ROUND1&lt;14&gt;(X,S, vk,vm);
SHA512_ROUND1&lt;15&gt;(X,S, VecShiftLeft&lt;8&gt;(vk),VecShiftLeft&lt;8&gt;(vm));
offset+=16;
</programlisting>
    <para><emphasis role="bold">Part 3.</emphasis> Perform the remaining rounds.
    </para>
    <programlisting><?code-font-size 75% ?>for (i=16; i&lt;80; i+=16)
{
    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2&lt;0&gt;(X,S, vk);
    SHA512_ROUND2&lt;1&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    offset+=16;

    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2&lt;2&gt;(X,S, vk);
    SHA512_ROUND2&lt;3&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    offset+=16;

    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2&lt;4&gt;(X,S, vk);
    SHA512_ROUND2&lt;5&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    offset+=16;

    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2&lt;6&gt;(X,S, vk);
    SHA512_ROUND2&lt;7&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    offset+=16;

    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2&lt;8&gt;(X,S, vk);
    SHA512_ROUND2&lt;9&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    offset+=16;

    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2&lt;10&gt;(X,S, vk);
    SHA512_ROUND2&lt;11&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    offset+=16;

    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2&lt;12&gt;(X,S, vk);
    SHA512_ROUND2&lt;13&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    offset+=16;

    vk = VecLoad64x2(k, offset);
    SHA512_ROUND2&lt;14&gt;(X,S, vk);
    SHA512_ROUND2&lt;15&gt;(X,S, VecShiftLeft&lt;8&gt;(vk));
    offset+=16;
}
</programlisting>
    <para><emphasis role="bold">Part 4.</emphasis> Repack and store the new state.
    </para>
    <programlisting><?code-font-size 75% ?>ab += VecPack(S[A],S[B]);
cd += VecPack(S[C],S[D]);
ef += VecPack(S[E],S[F]);
gh += VecPack(S[G],S[H]);

VecStore64x2u(ab, state+0, 0);
VecStore64x2u(cd, state+2, 0);
VecStore64x2u(ef, state+4, 0);
VecStore64x2u(gh, state+6, 0);
</programlisting>
    <para><emphasis role="bold">VecLoadMsg64x2.</emphasis> Perform an endian-aware load of a user message into a word.
    </para>
    <programlisting><?code-font-size 75% ?>template &lt;class T&gt;
uint32x4_p VecLoadMsg64x2(const T* data, int offset)
{
#if __LITTLE_ENDIAN__
    uint8x16_p mask = {7,6,5,4, 3,2,1,0, 15,14,13,12, 11,10,9,8};
    uint64x2_p r = VecLoad64x2u(data, offset);
    return (uint64x2_p)vec_perm(r, r, mask);
#else
    return VecLoad64x2u(data, offset);
#endif
}
</programlisting>
    <para><emphasis role="bold">SHA512_ROUND1.</emphasis> Mix state with a round key and user message.
    </para>
    <programlisting><?code-font-size 75% ?>template &lt;unsigned int R&gt;
void SHA512_ROUND1(uint64x2_p X[16], uint64x2_p S[8],
                   const uint64x2_p K, const uint64x2_p M)
{
    uint64x2_p T1, T2;

    X[R] = M;
    T1 = S[H] + VecSigma1(S[E]);
    T1 += VecCh(S[E],S[F],S[G]) + K + M;
    T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);

    S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
    S[E] = S[D] + T1;
    S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
    S[A] = T1 + T2;
}
</programlisting>
    <para><emphasis role="bold">SHA512_ROUND2.</emphasis> Mix state with a round key.
    </para>
    <programlisting><?code-font-size 75% ?>template &lt;unsigned int R&gt;
void SHA512_ROUND2(uint64x2_p X[16], uint64x2_p S[8],
                   const uint64x2_p K)
{
    // Indexes into the X[] array
    enum {IDX0=(R+0)&amp;0xf, IDX1=(R+1)&amp;0xf,
          IDX9=(R+9)&amp;0xf, IDX14=(R+14)&amp;0xf};

    const uint64x2_p s0 = Vec_sigma0(X[IDX1]);
    const uint64x2_p s1 = Vec_sigma1(X[IDX14]);

    uint64x2_p T1 = (X[IDX0] += s0 + s1 + X[IDX9]);
    T1 += S[H] + VecSigma1(S[E]) + VecCh(S[E],S[F],S[G]) + K;
    uint64x2_p T2 = VecSigma0(S[A]) + VecMaj(S[A],S[B],S[C]);

    S[H] = S[G]; S[G] = S[F]; S[F] = S[E];
    S[E] = S[D] + T1;
    S[D] = S[C]; S[C] = S[B]; S[B] = S[A];
    S[A] = T1 + T2;
}
</programlisting>
    <para><emphasis role="bold">VecPack.</emphasis> Repack working variables.
    </para>
    <programlisting><?code-font-size 75% ?>uint64x2_p VecPack(const uint64x2_p x, const uint64x2_p y)
{
    const uint8x16_p m = {0,1,2,3, 4,5,6,7, 16,17,18,19, 20,21,22,23};
    return vec_perm(x,y,m);
}
</programlisting>
    <para>The SHA-512 implementation uses the same functions as SHA-256, but SHA-512 uses a 64x2 arrangement rather than the 32x4 arrangement. You should copy/paste/replace as required for SHA-512. For example, below is the SHA <systemitem>Ch</systemitem> for the 64x2 arrangement.</para>
    <programlisting><?code-font-size 75% ?>uint64x2_p
VecCh(uint64x2_p x, uint64x2_p y, uint64x2_p z)
{
    return vec_sel(z,y,x);
}
</programlisting>
    <para>In fact, since this is C++ code, a template function works nicely. The language will use the template to instantiate <systemitem>VecCh</systemitem> using both <systemitem>uint32x4_p</systemitem> and <systemitem>uint64x2_p</systemitem>.</para>
    <programlisting><?code-font-size 75% ?>template &lt;class T&gt;
T VecCh(T x, T y, T z)
{
    return vec_sel(z,y,x);
}
</programlisting>
    <para>Templates do not work the Sigma functions and you will have to supply C++ overloaded functions as shown below.</para>
    <programlisting><?code-font-size 75% ?>uint64x2_p Vec_sigma0(const uint64x2_p val)
{
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmad(val, 0, 0);
#else
    return __builtin_crypto_vshasigmad(val, 0, 0);
#endif
}

uint64x2_p Vec_sigma1(const uint64x2_p val)
{
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmad(val, 0, 0xf);
#else
    return __builtin_crypto_vshasigmad(val, 0, 0xf);
#endif
}

uint64x2_p VecSigma0(const uint64x2_p val)
{
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmad(val, 1, 0);
#else
    return __builtin_crypto_vshasigmad(val, 1, 0);
#endif
}

uint64x2_p VecSigma1(const uint64x2_p val)
{
#if defined(__xlc__) || defined(__xlC__)
    return __vshasigmad(val, 1, 0xf);
#else
    return __builtin_crypto_vshasigmad(val, 1, 0xf);
#endif
}
</programlisting>
  </section>
</chapter>