Skip to content

Latest commit

 

History

History
36 lines (28 loc) · 2.81 KB

README.md

File metadata and controls

36 lines (28 loc) · 2.81 KB

ProtoNTT

ProtoNTT is a basic implementation of the Small Primes Number-Theoretic Transform (NTT) algorithm for multiplication of large integers.

ProtoNTT is an early prototype for the Small Primes NTT that was added to y-cruncher v0.6.8. This prototype has most of the low-effort/high-payoff optimizations. But it lacks all the difficult memory optimizations that are needed to make it viable in a serious bignum application. Nevertheless, ProtoNTT is still "reasonably" efficient and is open-sourced here for educational purposes.

Build Instructions/Requirements:

  • Hardware: x64 is required. The program will not compile for 32-bit.
  • Compiler:
    • Windows: Visual Studio 2013 update 2 or later. (update 2 is needed for the add-with-carry intrinsics)
    • Linux: GCC 4.8 or later.

A Visual Studio project is included. The Linux version can be built by simply running compile-gcc.sh.

Comparison with y-cruncher v0.6.8's NTT:

Feature(s) ProtoNTT y-cruncher v0.6.8
Prime Size 63-bit 63-bit
# of Primes 3, 5, 7, or 9 3, 4, 5, 6, 7, 8, or 9
Transform Lengths 2k, 3·2k, 5·2k, 7·2k 2k, 3·2k, 5·2k, 7·2k
Maximum Convolution Length ~10^18 bits ~10^18 bits
Parallelization No Yes
Swap Mode (Out-of-core Computation) No Yes
Supported Architectures x64 Any 32-bit or 64-bit
Processor-Specific Optimizations x64 x86, x64, SSE4.2, XOP, AVX2, AVX512-(F/DQ)
Butterfly Radix Radix 2 Generalized Bailey's 4-step
Twiddle Factor Table Size O(N) O(1)
Lines of Code 3,557 36,224 (excluding dependencies)

The only thing that prevents ProtoNTT from being a viable algorithm is that:

  • It uses much memory when all twiddle factors are precomputed. (4x the transform size)
  • It is too slow when no twiddle factors are precomputed. (3x slower)

The space-time trade-off curve for precomputing a subset of twiddle factors is particularly unfavorable for ProtoNTT. The cuts needed to reduce the table down to a reasonable size will easily slap on a performance penalty of 50% or more to an already slow algorithm.