- Code is highly optimized in terms of work, data locality and parallelism.
- Automatic code generation is used at library installation time.
- Example: one-dimensional FFT

- Functions provide a variety of algorithmic solutions for a given operation.
- The user can choose between algorithms minimizing work or algorithms maximizing parallelism.
- Example: Schönaghe-Strassen, divide-and-conquer, k-way Toom-Cook and the two-convolution method for integer polynomial multiplication

- Functions combine several Level 2 algorithms for achieving a given task.
- This leads to adaptive algorithms that select appropriate Level 2 functions depending on available resources (number of cores, input data size).
- Example: Taylor shift operation

- The BPAS classes Integer and RationalNumber are BPAS wrappers for GMP's mpz and mpq classes.
- Many other classes are provided, like Intervals, RationalRegularChain, ...

- Assuming an ideal cache of Z words with L words per cache line; n be the input vector size
- Traditional 2-way FFT proceeds on the FFT graph row-by-row (left picture above)
and a cache complexity in Ω(n / L (log
_{2}(n) - log_{2}(Z))), which is not optimal. - By proceeding in a block-wise manner (picture on the right),
cache complexity becomes
Ο(n / L (log
_{2}(n) / log_{2}(Z))), which is optimal. - BPAS 1-D FFT code is automatically generated by a configurable Python script implementing cache-optimal FFTs.
- In addition to the optimal blocking strategy, instruction level parallelism (ILP) is carefully considered: vectorized instructions are explicitly used (SSE2, SSE4) and instruction pipeline usage is highly optimized.
- Experimental results: Modpn (serial) vs BPAS (serial)

- An adaptive algorithm based on the input size and available cores
- Very small: Plain multiplication
- Small or Single-core: Schönaghe-Strassen algorithm via Kronecker's substitution
- Big but a few cores: Parallel 4-way Toom-Cook
- Big: Parallel 8-way Toom-Cook
- Very big: Parallel two-convolution method
- The two-convolution method

Input: a(y), b(y) ∈ Ζ[y] of degree d with coefficients of bit-size N = K M

Output: the product a(y) b(y)

Notations:

- F(e,s) is an upper bound of the number of bot operations
for computing the FFT of a vector of size s
over a prime field Ζ/p with e ∈ Θ(log
_{2}(p)). - If p is a Fourier prime, we have
F(e,s) ∈ Θ(s log
_{2}(s) e log_{2}(e) log_{2}log_{2}(e)). - If e ∈ Θ(log
_{2}(s)) and p is a generalized Fermat prime, then F(e,s) ∈ Θ(s log_{2}(s) e log_{2}log_{2}(e)). - K ∈ Θ(d)
- M ∈ Θ(log
_{2}(d)) - if p is a Fourier prime, then Ο(d
^{2}log_{2}^{2}(d) log_{2}log_{2}log_{2}(d)) (implemented and available) - if p is a generalized Fermat prime, then
Ο(d
^{2}log_{2}(d) log_{2}log_{2}log_{2}(d)) (work in progress) - in comparison, Schönaghe-Strassen gives here Ο(d
^{2}log_{2}^{}(d) log_{2}log_{2}(d)) - Experimental results are on a 48-core AMD Opteron 6168 node:
BPAS (parallel) vs FLINT (serial) vs Maple 2015 (parallel)

Our code relies only and directly on machine word operations (addition, shift and multiplication).

1) Exact integration followed by a numerical approximation of the integral:

- Hermite reduction (rational part) and Lazard-Rioboo-Trager algorithm (transcendental part) to integrate;
- Multiprecision real-root isolation on factors of the resultant using MPSolve;
- Symbolic post-processing to compute arctangent terms.
- Comparison between 53-bit and 106-bit precisions

2) Multiprecision partial fraction decomposition (PFD) followed by exact integration (proposed by R. Fateman):

- Multiprecision real-root isolation for denominator using MPSolve;
- Symbolic computation of partial fraction decomposition of reconstructed denominator;
- Term-wise integration of rational terms and linear and quadratic (denominator) transcendental terms.
- Comparison between 53-bit and 106-bit precisions:

- Experimental results: Running time (in sec.) on a 48-core AMD Opteron 6168 node for BPAS (parallel) vs CMY (parallel) vs Maple 18 realroot (serial)

- Experimental results: Running time (in sec.) on a 12-core Intel Xeon 5650
node for BPAS (parallel) vs. Maple 17 RealRootIsolate (serial) vs.
C (with Maple 17 interface) Isolate (serial)

ORCCA Lab,
Department of Computer Science,
The University of Western Ontario,
London, Ontario, Canada N6A 5B7