BPAS - Features

Targeted functionalities

Functions provide a variety of algorithmic solutions for a given operation.
The user can choose between algorithms minimizing work or algorithms maximizing parallelism.
Example: Schönaghe-Strassen, divide-and-conquer, k-way Toom-Cook and the two-convolution method for integer polynomial multiplication

Functions combine several Level 2 algorithms for achieving a given task.
This leads to adaptive algorithms that select appropriate Level 2 functions depending on available resources (number of cores, input data size).
Example: Taylor shift operation

The BPAS classes Integer and RationalNumber are BPAS wrappers for GMP's mpz and mpq classes.
Many other classes are provided, like Intervals, RationalRegularChain, ...

Assuming an ideal cache of Z words with L words per cache line; n be the input vector size
Traditional 2-way FFT proceeds on the FFT graph row-by-row (left picture above) and a cache complexity in Ω(n / L (log₂(n) - log₂(Z))), which is not optimal.
By proceeding in a block-wise manner (picture on the right), cache complexity becomes Ο(n / L (log₂(n) / log₂(Z))), which is optimal.
BPAS 1-D FFT code is automatically generated by a configurable Python script implementing cache-optimal FFTs.
In addition to the optimal blocking strategy, instruction level parallelism (ILP) is carefully considered: vectorized instructions are explicitly used (SSE2, SSE4) and instruction pipeline usage is highly optimized.
Experimental results: Modpn (serial) vs BPAS (serial)

The two-convolution method
Input: a(y), b(y) ∈ Ζ[y] of degree d with coefficients of bit-size N = K M
Output: the product a(y) b(y)

Notations:

F(e,s) is an upper bound of the number of bot operations for computing the FFT of a vector of size s over a prime field Ζ/p with e ∈ Θ(log₂(p)).
If p is a Fourier prime, we have F(e,s) ∈ Θ(s log₂(s) e log₂(e) log₂log₂(e)).
If e ∈ Θ(log₂(s)) and p is a generalized Fermat prime, then F(e,s) ∈ Θ(s log₂(s) e log₂log₂(e)).

if p is a Fourier prime, then Ο(d² log₂²(d) log₂log₂log₂(d)) (implemented and available)
if p is a generalized Fermat prime, then Ο(d² log₂(d) log₂log₂log₂(d)) (work in progress)
in comparison, Schönaghe-Strassen gives here Ο(d² log₂(d) log₂log₂(d))

Ο(

₂

)

Experimental results are on a 48-core AMD Opteron 6168 node: BPAS (parallel) vs FLINT (serial) vs Maple 2015 (parallel)

Multiprecision approximate integration of rational functions with error control using two methods:

1) Exact integration followed by a numerical approximation of the integral:

Hermite reduction (rational part) and Lazard-Rioboo-Trager algorithm (transcendental part) to integrate;
Multiprecision real-root isolation on factors of the resultant using MPSolve;
Symbolic post-processing to compute arctangent terms.
Comparison between 53-bit and 106-bit precisions

2) Multiprecision partial fraction decomposition (PFD) followed by exact integration (proposed by R. Fateman):

Multiprecision real-root isolation for denominator using MPSolve;
Symbolic computation of partial fraction decomposition of reconstructed denominator;
Term-wise integration of rational terms and linear and quadratic (denominator) transcendental terms.
Comparison between 53-bit and 106-bit precisions:

Parallel real root isolation

Experimental results: Running time (in sec.) on a 48-core AMD Opteron 6168 node for BPAS (parallel) vs CMY (parallel) vs Maple 18 realroot (serial)

Experimental results: Running time (in sec.) on a 12-core Intel Xeon 5650 node for BPAS (parallel) vs. Maple 17 RealRootIsolate (serial) vs. C (with Maple 17 interface) Isolate (serial)