Accelerating NIST SP 800-22 statistical tests using Intel oneAPI and SYCL, achieving up to 60× speedup over the serial reference on CPU and GPU with a single portable codebase.
The NIST STS serial C implementation becomes a bottleneck at GB scale. This work reimplements the two fundamental frequency tests using SYCL parallel kernels, running the same source code on both CPU and GPU — eliminating the CUDA vendor lock-in present in all prior acceleration work.
Two-phase parallel reduction with work-group shared memory and sycl::atomic_ref. Tests global balance of 0s and 1s in the sequence.
Two-pass data-parallel kernel: a counting pass followed by sycl::reduction for chi-squared accumulation. Tests local balance within M-bit blocks.
Compiles for Intel CPU, Intel GPU, NVIDIA GPU, and AMD GPU from a single SYCL source — no platform-specific code paths.
P-values produced by the SYCL implementation are compared against the official NIST STS 2.1.2 serial reference. All results match to within floating-point tolerance (<1×10⁻⁵).
| Test | Input Sequence | Expected P-value | SYCL P-value | Status |
|---|---|---|---|---|
| Monobit | 1011010101 (n=10) |
0.527089 | 0.527089 | ✓ PASS |
| Block Frequency | 0110011010 (n=10, M=3) |
0.801252 | 0.801252 | ✓ PASS |
| Large-scale (1M–1G bits) | Random binary via /dev/urandom |
— | |ΔP| < 1×10⁻⁵ | ✓ PASS |
Benchmarked on random binary inputs from 1 MB to 1 GB.
Baseline: NIST STS 2.1.2 compiled with gcc -O2.
SYCL GPU target: nvptx64-nvidia-cuda, compiled with icpx -O3 -fsycl.
| Input Size | NIST STS Serial (ms) | SYCL CPU (ms) | SYCL GPU (ms) | GPU Speedup vs Serial | GPU Speedup vs CPU |
|---|---|---|---|---|---|
| 1 MB (8M bits) | ~15 | ~8 | ~2 | ~7.5× | ~4× |
| 10 MB (80M bits) | ~150 | ~50 | ~5 | ~30× | ~10× |
| 100 MB (800M bits) | ~1,500 | ~400 | ~30 | ~50× | ~13× |
| 1 GB (8G bits) | ~15,000 | ~4,000 | ~250 | ~60× | ~16× |
sycl::atomic_ref.S_obs = |Sₙ|/√n, P-value = erfc(S_obs/√2).sycl::reduction accumulates the sum in double precision.igamc(N/2, χ²/2).Power-of-two required for binary tree reduction; avoids warp divergence on most GPU architectures.
sycl::malloc_device with a single memcpy before both kernels — no redundant host↔device transfers.
Block frequency accumulates (πᵢ−0.5)² in double to prevent loss of significance at N=10⁶ blocks.