SYCL-based Parallel Frequency Tests for Randomness Assessment

Accelerating NIST SP 800-22 statistical tests using Intel oneAPI and SYCL, achieving up to 60× speedup over the serial reference on CPU and GPU with a single portable codebase.

SYCL / DPC++ Intel oneAPI NIST SP 800-22 GPU Acceleration C++17 GM/T 0005-2021
⭐ View on GitHub 📊 See Results

At a Glance

What This Project Does

The NIST STS serial C implementation becomes a bottleneck at GB scale. This work reimplements the two fundamental frequency tests using SYCL parallel kernels, running the same source code on both CPU and GPU — eliminating the CUDA vendor lock-in present in all prior acceleration work.

60×
Peak GPU speedup (1 GB input)
32 Gbit/s
GPU throughput at scale
<1×10⁻⁵
P-value error vs. NIST reference
1
Codebase for CPU & GPU
🔀

Frequency (Monobit) Test

Two-phase parallel reduction with work-group shared memory and sycl::atomic_ref. Tests global balance of 0s and 1s in the sequence.

📦

Block Frequency Test

Two-pass data-parallel kernel: a counting pass followed by sycl::reduction for chi-squared accumulation. Tests local balance within M-bit blocks.

🌐

Cross-Architecture Portability

Compiles for Intel CPU, Intel GPU, NVIDIA GPU, and AMD GPU from a single SYCL source — no platform-specific code paths.


Correctness Verification

P-value Accuracy

P-values produced by the SYCL implementation are compared against the official NIST STS 2.1.2 serial reference. All results match to within floating-point tolerance (<1×10⁻⁵).

Test Input Sequence Expected P-value SYCL P-value Status
Monobit 1011010101 (n=10) 0.527089 0.527089 ✓ PASS
Block Frequency 0110011010 (n=10, M=3) 0.801252 0.801252 ✓ PASS
Large-scale (1M–1G bits) Random binary via /dev/urandom |ΔP| < 1×10⁻⁵ ✓ PASS

Performance

Speedup & Throughput

Benchmarked on random binary inputs from 1 MB to 1 GB. Baseline: NIST STS 2.1.2 compiled with gcc -O2. SYCL GPU target: nvptx64-nvidia-cuda, compiled with icpx -O3 -fsycl.

Execution Time (ms) — Serial vs SYCL GPU

Speedup over NIST STS Serial Baseline

CPU vs GPU Execution Time — Monobit Test

Throughput (Mbit/s)

Input Size NIST STS Serial (ms) SYCL CPU (ms) SYCL GPU (ms) GPU Speedup vs Serial GPU Speedup vs CPU
1 MB (8M bits) ~15 ~8 ~2 ~7.5× ~4×
10 MB (80M bits) ~150 ~50 ~5 ~30× ~10×
100 MB (800M bits) ~1,500 ~400 ~30 ~50× ~13×
1 GB (8G bits) ~15,000 ~4,000 ~250 ~60× ~16×

Algorithm Design

How the Kernels Work

⚡ Monobit Kernel

1
Each work-item loads bits and maps 0→−1, 1→+1, accumulates into work-group shared memory.
2
Binary tree reduction over work-group shared memory with synchronization barriers.
3
Work-item 0 per group atomically adds group sum to global accumulator via sycl::atomic_ref.
4
Host computes S_obs = |Sₙ|/√n, P-value = erfc(S_obs/√2).

📊 Block Frequency Kernel

1
Pass 1: Each work-item counts ones in one M-bit block (fully data-parallel, no communication).
2
Pass 2: Each work-item computes (πᵢ − 0.5)² for its block; sycl::reduction accumulates the sum in double precision.
3
Pass 2 declares an explicit event dependency on Pass 1 — no host-side blocking needed between passes.
4
Host scales result: χ² = 4M·Σ(πᵢ−0.5)², P-value = igamc(N/2, χ²/2).
🧠

Work-group size: 256

Power-of-two required for binary tree reduction; avoids warp divergence on most GPU architectures.

💾

USM Device Memory

sycl::malloc_device with a single memcpy before both kernels — no redundant host↔device transfers.

🔢

Double-precision reduction

Block frequency accumulates (πᵢ−0.5)² in double to prevent loss of significance at N=10⁶ blocks.


References

Key References

  1. Bassham L. et al. NIST SP 800-22 Rev. 1a — A Statistical Test Suite for Random and Pseudorandom Number Generators. NIST, 2010.
  2. Suciu A. et al. Parallel implementation of the NIST statistical test suite. IEEE ICCP, 2010.
  3. Osama M., Hussein A. Parallelization of PRNG statistical tests using CUDA. IEEE ICCES, 2015.
  4. Sýs M., Říha Z. Faster Randomness Testing with the NIST Statistical Test Suite. LNCS Springer, 2014.
  5. 罗影 et al. 基于SIMD的单比特频数和块内频数检测快速实现. 通信技术, 2025.
  6. Reinders J. et al. Data Parallel C++: Mastering DPC++ for Heterogeneous Systems. Apress, 2021.
  7. 国家密码管理局. GM/T 0005-2021 随机性检测规范. 2021.