SYCL-based Parallel Frequency Tests for Randomness Assessment

Accelerating NIST SP 800-22 statistical tests using Intel oneAPI and SYCL — up to 2,234× speedup on GPU and 357× speedup on a 24-core CPU, from a single portable codebase. Bit-exact P-value match against the NIST STS 2.1.2 serial reference.

SYCL / DPC++ Intel oneAPI NIST SP 800-22 GPU Acceleration CPU Acceleration C++17 RTX 4080 SUPER i9-13900KF
⭐ View on GitHub 📊 See Results

At a Glance

What This Project Does

The NIST STS serial C implementation becomes a bottleneck for modern cryptographic pipelines that ingest gigabits of entropy per second. This work reimplements the two fundamental frequency tests using SYCL parallel kernels — one codebase that runs on CPU, Intel, NVIDIA, and AMD GPUs — eliminating the vendor lock-in of prior CUDA-only acceleration work.

2,234×
Peak GPU speedup @ 100 M bits (Block Freq)
357×
Peak CPU speedup @ 100 M bits (Block Freq)
160.7 Gbit/s
Peak GPU throughput
6-decimal
P-value match vs NIST STS 2.1.2 (CPU + GPU)
🔀

Frequency (Monobit) Test

Two-phase parallel reduction with work-group shared memory and sycl::atomic_ref. Tests global balance of 0s and 1s in the sequence.

📦

Block Frequency Test

Two-pass data-parallel kernel: a counting pass followed by sycl::reduction for chi-squared accumulation. Tests local balance within M-bit blocks.

🌐

Cross-Architecture Portability

Compiles for Intel CPU, Intel GPU, NVIDIA GPU, and AMD GPU from a single SYCL source — no platform-specific code paths.


Correctness Verification

P-value Accuracy

P-values produced by the SYCL implementation are compared against the official NIST STS 2.1.2 serial reference. All P-values match the serial reference to at least six decimal places across every tested configuration.

Test Input Sequence Expected P-value SYCL P-value Status
Monobit 1011010101 (n=10) 0.527089 0.527089 ✓ PASS
Block Frequency 0110011010 (n=10, M=3) 0.801252 0.801252 ✓ PASS
Block Frequency (M=1024) Random binary, n=100 M bits 0.661400 0.661400 ✓ PASS (|ΔP| < 10⁻⁹)
Monobit (large-scale) Random binary, n=100 M bits |ΔP| < 10⁻⁹ ✓ PASS

Performance

Speedup & Throughput

Benchmarked on NVIDIA RTX 4080 SUPER (Intel DPC++ 2024.0, Ubuntu). Baseline: NIST STS 2.1.2 compiled with gcc -O2; SYCL GPU target: nvptx64-nvidia-cuda, compiled with icpx -O3 -fsycl. Each datapoint is the median of 5 runs with standard deviation below 2%.

Execution Time — SYCL vs NIST STS Serial

Execution time comparison

Speedup vs Input Size — Amdahl's-Law Fit

Speedup vs input size

Speedup Comparison — Monobit vs Block Frequency

Speedup comparison

Throughput vs Input Size — log-log Scale

Throughput vs input size
Test Bits (n) NIST STS Serial SYCL CPU SYCL GPU
Time Speedup Throughput Time Speedup Throughput
Frequency (Monobit) 100 M 1,390.10 ms 14.705 ms 94.5× 6.80 Gbit/s 1.052 ms 1,321.0× 95.0 Gbit/s
Block Frequency (M=1024) 100 M 1,390.10 ms 3.894 ms 357.0× 25.68 Gbit/s 0.622 ms 2,234.5× 160.7 Gbit/s

SYCL CPU benchmarked on Intel Core i9-13900KF (24 cores), Intel OpenCL 3.0 runtime. P-values match NIST STS reference exactly across CPU, GPU, and serial.


Algorithm Design

How the Kernels Work

⚡ Monobit Kernel

1
Each work-item loads bits and maps 0→−1, 1→+1, accumulates into work-group shared memory.
2
Binary tree reduction over work-group shared memory with synchronization barriers.
3
Work-item 0 per group atomically adds group sum to global accumulator via sycl::atomic_ref.
4
Host computes S_obs = |Sₙ|/√n, P-value = erfc(S_obs/√2).

📊 Block Frequency Kernel

1
Pass 1: Each work-item counts ones in one M-bit block (fully data-parallel, no communication).
2
Pass 2: Each work-item computes (πᵢ − 0.5)² for its block; sycl::reduction accumulates the sum in double precision.
3
Pass 2 declares an explicit event dependency on Pass 1 — no host-side blocking needed between passes.
4
Host scales result: χ² = 4M·Σ(πᵢ−0.5)², P-value = igamc(N/2, χ²/2).
🧠

Work-group size: 256

Power-of-two required for binary tree reduction; avoids warp divergence on most GPU architectures.

💾

USM Device Memory

sycl::malloc_device with a single memcpy before both kernels — no redundant host↔device transfers.

🔢

Double-precision reduction

Block frequency accumulates (πᵢ−0.5)² in double to prevent loss of significance at N=10⁶ blocks.


References

Key References

  1. Bassham L. et al. NIST SP 800-22 Rev. 1a — A Statistical Test Suite for Random and Pseudorandom Number Generators. NIST, 2010.
  2. Suciu A. et al. Parallel implementation of the NIST statistical test suite. IEEE ICCP, 2010.
  3. Osama M., Hussein A. Parallelization of PRNG statistical tests using CUDA. IEEE ICCES, 2015.
  4. Sýs M., Říha Z. Faster Randomness Testing with the NIST Statistical Test Suite. LNCS Springer, 2014.
  5. 罗影 et al. 基于SIMD的单比特频数和块内频数检测快速实现. 通信技术, 2025.
  6. Reinders J. et al. Data Parallel C++: Mastering DPC++ for Heterogeneous Systems. Apress, 2021.
  7. 国家密码管理局. GM/T 0005-2021 随机性检测规范. 2021.