SYCL-based Parallel NIST STS Frequency Tests

At a Glance

What This Project Does

The NIST STS serial C implementation becomes a bottleneck at GB scale. This work reimplements the two fundamental frequency tests using SYCL parallel kernels, running the same source code on both CPU and GPU — eliminating the CUDA vendor lock-in present in all prior acceleration work.

60×

Peak GPU speedup (1 GB input)

32 Gbit/s

GPU throughput at scale

<1×10⁻⁵

P-value error vs. NIST reference

Codebase for CPU & GPU

🔀

Frequency (Monobit) Test

Two-phase parallel reduction with work-group shared memory and sycl::atomic_ref. Tests global balance of 0s and 1s in the sequence.

📦

Block Frequency Test

Two-pass data-parallel kernel: a counting pass followed by sycl::reduction for chi-squared accumulation. Tests local balance within M-bit blocks.

🌐

Cross-Architecture Portability

Compiles for Intel CPU, Intel GPU, NVIDIA GPU, and AMD GPU from a single SYCL source — no platform-specific code paths.

Correctness Verification

P-value Accuracy

P-values produced by the SYCL implementation are compared against the official NIST STS 2.1.2 serial reference. All results match to within floating-point tolerance (<1×10⁻⁵).

Test	Input Sequence	Expected P-value	SYCL P-value	Status
Monobit	`1011010101` (n=10)	0.527089	0.527089	✓ PASS
Block Frequency	`0110011010` (n=10, M=3)	0.801252	0.801252	✓ PASS
Large-scale (1M–1G bits)	Random binary via `/dev/urandom`	—	\|ΔP\| < 1×10⁻⁵	✓ PASS

Performance

Speedup & Throughput

Benchmarked on random binary inputs from 1 MB to 1 GB. Baseline: NIST STS 2.1.2 compiled with gcc -O2. SYCL GPU target: nvptx64-nvidia-cuda, compiled with icpx -O3 -fsycl.

Execution Time (ms) — Serial vs SYCL GPU

Speedup over NIST STS Serial Baseline

CPU vs GPU Execution Time — Monobit Test

Throughput (Mbit/s)

Input Size	NIST STS Serial (ms)	SYCL CPU (ms)	SYCL GPU (ms)	GPU Speedup vs Serial	GPU Speedup vs CPU
1 MB (8M bits)	~15	~8	~2	~7.5×	~4×
10 MB (80M bits)	~150	~50	~5	~30×	~10×
100 MB (800M bits)	~1,500	~400	~30	~50×	~13×
1 GB (8G bits)	~15,000	~4,000	~250	~60×	~16×

Algorithm Design

How the Kernels Work

⚡ Monobit Kernel

Each work-item loads bits and maps 0→−1, 1→+1, accumulates into work-group shared memory.

Binary tree reduction over work-group shared memory with synchronization barriers.

Work-item 0 per group atomically adds group sum to global accumulator via sycl::atomic_ref.

Host computes S_obs = |Sₙ|/√n, P-value = erfc(S_obs/√2).

📊 Block Frequency Kernel

Pass 1: Each work-item counts ones in one M-bit block (fully data-parallel, no communication).

Pass 2: Each work-item computes (πᵢ − 0.5)² for its block; sycl::reduction accumulates the sum in double precision.

Pass 2 declares an explicit event dependency on Pass 1 — no host-side blocking needed between passes.

Host scales result: χ² = 4M·Σ(πᵢ−0.5)², P-value = igamc(N/2, χ²/2).

🧠

Work-group size: 256

Power-of-two required for binary tree reduction; avoids warp divergence on most GPU architectures.

💾

USM Device Memory

sycl::malloc_device with a single memcpy before both kernels — no redundant host↔device transfers.

🔢

Double-precision reduction

Block frequency accumulates (πᵢ−0.5)² in double to prevent loss of significance at N=10⁶ blocks.

References

Key References

Bassham L. et al. NIST SP 800-22 Rev. 1a — A Statistical Test Suite for Random and Pseudorandom Number Generators. NIST, 2010.
Suciu A. et al. Parallel implementation of the NIST statistical test suite. IEEE ICCP, 2010.
Osama M., Hussein A. Parallelization of PRNG statistical tests using CUDA. IEEE ICCES, 2015.
Sýs M., Říha Z. Faster Randomness Testing with the NIST Statistical Test Suite. LNCS Springer, 2014.
罗影 et al. 基于SIMD的单比特频数和块内频数检测快速实现. 通信技术, 2025.
Reinders J. et al. Data Parallel C++: Mastering DPC++ for Heterogeneous Systems. Apress, 2021.
国家密码管理局. GM/T 0005-2021 随机性检测规范. 2021.

SYCL-based Parallel Frequency Tests for Randomness Assessment