Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Use 4 PCG output function variants instead of itertaing 4x per fork
Using four PCG steps iterates through the PCG space 4x faster, getting back to the start after 2^60 splits instead of 2^64 (eh, who cares?). It's also a bit slow and hard to parallelize. It also reduces the space of possible weights for each register of SplitMix dot product from 2^64 to 2^60, which is a significant reduction in collision resistance. This commit instead implements an approach where we use four different PCG output functions on the same LCG state to get four sufficiently linearly unrelated streams of pseudorandom SplitMix dot product weights (one for each xoshiro256 state register). This should give us 256 bits of SplitMix collision resistance, which is formidable and means that collisions are a impossible in practice. Appealingly, this appraich is also easy to vectorize. I even implemented it with SIMD intrinsices just to be sure (code compiled but not tested): ``` void jl_rng_split(uint64_t dst[JL_RNG_SIZE], uint64_t src[JL_RNG_SIZE]) { // load and advance PCG's LCG state uint64_t x = src[4]; // high spectrum multiplier from https://arxiv.org/abs/2001.05304 src[4] = dst[4] = x * 0xd1342543de82ef95 + 1; // manually vectorized PCG-RXS-M-XS with four variants static const uint64_t a[4] = { 0xe5f8fa077b92a8a8, // random additive offsets... 0x7a0cd918958c124d, 0x86222f7d388588d4, 0xd30cbd35f2b64f52 }; static const uint64_t m[4] = { 0xaef17502108ef2d9, // standard multiplier 0xf34026eeb86766af, // random odd multipliers... 0x38fd70ad58dd9fbb, 0x6677f9b93ab0c04d }; __m256i p, s; p = _mm256_set1_epi64x(x); // p = x p = _mm256_add_epi64(p, _mm256_load_epi64(a)); // p += a s = _mm256_srlv_epi64(p, _mm256_set1_epi64x(59)); // s = x >> 59 s = _mm256_add_epi64(s, _mm256_set1_epi64x(5)); // s += 5 p = _mm256_xor_epi64(p, _mm256_srlv_epi64(p, s)); // p ^= p >> s p = _mm256_mullo_epi64(p, _mm256_load_epi64(m)); // p *= m s = _mm256_set1_epi64x(43); // s = 43 p = _mm256_xor_epi64(p, _mm256_srlv_epi64(p, s)); // p ^= p >> s // load, modify & store xoshiro256 state __m256i sv = _mm256_load_epi64(src); __m256i dv = _mm256_add_epi64(sv, p); // SplitMix dot product _mm256_store_epi64(dst, dv); } ``` I didn't end up using this because it only works on hardware with the necessary AVX instructions, so it's not portable, but I wanted to be sure it could be done. The committed version just uses a loop. One concern wit this approach that the 256 bits of SplitMix dot product collision resistance could actually be a mirage. Why? Because the random weights are generated from 64 bits of LCG state. How is that an issue? In the proof of DotMix's collision avoidance, which SplitMix inherits, the number of possible weight values is key: the collision probability is 1/N where N is the number of possible weight vectors. If we consider all four xoshihro256 registers as one big dot product and apply the proof to it, we have a problem: depsite 256 bits of register, there are only 2^64 possible weight values we can generate, so the proof only gives us a pairwise collision probability of 1/2^64. Another way to look at this, however, is to consider the four xoshiro256 register dot products separately: each one has a 1/2^64 collision probability and there are four of them; as long as the chance of each one colliding is independent, the probability of all of them colliding together is (1/2^64)^4 = 1/2^256. Clearly there are ways to generate the four weights that don't satisfy independence. You could use the same weights four times, for example. Or you could use weights that are just scaled copies of each other. Basically any linear relationship between the weights is be problematic. That's yet another reason that iterating PCG multiple times to generate weights may not be ideal: the LCG that drives PCG is very linear; only the output function sabotages the linearity. If the output function being non-linear is crucial, why not use multiple different output functions instead? So that's what I'm doing here: using four different variations on the PCG output function. First we perturb the LCG state by four different random additive constants, which moves it to four distant and unrelated places in the state space and gives the xor shifts different bits to work with. We also use four different multiplicative constants in the middle of the output function: the first is the standard PCG multiplier, so we get known-good weights for one of the registers; the rest are random odd multipliers. A potential improvement is to look for weights with optimal cascading behaviors, but random constants tend to be good. Assuming our four output variants are sufficiently independent, we should get very strong collision resistance with a pairwise collision probability of 1/2^256. It would, however, be reassuring to have empirical evidence that this approach actually works. To that end, I did a test by scaling everything down to four 8-bit SplitMix dot products and tested how many simulated task spawns before we get collisions, and compared to a single 8-bit dot product. Here's the test code: ``` function pcg_output_rxs_m_xs_8_8(x::UInt8) p = x p += 0xa0 # random but same as below p ⊻= p >> ((p >> 6) + 2) p *= 0xd9 # standard multiplier p ⊻= p >> 6 end function pcg_output_rxs_m_xs_8_32(x::UInt8) ntuple(4) do i p = x p += (0xa0, 0x98, 0x66, 0x8d)[i] p ⊻= p >> ((p >> 6) + 2) p *= (0xd9, 0x2b, 0x19, 0x9b)[i] p ⊻= p >> 6 end end Base.zero(::Type{NTuple{4, UInt8}}) = (0x0, 0x0, 0x0, 0x0) function gen_collisions( ::Type{T}, rec :: Int; cnt :: Dict{T,Int} = Dict{T,Int}(), lcg :: UInt8 = zero(UInt8), dot :: T = zero(T), out :: Function = T == UInt8 ? pcg_output_rxs_m_xs_8_8 : pcg_output_rxs_m_xs_8_32 ) where {T} if rec > 0 h = out(lcg) lcg = lcg * 0x8d + 0x01 gen_collisions(rec - 1, T; out, cnt, lcg, dot = dot) gen_collisions(rec - 1, T; out, cnt, lcg, dot = dot .+ h) else cnt[dot] = get(cnt, dot, 0) + 1 end return cnt end ``` With this, `gen_collisions(UInt8, 5)` generating 2^5 = 32 dot products, already has collisions, whereas we have to generate 2^20 = 1048576 dot products `gen_collisions(NTuple{4,UInt8}, 20)` to get collisions with four registers. This provides empirical evidence that this approach to generating weights is sufficiently independent and that we can really expect 256 bits of SplitMix collision resistance.
- Loading branch information