-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random: allow negative seeds #51416
Merged
Merged
Random: allow negative seeds #51416
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stream of random numbers is documented not to be stable across versions, even for the same seed. This seems like a lot of additional code to avoid breaking something we don't guarantee in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can definitely break the random streams, but it can be some work then to fix the seeds in LinearAlgebra! And probably for some people in the wild. But I don't think that not breaking most seeds didn't represent much more code; the alternatives I saw were:
seed[0]
, which would require shiftingseed[1]
orseed[end]
(probably the simplest option)seed!
to encode negativity (e.g.UInt32[0xbc7f82b8, 0x53f967d9, 0x538ab2e4, 0x7c8725a9]
); this would also have been "non-stream-breaking"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in the other PR, I still don't understand why the sign bit needs to be treated specially at all. It's a bit.
make_seed
certainly doesn't make use of the fact that what's passed in is a mathematical number; it only cares for the bits. Why is this one bit special to the purpose ofmake_seed
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already explained at length my reasonning in the other PR, I'm afraid I don't have much more ideas on how to explain this.
A Julia integer can be seen as a bag of bits, but it's first and foremost an object representing a mathematical number. I simply want (1)
n == m
impliesmake_seed(n) == make_seed(m)
, and (2)n != m
impliesmake_seed(n) != make_seed(m)
, without caring about the bits which encoden
orm
. I just need to care about the bits (in particular the sign bit) in the implementation, which must simply do what it takes to honor these properties.Property (1) is what we expect from mathematical functions, and property (2) is just useful for
seed!
: after all we have the following in the docstring forMersenneTwister
:Yes, with different seeds, we expect different streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My objection is with the interpretation that
make_seed
takes a mathematical integer and maps that to a stream of numbers. It doesn't - it takes the given bag-of-bits and initializes the RNG state with that. A JuliaInt
(8/16/32/64/128) (which are really from a modular ring, not the entire integers) is just the most convenient & packed representation of that we have.This is necessarily an operation that can't produce unique RNG objects for every possible seed, by the pigeonhole principle: the state of the PRNG is of finite size. You can of course say that you only really care about finitely sized inputs (so
Base.BitInteger
s, basically), but then you also have to prove that those different inputs not only produce different RNG initial streams, but that the stream of numbers (and the associated states) doesn't overlap (otherwise you're on a shared cycle). This is already not the case for any RNG based on LFSR, which merely has long cycle lengths.So, with that in mind, I don't understand why those two properties in particular are desirable, other than to give the illusion of uncorrelated RNG streams. I get that the properties seem nice, but you don't really get much out of them algorithmically. Documenting that
make_seed
by default cares only about the bits of whatever object its given seems just as simple to document & easier to maintain.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it might be easier to maintain, but not by much as this PR shows. But also, again, I see no particular reason to treat integers as bag of bits: it's only an implementation detail of
make_seed
. Mostly everywhere else, we treat julia integers as mathematical integers as much as reasonably possible.hash
for example does that.Besides, by just applying
unsigned
to the input like in #46190, we still have to handleBigInt
separately, and this wouldn't even address the case of other non-bits integers in the wild. Heremake_seed
will work for any integer implementation providing a view of their instances as two's complement (which BigInt does).There are two steps in the process:
If the two properties (P1) mathematical function and (P2) injectivity, i.e.
x == y iff f(x) == f(y)
, hold for each step, then it holds for the whole process.This PR addresses only 1), and implements P1 and P2.
For 2), (P1) is already generally implemented: for a given initialization vector, the RNGs implemented by
Random
are initialized into a deterministic state.For (P2), of course, as already mentioned in the other PR, true injectivity can't hold, but we get it in practice by using SHA2_256.
This is a good point, that users should be aware of; though I don't have to prove anything, I'm not trying to solve this here, only trying to do the best possible thing for seeding. To be certain to avoid overlapping streams, the only way I know of (for the RNGs we have) would be to use "jumping", which is already implemented for
MersenneTwister
, and there is an open PR forXoshiro
. For users not using jumps, I think the best we can do is to makeseed!(::Integer)
produce different streams for different inputs (and the same streams for==
inputs). In practice, two distinct streams will be extremely unlikely to overlap during program execution. For example, if i'm not mistaken, this paper (i could only read the intro, but it gives the formula) suggests that forn
processors each producing sequences of lenghtL
for a state with periodP
, "the probability of overlap is always at mostn^2 * L / P
". For example, settingn = big(10.0)^4; L = big(2.0)^50; P = big(2.0)^256
(Xoshiro has period 2^256-1), this formula gives a probability of getting overlapping sequences upper bounded by 2^(-179), i.e. were we to trust this paper, with these parameters above, it's more likely to find a collision in SHA2_256 than to get overlapping sequences with Xoshiro.It's maybe an illusion in theory, but with the arguments above, i believe that for all practical matters, these properties go a long way into ensuring uncorrelated streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No; Julia integers are not mathematical integers (i.e., the set of natural numbers, if that's what you mean). They are modular integers..
make_seed
ignores that in any case, and is only interested in the "bag of bits" behavior of them - otherwise operations like& 0xffffffff
wouldn't make sense in the first place, and would have to be written like% 0x100000000
instead.Naive conversion via
unsigned
still propagates the sign bit:which is exactly why I chose it in the original PR, in order to not have to take special considerations. So by special casing
signbit
, you've introduced a bias where the sign bit is counted twice if it's set (once during the conversion tounsigned
, then again in yourif neg
), meaning it contributes more to the seeded value than the other bits. That's why it's dangerous to do this kind of special casing, and why it should be avoided at all costs."suitable" is doing a lot of heavy lifting here. Why is the direct interpretation of "bag of bits" not suitable - other than losing numerical interpretation, which
make_seed
has no history of making in the first place?Could you expand on what you mean with "(P1) mathematical function"? That is not a testable property of a piece of code, as far as I'm aware. What criterion makes a function "mathematical"?
SHA cannot make a non-injective input injective overall. The failure of injectivity already happened before SHA ever saw any input; it cannot "disentangle" that without additional information, which is not provided as far as I can tell.
Yes, I agree with this. But I don't agree with "different inputs" meaning "different numerical interpretations of a string of bits" for the purpose of seeding an RNG.
make_seed
and RNGs in general do not care about numerical interpretation of the input seed; they only care about the raw bits, without interpretation.I don't follow here. This is not about jumping or multiprocessing at all - this is about what
make_seed
interprets its input as. I'm arguing that it should only care about the bits, not the type/domain. You're arguing for the opposite. Any existing user will be wholly unaffected by this (or the other PR), because the existing working inputs don't suddenly overlap with existing working inputs. It's only non-working inputs that may overlap with existing working inputs (which is, again, inevitable due to the nature of finite space).Further, any user relying on task-based multithreading in julia will (or does) already benefit from jumping, both in MT as well as Xoshiro; the tasks are already seeded with a jumped value from the parent RNG.
No, it's also an illusion in practice, and one we should document & make visible, so that people can avoid shooting themselves in the foot by accident, by encouraging use of jumping and heavily discouraging seeding multiple RNGs. The correct procedure for getting multiple statistically uncorrelated RNG streams is to seed one parent RNG and create multiple sub-RNGs by jumping. Seeding multiple RNGs and hoping for uncorrelation is not sound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what makes you say that
make_seed
is only interested in the bag of bits. The current implementation can lead to both interpretations. the fact that we use bitwise operations is just an implementation detail: they are the tools we have to dissect and assemble integers; I again concede that for that we need to make the assumption that these integer types expose a two's complement "interface", i.e. bitwise operation.I dont' understand this point.
I believe not: one way to see that is that, unless of course the implementation in this PR is faulty, it implements the following:
x == y iff f(x) == f(y)
. Now consider the path taken forx::BigInt
:abs(x)
can't be<0
, and nounsigned
call is involved. Therefore, forBigInt
, the sign bit is not counted twice: it's removed fromx
when takingabs
, and is "tranferred" intoneg
, which is then used to set one bit in the resulting array. Now for any non-BigInty
equal tox
, we havemake_seed(x) == make_seed(y)
, so the sign bit ofy
is not counted twice.But even then, even if the sign bit was counted twice, for whatever that means, I fail to see what problem this would lead to.
I think it's suitable to a great extent, but I like the interpretation as "math integers" better here, as for example
seed(-1)
will lead to the same stream on 32 vs 64 bits platforms.As an anecdote,
make_seed
was initially accepting negative numbers (67f6fee), but was changed shortly after (3c5946c). Tere is not a lot of information in these commits, but I would guess the change was precisely to get consistent streams across 32 and 64 bits platforms (but i could be totally wrong there).Sorry I was too terse I guess. The property
x == y iff f(x) == f(y)
contains(P1) "mathematical function", i.e.
x == y
impliesf(x) == f(y)
(P2) injectivity, i.e.
f(x) == f(y)
impliesx == y
So
rand(x::Integer)
is definitely not a "mathematical function", buthash(x)
is. Likewise, I wantmake_seed
to be mathematical.I don't get this. The pipeline being
x |> make_seed |> SHA
, ifSHA
is injective and ifmake_seed
is injective, thenSHA ∘ make_seed
, is injective, i.e.seed!
essentially.Yes and I don't see any advantage to that, except slightly simpler code if we didn't bother to accept
BigInt
; butBigInt
must be supported.But the big difference is that with the other PR using
unsigned
, it's trivial to construct two distinct (!=
) seeds which produce the same streams, whereas the solution proposed here make this practically impossible.I'm all for making jumping more visible, and finding it useful, I even implemented the jumping code for julia's
MersenneTwister
. But at the same time, users will sometimes just not care about it, because seeding a bunch of integers just works in practice, e.g.Xoshiro.((1, 2, 3, 4))
. I'm not an expert on the subject, but I tried to provide some justification for why the 4 resultingXoshiro
instances have such an astronomically small chance of running into overlapping sequence that it's impossible in practice. In other words, barring bugs, I believe no-one will ever run into overlapping sequences by accident when using distinct integer seeds. On the other hand, it's very easy to run in the same seqence from two distinct seeds if we interpret them as bag of bits.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is counted twice:
Once in propagation of the original data (hence the leading
8
), and once when pushing the additional zero! You're "extracting" two bits of information from a single input bit, whereas the other bits only contribute a single bit themselves. That is a bias in favor of that single bit, which ends up contributing 33 bits of "information" to the input of SHA256 (which is always going to be zero, or not exist at all - meaning, it might as well not be there!) There is no additional entropy being added; all you're doing is mapping to a different set of SHA256 outputs due to theUnsigned
ness (on the type level) of the input.Those are both "mathematical functions". Both map an input to an output. What you're describing with P1 and P2 sounds to me like bijectivity, a property some mathematical functions (i.e., functions) have. Since we've already established in the other PR that this function cannot be injective (and thus not bijective) precisely because of
BigInt
, it seems odd to me to want to cling to this.Why is that important? We intentionally barely guarantee anything about randomness. Why guarantee this? Why should a user expect this to be the case, when in tons of other contexts (e.g. hashing) we don't provide stableness across architectures at all?
No, it's just as trivial. Just create two random integers, convert them to
BigInt
, multiply them and use them as the seed. Because we're going through SHA256 later on, the larger input will be compressed into a fixed length and by the pigeonhole principle, we're pretty much guaranteed to have a collision for sufficiently largeBigInt
. SupportingBigInt
just means throwing any sort of injectivity right out of the window, if we intend to then go through SHA.So you mean to tell me that it's more likely that a user runs into an overlap between
0x80
vs.-1
, no matter the interpretation? I think this is a false equivalence; if the input bits are the exact same distribution, both approaches lead to equally likely collisions (again, by pigeonhole). There simply aren't magically more bits; signedness is not special.Not to mention that due to SHA, "by accident" is also not going to work for the approach in the other PR. Are there bitpatterns for signed integers that overlap with the results from unsigned integers? Yes, but this PR does nothing to prevent that, save for pushing an additional
zero(UInt32)
when the input is::Unsigned
(which would be easy to add to the other PR as well, which would still also result in biasing).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I think i now see what you mean. But no, when interpreting julia integers as mathematical integers, I'm not counting the sign bit twice: a
UInt32
needs 33 bits to be encoded as a math integers: 32 bits for the absolute value given by theUInt32
, and one bit for the sign. IOW, inUInt32
, the sign bit is implicit and doesn't require an explicit storage bit to store this information, because the information is in the type itself.make_seed
is nothing more than an encoder for arbitrary-length integers, in particular the set of possible outputs is isomorphic to the set ofBigInt
s. This transformation is useful only insofar as RNGs don't digestBigInt
well, they like to eat arrays of unsigned. When you pass a value of typeUInt
orUInt32
orBigInt
etc intomake_seed
, the input type is erased, only the mathematical value is taken into account. In particular, the sign of the value0x80000001
must be encoded in this array representation, if I want to implement properties P1 and P2.But anyway, I don't actually make much sense of "You're "extracting" two bits of information from a single input bit", for a single bit i can extract at most a single bit of information; the way i encode it doesn't change that; i could encode the sign bit by using 10 bits of storage, these 10 bits would still store one bit of information. And this redundant information wouldn't affect at all the soundess of the
integer |> make_seed |> SHA
pipeline,SHA
doesn't care, it would just give a distinct, random-looking output for distinct inputs, however they are encoded.By "both", are you refering to the examples I gave, i.e.
rand(x)
andhash(x)
?rand(x)
is not mapping an input to an output:For a given input
1
, there isn't any fixed output.No, P1 is
x == y implies f(x) == f(y)
which is a property of math functions (and whichrand
doesn't satisfy), and P2 is injectivityf(x) == f(y) implies x == y
. Bijectivity is injectivity + surjectivity. Here I don't specifically bother about surjection, that's why I don't try to have every possibleVector{UInt32}
be an encoding of an integer.I will repeat for the last time: while
make_seed
is truly injective, of course, in theory, SHA is ridiculously NOT injective, an infinite number of inputs can map to the same output, but for all practical matter, it can be considered injective.It's not crucial, but nice to have and trivial to implement. We have
seed!
for a reason, to reproduce a computation; if i want to reproduce a computation on another machine with a difference architecture, it's just nice to be able to have the stream of random numbers be reproducible.AFAIK,
hash
implements P1, e.g.hash(Int64(-1)) == hash(Int32(-1)) == hash(big(-1))
.hash
goes to great length to ensure that the inputs are considered as math integers rather than only bag of bits in order to implement P1.It can't be trivial if no one on earth is able to provide an instance of two distinct seeds producing the same stream. It's trivial only to an oracle.
I'm talking in terms of
==
. I'm saying: with the other PR ("bag of bits" interpretation) the distinct seedsInt8(-1)
and0xff
will lead to the same streams, and there are many such simple examples. With this PR here ("math interpretation"), it's practically impossible that anyone will ever find two distinct seeds leading to the same streams. But of course, if we had an equality operator for "bag of bits", say=_=
, then the same property would hold for the other PR: ifx =_= y
is false, then the streams will be distinct. But afaik we don't have such an operator in julia. Integers are compared with==
which uses the mathematical interpretation.I don't understand the "save for", encoding the sign bit in the upper bit
seed[end]
is excatly designed to prevent that; there are many ways to encode this bit, you just have to choose one. But I don't understand you concern with the biasing thing, there is no biasing happening here. In this PR, every integer is mapped to a dermined bit field of size 256 (output of SHA), which is random looking and cannot be computationally mapped back to the original integer seed. For whatever input seeds you choose, the distribution of the outputs 256-bitfields is uniform, there is no bias. And for all practical matters, no two integers will ever map to the same 256 bits, which is nice to have.