-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add brief explainer of how unlikely collisions are #15
Comments
Found some useful stats and equations at https://en.wikipedia.org/w/index.php?title=Universally_unique_identifier&oldid=755882275#Random_UUID_probability_of_duplicates |
Using the approximation from that article: P(n) ≈ n2/(2 * 2x) For x = 122 (v4 UUIDs have 122 bits of entropy), and n = 9.46x1015 (1M uuids / second * 300 years, per the Betable example), P(n) = 8.4x10-6... or one in 8.4 million chance of collision. |
@broofa neat 😃 for me, the 100 year timeline (as used in the Wikipedia article) seems less arbitrary than 300 (what are the odds of a collision in the upper-bound of a human lifespan). |
I think this was mostly resolved with #17. However during a discussion in microsoft/azure-pipelines-tasks#11021 (comment) I realized that it might be worthwhile to actually add a comparison with regards to how unlikely a collision in |
That would be an interesting comparison, but it's complicated by the fact that the [theoretical] odds of collision for v1 ids depend on some rather churlish aspects of the RFC. For version 4, collision probability is pretty easy. The theoretical probability of two UUIDs colliding, Pc, is: Pc = 1 / 2(# of bits of entropy) I.e., for v4, where there are 122 bits of entropy, Pc = ~1.8e-37 For version 1, however: Pc = Pt X Pn Where: Pt: probability IDs are generated in the same 100-nanosecond time interval Unfortunately, neither Pt nor Pn are particularly easy to quantify. For Pt, the RFC does require that UUID services should generate an error if they expect to generate more than one ID in a given 100-nsec time interval, but realistically that's only enforceable within a single thread/process on a single host. Enforcing this across multiple processes / hosts requires non-trivial architectures that run counter to the main thesis the UUID spec: "One of the main reasons for using UUIDs is that no centralized authority is required to administer them". So, in practice, Pt is non-zero. For example, in our README scenario where IDs are generated at a rate of 1M/second, Pt = 0.1 ... which ain't great, and causes us to turn the spotlight on Pn, the probability the IDs share the same Unfortunately the mechanism for generating Fortunately the RFC provides an escape hatch - the If we extrapolate this onto our README scenario (1M UUIDs/second), with an expected Pc = 3.6e-17 (0.1 X 3.6e-16 ) for v1 uuids, we find that instead of it taking 327 years to get to a 1-in-a-million chance of collision, it only takes about 30 seconds. :-/ ... And what we do with that knowledge I don't know. In hindsight, I don't really know if this is a useful comparison (and, btw, someone should probably check my logic and #'s), but this is about all the energy I have for thinking about this at the moment. |
Thanks for this excellent reasoning, @broofa! I get a result off by one decimal for the random
So Looking at the current readme again I'm wondering whether we're off by 10 there as well? Generating 1M UUID/s for 327 years yields a collision probability according to the approximated equation:
Which would be 10 in a million, not 1 in a million.
Could you briefly explain how you arrive at Where does all that takes us? Well, all I want to do is to counter the idea that I'll propose a new FAQ item once we have our numbers straight. |
Yeah, apparently I screwed up all my math. Not sure what happened. 'Guess I was more tired than I thought. 😦
Right you are.
Yup, my bad. In an attempt to not screw this up a second time, I've put together a birthday problem repl.it for us to reason with. It uses the taylor series approximation for the birthday problem. It shows that P = .000001 after 104 years for v4 @ 1M uuids/second.
First, yes, 75 msecs is not 30 seconds. 'Not sure how I got that 30 second number. However the basic rationale is this: Since Pc - the odds of two UUIDs colliding - is equal to ... which is what I've done in the repl. Hopefully that's the right approach. Stepping back, this feels weird. Like... really? If I spin up two independent v1 uuid services and start generating 1M ids/second, I don't actually expect to get duplicates right away. That's just silly. But the reason this feels unintuitive is because the odds of them having duplicate I think this is why contrasting v4 and v1 collision probabilities is difficult. v4 has this miniscule probability of collision for each and every UUID produced. But for v1, the odds of collision are exactly zero... except for the very rare case when it's a near certainty! As I think I said before, they have very different natures. Does this make sense? I'm just ruminating out loud here... trying to convince myself that this is valid comparison, I guess. |
@broofa I think your reasoning makes sense and I have tried to compress it into a new FAQ item. |
In reading
TIFU Math.random()
, I came across this statistic for the approach being used by betable:If we could work out a similar statistic for UUID V4, generated using a CSPRNG, it might be a neat little note to add into the README; as a way of reassuring folks who are concerned about randomness, and wish there was a clock in the mix.
The text was updated successfully, but these errors were encountered: