Add brief explainer of how unlikely collisions are #15

bcoe · 2019-06-25T16:37:57Z

In reading TIFU Math.random(), I came across this statistic for the approach being used by betable:

With 2¹³² possible values, if identifiers were randomly generated at the rate of one million per second for the next 300 years the chance of a collision would be roughly 1 in six billion.

If we could work out a similar statistic for UUID V4, generated using a CSPRNG, it might be a neat little note to add into the README; as a way of reassuring folks who are concerned about randomness, and wish there was a clock in the mix.

The text was updated successfully, but these errors were encountered:

broofa · 2019-06-25T18:40:11Z

Found some useful stats and equations at https://en.wikipedia.org/w/index.php?title=Universally_unique_identifier&oldid=755882275#Random_UUID_probability_of_duplicates

broofa · 2019-06-25T19:04:52Z

Using the approximation from that article:

P(n) ≈ n²/(2 * 2^x)

For x = 122 (v4 UUIDs have 122 bits of entropy), and n = 9.46x10¹⁵ (1M uuids / second * 300 years, per the Betable example), P(n) = 8.4x10^-6... or one in 8.4 million chance of collision.

bcoe · 2019-06-25T19:32:45Z

@broofa neat 😃 for me, the 100 year timeline (as used in the Wikipedia article) seems less arbitrary than 300 (what are the odds of a collision in the upper-bound of a human lifespan).

ctavan · 2019-08-16T11:57:38Z

I think this was mostly resolved with #17.

However during a discussion in microsoft/azure-pipelines-tasks#11021 (comment) I realized that it might be worthwhile to actually add a comparison with regards to how unlikely a collision in v4 UUIDs are vs. how unlikely a collision in v1 UUIDs are? WDYT?

broofa · 2019-08-16T23:01:58Z

That would be an interesting comparison, but it's complicated by the fact that the [theoretical] odds of collision for v1 ids depend on some rather churlish aspects of the RFC.

For version 4, collision probability is pretty easy. The theoretical probability of two UUIDs colliding, P_c, is:

P_c = 1 / 2^{(# of bits of entropy)}

I.e., for v4, where there are 122 bits of entropy, P_c = ~1.8e-37

For version 1, however:

P_c = P_t X P_n

Where:

P_t: probability IDs are generated in the same 100-nanosecond time interval
P_n: probability IDs share the same node

Unfortunately, neither P_t nor P_n are particularly easy to quantify. For P_t, the RFC does require that UUID services should generate an error if they expect to generate more than one ID in a given 100-nsec time interval, but realistically that's only enforceable within a single thread/process on a single host. Enforcing this across multiple processes / hosts requires non-trivial architectures that run counter to the main thesis the UUID spec: "One of the main reasons for using UUIDs is that no centralized authority is required to administer them".

So, in practice, P_t is non-zero. For example, in our README scenario where IDs are generated at a rate of 1M/second, P_t = 0.1 ... which ain't great, and causes us to turn the spotlight on P_n, the probability the IDs share the same node value.

Unfortunately the mechanism for generating node values preferred by the RFC is to use the host system's IEEE 802 MAC address. This made sense back when the RFC was authored and MAC addresses could reasonably be expected to be unique, but this is arguably no longer the case, not with the proliferation of virtual machines and containers where MAC addresses may not be unique by design.

Fortunately the RFC provides an escape hatch - the node can be randomly generated if desired. However, a random node only provides 48 bits of entropy, giving P_n = 3.6e-16.

If we extrapolate this onto our README scenario (1M UUIDs/second), with an expected P_c = 3.6e-17 (0.1 X 3.6e-16 ) for v1 uuids, we find that instead of it taking 327 years to get to a 1-in-a-million chance of collision, it only takes about 30 seconds. :-/

... And what we do with that knowledge I don't know. In hindsight, I don't really know if this is a useful comparison (and, btw, someone should probably check my logic and #'s), but this is about all the energy I have for thinking about this at the moment.

ctavan · 2019-08-18T12:52:41Z

Thanks for this excellent reasoning, @broofa!

I get a result off by one decimal for the random node with 48 bits of entropy:

> 1/Math.pow(2, 48)
3.6e-15

So P_c = 0.1 x 3.6e-15 = 3.6e-16.

Looking at the current readme again I'm wondering whether we're off by 10 there as well?

Generating 1M UUID/s for 327 years yields a collision probability according to the approximated equation:

> Math.pow((1e6 * 3600 * 24 * 365 * 327), 2)/(2 * Math.pow(2, 122))
0.000010000443315519015

Which would be 10 in a million, not 1 in a million.

If we extrapolate this onto our README scenario (1M UUIDs/second), with an expected Pc = 3.6e-17 (0.1 X 3.6e-16 ) for v1 uuids, we find that instead of it taking 327 years to get to a 1-in-a-million chance of collision, it only takes about 30 seconds. :-/

Could you briefly explain how you arrive at 30 seconds given P_c?

Where does all that takes us? Well, all I want to do is to counter the idea that v1 UUIDs are guaranteed to be collision-free. I want to make clear that given the fact that there's no practical way of ensuring node uniqueness, also for v1 there are chances of collisions in practice and that those are even higher with v1 UUIDs than with v4.

I'll propose a new FAQ item once we have our numbers straight.

broofa · 2019-08-19T05:14:59Z

Yeah, apparently I screwed up all my math. Not sure what happened. 'Guess I was more tired than I thought. 😦

So P_c = 0.1 x 3.6e-15 = 3.6e-16.

Right you are.

Math.pow((1e6 * 3600 * 24 * 365 * 327), 2)/(2 * Math.pow(2, 122))
0.000010000443315519015

Which would be 10 in a million, not 1 in a million.

Yup, my bad.

In an attempt to not screw this up a second time, I've put together a birthday problem repl.it for us to reason with. It uses the taylor series approximation for the birthday problem.

It shows that P = .000001 after 104 years for v4 @ 1M uuids/second.
It also shows that for our updated v1 probability, it takes all of 75 milliseconds (not 30 seconds!) to hit that same probability.

Could you briefly explain how you arrive at ~~30 seconds~~ 75 milliseconds given P_c?

First, yes, 75 msecs is not 30 seconds. 'Not sure how I got that 30 second number. However the basic rationale is this: Since P_c - the odds of two UUIDs colliding - is equal to 1 / (effective size of UUID space), we can just substitute 1/P_c in where we had Math.pow(2, 122) for v4.

... which is what I've done in the repl. Hopefully that's the right approach.

Stepping back, this feels weird. Like... really? If I spin up two independent v1 uuid services and start generating 1M ids/second, I don't actually expect to get duplicates right away. That's just silly. But the reason this feels unintuitive is because the odds of them having duplicate nodes are low (1 in 2⁴⁸). The problem is, that one time you're unlucky and end up with the same node on both services, 10% of the ids - 100K/second - are colliding, so you get a LOT of collisions.

I think this is why contrasting v4 and v1 collision probabilities is difficult. v4 has this miniscule probability of collision for each and every UUID produced. But for v1, the odds of collision are exactly zero... except for the very rare case when it's a near certainty! As I think I said before, they have very different natures.

Does this make sense? I'm just ruminating out loud here... trying to convince myself that this is valid comparison, I guess.

See #15 (comment)

Fixes #15

ctavan · 2019-08-27T12:46:49Z

@broofa I think your reasoning makes sense and I have tried to compress it into a new FAQ item.

bcoe assigned broofa, bcoe and ctavan Jun 25, 2019

ctavan added a commit that referenced this issue Aug 27, 2019

fix: correct collision probability

2b60384

See #15 (comment)

ctavan added a commit that referenced this issue Aug 27, 2019

feat: add section on v1 uniqueness

48000e6

Fixes #15

ctavan added a commit that referenced this issue Aug 27, 2019

feat: add section on v1 uniqueness

4499eea

Fixes #15

ctavan added a commit that referenced this issue Aug 27, 2019

feat: add section on v1 uniqueness

e39bec9

Fixes #15

ctavan mentioned this issue Aug 27, 2019

Note on v1 collisions #22

Merged

bcoe closed this as completed in #22 Sep 9, 2019

broofa mentioned this issue Jan 19, 2020

Support for comb v4 UUIDs uuidjs/uuid#303

Closed

broofa mentioned this issue Jul 30, 2021

Question: Is mixing uuidv4 and uuidv5 collision safe? uuidjs/uuid#579

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add brief explainer of how unlikely collisions are #15

Add brief explainer of how unlikely collisions are #15

bcoe commented Jun 25, 2019

broofa commented Jun 25, 2019

broofa commented Jun 25, 2019

bcoe commented Jun 25, 2019

ctavan commented Aug 16, 2019

broofa commented Aug 16, 2019 •

edited

Loading

ctavan commented Aug 18, 2019

broofa commented Aug 19, 2019 •

edited

Loading

ctavan commented Aug 27, 2019

Add brief explainer of how unlikely collisions are #15

Add brief explainer of how unlikely collisions are #15

Comments

bcoe commented Jun 25, 2019

broofa commented Jun 25, 2019

broofa commented Jun 25, 2019

bcoe commented Jun 25, 2019

ctavan commented Aug 16, 2019

broofa commented Aug 16, 2019 • edited Loading

ctavan commented Aug 18, 2019

broofa commented Aug 19, 2019 • edited Loading

ctavan commented Aug 27, 2019

broofa commented Aug 16, 2019 •

edited

Loading

broofa commented Aug 19, 2019 •

edited

Loading