-
Notifications
You must be signed in to change notification settings - Fork 9
Discussion: UUIDv7 subsecond precision encoding #24
Comments
Adding Draft V01 quote from thayne: https://news.ycombinator.com/item?id=28088213
|
I don't like using 38, 24 and 12 bits for nanos, micros and seconds respectively. I think this could 'waste' a few bits. I prefer to use 30, 20 and 10. But it's just my preference. However, fields of 38-24-12 sub-seconds can be very convenient to implement using string operations rather than bitwise operations. It's not efficient to create UUIDs with string operations, but it works well, at least in Python. At the end of this answer, there are 3 implementations of UUID7, tackling the problem with a 'character-oriented/based/driven' approach. We can think of UUIDs as a 3-part string, leaving out version and variant:
The
The 'character-oriented' layout is this:
These are the character settings for nanosecond, microsecond and millisecond precision, respectively:
Here are the Python implementations of the 3 configurations shown above: #---------------------------------
# UUID7 with nanosecond precision
#---------------------------------
import uuid
import time
import random
now = time.time_ns()
sec = now // 10**9
subsec = now % 10**9
f1 = f'{sec:09x}'
f2 = f'{subsec:09x}'
f3 = f'{random.randrange(2**48):012x}'
v1 = '7' # version
v2 = random.choice(['8', '9', 'a', 'b']) # variant
id = f1 + f2 + f3
id = id[:12] + v1 + id[12:]
id = id[:16] + v2 + id[16:]
uid = uuid.UUID(id)
print(uid) # 06113208-6028-78d2-aa42-aae9eba8a7a2 #---------------------------------
# UUID7 with microsecond precision
#---------------------------------
import uuid
import time
import random
now = time.time_ns()
sec = now // 10**9
subsec = now % 10**9 // 10**3 # ajust to micros
f1 = f'{sec:09x}'
f2 = f'{subsec:06x}'
fx = f'{random.randrange(2**12):03x}' # eXtra bits: random
f3 = f'{random.randrange(2**48):012x}' # node bits: random
v1 = '7' # version
v2 = random.choice(['8', '9', 'a', 'b']) # variant
id = f1 + f2 + fx + f3
id = id[:12] + v1 + id[12:]
id = id[:16] + v2 + id[16:]
uid = uuid.UUID(id)
print(uid) # 06113208-e044-7660-b754-b43ac4b4a721 #---------------------------------
# UUID7 with millisecond precision
#---------------------------------
import uuid
import time
import random
now = time.time_ns()
sec = now // 10**9
subsec = now % 10**9 // 10**6 # ajust to millis
f1 = f'{sec:09x}'
f2 = f'{subsec:03x}'
fx = f'{random.randrange(2**24):06x}' # eXtra bits: random
f3 = f'{random.randrange(2**48):012x}' # node bits: random
v1 = '7' # version
v2 = random.choice(['8', '9', 'a', 'b']) # variant
id = f1 + f2 + fx + f3
id = id[:12] + v1 + id[12:]
id = id[:16] + v2 + id[16:]
uid = uuid.UUID(id)
print(uid) # 06113209-430f-783b-89b8-68d0adb7fa01 |
Sorry, I didn't focus on the main question in my previous answer. I prefer the first case (one-second faction), although the second (specific number of ms/us/ns) is simpler and more efficient. As I said in a previous issue, encoding and decoding can be done with fixed-point arithmetic. We don't have to deal with floating-point math directly. To encode, do the following:
To decode, do the reverse:
That's it. The decoder can extract the exact original fraction, unless the decoder reads more bits than the encoder used. The excellent example given by @bradleypeabody above can be expressed like this in Python: bits = 12
fraction = 0.321
subsec = round(fraction * 2**bits) To decode, the reverse: bits = 12
subsec = 1315
fraction = round(subsec / 2**bits, 3) I used In the other case, which simply copies the |
@fabiolimace Great writeups. I agree with all points. However, if the alternative option only ever requires a maximum of 30 bits to encode up to nanoseconds properly why not make the total bits ever allocated to 12/12/12 is nice thematically and for Note: I really like those ASCII tables. I may use that format in draft-02 to help illustrate UUIDv6/7/8 Basic Creation Algorithm examples along with the current ASCII bit layout format and pseudocode steps |
TLDR; look at the bottom bit layout for the proposal Yeah there seems to be a lot of pain caused by how not-aligned the various fields are. A friend of mind has been hacking away on this: http://www.new-uuid.info/ (just put it up and still wip) and we went through the same thing - dealing with raw bits in most any language is painful (Go in this case). Some ideas: 1. One idea might be to try changing the UUID7 spec so it is more byte-aligned - more practical to use and implement. Like taking the ver and var fields and blocking off the entire byte that each of those use and giving some meaning to the extra bits there, but otherwise everything else can be by moved into place in whole bits. To make it really usable we'd probably want to change the 36-bit timestamp to either a 32-bit unsigned int, or another option would to be make it 40 bits (5 whole bytes). 2. Another option, crazy idea here, bear with me: What if we ditch the positions of these old fields and instead provide another means of reliably detecting v6/7/8 UUID? This whole thing would be way simpler if we didn't have to dance around those two old legacy fields (ver and var). And while their positions are obviously locked into old software, we could probably get away with breaking that for newer values IF there were still a reliable way for software to detect which version was in use. Actually, just looking at this newly again - https://datatracker.ietf.org/doc/html/rfc4122#section-4.1.1 :
...
That might be our ticket right there.... Per the spec, if we set those three bits to 1s, then we can change whatever we want. I mean, I'm sure we'll break someone's implementation somewhere, but per the spec it's legit - we can move or remove the version field if we set those bits. Okay so.... What if we set those var bits to 1s, and then make the rest of the byte the version field?! By combining these legacy fields, we then only have to work around this one byte:
(this incidentally will provide 5 version bits and allow for 32 UUID versions instead of 16 with the existing 4 bits) If you ask me, this is definitely simpler. From there, the main thing for UUID7 to figure out alignment for is this unix timestamp field. I'm not sure how hard we want try / complex we want to make it, in solving that. There are some simple transforms that would solve the problem. E.g. using the number of minutes since unix epoch instead of seconds fits in 32 bits up to year 10136, and then we could have a sub-minute portion instead of a sub-second portion. But I'm not sure any of this is better than bit shifting or just choosing 40 bits for the integer timestamp. Also, I mean heck, a uint32 gets us to the year 2106 - maybe we should just go with that and assume that if people are still using this standard 85 years from now it will have served its purpose and it's time for another version... If we did that, we'd get something like this:
(Where subsec_seq_rand is a concatenation of the subsecond value, sequence counter and random data, with whatever lengths the implementation wants to choose for each, up to a max length of 88 bits / 11 bytes) I know this would be a big change, but it would definitely be simpler and easier on the development side and to show/explain to people. I'm still thinking this over myself and just throwing this out there. Let me know what y'all think. |
Well, @bradleypeabody I said that it's annoying, but it's not not possible. First, data for everyone, I just finished couple more tweaks of http://www.new-uuid.info. The website is written in Go and runs my implementation of the UUIDv7 generator. Repo here. On the website, you can literally “dial in” the precision lengths for different parts of the UUID and see on-line how this generator provides you values. At the end of this website, I added a simple playground where you can try to convert milliseconds into "Brad's encoded" milliseconds, just so it's real to touch. Now, as for the implementation. This is a very rough draft. Although it's working. To simplify what you are talking about (ver and var fields) you can just subtract the length of those fields from a total length of UUID (128 - 4 - 2 = 122 bits) and just work with 122 bit structure. Once you're done populating it, you just copy it into the final bit array, adding those values in set positions. As for working with bits themselves, it's not that complicated. // getBit returns the value of a bit at a specified positon in given byte array
func getBit(b []byte, index int) bool {
pos := index / 8
j := uint(index % 8)
j = 7 - j
return (b[pos] & (uint8(1) << j)) != 0
}
// setBit sets the value of a bit at a specified positon in []byte
func setBit(b []byte, index int, value bool) {
pos := index / 8
j := uint(index % 8)
j = 7 - j
if value {
b[pos] |= (uint8(1) << j)
} else {
b[pos] &= ^(uint8(1) << j)
}
} And this is if you are very concerned about memory consumption (which is not the case nowadays anyway). You can use an array of bools instead. But I did it “the right way”. Extracting the timestamp from this version of UUID is elementary from a viewpoint of a processor. // Timestamp returns unix epoch stored in the struct without millisecond precision
func (u UUIDv7) Time() time.Time {
bytes := [16]byte(u)
tmp := toUint64(bytes[0:5])
tmp = tmp >> 4 //We are off by 4 last bits of the byte there.
return time.Unix(int64(tmp), 0)
} You just use the first 38 bits (5 bytes) and shift the resulting value by 4. This is something that any 64bit processor would be able to do in a click. It takes longer to extract other data. Currently, I'm using getBit/setBit, but it can be done with logical shifts. Those would be complex from the viewpoint of a programmer, but super-easy for a processor. Pros: It's good that I can adjust the format. For example, if I know that I'm usually generating about 10 values per second, I could set sub-second precision and counter to 4 bits each and be happy with the random data that goes after. I guess the main question is: "Who would be implementing this system?". |
I wouldn't go as far as changing the position of the version bit. It would be better to keep the compatibility with libraries implementing RFC4122. Sure cutting nanoseconds in three way makes it a bit more complicated but not marginally in my opinion. I also did some experimentation with UUIDv7 in go: https://github.com/coding-socks/uuiddraft/blob/v7/uuid.go. This implementation already contains the earlier suggestion to only use 30 bits (12, 12, 6) for nanosecond precision. Having the different subsec sections is as simple as the following: nsec := timestamp % int64(time.Second)
subsecA := (nsec >> 18) & 0xfff
subsecB := (nsec >> 6) & 0xfff
subsecC := nsec & 0x3f Going line by line:
With the current 38 bits (12, 12, 14) nanosecond precision it would be the following: nsec := timestamp % int64(time.Second)
subsecA := (nsec >> 26) & 0xfff
subsecB := (nsec >> 14) & 0xfff
subsecC := nsec & 0x3fff Going line by line:
It seems I'm one of those people who thinks bit manipulation is easy. 😄 To implement all precisions I'm waiting for Go 1.17 which will add |
@nurked is very cool, I enjoy the visualizations to help drive the point home! As tedious as the version/variant are, I agree with the others in leaving them where they is the best decision. That being said, Brad's comment about the variant bits got me thinking, why not implement both?
With this in mind we could also set the definition of any UUID Version (UUIDv1/2/3/4/5/6/7/8) + Variant E (111) as a method for signaling an alternative bit layout to any previously defined UUID. With that precedent set UUIDv6 could be converted to UUIDv1 (0001) + Variant E/F (111) as a method to signal the alternative encoding for UUIDv1 (Which is the entire goal of UUIDv6 at the moment) Edit: See #26 for further discussion on UUIDvXε variant usage. |
@kyzer-davis I really like the idea of UUIDv1ε because UUIDv6 in draft 01 is almost the same as UUIDv1 with different byte order. On the other hand, I don't quite understand the difference between UUIDv6 and UUIDv6ε. Do you mind opening an other issue to discuss the possible renaming to focus on subsecond precision in this issue? If UUIDv7 subsecond precision in case of nanoseconds is changed from 36 to 30 bits the layout would became the following:
From:
In both cases nsec is cut in 3 peaces there for I think it is better to keep it as compact as possible. However, in case of millisecond and microsecond precision decreasing nsec does not provide as much benefit because they don't free up a byte for Also, in case of 36 bits we could consider a way which would allow implementers to avoid bit shifting.
Simple division and remainder calculation should be enough Finally, 2 bits for Edit: Statement about bit shifting being an option is not true. |
Done, #26 open, will edit my original comment. |
What do you think about having a fixed length
This would allow
Mixed precision example:
Order I'm trying to avoid using 64 bit integer because some scripting language uses IEEE-754 double-precision (64 bit) format to represent double and float. In this case 53 int is the biggest they can represent. This means those languages cannot represent a fake 64 bit unix nano. Menawhile, a 34 int I assume 48 bit for random bits should be sufficient for millisecond, microsecond and nanosecond precision as well. |
IMO this is too little to avoid collisions. Especially with millisecond timestamp precision.
It only takes a few lines of code to get a 64-bit multiplication result using 32-bit arithmetic. Example |
Continuing on the idea of
And sequence is allowed to overflow. Overflow could be done with
Take into consideration sequence as well. With the rules from my previous comment it would be 48 bits random and 12 bits sequence in case of millisecond precision. With the rules from this comment it would be 48 bits random and a sequence with 64_000_000 maximum value. The later would allow the same theoretical maximum for each precision. I'm also thinking about how high resolution time could be used in addition to sequence.
You are correct, I totally forgot you can use two 32 bit int as a 64 bit int. I will play with this after work. |
I would like to show you how to encode and decode in Python, Java and Javascript using the timestamp with 40 bits and 10-milli precision. IMO a timestamp with 40 bits and 10-milli precision is easy to encode and decode. It also has the best 'score' in the spreadsheet created by @edo1 (#23 (comment)) for a 200 years life span. Also it is easy to describe the algorithm: just use the scale factor. Also, it's not difficult to encode and decode in Javascript. You have to encode and decode the timestamp and the timestamp extension separately to get around the limitation of the Here is (yet) another layout/structure I would like propose:
PythonIn Python you use the scale factor Encode to fixed-point: scale_factor = 100 * 2**20
timestamp = unixts * scale_factor Decode back to floating-point: unixts = timestamp / scale_factor Here is a complete example: import time
DEC_SCALE_FACTOR = 100
BIN_SCALE_FACTOR = 2**20
SCALE_FACTOR = DEC_SCALE_FACTOR * BIN_SCALE_FACTOR
def encode(unix_time):
'''Encode to FIXED-point number'''
return int(unix_time * SCALE_FACTOR)
def decode(timestamp):
'''Decode to floating-point number'''
return float(timestamp / SCALE_FACTOR)
# get UNIXTS as floating-point
unixts = time.time()
# encode UNIXTS to FIXED-point
timestamp = encode(unixts)
# decode timestamp to floating-point
unixts_decoded = decode(timestamp)
print(unixts)
print(unixts_decoded) JavaIn Java you also use the scale factor Encode to fixed-point: double scale_factor = 100.0 * Math.pow(2, 20);
double unixts = System.currentTimeMillis() / 1000;
long timestamp = (long) (unixts * scale_factor); Decode back to floating-point: double unixts_decoded = timestamp / scale_factor; Here is a complete example: public final class Uuid7Timestamp40b {
private static final double DEC_SCALE_FACTOR = 100;
private static final double BIN_SCALE_FACTOR = 1 << 20; // 2**20
private static final double SCALE_FACTOR = DEC_SCALE_FACTOR * BIN_SCALE_FACTOR;
public static long encode(final double unixts) {
return (long) (unixts * SCALE_FACTOR);
}
public static double decode(final long timestamp) {
return timestamp / SCALE_FACTOR;
}
public static double time() {
return System.currentTimeMillis() / 1000.0;
}
public static void main(String[] args) {
// get UNIXTS as floating-point
double unixts = time();
// encode UNIXTS to FIXED-point
long timestamp = encode(unixts);
// decode timestamp to floating-point
double unixts_decoded = decode(timestamp);
System.out.println(unixts);
System.out.println(unixts_decoded);
}
} JavascriptIn Javascript you use a decimal scale factor You can't use the fixed-point representation because of the internal representation of Number in javascript. But you can represent the timestamp as a hexadecimal. Encode to hexadecimal:
Decode back to Number:
Here is a complete example: var DEC_SCALE_FACTOR = 100.0
var BIN_SCALE_FACTOR = Math.pow(2, 20)
function time() {
return (new Date()).getTime() / 1000
}
function int(n) {
return Math.floor(n)
}
// output in hexadecimal
function encode(unixts) {
unixts = unixts * DEC_SCALE_FACTOR
// encode the first 40 bits
timestamp = int(unixts)
hexa = timestamp.toString(16)
padded = ("0000000000"+hexa).slice(-10)
// encode the last 20 bits
timestamp_ext = int((unixts - int(unixts)) * BIN_SCALE_FACTOR)
hexa_ext = timestamp_ext.toString(16)
padded_ext = ("00000"+hexa_ext).slice(-5)
return padded + padded_ext
}
// input in hexadecimal
function decode(timestamp) {
// decode the first 40 bits
hexa = timestamp.slice(0,10)
parsed = parseInt(hexa, 16)
// decode the last 20 bits
hexa_ext = timestamp.slice(10,15)
parsed_ext = parseInt(hexa_ext, 16) / BIN_SCALE_FACTOR
return (parsed + parsed_ext) / DEC_SCALE_FACTOR
}
unixts = time()
timestamp = encode(unixts)
unixts_decoded = decode(timestamp)
console.log(unixts)
console.log(unixts_decoded) |
What if a 64 bit integer is used for nanosecond time whit a
I created an example in NodeJS: https://gist.github.com/nerg4l/7de56f715486fada84ac4c1f27a17b23. JavaScript does not have a 64 bit int primitive and when bitwise operations are used it converts the value to a 32 bit int. There for, most of the calculations had to be done in interesting ways. I'm not sure if this is the correct way forward. It looks like a unix based UUIDv6 with 64 bit time. In a lot of cases the main pain point is still the unfortunate position of Lastly, I was looking into maximum time resolutions on different operating systems https://en.wikipedia.org/wiki/System_time. For example, according to the linked wiki the best Windows can do is 100 ns but macOS/OSX can do 1 ns with |
This was dropped from UUIDv7 in Draft 03 although I have re-worked some of the lessons learned into 5.1. Timestamp Granularity > Sub-second Precision and Accuracy bullet. |
Question: Factions of a second representation or specific number of milliseconds/microseconds/nanoseconds for subsecond precision encoding within UUIDv7's
subsec_a
,subsec_b
, andsubsec_seq_node
Current draft-02 Implementation as per @bradleypeabody (who wrote the UUIDv7 section)
Sources from email between Brad and I on the topic:
Alternative is specific number of milliseconds/microseconds/nanoseconds as integer.
Sources: https://en.wikipedia.org/wiki/Orders_of_magnitude_(time)
More Discussion on the alternative:
If alternative was utilized in Draft 02 we would encode as such:
subsec_a
is fully dedicated to millisecond information with a pad of two bits at the start since we only need 10. Easier to do this padding since Version bits come up right after.subsec_a
andsubsec_b
utilized for microsecond information with 4 bits padded at the start since we only require 20. Easier to pad here because Variant bits are coming up.subsec_a
,subsec_b
, and 6 bits ofsubsec_seq_nod
used for nanosecond information. Final 48 bits ofsubsec_seq_nod
are used for random/entropy/node data. No padding required for this scenario.The text was updated successfully, but these errors were encountered: