-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculating SAIDS with Blade3-256 - nuance #58
Comments
The way we calculate the digest is also CESR-specific. Let's go step by step over an example (given in JavaScript):
|
These are great details that I think should be included in the OCA specification. Should we could discuss using something without the additional CESR complexity and just a raw hashing algorithm? I think benefits and trade-offs should be clear and an excellent topic for DSWG. What about if someone wants to use a different hashing algorithm? Will they be required to follow similar instructions and will those instructions be documented? Or if we only document Blade3-256 and note that the OCA specification is prepared for other hashing algorithms in the future which can be added to OCA-spec as needed and not need a new version of OCA-spec for these additions. |
Given that there is no need for using CESR or worrying about the length of hashes falling on bit boundaries, why is the OCA spec following/using CESR that standard? Since the Blake3 hashing algorithm was not accepted by NIST and there are far fewer libraries in various languages vs. SHA hashing, why use it? That said — if this all gets documented cleanly in the OCA spec so that an independent implementation can be defined, all good. Documented means the permitted hashing algorithm(s), how the hashing algorithm can be detected (to allow for future evolution of the spec), steps to hash, and steps to verify. I infer from the first line that the canonicalization approach is as follows and needs also to be in the spec:
I would recommend (although presumably it is too late) to:
|
Please find this discussion as a continuation of the @swcurran, it is perfectly fine if you want to use SHA instead of BLAKE. Use the SHA hashing algorithm and prepend the digest with the proper letter. For the record:
|
So the expectation is any OCA parser MUST support all of the “permitted” hashing algorithms, and that is the list? I suspect that means a lot of extra dependencies. If that is what the spec says, that is OK — although not ideal. My suggestion (and what we do in the DID TDW spec) is to keep the prepending flag (which is what the “multihash” standard defines — but with a much larger list of hash algorithms), but to specify that OCA Bundles creators MUST use a (ideally) just one hash algorithm (preferrably sha-256) — or at most an option or two. That reduces the number of dependencies an implementation needs to include — reducing the dependabot updates needed. A later version of the spec might add newer, better algorithms, and deprecate older ones. And all of this needs to be added to the spec. |
In the OCA spec, we say that the OCA Object identifier is SAID. What will come as refinement (clarification) in v1.1 of the spec is that all objects are SADs(self-addressing data). SAID presumes specific algorithms that are sufficient for the current needs. SAID doesn't support all the algorithms as Multihash does because SAID is pragmatic. If it's not needed, why bother? This "not needed" argument is significant because we don't have to support old or legacy systems. We started Green Field to benefit from what is available now or in the future (SAID will get proper letters then). As discussed in #59, we made design choices and onboarded SAID. We cannot narrow the list of hashing algorithms to just one for the same reason git did not do it in the past with SHA1 and is now stuck with it: We want to be forward-compatible. We are specifically interested in the first and second pre-image resistance for any hashing algorithm in OCA, and historically, MD5 and SHA1 were found to lack these properties. SAID solves not only this, but also specifies how to make it self-referential (embeddable in the document it refers to). |
I love new ways and greenfield, I am looking for a defensible, pragmatic and clear benefit of using SAIDs vs multihash and the JSON canonicalization method. For the calculations of the hashing, I had indeed gone through Sam Smith's documentation (which is very much in flux since it has changed standards body hosting at least once). Even after going through that method I still apparently hadn't dug deep enough because E is not just the Blake3-256 Digest, it apparently has a bunch of additional processing. I don't think it is a good idea to point to Sam's most recent documentation as the source, it is confusing, a work in progress and any user would really have to dig. How does OCA benefit from using CESR. Can this be articulated? Could we write out a detailed, non-normative description of exactly how you calculate SAIDs for two hashing examples and that the official method is Sam Smith's documentation? Currently the weakest part of the specification is canonicalization and digests; that OCA is all about cryptographic reproducibility but it is very hard to verify (except to depend on a few experts to do it for you) and rather than going a standard route that can be easily reproduced by others we went the less documented, less used and more difficult to figure out CESR route without (yet) being able to articulate the clear benefits of that. Could we be ready for CESR and use it in v3 or v4 of OCA-spec? I'm not against CESR at all, I am suggesting a different timing. |
@carlyh-micb , please use the #59 discussion to continue discussing the design choices, i.e., these CESR-related. We can answer the questions there. CESR looks complex because it's a clever protocol. SAID isn't technically CESR but uses CESR concepts, i.e., the code table and the encoding process described in the 2nd post above. |
We seem to be talking past each other @blelump re: how to declare what hash algorithms are allowed and how to detect which one was actually used. My points:
Writing this I realize that there should be another requirement. Each OCA Document should have to declare what version of the OCA spec that was used in creating the document. That will ensure that both evolutions of the spec. and backwards compatibility can be managed in real deployments. However — the most important point. Whatever you decide to do with hashing and SAD generation — please put it in the spec. We can’t do anything until that is done. |
Stephen, I posted something in the discussions earlier, not sure if you saw it. I will note that you can tell what version you are in because each overlay is versioned. "type": "spec/capture_base/1.0", Are examples of valid overlays (v1.0). It looks like you could even have a schema with a mix of versions of overlays. I don't think there is a version for the entire schema bundle, which makes sense if it supports versions at the level of overlay. |
After discussing this at ADC - is the version number here the version number for the overlay (according to the spec) or is it the version of the content of the overlay and the user can change the versions?!!! |
There's no reason to keep each OCA Object versioned separately, as it is now defined in the spec. The serialization scheme ensures the |
To clarify things I will respond to each of @blelump's and @swcurran's comments inline. While you must understand a bare minimum of CESR-style encoding of a SAID to maintain interoperability with CESR SAIDs you don't have to know or implement all of CESR in order to have a CESR-compliant SAID. It is entirely possible to create non-CESR compliant SAIDs as well if you want to completely omit any CESR-related concepts. To get many of the same benefits you would need to come up with your own scheme for signifying specific types of digests and pad character sizes. You can see my short SAID implementation called saidify of about 525 lines of Typescript showing a minimal implementation of CESR code tables. Michael's process is correct. I will restate it briefly for clarity.
This can be simplified to:
Stephen:
You are missing the digest, alignment, and derive steps in your process. So Michael is correct:
SAID uses CESR-style self-framing derivation codes as well as the 24-bit alignment strategy to ensure composability of cryptographic primitives in a byte stream. And Stephen, regarding
I understand how it could be seen as annoying to change the length of the placeholder yet there are a host of important benefits for stream parsers when you make the placeholder the same length of the resulting digest. When you keep the placeholder and the digest the same length then that means that the size of the overall SAIDified data structure remains the same pre- and post-saidification. This fixed size enables incremental, efficient parsing of a byte stream since you can parse one by one due to a TLV encoding scheme and this fixed size also supports pipelining where you can hand off these incrementally received cryptographic primitives off to separate CPU cores for efficient parsing.
This should work for anything that is not KERI or insertion ordered. There is a good argument for insertion ordering for some use cases and there are similarly good arguments for alphabetical ordering for other use cases. I personally favor insertion ordering because it allows for developer and user friendly natural field orderings, though some people prefer alphabetical ordering in order to simplify debates and decision making. And @carlyh-micb, regarding your question about using CESR,
You don't have to use all of CESR for SAIDs, though since HCF is building on top of CESR then you can leverage all of its benefits including cryptographic agility, data efficiency with incremental parsing, pipelining, and a very compact encoding format.
I show how to calculate Blake3, Blake2, SHA3, and SHA2 digests (hashes) with SAIDify:
I have a longer explanation targeting programmers here detailing the steps to produce a SAID and the supporting concepts of 24-bit boundary alignment. Beware, it's a bit of a read. |
In a second review of Michael's process he is missing a key part of the Base64 encoding process that CESR uses. His Blake3 digest of What CESR does is repurpose the space that would have been taken up by equals sign ('=') padding characters for derivation codes that allow you to look up the type and length of a cryptographic primitive by reading only the front bytes of an encoded primitive and then take the observed type, read the code table entry to get the length (size) of the primitive, and then strip exactly that number of bytes from the incoming byte stream. This is why CESR is a TLV (type, length, value) encoding scheme. The use of a prefixed, TLV, self-framing encoding scheme with 24-bit boundary alignments provides composability including layered hierarchical composability which enables really cool things like indexed signatures and group count codes, among other things. |
thanks @kentbull . For your information, the Human Colossus Foundation provides a fully-fledged JS client calculating SAID that is exposed here. Both the source of the WEB page and the library are publicly available. The documentation level needs to be improved, but we always welcome contributions in this area. |
Thanks @kentbull for the (pretty) clear description of the algorithm. I’d still like some further clarifications:
|
About this comment:
My interpretation of the use of JCS is that it is only used when calculating SAID and so does not impact what the developer does with the JSON. The JSON is created as the developer wants, the SAID is calculated using the JSON as input to a function, and the original JSON is updated with the SAID. That JCS was used in the SAID calculation function does change the JSON itself. Likewise, a verifier receives the JSON, verifies the SAID in a function, and then continues processing the JSON. The JCS processing is used only in the SAID function. The problem with relying on insertion order is that a verifier could receive the JSON through a process that alters the ordering and so cannot verify the SAID. By using JCS during the SAID calculation, both the producer and verifying know exactly what they have to do to the JSON before calculating the hash. |
Good questions @swcurran, I wasn't as clear as I could be in my answer. I'll see if I can clean that up here. Field Ordering
I agree, field ordering must be strictly defined somewhere. Enforcing order with JCS is a good way to do that. Your comment on JSON not preserving insertion order is important. JSON does not inherently support any field ordering at all as it makes no guarantees about field ordering. Any ordering guarantees must be made on top of JSON and enforced by something. Newer versions of JavaScript do preserve field order in JavaScript maps and JavaScript objects as well as JSON serialization and deserialization, yet that's JavaScript, not JSON. So picking an ordering scheme whether JCS or something else is essential and should be formally called as you stated.
Insertion ordered data structures are the only thing that CESR calls out. If there is intermediary JSON processing of a byte stream that may not preserve the insertion ordering of a data structure then there must be some specification whether in a dedicated schema document or other resource such as JCS that enables reproducible field ordering. CESR does not specify a mechanism for creating this reproducible ordering in the event of a rearrangement of JSON fields. This is likely because the CESR spec doesn't seek to describe intermediate JSON processing and relies on putting bytes in a stream that are ordered correctly, which is functional as long as neither source nor destination reorders fields in transit. As you mentioned, any use case that needs to reprocess JSON and potentially reorder fields must create or introduce its own field ordering to calculate consistent digests.
That makes sense.
Yes, if the verifier or any intermediary manipulates the order of the JSON document then it will have a different SAID. If you have a use case where you expect this to happen then maintaining some sort of schema specification or canonical ordering process is the only way to create reproducible digests/SAIDs. Aligning on 24 bit boundaries
Yes and yes. The length of the hash does depend on the algorithm. And the handling of a digest should also be algorithm specific. The "derivation code" in the CESR master code table is a lookup key, essentially an object/class type, indicating which cryptographic digest algorithm to use, the length of the cryptographic digest, and how many pad bytes were added to a digest.
Steps (detailed below):
You asked:
Although the answer is technically yes from a Base64 perspective, the answer is no from a CESR perspective because CESR changes the padding to be on the left (start). Base64URLEncoding includes pad bytes, yet they are on the right (end) because standard Base64 encoding adds pad bytes (equals '=' signs) to align on 24 bit boundaries. CESR adds pad bytes on the left rather than the right in order to have self-framing, composable cryptographic primitives. I go more into detail on this below.
No, if a Base64URLSafe encoding already aligns on a 24-bit boundary then the derivation code for a CESR primitive is prepended like all other CESR derivation codes. Yet, for SAIDs, the underlying 32 byte digests never align on 24 bit boundaries because they need one more byte to get to the multiple of 24, 33 bytes. For SAIDs that rely on 32 byte digests you don't have to worry about the other cases that CESR covers and you don't want to because then you'd have to implement more CESR rules than are absolutely necessary for SAIDs.
The pre-padded 'A', or zero bits, get added directly after creating the raw digest. You create a raw digest, 32 bytes in our case, and then prepend zero bytes, 1 byte in our case, to get to a multiple of 24 bits, 33 bytes in our case, which is 264 bits. So you aren't really pre-padding 'A' characters, you are pre-padding zero bits to get to 264 bits, a multiple of 24 bits. These zero bits end up encoding to Base64URLSafe as 'A' characters (because it is index 0 in the Base64URLSafe character list) which is why you see the pre-padded 'A' characters on the final Base64URLSafe encoded value. See the steps below for a thorough explanation.
No, the '#' pound sign characters are not used to pad the digest to align on 24 bit boundaries. The pound sign characters are used as a placeholder in the target document to ensure the document length is consistent both before and after taking the digest of the document. The pound sign characters are only related to being a placeholder and are not included as a part of the actual Base64URLSafe encoded digest.
No, as I mentioned above, the pound signs are only used to provide a fixed-width placeholder in the document being "saidified." And it is important to remember that it is zero bits that are prepended to the digests that happen to translate to 'A' in Base64URLSafe encoding. Step 1 Pad zero bytes on the leftPadding gets added on the front (left side) of a digest as shown in the below examples. I will detail the steps for you.
Say you have the following object and are using the "said" field for the SAID digest: f = {
"said": "############################################",
"first": "Sue",
"last": "Smith",
"role": "Founder"
} The Base64URLSafe encoding of the raw, unpadded bytes of the Blake3-256 digest from above looks like the following. It is 43 Base64URLSafe characters, representing 256 bits (32 bytes), long:
Yet this is incorrect because it does not align on a 24 bit boundary. To create this alignment then you add between one and three zero bytes. Base64 aligns values on 24 bit boundaries as well using '=' equals signs, as you have seen. Typically Base64 encoding, whether regular Base64 or Base64URLSafe, adds these pad bytes on the end of a Base64 string as. Yet CESR, rather than have those equals signs be wasted bytes, to repurposes them to store the derivation code which is why some CESR derivation codes are as small as 1, 2, or 3 Base64URLSafe characters. To align the 32 byte digest from above (43 Base64URLSafe chars), then you pad with zero bytes on the left hand side to reach a multiple of 24 bits. Why 24 bits? To have a clean separation of stored bytes in Base64 characters. You don't want a single Base64 character to hold information for two different adjacent digests or other cryptographic primitive since this complicates parsing and does not allow simple round-trippable, lossless encoding and decoding. If you share/overlap bits from two different digests/primitives in one Base64 character then you end up having to parse and interpret both digests together in order to cleanly separate them. And since you don't know where in a stream such overlaps would occur then you end up having to wait to get the whole stream and then you have to parse the whole stream together and still count all the bytes, digest by digest, as a single operation because you don't have clean frames to separate digests/primitives on reliable bounds. If you want more clarification on this then I can talk you through it. The diagram below helps illustrate this. Step 2: Base64URLSafe encode the zero byte padded valueWhen you pad with zero bits in our example this results in the following 44 Base64URLSafe characters:
As you see there is the 'A' character on the front rather than the '=' equals sign on the end. The padding has been moved from the end to the beginning like the resolution of a well-written dramatic character arc. The reason there is an 'A' at the start of this digest is because all of the first six bits are zero, which corresponds to the 'A' character, the zeroth character, in Base64URLSafe encoding. Step 3: Replace 'A' (zero bit) Base64URLSafe characters with appropriate derivation codeFinally, there is one more step to get from this Base64URLSafe encoded value to a CESR compatible SAID. That step is replacing the prepended 'A' (zero bit) Base64 characters with the self-framing derivation code. In this case that is the character 'E' for a Blake3-256 digest, and would similarly be 'H' for a SHA3-256 digest. This results in the following:
See this illustration of what the bytes look like. This example shows 8 pad bits (a whole byte) of which the first six pad zero bits are encoded as the 'A' Base64URLSafe character and the last two pad zero bits are included in the second Base64URLSafe character 'J'. Replacement of pad bits with derivation codes only happens with 'A' characters where all the bits are zero. Illustration of Base64URLSafe, left-padded encoding
|
Thanks, Kent. I’m going to try to take another pass at it what I think belongs in the OCA Specification. Let me know what you think: Calculating the SAID Digest In OCA OverlaysPrecondition: The OCA spec. defines the permitted hash schemes. For example, sha256 and Blake3-256. Balance the list to allow some flexibility vs. simplifying the verifiers.
To verify the SAID, start with the OCA JSON containing the digest item value set to the
|
Not quite, though you are really close. One step is missing the padding part and the other step is incorrect. I list your steps and then show what needs to be changed. Calculation
Step 2 here prematurely Base64URLSafe encodes the digest. You don't Base64URLSafe encode anything until you have prepadded it with zero bits. For 32 byte SAIDs you don't have to use the calculation described here in qb64b because it will always be 1 pad byte. For longer SAIDs then you need to use the calculation specified in the qb64b function. I would be happy to walk you through this. So the steps should be:
Step 3 here is unnecessary. I can see where you might have got this idea when I talked about how the CESR encoding does not use the '=' equals signs on Base64URLSafe encoded values and instead pre-pads zero bits, though I will clear this up to eliminate any miscommunication on my part and hopefully address any misunderstanding. There are no trailing pound signs '#' in the Base64URLSafe encoding of the digest. The pound signs are only used as a fixed-width placeholder in the VerificationThe easy way is to just re-calculate the SAID based on the JSON at the destination and compare the SAIDs, though if you want to do another Blake3-256 or other digest computation then you need to get back to the original un-padded raw digest bytes. First let's clear up the steps you mentioned.
This is not correct. Removing the first character leaves you with a partial digest which will not match the digest of the destination JSON because it is missing the zero bits. It would be off by one character, the 'A' character of zero bits. What you want to do is to get back to the
This is correct.
The calculation here is incorrect. See the corrected calculation steps above that pre-pad the correct number of zero bits prior to calculating the raw digest bytes to give you the
There is no need to remove any Base64URLSafe '=' equals sign padding characters from the hash string because the raw bytes that produced the encoded value were already aligned on a 24 bit multiple (boundary). Base64 padding characters only show up if your value is not aligned on a 24 bit boundary, which will never happen in properly padded and encoded SAIDs or other CESR primitives.
When you calculate the I wrote a lot here and we've written a few times back and forth. I know that async communication can sometimes be difficult on very precise things like this. Feel free to reach out to me on LinkedIn, ToIP Slack, or KERI Discord if you want to have a realtime conversation about this. |
Stephen, a note on dependencies, The @noble/hashes library is a single dependency that includes all of Blake2b-256, Blake3-256, SHA3-256, and SHA2-256 digest algorithms. When you said "dependency" were you referring to software or conceptual dependencies? If software dependencies then |
We’re getting close to something that can go in the spec!!! About this:
I intentionally proposed the technique I did so the “extra steps” are done with characters vs. bits — which I thought would be easier. If you look at the RFC4648, the padding characters are added for exactly the same reason in that RFC as for CESR—to get the byte boundry even. By using an RFC4648 implementation that meets the spec, you get “the right thing” for CESR, without having to implement your own steps to figure out the bit padding needed. Just look at what the Base64 padding is (0, 1 or 2 About this:
In my version, deliberately left off the pre-pending in the verification, so I think we get to the same point — I strip it in the input, and don’t pre-pend in the calculated value. You don’t remove it from the input, and include the pre-pending to the calculated value (in the corrections to the later steps that I proposed). So I leave it off in both cases, you include it in both cases, so both would work. I’m OK either way that works :-). Shall we take a shot at getting a PR to the spec done? |
I’m thinking more of conceptual. In writing the spec, we don’t want to make assumptions about what libraries, languages or other constraints a specific instance might need. So we can’t assume that every instance can find a dependency that supports all of the hash algorithms we want to support. We also want to support instances where someone wants to implement the entire thing. We have to support some minimal acceptable algorithm — e.g. nothing that has a weakness, so that’s the low bar. Beyond that, we want to pick the one(s) that have the broadest support. Further, we want to limit those choices so a resolver only has to support the algorithms that OCA Bundle producers is permitted to create. So — the bottom line is, what is the value of requiring support all 4 algorithms? Are any weak and so should be dropped? If all are the same, why allow all of them? Example — in working on did:tdw, we were going to use sha-256 and sha3-256, and we discovered (suprisingly enough) that there were no generally accepted sha3-256 TypeScript libraries (or so I’m told). And since using sha3-256 didn’t really make the implementation more secure, we figured we would drop the option of using it until it (or a better) algorithm was readily available. In a later version of the spec, we’ll likely add support for another hashing algorithm, but for now, we’ll leave it at sha-256. |
I respond to the comments inline below. The TL;DR is: sharing bits in pad bytes forces us to do more work. A visual helps explain this clearly. CESR pads on the front, Base64 pads on the back, and both share bits in the Base64 character (light green box) adjacent to the padding character. Legend![]() Example SAIDified JSON![]() CESR Pre-Padded Encoded Digest Diagram![]() Naive Base64 Post-Padded Encoded Digest Diagram![]() Bit sharing between Base64 characters encoding both bits from pad bytes and bits from raw value bytes forces you to have to manage padding bits whether using CESR or naive Base64. This sharing of bits occurs in both Base64 and CESR because it is a consequence of the need to align on a 24 bit boundary. When encoding values where padding is necessary, as in when the byte count of the raw value is not a multiple of 24, you must share either four or two bits with one of the pad characters. Whether four or two bits depends, like in section 4 of RFC 4648, on the amount of pad bytes used. There are two cases, one pad byte or two pad bytes.
How many pad bytes?In both Base64 and CESR the count of pad characters tells you how many pad bytes are used, which you can use to reverse-engineer the original digest by stripping this count of bytes from the digest and thus remove any need to depend on a count code table for something simple like a SAID. Bit sharing forces dealing with pad bytesWhat you have to choose is whether you want to use pre-padding like CESR or post-padding like Base64. What you can't choose is whether or not you have to add and strip pad bytes. A SAID implementation could work with either pre-padding or post-padding, yet must account for sharing of bits between CESR uses pre-padding (on the front/start of the raw bytes being encoded), Base64 uses post-padding (on the back/end of the raw bytes being encoded). To get the same digest as a CESR-compatible SAID you must use pre-padding AND you must know the number of pad bytes to extract from the front of the padded raw digest in order to get back to the original digest. The derivation code/lookup character tells you how many pad bytes to remove from the decoded value, not how many pad characters to remove. You can't only remove pad characters because of shared bits between some pad characters and the raw encoded digest/primitive. If you only add or remove pad characters post-conversion to Base64 then you are missing the pad bits that were encoded as a part of the raw bytes because:
Due to the nature of Base64 encoding only 6 bits of information per raw byte that you are encoding (8 bits) then you always end up needing to share raw bits with a pad character whenever you are encoding something that needs pad characters to align on the 24 bit boundary. What CESR does is put these pad bits on the front while Base64 puts these pad bits on the back. The images above make this clear. Benefits to pre-paddingThe benefits to padding on the front are:
Type code substitutionAs shown in the diagram above the CESR-compatible SAID encoding substitutes the resulting pre-padded 'A' characters with a type code, 'H' in this instance, that serves as a lookup value for a parsing rules table indicating the length of the digest. As mentioned above, using pre-pended type codes both increases human-friendliness and dramatically simplifies parser implementations. This is what is called a self-framing digest, or a self-framing cryptographic primitive. Itemized Response
While I see why you would want to make it easier or simpler the the reason why you can't do the padding with characters only is because pad bytes split across multiple characters and only some of the characters end up being a pure "zero" character, the 'A' character. You will end up with incorrect saids if you only consider the text-based pad characters, post-conversion to Base64, as all of the pad bits you have to account for, in the example pictured above, stretch beyond the six bits in the pad character to two more bits that are shared with the character adjacent to the pure padding character. As mentioned above there will either be four or two data bits included in a character also including pad bits that is adjacent to the pure padding character whether using Base64 or CESR style padding.
Due to bit sharing this is not entirely correct. The padding characters are not all of the padding added because some of the pad bits (2 or 4 as shown above) are included in the encoded Base64 character adjacent to the padding character. So getting to the bit boundary is not as simple as adding or removing a padding character because padding is added as bytes and not Base64 characters.
This is incorrect. A RFC4648 compliant implementation of Base64 does post-padding while CESR does pre-padding. You will get different digests if you use what is called "naive Base64." Why is it called "naive?" This gets into the composability property of CESR that I reference rather than repeat here. I am more than happy to talk you through this if you want additional clarification. The gist is that "naive Base64" makes it impossible for a parser to cleanly separate primitives without additional parsing instructions beyond a TLV scheme because there are no clean boundaries of primitives included in the encoding itself. You have to add parsing instructions beyond the TLV rules to understand where the boundaries are in the naive Base64 data in order to properly separate raw values from the parsed stream, and this all has to be done at once due to the lack of boundaries, meaning you can't pipeline the processing of the stream and thus can't utilize all your CPU cores for efficient stream processing.
You are off by two bits because of bit sharing as described above. Thus your digest would be different and would not match the original SAID. If the padding were only limited to adding or removing a character at the front then what you are saying would work. Yet bit sharing is the important reason why it does not work. Some of the 8 pad bytes both Base64 and CESR use in the post-pad/pre-pad are included in the Base64 character adjacent to the pad character(s). So just stripping the pad character(s) off the front, or the back, will leave you with two extra bits in your raw output, which will give you digests that don't match.
Both would not work. If you leave the Base64 pad character off in both cases then you are ignoring the two additional pad bits that have been encoded into the Base64 character adjacent to the pad character, causing you to compute a different digest, one that would not be CESR compatible because the resulting different digest would not perform pre-padding in the way CESR expects. Doing pre-padding with pad bytes prior to conversion to Base64 is the only way to be CESR compatible. For what it's worth, the default behavior of Base64 encoding is to pad bytes prior to conversion as well, just with post-padding. Because pad characters encode only 6 bits of information then you can't pad to a 24 byte boundary correctly by just adding or stripping characters. Padding must be done with bytes prior to conversion to Base64 characters or you will end up with Base64 doing its own post-padding, which is not SAID compatible.
It seems that it would not be desirable to have a spec to be tied to only one specific cryptographic algorithm and rather open it up so implementors could use anything that follows a given process. Along these same lines, it seems that standardizing on a process plans for expansion and evolution in the specification language, which would be valuable given the expected need to change cryptographic libraries for post quantum resistance once quantum computers actually take off. So focusing on a general encoding process like SAIDification rather than a particular algorithm would be a way to both be specific and retain flexibility for the specification to remain valid even when cryptographic algorithms need to change. There's essentially low to no cost to include a list of algorithms in the spec that work with SAIDification. As long as the TLV scheme used allows for clear identification of the type of algorithm then the coding effort to support an additional digest algorithm is minimal. So the bottom line to me is that allowing all four, or even more, has such a low cost and a comparatively high benefit of meeting the needs of diverse applications that it seems like an easy win and low hanging fruit.
Are you referring to not generally accepted by a given body or independent security audits? I know the readily available SummaryWith the pre-padding and prepended type code substitution process outlined above your digest is aligned on a 24 bit boundary whether using the SHA or Blake families of cryptographic algorithms. SAIDification is about padding on the front to meet the needs of a TLV encoding scheme for human-friendly textually encoded values and simple parser implementations. If this meets the needs of the OCA spec then I suggest we work out language that clarifies the use of pre-padding |
I'm not sure if this is documented above (didn't notice it) so I wanted to include it here since it also influences the SAID calculation (thank you Kent for the help in understanding this). In OCA they use: This is using: https://trustoverip.github.io/tswg-cesr-specification/#legacy-version-1xx-string-field-format
The serialization length is calculated for each schema. So the " This means when calculating a SAID where you use a "v": |
I collected everything from our conversation here regarding SAIDs into my latest blog post. I also added the HCF demos @blelump suggested to the "Implementations" section of the blog post at the end. If that post leaves anything unclear then please reach out to me and I will further clarify the post. |
Thanks Kent — I’ll defer to you on the bit-twiddling. If you say it is needed, it is! I checked out the blog and it is great — the section I'm a bit confused about this.
You seemed to be saying that you disagree with me, and then seem to repeat the point I was making. There is a cost for every consumer to have to support any algorithm that any producer might use. Each producer need only support one, but every consumer has to support all. So agree that we use an approach that allows for any algorithm (the SAID spec. achieves that), but limit the algorithms that can be used in any specific version of the OCA (not SAID) spec to those that add value by their inclusion. So, as long as the OCA Spec says “You MUST use algorithm A, B, C, D, E, F and G” — I’m happy. Are you going to do a PR to the V1 OCA spec for the SAID calculation? |
@carlyh-micb — are you saying that any OCA file now needs to have a I do think that the version of the OCA spec being used by an overlay is needed. I’m not thrilled by the complexity of the version calculation, but I’m guessing it not that hard to calculate the length of the OCA Bundle. It does force the dev to insert the N I’m guessing this is another CESR thing? I find it doesn’t help adoption by having OCA intertwined with CESR where it adds complexity. In this case, other than for CESR, why does the length of the JSON matter? Question: In calculating the JSON — how are end-of-lines calculated — Linux/MacOS (1 character) or Windows style (2)? :-). |
@swcurran It doesn't show up in the spec yet, it is in the test site where you can produce a single JSON schema bundle (https://repository.oca.argo.colossi.network/). I'm not thrilled by the complexity of the version calculation, I'm curious to know which use cases that need this complexity now. Semantic versioning (e.g. 1.1.0) seem to be sufficient. Perhaps any kind of 'wrapper' of the schema could include the size. At ADC we are working on adding our own overlays which will have SAIDs calculated. We plan to create a Schema Package which contains an HCF-generated OCA schema bundle (with all their complexities) that we generate using an API. This gets combined with the presentation JSON and our own overlays. Our syntax would use only semantic versioning and not this CESR complexity. The idea is that we will be able to incorporate any HCF JSON OCA schema quickly without having to change our own overlay calculations. @swcurran do you want to chat with us about how we are making our own overlays? This Schema Package we are designing could be standardized and shared. |
I was disagreeing with you, though after sleeping on it I changed my mind and agree with you. The spec is more likely to be implemented and well supported if it has as few algorithms as absolutely necessary.
This is a very important point. Thank you for reiterating it.
Yes, this is reasonable. What algorithm do you favor? I've heard people say that SHA2-256 and SHA3-256 are good candidates. Would you pick one of those or something different? |
I was not originally planning on it though I would be happy to. Which section would you like me to add it to? The concepts section? |
I agree on versioning. Unless you need and want the benefits that CESR provides for clear use cases then I recommend against using a version field because of the following constraints:
If you don't need those things, don't use them. KISS
Knowing the length of the OCA bundle would only be required for the version string. You don't need to know the length of the entire JSON object, which I assume is the bundle, for the SAID filler characters. The quantity of SAID filler characters is determined only by the number of bytes in the fully padded digest.
Yes, the version field and version string is a CESR thing. If what that provides is not needed for the OCA spec then I recommend it be elided.
To my knowledge you don't have to worry about this if you aren't using a |
I think the explanation of the calculation of the SAID doesn't belong in the OCA specification because it already belongs in another specification, that is the root of trust of the standard (KERI/CESR). However, I do think it would be valuable to be in helper documentation of OCA, but I'm not sure where that is. If I was told where the home is, I could also contribute a clear introduction into the rationale of the architecture and an in depth description of how SAIDs are calculated (non-normative and referencing CESR) would also also belong there. I think the OCA specification can allow anyone to use any hashing algorithm they choose (that is allowed in CESR). However, any ecosystem that uses OCA may want to limit the choices for interoperability. Finally, I really think the version string in OCA should be removed. At ADC we are adding more overlays and I think that will be very common. We are also adding the presentation overlay. In the swagger API that generates OCA bundle it includes dependencies as part of the JSON object, does the version string include those extra dependencies? At this point, there is a bunch of stuff often being added to the bundle so whatever size is specified in the schema bundle specifically is inaccurate with all these additions. At the time of data transit perhaps is the time for a payload size to be calculated. |
My preference is that there is no need in OCA to reference either CESR or SAID specifications, and that by putting in the SAID (or "overlay identifier") calculation into the OCA spec, the specification can stand alone. That makes OCA acceptance in circles outside of the KERI world much easier. I think it is reasonable to reference the version of OCA being used by an overlay (although it adds a lot of overhead), and I would encourage not using the CESR |
OCA specification is already referencing outside standards, we do not give out the full specification of each hashing algorithm, those are outside the specification as well. Same with the syntax of JSON which has its own external specification. I think it is normal and acceptable that many standards exist outside of the OCA specification and are referenced (but I can understand the appeal of a one-stop-shop). Where information on an external specification is hard to find or hard to understand (e.g. SAIDs) OCA specification could provide helper documentation. I agree that the version would be better as Semantic Versioning (e.g. 1.0.0) which many people are familiar with. |
Just to check, using the OCA swagger API and some quick schemas I wrote, I tested a schema with a single dependency. The bundle has version: OCAM10JSON00032c_ @kentbull says "CESR requires version fields to be at the front of the JSON payload, which is really good for a TLV scheme, yet would feel unwieldy if you don't need such versioning or don't need to support a TLV streaming parser." These version strings seem to be distributed throughout the payload. My preference is to use Semantic Versioning instead.
|
Ordered dicts are the way that the Python reference implementation maintains a consistent ordering of each JSON, CBOR, and MessagePack. Such field order is not inherently a feature of JSON yet newer versions of Javascript, as of ECMAScript 2015 (ES6), and many JSON implementations support insertion ordering. JavaScript objects, class constructors, and the JSON.parse and JSON.stringify are explicitly mandated by the ECMAScript specification to preserve insertion ordering of properties. Yet this ordering issue is a separate issue from the version string. If you are not trying to build a parser that supports a speed and space optimized type, length, value (TLV) parsing scheme for ordered field maps (like insertion ordered JSON) then you don't really need the same kind of version string as CESR data types because you are likely just dealing with JSON or CBOR. Then again, I do not know how OCA wants to handle versions of the OCA spec and how that translates to serialization versions. Depending on what kind of parsing rules an OCA JSON parser needs to handle a version string may be useful. Maybe
is sufficient. I do not have the context to know what the needs of a version field are for OCA objects and so defer to you two. |
Also, just checking, but: In OCA they use: "v": "OCAB10JSON0010eb_", in the JSON version of the schema generated. This is using: https://trustoverip.github.io/tswg-cesr-specification/#legacy-version-1xx-string-field-format
Does this mean, since there are only two version character codes that the highest version allowed would be 9.9? That seems to be a restrictive choice for OCA. And I also noticed that OCA (swagger API for generated JSON schema bundles) generates different Protocol version strings. I've now seen OCAB and OCAM. |
Since the digits are hexadecimal then the largest version would be Nonetheless, if you want a full semantic version string then you would need to do something different. Does compatibility at a version string level between OCA and ACDC / KERI data structures matter? If not then you can pick whatever version string semantics you want and you can ignore the CESR-style version string. |
Sorry to do this, but I’ll try one more time on this issue.
The currently expired SAID spec. is (in my opinion) overly complicated and does not clearly present the algorithm — hence this long discussion that Kent has beautifully summarized in his Blog post. At minimum, the SAID spec. should be replaced with what Kent has written. If that spec. can’t be simplified, then my recommendation is to just put the text into the OCA spec. The algorithm is relatively short and easy, doesn’t need much introduction (“A unique identifier is needed for an overlay and that identifier MUST be derived from the content of the overlay using this algorithm:”). The CESR spec. should not be referenced at all, as it is not needed. If there is insistence on using the CESR @carlyh-micb — you are absolutely right about referencing other specs — hashing, JCS and the like. |
I've been thinking @swcurran about your comment. Standards belong in the organizations that are responsible for them. However, standards that are still under development and not officially released as a locked version are at risk of changing. Even if it is unlikely. We also acknowledge that the source for this is CESR standard and that when they officially release their standard as an official version release, then OCA switches to the official KERI/CESR/SAID standard. Unless the calculation has changed in the meantime whereupon OCA would need to do a new version release of the OCA specification referencing the newer SAID calculation. If it is still hard to read the standard to find the SAID calculation OCA specification can still release the non-normative helper documentation for understanding the SAID spec. It would be bad if OCA uses the unreleased KERI/CESR/SAID specification as the official source of SAID calculations and then the calculation gets changed because the SAID spec is still under development. Did I capture the argument? |
That’s pretty much it. |
@kentbull any chance that Sam Smith would release the SAID specification, as you have written it, as an official versioned release that can be cited as a specification standard and will not change? |
It is very unlikely that the SAID calculation will change because it is part of the bedrock on which everything else in CESR, KERI, and ACDC are built. If you are worried about the SAID calculation changing you can lay those fears to rest. @carlyh-micb as far as Sam releasing a spec with the language I used, that seems unlikely, though I will ask what we can do on the next spec call this coming Tuesday. I like the idea of the SAID spec being released separately, it's just one more document to manage so I could see that being the primary argument against a separate spec. Right now the SAID spec language has been included with the CESR spec. |
Given the utility of the SAID spec, I'd love to see it released separately. It can be useful in many places, not just CESR. Because of its utility and the challenge for people to get the details out of the spec I think it deserves a separate specification. Separation of concerns right? I've seen SAID specifications on WebOfTrust and IETF, and now ToIP, all places that have hosted this specification as it is developed. Without an official release it can be confusing to identify the actual home of the current draft (when you know you know, but a google search is unclear). |
Yes, so would I. I will bring this up to Sam and see what he says. I'm sure he'd at least give it some consideration. Worst case scenario I'd be happy to draft some language for the OCA spec as a first version. There could be a non-normative reference to the SAID language in the CESR spec for those who want more details. The outline of what the language would include is this:
|
After bringing this to this morning's ToIP ACDC Task Force meeting my recommendation is twofold:
The rationale and discussion points are these:
To directly address the concern of the SAID spec or process changing I make the comment that SAIDs will not be changing in the forseeable future. If they did then pointing to the CESR spec would be the best, most accurate place to reference for a full implementation. And, regarding the complexity or simplicity of the CESR specification and available reference material such as my blog post, I favor having a combination of authoritative, detailed language from the CESR specification along with helpful reference material. @carlyh-micb regarding the concern on things changing,
The KERI Working Group at ToIP is in the process of locking the specifications to version 0.9. It makes the most sense to me to point to those locked spec versions. During the time where this process is resolving the draft spec can be pointed to by the OCA language. Lastly, to address the issue @swcurran brought up of CESR not being needed,
I changed my mind on this after writing the SAID blog post. CESR should be referenced from the OCA spec for a few good reasons. The short version is that there are enough CESR-specific ideas involved in SAIDs that the economical and supportable option is to reference the CESR spec.
If the OCA spec and implementations are only ever going to use one algorithm such as SHA3-256 then you could hardcode one algorithm in the spec and leave it that way. A downside to this hardcoding approach is that it does not plan for the future and requires potentially large future spec updates when longer SAIDs are used, such as 64 byte rather than 32 byte SAIDs, or even larger SAIDs. While I favor starting with one algorithm, SHA3-256, I also favor including the concept of a parse table that provides a way to add additional spec-supported algorithms as growth occurs. |
Reading the SAID section in the CESR spec, I would suggest that we put the calculation in-line in the OCA spec without an external reference. That section of the spec is largely non-normative, except for one “MUST” that points to another section of the CESR spec (without a link). Worse, the narrative is irrelevant to how the SAID feature is used in OCA — to create content-based identifier for overlays so that we are certain we have consistency between referenses to the overlay and the overlay content. The idea of having a non-normative reference to the details of how a required calculation is done seems extremely odd and unhelpful. Since the KERI and OCA communities are largely distinct, and the overlap (this one calculation), I don’t see value in the 6 stated reasons — only confusion for OCA developers/users (“What is this big other spec, and I can’t I just see the calculation?”). The references to the very complex CESR spec. for such a small feature is daunting. If the CESR spec. changes, it would be better that the OCA spec NOT be impacted, as that would independently break interop. For all of those reasons, I think we put the algorithm into the OCA spec with a non-normative acknowledgment of the CESR source. Regards the single hashing algorithm — as long as we state that there is a path to adding more algorithms in future versions (2.x, 3.x) of the spec, and we have a SAID algorithm capable of adding new versions (which we do), I think it is safe and helpful. |
This is tough. I go back and forth on whether language should be referenced or whether it should be included as a subset. As much as I like the argument of including a subset of language in the OCA spec I also know that the most well maintained language on the CESR-style SAID encoding is in the CESR spec. If the OCA maintainers are okay with manually keeping the OCA spec up to date with the CESR style encoding, which will never change at this point, then including a subset of language is low risk. Then again, why use a CESR style encoding at all for content addressable identifiers? If you just want one type of hashing algorithm then you don't need any type codes. If you do want type codes then you don't have to use CESR style encodings, you could just prefix a type code on a Base64 value without worrying about any other parse table rules like CESR does. If the goal is to just have content addressable identifiers then why involve CESR-style SAIDs? There would need to be other use cases where KERI or CESR are involved to justify using CESR-style SAIDs. If the HCF has those other use cases then I could understand sticking with CESR-style SAIDs. Then again, maybe the lowest cost, lowest friction way forward is to just include CESR-style SAID language in the SAID spec. Forgive me for not having a more educated opinion on this to start. This is the first spec discussion I've been a part of. Thanks for bearing with my back and forth. As I have thought through the arguments on each side and discovered new thinking and information I changed my position a few times. If you want help drafting that language to place in the OCA spec I'm more than happy to help. I don't have a dog in this fight yet I want all dogs to be healthy, if that makes sense. |
We only want one or two hashing algorithms today, but a future OCA spec might want more. Hence the need for identifying the algorithm used in a given hash instance. I would personally rather use multihash, as I think (but don’t know) it is more likely to be accepted by a broader community. Thanks for your help in this — especially in getting the algorithm clearly defined. That is a huge help. I do have a dog in the fight in that I want OCA to be accepted broadly as we are using it and getting great benefit from it. Complexity without benefit is a barrier to acceptance of any spec., and we to reduce complexity. |
@kentbull — this has come up again, as we are finally working on getting the SAID calculation into the OCA spec. Yay! I want to point a developer to your blog post to see if they can generate a compatible hash using just that info. But I see what I think is a big ommission in your algorithm — not sure how I missed it before. In your “7 steps”, you either leave off the part in step 3 of canonicalizing the JSON before hashing, or (I guess) assume that the input JSON is in canoncialized form. I think there needs to be an explicit reference to generating the JCS from the input JSON before calculating the hash ( Thoughts? |
One important note, I realize when I tried to use an online Blade3-256 hashing function that there are actually multiple options when hashing and I couldn't actually get the exact same digest, some characters were mismatched (see the picture below). There is obviously some kind of nuance with regards to hashing algorithms and it would be good to document this in the OCA specification.
https://emn178.github.io/online-tools/blake3/
The text was updated successfully, but these errors were encountered: