Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating SAIDS with Blade3-256 - nuance #58

Open
carlyh-micb opened this issue Jul 30, 2024 · 51 comments
Open

Calculating SAIDS with Blade3-256 - nuance #58

carlyh-micb opened this issue Jul 30, 2024 · 51 comments

Comments

@carlyh-micb
Copy link
Collaborator

One important note, I realize when I tried to use an online Blade3-256 hashing function that there are actually multiple options when hashing and I couldn't actually get the exact same digest, some characters were mismatched (see the picture below). There is obviously some kind of nuance with regards to hashing algorithms and it would be good to document this in the OCA specification.

image

https://emn178.github.io/online-tools/blake3/

@blelump
Copy link
Member

blelump commented Jul 31, 2024

The way we calculate the digest is also CESR-specific. Let's go step by step over an example (given in JavaScript):

  1. Assume the payload is abcd.
  2. await blake3(Buffer.from("abcd", "utf-8")) => jJyYgYBdGoRxAtekLli5kNCI3YioT3MU1xyDgQdXHys=. The same digest is returned from https://emn178.github.io/online-tools/blake3/ for abcd.
  3. In CESR (that SAID is also using for encoding), we're not interested in raw output but in output that can be divided by 24. This is to support framing in CESR for all its primitives.
  4. The length of jJyYgYBdGoRxAtekLli5kNCI3YioT3MU1xyDgQdXHys= (after decoding back to bytes) is 32 bytes. 32*8, we get 256 bits and then divide by 24: 10.666666666666666. 11*24 (=264) is the number of bits we're looking for to encoding. All we have to do is prepend the 32 bytes (256 bits) with 1 additional byte (8 bits) – in this case.
  5. We then concat the bytes from await blake3(Buffer.from("abcd", "utf-8")) with one byte representing 0: Buffer.concat([Buffer.from([0]), raw])
  6. Final result: AIycmIGAXRqEcQLXpC5YuZDQiN2IqE9zFNccg4EHVx8r . When we replace A with E (the Blake3 code), we get exactly the same output as from https://said.argo.colossi.network/ .

@carlyh-micb
Copy link
Collaborator Author

These are great details that I think should be included in the OCA specification.

Should we could discuss using something without the additional CESR complexity and just a raw hashing algorithm? I think benefits and trade-offs should be clear and an excellent topic for DSWG.

What about if someone wants to use a different hashing algorithm? Will they be required to follow similar instructions and will those instructions be documented? Or if we only document Blade3-256 and note that the OCA specification is prepared for other hashing algorithms in the future which can be added to OCA-spec as needed and not need a new version of OCA-spec for these additions.

@swcurran
Copy link

Given that there is no need for using CESR or worrying about the length of hashes falling on bit boundaries, why is the OCA spec following/using CESR that standard?

Since the Blake3 hashing algorithm was not accepted by NIST and there are far fewer libraries in various languages vs. SHA hashing, why use it?

That said — if this all gets documented cleanly in the OCA spec so that an independent implementation can be defined, all good. Documented means the permitted hashing algorithm(s), how the hashing algorithm can be detected (to allow for future evolution of the spec), steps to hash, and steps to verify.

I infer from the first line that the canonicalization approach is as follows and needs also to be in the spec:

  1. Start with JSON input.
  2. Place the string “####…” in all places the hash will go (usually, just 1).
  3. Recursively sort the JSON struct alphabetically by item.
  4. Remove all of the extraneous whitespace in the JSON.

I would recommend (although presumably it is too late) to:

  1. Remove the need to make the placeholder (“####…”) the same length as the resulting hash, since that is unnecessary, will change if new hash algorithms are permitted, and kind of annoying.
  2. For steps 3 and 4, use JSON Canonicalization Scheme, since that is a standardized version of the two steps.

@blelump
Copy link
Member

blelump commented Aug 1, 2024

Please find this discussion as a continuation of the OCA v1 design choices topic. It is discussed here and there, so the open discussion should address (at least some) of the questions.

@swcurran, it is perfectly fine if you want to use SHA instead of BLAKE. Use the SHA hashing algorithm and prepend the digest with the proper letter. For the record:

E	Blake3-256 Digest	1		44
F	Blake2b-256 Digest	1		44
G	Blake2s-256 Digest	1		44
H	SHA3-256 Digest	1		44
I	SHA2-256 Digest	1		44

@swcurran
Copy link

swcurran commented Aug 1, 2024

So the expectation is any OCA parser MUST support all of the “permitted” hashing algorithms, and that is the list? I suspect that means a lot of extra dependencies. If that is what the spec says, that is OK — although not ideal.

My suggestion (and what we do in the DID TDW spec) is to keep the prepending flag (which is what the “multihash” standard defines — but with a much larger list of hash algorithms), but to specify that OCA Bundles creators MUST use a (ideally) just one hash algorithm (preferrably sha-256) — or at most an option or two. That reduces the number of dependencies an implementation needs to include — reducing the dependabot updates needed. A later version of the spec might add newer, better algorithms, and deprecate older ones.

And all of this needs to be added to the spec.

@blelump
Copy link
Member

blelump commented Aug 2, 2024

In the OCA spec, we say that the OCA Object identifier is SAID. What will come as refinement (clarification) in v1.1 of the spec is that all objects are SADs(self-addressing data). SAID presumes specific algorithms that are sufficient for the current needs. SAID doesn't support all the algorithms as Multihash does because SAID is pragmatic. If it's not needed, why bother? This "not needed" argument is significant because we don't have to support old or legacy systems. We started Green Field to benefit from what is available now or in the future (SAID will get proper letters then).

As discussed in #59, we made design choices and onboarded SAID. We cannot narrow the list of hashing algorithms to just one for the same reason git did not do it in the past with SHA1 and is now stuck with it: We want to be forward-compatible. We are specifically interested in the first and second pre-image resistance for any hashing algorithm in OCA, and historically, MD5 and SHA1 were found to lack these properties. SAID solves not only this, but also specifies how to make it self-referential (embeddable in the document it refers to).

@carlyh-micb
Copy link
Collaborator Author

I love new ways and greenfield, I am looking for a defensible, pragmatic and clear benefit of using SAIDs vs multihash and the JSON canonicalization method.

For the calculations of the hashing, I had indeed gone through Sam Smith's documentation (which is very much in flux since it has changed standards body hosting at least once). Even after going through that method I still apparently hadn't dug deep enough because E is not just the Blake3-256 Digest, it apparently has a bunch of additional processing. I don't think it is a good idea to point to Sam's most recent documentation as the source, it is confusing, a work in progress and any user would really have to dig.

How does OCA benefit from using CESR. Can this be articulated? Could we write out a detailed, non-normative description of exactly how you calculate SAIDs for two hashing examples and that the official method is Sam Smith's documentation? Currently the weakest part of the specification is canonicalization and digests; that OCA is all about cryptographic reproducibility but it is very hard to verify (except to depend on a few experts to do it for you) and rather than going a standard route that can be easily reproduced by others we went the less documented, less used and more difficult to figure out CESR route without (yet) being able to articulate the clear benefits of that. Could we be ready for CESR and use it in v3 or v4 of OCA-spec? I'm not against CESR at all, I am suggesting a different timing.

@blelump
Copy link
Member

blelump commented Aug 2, 2024

@carlyh-micb , please use the #59 discussion to continue discussing the design choices, i.e., these CESR-related. We can answer the questions there.

CESR looks complex because it's a clever protocol. SAID isn't technically CESR but uses CESR concepts, i.e., the code table and the encoding process described in the 2nd post above.

@swcurran
Copy link

swcurran commented Aug 2, 2024

We seem to be talking past each other @blelump re: how to declare what hash algorithms are allowed and how to detect which one was actually used. My points:

  • The OCA spec (not the SAID spec) should define what hash algorithms are allowed to be used, and that list should be updated with each version of OCA as it is determined that better algorithms are available. That said, the list in the spec should be as small as possible to reduce the dependencies needed.
  • Multihash and the prefix used by SAID are for the same purpose. I’m just suggesting using multihash because it is used for more widely.

Writing this I realize that there should be another requirement. Each OCA Document should have to declare what version of the OCA spec that was used in creating the document. That will ensure that both evolutions of the spec. and backwards compatibility can be managed in real deployments.

However — the most important point. Whatever you decide to do with hashing and SAD generation — please put it in the spec. We can’t do anything until that is done.

@carlyh-micb
Copy link
Collaborator Author

Stephen, I posted something in the discussions earlier, not sure if you saw it.

I will note that you can tell what version you are in because each overlay is versioned.

"type": "spec/capture_base/1.0",
"type": "spec/overlays/unit/1.0",
"type": "spec/overlays/entry_code/1.0",
"type": "spec/overlays/entry/1.0",

Are examples of valid overlays (v1.0). It looks like you could even have a schema with a mix of versions of overlays.

I don't think there is a version for the entire schema bundle, which makes sense if it supports versions at the level of overlay.

@carlyh-micb
Copy link
Collaborator Author

Stephen, I posted something in the discussions earlier, not sure if you saw it.

I will note that you can tell what version you are in because each overlay is versioned.

"type": "spec/capture_base/1.0", "type": "spec/overlays/unit/1.0", "type": "spec/overlays/entry_code/1.0", "type": "spec/overlays/entry/1.0",

Are examples of valid overlays (v1.0). It looks like you could even have a schema with a mix of versions of overlays.

I don't think there is a version for the entire schema bundle, which makes sense if it supports versions at the level of overlay.

After discussing this at ADC - is the version number here the version number for the overlay (according to the spec) or is it the version of the content of the overlay and the user can change the versions?!!!

@blelump
Copy link
Member

blelump commented Aug 16, 2024

Stephen, I posted something in the discussions earlier, not sure if you saw it.
I will note that you can tell what version you are in because each overlay is versioned.
"type": "spec/capture_base/1.0", "type": "spec/overlays/unit/1.0", "type": "spec/overlays/entry_code/1.0", "type": "spec/overlays/entry/1.0",
Are examples of valid overlays (v1.0). It looks like you could even have a schema with a mix of versions of overlays.
I don't think there is a version for the entire schema bundle, which makes sense if it supports versions at the level of overlay.

After discussing this at ADC - is the version number here the version number for the overlay (according to the spec) or is it the version of the content of the overlay and the user can change the versions?!!!

There's no reason to keep each OCA Object versioned separately, as it is now defined in the spec. The serialization scheme ensures the v attribute at the beginning that describes the serialized content's (Bundle) major and minor semantic versions. If any OCA Object changes, the semantic version of the Bundle must also change -- it's otherwise unknown whether the parser can parse it. However, how to version the community overlays concept remains an open topic.

@kentbull
Copy link

kentbull commented Aug 25, 2024

To clarify things I will respond to each of @blelump's and @swcurran's comments inline. While you must understand a bare minimum of CESR-style encoding of a SAID to maintain interoperability with CESR SAIDs you don't have to know or implement all of CESR in order to have a CESR-compliant SAID. It is entirely possible to create non-CESR compliant SAIDs as well if you want to completely omit any CESR-related concepts. To get many of the same benefits you would need to come up with your own scheme for signifying specific types of digests and pad character sizes.

You can see my short SAID implementation called saidify of about 525 lines of Typescript showing a minimal implementation of CESR code tables.

Michael's process is correct. I will restate it briefly for clarity.
Michael correctly wrote:

  1. Assume the payload is abcd.
  2. await blake3(Buffer.from("abcd", "utf-8")) => jJyYgYBdGoRxAtekLli5kNCI3YioT3MU1xyDgQdXHys=. The same digest is returned from https://emn178.github.io/online-tools/blake3/ for abcd.
  3. In CESR (that SAID is also using for encoding), we're not interested in raw output but in output that can be divided by 24. This is to support framing in CESR for all its primitives.
  4. The length of jJyYgYBdGoRxAtekLli5kNCI3YioT3MU1xyDgQdXHys= (after decoding back to bytes) is 32 bytes. 328, we get 256 bits and then divide by 24: 10.666666666666666. 1124 (=264) is the number of bits we're looking for to encoding. All we have to do is prepend the 32 bytes (256 bits) with 1 additional byte (8 bits) – in this case.
  5. We then concat the bytes from await blake3(Buffer.from("abcd", "utf-8")) with one byte representing 0: Buffer.concat([Buffer.from([0]), raw])
  6. Final result: AIycmIGAXRqEcQLXpC5YuZDQiN2IqE9zFNccg4EHVx8r . When we replace A with E (the Blake3 code), we get exactly the same output as from https://said.argo.colossi.network/ .

This can be simplified to:

  1. Start with any arbitrary JSON payload that includes an empty attribute for your SAID, say the "d" field for now.
  2. Place the correctly sized pad string "####..." in the SAID field, "d" here. This size is read from the CESR master code table and corresponds to the type of digest you produce, 'E' for our case here of Blake3-256. This type of digest requires a 44 character length (33 byte) Base64URLSafe resulting encoding, 32 bytes for the digest and 1 byte for the prefixed derivation code.
  3. Create a digest (Blake3-256, Blake2b-256, SHA3, SHA2, etc.) of the raw, serialized JSON bytes preserving some reproducible field order whether insertion order or some other canonical order. The digest is typically a 32 byte digest (256 bits). The field order varies between organization. CESR uses insertion order, though you can use whatever order you want as long as you are consistent.
  4. Align the digest to a 24 bit boundary with pre-padded zero bytes.
    You need to get to a multiple of 24 since that is the least common multiple between the Base64 character bit count and a byte bit count.
    Here that means 33 bytes, or 264 bits, meaning 44 Base64URLSafe characters.
  5. Replace any pre-paded "A" (zero bit) Base64 characters with the correct CESR derivation code from the CESR master code table, in this case replacing a single 'A' with 'E' for Blake3-256. SHA3-256 would be 'H'.\
    AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
    ↓
    EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ
    

Stephen:

  1. Start with JSON input.
  2. Place the string “####…” in all places the hash will go (usually, just 1).
  3. Recursively sort the JSON struct alphabetically by item.
    This sorting is not strictly necessary, though is compatible with CESR-encoded SAID. I know you aren't using KERI, though just FYI since OCA is integrating with KERI: alphabetic encoding will break anything using KERI since KERI data structures have a specific insertion order that is not alphabetical. This order is specified in section 7 of the ToIP KERI specification.
  4. Remove all of the extraneous whitespace in the JSON.
    Yes, this is implied with JSON serialization. Raw digests should only ever be computed on a compact, no-whitespace JSON

You are missing the digest, alignment, and derive steps in your process.

So Michael is correct:

CESR looks complex because it's a clever protocol. SAID isn't technically CESR but uses CESR concepts, i.e., the code table and the encoding process described in the #58 (comment) above.

SAID uses CESR-style self-framing derivation codes as well as the 24-bit alignment strategy to ensure composability of cryptographic primitives in a byte stream.

And Stephen, regarding

I would recommend (although presumably it is too late) to:

  1. Remove the need to make the placeholder (“####…”) the same length as the resulting hash, since that is unnecessary, will change if new hash algorithms are permitted, and kind of annoying.

I understand how it could be seen as annoying to change the length of the placeholder yet there are a host of important benefits for stream parsers when you make the placeholder the same length of the resulting digest. When you keep the placeholder and the digest the same length then that means that the size of the overall SAIDified data structure remains the same pre- and post-saidification. This fixed size enables incremental, efficient parsing of a byte stream since you can parse one by one due to a TLV encoding scheme and this fixed size also supports pipelining where you can hand off these incrementally received cryptographic primitives off to separate CPU cores for efficient parsing.

  1. For steps 3 and 4, use JSON Canonicalization Scheme, since that is a standardized version of the two steps.

This should work for anything that is not KERI or insertion ordered. There is a good argument for insertion ordering for some use cases and there are similarly good arguments for alphabetical ordering for other use cases. I personally favor insertion ordering because it allows for developer and user friendly natural field orderings, though some people prefer alphabetical ordering in order to simplify debates and decision making.

And @carlyh-micb, regarding your question about using CESR,

How does OCA benefit from using CESR. Can this be articulated?

You don't have to use all of CESR for SAIDs, though since HCF is building on top of CESR then you can leverage all of its benefits including cryptographic agility, data efficiency with incremental parsing, pipelining, and a very compact encoding format.

Could we write out a detailed, non-normative description of exactly how you calculate SAIDs for two hashing examples and that the official method is Sam Smith's documentation?

I show how to calculate Blake3, Blake2, SHA3, and SHA2 digests (hashes) with SAIDify:

  • here how to calculate a Blake3-256 digest
  • here how to calculate a Blake2-256 digest
  • here how to calculate a SHA3-256 digest
  • here how to calculate a SHA2-256 digest

I have a longer explanation targeting programmers here detailing the steps to produce a SAID and the supporting concepts of 24-bit boundary alignment. Beware, it's a bit of a read.

@kentbull
Copy link

kentbull commented Aug 26, 2024

Michael's process is correct.

In a second review of Michael's process he is missing a key part of the Base64 encoding process that CESR uses. His Blake3 digest of jJyYgYBdGoRxAtekLli5kNCI3YioT3MU1xyDgQdXHys= contains an equals sign '=' which would never occur in a valid CESR encoding of a SAID because prior to creating a digest for a SAID the value is always pre-padded with zero bytes to align on a 24-bit boundary. Equals signs '=' only occur when a Base64 encoded value is not aligned on a 24-bit boundary because Base64 encoding is supposed to always produce 24-bit aligned values whether CESR or not.

What CESR does is repurpose the space that would have been taken up by equals sign ('=') padding characters for derivation codes that allow you to look up the type and length of a cryptographic primitive by reading only the front bytes of an encoded primitive and then take the observed type, read the code table entry to get the length (size) of the primitive, and then strip exactly that number of bytes from the incoming byte stream. This is why CESR is a TLV (type, length, value) encoding scheme.

The use of a prefixed, TLV, self-framing encoding scheme with 24-bit boundary alignments provides composability including layered hierarchical composability which enables really cool things like indexed signatures and group count codes, among other things.

@blelump
Copy link
Member

blelump commented Aug 26, 2024

thanks @kentbull . For your information, the Human Colossus Foundation provides a fully-fledged JS client calculating SAID that is exposed here. Both the source of the WEB page and the library are publicly available. The documentation level needs to be improved, but we always welcome contributions in this area.

@swcurran
Copy link

Thanks @kentbull for the (pretty) clear description of the algorithm. I’d still like some further clarifications:

  • Start with any arbitrary JSON payload that includes an empty attribute for your SAID, say the "d" field for now.
  • Place the pad string "####..." in the SAID field, "d" here.
    • Should specify that the length of the string depends on the selected hashing algorithm. I think that is the “total length” column in the table referenced below).
  • Create a digest (Blake3-256, Blake2b-256, SHA3, SHA2, etc.) of the raw, serialized JSON bytes preserving some reproducible field order whether insertion order or some other canonical order. The digest is typically a 32 byte digest (256 bits). The field order varies between organization. CESR uses insertion order, though you can use whatever order you want as long as you are consistent.
    • AFAIK, since the verifier could be anyone with or without a relationship with the many SAID (OCA Bundle) producers, I don’t think it is enough to not strictly define the canonicalization. Both parties MUST do exactly the same thing, or the digest will not match. I think this needs to be quite formal. CESR may not specify it (although it seems odd to me), but OCA needs to specify exactly what all OCA implementations should do.
  • Align the digest to a 24 bit boundary with pre-padded zero bytes. You need to get to a multiple of 24 since that is the least common multiple between the Base64 character bit count and a byte bit count. Here that means 33 bytes, or 264 bits, meaning 44 Base64URLSafe characters.
    • Its not clear to me what is done here. The length of the hash depends on the algorithm, right? So the handling should likewise. How exactly does the padding get added? Does the result of base64url(hash(input)) have padding if needed? Is it always needed? Where in the process does the pre-padded “A” get inserted? AFAIK — any padding needed (which is not always needed…) is the # at the end. Is the idea to remove any trailing # and correspondingly add the A at the beginning? Is it possible that there is no trailing #? If so, should the A be pre-pended anyway?
  • Replace any pre-paded "A" (zero bit) Base64 characters with the correct CESR derivation code from the CESR master code table, in this case replacing a single 'A' with 'E' for Blake3-256. SHA3-256 would be 'H'.

AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

@swcurran
Copy link

About this comment:

For steps 3 and 4, use JSON Canonicalization Scheme, since that is a standardized version of the two steps.

This should work for anything that is not KERI or insertion ordered. There is a good argument for insertion ordering for some use cases and there are similarly good arguments for alphabetical ordering for other use cases. I personally favor insertion ordering because it allows for developer and user friendly natural field orderings, though some people prefer alphabetical ordering in order to simplify debates and decision making.

My interpretation of the use of JCS is that it is only used when calculating SAID and so does not impact what the developer does with the JSON. The JSON is created as the developer wants, the SAID is calculated using the JSON as input to a function, and the original JSON is updated with the SAID. That JCS was used in the SAID calculation function does change the JSON itself. Likewise, a verifier receives the JSON, verifies the SAID in a function, and then continues processing the JSON. The JCS processing is used only in the SAID function.

The problem with relying on insertion order is that a verifier could receive the JSON through a process that alters the ordering and so cannot verify the SAID. By using JCS during the SAID calculation, both the producer and verifying know exactly what they have to do to the JSON before calculating the hash.

@kentbull
Copy link

kentbull commented Aug 28, 2024

Good questions @swcurran, I wasn't as clear as I could be in my answer. I'll see if I can clean that up here.

Field Ordering

AFAIK, since the verifier could be anyone with or without a relationship with the many SAID (OCA Bundle) producers, I don’t think it is enough to not strictly define the canonicalization.

I agree, field ordering must be strictly defined somewhere. Enforcing order with JCS is a good way to do that. Your comment on JSON not preserving insertion order is important. JSON does not inherently support any field ordering at all as it makes no guarantees about field ordering. Any ordering guarantees must be made on top of JSON and enforced by something. Newer versions of JavaScript do preserve field order in JavaScript maps and JavaScript objects as well as JSON serialization and deserialization, yet that's JavaScript, not JSON.

So picking an ordering scheme whether JCS or something else is essential and should be formally called as you stated.

CESR may not specify it (although it seems odd to me), but OCA needs to specify exactly what all OCA implementations should do.

Insertion ordered data structures are the only thing that CESR calls out. If there is intermediary JSON processing of a byte stream that may not preserve the insertion ordering of a data structure then there must be some specification whether in a dedicated schema document or other resource such as JCS that enables reproducible field ordering. CESR does not specify a mechanism for creating this reproducible ordering in the event of a rearrangement of JSON fields. This is likely because the CESR spec doesn't seek to describe intermediate JSON processing and relies on putting bytes in a stream that are ordered correctly, which is functional as long as neither source nor destination reorders fields in transit.

As you mentioned, any use case that needs to reprocess JSON and potentially reorder fields must create or introduce its own field ordering to calculate consistent digests.

The JCS processing is used only in the SAID function.

That makes sense.

The problem with relying on insertion order is that a verifier could receive the JSON through a process that alters the ordering and so cannot verify the SAID.

Yes, if the verifier or any intermediary manipulates the order of the JSON document then it will have a different SAID. If you have a use case where you expect this to happen then maintaining some sort of schema specification or canonical ordering process is the only way to create reproducible digests/SAIDs.

Aligning on 24 bit boundaries

Its not clear to me what is done here. The length of the hash depends on the algorithm, right? So the handling should likewise.

Yes and yes. The length of the hash does depend on the algorithm. And the handling of a digest should also be algorithm specific. The "derivation code" in the CESR master code table is a lookup key, essentially an object/class type, indicating which cryptographic digest algorithm to use, the length of the cryptographic digest, and how many pad bytes were added to a digest.

How exactly does the padding get added?

Steps (detailed below):

  1. Pad zero bytes to the left to get to something that creates a multiple of 24 bits.
  2. Base64URLSafe encode the zero byte padded value.
  3. Replace any 'A' (zero bits) Base64URLSafe characters in the encoded value with the correct derivation code from the master code table

You asked:

Does the result of base64url(hash(input)) have padding if needed?

Although the answer is technically yes from a Base64 perspective, the answer is no from a CESR perspective because CESR changes the padding to be on the left (start). Base64URLEncoding includes pad bytes, yet they are on the right (end) because standard Base64 encoding adds pad bytes (equals '=' signs) to align on 24 bit boundaries. CESR adds pad bytes on the left rather than the right in order to have self-framing, composable cryptographic primitives. I go more into detail on this below.

Is it always needed?

No, if a Base64URLSafe encoding already aligns on a 24-bit boundary then the derivation code for a CESR primitive is prepended like all other CESR derivation codes. Yet, for SAIDs, the underlying 32 byte digests never align on 24 bit boundaries because they need one more byte to get to the multiple of 24, 33 bytes. For SAIDs that rely on 32 byte digests you don't have to worry about the other cases that CESR covers and you don't want to because then you'd have to implement more CESR rules than are absolutely necessary for SAIDs.

Where in the process does the pre-padded “A” get inserted?

The pre-padded 'A', or zero bits, get added directly after creating the raw digest. You create a raw digest, 32 bytes in our case, and then prepend zero bytes, 1 byte in our case, to get to a multiple of 24 bits, 33 bytes in our case, which is 264 bits.

So you aren't really pre-padding 'A' characters, you are pre-padding zero bits to get to 264 bits, a multiple of 24 bits. These zero bits end up encoding to Base64URLSafe as 'A' characters (because it is index 0 in the Base64URLSafe character list) which is why you see the pre-padded 'A' characters on the final Base64URLSafe encoded value.

See the steps below for a thorough explanation.

AFAIK — any padding needed (which is not always needed…) is the # at the end.

No, the '#' pound sign characters are not used to pad the digest to align on 24 bit boundaries. The pound sign characters are used as a placeholder in the target document to ensure the document length is consistent both before and after taking the digest of the document. The pound sign characters are only related to being a placeholder and are not included as a part of the actual Base64URLSafe encoded digest.

Is the idea to remove any trailing # and correspondingly add the A at the beginning? Is it possible that there is no trailing #? If so, should the A be pre-pended anyway?

No, as I mentioned above, the pound signs are only used to provide a fixed-width placeholder in the document being "saidified." And it is important to remember that it is zero bits that are prepended to the digests that happen to translate to 'A' in Base64URLSafe encoding.

Step 1 Pad zero bytes on the left

Padding gets added on the front (left side) of a digest as shown in the below examples. I will detail the steps for you.

Primitive Raw Bytes (unencoded)
       |r32:r31:r30:r29:28:r27:r26:r25:r24:r23:r22:r21:r20:r19:r18:r17:r16:r15:r14:r13:r12:r11:r10:r9:r8:r7:r6:r5:r4:r3:r2:r1:r0|

Pad Byte of zeroes (unencoded)
|p33|

Padded Primitive - Raw Bytes
|p33 +  r32:r31:r30:r29:28:r27:r26:r25:r24:r23:r22:r21:r20:r19:r18:r17:r16:r15:r14:r13:r12:r11:r10:r9:r8:r7:r6:r5:r4:r3:r2:r1:r0|

Say you have the following object and are using the "said" field for the SAID digest:

f = {
"said": "############################################",
"first": "Sue",
"last": "Smith",
"role": "Founder"
}

The Base64URLSafe encoding of the raw, unpadded bytes of the Blake3-256 digest from above looks like the following. It is 43 Base64URLSafe characters, representing 256 bits (32 bytes), long:

nKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk=

Yet this is incorrect because it does not align on a 24 bit boundary. To create this alignment then you add between one and three zero bytes. Base64 aligns values on 24 bit boundaries as well using '=' equals signs, as you have seen. Typically Base64 encoding, whether regular Base64 or Base64URLSafe, adds these pad bytes on the end of a Base64 string as. Yet CESR, rather than have those equals signs be wasted bytes, to repurposes them to store the derivation code which is why some CESR derivation codes are as small as 1, 2, or 3 Base64URLSafe characters.

To align the 32 byte digest from above (43 Base64URLSafe chars), then you pad with zero bytes on the left hand side to reach a multiple of 24 bits. Why 24 bits? To have a clean separation of stored bytes in Base64 characters. You don't want a single Base64 character to hold information for two different adjacent digests or other cryptographic primitive since this complicates parsing and does not allow simple round-trippable, lossless encoding and decoding.

If you share/overlap bits from two different digests/primitives in one Base64 character then you end up having to parse and interpret both digests together in order to cleanly separate them. And since you don't know where in a stream such overlaps would occur then you end up having to wait to get the whole stream and then you have to parse the whole stream together and still count all the bytes, digest by digest, as a single operation because you don't have clean frames to separate digests/primitives on reliable bounds. If you want more clarification on this then I can talk you through it. The diagram below helps illustrate this.

Step 2: Base64URLSafe encode the zero byte padded value

When you pad with zero bits in our example this results in the following 44 Base64URLSafe characters:

AJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

As you see there is the 'A' character on the front rather than the '=' equals sign on the end. The padding has been moved from the end to the beginning like the resolution of a well-written dramatic character arc.

The reason there is an 'A' at the start of this digest is because all of the first six bits are zero, which corresponds to the 'A' character, the zeroth character, in Base64URLSafe encoding.

Step 3: Replace 'A' (zero bit) Base64URLSafe characters with appropriate derivation code

Finally, there is one more step to get from this Base64URLSafe encoded value to a CESR compatible SAID. That step is replacing the prepended 'A' (zero bit) Base64 characters with the self-framing derivation code. In this case that is the character 'E' for a Blake3-256 digest, and would similarly be 'H' for a SHA3-256 digest.

This results in the following:

EJymtAC4piy_HkHWRs4JSRv0sb53MZJr8BQ4SMixXIVJ

See this illustration of what the bytes look like. This example shows 8 pad bits (a whole byte) of which the first six pad zero bits are encoded as the 'A' Base64URLSafe character and the last two pad zero bits are included in the second Base64URLSafe character 'J'. Replacement of pad bits with derivation codes only happens with 'A' characters where all the bits are zero.

Illustration of Base64URLSafe, left-padded encoding

P33 = leftmost pad byte of zero bits
R32 = leftmost raw primitive byte
R31 = second to last left raw primitive byte
EP44 = leftmost Base64 encoded pad character ('A')
EP43 = second Base64 encoded character, some pad bits (zeroes) from P33, some raw primitive bytes from R32
ER42 = first Base64 encoded character that is all raw primitive bytes
ER42 = second Base64 encoded character that is all raw primitive bytes

byte index   -> |           P33         |           R32         |           R31         | ...
bit index    -> |p7:p6:p5:p4:p3:p2:p1:p0|r7:r6:r5:r4:r3:r2:r1:r0|r7:r6:r5:r4:r3:r2:r1:r0| ...
bit label    -> | ------ PAD BITS ----- | ------- RAW PRIMITIVE BITS OF DIGEST -------- | ...
Raw Bits     -> | 0: 0: 0: 0: 0: 0: 0: 0| 1: 0: 0: 1: 1: 1: 0: 0| 1: 0: 1: 0: 0: 1: 1: 0| ...
Base64 Index -> |       EP44      |       EP43      |       ER42      |       ER41      | ...
Base64 Char  -> |       A         |       J         |       y         |       m         | ...

@swcurran
Copy link

swcurran commented Aug 30, 2024

Thanks, Kent. I’m going to try to take another pass at it what I think belongs in the OCA Specification. Let me know what you think:

Calculating the SAID Digest In OCA Overlays

Precondition: The OCA spec. defines the permitted hash schemes. For example, sha256 and Blake3-256. Balance the list to allow some flexibility vs. simplifying the verifiers.

  • I recommend that we choose hash algorithms that all yield a 256 bit hash to simplify the processing — assuming they are deemed sufficiently secure. Other hash algorithms added in later versions of the OCA spec. Could yield different hash lengths, and if so, some details of this process may change.
  1. Take the OCA JSON (an OCA overlay) and replace the value of the item that will hold the digest with 44 # characters. The result is the SAID input JSON.
    1. If the permitted hash algorithms are changed to those that yield a different length hash, the number of characters will change in this step.
  2. Calculate the hash string as base64url(hash(<algorithm>, JCS(SAID input JSON))) following RFC4648 (Base64URL), RFC8785 (JCS) and using one of the permitted hash algorithms.
  3. Remove the trailing padding character # from the resulting hash string.
    1. If the permitted hash algorithms are changed to those that yield a different length hash, there may be more or no padding characters.
  4. Prepend the hash string with the indicated character from the CESR Master Table for the hash algorithm used. For sha256, use an I, and for blake3-256 use an E. This is the digest string.
  5. Set the value of the digest item in the OCA JSON to the digest string.

To verify the SAID, start with the OCA JSON containing the digest item value set to the digest string.

  1. Extract the value of the digest item as the digest string.
  2. Extract the first character of the digest string and use it to determine hash algorithm used by looking in the CESR Master Table.
  3. Remove the first character of the digest string. The result is the input hash string.
  4. Create the digest JSON by replacing the value of the digest item with a string of # characters the same length as the existing digest item value.
  5. Calculate the hash string as base64url(hash(<algorithm>, JCS(SAID input JSON))) following RFC4648 (Base64URL), RFC8785 (JCS) and the hash algorithm determined in the earlier step.
  6. Remove the padding characters (if any) from the hash string.
  7. The resulting hash string MUST equal the input hash.

@kentbull
Copy link

Not quite, though you are really close. One step is missing the padding part and the other step is incorrect. I list your steps and then show what needs to be changed.

Calculation

  1. Calculate the hash string as base64url(hash(, JCS(SAID input JSON))) following RFC4648 (Base64URL), RFC8785 (JCS) and using one of the permitted hash algorithms.
  2. Remove the trailing padding character # from the resulting hash string.
    If the permitted hash algorithms are changed to those that yield a different length hash, there may be more or no padding characters.

Step 2 here prematurely Base64URLSafe encodes the digest. You don't Base64URLSafe encode anything until you have prepadded it with zero bits. For 32 byte SAIDs you don't have to use the calculation described here in qb64b because it will always be 1 pad byte. For longer SAIDs then you need to use the calculation specified in the qb64b function. I would be happy to walk you through this.

So the steps should be:

  1. Calculate the digest (hash string) as digest(<algorithm>, JCS(SAID input JSON))
  2. Calculate the number of zero bytes to be concatenated as padding to the digest (shown in qb64b) - for SAIDs using 32 byte digests this is always 1, though the formula returns either 2, 1, or 0 pad bytes (a length mod 3 operation).
  3. Concatenate the zero bits as prepadded zero bits in zbits + digest. This is the padded digest bytes.
  4. Encode the concatenation as base64url(padded digest bytes). This is the unqualified digest string.
    • There will be no trailing '=' equals signs (Base64URLSafe pad characters) because the padded digest bytes are already aligned on a 24 bit boundary.
  5. Replace the leading 'A' character (representing encoded zero bits) in the unqualified digest string with the indicated character (derivation code) corresponding to the algorithm used from the master code table. This is the fully qualified digest string.
  6. Replace all of the "#" pound sign characters in the digest field in the OCA JSON with the fully qualified digest string.
  1. Remove the trailing padding character # from the resulting hash string.
    If the permitted hash algorithms are changed to those that yield a different length hash, there may be more or no padding characters.

Step 3 here is unnecessary. I can see where you might have got this idea when I talked about how the CESR encoding does not use the '=' equals signs on Base64URLSafe encoded values and instead pre-pads zero bits, though I will clear this up to eliminate any miscommunication on my part and hopefully address any misunderstanding. There are no trailing pound signs '#' in the Base64URLSafe encoding of the digest. The pound signs are only used as a fixed-width placeholder in the OCA JSON during the SAID digest calculation process. Does that make sense now?

Verification

The easy way is to just re-calculate the SAID based on the JSON at the destination and compare the SAIDs, though if you want to do another Blake3-256 or other digest computation then you need to get back to the original un-padded raw digest bytes. First let's clear up the steps you mentioned.

  1. Remove the first character of the digest string. The result is the input hash string.

This is not correct. Removing the first character leaves you with a partial digest which will not match the digest of the destination JSON because it is missing the zero bits. It would be off by one character, the 'A' character of zero bits.

What you want to do is to get back to the unqualified digest string. When you remove the first character of the digest string then you need to replace it with the 'A' character since you did the reverse of this when you created the unqualified digest string.

  1. Create the digest JSON by replacing the value of the digest item with a string of # characters the same length as the existing digest item value.

This is correct.

  1. Calculate the hash string as base64url(hash(, JCS(SAID input JSON))) following RFC4648 (Base64URL), RFC8785 (JCS) and the hash algorithm determined in the earlier step.

The calculation here is incorrect. See the corrected calculation steps above that pre-pad the correct number of zero bits prior to calculating the raw digest bytes to give you the padded digest bytes.

  1. Remove the padding characters (if any) from the hash string.

There is no need to remove any Base64URLSafe '=' equals sign padding characters from the hash string because the raw bytes that produced the encoded value were already aligned on a 24 bit multiple (boundary). Base64 padding characters only show up if your value is not aligned on a 24 bit boundary, which will never happen in properly padded and encoded SAIDs or other CESR primitives.

  1. The resulting hash string MUST equal the input hash.

When you calculate the hash string (unqualified digest string) as noted with my corrected steps from above then the hash will match.

I wrote a lot here and we've written a few times back and forth. I know that async communication can sometimes be difficult on very precise things like this. Feel free to reach out to me on LinkedIn, ToIP Slack, or KERI Discord if you want to have a realtime conversation about this.

@kentbull
Copy link

kentbull commented Sep 2, 2024

Stephen, a note on dependencies, The @noble/hashes library is a single dependency that includes all of Blake2b-256, Blake3-256, SHA3-256, and SHA2-256 digest algorithms. When you said "dependency" were you referring to software or conceptual dependencies?

If software dependencies then @noble/hashes can help address that problem. If conceptual dependencies then picking one digest algorithm would be best. If having multiple digest algorithms is acceptable from an conceptual perspective then @noble/hashes is a good option for the algorithms I mentioned.

@swcurran
Copy link

swcurran commented Sep 3, 2024

We’re getting close to something that can go in the spec!!!

About this:

Step 2 here prematurely Base64URLSafe encodes the digest. You don't Base64URLSafe encode anything until you have prepadded it with zero bits. For 32 byte SAIDs you don't have to use the calculation described here in qb64b because it will always be 1 pad byte. For longer SAIDs then you need to use the calculation specified in the qb64b function. I would be happy to walk you through this.

I intentionally proposed the technique I did so the “extra steps” are done with characters vs. bits — which I thought would be easier. If you look at the RFC4648, the padding characters are added for exactly the same reason in that RFC as for CESR—to get the byte boundry even. By using an RFC4648 implementation that meets the spec, you get “the right thing” for CESR, without having to implement your own steps to figure out the bit padding needed. Just look at what the Base64 padding is (0, 1 or 2 = characters), and put the prefix on accordingly. That said, what you are saying also works. It just seems easier to me to deal with characters instead of bits. :-)

About this:

This is not correct. Removing the first character leaves you with a partial digest which will not match the digest of the destination JSON because it is missing the zero bits. It would be off by one character, the 'A' character of zero bits.

In my version, deliberately left off the pre-pending in the verification, so I think we get to the same point — I strip it in the input, and don’t pre-pend in the calculated value. You don’t remove it from the input, and include the pre-pending to the calculated value (in the corrections to the later steps that I proposed). So I leave it off in both cases, you include it in both cases, so both would work. I’m OK either way that works :-).

Shall we take a shot at getting a PR to the spec done?

@swcurran
Copy link

swcurran commented Sep 3, 2024

Stephen, a note on dependencies, The @noble/hashes library is a single dependency that includes all of Blake2b-256, Blake3-256, SHA3-256, and SHA2-256 digest algorithms. When you said "dependency" were you referring to software or conceptual dependencies?

I’m thinking more of conceptual. In writing the spec, we don’t want to make assumptions about what libraries, languages or other constraints a specific instance might need. So we can’t assume that every instance can find a dependency that supports all of the hash algorithms we want to support. We also want to support instances where someone wants to implement the entire thing.

We have to support some minimal acceptable algorithm — e.g. nothing that has a weakness, so that’s the low bar. Beyond that, we want to pick the one(s) that have the broadest support. Further, we want to limit those choices so a resolver only has to support the algorithms that OCA Bundle producers is permitted to create.

So — the bottom line is, what is the value of requiring support all 4 algorithms? Are any weak and so should be dropped? If all are the same, why allow all of them?

Example — in working on did:tdw, we were going to use sha-256 and sha3-256, and we discovered (suprisingly enough) that there were no generally accepted sha3-256 TypeScript libraries (or so I’m told). And since using sha3-256 didn’t really make the implementation more secure, we figured we would drop the option of using it until it (or a better) algorithm was readily available. In a later version of the spec, we’ll likely add support for another hashing algorithm, but for now, we’ll leave it at sha-256.

@kentbull
Copy link

kentbull commented Sep 15, 2024

I respond to the comments inline below.

The TL;DR is: sharing bits in pad bytes forces us to do more work. A visual helps explain this clearly. CESR pads on the front, Base64 pads on the back, and both share bits in the Base64 character (light green box) adjacent to the padding character.

Legend

image

Example SAIDified JSON

image

CESR Pre-Padded Encoded Digest Diagram

image

Naive Base64 Post-Padded Encoded Digest Diagram

image

Bit sharing between Base64 characters encoding both bits from pad bytes and bits from raw value bytes forces you to have to manage padding bits whether using CESR or naive Base64. This sharing of bits occurs in both Base64 and CESR because it is a consequence of the need to align on a 24 bit boundary.

When encoding values where padding is necessary, as in when the byte count of the raw value is not a multiple of 24, you must share either four or two bits with one of the pad characters. Whether four or two bits depends, like in section 4 of RFC 4648, on the amount of pad bytes used.

There are two cases, one pad byte or two pad bytes.

  • For one pad byte, four bits are shared. This is pictured in the diagram.
  • For two pad bytes, two bits are shared.

How many pad bytes?

In both Base64 and CESR the count of pad characters tells you how many pad bytes are used, which you can use to reverse-engineer the original digest by stripping this count of bytes from the digest and thus remove any need to depend on a count code table for something simple like a SAID.

Bit sharing forces dealing with pad bytes

What you have to choose is whether you want to use pre-padding like CESR or post-padding like Base64. What you can't choose is whether or not you have to add and strip pad bytes. A SAID implementation could work with either pre-padding or post-padding, yet must account for sharing of bits between

CESR uses pre-padding (on the front/start of the raw bytes being encoded), Base64 uses post-padding (on the back/end of the raw bytes being encoded).

To get the same digest as a CESR-compatible SAID you must use pre-padding AND you must know the number of pad bytes to extract from the front of the padded raw digest in order to get back to the original digest. The derivation code/lookup character tells you how many pad bytes to remove from the decoded value, not how many pad characters to remove. You can't only remove pad characters because of shared bits between some pad characters and the raw encoded digest/primitive.

If you only add or remove pad characters post-conversion to Base64 then you are missing the pad bits that were encoded as a part of the raw bytes because:

  • whether Base64 post padding or CESR pre-padding then sometimes (for scenarios with 1 or 2 pad characters)
    • some of the bits of your digest share the same Base64 character as some of the pad bits.

Due to the nature of Base64 encoding only 6 bits of information per raw byte that you are encoding (8 bits) then you always end up needing to share raw bits with a pad character whenever you are encoding something that needs pad characters to align on the 24 bit boundary.

What CESR does is put these pad bits on the front while Base64 puts these pad bits on the back. The images above make this clear.

Benefits to pre-padding

The benefits to padding on the front are:

  • Conforms to the standard in Type Length Value (TLV) encoding where the framing code is prepended to the front of a stream of bytes.
    • This allows a parser to be able to efficiently select the correct quantity of bytes from a byte stream without having to decode the entire stream or primitive first.
      • Conversely, putting the type code on the back, as pictured in the "Naive Base64" example above, would force the parser to do extra work to identify which character in a stream is the start or end of a primitive. You would have to include other serialization and deserialization instructions in order for the parser to do its work.
  • Self-contained, standalone value encoding due to using TLV.
  • Easier to read, relatively; human friendly
  • Tremendously simplifies parser implementations.
    • TLV parsers with lookup tables for sizes are much easier to write than something with complicated parsing instructions.

Type code substitution

As shown in the diagram above the CESR-compatible SAID encoding substitutes the resulting pre-padded 'A' characters with a type code, 'H' in this instance, that serves as a lookup value for a parsing rules table indicating the length of the digest.

image

As mentioned above, using pre-pended type codes both increases human-friendliness and dramatically simplifies parser implementations. This is what is called a self-framing digest, or a self-framing cryptographic primitive.

Itemized Response

I intentionally proposed the technique I did so the “extra steps” are done with characters vs. bits — which I thought would be easier.

While I see why you would want to make it easier or simpler the the reason why you can't do the padding with characters only is because pad bytes split across multiple characters and only some of the characters end up being a pure "zero" character, the 'A' character. You will end up with incorrect saids if you only consider the text-based pad characters, post-conversion to Base64, as all of the pad bits you have to account for, in the example pictured above, stretch beyond the six bits in the pad character to two more bits that are shared with the character adjacent to the pure padding character.

As mentioned above there will either be four or two data bits included in a character also including pad bits that is adjacent to the pure padding character whether using Base64 or CESR style padding.

If you look at the RFC4648, the padding characters are added for exactly the same reason in that RFC as for CESR—to get the byte boundry even.

Due to bit sharing this is not entirely correct. The padding characters are not all of the padding added because some of the pad bits (2 or 4 as shown above) are included in the encoded Base64 character adjacent to the padding character. So getting to the bit boundary is not as simple as adding or removing a padding character because padding is added as bytes and not Base64 characters.

By using an RFC4648 implementation that meets the spec, you get “the right thing” for CESR, without having to implement your own steps to figure out the bit padding needed.

This is incorrect. A RFC4648 compliant implementation of Base64 does post-padding while CESR does pre-padding. You will get different digests if you use what is called "naive Base64." Why is it called "naive?" This gets into the composability property of CESR that I reference rather than repeat here. I am more than happy to talk you through this if you want additional clarification. The gist is that "naive Base64" makes it impossible for a parser to cleanly separate primitives without additional parsing instructions beyond a TLV scheme because there are no clean boundaries of primitives included in the encoding itself. You have to add parsing instructions beyond the TLV rules to understand where the boundaries are in the naive Base64 data in order to properly separate raw values from the parsed stream, and this all has to be done at once due to the lack of boundaries, meaning you can't pipeline the processing of the stream and thus can't utilize all your CPU cores for efficient stream processing.

In my version, deliberately left off the pre-pending in the verification, so I think we get to the same point — I strip it in the input, and don’t pre-pend in the calculated value. You don’t remove it from the input, and include the pre-pending to the calculated value (in the corrections to the later steps that I proposed).

You are off by two bits because of bit sharing as described above. Thus your digest would be different and would not match the original SAID.

If the padding were only limited to adding or removing a character at the front then what you are saying would work. Yet bit sharing is the important reason why it does not work. Some of the 8 pad bytes both Base64 and CESR use in the post-pad/pre-pad are included in the Base64 character adjacent to the pad character(s). So just stripping the pad character(s) off the front, or the back, will leave you with two extra bits in your raw output, which will give you digests that don't match.

So I leave it off in both cases, you include it in both cases, so both would work. I’m OK either way that works :-).

Both would not work. If you leave the Base64 pad character off in both cases then you are ignoring the two additional pad bits that have been encoded into the Base64 character adjacent to the pad character, causing you to compute a different digest, one that would not be CESR compatible because the resulting different digest would not perform pre-padding in the way CESR expects.

Doing pre-padding with pad bytes prior to conversion to Base64 is the only way to be CESR compatible. For what it's worth, the default behavior of Base64 encoding is to pad bytes prior to conversion as well, just with post-padding. Because pad characters encode only 6 bits of information then you can't pad to a 24 byte boundary correctly by just adding or stripping characters. Padding must be done with bytes prior to conversion to Base64 characters or you will end up with Base64 doing its own post-padding, which is not SAID compatible.

So — the bottom line is, what is the value of requiring support all 4 algorithms? Are any weak and so should be dropped? If all are the same, why allow all of them?

It seems that it would not be desirable to have a spec to be tied to only one specific cryptographic algorithm and rather open it up so implementors could use anything that follows a given process. Along these same lines, it seems that standardizing on a process plans for expansion and evolution in the specification language, which would be valuable given the expected need to change cryptographic libraries for post quantum resistance once quantum computers actually take off. So focusing on a general encoding process like SAIDification rather than a particular algorithm would be a way to both be specific and retain flexibility for the specification to remain valid even when cryptographic algorithms need to change.

There's essentially low to no cost to include a list of algorithms in the spec that work with SAIDification. As long as the TLV scheme used allows for clear identification of the type of algorithm then the coding effort to support an additional digest algorithm is minimal.

So the bottom line to me is that allowing all four, or even more, has such a low cost and a comparatively high benefit of meeting the needs of diverse applications that it seems like an easy win and low hanging fruit.

we discovered (suprisingly enough) that there were no generally accepted sha3-256 TypeScript libraries (or so I’m told).

Are you referring to not generally accepted by a given body or independent security audits? I know the readily available @noble/hashes package on NPM provides each of the algorithms we have discussed in this thread, Blake3-256, Blake2b-256, SHA3-256, SHA2-256. According to this NIST policy I just Googled it appears that both SHA3-256 and SHA2-256 are generally accepted by NIST. As far as the @noble/hashes library goes it was audited in 2022 in case you are considering using it.

Summary

With the pre-padding and prepended type code substitution process outlined above your digest is aligned on a 24 bit boundary whether using the SHA or Blake families of cryptographic algorithms. SAIDification is about padding on the front to meet the needs of a TLV encoding scheme for human-friendly textually encoded values and simple parser implementations.

If this meets the needs of the OCA spec then I suggest we work out language that clarifies the use of pre-padding

@carlyh-micb
Copy link
Collaborator Author

carlyh-micb commented Sep 16, 2024

I'm not sure if this is documented above (didn't notice it) so I wanted to include it here since it also influences the SAID calculation (thank you Kent for the help in understanding this).

In OCA they use: "v": "OCAB10JSON0010eb_", in the JSON version of the schema generated.

This is using: https://trustoverip.github.io/tswg-cesr-specification/#legacy-version-1xx-string-field-format

The format of the Version String for version 1.XX is PPPPvvKKKKllllll_. It is 17 characters in length and is divided into five parts:
Protocol: PPPP four character version string (for example, KERI or ACDC or OCAB)
Version: vv two character major minor version (described below)
Serialization kind: KKKK four character string of the types (JSON, CBOR, MGPK, CESR)
Serialization length: llllll integer encoded in lowercase hexidecimal (Base 16) format
legacy version terminator character _

The serialization length is calculated for each schema. So the "OCAB10JSON" is standard and the "0010eb_" is the size of the seralized schema file (with the _ termination character).

This means when calculating a SAID where you use a "v":
first you add the #'s added where the SAID will go
then calculate the size of the entire serialized object ("0010eb")
that gets inserted into the v value,
then the SAID is calculated and replaces the #'s.

@kentbull
Copy link

I collected everything from our conversation here regarding SAIDs into my latest blog post. I also added the HCF demos @blelump suggested to the "Implementations" section of the blog post at the end.

If that post leaves anything unclear then please reach out to me and I will further clarify the post.

@swcurran
Copy link

Thanks Kent — I’ll defer to you on the bit-twiddling. If you say it is needed, it is! I checked out the blog and it is great — the section How does it work? How are SAIDs calculated? should go into the OCA spec more or less exactly. Nice!

I'm a bit confused about this.

There's essentially low to no cost to include a list of algorithms in the spec that work with SAIDification. As long as the TLV scheme used allows for clear identification of the type of algorithm then the coding effort to support an additional digest algorithm is minimal.

You seemed to be saying that you disagree with me, and then seem to repeat the point I was making.

There is a cost for every consumer to have to support any algorithm that any producer might use. Each producer need only support one, but every consumer has to support all. So agree that we use an approach that allows for any algorithm (the SAID spec. achieves that), but limit the algorithms that can be used in any specific version of the OCA (not SAID) spec to those that add value by their inclusion.

So, as long as the OCA Spec says “You MUST use algorithm A, B, C, D, E, F and G” — I’m happy.

Are you going to do a PR to the V1 OCA spec for the SAID calculation?

@swcurran
Copy link

@carlyh-micb — are you saying that any OCA file now needs to have a v item that meets that criteria? I don’t see anything in the spec that requires that. Where is that defined?

I do think that the version of the OCA spec being used by an overlay is needed. I’m not thrilled by the complexity of the version calculation, but I’m guessing it not that hard to calculate the length of the OCA Bundle. It does force the dev to insert the N # characters, as the length must be known before the SAID is calculated. @kentbull — you might want to add that to the OCA spec calculation, so that the OCA spec is self-standing.

I’m guessing this is another CESR thing? I find it doesn’t help adoption by having OCA intertwined with CESR where it adds complexity. In this case, other than for CESR, why does the length of the JSON matter?

Question: In calculating the JSON — how are end-of-lines calculated — Linux/MacOS (1 character) or Windows style (2)? :-).

@carlyh-micb
Copy link
Collaborator Author

carlyh-micb commented Sep 22, 2024

@swcurran It doesn't show up in the spec yet, it is in the test site where you can produce a single JSON schema bundle (https://repository.oca.argo.colossi.network/).

I'm not thrilled by the complexity of the version calculation, I'm curious to know which use cases that need this complexity now. Semantic versioning (e.g. 1.1.0) seem to be sufficient. Perhaps any kind of 'wrapper' of the schema could include the size.

At ADC we are working on adding our own overlays which will have SAIDs calculated. We plan to create a Schema Package which contains an HCF-generated OCA schema bundle (with all their complexities) that we generate using an API. This gets combined with the presentation JSON and our own overlays. Our syntax would use only semantic versioning and not this CESR complexity. The idea is that we will be able to incorporate any HCF JSON OCA schema quickly without having to change our own overlay calculations. @swcurran do you want to chat with us about how we are making our own overlays? This Schema Package we are designing could be standardized and shared.

@kentbull
Copy link

@swcurran

You seemed to be saying that you disagree with me, and then seem to repeat the point I was making.

I was disagreeing with you, though after sleeping on it I changed my mind and agree with you. The spec is more likely to be implemented and well supported if it has as few algorithms as absolutely necessary.

There is a cost for every consumer to have to support any algorithm that any producer might use. Each producer need only support one, but every consumer has to support all.

This is a very important point. Thank you for reiterating it.

So agree that we use an approach that allows for any algorithm (the SAID spec. achieves that), but limit the algorithms that can be used in any specific version of the OCA (not SAID) spec to those that add value by their inclusion.

Yes, this is reasonable. What algorithm do you favor? I've heard people say that SHA2-256 and SHA3-256 are good candidates. Would you pick one of those or something different?

@kentbull
Copy link

Are you going to do a PR to the V1 OCA spec for the SAID calculation?

I was not originally planning on it though I would be happy to. Which section would you like me to add it to? The concepts section?

@kentbull
Copy link

I agree on versioning. Unless you need and want the benefits that CESR provides for clear use cases then I recommend against using a version field because of the following constraints:

  1. CESR requires version fields to be at the front of the JSON payload, which is really good for a TLV scheme, yet would feel unwieldy if you don't need such versioning or don't need to support a TLV streaming parser.
  2. The version of the entire JSON object must be known and encoded into the version string with some sort of format such as raw digits, hex, or Base64 digits. Managing this complexity isn't difficult, yet it is just one more thing to worry about that you should not worry about if you don't want to leverage the advantages of having versioned, sized objects.

If you don't need those things, don't use them. KISS

I’m guessing it not that hard to calculate the length of the OCA Bundle.

Knowing the length of the OCA bundle would only be required for the version string. You don't need to know the length of the entire JSON object, which I assume is the bundle, for the SAID filler characters. The quantity of SAID filler characters is determined only by the number of bytes in the fully padded digest.

I’m guessing this is another CESR thing? I find it doesn’t help adoption by having OCA intertwined with CESR where it adds complexity. In this case, other than for CESR, why does the length of the JSON matter?

Yes, the version field and version string is a CESR thing. If what that provides is not needed for the OCA spec then I recommend it be elided.

Question: In calculating the JSON — how are end-of-lines calculated — Linux/MacOS (1 character) or Windows style (2)? :-).

To my knowledge you don't have to worry about this if you aren't using a v version field. If you are using a version field then it's probably Linux style line endings. I actually don't know on this point and would need to check whatever JSON standards there are for JSON parsing libraries.

@carlyh-micb
Copy link
Collaborator Author

I think the explanation of the calculation of the SAID doesn't belong in the OCA specification because it already belongs in another specification, that is the root of trust of the standard (KERI/CESR).

However, I do think it would be valuable to be in helper documentation of OCA, but I'm not sure where that is. If I was told where the home is, I could also contribute a clear introduction into the rationale of the architecture and an in depth description of how SAIDs are calculated (non-normative and referencing CESR) would also also belong there.

I think the OCA specification can allow anyone to use any hashing algorithm they choose (that is allowed in CESR). However, any ecosystem that uses OCA may want to limit the choices for interoperability.

Finally, I really think the version string in OCA should be removed. At ADC we are adding more overlays and I think that will be very common. We are also adding the presentation overlay. In the swagger API that generates OCA bundle it includes dependencies as part of the JSON object, does the version string include those extra dependencies? At this point, there is a bunch of stuff often being added to the bundle so whatever size is specified in the schema bundle specifically is inaccurate with all these additions. At the time of data transit perhaps is the time for a payload size to be calculated.

@swcurran
Copy link

My preference is that there is no need in OCA to reference either CESR or SAID specifications, and that by putting in the SAID (or "overlay identifier") calculation into the OCA spec, the specification can stand alone. That makes OCA acceptance in circles outside of the KERI world much easier.

I think it is reasonable to reference the version of OCA being used by an overlay (although it adds a lot of overhead), and I would encourage not using the CESR v identifier as the model — especially inclusion of the length, and the requirement that it be at the front of the overlay (since that is not a reliable concept in JSON) and is unnecessary. Just ”v”: “1.0” would be sufficient in my opinion.

@carlyh-micb
Copy link
Collaborator Author

carlyh-micb commented Sep 24, 2024

OCA specification is already referencing outside standards, we do not give out the full specification of each hashing algorithm, those are outside the specification as well. Same with the syntax of JSON which has its own external specification. I think it is normal and acceptable that many standards exist outside of the OCA specification and are referenced (but I can understand the appeal of a one-stop-shop). Where information on an external specification is hard to find or hard to understand (e.g. SAIDs) OCA specification could provide helper documentation.
Referencing the external specification for SAIDs means that we don't have to do the work of harmonizing the standards every time there is an update. (I can see confusion and people asking 'is it SAID-KERI or SAID-OCA that we are using' whenever something is updated).
That doesn't stop OCA from including a non-normative description of how things work.

I agree that the version would be better as Semantic Versioning (e.g. 1.0.0) which many people are familiar with.

@carlyh-micb
Copy link
Collaborator Author

Just to check, using the OCA swagger API and some quick schemas I wrote, I tested a schema with a single dependency.

The bundle has version: OCAM10JSON00032c_
The dependency has a single schema with version: OCAM10JSON00032d_

@kentbull says "CESR requires version fields to be at the front of the JSON payload, which is really good for a TLV scheme, yet would feel unwieldy if you don't need such versioning or don't need to support a TLV streaming parser."

These version strings seem to be distributed throughout the payload.

My preference is to use Semantic Versioning instead.

{
  "bundle": {
    "v": "OCAM10JSON00032c_",
    "d": "EMOv2x0Vy4Ejp3g0EmUVy0rYPl_iTlpYrMXKuL06OGkV",
    "capture_base": {
      "d": "EMM5Fhv9qG-2mOtqwoAbj3GfzZJjyQ9suYBcP5dV-7b7",
      "type": "spec/capture_base/1.0",
      "classification": "RDF107",
      "attributes": {
        "reference": "refs:ENuWNrIcSZzLn6zeG7eJo8P0RiuAA6dogdCKAYEOoEHn"
      },
      "flagged_attributes": []
    },
    "overlays": {
      "character_encoding": {
        "d": "EMTbffHCVcO9nH2k-MNoUPSB28VvfTZQTJYw6dEdI7s2",
        "type": "spec/overlays/character_encoding/1.0",
        "capture_base": "EMM5Fhv9qG-2mOtqwoAbj3GfzZJjyQ9suYBcP5dV-7b7",
        "attribute_character_encoding": {
          "reference": "utf-8"
        }
      },
      "meta": [
        {
          "d": "EL4K6n1ltZI9bzN24QCIUakiI-n3FQnM5IZVypr82SU3",
          "language": "eng",
          "type": "spec/overlays/meta/1.0",
          "capture_base": "EMM5Fhv9qG-2mOtqwoAbj3GfzZJjyQ9suYBcP5dV-7b7",
          "description": "This schema will reference another schema.",
          "name": "testing references"
        }
      ]
    }
  },
  "dependencies": [
    {
      "v": "OCAM10JSON00032d_",
      "d": "ENuWNrIcSZzLn6zeG7eJo8P0RiuAA6dogdCKAYEOoEHn",
      "capture_base": {
        "d": "EBDctubUaASR1t4FjZLV9izKkvwSANyW_Hi7ds_qlIpC",
        "type": "spec/capture_base/1.0",
        "classification": "RDF107",
        "attributes": {
          "v1": "Numeric",
          "v2": "Text"
        },
        "flagged_attributes": []
      },
      "overlays": {
        "character_encoding": {
          "d": "EGYXTClyg25g6nKdS1oLeRjcZi_-114YUve8Oo8o6uEH",
          "type": "spec/overlays/character_encoding/1.0",
          "capture_base": "EBDctubUaASR1t4FjZLV9izKkvwSANyW_Hi7ds_qlIpC",
          "attribute_character_encoding": {
            "v1": "utf-8",
            "v2": "utf-8"
          }
        },
        "meta": [
          {
            "d": "EEwqBMLIFlryq8YatvA5DBqhMIRp-QAkrFVhF_XPRVW2",
            "language": "eng",
            "type": "spec/overlays/meta/1.0",
            "capture_base": "EBDctubUaASR1t4FjZLV9izKkvwSANyW_Hi7ds_qlIpC",
            "description": "Adding references to a schema to test how they work and test dependencies.",
            "name": "Testing references"
          }
        ]
      }
    }
  ]
}

@kentbull
Copy link

I would encourage not using the CESR v identifier as the model — especially inclusion of the length, and the requirement that it be at the front of the overlay (since that is not a reliable concept in JSON) and is unnecessary.

Ordered dicts are the way that the Python reference implementation maintains a consistent ordering of each JSON, CBOR, and MessagePack. Such field order is not inherently a feature of JSON yet newer versions of Javascript, as of ECMAScript 2015 (ES6), and many JSON implementations support insertion ordering. JavaScript objects, class constructors, and the JSON.parse and JSON.stringify are explicitly mandated by the ECMAScript specification to preserve insertion ordering of properties.

Yet this ordering issue is a separate issue from the version string.

If you are not trying to build a parser that supports a speed and space optimized type, length, value (TLV) parsing scheme for ordered field maps (like insertion ordered JSON) then you don't really need the same kind of version string as CESR data types because you are likely just dealing with JSON or CBOR.

Then again, I do not know how OCA wants to handle versions of the OCA spec and how that translates to serialization versions. Depending on what kind of parsing rules an OCA JSON parser needs to handle a version string may be useful.

Maybe

Just ”v”: “1.0” would be sufficient in my opinion.

is sufficient.

I do not have the context to know what the needs of a version field are for OCA objects and so defer to you two.

@carlyh-micb
Copy link
Collaborator Author

Also, just checking, but:

In OCA they use: "v": "OCAB10JSON0010eb_", in the JSON version of the schema generated.

This is using: https://trustoverip.github.io/tswg-cesr-specification/#legacy-version-1xx-string-field-format

The format of the Version String for version 1.XX is PPPPvvKKKKllllll_. It is 17 characters in length and is divided into five parts:
Protocol: PPPP four character version string (for example, KERI or ACDC or OCAB)
Version: vv two character major minor version (described below)
Serialization kind: KKKK four character string of the types (JSON, CBOR, MGPK, CESR)
Serialization length: llllll integer encoded in lowercase hexidecimal (Base 16) format
legacy version terminator character _

Does this mean, since there are only two version character codes that the highest version allowed would be 9.9? That seems to be a restrictive choice for OCA.

And I also noticed that OCA (swagger API for generated JSON schema bundles) generates different Protocol version strings. I've now seen OCAB and OCAM.

@kentbull
Copy link

Does this mean, since there are only two version character codes that the highest version allowed would be 9.9?

Since the digits are hexadecimal then the largest version would be f.f or equivalently 15.15 in decimal. This is enough for CESR protocol versions of KERI and ACDC because it is not expected that those protocols will change often. If OCA has objects whose versions will change often then a different versioning scheme should be used.

Nonetheless, if you want a full semantic version string then you would need to do something different.

Does compatibility at a version string level between OCA and ACDC / KERI data structures matter? If not then you can pick whatever version string semantics you want and you can ignore the CESR-style version string.

@swcurran
Copy link

Sorry to do this, but I’ll try one more time on this issue.

I think the explanation of the calculation of the SAID doesn't belong in the OCA specification because it already belongs in another specification, that is the root of trust of the standard (KERI/CESR).

The currently expired SAID spec. is (in my opinion) overly complicated and does not clearly present the algorithm — hence this long discussion that Kent has beautifully summarized in his Blog post. At minimum, the SAID spec. should be replaced with what Kent has written. If that spec. can’t be simplified, then my recommendation is to just put the text into the OCA spec. The algorithm is relatively short and easy, doesn’t need much introduction (“A unique identifier is needed for an overlay and that identifier MUST be derived from the content of the overlay using this algorithm:”).

The CESR spec. should not be referenced at all, as it is not needed. If there is insistence on using the CESR v, then again, define the value, don’t reference an unrelated spec. for one tiny part of it.

@carlyh-micb — you are absolutely right about referencing other specs — hashing, JCS and the like.

@carlyh-micb
Copy link
Collaborator Author

carlyh-micb commented Sep 25, 2024

I've been thinking @swcurran about your comment.

Standards belong in the organizations that are responsible for them.
CESR with its SAID calculation is clearly the responsibility of Sam Smith and the standard they are developing (KERI/CESR/SAID). Even if it is hard to read and find in the specification.

However, standards that are still under development and not officially released as a locked version are at risk of changing. Even if it is unlikely.
It would be prudent, since there isn't an official, locked, version 1 release of SAID calculations, that the OCA spec release the SAID calculation as its own standard, official, locked, released, version 1.

We also acknowledge that the source for this is CESR standard and that when they officially release their standard as an official version release, then OCA switches to the official KERI/CESR/SAID standard. Unless the calculation has changed in the meantime whereupon OCA would need to do a new version release of the OCA specification referencing the newer SAID calculation. If it is still hard to read the standard to find the SAID calculation OCA specification can still release the non-normative helper documentation for understanding the SAID spec.

It would be bad if OCA uses the unreleased KERI/CESR/SAID specification as the official source of SAID calculations and then the calculation gets changed because the SAID spec is still under development.

Did I capture the argument?

@swcurran
Copy link

That’s pretty much it.

@carlyh-micb
Copy link
Collaborator Author

@kentbull any chance that Sam Smith would release the SAID specification, as you have written it, as an official versioned release that can be cited as a specification standard and will not change?

@kentbull
Copy link

kentbull commented Sep 25, 2024

It would be bad if OCA uses the unreleased KERI/CESR/SAID specification as the official source of SAID calculations and then the calculation gets changed because the SAID spec is still under development.

It is very unlikely that the SAID calculation will change because it is part of the bedrock on which everything else in CESR, KERI, and ACDC are built. If you are worried about the SAID calculation changing you can lay those fears to rest.

@carlyh-micb as far as Sam releasing a spec with the language I used, that seems unlikely, though I will ask what we can do on the next spec call this coming Tuesday.

I like the idea of the SAID spec being released separately, it's just one more document to manage so I could see that being the primary argument against a separate spec. Right now the SAID spec language has been included with the CESR spec.

@carlyh-micb
Copy link
Collaborator Author

Given the utility of the SAID spec, I'd love to see it released separately. It can be useful in many places, not just CESR. Because of its utility and the challenge for people to get the details out of the spec I think it deserves a separate specification. Separation of concerns right?

I've seen SAID specifications on WebOfTrust and IETF, and now ToIP, all places that have hosted this specification as it is developed. Without an official release it can be confusing to identify the actual home of the current draft (when you know you know, but a google search is unclear).

@kentbull
Copy link

Given the utility of the SAID spec, I'd love to see it released separately.

Yes, so would I. I will bring this up to Sam and see what he says. I'm sure he'd at least give it some consideration. Worst case scenario I'd be happy to draft some language for the OCA spec as a first version. There could be a non-normative reference to the SAID language in the CESR spec for those who want more details.

The outline of what the language would include is this:

  1. Description of the code table of digest algorithm types like the following table with all supported algorithms, starting with just two, SHA3-256 and SHA2-256. We can omit either of these if desired.
Derivation Code Description Code Length Count Length Total Length
H SHA3-256 Digest 1   44
I SHA2-256 Digest 1   44
  1. Explanation of the process of making a digest with the supported algorithm
  2. Step-by-step language on the pre-padding process to get the padded raw bytes
  3. Explanation of the Derivation Code (type code) lookup process and encoding that into the front of the padded raw bytes
  4. Description of the SAID embedding process
  5. Walk through of the SAID verification process (recomputing the digest and comparing it)

@kentbull
Copy link

kentbull commented Oct 1, 2024

After bringing this to this morning's ToIP ACDC Task Force meeting my recommendation is twofold:

  1. Reference section 12.6 Self-addressing identifier (SAID) of the CESR specification.
  2. Add some non-normative language, if desired, to point to my blog post as a starting point for implementors.

The rationale and discussion points are these:

  1. Managing a separate spec for SAIDs has overhead.
  2. When there are two specs then questions invariably come up about syncing between two specs (CESR & SAID)
  3. Pointing to the CESR spec gives a stable reference.
  4. Breaking things out of the CESR spec would be an OCA / HCF effort that would need to be independently maintained.
  5. There is no guarantee that an OCA / HCF SAID specification would be reviewed by anyone from the KERI community.
  6. Referencing the CESR specification allows OCA to gain the benefit, at least partially, of the spec review efforts in the KERI community.

To directly address the concern of the SAID spec or process changing I make the comment that SAIDs will not be changing in the forseeable future. If they did then pointing to the CESR spec would be the best, most accurate place to reference for a full implementation.

And, regarding the complexity or simplicity of the CESR specification and available reference material such as my blog post, I favor having a combination of authoritative, detailed language from the CESR specification along with helpful reference material.

@carlyh-micb regarding the concern on things changing,

standards that are still under development and not officially released as a locked version are at risk of changing.

The KERI Working Group at ToIP is in the process of locking the specifications to version 0.9. It makes the most sense to me to point to those locked spec versions. During the time where this process is resolving the draft spec can be pointed to by the OCA language.

Lastly, to address the issue @swcurran brought up of CESR not being needed,

The CESR spec. should not be referenced at all, as it is not needed.

I changed my mind on this after writing the SAID blog post. CESR should be referenced from the OCA spec for a few good reasons. The short version is that there are enough CESR-specific ideas involved in SAIDs that the economical and supportable option is to reference the CESR spec.

  1. SAIDs use what is called "fully qualified Base64URLSafe encoding" which is a CESR concept. This is all of the bit padding I delved deep on above and in my blog post.
  2. The authoritative language for "fully qualified Base64URLSafe encoding" that has the most eyeballs on it is in the CESR spec.
  3. The parse table ideas represented by the CESR Master Code Table (where the SAID type codes come from), or some version of it, however minimal, must be present in or linked from any OCA spec version that is going to grow over time and use more than one algorithm type.

If the OCA spec and implementations are only ever going to use one algorithm such as SHA3-256 then you could hardcode one algorithm in the spec and leave it that way. A downside to this hardcoding approach is that it does not plan for the future and requires potentially large future spec updates when longer SAIDs are used, such as 64 byte rather than 32 byte SAIDs, or even larger SAIDs.

While I favor starting with one algorithm, SHA3-256, I also favor including the concept of a parse table that provides a way to add additional spec-supported algorithms as growth occurs.

@swcurran
Copy link

swcurran commented Oct 2, 2024

Reading the SAID section in the CESR spec, I would suggest that we put the calculation in-line in the OCA spec without an external reference. That section of the spec is largely non-normative, except for one “MUST” that points to another section of the CESR spec (without a link). Worse, the narrative is irrelevant to how the SAID feature is used in OCA — to create content-based identifier for overlays so that we are certain we have consistency between referenses to the overlay and the overlay content.

The idea of having a non-normative reference to the details of how a required calculation is done seems extremely odd and unhelpful.

Since the KERI and OCA communities are largely distinct, and the overlap (this one calculation), I don’t see value in the 6 stated reasons — only confusion for OCA developers/users (“What is this big other spec, and I can’t I just see the calculation?”). The references to the very complex CESR spec. for such a small feature is daunting. If the CESR spec. changes, it would be better that the OCA spec NOT be impacted, as that would independently break interop.

For all of those reasons, I think we put the algorithm into the OCA spec with a non-normative acknowledgment of the CESR source.

Regards the single hashing algorithm — as long as we state that there is a path to adding more algorithms in future versions (2.x, 3.x) of the spec, and we have a SAID algorithm capable of adding new versions (which we do), I think it is safe and helpful.

@kentbull
Copy link

kentbull commented Oct 2, 2024

This is tough. I go back and forth on whether language should be referenced or whether it should be included as a subset.

As much as I like the argument of including a subset of language in the OCA spec I also know that the most well maintained language on the CESR-style SAID encoding is in the CESR spec.

If the OCA maintainers are okay with manually keeping the OCA spec up to date with the CESR style encoding, which will never change at this point, then including a subset of language is low risk.

Then again, why use a CESR style encoding at all for content addressable identifiers? If you just want one type of hashing algorithm then you don't need any type codes.

If you do want type codes then you don't have to use CESR style encodings, you could just prefix a type code on a Base64 value without worrying about any other parse table rules like CESR does.

If the goal is to just have content addressable identifiers then why involve CESR-style SAIDs? There would need to be other use cases where KERI or CESR are involved to justify using CESR-style SAIDs. If the HCF has those other use cases then I could understand sticking with CESR-style SAIDs.

Then again, maybe the lowest cost, lowest friction way forward is to just include CESR-style SAID language in the SAID spec.

Forgive me for not having a more educated opinion on this to start. This is the first spec discussion I've been a part of. Thanks for bearing with my back and forth. As I have thought through the arguments on each side and discovered new thinking and information I changed my position a few times.

If you want help drafting that language to place in the OCA spec I'm more than happy to help.

I don't have a dog in this fight yet I want all dogs to be healthy, if that makes sense.

@swcurran
Copy link

swcurran commented Oct 2, 2024

Then again, why use a CESR style encoding at all for content addressable identifiers? If you just want one type of hashing algorithm then you don't need any type codes.

We only want one or two hashing algorithms today, but a future OCA spec might want more. Hence the need for identifying the algorithm used in a given hash instance. I would personally rather use multihash, as I think (but don’t know) it is more likely to be accepted by a broader community.

Thanks for your help in this — especially in getting the algorithm clearly defined. That is a huge help.

I do have a dog in the fight in that I want OCA to be accepted broadly as we are using it and getting great benefit from it. Complexity without benefit is a barrier to acceptance of any spec., and we to reduce complexity.

@swcurran
Copy link

@kentbull — this has come up again, as we are finally working on getting the SAID calculation into the OCA spec. Yay!

I want to point a developer to your blog post to see if they can generate a compatible hash using just that info. But I see what I think is a big ommission in your algorithm — not sure how I missed it before. In your “7 steps”, you either leave off the part in step 3 of canonicalizing the JSON before hashing, or (I guess) assume that the input JSON is in canoncialized form. I think there needs to be an explicit reference to generating the JCS from the input JSON before calculating the hash (hash = hash(jcs(JSON-with-###s))). You don’t want to just hash the input JSON, as you have no idea what has changed from when the JSON was created and when it is being verified. The input JSON is not altered by the canonicalization (it’s only done in generating the hash value). I searched for “canon” in your Blog post and didn’t find anything.

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants