docs: datawherehouse location claim + store/publish #13

vasco-santos · 2024-03-07T15:16:27Z

This RFC proposes datawherehouse to keep track of location of bytes stored with write targets as part of data anywhere. In short, this is a datastore to store locations that also requires a flow such as the proposed on #10

rfc/datawherehouse.md

alanshaw

Approve for "location claims with encoded params"!

rfc/datawherehouse.md

alanshaw · 2024-03-13T12:21:39Z

rfc/datawherehouse.md

+On the other side of things, the next diagram presents the flow when a client wants to read some data, and is illustrated with the following steps:
+
+1. Client request to read some bytes by the CID of the data
+2. Service discovers where the requested bytes are stored relying on content claims service and the materialized claims from `datawherehouse`


I think this is not correct anymore, right?

rfc/datawherehouse.md

Gozala

I provided some feedback inline. I did not approve PR only because reading this document I'm left with an impression that two conflicting proposals are made. Thinking more on this and digesting all the shared insights I think I'm coming to a following conclusion:

Client could issue receipt that they uploaded content (IN THE FUTURE), but that should not trigger pipeline, meaning it should not make that content available on IPFS or put it into filecoin.
- I think this would also align with potential private data requirement that has been brought up by product.
Client should explicitly request read URL for the content, which is effectively reframing of store/deliver. When user requests content identified by CID to be readable we can verify that we have it and either issue location claim or produce an error saying we don't have it.
- Furthermore we probably should consider incorporating TTL in the user request because forever is impossible in finite bounds of universe. In fact accepting TTL is a commitment and we probably could / should reflect this in billing.
- This also may set us up on a path of supporting request URLs that are not public, which I think it's matter of time. I imagine such URLs could be under space DID and require UCAN token to perform reads. Not something I propose we tackle now, but it is good to consider how things would fit together.
When client requests content read URL we should respond with a location claim from us (audience should be DID of the with of the request). That means client can publish claim on our behalf wherever they wish, because it is commitment from us.
I think we really should design it with RAW CIDs as opposed to CAR CIDs. Because we do not verify those are CARs. I'm not suggesting rewriting everything here, but I would encourage to do it for the new capabilities as it is easy to go from CAR <-> RAW CID it's just switching a codec byte. Alternatively we could use multihash instead.

Co-authored-by: Alan Shaw <[email protected]>

prodalex · 2024-03-14T15:10:47Z

Client could issue receipt that they uploaded content (IN THE FUTURE), but that should not trigger pipeline, meaning it should not make that content available on IPFS or put it into filecoin.
I think this would also align with potential private data requirement that has been brought up by product.

Yeah. The client will need to be able to decide eventually if content should be public, private, kept in hot storage (maybe even number of replicas/store targets but to be considered in billing) or actually just be archived in filecoin and then deleted from hot storage once the CommP and pieceID is available.

We should also think about Object lifecycle management. The customer might want to set up rules when content is automatically deleted from hot storage after X time or X time without reads, etc. and only kept in filecoin.

Doesn't need to go into current design but keep in mind.

Same goes for permissions policies and roles for CRUD.

vasco-santos · 2024-03-14T16:46:43Z

rfc/datawherehouse.md

+Return on success the following receipt:
+
+```json
+{
+  "ran": "bafy...storePublish",
+  "out": {
+    "ok": {
+      "link" : { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte
+      "location": "`https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region/bucketName/key"
+    }
+  },
+  "fx": {},
+}
+```
+
+TODO: Should return the actual location claim? How would we do that?


TODO: would love your help here @Gozala ?

or maybe a CID for it? but client can't really read one today? Maybe receipt CID that claim was made?

I was suggesting to take actual signed location claim, provide a link to it and inline block in the response. Our client can take care of decoding it for the users.

Alternative can be to simply inline the content claim, as client can derive same CID on their own. I'm getting more bullish on this since #8, but whether we inline or embed location claim is more of an implementation detail so I don't care that much.

embed would mean we need to implement in ucanto what we talked about in Istanbul to attach things via receipt builder, but that seems good

vasco-santos · 2024-03-14T16:53:48Z

rfc/datawherehouse.md

+  "rsc": "did:key:abc...space",
+  "input": {
+    "link": { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte
+    "url": "https://..." // write target presignedurl previously provided


maybe I should include some info for this. My reasoning to put this is to let service know what actually was the write target so that service can check it is there. Otherwise, we would need to find out by storing presigned URLs somewhere. Thoughts?

I would prefer not to. We already need to store some record in DB that user has allocated space for the blob, we also need to check that user has such allocation. Seems like it would perfect sense for us to store the details about where space was allocated in that DB record.

Alternatively I would suggest putting a link to our store/add receipt which should contain all of this info there.

Thinking more about this I feel like this is not a good requirement, because client may have to do request /add in one session and do /publish in the other. Having to maintain state across these invocations seems like incidental complexity and I would advocate against it.

Thinking more about this I feel like this is not a good requirement, because client may have to do request /add in one session and do /publish in the other. Having to maintain state across these invocations seems like incidental complexity and I would advocate against it.

Not sure what you meant. My point of having the URL was precisely what I think you mean here. Not require state on our end across multiple invocations. Note that I am also assuming that we can derive write target details via presigned URL, which I am not entirely sure it will be true.

You are thinking about state we (service) needs to maintain and I was referring to the state client will need to maintain. Requiring client to be stateful or a dependency on them providing right url is not good choice.

Point I was making is that we (service) needs to lookup state in DB anyway so extending a (state) record with a URL so client will be stateless was my recommendation. That also reduces surface for errors where client fails to provide a URL or a correct URL for that matter.

Gozala

I have proposed various things here, but I think we can call this a consensus and can iterate on remaining details in the spec.

rfc/datawherehouse.md

Gozala · 2024-03-14T20:01:20Z

rfc/datawherehouse.md

+* Client stores some blob with web3.storage
+* Client requests this blob to be published with `store/publish`
+* Service handler verifies requested blob was stored on the client space, issues a location claim with encoded information for location and returns it to the client
+  * Can be batched with other invocations like `filecoin/offer`


I do not follow the batching part here, mind elaborating more on this please ?

Or delete it if it is not important.

not important, but was more towards the idea that we should batch all follow up invocations of store/add that can happen in parallel

Gozala · 2024-03-14T20:17:31Z

rfc/datawherehouse.md

+## High level flow of proposed solution
+
+* Client stores some blob with web3.storage
+* Client requests this blob to be published with `store/publish`


Given the discussions around graceful upgrades I propose we leverage #12 proposal to decide whether to direct client to buckets that assume events vs buckets that do not. At the same time I would like to address claimed CARs aren't always CARs issue and propose that we introduce new /space/content/add/blob capability that is effectively a store/add with a following differences:

It does not imply publishing (or advertising content on the network).

It does not have origin field.

The link field is renamed to a blob and is turned into a multihash (can be convinced that it should be a raw cid instead).

This would put us into a better position because we won't assume that uploaded bytes represent valid CAR and we will know that we can direct this request to the bucket without events.

I would propose renaming above described store/publish to /space/content/add/blob/publish instead.

Overall this will provide us with a lot more granular upgrade path in comparison to blanket protocol versioning.

P.S.: We could also consider adding semver to the end of the ability e.g. /0.0.1 if we feel like it. This is something that @gobengo advocated for on several occasions. I'm not convinced because I feel like feature detection worked better in web like open systems than versioning.

yeah, I think we can work on that direction indeed!

/space/content/add/blob/publish yeah, I think that I agree with this direction + start with blob work for new path

Gozala · 2024-03-14T20:37:46Z

rfc/datawherehouse.md

+3. Client writes the bytes into the write target
+4. Client requests service to serve written bytes under given stored CID
+5. Service verifies that the bytes are stored in the provided write target/space and writes a claim about it.
+6. Service issues a receipt stating bytes are being stored by the service on a given location claim.


While it is not strictly necessary I would really like us to write issued receipt containing a content claim in user space so we can charge users for it. We do not have to actually store it there but I'd like us to create a record so we could bill users for it. @alanshaw is probably best positioned to advice on what would be the best way to go about this.

P.S. In the future we may consider letting user to request a location claim without performing a publish and make them decide when to add to their space to perform actual publishing. E.g. /space/content/get/blob could be used to obtain content location claim without us publishing it anywhere.

Relatedly I would suggest that /space/content/add/blob return us a content location claim if we already have corresponding blob in the bucket.

While I think that would be great, I think we need to figure out a lot of things on the UX of this. I do not want to make it easy for folks to delete these files, as well as I would like them to be associated with the actual CAR in the way that they could be deleted together.

Perhaps we can try the same structure as proposed in the hierarchical capabilities. Have a "folder" as the CAR CID, that then has its bytes, location claim, needed indexes, etc. I would like to try to make this part of the next iteration to not increase the scope further though

Gozala · 2024-03-14T21:00:34Z

rfc/datawherehouse.md

+
+## Location claims encoding location hints
+
+Content claims service is currently deployed implementing the [content claims spec](https://github.com/web3-storage/specs/pull/86). Among other claims, it provides [Location Claims](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?view#Location-Claims) which MAY be used to claim that the bytes that hash to a given CID are available in a given URL.


I think fact that size of the content / blob is not captured (or is optional) is an unfortunate oversight. I would suggest changing range field into required that way size of the content is clear from the claim.

Gozala · 2024-03-14T21:01:27Z

rfc/datawherehouse.md

+  "ran": "bafy...storePublish",
+  "out": {
+    "ok": {
+      "link" : { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte


In the new capabilities I'd suggest using blob or content as filed name instead of link.

Gozala · 2024-03-14T21:02:46Z

rfc/datawherehouse.md

+
+While thinking about using location claims to record where bytes are stored by the service, there are a few characteristics we want to have:
+- location claim MUST resolve to a public and fetchable URLs
+- location in location claim SHOULD (ideally) not change recurrently given it MAY impact negatively the reputation of a party. However, we should consider letting client choose how long the location claim should be valid for.


I would say it should not change within the commitment window of the claim.

Gozala · 2024-03-14T21:05:16Z

rfc/datawherehouse.md

+### _private_ location claims
+
+_private_ location claims would enable us to not expose these claims directly to the user, given their sole purpose at the moment is internal routing. This would enable queries of w3s read/write interfaces to know where the bytes for a CID are stored.
+
+With this building block we can issue claims that MAY not be public and fetchable URLs, as well as not have worries on a potential future data migration.
+
+A _private_ location claim MAY look like:


I'm starting to think that private location claims are obsolete. Content claim should include whatever metadata it needs in the claim to make anyone with it able to perform a read. We can simply leverage UCAN auth system for the rest.

In other words query params ≈ ucan auth header. Later in addition gives us ability to choose who can exercise it while former is public by default.

since we are all in agreement to not use this, I will just drop it from the RFC. FWIW, this is currently there as "alternative", but later is specified what this RFC proposes

Gozala · 2024-03-14T21:07:46Z

rfc/datawherehouse.md

+  "input": {
+    "content" : CID /* // RAW CID */, 
+    "location": "`https://<BUCKET_NAME>.<REGION>.web3.storage/<CID>/<CID>.car`",
+    "range"   : [ start, end ] /* Optional: Byte Range in URL


We should make this required field so size of the content can be inferred from the claim.

well, for this case we are meaning blobs (or CAR files), which in theory is the entire thing, but yeah, since we need to do a HEAD request to see if it is in the write target we can make this required as it is really no extra cost

Co-authored-by: Irakli Gozalishvili <[email protected]>

vasco-santos force-pushed the docs/datawherehouse branch from b044e2b to 0d9e9ac Compare March 7, 2024 15:39

docs: datawherehouse

e59f282

vasco-santos force-pushed the docs/datawherehouse branch from 0d9e9ac to e59f282 Compare March 7, 2024 15:43

Gozala reviewed Mar 7, 2024

View reviewed changes

rfc/datawherehouse.md Outdated Show resolved Hide resolved

vasco-santos requested a review from Gozala March 11, 2024 19:04

fix: remove datawherehouse store in favour of location claims

78ace25

vasco-santos force-pushed the docs/datawherehouse branch from edce6ba to 78ace25 Compare March 11, 2024 19:05

vasco-santos requested a review from alanshaw March 11, 2024 19:07

hannahhoward mentioned this pull request Mar 12, 2024

Complete write-anywhere work storacha/project-tracking#11

Closed

alanshaw approved these changes Mar 13, 2024

View reviewed changes