-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: datawherehouse location claim + store/publish #13
Conversation
b044e2b
to
0d9e9ac
Compare
0d9e9ac
to
e59f282
Compare
edce6ba
to
78ace25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve for "location claims with encoded params"!
rfc/datawherehouse.md
Outdated
On the other side of things, the next diagram presents the flow when a client wants to read some data, and is illustrated with the following steps: | ||
|
||
1. Client request to read some bytes by the CID of the data | ||
2. Service discovers where the requested bytes are stored relying on content claims service and the materialized claims from `datawherehouse` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not correct anymore, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I provided some feedback inline. I did not approve PR only because reading this document I'm left with an impression that two conflicting proposals are made. Thinking more on this and digesting all the shared insights I think I'm coming to a following conclusion:
- Client could issue receipt that they uploaded content (IN THE FUTURE), but that should not trigger pipeline, meaning it should not make that content available on IPFS or put it into filecoin.
- I think this would also align with potential private data requirement that has been brought up by product.
- Client should explicitly request read URL for the content, which is effectively reframing of
store/deliver
. When user requests content identified by CID to be readable we can verify that we have it and either issue location claim or produce an error saying we don't have it.- Furthermore we probably should consider incorporating TTL in the user request because forever is impossible in finite bounds of universe. In fact accepting TTL is a commitment and we probably could / should reflect this in billing.
- This also may set us up on a path of supporting request URLs that are not public, which I think it's matter of time. I imagine such URLs could be under space DID and require UCAN token to perform reads. Not something I propose we tackle now, but it is good to consider how things would fit together.
- When client requests content read URL we should respond with a location claim from us (audience should be DID of the
with
of the request). That means client can publish claim on our behalf wherever they wish, because it is commitment from us. - I think we really should design it with
RAW
CIDs as opposed to CAR CIDs. Because we do not verify those are CARs. I'm not suggesting rewriting everything here, but I would encourage to do it for the new capabilities as it is easy to go from CAR <-> RAW CID it's just switching a codec byte. Alternatively we could use multihash instead.
Co-authored-by: Alan Shaw <[email protected]>
Yeah. The client will need to be able to decide eventually if content should be public, private, kept in hot storage (maybe even number of replicas/store targets but to be considered in billing) or actually just be archived in filecoin and then deleted from hot storage once the CommP and pieceID is available. We should also think about Object lifecycle management. The customer might want to set up rules when content is automatically deleted from hot storage after X time or X time without reads, etc. and only kept in filecoin. Doesn't need to go into current design but keep in mind. Same goes for permissions policies and roles for CRUD. |
rfc/datawherehouse.md
Outdated
Return on success the following receipt: | ||
|
||
```json | ||
{ | ||
"ran": "bafy...storePublish", | ||
"out": { | ||
"ok": { | ||
"link" : { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte | ||
"location": "`https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region/bucketName/key" | ||
} | ||
}, | ||
"fx": {}, | ||
} | ||
``` | ||
|
||
TODO: Should return the actual location claim? How would we do that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: would love your help here @Gozala ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe a CID for it? but client can't really read one today? Maybe receipt CID that claim was made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was suggesting to take actual signed location claim, provide a link to it and inline block in the response. Our client can take care of decoding it for the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative can be to simply inline the content claim, as client can derive same CID on their own. I'm getting more bullish on this since #8, but whether we inline or embed location claim is more of an implementation detail so I don't care that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embed would mean we need to implement in ucanto what we talked about in Istanbul to attach things via receipt builder, but that seems good
"rsc": "did:key:abc...space", | ||
"input": { | ||
"link": { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte | ||
"url": "https://..." // write target presignedurl previously provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I should include some info for this. My reasoning to put this is to let service know what actually was the write target so that service can check it is there. Otherwise, we would need to find out by storing presigned URLs somewhere. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to. We already need to store some record in DB that user has allocated space for the blob, we also need to check that user has such allocation. Seems like it would perfect sense for us to store the details about where space was allocated in that DB record.
Alternatively I would suggest putting a link to our store/add
receipt which should contain all of this info there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this I feel like this is not a good requirement, because client may have to do request /add
in one session and do /publish
in the other. Having to maintain state across these invocations seems like incidental complexity and I would advocate against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this I feel like this is not a good requirement, because client may have to do request /add in one session and do /publish in the other. Having to maintain state across these invocations seems like incidental complexity and I would advocate against it.
Not sure what you meant. My point of having the URL was precisely what I think you mean here. Not require state on our end across multiple invocations. Note that I am also assuming that we can derive write target details via presigned URL, which I am not entirely sure it will be true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are thinking about state we (service) needs to maintain and I was referring to the state client will need to maintain. Requiring client to be stateful or a dependency on them providing right url is not good choice.
Point I was making is that we (service) needs to lookup state in DB anyway so extending a (state) record with a URL so client will be stateless was my recommendation. That also reduces surface for errors where client fails to provide a URL or a correct URL for that matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have proposed various things here, but I think we can call this a consensus and can iterate on remaining details in the spec.
rfc/datawherehouse.md
Outdated
* Client stores some blob with web3.storage | ||
* Client requests this blob to be published with `store/publish` | ||
* Service handler verifies requested blob was stored on the client space, issues a location claim with encoded information for location and returns it to the client | ||
* Can be batched with other invocations like `filecoin/offer` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not follow the batching part here, mind elaborating more on this please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or delete it if it is not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not important, but was more towards the idea that we should batch all follow up invocations of store/add
that can happen in parallel
## High level flow of proposed solution | ||
|
||
* Client stores some blob with web3.storage | ||
* Client requests this blob to be published with `store/publish` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the discussions around graceful upgrades I propose we leverage #12 proposal to decide whether to direct client to buckets that assume events vs buckets that do not. At the same time I would like to address claimed CARs aren't always CARs issue and propose that we introduce new /space/content/add/blob
capability that is effectively a store/add
with a following differences:
- It does not imply publishing (or advertising content on the network).
- It does not have
origin
field. - The
link
field is renamed to ablob
and is turned into a multihash (can be convinced that it should be a raw cid instead).
This would put us into a better position because we won't assume that uploaded bytes represent valid CAR and we will know that we can direct this request to the bucket without events.
I would propose renaming above described store/publish
to /space/content/add/blob/publish
instead.
Overall this will provide us with a lot more granular upgrade path in comparison to blanket protocol versioning.
P.S.: We could also consider adding semver to the end of the ability e.g. /0.0.1 if we feel like it. This is something that @gobengo advocated for on several occasions. I'm not convinced because I feel like feature detection worked better in web like open systems than versioning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think we can work on that direction indeed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/space/content/add/blob/publish
yeah, I think that I agree with this direction + start with blob work for new path
3. Client writes the bytes into the write target | ||
4. Client requests service to serve written bytes under given stored CID | ||
5. Service verifies that the bytes are stored in the provided write target/space and writes a claim about it. | ||
6. Service issues a receipt stating bytes are being stored by the service on a given location claim. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is not strictly necessary I would really like us to write issued receipt containing a content claim in user space so we can charge users for it. We do not have to actually store it there but I'd like us to create a record so we could bill users for it. @alanshaw is probably best positioned to advice on what would be the best way to go about this.
P.S. In the future we may consider letting user to request a location claim without performing a publish and make them decide when to add to their space to perform actual publishing. E.g. /space/content/get/blob
could be used to obtain content location claim without us publishing it anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatedly I would suggest that /space/content/add/blob
return us a content location claim if we already have corresponding blob in the bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think that would be great, I think we need to figure out a lot of things on the UX of this. I do not want to make it easy for folks to delete these files, as well as I would like them to be associated with the actual CAR in the way that they could be deleted together.
Perhaps we can try the same structure as proposed in the hierarchical capabilities. Have a "folder" as the CAR CID, that then has its bytes, location claim, needed indexes, etc. I would like to try to make this part of the next iteration to not increase the scope further though
|
||
## Location claims encoding location hints | ||
|
||
Content claims service is currently deployed implementing the [content claims spec](https://github.com/web3-storage/specs/pull/86). Among other claims, it provides [Location Claims](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?view#Location-Claims) which MAY be used to claim that the bytes that hash to a given CID are available in a given URL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fact that size of the content / blob is not captured (or is optional) is an unfortunate oversight. I would suggest changing range
field into required that way size of the content is clear from the claim.
rfc/datawherehouse.md
Outdated
"ran": "bafy...storePublish", | ||
"out": { | ||
"ok": { | ||
"link" : { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new capabilities I'd suggest using blob
or content
as filed name instead of link
.
rfc/datawherehouse.md
Outdated
|
||
While thinking about using location claims to record where bytes are stored by the service, there are a few characteristics we want to have: | ||
- location claim MUST resolve to a public and fetchable URLs | ||
- location in location claim SHOULD (ideally) not change recurrently given it MAY impact negatively the reputation of a party. However, we should consider letting client choose how long the location claim should be valid for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it should not change within the commitment window of the claim.
rfc/datawherehouse.md
Outdated
### _private_ location claims | ||
|
||
_private_ location claims would enable us to not expose these claims directly to the user, given their sole purpose at the moment is internal routing. This would enable queries of w3s read/write interfaces to know where the bytes for a CID are stored. | ||
|
||
With this building block we can issue claims that MAY not be public and fetchable URLs, as well as not have worries on a potential future data migration. | ||
|
||
A _private_ location claim MAY look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting to think that private location claims are obsolete. Content claim should include whatever metadata it needs in the claim to make anyone with it able to perform a read. We can simply leverage UCAN auth system for the rest.
In other words query params ≈ ucan auth header. Later in addition gives us ability to choose who can exercise it while former is public by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are all in agreement to not use this, I will just drop it from the RFC. FWIW, this is currently there as "alternative", but later is specified what this RFC proposes
rfc/datawherehouse.md
Outdated
"input": { | ||
"content" : CID /* // RAW CID */, | ||
"location": "`https://<BUCKET_NAME>.<REGION>.web3.storage/<CID>/<CID>.car`", | ||
"range" : [ start, end ] /* Optional: Byte Range in URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this required field so size of the content can be inferred from the claim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, for this case we are meaning blobs (or CAR files), which in theory is the entire thing, but yeah, since we need to do a HEAD request to see if it is in the write target we can make this required as it is really no extra cost
Co-authored-by: Irakli Gozalishvili <[email protected]>
HTML View
This RFC proposes
datawherehouse
to keep track of location of bytes stored with write targets as part of data anywhere. In short, this is a datastore to store locations that also requires a flow such as the proposed on #10