Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: datawherehouse location claim + store/publish #13

Merged
merged 7 commits into from
Mar 20, 2024

Conversation

vasco-santos
Copy link
Contributor

@vasco-santos vasco-santos commented Mar 7, 2024

HTML View

This RFC proposes datawherehouse to keep track of location of bytes stored with write targets as part of data anywhere. In short, this is a datastore to store locations that also requires a flow such as the proposed on #10

@vasco-santos vasco-santos force-pushed the docs/datawherehouse branch from b044e2b to 0d9e9ac Compare March 7, 2024 15:39
@vasco-santos vasco-santos force-pushed the docs/datawherehouse branch from 0d9e9ac to e59f282 Compare March 7, 2024 15:43
rfc/datawherehouse.md Outdated Show resolved Hide resolved
@vasco-santos vasco-santos requested a review from Gozala March 11, 2024 19:04
Copy link
Member

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve for "location claims with encoded params"!

rfc/datawherehouse.md Outdated Show resolved Hide resolved
On the other side of things, the next diagram presents the flow when a client wants to read some data, and is illustrated with the following steps:

1. Client request to read some bytes by the CID of the data
2. Service discovers where the requested bytes are stored relying on content claims service and the materialized claims from `datawherehouse`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not correct anymore, right?

rfc/datawherehouse.md Outdated Show resolved Hide resolved
rfc/datawherehouse.md Outdated Show resolved Hide resolved
rfc/datawherehouse.md Outdated Show resolved Hide resolved
rfc/datawherehouse.md Outdated Show resolved Hide resolved
rfc/datawherehouse.md Outdated Show resolved Hide resolved
rfc/datawherehouse.md Outdated Show resolved Hide resolved
rfc/datawherehouse.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Gozala Gozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I provided some feedback inline. I did not approve PR only because reading this document I'm left with an impression that two conflicting proposals are made. Thinking more on this and digesting all the shared insights I think I'm coming to a following conclusion:

  1. Client could issue receipt that they uploaded content (IN THE FUTURE), but that should not trigger pipeline, meaning it should not make that content available on IPFS or put it into filecoin.
    • I think this would also align with potential private data requirement that has been brought up by product.
  2. Client should explicitly request read URL for the content, which is effectively reframing of store/deliver. When user requests content identified by CID to be readable we can verify that we have it and either issue location claim or produce an error saying we don't have it.
    • Furthermore we probably should consider incorporating TTL in the user request because forever is impossible in finite bounds of universe. In fact accepting TTL is a commitment and we probably could / should reflect this in billing.
    • This also may set us up on a path of supporting request URLs that are not public, which I think it's matter of time. I imagine such URLs could be under space DID and require UCAN token to perform reads. Not something I propose we tackle now, but it is good to consider how things would fit together.
  3. When client requests content read URL we should respond with a location claim from us (audience should be DID of the with of the request). That means client can publish claim on our behalf wherever they wish, because it is commitment from us.
  4. I think we really should design it with RAW CIDs as opposed to CAR CIDs. Because we do not verify those are CARs. I'm not suggesting rewriting everything here, but I would encourage to do it for the new capabilities as it is easy to go from CAR <-> RAW CID it's just switching a codec byte. Alternatively we could use multihash instead.

@prodalex
Copy link

prodalex commented Mar 14, 2024

Client could issue receipt that they uploaded content (IN THE FUTURE), but that should not trigger pipeline, meaning it should not make that content available on IPFS or put it into filecoin.
I think this would also align with potential private data requirement that has been brought up by product.

Yeah. The client will need to be able to decide eventually if content should be public, private, kept in hot storage (maybe even number of replicas/store targets but to be considered in billing) or actually just be archived in filecoin and then deleted from hot storage once the CommP and pieceID is available.

We should also think about Object lifecycle management. The customer might want to set up rules when content is automatically deleted from hot storage after X time or X time without reads, etc. and only kept in filecoin.

Doesn't need to go into current design but keep in mind.

Same goes for permissions policies and roles for CRUD.

@vasco-santos vasco-santos changed the title docs: datawherehouse docs: datawherehouse location claim + store/publish Mar 14, 2024
Comment on lines 78 to 93
Return on success the following receipt:

```json
{
"ran": "bafy...storePublish",
"out": {
"ok": {
"link" : { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte
"location": "`https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region/bucketName/key"
}
},
"fx": {},
}
```

TODO: Should return the actual location claim? How would we do that?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: would love your help here @Gozala ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe a CID for it? but client can't really read one today? Maybe receipt CID that claim was made?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting to take actual signed location claim, provide a link to it and inline block in the response. Our client can take care of decoding it for the users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative can be to simply inline the content claim, as client can derive same CID on their own. I'm getting more bullish on this since #8, but whether we inline or embed location claim is more of an implementation detail so I don't care that much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embed would mean we need to implement in ucanto what we talked about in Istanbul to attach things via receipt builder, but that seems good

"rsc": "did:key:abc...space",
"input": {
"link": { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte
"url": "https://..." // write target presignedurl previously provided
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I should include some info for this. My reasoning to put this is to let service know what actually was the write target so that service can check it is there. Otherwise, we would need to find out by storing presigned URLs somewhere. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to. We already need to store some record in DB that user has allocated space for the blob, we also need to check that user has such allocation. Seems like it would perfect sense for us to store the details about where space was allocated in that DB record.

Alternatively I would suggest putting a link to our store/add receipt which should contain all of this info there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this I feel like this is not a good requirement, because client may have to do request /add in one session and do /publish in the other. Having to maintain state across these invocations seems like incidental complexity and I would advocate against it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this I feel like this is not a good requirement, because client may have to do request /add in one session and do /publish in the other. Having to maintain state across these invocations seems like incidental complexity and I would advocate against it.

Not sure what you meant. My point of having the URL was precisely what I think you mean here. Not require state on our end across multiple invocations. Note that I am also assuming that we can derive write target details via presigned URL, which I am not entirely sure it will be true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are thinking about state we (service) needs to maintain and I was referring to the state client will need to maintain. Requiring client to be stateful or a dependency on them providing right url is not good choice.

Point I was making is that we (service) needs to lookup state in DB anyway so extending a (state) record with a URL so client will be stateless was my recommendation. That also reduces surface for errors where client fails to provide a URL or a correct URL for that matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌🏼

Copy link
Contributor

@Gozala Gozala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have proposed various things here, but I think we can call this a consensus and can iterate on remaining details in the spec.

rfc/datawherehouse.md Outdated Show resolved Hide resolved
rfc/datawherehouse.md Outdated Show resolved Hide resolved
* Client stores some blob with web3.storage
* Client requests this blob to be published with `store/publish`
* Service handler verifies requested blob was stored on the client space, issues a location claim with encoded information for location and returns it to the client
* Can be batched with other invocations like `filecoin/offer`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not follow the batching part here, mind elaborating more on this please ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or delete it if it is not important.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not important, but was more towards the idea that we should batch all follow up invocations of store/add that can happen in parallel

## High level flow of proposed solution

* Client stores some blob with web3.storage
* Client requests this blob to be published with `store/publish`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussions around graceful upgrades I propose we leverage #12 proposal to decide whether to direct client to buckets that assume events vs buckets that do not. At the same time I would like to address claimed CARs aren't always CARs issue and propose that we introduce new /space/content/add/blob capability that is effectively a store/add with a following differences:

  1. It does not imply publishing (or advertising content on the network).
  2. It does not have origin field.
  3. The link field is renamed to a blob and is turned into a multihash (can be convinced that it should be a raw cid instead).

This would put us into a better position because we won't assume that uploaded bytes represent valid CAR and we will know that we can direct this request to the bucket without events.

I would propose renaming above described store/publish to /space/content/add/blob/publish instead.

Overall this will provide us with a lot more granular upgrade path in comparison to blanket protocol versioning.

P.S.: We could also consider adding semver to the end of the ability e.g. /0.0.1 if we feel like it. This is something that @gobengo advocated for on several occasions. I'm not convinced because I feel like feature detection worked better in web like open systems than versioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we can work on that direction indeed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/space/content/add/blob/publish yeah, I think that I agree with this direction + start with blob work for new path

3. Client writes the bytes into the write target
4. Client requests service to serve written bytes under given stored CID
5. Service verifies that the bytes are stored in the provided write target/space and writes a claim about it.
6. Service issues a receipt stating bytes are being stored by the service on a given location claim.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is not strictly necessary I would really like us to write issued receipt containing a content claim in user space so we can charge users for it. We do not have to actually store it there but I'd like us to create a record so we could bill users for it. @alanshaw is probably best positioned to advice on what would be the best way to go about this.

P.S. In the future we may consider letting user to request a location claim without performing a publish and make them decide when to add to their space to perform actual publishing. E.g. /space/content/get/blob could be used to obtain content location claim without us publishing it anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatedly I would suggest that /space/content/add/blob return us a content location claim if we already have corresponding blob in the bucket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think that would be great, I think we need to figure out a lot of things on the UX of this. I do not want to make it easy for folks to delete these files, as well as I would like them to be associated with the actual CAR in the way that they could be deleted together.

Perhaps we can try the same structure as proposed in the hierarchical capabilities. Have a "folder" as the CAR CID, that then has its bytes, location claim, needed indexes, etc. I would like to try to make this part of the next iteration to not increase the scope further though


## Location claims encoding location hints

Content claims service is currently deployed implementing the [content claims spec](https://github.com/web3-storage/specs/pull/86). Among other claims, it provides [Location Claims](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?view#Location-Claims) which MAY be used to claim that the bytes that hash to a given CID are available in a given URL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fact that size of the content / blob is not captured (or is optional) is an unfortunate oversight. I would suggest changing range field into required that way size of the content is clear from the claim.

"ran": "bafy...storePublish",
"out": {
"ok": {
"link" : { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new capabilities I'd suggest using blob or content as filed name instead of link.


While thinking about using location claims to record where bytes are stored by the service, there are a few characteristics we want to have:
- location claim MUST resolve to a public and fetchable URLs
- location in location claim SHOULD (ideally) not change recurrently given it MAY impact negatively the reputation of a party. However, we should consider letting client choose how long the location claim should be valid for.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it should not change within the commitment window of the claim.

Comment on lines 124 to 130
### _private_ location claims

_private_ location claims would enable us to not expose these claims directly to the user, given their sole purpose at the moment is internal routing. This would enable queries of w3s read/write interfaces to know where the bytes for a CID are stored.

With this building block we can issue claims that MAY not be public and fetchable URLs, as well as not have worries on a potential future data migration.

A _private_ location claim MAY look like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to think that private location claims are obsolete. Content claim should include whatever metadata it needs in the claim to make anyone with it able to perform a read. We can simply leverage UCAN auth system for the rest.

In other words query params ≈ ucan auth header. Later in addition gives us ability to choose who can exercise it while former is public by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are all in agreement to not use this, I will just drop it from the RFC. FWIW, this is currently there as "alternative", but later is specified what this RFC proposes

"input": {
"content" : CID /* // RAW CID */,
"location": "`https://<BUCKET_NAME>.<REGION>.web3.storage/<CID>/<CID>.car`",
"range" : [ start, end ] /* Optional: Byte Range in URL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this required field so size of the content can be inferred from the claim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, for this case we are meaning blobs (or CAR files), which in theory is the entire thing, but yeah, since we need to do a HEAD request to see if it is in the write target we can make this required as it is really no extra cost

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants