Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: datawherehouse location claim + store/publish #13

Merged
merged 7 commits into from
Mar 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions rfc/datawherehouse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# DATAWHEREHOUSE

> or... where is that file living on?

## Authors

- [Vasco Santos], [Protocol Labs]

## Background

TL;DR

1. We're missing a mapping of Blob CID -> target URI (Bucket, Saturn Node, etc).
2. We don't want clients to create location claims because they have no control over the location and consequently are in no position to make such claims.
3. We SHOULD NOT assume that user uploading a content implies thy want it to publicly available for everyone to read. User should signal that intention explicitly via invocation. Not to mention that we do not have bucket events in Cloudflare and require client to tell us when upload was complete.
4. We want freeway code to be usable in Saturn nodes, so ideally it uses only content claims to discover locations.
5. We want this information to be available as soon as possible so that read interfaces can serve the content immediately.

### Extended details

> 1. We're missing a mapping of Blob CID -> target URI (Bucket, Saturn Node, etc).

When facing this problem in the begining, we considered to issue location claim direct to the bytes (i.e. `r2://...`). But when we got closer to put that into practise we realized this was not a good idea. We need this mapping so that w3s Read/Write Interfaces can discover where the bytes are. Where the bytes are stored may actually be private write targets (for instance, a R2 bucket), which location is not public. We consider that location claims MUST be retrievable, have public access and not heavily rate limited. Finally, some read interfaces (for instance Roundabout, Freeway) require some information encoded in the URI (like bucket name), which would not be available in a public URL of R2 bucket. All things considered, Location claims should include URIs like `https://bafy...blob.ipfs.w3s.link`, and we need a mapping of `bafy...blob` to where its bytes are actually stored internally.

> 2. We don't want clients to create location claims for internal bucket URLs because we might change the location in the future (reputation hit for client).

Extending on the first point, making location claims to include "private/unavailable" URIs will make it harder for the service to move blob to other places, given it would need to revoke a bunch of claims and re-write them with new location.

> 3. We don't have bucket events in Cloudflare, so need the client to tell us when it has uploaded something to the provided write target.

Actually we can even extend on this point by saying that today we have no verifiability on data being sent by the user, as well as received by the service. By having the client to sign that bytes were sent, and the service to check and also sign that is true will allow us to achieve that. Moreover, we also open the doors in this interaction for a challengen/proof of delivery.

## High level flow of proposed solution

* Client stores some blob with web3.storage
* Client requests this blob to be published with `store/publish`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussions around graceful upgrades I propose we leverage #12 proposal to decide whether to direct client to buckets that assume events vs buckets that do not. At the same time I would like to address claimed CARs aren't always CARs issue and propose that we introduce new /space/content/add/blob capability that is effectively a store/add with a following differences:

  1. It does not imply publishing (or advertising content on the network).
  2. It does not have origin field.
  3. The link field is renamed to a blob and is turned into a multihash (can be convinced that it should be a raw cid instead).

This would put us into a better position because we won't assume that uploaded bytes represent valid CAR and we will know that we can direct this request to the bucket without events.

I would propose renaming above described store/publish to /space/content/add/blob/publish instead.

Overall this will provide us with a lot more granular upgrade path in comparison to blanket protocol versioning.

P.S.: We could also consider adding semver to the end of the ability e.g. /0.0.1 if we feel like it. This is something that @gobengo advocated for on several occasions. I'm not convinced because I feel like feature detection worked better in web like open systems than versioning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we can work on that direction indeed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/space/content/add/blob/publish yeah, I think that I agree with this direction + start with blob work for new path

* Service handler verifies requested blob was stored on the client space, issues a location claim with encoded information for location and returns it to the client

The following diagram presents the described flow, and is illustrated with the following steps:

1. Client requests to store some bytes with the storefront service
2. Service issues a receipt stating the URI of a write target for the client to write those bytes
3. Client writes the bytes into the write target
4. Client requests service to serve written bytes under given stored CID
5. Service verifies that the bytes are stored in the provided write target/space and writes a claim about it.
6. Service issues a receipt stating bytes are being stored by the service on a given location claim.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is not strictly necessary I would really like us to write issued receipt containing a content claim in user space so we can charge users for it. We do not have to actually store it there but I'd like us to create a record so we could bill users for it. @alanshaw is probably best positioned to advice on what would be the best way to go about this.

P.S. In the future we may consider letting user to request a location claim without performing a publish and make them decide when to add to their space to perform actual publishing. E.g. /space/content/get/blob could be used to obtain content location claim without us publishing it anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatedly I would suggest that /space/content/add/blob return us a content location claim if we already have corresponding blob in the bucket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think that would be great, I think we need to figure out a lot of things on the UX of this. I do not want to make it easy for folks to delete these files, as well as I would like them to be associated with the actual CAR in the way that they could be deleted together.

Perhaps we can try the same structure as proposed in the hierarchical capabilities. Have a "folder" as the CAR CID, that then has its bytes, location claim, needed indexes, etc. I would like to try to make this part of the next iteration to not increase the scope further though


![datawherehouse1](./datawherehouse/datawherehouse-1.svg)

On the other side of things, the next diagram presents the flow when a client wants to read some blob, and is illustrated with the following steps:

1. Client request to read some bytes by the CID of the blob
2. Service discovers where the requested bytes are stored relying on content claims service to find location claim from the service.
3. Service serves blob stored on the discovered location target.

![datawherehouse2](./datawherehouse/datawherehouse-2.svg)

## `store/publish` capability

After some blob being stored with web3.storage, a client MAY request this blob to be available under given CID. Service MUST verify that it has bytes corresponding to that CID and it was stored on the provided client space. If so, service MUST issue a location claim to the content claims service. Moreover, the service SHOULD respond with location claim from which read can be performed.

This method enables the service to handle private data in the future, and likely should allow client to specify TTL for the produced read URL, as well as even consider future where this read URI might require permission (e.g. ucan authorization).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about I like the idea of requesting content claim from service and then publishing it even more, because:

  1. User can request a location claim and use it to authorize private reads.
  2. User can publish delegated content claim and if they do they effectively enable public reads.

Note that in 1st case aud of the location claim would be a user space DID, and aud in the second case would be our service did.

This also would very well capture the fact that if someone creates the loop content claim loop like w3up → alice → bob → w3up they effectively made the reads public. As long as reader can provide the valid ucan chain we will be required to serve the content or be held accountable for not upholding our commitment.


This capability can be specified as follows:

```json
{
"op": "store/publish",
"rsc": "did:key:abc...space",
"input": {
"content": { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte
"url": "https://..." // write target presignedurl previously provided
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I should include some info for this. My reasoning to put this is to let service know what actually was the write target so that service can check it is there. Otherwise, we would need to find out by storing presigned URLs somewhere. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to. We already need to store some record in DB that user has allocated space for the blob, we also need to check that user has such allocation. Seems like it would perfect sense for us to store the details about where space was allocated in that DB record.

Alternatively I would suggest putting a link to our store/add receipt which should contain all of this info there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this I feel like this is not a good requirement, because client may have to do request /add in one session and do /publish in the other. Having to maintain state across these invocations seems like incidental complexity and I would advocate against it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this I feel like this is not a good requirement, because client may have to do request /add in one session and do /publish in the other. Having to maintain state across these invocations seems like incidental complexity and I would advocate against it.

Not sure what you meant. My point of having the URL was precisely what I think you mean here. Not require state on our end across multiple invocations. Note that I am also assuming that we can derive write target details via presigned URL, which I am not entirely sure it will be true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are thinking about state we (service) needs to maintain and I was referring to the state client will need to maintain. Requiring client to be stateful or a dependency on them providing right url is not good choice.

Point I was making is that we (service) needs to lookup state in DB anyway so extending a (state) record with a URL so client will be stateless was my recommendation. That also reduces surface for errors where client fails to provide a URL or a correct URL for that matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌🏼

}
}
```

Return on success the following receipt:

```json
{
"ran": "bafy...storePublish",
"out": {
"ok": {
"content" : { "/": "bafy...BLOBCID" }, // RAW CID - go from CAR <-> RAW CID it's just switching a codec byte
"location": "`https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region/bucketName/key"
}
},
"fx": {},
}
```

TODO: Should return the actual location claim? How would we do that?

On the event of failure, it should state error behind it, such as:

```json
{
"ran": "bafy...storePublish",
"out": {
"error": {
"name": "ContentNotFoundError",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets have multiple distinct error types please:

  1. Blob was not allocated.
  2. Blob was not written.

"content" : { "/": "bafy...BLOBCID" }
}
},
"fx": {},
}
```

or

```json
{
"ran": "bafy...storePublish",
"out": {
"error": {
"name": "ContentNotAllocatedError",
"content" : { "/": "bafy...BLOBCID" }
}
},
"fx": {},
}
```


## Location claims encoding location hints

Content claims service is currently deployed implementing the [content claims spec](https://github.com/web3-storage/specs/pull/86). Among other claims, it provides [Location Claims](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?view#Location-Claims) which MAY be used to claim that the bytes that hash to a given CID are available in a given URL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fact that size of the content / blob is not captured (or is optional) is an unfortunate oversight. I would suggest changing range field into required that way size of the content is clear from the claim.


In w3s the service is responsible to deciding the write target, therefore service SHOULD be responsible for claiming the location of the blob on user request to have it published.

While thinking about using location claims to record where bytes are stored by the service, there are a few characteristics we want to have:
- location claim MUST resolve to a public and fetchable URLs
- location in location claim SHOULD should change within the commitment window of the claim given it MAY impact negatively the reputation of a party. However, client SHOULD be able to choose how long the location claim is valid for.

Read interfaces MAY have some requirements other than the CID to better, such as knowing bucket name, region, etc.

As a way to store the location of this bytes, we discussed relying on a "private" location claims concept, or even on location claims for a gateway that have hints as encoded params in the URL that the read interface can decide if want to try to use. This would allow us to already have the infra and datastores we have, leaving the decentralization of content claims for a completely different problem. We decided for encoded params on location claim given it also puts us into the future direction where write targets may actually have public URLs.

### location claims with encoded params

On the other side, we could also rely on a known resolvable location and encode the needed information as part of the URL. This would allow w3s service to just issue claims point to the gateway with extra hints that they can use for a "fast lane".

A location claim MAY look like:

```json
{
"op": "assert/location",
"aud": "did:key:abc...space",
"rsc": "did:web:web3.storage",
"input": {
"content" : { "/": "bafy...BLOBCID" }, // // RAW CID
"location": "`https://w3s.link/ipfs/bafy...BLOBCID?origin=r2://region/bucketName/key",
"range" : [ start, end ] /* Byte Range in URL
}
}
```

The public IPFS HTTP Gateway could decide if it wants to use the HINTs or any other discovery method. Therefore, this location should be able to still be fetachable on the future when content is somewhere else.

We do not need to have an internal "private" claim for storing this data. Once we move to a decentralized write target approach, likely they will have public locations we can just stick here, which means we could just rely on location claims issued by the service (even though revocation would become a concern on data moving around).

In case we issue further claims with different query parameters, the service can still look at their date and attempt latest first, without real need to revoke them given the URL will still resolve to the data.

Also note that we do not really need to do any change in `w3s.link`. The service can call content claims and see what are the hints. For optimization purpuses we can however check and try them first.

---

## Deprecated research on data location URIs

### Bucket data Location URIs

Defining the format of data locations for these target locations is critical to have a mapping of these locations to the buckets to fulfill all requirements of read interfaces (See https://hackmd.io/5qyJwDORTc6B-nqZSmzmWQ#Read-Use-cases).

#### URIs in well known write targets

Typically, objects in S3 buckets can be located via following URIs:
- S3 URI (e.g. `s3://<BUCKET_NAME>/<CID>/<CID>.car`)
- Object URL (e.g. `https://<BUCKET_NAME>.s3.<AWS_REGION>.amazonaws.com/<CID>/<CID>.car`)
- can be used to fetch the bytes by any HTTP client if bucket is public

However, R2 object locations have different patterns, instead of following S3 pattern. They can be:
- Public Object URL for Dev (e.g. `https://pub-<INTERNAL_R2_BUCKET_IDENTIFIER>.r2.dev/<CID>/<CID>.car`)
- can be used to fetch the bytes by any HTTP client, if bucket is public and heavily rate limited
- [R2 docs](https://developers.cloudflare.com/r2/buckets/public-buckets/#enable-managed-public-access) state that such URLs should only be used for dev!
- Custom domain object URL (e.g. `https://<CUSTOM_DOMAIN>.web3.storage/<CID>/<CID>.car`)
- can be used to fetch the bytes by any HTTP client, if custom domain is configured in R2
- can be rate limited by operator configuration
- account will need to pay for the egress of reading from the bucket
- Presigned URL (e.g. `https://<ACCOUNT_ID>.r2.cloudflarestorage.com/<BUCKET_NAME>/<CID>/<CID>.car?Signature=...&Expires=...`)
- can be used to fetch the bytes by any HTTP client, if has the signature query parameter and is not expired
- no heavy rate limits in place, together with no egress costs to read data at rest

Note that a data location URI may not be readable from all actors, as some may be behind a given set of permissions/capabilities.

#### URI Patterns

The main pattern that we can identify is to have URLs that can be accessed by any HTTP client. Except for S3 URIs and given the correct setup/keys is available, all other URLs are fetch'able. Therefore, we can assume as an advantage that claim is directly fetchable without any pre-knowledge.

Having a claim that cannot be used by every potential client (i.e. needs some extra permissions) is also a disadvantage that may represent penalties in a reputation system. Moreover, rate limits can have a negative impact on reputation as well.

URIs that include all necessary information to enable derivation for smart clients that could rely on Worker R2 Bindings or to generate Presigned URLs are critical for several use cases. URIs that minimize Egress costs can also be preferred by smart clients.

Per the above, considering S3, it looks like we should rely on S3 Object URL (e.g. `https://<BUCKET_NAME>.s3.<AWS_REGION>.amazonaws.com/<CID>/<CID>.car`).

But with R2, there is no perfect fit. The only good option would be the format used by presigned URLs, but it should only be created on request given their expiration. The custom domain offers better retrievability than the Public Object URL for Dev, but has not enough information encoded for smart clients (i.e. no way to know bucket name). For R2, we will likely need to add 2 location URIs:
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved
- Custom domain object URL (e.g. `https://<CUSTOM_DOMAIN>.web3.storage/<CID>/<CID>.car`)
- can be used out of the box to fetch content
- Presigned URL like URI without query params (e.g. `https://<ACCOUNT_ID>.r2.cloudflarestorage.com/<BUCKET_NAME>/<CID>/<CID>`)
- won't really work, but smart clients can see if it is available and rely on its encoded info to use CF Worker Bindings / create presigned URLs, etc.

Alternatively, we can require `<CUSTOM_DOMAIN>` to become bucket name, in order to make it work as a single location URI. Main disadvantage besides the extra requirement, is that there is no url mention of being a R2 bucket, which would mean hard coded assumptions. We could also consider to mimic S3 URL for R2 here as well.

The Public Object URL for Dev should not be adopted, as it is heavily rate limited, we do not know what CF may do with it in the future, and also does not even have information about the Bucket name.
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved

#### Proposal

Nothing prevents us from claiming multiple location URIs for a given content, however we may need to also be careful on having multiple claims for the same location as if it is not fully available it MAY be ranked badly in whatever reputation system we may create. However, some smart clients MAY benefit of cost savings or faster retrievals if they have extra information encoded in the URI.
Gozala marked this conversation as resolved.
Show resolved Hide resolved

In conclusion, this document proposes that w3up clients, once they successfully perform an upload, they create location claims within following formats:
- S3
- `https://<BUCKET_NAME>.s3.<AWS_REGION>.amazonaws.com/<CID>/<CID>.car`
- R2
- `https://<CUSTOM_DOMAIN>.web3.storage/<CID>/<CID>.car`
- `https://<ACCOUNT_ID>.r2.cloudflarestorage.com/<BUCKET_NAME>/<CID>/<CID>`
Loading