Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: datawherehouse location claim + store/publish #13

Merged
merged 7 commits into from
Mar 20, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 198 additions & 0 deletions rfc/datawherehouse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# DATAWHEREHOUSE

> or... where is that file living on?

## Authors

- [Vasco Santos], [Protocol Labs]

## Background

TL;DR

1. We're missing a mapping of Data CID -> target URI (Bucket, Saturn Node, etc).
2. We don't want clients to create location claims tight with internal bucket URLs because we might change the location in the future (reputation hit for client).
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved
3. We don't have bucket events in Cloudflare, so need the client to tell us when it has uploaded something to the provided write target.
4. We want freeway code to be usable in Saturn nodes, so ideally it uses only content claims to discover locations.
5. We want this information available as soon as content is written so that read interfaces can serve the content right away.
Gozala marked this conversation as resolved.
Show resolved Hide resolved
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved

### Extended details

> 1. We're missing a mapping of Data CID -> target URI (Bucket, Saturn Node, etc).

We considered at first location claims directly to data bytes would be the solution here. But when we got closer to put that into practise we realized this was not a good idea. We need this mapping so that w3s Read/Write Interfaces can discover where the bytes are. Where the bytes are stored may actually be private write targets (for instance, a R2 bucket), which location is not public. We consider that location claims MUST be retrievable, have public access and not heavily rate limited. Finally, some read interfaces (for instance Roundabout, Freeway) require some information encoded in the URI (like bucket name), which would not be available in a public URL of R2 bucket. All things considered, Location claims should include URIs like `https://bafy...data.ipfs.w3s.link`, and we need a mapping of `bafy...data` to where its bytes are actually stored internally.
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved

> 2. We don't want clients to create location claims for internal bucket URLs because we might change the location in the future (reputation hit for client).

Extending on the first point, making location claims to include "private/unavailable" URIs will make it harder for the service to move data to other places, given it would need to revoke a bunch of claims and re-write them with new location.
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved

> 3. We don't have bucket events in Cloudflare, so need the client to tell us when it has uploaded something to the provided write target.

Actually we can even extend on this point by saying that today we have no verifiability on data being sent by the user, as well as received by the service. By having the client to sign that bytes were sent, and the service to check and also sign that is true will allow us to achieve that. Moreover, we also open the doors in this interaction for a challengen/proof of delivery.

## High level flow of proposed solution

* Simulate bucket event by having client submit `store/deliver` invocation.
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved
* Confirms successful transfer of data.
* Effect linked by `store/add` receipt.
* Can be batched with other invocations like `filecoin/offer`
* Handler writes to location claims with encoded information

The following diagram presents the described flow, and is illustrated with the following steps:

1. Client requests to store some bytes with the storefront service
2. Service issues a receipt stating the URI of a write target for the client to write those bytes
3. Client writes the bytes into the write target
4. Client notifies the service that the bytes were written to the provided write target
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved
5. Service verifies that the bytes are stored in the provided write target and writes a claim about it.
6. Service issues a receipt stating bytes are being stored by the service.

![datawherehouse1](./datawherehouse/datawherehouse-1.svg)

On the other side of things, the next diagram presents the flow when a client wants to read some data, and is illustrated with the following steps:

1. Client request to read some bytes by the CID of the data
2. Service discovers where the requested bytes are stored relying on content claims service and the materialized claims from `datawherehouse`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not correct anymore, right?

3. Service serves data stored on the discovered location.

![datawherehouse2](./datawherehouse/datawherehouse-2.svg)

## Location claims

Content claims service is currently deployed implementing the [content claims spec](https://github.com/web3-storage/specs/pull/86). Among other claims, it provides [Location Claims](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?view#Location-Claims) which MAY be used to claim that the bytes that hash to a given CID are available in a given URL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fact that size of the content / blob is not captured (or is optional) is an unfortunate oversight. I would suggest changing range field into required that way size of the content is clear from the claim.


In w3s the service is responsible to deciding the write target, therefore service SHOULD be responsible for claiming the location of the bytes.
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved

While thinking about using location claims to record where bytes are stored by the service, there are a few characteristics we want to have:
- location claim MUST resolve to a public and fetchable URLs
- location in location claim SHOULD (ideally) not change recurrently given it MAY impact negatively the reputation of a party
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved

Read interfaces MAY have some requirements other than the CID to better, such as knowing bucket name, region, etc.

As a way to store the location of this bytes, we discussed relying on a "private" location claims concept, or even on location claims for a gateway that have hints as encoded params in the URL that the read interface can decide if want to try to use. This would allow us to already have the infra and datastores we have, leaving the decentralization of content claims for a completely different problem.

### _private_ location claims

_private_ location claims would enable us to not expose these claims directly to the user, given their sole purpose at the moment is internal routing. This would enable queries of w3s read/write interfaces to know where the bytes for a CID are stored.

With this building block we can issue claims that MAY not be public and fetchable URLs, as well as not have worries on a potential future data migration.

A _private_ location claim MAY look like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to think that private location claims are obsolete. Content claim should include whatever metadata it needs in the claim to make anyone with it able to perform a read. We can simply leverage UCAN auth system for the rest.

In other words query params ≈ ucan auth header. Later in addition gives us ability to choose who can exercise it while former is public by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are all in agreement to not use this, I will just drop it from the RFC. FWIW, this is currently there as "alternative", but later is specified what this RFC proposes


```json
{
"op": "assert/datawherehouse",
"rsc": "https://web3.storage",
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved
"input": {
"content" : CID /* CAR CID */,
"location": "`https://<BUCKET_NAME>.<REGION>.web3.storage/<CID>/<CID>.car`",
"range" : [ start, end ] /* Optional: Byte Range in URL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this required field so size of the content can be inferred from the claim.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, for this case we are meaning blobs (or CAR files), which in theory is the entire thing, but yeah, since we need to do a HEAD request to see if it is in the write target we can make this required as it is really no extra cost

}
}
```

Note that we could actually make this location URL publicly available in R2 custom domain, if we would like it. Of course this would still not be a good reason to make it public, given moving the data to a different location would lead to invalid claims. But can actually be a good idea for a transition period for decentralized write nodes.

### location claims with encoded params

On the other side, we could also rely on a known resolvable location and encode the needed information as part of the URL. This would allow w3s service to just issue claims point to the gateway with extra hints that they can use for a "fast lane".

A location claim MAY look like:

```json
{
"op": "assert/location",
"rsc": "https://web3.storage",
"input": {
"content" : CID /* CAR CID */,
"location": "`https:<CID>.dag.w3s.link?bucket=<bucketName>&region=<region>`",
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved
"range" : [ start, end ] /* Optional: Byte Range in URL
}
}
```

The public IPFS HTTP Gateway could decide if it wants to use the HINTs or any other discovery method. Therefore, this location should be able to still be fetachable on the future when content is somewhere else.

We do not need to have an internal "private" claim for storing this data. Once we move to a decentralized write target approach, likely they will have public locations we can just stick here, which means we could just rely on location claims issued by the service (even though revocation would become a concern on data moving around).

In case we issue further claims with different query parameters, the service can still look at their date and attempt latest first, without real need to revoke them given the URL will still resolve to the data.

Also note that we do not really need to do any change in `dag.w3s.link`. The service can call content claims and see what are the hints. For optimization purpuses we can however check and try them first.

## Proposal

Location claims with encoded params seems to be the simplest solution and also puts us into the future direction where write targets may actually have public URLs. Therefore, relying on `location claims with encoded params` can solve all the requirements while better position us for future. In addition, it is also the easy solution to implement.

## Deprecated

### Store design

carwhere is a store that enables a `store/*` implementer to map CAR CIDs to one or more locations where they are written (and confirmed!).

The store should be indexed and queryable by CAR CID, but should also support multiple entries with CAR CID. Therefore, we have two potential solutions for this Store:
- Bucket store with format as `${carCid}/${bucketName}/${region}/${key}`. So, quite similar to current Dudewhere indexes. E.g. `bag.../carpark-prod-0/auto/bag.../bag....key`
- DynamoDB store where partition key is `${carCid}` and `${bucketName}/${region}/${key}`

From a price standpoint, as well as ease of storage migration, Bucket store will be way cheaper. However, dynamoDB will be faster for high throughputs. Given the index read will likely be one of the less costly parts of reading content, it MAY not make a big difference to read from the indexes. Specially as we will have full content cached later on.

Proposal: Bucket Store

### Bucket data Location URIs

Defining the format of data locations for these target locations is critical to have a mapping of these locations to the buckets to fulfill all requirements of read interfaces (See https://hackmd.io/5qyJwDORTc6B-nqZSmzmWQ#Read-Use-cases).

#### URIs in well known write targets

Typically, objects in S3 buckets can be located via following URIs:
- S3 URI (e.g. `s3://<BUCKET_NAME>/<CID>/<CID>.car`)
- Object URL (e.g. `https://<BUCKET_NAME>.s3.<AWS_REGION>.amazonaws.com/<CID>/<CID>.car`)
- can be used to fetch the bytes by any HTTP client if bucket is public

However, R2 object locations have different patterns, instead of following S3 pattern. They can be:
- Public Object URL for Dev (e.g. `https://pub-<INTERNAL_R2_BUCKET_IDENTIFIER>.r2.dev/<CID>/<CID>.car`)
- can be used to fetch the bytes by any HTTP client, if bucket is public and heavily rate limited
- [R2 docs](https://developers.cloudflare.com/r2/buckets/public-buckets/#enable-managed-public-access) state that such URLs should only be used for dev!
- Custom domain object URL (e.g. `https://<CUSTOM_DOMAIN>.web3.storage/<CID>/<CID>.car`)
- can be used to fetch the bytes by any HTTP client, if custom domain is configured in R2
- can be rate limited by operator configuration
- account will need to pay for the egress of reading from the bucket
- Presigned URL (e.g. `https://<ACCOUNT_ID>.r2.cloudflarestorage.com/<BUCKET_NAME>/<CID>/<CID>.car?Signature=...&Expires=...`)
- can be used to fetch the bytes by any HTTP client, if has the signature query parameter and is not expired
- no heavy rate limits in place, together with no egress costs to read data at rest

Note that a data location URI may not be readable from all actors, as some may be behind a given set of permissions/capabilities.

#### URI Patterns

The main pattern that we can identify is to have URLs that can be accessed by any HTTP client. Except for S3 URIs and given the correct setup/keys is available, all other URLs are fetch'able. Therefore, we can assume as an advantage that claim is directly fetchable without any pre-knowledge.

Having a claim that cannot be used by every potential client (i.e. needs some extra permissions) is also a disadvantage that may represent penalties in a reputation system. Moreover, rate limits can have a negative impact on reputation as well.

URIs that include all necessary information to enable derivation for smart clients that could rely on Worker R2 Bindings or to generate Presigned URLs are critical for several use cases. URIs that minimize Egress costs can also be preferred by smart clients.

Per the above, considering S3, it looks like we should rely on S3 Object URL (e.g. `https://<BUCKET_NAME>.s3.<AWS_REGION>.amazonaws.com/<CID>/<CID>.car`).

But with R2, there is no perfect fit. The only good option would be the format used by presigned URLs, but it should only be created on request given their expiration. The custom domain offers better retrievability than the Public Object URL for Dev, but has not enough information encoded for smart clients (i.e. no way to know bucket name). For R2, we will likely need to add 2 location URIs:
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved
- Custom domain object URL (e.g. `https://<CUSTOM_DOMAIN>.web3.storage/<CID>/<CID>.car`)
- can be used out of the box to fetch content
- Presigned URL like URI without query params (e.g. `https://<ACCOUNT_ID>.r2.cloudflarestorage.com/<BUCKET_NAME>/<CID>/<CID>`)
- won't really work, but smart clients can see if it is available and rely on its encoded info to use CF Worker Bindings / create presigned URLs, etc.

Alternatively, we can require `<CUSTOM_DOMAIN>` to become bucket name, in order to make it work as a single location URI. Main disadvantage besides the extra requirement, is that there is no url mention of being a R2 bucket, which would mean hard coded assumptions. We could also consider to mimic S3 URL for R2 here as well.

The Public Object URL for Dev should not be adopted, as it is heavily rate limited, we do not know what CF may do with it in the future, and also does not even have information about the Bucket name.
vasco-santos marked this conversation as resolved.
Show resolved Hide resolved

#### Proposal

Nothing prevents us from claiming multiple location URIs for a given content, however we may need to also be careful on having multiple claims for the same location as if it is not fully available it MAY be ranked badly in whatever reputation system we may create. However, some smart clients MAY benefit of cost savings or faster retrievals if they have extra information encoded in the URI.
Gozala marked this conversation as resolved.
Show resolved Hide resolved

In conclusion, this document proposes that w3up clients, once they successfully perform an upload, they create location claims within following formats:
- S3
- `https://<BUCKET_NAME>.s3.<AWS_REGION>.amazonaws.com/<CID>/<CID>.car`
- R2
- `https://<CUSTOM_DOMAIN>.web3.storage/<CID>/<CID>.car`
- `https://<ACCOUNT_ID>.r2.cloudflarestorage.com/<BUCKET_NAME>/<CID>/<CID>`

### Materialize location claims

Extend materialized location claims to include carwhere locations short lived. We will need to align on how these claims will look like.
Loading