-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: proposal for content claims #86
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,334 @@ | ||
# Content Claims Protocol | ||
|
||
![wip](https://img.shields.io/badge/status-wip-orange.svg?style=flat-square) | ||
|
||
## Editors | ||
|
||
- [Alan Shaw](https://github.com/alanshaw), [DAG House](https://dag.house/) | ||
|
||
## Authors | ||
|
||
- [Irakli Gozalishvili](https://github.com/Gozala), [DAG House](https://dag.house/) | ||
- [Mikeal Rogers](https://github.com/mikeal), [DAG House](https://dag.house/) | ||
|
||
# Abstract | ||
|
||
UCAN based protocol allowing actors to share information about specific content (identified by CID). | ||
|
||
> We base the protocol on top of [UCAN invocation specification 0.2](https://github.com/ucan-wg/invocation/blob/v0.2/README.md#23-ipld-schema) | ||
> [Original proposal](https://hackmd.io/IiKMDqoaSM61TjybSxwHog?view) (includes implementation notes) | ||
|
||
## Language | ||
|
||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119](https://datatracker.ietf.org/doc/html/rfc2119). | ||
|
||
# Introduction | ||
|
||
## Motivation | ||
|
||
## NextGen IPFS Content Discovery | ||
Check failure on line 29 in content-claims.md
|
||
|
||
Content discovery in IPFS today is "peer based", meaning that an IPFS client first performs a "content discovery" step and then a "peer discovery" step to read data in the following way: | ||
|
||
1. Client finds peerids that have announced they have a CID | ||
2. Client finds transport protocols for the given peerid that it can use to read IPFS data from that peer. | ||
3. Client retreives data from that peer. | ||
|
||
![](https://hackmd.io/_uploads/r1pmgLuVh.png) | ||
|
||
File reads in this system requires that two entities, the client and the | ||
peer serving the data, both support an IPFS aware transport protocol. | ||
The peer serving a given CID is required to serve data at a specific | ||
endpoint and keep that endpoint operational for data to be available. | ||
|
||
When a publisher wants to hire a remote storage device this protocol design presents a challenge since these remote storage devices don’t speak native IPFS protocols, therefor IPFS clients can’t read content from them directly. We end up having to place a node between these remote storage devices and the client, proxying the data and incurring unnecessary egress. | ||
Check failure on line 44 in content-claims.md
|
||
|
||
![](https://i.imgur.com/Jan0Opg.png) | ||
|
||
Content Claims change this. | ||
|
||
Publishers post **verifiable claims** about IPFS data they publish in remote storage devices into Content Discovery networks and services. | ||
|
||
From these claims, IPFS clients can request data directly from remote storage devices over standard transports like HTTP. | ||
|
||
![](https://i.imgur.com/4iutSwA.png) | ||
|
||
|
||
With content claims, any actor can provide data into the network without making the data available over an IPFS aware transport protocol. Rather than "serving" the data, content providers post verifiable claims about the data and its location. | ||
|
||
Clients can use these claims to read directly, over any existing transport (mostly HTTP), and in the act of reading the data will verify the related claims. | ||
|
||
This removes a **substantial** source of cost from providing content into the network. Content can be published into the network "at rest" on any permanent or temporary storage device that supports reading over HTTP. | ||
|
||
# Protocol | ||
|
||
## Claim Types | ||
|
||
The requirements for content claims break down into a few isolated components, each representing a specific claim. These claims are assembled together to represent the proof information necessary for retrieving content. | ||
|
||
All claim types map a **single** CID to "claim information." | ||
|
||
While you can derive block indexes from these claims (see: Inclusion Claims), each individual claim is indexed by a single **significant** CID (file root, dir root, etc) referrenced by `content` field. | ||
Check failure on line 71 in content-claims.md
|
||
|
||
These claims include examples of a unixfs file encoded into CAR files, but the protocol itself makes heavy use of CIDs in order to support a variety of future protocols and other use cases. | ||
|
||
Since this protocol builds upon the UCAN Invocation Specification, all of these claims are contained within a message from an *issuer* to a *destination*. This means it can be used to send specific actors unique route information apart from public content discovery networks, and can also be used to send messages to public content discovery networks by simply addressing them to a DID representing the discovery network. | ||
|
||
### Location Claims | ||
|
||
* Claims that a CID is available at a URL. | ||
* Block Level Interface, | ||
Check failure on line 80 in content-claims.md
|
||
* GET Body MUST match multihash digest in CID. | ||
* Using CAR CID, or any future BlockSet() identifier, allows this to be a multi-block interface. | ||
|
||
```javascript | ||
{ | ||
"op": "assert/location", | ||
"rsc": "https://web3.storage", | ||
"input": { | ||
"content" : CID /* CAR CID */, | ||
"location": "https://r2.cf/bag...car", | ||
"range" : [ start, end ] /* Optional: Byte Range in URL */ | ||
} | ||
} | ||
``` | ||
|
||
From this, we can derive a verifiable Block interface over HTTP or any other URL based address. This could even be used with `BitSwap` using `bitswap://`. | ||
|
||
Filecoin Storage Providers (running boost) would be addressable via the [lowest cost read method available (HTTP GET Range in Piece)](https://boost.filecoin.io/http-retrieval#retrieving-a-full-piece). | ||
|
||
And you can provide multiple locations for the same content using a list. | ||
|
||
```javascript | ||
{ | ||
"op": "assert/location", | ||
"rsc": "https://web3.storage", | ||
"input": { | ||
"content": CID /* CAR CID */, | ||
"location": [ "https://r2.cf/bag...car", "s3://bucket/bag...car" ], | ||
"range": [ start, end ] /* Optional: Byte Range in URL */ | ||
} | ||
} | ||
``` | ||
|
||
### Equivalency Claims | ||
|
||
We also have cases in which the same data is referred to by another CID and/or multihash. Equivalency claims represent this association as a verifiable claim. | ||
Check failure on line 116 in content-claims.md
|
||
|
||
```javascript | ||
{ | ||
"op": "assert/equals", | ||
"rsc": "https://web3.storage", | ||
"input": { | ||
"content": CID /* CAR CID */, | ||
"equals": CID /* Commp CID */ | ||
} | ||
} | ||
``` | ||
|
||
We should expect content discovery services to index these claims by `"content"` CID, since that is standard across all claims, but since this is a cryptographic equivalency, equivalency claim aware systems are encouraged to index both. | ||
|
||
### Inclusion Claims | ||
|
||
* Claims that a CID includes the contents claimed in another CID. | ||
* Multi-Block Level Interface, | ||
* One CID is included in another CID. | ||
* When that's a CARv2 CID, the CID also provides a block level index of the referenced CAR CIDs contents. | ||
* Using CAR CIDs and CARv2 Indexes, get a multi-block interface. | ||
* Combined with HTTP location information for the CAR CID, this means we can read individual block sections using HTTP Ranges. | ||
|
||
|
||
```javascript | ||
{ | ||
"op": "assert/inclusion", | ||
"rsc": "https://web3.storage", | ||
"input": { | ||
"content": CID /* CAR CID */, | ||
"includes": CID /* CARv2 Index CID */, | ||
"proof": CID /* Optional: zero-knowledge proof */ | ||
} | ||
} | ||
``` | ||
|
||
This can also be used to provide verifable claims for sub-deal inclusions in Filecoin. | ||
Check failure on line 153 in content-claims.md
|
||
|
||
```javascript | ||
{ | ||
"op": "assert/inclusion", | ||
"rsc": "https://web3.storage", | ||
"input": { | ||
"content": CID /* PieceCID (CommP) */, | ||
"includes": CID /* Sub-Deal CID (CommP) */, | ||
"proof": CID /* Sub-Deal Inclusion Proof */ | ||
} | ||
} | ||
``` | ||
|
||
### Partition Claims | ||
|
||
* Claims that a CID’s graph can be read from the blocks found in parts, | ||
* `content` (Root CID) | ||
* `blocks` (List of ordered CIDs) | ||
* `parts` (List of archives [CAR CIDs] containing the blocks) | ||
|
||
```javascript | ||
{ | ||
"op": "assert/partition", | ||
"rsc": "https://web3.storage", | ||
"input": { | ||
"content": CID /* Content Root CID */, | ||
"blocks": CID, /* CIDs CID */ | ||
"parts": [ | ||
CID /* CAR CID */, | ||
CID /* CAR CID */, | ||
... | ||
] | ||
} | ||
} | ||
``` | ||
|
||
# Consuming Claims | ||
|
||
An IPFS client wishing to perform a verifiable `read()` of IPFS data can construct one from verifiable claims. Given the amount of cryptography and protocol expertise necessary to perform these operations a few examples are detailed below. | ||
|
||
## IPFS File Publishing | ||
|
||
An IPFS File is a contiuous set of bytes (source_file) that has been encoded into a merkle-dag. The resulting tree is referenced by CID using the `dag-pb` codec. | ||
Check failure on line 196 in content-claims.md
|
||
|
||
When files are encoded and published into the network, they are often packed into CAR files. One CAR file might contain **many** IPFS Files, and one large file could be encoded into **many** CAR Files. CAR files can be referenced by CID (digest of the CAR) and are typically exchanged transactionally (block level interface). This can get confusing, as transaction (block level) interfaces now ***contain*** multi-block interfaces. | ||
Check failure on line 198 in content-claims.md
|
||
|
||
### Large File | ||
|
||
Large file uploads can be problematic when performed as a single pass/fail transaction, so it has become routine to break large upload transactions into smaller chunks. The default encoder for web3.storage (w3up) is configurable but defaults to ~100MB, so as the file is encoded into the unixfs block structure it is streamed to CAR encodings of the configured size. Each of these is uploaded transactionally, and if the operation were aborted and started again it would effectively resume as long as the file and settings hadn't changed. | ||
|
||
When the encode is complete the root CID of the unixfs merkle tree will be available as a `dag-pb` CID. This means we need a claim structure that can cryptographically route from this `dag-pb` CID to the data ***inside*** the resulting CAR files. | ||
|
||
Content Claims allow us to expose cryptography in the IPFS encoding to client such that clients can read **directly from the encoded CAR**, rather than requiring an intermediary to *assemble* the unixfs merkle tree for them. | ||
|
||
So, after we stream encode a bunch of CARs and upload them, we encode and publish: | ||
* A **Partition Claim** for the ***Content Root CID***, which includes: | ||
* An ordered list of every CID we encoded. | ||
* A list of CAR CIDs where those CIDs can be found. | ||
* **Inclusion Claims** for every ***CAR CID***, which includes: | ||
* A CARv2 index of the CAR file. | ||
|
||
Now, if we publish these CAR files to Filecoin, we're going to want to capture another address (CommP) for the CAR file. We then include an | ||
|
||
* **Equivalency Claim** for every ***CAR CID***, which claims this CAR CID is equivalent to CommP for that CAR file. | ||
|
||
Once that CAR data is aggregated into Filecoin deals, the list of CommP addresses included in each deal can be used to compute a sub-inclusion proof. At this time you may also publish: | ||
|
||
* **Inclusion Claims** for every ***Sub-Deal CID*** (CommP) which claims each Sub-Deal CID is "included" in each Piece CID (also CommP), which includes: | ||
* The referenced sub-deal inclusion proof. | ||
|
||
From this point forward, you can lookup Filecoin Storage Providers onchain 😁 | ||
|
||
The preferred (lowest cost) method of reading data from Storage Providers is through an HTTP interface that requests data by Piece CID. You could describe these locations, perhaps playing the role of chain oracle, as: | ||
|
||
* **Location Claims** for every ***CAR CID*** in every Storage Provider, which includes: | ||
* The offsets in the Piece addressable by HTTP Range Header. | ||
|
||
And of course, you can also use **Location Claims** for any other HTTP accessible storage system you ever decide to put a CAR into. | ||
|
||
### *`read(DagPBCID, offset, length)`* | ||
|
||
Now that we have a better idea of what sorts of claims are being published, let's construct a read operation from the claims. | ||
|
||
It's out of scope for this specification to define the exact means by which you **discover** claims, but content discovery systems are expected to index these claims by `content` CID (in theory, you could index every block in the CARv2 indexes, but we should assume many actors don't want to pay for that, so everything in this specification works with only requires indexing the `content` CID in each claim). However, once a publisher sends these claims in public networks, they should presume all exposed addresses in the claims are "discoverable" from a security and privacy perspective. | ||
|
||
Once you have the claims for a given `dag-pb` CID, like the claim examples above, you can: | ||
|
||
* Build a Set() of CAR CIDs claimed to be holding relevant blocks and | ||
* Build a Map() of inclusions indexed by CAR CID and | ||
* depending on your CARv2 library you may want to parse the CARv2 indexes into something you can read quickly from, and | ||
* Build a Map() of locations indexed by CAR CID, | ||
* and you may even be able to build those concurrently. | ||
|
||
Take the list of blocks from the partition claim and find the smallest number of CAR CIDs containing a complete Set. | ||
|
||
Using the CARv2 indexes, you can now determine the byte offsets within every CAR for every block. You should also take the time to improve the read performance by closing the distance between many of those offsets so that you have fewer individual reads for contiuous sections of data, that will also reduce the burden your client will put on data providers since the same data will be fetched in fewer requests. | ||
|
||
You're now free to request this data from all CARs concurrently in a single round-trip, wherever they are. | ||
|
||
#### From Remote HTTP Endpoint | ||
|
||
You can use HTTP Range Headers to request the required block sections over HTTP. Keep in mind that CAR CID claims have support for offsets that would need to be included in the offset calculation you'd then need to perform *within* the CAR. | ||
|
||
#### From Filecoin Sub-Inclusion Proofs | ||
|
||
Pretty simple really: | ||
|
||
* The **Equivalency Claims** tells you the sub-deal CID. | ||
* The **Inclusion Claims** for those sub-deal CIDs give you the Piece CIDs that Storage Providers commit to on chain. | ||
|
||
#### From BitSwap 🤯 | ||
|
||
One amazing thing about these protocols is that by simply using a BitSwap URL, we have BitSwap peers as "locations" for CAR CIDs in the same **Location Claim** protocol. | ||
|
||
One could build a system/protocol for "loading" CAR files into BitSwap peers and then publish Location Claims. | ||
|
||
Even over BitSwap, this would represent a performance improvement over the current client protocols as you avoid round-trips traversing the graph. | ||
|
||
## IPLD Schema | ||
|
||
```ipldsch | ||
|
||
# For UCAN IPLD Schema see | ||
# https://github.com/ucan-wg/ucan-ipld | ||
|
||
type URL string | ||
type URLs [URL] | ||
type CIDs [&Any] | ||
|
||
type Assertion union { | ||
| AssertLocation "assert/location" | ||
| AssertInclusion "assert/inclusion" | ||
| AssertPartition "assert/partition" | ||
} representation inline { | ||
discriminantKey "op" | ||
} | ||
|
||
type AssertLocation { | ||
on URL | ||
input ContentLocation | ||
} | ||
type AssertLocation { | ||
on URL | ||
input ContentInclusion | ||
} | ||
type AssertLocation { | ||
on URL | ||
input ContentPartition | ||
} | ||
|
||
type Range { | ||
start Int | ||
end Int | ||
} representation tuple | ||
|
||
type Locations union { | ||
| URL string | ||
| URLs list | ||
} representation kinded | ||
|
||
type ContentLocation { | ||
content &Any | ||
location Locations | ||
range optional Range | ||
} | ||
|
||
type ContentInclusion { | ||
content &Any | ||
includes &Any | ||
proof optional &Any | ||
} | ||
|
||
type ContentPartition { | ||
# Content that is partitioned | ||
content &Any | ||
# Links to the CARs in which content is contained | ||
parts [&ContentArchive] | ||
# Block addresses in read order | ||
blocks &CIDs | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that implementation currently receives an array here, which I would think is also the best option :)