-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support byte range for raw block requests #101
Conversation
if (searchParams.get('format') === 'car' || headers.get('Accept')?.includes('application/vnd.ipld.car')) { | ||
return await handleCar(request, env, ctx) | ||
} | ||
return handler(request, env, ctx) // pass to other handlers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this single handler split into 2 (well, 3 really) to allow the ?format=raw
handler to run before the middleware that rejects requests that have HTTP Range
headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add this as a code comment. I'm also not sure I fully understand what you mean just yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I do feel however that following two changes would be a recommended, but not required:
- Location claims should probably use raw CIDs in the URL as opposed to CAR CIDs. We already had mistagged CAR content which this would avoid. I do have to admit that I don't fully understanding if such a change would invalidate assumptions of other components, in which case it will not be worth it, yet it would be good idea to document that in the code comments.
- I think if we have raw CID we should not require additional signals in form of query param or headers. This probably could be a followup improvement
return async (request, env, ctx) => { | ||
const { headers } = request | ||
const { searchParams } = ctx | ||
if (!searchParams) throw new Error('missing URL search params') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why require search params ? Code comment answering this would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, so this middleware requires a searchParams: URLSearchParams
on the context that some other middleware has extracted from the URL prior to it running.
const { headers } = request | ||
const { searchParams } = ctx | ||
if (!searchParams) throw new Error('missing URL search params') | ||
if (searchParams.get('format') === 'raw' || headers.get('Accept')?.includes('application/vnd.ipld.raw')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why default content type of application/octet-stream
would be refused here ? I would personally expect that specifying raw codec in the CID should be enough signal not to require anything in addition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This middleware is designed to only handle the request if ?format=raw
is in the query string or Accept: application/vnd.ipld.raw
is in the HTTP headers.
I will make this handler also handle requests when the CID codec is raw when I can, but it requires some other changes to keep parity with what the unixfs handler does when a raw block is requested (mainly content type sniffing).
if (searchParams.get('format') === 'car' || headers.get('Accept')?.includes('application/vnd.ipld.car')) { | ||
return await handleCar(request, env, ctx) | ||
} | ||
return handler(request, env, ctx) // pass to other handlers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add this as a code comment. I'm also not sure I fully understand what you mean just yet
src/lib/blockstore.js
Outdated
if (IndexEntry.isLocated(entry)) { | ||
// if host is "w3s.link" then content can be found in CARPARK | ||
const url = entry.site.location.find(l => l.hostname === 'w3s.link') | ||
// TODO: allow fetch from _any_ URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think of location commitments not just as locations but as permissions authorizing the read, meaning you may not be able to simply read from that http endpoint, you may have to check if commitment is delegated to you (the node) and are able to read it.
Might be worth adding some comment to that end here.
src/lib/blockstore.js
Outdated
const link = Link.parse(url.pathname.split('/')[2]) | ||
const digestString = base58btc.encode(link.multihash.bytes) | ||
const key = `${digestString}/${digestString}.blob` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would recommend factoring out this piece of code into a separate function, perhaps something like parseSite(url: URL): { key: string, range: Range }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or alternatively something like IndexEntry.toBucketGet(entry: IndexEntry): [key:string, options?:GetOptions]
} | ||
} | ||
|
||
export type IndexEntry = NotLocatedIndexEntry | LocatedIndexEntry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I can't stop thinking of two as lost & found as in type IndexEntry = LostIndexEntry | FoundIndexEntry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 if you're ok with that I can get behind it. I was thinking lost implies that you had it at some point...
} | ||
|
||
export type IndexEntry = NotLocatedIndexEntry | LocatedIndexEntry | ||
|
||
export interface Index { | ||
get (c: UnknownLink): Promise<IndexEntry|undefined> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: As alluded earlier I think Result
is a much better option because:
- Interface tells you exactly why you did not got anything - "could not find entry"
- There is a room to grow, as in if we need to do some IO (which can fail) we can distinguish not found from failed to read and potentially more things like unsupported input link etc...
get (c: UnknownLink): Promise<IndexEntry|undefined> | |
get (c: UnknownLink): Promise<Result<IndexEntry, NotFound>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I mean I wrote all this before I even knew about Result
. I don't disagree (I like the type information Result
gives us for errors) but I think we should do a transition to it separately to this PR.
@@ -136,7 +133,11 @@ export class ContentClaimsIndex { | |||
|
|||
const entries = await decodeIndex(content, block.bytes) | |||
for (const entry of entries) { | |||
this.#cache.set(Link.create(raw.code, entry.multihash), entry) | |||
const entryCid = Link.create(raw.code, entry.multihash) | |||
// do not overwrite an existing LocatedIndexEntry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Do we just drop it ? Can we have different entry for the same key ? I think it would really help to address in the comment above why dropping entries after we have one in cache is desired behavior
/** @param {import('multiformats').MultihashDigest} digest */ | ||
export const toBlobKey = digest => { | ||
const digestString = base58btc.encode(digest.bytes) | ||
return `${digestString}/${digestString}.blob` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I've been thinking whether doing something like blob/${digestSstring}.blob
would be a better option ? At least visually it would be easier to distinguish that looking at car vs blob hashes and potentially other things in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vasco-santos WDYT? We originally used cid/cid.car
so that we could put indexes next to the data, like cid/cid.car.idx
but we never did that and used a different bucket instead. So I'm not too bothered about having the extra "directory" per upload.
This would also be a good way to segregate new blob data from the old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do this keying with pathing to better optimize S3 Rate limits. Re R2 Rate limits, this was not rate limited like S3, but last time we talked (6 months + ago), they would be looking into having that possibility on their end.
So, I would push back on doing this change, and prioritize to finish what we need on the reads interface to swap the bucket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we have different tables in Dynamo for Blobs and Store allocations. Therefore, if we ever need to go through the whole list, we can rely on dynamo instead of a list with prefix from bucket
test/helpers/content-claims.js
Outdated
const blocks = claims.get(cid) ?? [] | ||
// @ts-expect-error | ||
blocks.push(await encode(invocation)) | ||
const location = new URL(`https://w3s.link/ipfs/${carCid}?format=raw`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use raw CIDs here as opposed to CAR CIDs with format=raw
should not we ?
package.json
Outdated
@@ -41,7 +41,7 @@ | |||
"@ipld/dag-json": "^10.1.7", | |||
"@ipld/dag-pb": "^4.0.8", | |||
"@web3-storage/content-claims": "^4.0.5", | |||
"@web3-storage/gateway-lib": "^4.1.1", | |||
"@web3-storage/gateway-lib": "github:web3-storage/gateway-lib#5a027e05be62f985407b46bce70748f543d302b7", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs update
/** @param {import('multiformats').MultihashDigest} digest */ | ||
export const toBlobKey = digest => { | ||
const digestString = base58btc.encode(digest.bytes) | ||
return `${digestString}/${digestString}.blob` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do this keying with pathing to better optimize S3 Rate limits. Re R2 Rate limits, this was not rate limited like S3, but last time we talked (6 months + ago), they would be looking into having that possibility on their end.
So, I would push back on doing this change, and prioritize to finish what we need on the reads interface to swap the bucket
🤖 I have created a release *beep* *boop* --- ## [2.17.0](v2.16.0...v2.17.0) (2024-05-15) ### Features * support byte range for raw block requests ([#101](#101)) ([1ff3bad](1ff3bad)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Pulls in latest gateway-lib and dagula to allow byte range requests to raw blocks (
?format=raw
).supersedes #100