Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify pieces are CAR before aggregation #1304

Open
Gozala opened this issue Feb 6, 2024 · 3 comments
Open

Verify pieces are CAR before aggregation #1304

Gozala opened this issue Feb 6, 2024 · 3 comments

Comments

@Gozala
Copy link
Contributor

Gozala commented Feb 6, 2024

Boost assumes that aggregate pieces are CARs

https://github.com/filecoin-project/boost/blob/baf26c6dc883dade04947ff4632eb1e664fb55b9/storagemarket/deal_commp.go#L148-L154

However currently we do not verify that uploaded CAR is actually a valid CAR, we instead leverage S3 / R2 presigned URLs to verify that uploaded file matches sha256 in the CAR cid.

As it turns out some users have uploaded non CAR files through a CAR upload pipeline, resulting in mistagged CIDs.

We should either start verifying that uploaded content is a CAR, or perhaps use raw blocks instead so content is not mistagged. If we do later we also need to decide what to do when putting pieces into an aggregate as they may not be CARs which would violate assumptions in boost.

@Gozala
Copy link
Contributor Author

Gozala commented Feb 7, 2024

@vasco-santos It occured to me that we could validate that input to piece CID derivation is a valid CAR since we have to read CAR bytes anyway to derive piece CID. If we discover that piece is not a CAR we could error, mark content invalid and take it out of queue.

Alternatively we could attempt something more complicated and if piece payload is not a CAR simply prepend a CAR header + content CID and derive piece from there. If we store that information somewhere in content claims we could also make a CAR on demand when serving. This option does seem very complicated however and probably not worth the effort. People really should not be putting non-cars into store anyway.

@vasco-santos
Copy link
Contributor

vasco-santos commented Feb 8, 2024

It occured to me that we could validate that input to piece CID derivation is a valid CAR since we have to read CAR bytes anyway to derive piece CID. If we discover that piece is not a CAR we could error, mark content invalid and take it out of queue.

Yes, we could check that there today. If we would move to a decentralized run of piece computation, where this would run in a Filecoin Station for example, this would not be possible anymore. Second option would also be problematic, given a computation workload should not have side effects. Therefore, I would like to avoid that direction and discuss first how strong we care about not storing non CARs into aggregates, before making it difficult to decentralize these bits of infra.

I was thinking more about checking the CAR cid on store/add, with that we could already get rid of part of "undesirable" uploads if CID does not have the right codes for what we want

I still think it is ok to aggregate non-CARs, so maybe we should align as a team first if that is or not desirable

@Gozala
Copy link
Contributor Author

Gozala commented Feb 8, 2024

Yes, we could check that there today. If we would move to a decentralized run of piece computation, where this would run in a Filecoin Station for example, this would not be possible anymore. Second option would also be problematic, given a computation workload should not have side effects. Therefore, I would like to avoid that direction and discuss first how strong we care about not storing non CARs into aggregates, before making it difficult to decentralize these bits of infra.

I think it depends how you define the interface. If interface is "compute piece cid for the bytes" I agree it is problematic. However if interface is "compute piece cid for the CAR", I would argue that passing anything but CAR should fail that computation whether it runs in station or or elsewhere.

There are no side effects piece CID computation should always fail if input invariant is violated.

I was thinking more about checking the CAR cid on store/add, with that we could already get rid of part of "undesirable" uploads if CID does not have the right codes for what we want

I actually would like to remove the assumption that we take CARs. In fact directionally I think we all want to move to a system where we don't turn files into CARs which we then index to assemble files back. Instead we want to take files and optionally index them if we want to support block level reads. I'll write more on this in a separate issue.

I still think it is ok to aggregate non-CARs, so maybe we should align as a team first if that is or not desirable

I think this is not our decision to make. Regardless though we need to fix our assumptions which currently are invalid. We assume that store/add takes CARs and that aggregate builder aggregates from pieces that are bytes. I think we should change them around, that is drop assumption that bytes stored are CARs and in aggregation impose invariant that pieces are derived from CARs not from bytes.

On a final note if we do fix our assumptions we also will have alternative way to deal with the problem, specifically we'll know which one is the piece derived from CAR payload and which one is piece derived from arbitrary bytes. We could retain that info and consider it in the piece sorting algorithm so that aggregate will start with a CAR.

That said, I personally don't think we should force non CARs into filecoin pipeline if there is a resistance (which there seems to be). I would instead collect evidence of negative effects of that resistance so opinions can be changed and if they do we could trivially remove invariant that payload is CAR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants