-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify pieces are CAR before aggregation #1304
Comments
@vasco-santos It occured to me that we could validate that input to piece CID derivation is a valid CAR since we have to read CAR bytes anyway to derive piece CID. If we discover that piece is not a CAR we could error, mark content invalid and take it out of queue. Alternatively we could attempt something more complicated and if piece payload is not a CAR simply prepend a CAR header + content CID and derive piece from there. If we store that information somewhere in content claims we could also make a CAR on demand when serving. This option does seem very complicated however and probably not worth the effort. People really should not be putting non-cars into store anyway. |
Yes, we could check that there today. If we would move to a decentralized run of piece computation, where this would run in a Filecoin Station for example, this would not be possible anymore. Second option would also be problematic, given a computation workload should not have side effects. Therefore, I would like to avoid that direction and discuss first how strong we care about not storing non CARs into aggregates, before making it difficult to decentralize these bits of infra. I was thinking more about checking the CAR cid on I still think it is ok to aggregate non-CARs, so maybe we should align as a team first if that is or not desirable |
I think it depends how you define the interface. If interface is "compute piece cid for the bytes" I agree it is problematic. However if interface is "compute piece cid for the CAR", I would argue that passing anything but CAR should fail that computation whether it runs in station or or elsewhere. There are no side effects piece CID computation should always fail if input invariant is violated.
I actually would like to remove the assumption that we take CARs. In fact directionally I think we all want to move to a system where we don't turn files into CARs which we then index to assemble files back. Instead we want to take files and optionally index them if we want to support block level reads. I'll write more on this in a separate issue.
I think this is not our decision to make. Regardless though we need to fix our assumptions which currently are invalid. We assume that store/add takes CARs and that aggregate builder aggregates from pieces that are bytes. I think we should change them around, that is drop assumption that bytes stored are CARs and in aggregation impose invariant that pieces are derived from CARs not from bytes. On a final note if we do fix our assumptions we also will have alternative way to deal with the problem, specifically we'll know which one is the piece derived from CAR payload and which one is piece derived from arbitrary bytes. We could retain that info and consider it in the piece sorting algorithm so that aggregate will start with a CAR. That said, I personally don't think we should force non CARs into filecoin pipeline if there is a resistance (which there seems to be). I would instead collect evidence of negative effects of that resistance so opinions can be changed and if they do we could trivially remove invariant that payload is CAR. |
Boost assumes that aggregate pieces are CARs
https://github.com/filecoin-project/boost/blob/baf26c6dc883dade04947ff4632eb1e664fb55b9/storagemarket/deal_commp.go#L148-L154
However currently we do not verify that uploaded CAR is actually a valid CAR, we instead leverage S3 / R2 presigned URLs to verify that uploaded file matches sha256 in the CAR cid.
As it turns out some users have uploaded non CAR files through a CAR upload pipeline, resulting in mistagged CIDs.
We should either start verifying that uploaded content is a CAR, or perhaps use raw blocks instead so content is not mistagged. If we do later we also need to decide what to do when putting pieces into an aggregate as they may not be CARs which would violate assumptions in boost.
The text was updated successfully, but these errors were encountered: