-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File-based import/export for Merkle DAGs (à la git bundle
)
#1195
Comments
My interest in pushing this ahead is to replace the current doc2go workflow with an IPO bundle. For more details, see whyrusleeping/doc2go#2 and the difficult-to-extend-to-recursive-directories-or-other-useful-objects code here. |
It will be nice to have our archive format finally written. 👍 (although ipo isnt the name i would choose) |
On Sun, May 03, 2015 at 11:22:40PM -0700, Jeromy Johnson wrote:
But you could have this great little asteroid or comet icon :p. I was |
The format name is going to be something Some examples:
i don't care much about clashing with utilities that are barely used now. all things being equal better not to, but name collisions are inevitable as time passes. I don't have any statistics on Disk ARchive usage to warrant either clashing or not clashing. |
On Mon, May 04, 2015 at 08:17:36AM -0700, Juan Batiz-Benet wrote:
I don't think the self-authenticating part of this is important enough |
git bundle
)git bundle
)
I'm gathering notes on similar existing formats. Here's the spec for
I'll keep looking for specs on other similar formats to give us a |
@wking Git pack files are transfered too. You might want to have a look at: http://git-scm.com/book/es/v2/Git-Internals-Transfer-Protocols And I think a corruption-checking end-of-pack sanity hash is nice to have. |
On Tue, May 05, 2015 at 05:09:22AM -0700, Christian Couder wrote:
Ah, good to know.
That explains why you'd want a separate packfile index. Does anyone
Can you explain why? Each object is already individually hashed to |
Yeah, according to Junio in this thread: "The bundle file is a thinly wrapped packfile, with extra information
Because this makes sure that the whole pack has been correctly transfered. |
On Tue, May 05, 2015 at 10:30:45AM -0700, Christian Couder wrote:
What do you do if you detect truncation or corruption (I don't see how a. Ask for a fresh copy. In this case, I'd rather handle the check in If those mechanisms are insufficient for a given use-case, I think you On the other hand, it's easy to add and check, so I'm ok with just |
@wking said
Yeah.
Agreed. (we will want to, so should be careful not to make it hard on ourselves.)
Yep, exactly. A dag with one root. (want multiple roots? add a virtual root at the top that gathers all of them) Ideally we'll want to store the objects with offsets so seeking is very fast. (am reminded of cdb). We'll probably end up with a format that wraps (prefix or suffix) each object with file offset tables (mapping a link to a file offset), etc. It sort of looks to me like we'd take the dag and wrap it with another dag (each object wrapped in another obejct) where the links are file offsets instead of the hashes. (well, the full link is the pair (forgive me if this is rambly and didn't make sense. i can draw it.)
This sounds good to me. Would have to play with it more to see if it has any drawbacks/pitfalls, but it should work correctly. The index size could actually be a function of the number of objects. Treat it like a hash table, the more objects, the more collisions possible, so we bump up the size. (reminded again of cdb).
Yep! 👍
Yeah I agree. Since it's a proper merkle dag (hash tree properties), we can know that the thing is missing pieces. (Aside: Actually, I was just pointed to (thanks @zmanian !) a very cool talk by agl where he shows a streaming merkledag construction. and he made a point that most AEAD stream constructions fail to account for truncation correctly (miss clearly labeling start and end chunks), whereas a merkledag approach does so easily.) Another question raised is what to do in the case of corruption -- we'll be able to detect the level of corruption in most cases (i.e. becase we have the full merkledag we can isolate the corrupted subtrees, and use the uncorrupted ones. signatures on the top object still check out and everything). Maybe this is an application decision, where some applications need the full intact archive, and others are ok tolerating some loss. We could even possibly reconstruct multiple corrputed dag archives if they happen to complement each other (RAID style). (I'm also reminded of long-distance transport protocols (where latency >> bandwidth), which transmit the same data multiple times back to back. could even have a mode where we tradeoff storage size for object redundancy in the same archive file, storing objects a number of times and randomizing their location to correct for disk sectors going bad, and so on. This is all out of scope, but relevant to mention as may be useful to some people in the future.)
sounds good! 👍 very useful. @chriscool and @wking On the trailing hash problem, am not sure, but I think @wking is right. can walk the dag to check all the objects individually in lieu of the full checksum. the whole thing should be consistent. this may be slower than a straight up checksum, but it's got other better properties.
I think so too, but again, not sure. there may be something we're missing that @chriscool is pointing at, or that other git contributors saw as relevant for the packfile. (Not sure, for example, why the packfile isn't a merkledag itself, and is instead just loose objects back to back without full merkle structure. maybe they just didn't think of it because their use case didn't lead them to that path). More on names:
@wking i think the depth of these discussions warrant making a repo. I've just made |
closing, continue discussion at https://github.com/ipfs/archive-format |
@jbenet has mentioned this a few times on IRC, but there are no formal specs yet. The eventual self-certifying goal is blocking on public-key and signature objects (#1045, ipfs/specs#3), but we can certainly start work on a serialized-object file format before those land.
Tangentially, I'd prefer
.ipo
as the extension (InterPlanetary Objects), since.dar
seems to be already used by Disk ARchiver.Do we want to spec this file format out in ipfs/specs, or should I just dive in with implementations for:
Once we get that implemented, a useful extention for the UI would be a repeatable
--ignore $IPFS_OR_IPNS_OBJECT_NAME
argument toipfs ipo export
which would allow to to say “export all objects reachable by<names-listed-in-the-positional-arguments>
except for<names-listed-in-ignore-options>
and their descendents”. That's the same sort of thing you can do with:for “bundle all objects reachable from my master branch reference except for those reachable from my v1.0.0 tag”, but I like the separate option to make it easier to truncate in multiple locations.
The text was updated successfully, but these errors were encountered: