Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based import/export for Merkle DAGs (à la git bundle) #1195

Closed
wking opened this issue May 4, 2015 · 12 comments
Closed

File-based import/export for Merkle DAGs (à la git bundle) #1195

wking opened this issue May 4, 2015 · 12 comments
Assignees
Labels
need/community-input Needs input from the wider community

Comments

@wking
Copy link
Contributor

wking commented May 4, 2015

@jbenet has mentioned this a few times on IRC, but there are no formal specs yet. The eventual self-certifying goal is blocking on public-key and signature objects (#1045, ipfs/specs#3), but we can certainly start work on a serialized-object file format before those land.

Tangentially, I'd prefer .ipo as the extension (InterPlanetary Objects), since .dar seems to be already used by Disk ARchiver.

Do we want to spec this file format out in ipfs/specs, or should I just dive in with implementations for:

$ ipfs ipo export $IPFS_OR_IPNS_OBJECT_NAME … > archive.ipo
$ ipfs ipo ls archive.ipo
$IPFS_OR_IPNS_OBJECT_NAME
…
$ ipfs import < archive.ipo
$IPFS_OR_IPNS_OBJECT_NAME
…

Once we get that implemented, a useful extention for the UI would be a repeatable --ignore $IPFS_OR_IPNS_OBJECT_NAME argument to ipfs ipo export which would allow to to say “export all objects reachable by <names-listed-in-the-positional-arguments> except for <names-listed-in-ignore-options> and their descendents”. That's the same sort of thing you can do with:

$ git bundle create mybundle v1.0.0..master

for “bundle all objects reachable from my master branch reference except for those reachable from my v1.0.0 tag”, but I like the separate option to make it easier to truncate in multiple locations.

@wking
Copy link
Contributor Author

wking commented May 4, 2015

My interest in pushing this ahead is to replace the current doc2go workflow with an IPO bundle. For more details, see whyrusleeping/doc2go#2 and the difficult-to-extend-to-recursive-directories-or-other-useful-objects code here.

@whyrusleeping
Copy link
Member

It will be nice to have our archive format finally written. 👍 (although ipo isnt the name i would choose)

@wking
Copy link
Contributor Author

wking commented May 4, 2015

On Sun, May 03, 2015 at 11:22:40PM -0700, Jeromy Johnson wrote:

(although ipo isnt the name i would choose)

But you could have this great little asteroid or comet icon :p. I was
mostly looking for something without obvious conflicts though, so I'm
happy to use a differnet name. What would you choose?

@jbenet
Copy link
Member

jbenet commented May 4, 2015

The format name is going to be something .<>ar, so it goes with all the -ar archival utilities.

Some examples:

  • dar for merkleDag ARchive (clashes with Disk ARchive)
  • mar for Merkle(dag) ARchive
  • sar for Secure ARchive
  • aar for Authenticated ARchive (pirates!)
  • kar for merKledag ARchive
  • yar for ______ ARchive.

seems to be already used by ...

i don't care much about clashing with utilities that are barely used now. all things being equal better not to, but name collisions are inevitable as time passes. I don't have any statistics on Disk ARchive usage to warrant either clashing or not clashing.

@wking
Copy link
Contributor Author

wking commented May 4, 2015

On Mon, May 04, 2015 at 08:17:36AM -0700, Juan Batiz-Benet wrote:

  • dar for merkleDag ARchive (clashes with Disk ARchive)
  • mar for Merkle(dag) ARchive
  • sar for Secure ARchive
  • aar for Authenticated ARchive (pirates!)
  • kar for merKledag ARchive
  • yar for ______ ARchive.

I don't think the self-authenticating part of this is important enough
to be worth enshrining in the name ;). Things that focus on the
Merkle-ness, DAG-ness, or IPFS-Object-ness make more sense to me. How
about .oar for object-archive. Seems to be unused, and it's fun to
say :). The only space-oriented name I've turned up with Y is Yildun,
but “the one next to Polaris” isn't the most awesome qualification ;).

@wking wking added the ready label May 4, 2015
@wking wking changed the title File-base import/export for Merkle DAGs (à la git bundle) File-based import/export for Merkle DAGs (à la git bundle) May 4, 2015
@wking wking self-assigned this May 4, 2015
@wking
Copy link
Contributor Author

wking commented May 4, 2015

I'm gathering notes on similar existing formats. Here's the spec for
Git's packfiles 1. The main differences for us are:

  • We want a single file with both the packed objects and the index.

  • I don't think we want to bother with delta representations yet.

  • We want a variable-length table with hashes for any top-level
    objects (maybe just copied from what the user listed on the command
    line), so we can show folks the main entrypoints for an archive
    without walking the whole pack.

  • Our multihash keys are variable length (Git uses 20-byte SHA-1s
    everywhere). I expect we want to index our entries with a 20-byte
    (or thereabout) slice of the crypto portion of the multihash (padded
    with zeros), with an offset pointing to the first of those objects
    in the pack portion. Folks looking up a particular object can find
    the offset in the index, jump to that offset in the pack portion,
    and start re-hashing those objects until either:

    a. The calculated multihash matches the lookup key (success),
    b. The calculated multihash has a different 20-byte crypto portion
    (failure), or
    c. We run out of file (failure).

    This approach is basically using the main index table as a second
    level of fanout (with very few collisions), with the authoritative
    hashes moving from the main index table to a property of the packed
    objects themselves.

    Case (c) means we don't want Git's end-of-pack sanity hash 2.
    Since we're intending to transfer these objects between hosts
    (vs. Git's local-only packfiles), we'll be hashing each of the
    included objects anyway (and eventually including signature objects,
    etc), so I don't think a corruption-checking sanity hash is
    particularly useful to use.

I'll keep looking for specs on other similar formats to give us a
better shot of making well-informed choices here ;).

@wking wking removed the ready label May 4, 2015
@chriscool
Copy link
Contributor

@wking Git pack files are transfered too.

You might want to have a look at:

http://git-scm.com/book/es/v2/Git-Internals-Transfer-Protocols

And I think a corruption-checking end-of-pack sanity hash is nice to have.

@wking
Copy link
Contributor Author

wking commented May 5, 2015

On Tue, May 05, 2015 at 05:09:22AM -0700, Christian Couder wrote:

@wking Git pack files are transfered too.

Ah, good to know.

http://git-scm.com/book/es/v2/Git-Internals-Transfer-Protocols

That explains why you'd want a separate packfile index. Does anyone
know of docs for the ‘git bundle’ format? Is it just an index and
packfile wedged together in a single file?

And I think a corruption-checking end-of-pack sanity hash is nice to
have.

Can you explain why? Each object is already individually hashed to
figure out their ID. That means object corruption would just give you
“couldn't find ”, which seems
reasonable. And index corruption would just send you to the wrong
place, which would probably also give you “couldn't find
” because the hash calculated at the
target location wouldn't match 1. What do you gain by hashing
everything together?

@chriscool
Copy link
Contributor

That explains why you'd want a separate packfile index. Does anyone know of docs for the ‘git bundle’ format? Is it just an index and packfile wedged together in a single file?

Yeah, according to Junio in this thread:

"The bundle file is a thinly wrapped packfile, with extra information
that tells what objects in the bundle are the tips of histories and
what objects the repository the bundle gets unbundled has to have."

Can you explain why?

Because this makes sure that the whole pack has been correctly transfered.
Only checking object hashes doesn't guaranty that the file has not been truncated.
And you can check it fast, for example without having to check if each object in the index file is also in the pack file.

@wking
Copy link
Contributor Author

wking commented May 5, 2015

On Tue, May 05, 2015 at 10:30:45AM -0700, Christian Couder wrote:

Can you explain why?

Because this makes sure that the whole pack has been correctly
transfered. Only checking object hashes doesn't guaranty that the
file has not been truncated. And you can check it fast, for example
without having to check if each object in the index file is also in
the pack file.

What do you do if you detect truncation or corruption (I don't see how
a trailing-checksum spec would let you distinguish between them)? I
can see a few choices:

a. Ask for a fresh copy. In this case, I'd rather handle the check in
the transport layer itself. For example, [1,2] will catch most
cases.
b. Abort with an error. In this case, I'd rather just assume an
uncorrupted archive and try and extract the objects. If the usual
procedure gets you a chunk of bytes whose multihash matches what
you're looking for, then you should be pretty safe.

If those mechanisms are insufficient for a given use-case, I think you
might as well be signing the whole archive with an external signature.
A trailing checksum just seems like a weak middle ground.

On the other hand, it's easy to add and check, so I'm ok with just
adding it if these arguments aren't convincing ;). In that case I'll
just need clarity on how we should handle the “trailing-checksum
doesn't match” case.

@jbenet
Copy link
Member

jbenet commented May 5, 2015

@wking said

I'm gathering notes on similar existing formats. Here's the spec for
Git's packfiles [1]. The main differences for us are:

  • We want a single file with both the packed objects and the index.

Yeah.

  • I don't think we want to bother with delta representations yet.

Agreed. (we will want to, so should be careful not to make it hard on ourselves.)

  • We want a variable-length table with hashes for any top-level
    objects (maybe just copied from what the user listed on the command
    line), so we can show folks the main entrypoints for an archive
    without walking the whole pack.

Yep, exactly. A dag with one root. (want multiple roots? add a virtual root at the top that gathers all of them)

Ideally we'll want to store the objects with offsets so seeking is very fast. (am reminded of cdb).

We'll probably end up with a format that wraps (prefix or suffix) each object with file offset tables (mapping a link to a file offset), etc.

It sort of looks to me like we'd take the dag and wrap it with another dag (each object wrapped in another obejct) where the links are file offsets instead of the hashes. (well, the full link is the pair (offset, hash), cause you need + use both). (It's sort of like unixfs in reverse?)

(forgive me if this is rambly and didn't make sense. i can draw it.)

  • Our multihash keys are variable length (Git uses 20-byte SHA-1s
    everywhere). I expect we want to index our entries with a 20-byte
    (or thereabout) slice of the crypto portion of the multihash (padded
    with zeros), with an offset pointing to the first of those objects
    in the pack portion. Folks looking up a particular object can find
    the offset in the index, jump to that offset in the pack portion,
    and start re-hashing those objects until either:

    a. The calculated multihash matches the lookup key (success),
    b. The calculated multihash has a different 20-byte crypto portion
    (failure), or
    c. We run out of file (failure).

This sounds good to me. Would have to play with it more to see if it has any drawbacks/pitfalls, but it should work correctly. The index size could actually be a function of the number of objects. Treat it like a hash table, the more objects, the more collisions possible, so we bump up the size. (reminded again of cdb).

This approach is basically using the main index table as a second
level of fanout (with very few collisions), with the authoritative
hashes moving from the main index table to a property of the packed
objects themselves.

Yep! 👍

Case (c) means we don't want Git's end-of-pack sanity hash [2].
Since we're intending to transfer these objects between hosts
(vs. Git's local-only packfiles), we'll be hashing each of the
included objects anyway (and eventually including signature objects,
etc), so I don't think a corruption-checking sanity hash is
particularly useful to use.

Yeah I agree. Since it's a proper merkle dag (hash tree properties), we can know that the thing is missing pieces.

(Aside: Actually, I was just pointed to (thanks @zmanian !) a very cool talk by agl where he shows a streaming merkledag construction. and he made a point that most AEAD stream constructions fail to account for truncation correctly (miss clearly labeling start and end chunks), whereas a merkledag approach does so easily.)

Another question raised is what to do in the case of corruption -- we'll be able to detect the level of corruption in most cases (i.e. becase we have the full merkledag we can isolate the corrupted subtrees, and use the uncorrupted ones. signatures on the top object still check out and everything). Maybe this is an application decision, where some applications need the full intact archive, and others are ok tolerating some loss. We could even possibly reconstruct multiple corrputed dag archives if they happen to complement each other (RAID style).

(I'm also reminded of long-distance transport protocols (where latency >> bandwidth), which transmit the same data multiple times back to back. could even have a mode where we tradeoff storage size for object redundancy in the same archive file, storing objects a number of times and randomizing their location to correct for disk sectors going bad, and so on. This is all out of scope, but relevant to mention as may be useful to some people in the future.)

I'll keep looking for specs on other similar formats to give us a
better shot of making well-informed choices here ;).

sounds good! 👍 very useful.


@chriscool and @wking

On the trailing hash problem, am not sure, but I think @wking is right. can walk the dag to check all the objects individually in lieu of the full checksum. the whole thing should be consistent. this may be slower than a straight up checksum, but it's got other better properties.

If those mechanisms are insufficient for a given use-case, I think you might as well be signing the whole archive with an external signature. A trailing checksum just seems like a weak middle ground.

I think so too, but again, not sure. there may be something we're missing that @chriscool is pointing at, or that other git contributors saw as relevant for the packfile. (Not sure, for example, why the packfile isn't a merkledag itself, and is instead just loose objects back to back without full merkle structure. maybe they just didn't think of it because their use case didn't lead them to that path).


More on names:

  • .car or .kar is growing on me. "Certified ARchive" or "merKle ARchive". (and "it's a car for your data".) I tested a few of these names on people, and the hard consonant sound made it sound way better to them. People seemed to prefer "tar >> dar", and "car >> tar", and "sar > dar". This is too small/anecdotal to count though-- Maybe we could survey lots of people at some point :) -- good names matter, and lots of factors are relevant here (relevance of name, synergy with other names (tar), clarity (acronyms are useful), fun/dramatic flair (.yar, .ipo), sound (tar, yar, car)).
  • another ... interesting ... name came up: .ship. it evokes "shipping things around", "releases", "interplanetary travel". It doesn't work with the ._ar files, but to be fair, it sort of goes with .zip. (im not very sold on it, but mentioning for completeness). not sure how much the unix community would like a 4 letter archive name ;), let alone one that doesn't have ._ar.

@wking i think the depth of these discussions warrant making a repo. I've just made ipfs/archive-format (to not bias us). we can change the name when it becomes relevant to. lets maybe separate the discussions here into issues there as we go. https://github.com/ipfs/archive-format

@whyrusleeping
Copy link
Member

closing, continue discussion at https://github.com/ipfs/archive-format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/community-input Needs input from the wider community
Projects
None yet
Development

No branches or pull requests

6 participants