Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL store #4896

Merged
merged 13 commits into from
Jul 13, 2018
Merged

URL store #4896

merged 13 commits into from
Jul 13, 2018

Conversation

Stebalien
Copy link
Member

This is an experimental URL store feature that needs some discussion/design work before we even consider merging it. I'm opening this PR at @diasdavid's request so we don't lose track of it.

@Stebalien Stebalien added status/deferred Conscious decision to pause or backlog status/blocked Unable to be worked further until needs are met labels Mar 29, 2018
@Stebalien Stebalien requested a review from Kubuxu as a code owner March 29, 2018 18:55
@ghost ghost assigned Stebalien Mar 29, 2018
@ghost ghost added status/in-progress In progress and removed status/deferred Conscious decision to pause or backlog labels Mar 29, 2018
@Stebalien Stebalien added status/deferred Conscious decision to pause or backlog and removed status/in-progress In progress labels Mar 29, 2018
@Stebalien Stebalien changed the title [DO NOT MERGE} URL store [DO NOT MERGE] URL store Mar 29, 2018
@kyledrake
Copy link

This is beautiful, thank you. ❤️

@whyrusleeping
Copy link
Member

whyrusleeping commented Mar 30, 2018

Reading through this and thinking a bit, It seems like the filestore/urlstore logic should be self-contained entirely in a special chunker. The fact that its leaking into the dagbuilderhelper importer stuff appears to be unnecessary.

Cleaning that up might make it easier to integrate both the filestore and the urlstore in the same codebase

-- more thinking --

For this to work, the import would need to be changed to so that it accepts a chunker with an interface that gives it just a cid and a length (and changing the chunker to be responsible for putting the blocks to the datastore. Switching things up might actually make it easier for us to optimize the importing process by adding batched writes of the file being chunked in the chunker itself.

Copy link
Member

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this allowed to pass cid/multihash params as it would allow to build bridges to ipld things more easily

@@ -4,4 +4,5 @@ message DataObj {
optional string FilePath = 1;
optional uint64 Offset = 2;
optional uint64 Size = 3;
optional bool URL = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this an enum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Making this a type enum makes more sense and allows for better future extensibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could:

  1. Rename FilePath to Location.
  2. Use URLs for everything and use the file:/// schema for files.
  3. Unify the commands.

This should make this trivial to extend via plugins.

(don't bother if this leads to a bunch of additional work)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) Use URLs for everything and use the file:/// schema for files.

This will require a migration. Alternatively if I am correct that paths or stored absolute, we can use that to distinguish between url's and files.

(3) Unify the commands.

This makes sense since they are essentially doing the same thing, and I don't think it will be too hard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require a migration. Alternatively if I am correct that paths or stored absolute, we can use that to distinguish between url's and files.

I assume they're absolute. And yes, that'd be a reasonable way forward.

Copy link
Contributor

@kevina kevina Jun 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there not. However the string http:// or https:// will not appear in any normalized file path as // will always be simplified to '/'. So I still think checking for a http:// or https:// is valid way forward and can make this patch fairly small and non-invasive.

@Stebalien Stebalien removed their assignment Mar 31, 2018
@parkan
Copy link

parkan commented Jun 5, 2018

attempting to bring this up to date with master to potentially resolve some issues that IA is seeing (https://github.com/protocol/collab-internet-archive/issues/13)

most of the conflicts seem to be caused (or made worse by) unstable import sort order, is this not something gofmt takes care of for us?

@ghost ghost assigned Kubuxu Jun 5, 2018
@ghost ghost added status/in-progress In progress and removed status/deferred Conscious decision to pause or backlog labels Jun 5, 2018
@Kubuxu
Copy link
Member

Kubuxu commented Jun 5, 2018

I've rebased it.

@Kubuxu
Copy link
Member

Kubuxu commented Jun 6, 2018

Tests are failing, I didn't have time to check why.

@parkan
Copy link

parkan commented Jun 6, 2018

not finding gx/ipfs/QmWo8jYc19ppG7YoTsrr2kEtLRbARTJho5oNXFTR6B7Peq/go-ipfs-chunker, can I get push access to the repo so I can diagnose without PRing from my fork?

@parkan
Copy link

parkan commented Jun 6, 2018

side note: this is probably a (long) separate discussion, but the goimports behavior with gx seems to introduce a nontrivial overhead with constantly reordering imports based on the hash, possible to patch it to sort based on alias instead?

for example, this rebase could be done automatically (--theirs semantics) if not for non-deterministic sorting

@Stebalien
Copy link
Member Author

side note: this is probably a (long) separate discussion, but the goimports behavior with gx seems to introduce a nontrivial overhead with constantly reordering imports based on the hash, possible to patch it to sort based on alias instead?

Yeah... #4831

@parkan
Copy link

parkan commented Jun 7, 2018

still don't have push access, unbroken here: #5096

@ghost ghost assigned whyrusleeping Jun 8, 2018
@Kubuxu
Copy link
Member

Kubuxu commented Jun 8, 2018

Sorry that I've missed it.

@parkan
Copy link

parkan commented Jun 21, 2018

hi guys, in order to ensure readiness for the demo at DWeb summit (7/30-ish), we need to get this merged

the scenario I am particularly concerned about is breaking changes introduced at the last minute in master that would require panicked rebasing, a sure recipe for demofail (precedent: the rebase in #4896 (comment) was required because accessing unixfs objects broke, thus breaking serving the static frontend of the IA viewer)

by @bigs's estimate there's about an engineer-week of work to get this in good shape (probably less for minimum viable shape)

can we get some eng time allocated to this? 100% understand that go-ipfs team is stretched thin already, but this is a relatively small linchpin that holds together a lot of other work and not getting it done would jeopardize an important opportunity to showcase IPFS to the world

@Kubuxu @whyrusleeping

@kevina
Copy link
Contributor

kevina commented Jun 21, 2018

Could I get some background on what lead to the p.r. and why is it needed?

If I understand what is going allows storing a url in place on an actual object. When it comes time to retrieve the object the URL is fetched instead. I can see how this can cause all sorts of problems, not the least of which is that URL can be very unstable and there is no guarantee that the content will stay around. Retrieving the content could also be very slow.

@Stebalien
Copy link
Member Author

If I understand what is going allows storing a url in place on an actual object. When it comes time to retrieve the object the URL is fetched instead. I can see how this can cause all sorts of problems, not the least of which is that URL can be very unstable and there is no guarantee that the content will stay around. Retrieving the content could also be very slow.

This is designed for cases where the user controls both the server and the IPFS node. Basically, it's treating a remote HTTP server as a filesystem.

Specifically, the internet archive needs this so they can serve their files over IPFS without moving their entire infrastructure over to IPFS.

Copy link
Member Author

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for now. We can make it extensible later.

@@ -48,6 +48,8 @@ type DagBuilderParams struct {
// NoCopy signals to the chunker that it should track fileinfo for
// filestore adds
NoCopy bool

URL string
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment?

}
if IsURL("adir/afile") {
t.Fatal("IsURL recognized non-url")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should add some tests that look more like URLs (e.g., http:/ /a/file).

Copy link
Member

@whyrusleeping whyrusleeping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Already thinking of some changes we'll want for the future (like a cache in front of the http fetch), but this works for now!

This will be the first thing merged in after 0.4.16 lands. The rush to get it merged was simply so nothing breaks it, right @parkan ?

@kevina
Copy link
Contributor

kevina commented Jun 30, 2018

@whyrusleeping should I rebase this now or wait until 0.4.16 lands (to avoid having to rebase it a second time)

@whyrusleeping
Copy link
Member

@kevina I'd just wait until 0.4.16 lands. I don't think anything else is going to land between then and now, but just in case.

@parkan
Copy link

parkan commented Jul 1, 2018

merging first thing after 0.4.16 sgtm -- ideally this would be before the dev days (planning to demo there and would prefer to do it from master)

thanks again @kevina!

@kevina
Copy link
Contributor

kevina commented Jul 1, 2018

@parkan your welcome. I have this rebased within a day after 0.4.16 lands. If it is before the dev conf, well that depends on @whyrusleeping.

Kubuxu and others added 13 commits July 13, 2018 09:04
License: MIT
Signed-off-by: Jakub Sztandera <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@kevina
Copy link
Contributor

kevina commented Jul 13, 2018

@whyrusleeping @parkan just rebased now that it looks like we finally got a release out.

@daviddias
Copy link
Member

0.4.16 has been released \o/

@whyrusleeping whyrusleeping merged commit 95f721c into master Jul 13, 2018
@ghost ghost removed the status/in-progress In progress label Jul 13, 2018
@whyrusleeping
Copy link
Member

choo choo!

@whyrusleeping whyrusleeping deleted the feat/ai-mirror branch July 13, 2018 15:09
@kyledrake
Copy link

kyledrake commented Jul 13, 2018 via email

@parkan
Copy link

parkan commented Jul 16, 2018

beautiful

@whyrusleeping
Copy link
Member

I think the next steps here will be to add some sort of cache in front of the urlstore, maybe even disk backed. But before we do that, we should probably add metrics of some sort to get an idea of how well it performs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/review Needs a review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants