Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inability to pin two things at once #5512

Merged
merged 2 commits into from
Sep 28, 2018

Conversation

michaelavila
Copy link
Contributor

Goals

Fix #5376 by allowing content to be fetched concurrently when pinning multiple hashes.

Implementation

Simply release the pinner lock when fetching content and re-lock it immediately afterwards.

For Discussion

ipfs dag put <<<'{"link": {"/": "QmdZmoS1b1jp9vL5vgTyebEYUVH6zjd6erhp3fpqV1Zktx"}}'
ipfs pin add zdpuAsWv1LNwtxvFB3iTQjtwDLxPxRaa3WeE4i6BoDuKDy4gf &
echo "thing" | ipfs add
  • This implementation still blocks gc:
ipfs dag put <<<'{"link": {"/": "QmdZmoS1b1jp9vL5vgTyebEYUVH6zjd6erhp3fpqV1Zktx"}}'
ipfs pin add zdpuAsWv1LNwtxvFB3iTQjtwDLxPxRaa3WeE4i6BoDuKDy4gf &
ipfs repo gc

@ghost ghost assigned michaelavila Sep 23, 2018
@ghost ghost added the status/in-progress In progress label Sep 23, 2018
@michaelavila michaelavila force-pushed the fix/reduce-critical-pinning-session-5376 branch 2 times, most recently from 55338e6 to 4613e53 Compare September 23, 2018 02:30
pin/pin.go Show resolved Hide resolved
@michaelavila michaelavila force-pushed the fix/reduce-critical-pinning-session-5376 branch from 4613e53 to df09116 Compare September 23, 2018 16:36
@Stebalien
Copy link
Member

but it appears that gc locking is managed here.

My concern with this is that we're requiring that the caller hold this lock and, if they aren't holding it, we can end up in a bad pin state (where we think something has been pinned but it hasn't). This will cause GC to hard-fail (it will simply refuse to run).

We actually have some code that will call this without holding that lock. I can pretty much guarantee that we won't run into issues but I'd feel more comfortable with a completely correct fix.

At the moment, I can't think of any simple and correct fixes that don't involve giving the Pinner access to the GCLocker.

@Stebalien
Copy link
Member

Note: There may actually be other solutions and I'd love to have alternatives. One would be to just fix all calls to Pinner.Pin(...) and make sure they hold the correct locks. Alternatively, we might be able to switch everything to the coreapi's pinning interface (core/coreapi/pin.go) which will always take the correct locks. That may actually be the correct solution.

@Stebalien
Copy link
Member

@lus0p let's let @michaelavila work on this one for now.

@michaelavila
Copy link
Contributor Author

@Stebalien

I'm looking into your suggestions now, but I want to run something by you. This PR solves a very specific problem, but as you noted there is another problem that we would like to solve to make the overall solution more correct. Are you amenable to tackling these two problems separately? One thing we can do is get this solution merged (once it's ready) and create an issue to solve the other problem. I see a few benefits:

  • Overall, the PRs will be smaller (rather than solve two problems in one, we have two that each solve one problem)
  • One of the problems will be solved without being held up on solving another problem
  • Each piece of work can be prioritized separately
  • If the next most important thing for me to do is to solve the second problem, I can start on it immediately, so I don't think we lose much by breaking it out

Ultimately, if this solution (i.e. the code) I'm proposing is flawed in a serious way then I'd recommend not proceeding. But if the problem already exists today (which I believe it does), then I think risk is low.

Let me know what you think.

cc @eingenito @hannahhoward

@Stebalien
Copy link
Member

My issue is that this patch technically introduces a rather nasty bug (albeit one that probably won't come up in practice). We have code that calls pinner.Pin(...) without holding the GCLock. If we happen to GC in the middle of this, we'll end up in a broken state with pinned nodes that aren't in the local blockstore.

A simple alternative is to just fetch everything ahead of time. That is, have the pin command act like the add command:

  1. Take the pin lock.
  2. Fetch the dag in the command instead of relying on the pinner.
  3. Call pinner.Pin(...) on the already fetched DAG.
  4. Optional, future: Give the pinner an offline dag so it can't fetch anything. Tell the user that they need to already have everything they want to pin.

The downside is that we'll have to traverse the dag twice (once under a lock). On the other hand, this is what add already does so I'm not so concerned.

@Stebalien
Copy link
Member

Basically, I don't want to go straight for the correct fix (get rid of the gclock and instead introduce temporary runtime "pins"). However, I also don't want to introduce a bug (even it it's a rare edge case).

@michaelavila
Copy link
Contributor Author

michaelavila commented Sep 25, 2018

@Stebalien makes sense, I didn't realize I was introducing something that wasn't already there. I also don't want to introduce a new bug. I'll keep cracking at this after dealing with the sharded ls PR.

Thanks for the quick response.

@michaelavila
Copy link
Contributor Author

@Stebalien I've been working through your feedback and trying to understand the problem I may have introduced, but am having trouble. It seems like the issue you're alluding to existed before this change. Can you explain in more detail what issue was introduced by this change? My apologies, I'd like to figure this out without bugging you, but after digging into it both myself and with a pair I am still not sure. Thank you.

@Stebalien
Copy link
Member

Actually... I think I was wrong. Thanks for asking me to actually think through this.

Basically, I figured that holding the pin lock would prevent GC. This is correct. However, we can still have the following sequence of events:

  1. A GC starts.
  2. The GC gets the current pinset.
  3. Something calls pinner.Pin without taking the GC lock.
  4. GC removes the block.

Oops? Yep, we already have that bug. New pinner.Pin requirement: "the GC lock must be held or else!".

Stebalien
Stebalien previously approved these changes Sep 27, 2018
@Stebalien Stebalien requested a review from magik6k September 27, 2018 02:16
@Stebalien
Copy link
Member

@magik6k could you double-check the logic?

Kubuxu
Kubuxu previously approved these changes Sep 27, 2018
Copy link
Member

@Kubuxu Kubuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aren't worried about the case when GC lock is not held, LGTM.

@Stebalien
Copy link
Member

If we aren't worried about the case when GC lock is not held, LGTM.

Yeah, turns out taking this lock doesn't actually help. If the GC lock isn't held, bad things can still happen.

magik6k
magik6k previously approved these changes Sep 27, 2018
Copy link
Member

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I've also looked at all invocations of this method, it seems like all places where GC can happen (so after initialization), do hold n.Blockstore.PinLock() (unless GC can be triggered while the node is still starting)

@michaelavila
Copy link
Contributor Author

Thanks everyone. Before I merge this, I want to resolve the codecov issue which was introduced when I added that check in after re-locking. I'll add a note when that's done.

@Stebalien Stebalien dismissed stale reviews from magik6k, Kubuxu, and themself via 7ed1a53 September 27, 2018 21:54
@Stebalien Stebalien force-pushed the fix/reduce-critical-pinning-session-5376 branch from df09116 to 7ed1a53 Compare September 27, 2018 21:54
@Stebalien
Copy link
Member

(rebased to get github to stop complaining)

Let's add tests in a different PR (tests would be much appreciated). It's not an issue with your code, just our test coverage in general.

@michaelavila
Copy link
Contributor Author

@Stebalien ok, are you saying I should just merge this now then?

@Stebalien
Copy link
Member

Yeah... I'm just trying to get jenkins to pass.

@Stebalien Stebalien merged commit c4398fe into master Sep 28, 2018
@ghost ghost removed the status/in-progress In progress label Sep 28, 2018
@Stebalien
Copy link
Member

🎉

@Stebalien Stebalien deleted the fix/reduce-critical-pinning-session-5376 branch September 28, 2018 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants