Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Robust GC #3712

Merged
merged 10 commits into from
Mar 22, 2017
Merged

More Robust GC #3712

merged 10 commits into from
Mar 22, 2017

Conversation

kevina
Copy link
Contributor

@kevina kevina commented Feb 20, 2017

No description provided.

@kevina kevina added the status/in-progress In progress label Feb 20, 2017
@whyrusleeping whyrusleeping self-requested a review February 20, 2017 23:10
@kevina
Copy link
Contributor Author

kevina commented Feb 20, 2017

@whyrusleeping this is very much a work in progress, just want some early feedback.

@kevina
Copy link
Contributor Author

kevina commented Feb 20, 2017

This is based on #3700 use this link for a better comparison until that is merged: kevina/enumerate-children-refactor...kevina/more-robust-gc

@kevina
Copy link
Contributor Author

kevina commented Feb 23, 2017

This took longer than I would of like and I am not sure I am happy with the approach but it should work. Everything is tested, but the test could also use some improvements in order to document why what they are testing is important.

pin/gc/gc.go Outdated
}
}
if errors {
errOutput <- CouldNotDeleteSomeBlocksError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrCouldNotDeleteSomeBlocks

pin/gc/gc.go Outdated
@@ -22,58 +25,73 @@ var log = logging.Logger("gc")
//
// The routine then iterates over every block in the blockstore and
// deletes any block that is not found in the marked set.
func GC(ctx context.Context, bs bstore.GCBlockstore, ls dag.LinkService, pn pin.Pinner, bestEffortRoots []*cid.Cid) (<-chan *cid.Cid, error) {
//
func GC(ctx context.Context, bs bstore.GCBlockstore, ls dag.LinkService, pn pin.Pinner, bestEffortRoots []*cid.Cid) (<-chan *cid.Cid, <-chan error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cleaner return value would be to return a channel of 'results'. Where a result would be defined as something like:

type Result struct {
    Cid *cid.Cid
    Err error
}

Its always simpler to just have to deal with a single channel in cases like these, it can cause tricky race conditions (though the race conditions probably arent an issue here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is one of the things I didn't like also. Working on fixing that today.

pin/gc/gc.go Outdated
}
}

var CoundNotFetchAllLinksError = errors.New("garbage collection aborted: could not retrieve some links")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure errors like this have an Err prefix

test_description="Test ipfs pinning operations"

. lib/test-lib.sh
set -e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use set -e, sharness handles these sorts of things

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you can run the tests with -i to stop when a test fails)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it if you want, but note it does serve a very useful purpose. Please see #3721. I already know about -i and that is not why i am using it.

# MIT Licensed; see the LICENSE file in this repository.
#

test_description="Test ipfs pinning operations"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want to update the description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah :)

LEAF4=zb2rhnvAVvfDDnndcfQkwqNgUm94ba3zBSXyJKCfVXwU4FXx

test_expect_success "remove a leaf node from the repo manually" '
rm $LEAF1FILE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quotes around the variable, just in case it changes


test_expect_success "create a permission problem" '
chmod 500 `dirname $LEAF2FILE` &&
test_must_fail ipfs block rm $LEAF2 2>&1 | tee err &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably fine just to redirect this to 'err'. May also want to name 'err' something a tiny bit more descriptive like 'block_rm_err'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason I do it this was is so rather than a simple '2> err' is so that I can see the output of the command when developing the test. It is far easier than having to do a cat "trash directory XXX/err". The sharness framework will suppress both stdout and stderr from a test unless '-v' is given so I don't see extra noise as a problem. If the test does fail for some reason then it is helpful to see the output of the command without having to track down the output file name. If a test fails on a build server the output file may not even be accessible.

Nevertheless, I understand it is not the convention used in the other tests, so if you still don't like I it I will reluctantly change it after I am done developing the tests and the code. Let me know.

@kevina kevina force-pushed the kevina/more-robust-gc branch 2 times, most recently from 4a71e51 to cdccbae Compare February 24, 2017 22:13
pin/gc/gc.go Outdated
} else {
return gcs, nil
}
}

var CoundNotFetchAllLinksError = errors.New("garbage collection aborted: could not retrieve some links")
var ErrCoundNotFetchAllLinks = errors.New("garbage collection aborted: could not retrieve some links")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here (not sure if you fixed already, going through commit by commit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -37,6 +39,23 @@ test_gc_robust_part1() {
ipfs repo gc
'

test_expect_success "corrupt the root node of 1MB file" '
ls -l "$HASH1FILE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this ls here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was seeing something. Will remove.

grep -q "aborted" gc_err
'

test_expect_success "leaf nodes where not removed after gc" '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, s/where/were/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


test_expect_success "leaf nodes where not removed after gc" '
ipfs cat $LEAF3 > /dev/null &&
ipfs cat $LEAF4 > /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test run both with and without a daemon? If yes, we should add a --timeout flag here, if no, we should make sure to run this test with and without a daemon ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping Um, that is what the "--offline" flag is for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Touche

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved via irc, no need for --timeout flag.


outChan := make(chan interface{})
outChan := make(chan interface{}, len(gcOutChan))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does len on a channel tell us? Is it the capacity? or the number of items currently waiting in it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, should be cap. Sorry. Will fix.

default:
res.SetError(corerepo.NewMultiError(errors...), cmds.ErrNormal)
return
err := corerepo.CollectResult(gcOutChan, func(k *cid.Cid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though you probably want to select with a context here to make sure there are no accidental hangs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping I am never sure when I should be using a context and when it is okay to just read from a channel. When is checking for ctx.Done() important in the pipeline?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discussed in IRC, putting here for posterity)

Any time youre sending or receiving on a channel that is outside the control of your immediate goroutine, its a good idea to use a context. In this case, the reader is the commands lib, and the commands lib will stop reading from the channel if the user cancels the operation.

@@ -51,6 +56,7 @@ order to reclaim hard disk space.
},
Options: []cmds.Option{
cmds.BoolOption("quiet", "q", "Write minimal output.").Default(false),
cmds.BoolOption("stream-errors", "Stream errors.").Default(false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I also like this. It is a good option.

pin/gc/gc.go Outdated
@@ -66,7 +66,7 @@ func GC(ctx context.Context, bs bstore.GCBlockstore, ls dag.LinkService, pn pin.
err := bs.DeleteBlock(k)
if err != nil {
errors = true
output <- Result{Error: &CouldNotDeleteBlockError{k, err}}
output <- Result{Error: &ErrCouldNotDeleteBlock{k, err}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you had originally was good. It seems the go convention for error types is to suffix them with the word Error.

https://golang.org/pkg/net/#DNSConfigError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Will revert that commit.

@kevina kevina force-pushed the kevina/more-robust-gc branch from cdccbae to 9cfba7d Compare February 24, 2017 23:29
@kevina kevina changed the title WIP: More Robust GC More Robust GC Feb 24, 2017
@kevina
Copy link
Contributor Author

kevina commented Feb 24, 2017

@whyrusleeping Okay, I address all your comments. I think this is ready. Will rebase once #3700 is merged.

@kevina kevina force-pushed the kevina/more-robust-gc branch from a40d15c to b95e413 Compare February 24, 2017 23:51
@whyrusleeping
Copy link
Member

@kevina great, thanks!

@kevina
Copy link
Contributor Author

kevina commented Mar 17, 2017

Thanks.

@@ -87,26 +110,29 @@ order to reclaim hard disk space.
return nil, err
}

marshal := func(v interface{}) (io.Reader, error) {
obj, ok := v.(*corerepo.KeyRemoved)
for v := range outChan {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still use the marshal func construction like it previously did. Output for stdout can go through the same style of returned pipe, and we can print to stderr within that.

The reason for this is to avoid the double error printing that you can see in @Kubuxu's sample output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure I am following, but I attempted to do what you want. It did not seam to help with the duplicate error reporting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping for example here is the output from a test case when run with -v

expecting success: 
                test_must_fail ipfs repo gc --stream-errors 2>&1 | tee repo_gc_out &&
                grep -q "Error: could not retrieve links for $LEAF1" repo_gc_out &&
                grep -q "Error: could not retrieve links for $LEAF2" repo_gc_out &&
                grep -q "Error: garbage collection aborted" repo_gc_out

Error: could not retrieve links for QmSijovevteoY63Uj1uC5b8pkpDU5Jgyk2dYBqz3sMJUPc: merkledag: not found
Error: could not retrieve links for QmTbPEyrA1JyGUHFvmtx1FNZVzdBreMv8Hc8jV9sBRWhNA: The block referred to by 'QmTbPEyrA1JyGUHFvmtx1FNZVzdBreMv8Hc8jV9sBRWhNA' was not a valid merkledag node
Error: garbage collection aborted: could not retrieve some links
01:35:09.788 ERROR commands/h: encountered errors during gc run client.go:247
Error: encountered errors during gc run
ok 48 - 'ipfs repo gc --stream-errors' should abort and report each error separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@whyrusleeping since using the marshal func construction didn't fix the double error reporting, do you still want to use the marshal func construction, I think it makes the code more complicated to understand. But I can leave it if you want.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
if !ok {
return nil, u.ErrCast()
}

buf := new(bytes.Buffer)
if obj.Error != "" {
fmt.Fprintf(res.Stderr(), "Error: %s\n", obj.Error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use fmt.Fprintf and not use a standard ipfs logger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was for the output to go to the client's stderr. The logger won't do thus as far as I know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this part of code runs on Client CLI (IIRC) not on the daemon so we are handling logging/errors manually here.

@@ -39,6 +40,11 @@ var RepoCmd = &cmds.Command{
},
}

type GcResult struct {
Copy link
Contributor

@hoenirvili hoenirvili Mar 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every type exported and every attribute that's also exported should have a well described comment above.


func CollectResult(ctx context.Context, gcOut <-chan gc.Result, cb func(*cid.Cid)) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add here also a doc string above the function name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// CollectResult collects the output of a garbage collection run and calls the given callback for each object removed. It also collects all errors into a MultiError which is returned after the gc is completed

return &MultiError{errs[:len(errs)-1], errs[len(errs)-1]}
}

type MultiError struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you could tell that the type implements the Error interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioning that it implements the error interface is pretty redundant and not standard go convention

roots, err := BestEffortRoots(n.FilesRoot)
if err != nil {
return nil, err
func NewMultiError(errs ...error) *MultiError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// NewMultiError creates a new MultiError object from a given slice of errors

pin/gc/gc.go Outdated
@@ -14,6 +16,11 @@ import (

var log = logging.Logger("gc")

type Result struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Result represents an incremental output from a garbage collection run. It contains either an error, or the cid of a removed object

pin/gc/gc.go Outdated
@@ -83,35 +103,74 @@ func Descendants(ctx context.Context, getLinks dag.GetLinks, set *cid.Set, roots
return nil
}

func ColoredSet(ctx context.Context, pn pin.Pinner, ls dag.LinkService, bestEffortRoots []*cid.Cid) (*cid.Set, error) {
func ColoredSet(ctx context.Context, pn pin.Pinner, ls dag.LinkService, bestEffortRoots []*cid.Cid, output chan<- Result) (*cid.Set, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ColoredSet computes the set of nodes in the graph that are pinned by the pins in the given pinner

pin/gc/gc.go Outdated
}
}

var ErrCouldNotFetchAllLinks = errors.New("garbage collection aborted: could not retrieve some links")
Copy link
Contributor

@hoenirvili hoenirvili Mar 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported fields should have a doc string. Also, can we have the name ErrCannotFetchLinks ?

pin/gc/gc.go Outdated

var ErrCouldNotFetchAllLinks = errors.New("garbage collection aborted: could not retrieve some links")

var ErrCouldNotDeleteSomeBlocks = errors.New("garbage collection incomplete: could not delete some blocks")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string and can we have ErrCannotDeleteBlocks ?

pin/gc/gc.go Outdated

var ErrCouldNotDeleteSomeBlocks = errors.New("garbage collection incomplete: could not delete some blocks")

type CouldNotFetchLinksError struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string.

pin/gc/gc.go Outdated
Err error
}

func (e *CouldNotFetchLinksError) Error() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string.

pin/gc/gc.go Outdated
return fmt.Sprintf("could not retrieve links for %s: %s", e.Key, e.Err)
}

type CouldNotDeleteBlockError struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc string.

pin/gc/gc.go Outdated

if errors {
return nil, ErrCouldNotFetchAllLinks
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can dropthe else stmt.

This should look like

if errors {
  return nil, ErrCouldNotFetchAllLinks
}

return gcs, nil

@kevina kevina force-pushed the kevina/more-robust-gc branch from bfaba39 to bdd0674 Compare March 19, 2017 02:21
@@ -39,6 +40,12 @@ var RepoCmd = &cmds.Command{
},
}

// GcResult is the result returned by "repo gc" command
type GcResult struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually remove this whole type in favor of the one in the gc package (just change the field names so the json marshaling matches up correctly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is we can't pass the error type through the API. I get:

02:40:33.946 ERROR commands/h: json: cannot unmarshal object into Go value of type error client.go:247
Error: json: cannot unmarshal object into Go value of type error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, lets figure that out later then

@kevina kevina force-pushed the kevina/more-robust-gc branch 2 times, most recently from 39b2f74 to fbc4f90 Compare March 19, 2017 03:35
@whyrusleeping
Copy link
Member

@Kubuxu the circleCI failure here is the iptb failed to connect problem again. Didnt you have an idea on why that was happening?

@Kubuxu
Copy link
Member

Kubuxu commented Mar 20, 2017

@whyrusleeping nope, couldn't figure it out. It happens much more on Travis and Circle than our CI which means it might be some race or instability.

I tried to replicate it locally (1000 runs) didn't find even one failure.

License: MIT
Signed-off-by: Kevin Atkinson <[email protected]>
@kevina kevina force-pushed the kevina/more-robust-gc branch from fbc4f90 to d39d9ed Compare March 20, 2017 18:46
Copy link
Member

@whyrusleeping whyrusleeping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Travis flakiness aside, this LGTM.

Thanks @kevina, making GC better has been needed for a while now.

@whyrusleeping whyrusleeping merged commit 21072a5 into master Mar 22, 2017
@whyrusleeping whyrusleeping deleted the kevina/more-robust-gc branch March 22, 2017 00:16
@kevina
Copy link
Contributor Author

kevina commented Mar 28, 2017

Yes I can, I have snapper running so have snapshots of this repo before repofix. I will archive it on my PC when I get back to my studio (Sunday), and then I can expose it for you over 60Mbit/s uplink pipe if you want to download it.

@Kubuxu did you ever do this.

I am very surprised you can't just remove the broken pins with the latest IPFS version. See #3796 (comment)

@kevina kevina mentioned this pull request Aug 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/review Needs a review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants