Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in check --repair #87

Closed
OleLaursen opened this issue May 28, 2014 · 11 comments
Closed

Crash in check --repair #87

OleLaursen opened this issue May 28, 2014 · 11 comments

Comments

@OleLaursen
Copy link

Hi again, I'm beginning to suspect there's a problem with either a disk or the memory on this machine, but in any case I hit this problem on pruning

Traceback (most recent call last):
  File "/home/olau/Attic-0.12/build/scripts-3.3/attic", line 3, in <module>
    main()
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 702, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 692, in run
    return args.func(args)
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 343, in do_prune
    archive.delete(stats)
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 327, in delete
    self.cache.chunk_decref(chunk_id, stats)
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/cache.py", line 198, in chunk_decref
    count, size, csize = self.chunks[id]
  File "hashindex.pyx", line 151, in attic.hashindex.ChunkIndex.__getitem__ (attic/hashindex.c:3090)
KeyError

then run check --repair which crashes with

Starting repository check...
Repository check complete, no problems found.
Starting archive consistency check...
Analyzing archive servers-2014-05-04-062434 (1/31)
...
Analyzing archive servers-2014-05-17-092237 (30/31)
Traceback (most recent call last):
  File "/home/olau/Attic-0.12/build/scripts-3.3/attic", line 3, in <module>
    main()
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 702, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 692, in run
    return args.func(args)
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 83, in do_check
    if not args.repo_only and not ArchiveChecker().check(repository, repair=args.repair):
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 481, in check
    self.rebuild_refcounts()
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 630, in rebuild_refcounts
    for item in robust_iterator(archive):
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 609, in robust_iterator
    for item in unpacker:
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 458, in __next__
    return next(self._unpacker)
  File "_unpacker.pyx", line 377, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:377)
  File "_unpacker.pyx", line 306, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:306)
TypeError: unhashable type: 'list'
@jborg
Copy link
Owner

jborg commented May 28, 2014

This is very weird. It looks like the archive metadata for archive servers-2014-05-17-092237 is corrupt even though the repository level checksum is valid.

Did you see any "Archive metadata damage detected" message on stderr during the repair?

Here's a completely untested patch which might give some additional debug information about the crash. It also skips all archives except the problematic one to make it faster to reproduce the crash.

https://gist.github.com/jborg/93edaa92afa6d7ca815a

So please if you can, please apply the following patch and run:

attic check --archives-only <repo_path>

@OleLaursen
Copy link
Author

Thanks for the help! No archive metadata damage detected messages. Here's the complete output with the patch, unfortunately the debug statement didn't get triggered, it looks like the check here is crashing in the outer loop?

Starting archive consistency check...
Skipping archive servers-2014-04-30-013256 (1/33)
Skipping archive servers-2014-05-02-062845 (2/33)
Skipping archive servers-2014-05-06-063617 (3/33)
Skipping archive servers-2014-05-08-062041 (4/33)
Skipping archive servers-2014-05-23-083703 (5/33)
Skipping archive servers-2014-05-13-062923 (6/33)
Skipping archive servers-2014-05-16-061734 (7/33)
Skipping archive servers-2014-05-24-010201 (8/33)
Skipping archive servers-2014-05-12-063154 (9/33)
Skipping archive servers-2014-04-27-063428 (10/33)
Skipping archive servers-2014-05-28-010300 (11/33)
Skipping archive servers-2014-05-11-062235 (12/33)
Skipping archive servers-2014-05-10-062340 (13/33)
Skipping archive servers-2014-05-15-064021 (14/33)
Skipping archive servers-2014-05-09-062454 (15/33)
Skipping archive servers-2014-05-26-010201 (16/33)
Skipping archive servers-2014-04-26-062813 (17/33)
Skipping archive servers-2014-05-01-062828 (18/33)
Skipping archive servers-2014-05-20-085713 (19/33)
Analyzing archive servers-2014-05-17-092237 (20/33)
Traceback (most recent call last):
  File "/home/olau/Attic-0.12/build/scripts-3.3/attic", line 3, in <module>
    main()
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 702, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 692, in run
    return args.func(args)
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 83, in do_check
    if not args.repo_only and not ArchiveChecker().check(repository, repair=args.repair):
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 482, in check
    self.rebuild_refcounts()
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 638, in rebuild_refcounts   
    for item in robust_iterator(archive):
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 608, in robust_iterator
    for i, (chunk_id, cdata) in zip(items, repository.get_many(items)):
ValueError: too many values to unpack (expected 2)

@jborg
Copy link
Owner

jborg commented May 29, 2014

for i, (chunk_id, cdata) in zip(items, repository.get_many(items)):

oops my bad, it should be "enumerate(zip(...))"

I've corrected the gist: https://gist.github.com/jborg/93edaa92afa6d7ca815a/81104a39cfbcf30486623ec98575bc4b1114efa8

@OleLaursen
Copy link
Author

Okay, it says:

Analyzing archive servers-2014-05-17-092237 (4/34)
unpack of items chunk b'3d90991d450e7d1fefa5b0f2c885389f90a31fbe1dc4cfa13ced87304410ded6' (423) failed
Traceback (most recent call last):
  File "/home/olau/Attic-0.12/build/scripts-3.3/attic", line 3, in <module>
    main()
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 702, in main
    exit_code = archiver.run(sys.argv[1:])
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 692, in run
    return args.func(args)
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archiver.py", line 83, in do_check
    if not args.repo_only and not ArchiveChecker().check(repository, repair=args.repair):
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 482, in check
    self.rebuild_refcounts()
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 638, in rebuild_refcounts
    for item in robust_iterator(archive):
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 611, in robust_iterator
    for item in unpacker:
  File "/home/olau/Attic-0.12/build/lib.linux-i686-3.3/attic/archive.py", line 459, in __next__
    return next(self._unpacker)
  File "_unpacker.pyx", line 377, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:377)
  File "_unpacker.pyx", line 306, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:306)
TypeError: unhashable type: 'list'

Can I find and send this chunk to you or perhaps try looking into it myself somehow?

@jborg
Copy link
Owner

jborg commented May 30, 2014

Can I find and send this chunk to you or perhaps try looking into it myself somehow?

Yeah, that would be really helpful. I created a tiny script to dump an archives metadata to a bunch of files:

https://gist.github.com/jborg/78bbb98f3b95275131d3

Run it like this:

$ python attic_dump_archive_metadata.py /path/to/repository::servers-2014-05-17-092237

And email me files 422_XXXXX, 423_YYYY and 424_ZZZ (or all of them). But please note that these files contain file metadata such as file names but no actual file contents.

One more thing. Can you verify that this command also crashes? (I suspect it will)

$ attic list /path/to/repository::servers-2014-05-17-092237

@jborg
Copy link
Owner

jborg commented Jun 5, 2014

Hi again, I'm beginning to suspect there's a problem with either a disk or the memory on this machine, but in any case I hit this problem on pruning

I think I can confirm your theory about bad memory. After analysing the files you sent to me I found that a single bit in the 423-file has been flipped.

The byte at offset 0x3ef8 is 0x8e but should be 0xce

Since this corruption was not detected by the data checksums it happened at archive creation time, before the checksum was calculated.

Ideally --repair should be able to handle all types of corruption, but this kind of corruption that happens before the initial checksum is calculated is especially hard to deal with, and sometimes impossible.

So unfortunately I'm closing this ticket now since a system with broken ram can produce all kind of corrupted data that is in most cases impossible to properly detect and fix in a good way.

@jborg jborg closed this as completed Jun 5, 2014
@OleLaursen
Copy link
Author

Oh, well, if --repair could just delete the single backup run where the problem is, that would be nice, but I suppose I'll start a new repository then. I did pull out a couple of the RAM sticks the other day after running memtest86+ for 24 hours and getting 1 error. Thanks for looking into it.

@jborg
Copy link
Owner

jborg commented Jun 6, 2014

Since --repair is already supposed to handle cases where data is either missing or detectable corrupt the easiest way to get your repository back into a working state is simply to delete the corrupt object.

You can delete object 3d90991d450e7d1fefa5b0f2c885389f90a31fbe1dc4cfa13ced87304410ded6 like this:

$ python attic_delete_objects.py /path/to/repo.attic 3d90991d450e7d1fefa5b0f2c885389f90a31fbe1dc4cfa13ced87304410ded6

https://gist.github.com/jborg/d5bd7ceb419becdbc8f0

After that attic check --repair should hopefully be able to recover your repository, except for a couple of files from the affected archive.

Let me know if it works.

/ Jonas

@OleLaursen
Copy link
Author

Sorry for the slow reply, a couple of things happened (including a new motherboard). I did archive the old repository and started a new one, but tried the delete script anyway now. Unfortunately, it dies with

python3 attic_delete_objects.py attic-old 3d90991d450e7d1fefa5b0f2c885389f90a31fbe1dc4cfa13ced87304410ded6Deleting 3d90991d450e7d1fefa5b0f2c885389f90a31fbe1dc4cfa13ced87304410ded6
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/attic/repository.py", line 368, in delete
    segment, offset = self.index.pop(id)
  File "hashindex.pyx", line 81, in attic.hashindex.IndexBase.pop (attic/hashindex.c:2127)
  File "hashindex.pyx", line 75, in attic.hashindex.IndexBase.pop (attic/hashindex.c:2036)
  File "hashindex.pyx", line 95, in attic.hashindex.NSIndex.__getitem__ (attic/hashindex.c:2315)
KeyError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "attic_delete_objects.py", line 16, in <module>
    repository.delete(unhexlify(key))
  File "/usr/local/lib/python3.4/dist-packages/attic/repository.py", line 370, in delete
    raise self.DoesNotExist(self.path)
attic.repository.DoesNotExist: attic-old

So unless you're still interested in trying this out, I think I'll delete the old archive in a couple of weeks when the backups in there are too old to be interesting.

@jborg
Copy link
Owner

jborg commented Jun 18, 2014

Was that the first time you ran that script?

The "attic.repository.DoesNotExist" exception simply indicates that the object the script tries to delete did not exist. So the object has already been deleted by a precious invocation or the object id is incorrect.

@OleLaursen
Copy link
Author

Aha. Okay, maybe I did run it twice - I had trouble getting the include paths to work correctly (long story involving Python 3.3 and 3.4) so it could have happened. I tried check --repair and it seems to have worked:

...
Analyzing archive servers-2014-05-17-092237 (12/41)
Archive metadata damage detected

And I can now list the contents of the previously totally broken snapshot. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants