Skip to content

sjm42/7z-kaboom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

7z-kaboom

7zip extract bug explained

Ever wondered why 7zip is sometimes horribly slow to extract archives? It has a problem with extracting into large directories with lots of files/dirs in them.

Why is this?

The 7z process is always reading the full listing of target directory before each single file to be extracted from the archive. When the number of files in the subdirectory is growing, 7z extraction process will gradually but quickly slow down. And after each new extracted file, it will slow down a bit more.

In this git repo, there is an example file 7z-kaboom.7z that you may try to extract with 7zip if you want. It contains just 100 000 files each containing just 1 KB of null bytes. It is a small 7z file that takes a veeery long time to extract.

This has been made just to demonstrate the extraction behaviour of 7z.

The reason for this behaviour is easy to reveal by running strace on the 7z extract process, and we can see output similar to this:

mkdir("hugedir", 0700)                  = -1 EEXIST (File exists)
open("hugedir", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
fstat(4, {st_mode=S_IFDIR|0700, st_size=295980, ...}) = 0
getdents(4, /* 1027 entries */, 32768)  = 32760
getdents(4, /* 1026 entries */, 32768)  = 32752
getdents(4, /* 1026 entries */, 32768)  = 32752
getdents(4, /* 1026 entries */, 32768)  = 32744
getdents(4, /* 1026 entries */, 32768)  = 32744
getdents(4, /* 1026 entries */, 32768)  = 32752
getdents(4, /* 1026 entries */, 32768)  = 32752
getdents(4, /* 1026 entries */, 32768)  = 32752
getdents(4, /* 1026 entries */, 32768)  = 32760
...
...LOTS of similar lines deleted.
...
close(4)                                = 0
umask(0)                                = 02
umask(02)                               = 0
readlink("hugedir/f23314", 0x11df214, 1024) = -1 ENOENT (No such file or directory)
open("hugedir/f23314", O_WRONLY|O_CREAT, 0664) = 4
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = 1024
close(4)                                = 0
stat("hugedir/f23314", {st_mode=S_IFREG|0664, st_size=1024, ...}) = 0
utime("hugedir/f23314", [2017/04/26-14:58:49, 2017/04/26-14:54:36]) = 0
lstat("hugedir/f23314", {st_mode=S_IFREG|0664, st_size=1024, ...}) = 0
chmod("hugedir/f23314", 0664)           = 0

The first 10k to 20k files will be extracted at a relatively normal speed, but after some 50k files have been extracted, the process is slowed down to a crawl.

You may create your own hard-to-extract 7z archive like this, for example:

mkdir hugedir
cd hugedir
for i in `seq 100000`; do dd status=none if=/dev/zero of=f$i bs=1024 count=1; done
cd ..
7z a 7z-kaboom.7z hugedir

The root cause

The 7z internal function:

bool CFindFile::FindFirst(CFSTR cfWildcard, CFileInfo &fi, bool ignoreLink)

This function is apparently called from many points in p7zip code. It takes a "wildcard" as an argument, and it is also often called with just a filename without any wildcard characters. Nevertheless, this function is always making a full directory scan until a filename matching cfWildcard is found.

When extracting an archive with a very large flat directory with at least tens of thousands of files, this process of making full directory filename scans becomes very costly and makes it practically impossible to extract archives with huge subdirs with millions of files. Using an SSD does not help. Using a ramdisk does not help either. 7zip is just burning cpu cycles in a very useless way. This is not an IO bottleneck problem at all!

The fix

A patch for p7zip 16.02 source code is included in this git repo.

Basically, in the function CFindFile::FindFirst() we are checking if the given wildcard parameter actually contains any wildcard characters that are ? and * as usual.

If we have just a filename that is not a wildcard, the potentially costly full directory scan is avoided. Just a normal stat() call is made to check the existence of the given filename.

Results

Extracting the sample archive with unfixed 7za takes 22 minutes on a fast machine.

% time ~/p7zip_16.02/bin/7za x 7z-kaboom.7z 

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,56 CPUs x64)

Scanning the drive for archives:
1 file, 1927937 bytes (1883 KiB)

Extracting archive: 7z-kaboom.7z
--
Path = 7z-kaboom.7z
Type = 7z
Physical Size = 1927937
Headers Size = 1913422
Method = LZMA:24
Solid = +
Blocks = 1

Everything is Ok                                     

Folders: 1
Files: 100000
Size:       102400000
Compressed: 1927937

real	22m36.940s
user	3m35.887s
sys	19m0.180s

Extracting the sample archive with fixed 7za takes a few seconds.

% time ~/p7zip_16.02/bin/7za x 7z-kaboom.7z 

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,56 CPUs x64)

Scanning the drive for archives:
1 file, 1927937 bytes (1883 KiB)

Extracting archive: 7z-kaboom.7z
--
Path = 7z-kaboom.7z
Type = 7z
Physical Size = 1927937
Headers Size = 1913422
Method = LZMA:24
Solid = +
Blocks = 1

Everything is Ok                                     

Folders: 1
Files: 100000
Size:       102400000
Compressed: 1927937

real	0m3.683s
user	0m1.356s
sys	0m2.325s

Now we are talking!

Version data

Last updated: 2017-04-26

7zip versions tested:

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,4 CPUs)

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,56 CPUs Intel(R) Xeon(R) CPU E5-2658 v4 @ 2.30GHz (406F1),ASM,AES-NI)

BR, -sjm

About

7zip extract bug

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published