Improve performance for multi-threaded access to encrypted zip files #97

mxmlnkn · 2022-11-13T16:15:14Z

BTW one thing i've discovered when trying to integrate libarchive is that python zipfile has similar inefficiencies as tarfile module:
if 2 threads trying to access the same member it will decompress (and decrypt if pw-protected) the member from the beginning.
The situation is better than with .tar.gz when it needs to decompress from the start of the whole archive, but still poblematic.
There is a need to develop something like SQLindexedTar class to checkpoint decompression and decryption states..

This might need yet another backend like indexed_bzip2 that works with zip files. So... a lot of work.

The text was updated successfully, but these errors were encountered:

mxmlnkn · 2022-11-22T20:14:51Z

@Vadiml1024 Could it be that you are running into this issue instead:

Decryption is extremely slow as it is implemented in native Python rather than C.

With #98 also observing performance issues I get the feeling that a better zip module must be available :/. Maybe libarchive? But, we tried. And, concurrency support in the libarchive Python-bindings was a work in progress. czipfile exists but it seems to be Python 2 and dead.

So, I guess another self-written backend.

Vadiml1024 · 2022-11-22T21:34:34Z

I'look I to czipfile I suppose porting it to python3 will be not too difficult.. Le mar. 22 nov. 2022, 21:15, Maximilian Knespel ***@***.***> a écrit :

…

@Vadiml1024 <https://github.com/Vadiml1024> Could it be that you are running into this issue <https://docs.python.org/3/library/zipfile.html> instead: Decryption is extremely slow as it is implemented in native Python rather than C. With #98 <#98> also observing performance issues I get the feeling that a better zip module must be available :/. Maybe libarchive? But, we tried. And, concurrency support in the libarchive Python-bindings was a work in progress. czipfile <https://pypi.org/project/czipfile/> exists but it seems to be Python 2 and dead. So, I guess another self-written backend. — Reply to this email directly, view it on GitHub <#97 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAG76GLFVDCK2NRECIXVY5DWJUSUNANCNFSM6AAAAAAR7DDISI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Vadiml1024 · 2022-11-22T21:50:10Z

Somebody already ported czipfile to python3
https://github.com/ziyuang/czipfile

mxmlnkn · 2022-11-22T22:16:10Z

Ah nice. I didn't see it on PyPI.

Cython as opposed to Python is also said to be faster: https://stackoverflow.com/a/72513075/2191065

And there is this: https://github.com/TkTech/fasterzip But it seems like it might be missing some features like setting a password among others.

hendursaga · 2024-08-28T17:04:20Z

I haven't tested it out yet, but perhaps something like https://github.com/kamilmahmood/fastzipfile could work? https://github.com/TkTech/fasterzip has been archived, and then there's this report, though I haven't benchmarked things myself.

mxmlnkn · 2024-08-28T19:27:03Z

Thanks for mentioning it here. Some benchmarks should be quickly doable and would be interesting, but the development state as visible from the last commit and open issues don't bode well. It might also get archived soon... Software obsolescence is sad. Same thing for fusepy.

mxmlnkn · 2024-08-29T09:35:36Z

Benchmark decryption of one large file

Installation and test file creation:

(
    git clone https://github.com/ziyuang/czipfile.git
    cd czipfile
    sed -i "s|'README'|'README.md'|" setup.py
    python3 setup.py build
    python3 setup.py install --user
)

(
    git clone https://github.com/TkTech/fasterzip.git
    cd fasterzip
    python3 setup.py build
    python3 setup.py install --user
)

(
    git clone https://github.com/kamilmahmood/fastzipfile.git
    cd fastzipfile
    sed -i -r "s|(python_requires='>=3.5), <3.9|\1|" setup.py
    python3 setup.py build
    python3 setup.py install --user
)

for size in 4 64; do
    head -c $(( size * 1024 * 1024 )) /dev/urandom > random-${size}MiB.dat
    zip encrypted-${size}MiB.zip --encrypt --password password random-${size}MiB.dat
    7z a 7z-encrypted-${size}MiB.zip -tzip -mem=AES256 -ppassword random-${size}MiB.dat
done

benchmarkDecryption.py

import sys
import timeit
import numpy as np

path = sys.argv[1]
fileName = sys.argv[2]
repeat = 50

import zipfile
def readWithPythonZipFile():
    with zipfile.ZipFile(path) as archive:
        archive.setpassword(b"password")
        with archive.open(fileName) as file:
            file.read()

times = timeit.repeat(readWithPythonZipFile, number=1, repeat=5)
print(np.mean(times), "+-", np.std(times, ddof=1))

import czipfile
def readWithCZipFile():
    with czipfile.PyZipFile(path) as archive:
        archive.setpassword(b"password")
        with archive.open(fileName) as file:
            file.read()

times = timeit.repeat(readWithCZipFile, number=1, repeat=repeat)
print(np.mean(times), "+-", np.std(times, ddof=1))

# Does not seem to support encryption. No way to set passwords.
# API is different from zipfile, it will wholly extract whole entries and return them.
# This has memory usage implications!
# import fasterzip

import fastzipfile  # monkey-patches Python zipfile on import!
times = timeit.repeat(readWithPythonZipFile, number=1, repeat=repeat)
print(np.mean(times), "+-", np.std(times, ddof=1))

Call with:

for size in 4 64; do
for prefix in '7z-' ''; do
    echo "==  ${prefix}encrypted-${size}MiB.zip =="
    python3 benchmarkDecryption.py ${prefix}encrypted-${size}MiB.zip random-${size}MiB.dat
done
done

Library	4 MiB file	64 MiB file
zipfile	3.3 s	53.72 s +- 0.28 s
czipfile	0.0439 s +- 0.0023 s	0.693 s +- 0.006 s
fastzipfile	0.0463 s +- 0.0010 s	0.758 s +- 0.005 s
fasterzip	-	-

The performance improvements of czipfile and fasterzip are nice!

AES encryption is not supported by Python zipfile and the issue is closed as "won't fix due to legal concerns", probably cryptography export restrictions, and therefore is also not by the fork czipfile or by the hot-patching fastzipfile. Python zipfile and fastzipfile will raise the exception NotImplementedError: That compression method is not supported, while czipfile will raise RuntimeError: Bad password for file. Fasterzip does not support encryption at all.

The three proposed libraries only affect standard ZIP encryption (ZipCrypto), which is said to be broken. Other encryptions such as AES are not improved upon and creating AES-encrypted ZIPs is not supported by the standard zip tool tool. The website reads:

Latest Release [...] Zip 3.0, released 7 July 2008:
The next major release of Zip will be version 3.1, with AES encryption

so it doesn't seem likely that AES support will be added soon.

However, p7zip supports it, even though I didn't find any mention of it in the manual, but in this answer.

Note that I was not able to install any of the three packages from PyPI, had to patch two of them and fasterzip does not even support encryption. All of them are hardly usable as they are now.

To speed up decryption, fastzipfile looks the best. It does one simple small thing and does it well.
The code is short, 200 lines of C, and works almost as fast as czipfile, which consists of 2000 lines of Cython, which is fine if it was actively maintained, but it isn't and it seems to basically be a fork of Python 2.6.5 zipfile.
A lot of bugfixes and improvements have presumably happened to the upstream zipfile since then.
I guess, one could take a look at the diff to Python 2.6.5 zipfile and reapply it to a newer version.
I assume that the result wouldn't look that different from fastzipfile.

Benchmark reading of many small 10 KiB unencrypted files

mkdir -p 10k-10KiB-files
for i in $( seq 10000 ); do
    base64 /dev/urandom | head -c $(( 10 * 1024 )) > 10k-10KiB-files/$i
done
zip -r 10k-10KiB-files.zip 10k-10KiB-files

import numpy as np
import timeit

path = "10k-10KiB-files.zip"
repeat = 10

import zipfile
def readWithPythonZipFile():
    with zipfile.ZipFile(path) as archive:
        for info in archive.infolist():
            if not info.is_dir():
                with archive.open(info) as file:
                    file.read()

times = timeit.repeat(readWithPythonZipFile, number=1, repeat=5)
print("zipfile:", np.mean(times), "+-", np.std(times, ddof=1))

import czipfile
def readWithCZipFile():
    with czipfile.PyZipFile(path) as archive:
        for info in archive.infolist():
            if not info.is_dir():
                with archive.open(info) as file:
                    file.read()

times = timeit.repeat(readWithCZipFile, number=1, repeat=repeat)
print("czipfile:", np.mean(times), "+-", np.std(times, ddof=1))

import fasterzip
def readWithFasterZip():
    archive = fasterzip.ZipFile(path.encode())
    for info in archive.infolist():
        if not info["m_filename"].endswith(b"/"):
            with archive.read(info["m_filename"]) as file:
                len(file)

times = timeit.repeat(readWithFasterZip, number=1, repeat=repeat)
print("fasterzip:", np.mean(times), "+-", np.std(times, ddof=1))

import fastzipfile  # monkey-patches Python zipfile on import!
times = timeit.repeat(readWithPythonZipFile, number=1, repeat=repeat)
print("fastzipfile:", np.mean(times), "+-", np.std(times, ddof=1))

Library	Archive with 10k files á 10 KiB
zipfile	0.635 +- 0.007
czipfile	0.667 +- 0.012
fasterzip	0.619 +- 0.005
fastzipfile	0.652 +- 0.015

There is basically no difference in this benchmark.

mxmlnkn added enhancement New feature or request performance Something is slower than it could be help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Nov 13, 2022

mxmlnkn mentioned this issue Feb 22, 2023

Add parallel support for large compressed zip members #105

Open

mxmlnkn closed this as completed in a42bf15 Oct 4, 2024

mxmlnkn mentioned this issue Oct 6, 2024

Support more formats #109

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for multi-threaded access to encrypted zip files #97

Improve performance for multi-threaded access to encrypted zip files #97

mxmlnkn commented Nov 13, 2022

mxmlnkn commented Nov 22, 2022

Vadiml1024 commented Nov 22, 2022 via email

Vadiml1024 commented Nov 22, 2022

mxmlnkn commented Nov 22, 2022

hendursaga commented Aug 28, 2024

mxmlnkn commented Aug 28, 2024

mxmlnkn commented Aug 29, 2024

Improve performance for multi-threaded access to encrypted zip files #97

Improve performance for multi-threaded access to encrypted zip files #97

Comments

mxmlnkn commented Nov 13, 2022

mxmlnkn commented Nov 22, 2022

Vadiml1024 commented Nov 22, 2022 via email

Vadiml1024 commented Nov 22, 2022

mxmlnkn commented Nov 22, 2022

hendursaga commented Aug 28, 2024

mxmlnkn commented Aug 28, 2024

mxmlnkn commented Aug 29, 2024

Benchmark decryption of one large file

Benchmark reading of many small 10 KiB unencrypted files