-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance for multi-threaded access to encrypted zip files #97
Comments
@Vadiml1024 Could it be that you are running into this issue instead:
With #98 also observing performance issues I get the feeling that a better zip module must be available :/. Maybe libarchive? But, we tried. And, concurrency support in the libarchive Python-bindings was a work in progress. czipfile exists but it seems to be Python 2 and dead. So, I guess another self-written backend. |
I'look I to czipfile I suppose porting it to python3 will be not too
difficult..
Le mar. 22 nov. 2022, 21:15, Maximilian Knespel ***@***.***>
a écrit :
… @Vadiml1024 <https://github.com/Vadiml1024> Could it be that you are
running into this issue <https://docs.python.org/3/library/zipfile.html>
instead:
Decryption is extremely slow as it is implemented in native Python rather
than C.
With #98 <#98> also observing
performance issues I get the feeling that a better zip module must be
available :/. Maybe libarchive? But, we tried. And, concurrency support in
the libarchive Python-bindings was a work in progress. czipfile
<https://pypi.org/project/czipfile/> exists but it seems to be Python 2
and dead.
So, I guess another self-written backend.
—
Reply to this email directly, view it on GitHub
<#97 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAG76GLFVDCK2NRECIXVY5DWJUSUNANCNFSM6AAAAAAR7DDISI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Somebody already ported czipfile to python3 |
Ah nice. I didn't see it on PyPI. Cython as opposed to Python is also said to be faster: https://stackoverflow.com/a/72513075/2191065 And there is this: https://github.com/TkTech/fasterzip But it seems like it might be missing some features like setting a password among others. |
I haven't tested it out yet, but perhaps something like https://github.com/kamilmahmood/fastzipfile could work? https://github.com/TkTech/fasterzip has been archived, and then there's this report, though I haven't benchmarked things myself. |
Thanks for mentioning it here. Some benchmarks should be quickly doable and would be interesting, but the development state as visible from the last commit and open issues don't bode well. It might also get archived soon... Software obsolescence is sad. Same thing for fusepy. |
Benchmark decryption of one large fileInstallation and test file creation: (
git clone https://github.com/ziyuang/czipfile.git
cd czipfile
sed -i "s|'README'|'README.md'|" setup.py
python3 setup.py build
python3 setup.py install --user
)
(
git clone https://github.com/TkTech/fasterzip.git
cd fasterzip
python3 setup.py build
python3 setup.py install --user
)
(
git clone https://github.com/kamilmahmood/fastzipfile.git
cd fastzipfile
sed -i -r "s|(python_requires='>=3.5), <3.9|\1|" setup.py
python3 setup.py build
python3 setup.py install --user
)
for size in 4 64; do
head -c $(( size * 1024 * 1024 )) /dev/urandom > random-${size}MiB.dat
zip encrypted-${size}MiB.zip --encrypt --password password random-${size}MiB.dat
7z a 7z-encrypted-${size}MiB.zip -tzip -mem=AES256 -ppassword random-${size}MiB.dat
done
import sys
import timeit
import numpy as np
path = sys.argv[1]
fileName = sys.argv[2]
repeat = 50
import zipfile
def readWithPythonZipFile():
with zipfile.ZipFile(path) as archive:
archive.setpassword(b"password")
with archive.open(fileName) as file:
file.read()
times = timeit.repeat(readWithPythonZipFile, number=1, repeat=5)
print(np.mean(times), "+-", np.std(times, ddof=1))
import czipfile
def readWithCZipFile():
with czipfile.PyZipFile(path) as archive:
archive.setpassword(b"password")
with archive.open(fileName) as file:
file.read()
times = timeit.repeat(readWithCZipFile, number=1, repeat=repeat)
print(np.mean(times), "+-", np.std(times, ddof=1))
# Does not seem to support encryption. No way to set passwords.
# API is different from zipfile, it will wholly extract whole entries and return them.
# This has memory usage implications!
# import fasterzip
import fastzipfile # monkey-patches Python zipfile on import!
times = timeit.repeat(readWithPythonZipFile, number=1, repeat=repeat)
print(np.mean(times), "+-", np.std(times, ddof=1)) Call with: for size in 4 64; do
for prefix in '7z-' ''; do
echo "== ${prefix}encrypted-${size}MiB.zip =="
python3 benchmarkDecryption.py ${prefix}encrypted-${size}MiB.zip random-${size}MiB.dat
done
done
The performance improvements of czipfile and fasterzip are nice! AES encryption is not supported by Python zipfile and the issue is closed as "won't fix due to legal concerns", probably cryptography export restrictions, and therefore is also not by the fork czipfile or by the hot-patching fastzipfile. Python zipfile and fastzipfile will raise the exception The three proposed libraries only affect standard ZIP encryption (ZipCrypto), which is said to be broken. Other encryptions such as AES are not improved upon and creating AES-encrypted ZIPs is not supported by the standard zip tool tool. The website reads:
so it doesn't seem likely that AES support will be added soon. However, p7zip supports it, even though I didn't find any mention of it in the manual, but in this answer. Note that I was not able to install any of the three packages from PyPI, had to patch two of them and fasterzip does not even support encryption. All of them are hardly usable as they are now. To speed up decryption, fastzipfile looks the best. It does one simple small thing and does it well. Benchmark reading of many small 10 KiB unencrypted filesmkdir -p 10k-10KiB-files
for i in $( seq 10000 ); do
base64 /dev/urandom | head -c $(( 10 * 1024 )) > 10k-10KiB-files/$i
done
zip -r 10k-10KiB-files.zip 10k-10KiB-files import numpy as np
import timeit
path = "10k-10KiB-files.zip"
repeat = 10
import zipfile
def readWithPythonZipFile():
with zipfile.ZipFile(path) as archive:
for info in archive.infolist():
if not info.is_dir():
with archive.open(info) as file:
file.read()
times = timeit.repeat(readWithPythonZipFile, number=1, repeat=5)
print("zipfile:", np.mean(times), "+-", np.std(times, ddof=1))
import czipfile
def readWithCZipFile():
with czipfile.PyZipFile(path) as archive:
for info in archive.infolist():
if not info.is_dir():
with archive.open(info) as file:
file.read()
times = timeit.repeat(readWithCZipFile, number=1, repeat=repeat)
print("czipfile:", np.mean(times), "+-", np.std(times, ddof=1))
import fasterzip
def readWithFasterZip():
archive = fasterzip.ZipFile(path.encode())
for info in archive.infolist():
if not info["m_filename"].endswith(b"/"):
with archive.read(info["m_filename"]) as file:
len(file)
times = timeit.repeat(readWithFasterZip, number=1, repeat=repeat)
print("fasterzip:", np.mean(times), "+-", np.std(times, ddof=1))
import fastzipfile # monkey-patches Python zipfile on import!
times = timeit.repeat(readWithPythonZipFile, number=1, repeat=repeat)
print("fastzipfile:", np.mean(times), "+-", np.std(times, ddof=1))
There is basically no difference in this benchmark. |
#96 (comment)
This might need yet another backend like
indexed_bzip2
that works with zip files. So... a lot of work.The text was updated successfully, but these errors were encountered: