Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Compression: ZipArchive loads entire file in memory on .Dispose #1543

Open
Tracked by #62658
qmfrederik opened this issue Sep 13, 2016 · 6 comments
Open
Tracked by #62658
Labels
area-System.IO.Compression enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@qmfrederik
Copy link
Contributor

When you open a ZipArchive in Update mode, the entire zip file will be loaded in memory when the .Dispose method is invoked.

This is because .Dipose calls .WriteFile, which:

  • Calls LoadLocalHeaderExtraFieldAndCompressedBytesIfNeeded for all entries, which loads the compressed data into memory for those entries
  • Sets the size of the .zip archive to 0, by calling _archiveStream.SetLength(0);
  • Writes out all entries one by one.

As a result:

  • A lot of memory is used, the compressed data for each entry is loaded into memory
  • A lot of unnecessary disk I/O is performed, because all entries are written out again, even if they were not modified.

An alternative may be to incrementally update the zip archive, and only update entries which changes.

@jakubsuchybio
Copy link

This is a real pain. We use full .NET Framework and we have some really large zip files where we need to update small files inside, but because of this loading into memory, we are getting OutOfMemoryException, because we have 32b application and we cannot change into 64bit, because of dependent driver DLLs that are in 32b.

@carlossanlop
Copy link
Member

Triage:
We should scope this to fix only loading entries that have changes.

@carlossanlop carlossanlop transferred this issue from dotnet/corefx Jan 9, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Jan 9, 2020
@carlossanlop carlossanlop added enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors and removed untriaged New issue has not been triaged by the area owner labels Jan 9, 2020
@carlossanlop carlossanlop added this to the Future milestone Jan 9, 2020
@Jlalond
Copy link
Contributor

Jlalond commented Jan 17, 2020

@carlossanlop Hmm Looks like it's only writing files on creates/updates.

But it looks like it's still iterating over them all , I don't see any properties in ZipArchiveEntry to denote if it's been modified, I'm going to keep doing research once I get home, but I think this would be fun to pick up

@IDisposable
Copy link
Contributor

Anyone working on this? I could pick it up if not.

@ulrichb
Copy link

ulrichb commented Nov 7, 2023

Just found this ticket after creating a duplicate (#94455), and wanted to crosspost our use case where the current behavior is a big issue:

a) We're dealing with potentially large user-provided ZIP files (in the GB range),
b) we need to update entries (actually not directly but via System.IO.Packaging.ZipPackage for OPC file processing), and
c) the whole can happen in parallel.

This means with the current ZipArchive implementation we need to reserve dozens of GBs of virtual memory just for System.IO.Packaging.ZipPackage processing, otherwise we're risking container memory limit violations and therefore OOM exceptions.

@edwardneal
Copy link
Contributor

With #102704 now merged, there's been some progress to improving this. The PR includes some infrastructure to track the type of changes made to an individual ZipArchiveEntry, and alters the way that entries are written to the output stream.

The present behaviour is:

  • If an entry is added to the archive and the archive is then disposed of, only the new entry (and the new central directory) is written.
  • If an existing entry's fixed-length metadata (e.g. the last write time) is modified, the entry headers are rewritten in-place. The entry contents are not.
  • If an existing entry's dynamic-length metadata (e.g. the filename) or contents are modified, that entry is written. Every entry which follows it in the archive is also written.
  • If an existing entry is deleted, every entry which follows it in the archive is written.

Appending an entry to a ZipArchive is now faster and uses less memory, particularly when appending to large archives. If a ZipArchive was opened in Update mode and never modified, this will no longer write to the output stream at all.

Making ad-hoc modifications to entries in ZipArchives now has less consistent performance, but it should still be faster on average. Deleting or modifying the first entry in the archive results in performance which is a modest improvement over .NET 9.0; doing the same thing to the last entry in the archive should be significantly faster than .NET 9.0 (because fewer entries are being rewritten out to the output stream.)

If we've got control over the source ZIP file, ZipArchive will perform best when the largest and the most frequently modified entries are placed at the end of the archive.

I had a few ideas while writing the PR, but most of these involve archive size/write speed/memory usage tradeoffs. I'm leaving them below in case anyone wants to develop them.

  • We currently rewrite the entry when dynamic-length metadata changes, because that metadata could otherwise overwrite the contents. This isn't guaranteed: if the total length of the dynamic-length metadata doesn't change, we could safely rewrite the header without touching the entry contents or the entries which follow.
  • When an entry's contents are modified, there's no guarantee that they'll get larger - that's just the safest assumption. If it shrinks, we could overwrite the now-free space; we wouldn't need to rewrite every entry which follows.
    • We wouldn't want to do this if the total size of the following entries is greater than the amount of now-free space.
  • When an entry's contents do become larger, they might only intersect a few entries which follow it. In this situation, we might only need to rewrite a few of the following entries. There'll be a some now-free space between the end of the enlarged entry and the start of the first untouched entry which follows.
  • If someone deletes an entry in the ZipArchive, it might be more efficient to simply overwrite that space in the archive (marking it as free) and not rewrite the entries which follow.
  • If there's now free space in the ZipArchive and we create a new entry, (or move an existing one) then that entry might be able to fit in the free space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.IO.Compression enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

No branches or pull requests

10 participants