Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: Allow opening (raw) compressed archive entries in ZipArchiveEntry #63155

Open
Tracked by #62658
PJB3005 opened this issue Dec 27, 2021 · 9 comments
Open
Tracked by #62658
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Milestone

Comments

@PJB3005
Copy link
Contributor

PJB3005 commented Dec 27, 2021

Background and motivation

Right now, ZipArchive only supports opening entries compressed with Stored, Deflate and Deflate64. While there are open issues about adding support for more specified methods such as LZMA, I would like to propose an orthogonal solution to this problem.

Allow access to the raw compressed streams in the zip file, and the compression method flag in the entry.
This opens up a few possibilities:

  • Allows developers to use third-party compression libraries to get support for algorithms like zstd or LZMA themselves.
  • Can be used in advanced scenarios when, for example, copying between zip files, to avoid having to decompress and re-compress data.

I am far from an expert on the zip file format, but from my rudimentary understanding of it, this should be possible?

API Proposal

namespace System.IO.Compression
{
	public class ZipArchiveEntry
	{
		public ZipCompressionMethod CompressionMethod { get; }
		public Stream OpenRaw();
	}

	public class ZipArchive
	{
		public ZipArchiveEntry CreateEntry(string entryName, ZipCompressionMethod compression);
	}

	public enum ZipCompressionMethod : short
	{
		// Corresponds to the compression method described by APPNOTE.TXT section 4.4.5
		Stored = 0,
		Deflate = 8,
		Bzip2 = 12,
		Lzma = 14,
		Zstd = 93
	}
}

API Usage

Using third-party decompression streams with ZipArchive:

var zipArchive = new ZipArchive(..., ZipArchiveMode.Read);
var entry = zipArchive.GetEntry("foo.json");
Debug.Assert(entry.CompressionMethod == ZipCompressionMethod.Zstd);

// Imagine a ZstdStream from a third-party library.
var stream = new ZstdStream(entry.OpenRaw(), CompressionMode.Decompress);

Copying compressed blobs between zip files:

ZipArchive a = ...;
ZipArchive b = ...;

var aEntry = a.GetEntry("foo.json");
var bEntry = b.CreateEntry("foo.json", aEntry.CompressionMethod);

aEntry.OpenRaw().CopyTo(bEntry.OpenRaw());

Alternative Designs

No response

Risks

No response

@PJB3005 PJB3005 added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Dec 27, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Dec 27, 2021
@ghost
Copy link

ghost commented Dec 27, 2021

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

Right now, ZipArchive only supports opening entries compressed with Stored, Deflate and Deflate64. While there are open issues about adding support for more specified methods such as LZMA, I would like to propose an orthogonal solution to this problem.

Allow access to the raw compressed streams in the zip file, and the compression method flag in the entry.
This opens up a few possibilities:

  • Allows developers to use third-party compression libraries to get support for algorithms like zstd or LZMA themselves.
  • Can be used in advanced scenarios when, for example, copying between zip files, to avoid having to decompress and re-compress data.

I am far from an expert on the zip file format, but from my rudimentary understanding of it, this should be possible?

API Proposal

namespace System.IO.Compression
{
	public class ZipArchiveEntry
	{
		public ZipCompressionMethod CompressionMethod { get; }
		public Stream OpenRaw();
	}

	public class ZipArchive
	{
		public ZipArchiveEntry CreateEntry(string entryName, ZipCompressionMethod compression);
	}

	public enum ZipCompressionMethod : short
	{
		// Corresponds to the compression method described by APPNOTE.TXT section 4.4.5
		Stored = 0,
		Deflate = 0,
		Bzip2 = 12,
		Lzma = 14,
		Zstd = 93
	}
}

API Usage

Using third-party decompression streams with ZipArchive:

var zipArchive = new ZipArchive(..., ZipArchiveMode.Read);
var entry = zipArchive.GetEntry("foo.json");
Debug.Assert(entry.CompressionMethod == ZipCompressionMethod.Zstd);

// Imagine a ZstdStream from a third-party library.
var stream = new ZstdStream(entry.OpenRaw(), CompressionMode.Decompress);

Copying compressed blobs between zip files:

ZipArchive a = ...;
ZipArchive b = ...;

var aEntry = a.GetEntry("foo.json");
var bEntry = b.CreateEntry("foo.json", aEntry.CompressionMode);

aEntry.OpenRaw().CopyTo(bEntry.OpenRaw());

Alternative Designs

No response

Risks

No response

Author: PJB3005
Assignees: -
Labels:

api-suggestion, area-System.IO.Compression, untriaged

Milestone: -

@AlgorithmsAreCool
Copy link
Contributor

I have a real-world use case for this also. I recently implemented my own incomplete parser for ZIP archives to use LibDeflate as the decompressor, which got me some nice speedups. It would be nice to be able to use the structure parsing with my own compression libs.

@PJB3005
Copy link
Contributor Author

PJB3005 commented Dec 27, 2021

My use cases are that I want to be able to use zip files (because it's a standard format) but with LZMA (significant space savings for my use case) while also being able to instantly dump these blobs into an SQLite DB (while still compressed). Another use case I have is that I want to basically use zip files as an object storage from an API and being able to use the compressed blobs to throw them over the wire directly would be great.

This would hit multiple birds with one stone.

@Clockwork-Muse
Copy link
Contributor

Allows developers to use third-party compression libraries to get support for algorithms like zstd or LZMA themselves.

Having an enum that requires a third-party library to supply that compression algorithm is likely to cause confusion.

At least some compression libraries add a header to the compressed stream - that being the case, if the constructor instead took something like

public interface IZipCompressionStream {
    public string CompressionMethod;
    public ReadOnlySpan<byte> Header;
    public Stream Compress(Stream raw);
    public bool TryDecompress(Stream compressed, out Stream raw);
    public Stream Decompress(Stream compressed);
}

... this would allow for arbitrary compression methods, including ones not currently envisioned

@PJB3005
Copy link
Contributor Author

PJB3005 commented Dec 27, 2021

Having an enum that requires a third-party library to supply that compression algorithm is likely to cause confusion.

It is a lower level API that simply exposes more information about the underlying zip file format. Python also exposes the ZipInfo.compress_type field in its zipfile module (but no ability to access the raw stream, AFAICT).

Limiting the enum members to the compression methods supported by .NET today would be an option, which I suppose is closer to what Python does in this regard.

At least some compression libraries add a header to the compressed stream - that being the case, if the constructor instead took something like

Relying on such headers is silly for zip files, since they already have a standardized 2-byte entry field for compression method.

This entire IZipCompressionStream seems like a very complex solution and does not address the other point (access to raw blobs, although you could probably abuse it to achieve with many silly hoops).

@svick
Copy link
Contributor

svick commented Dec 27, 2021

@Clockwork-Muse I think the API should follow the standard (though which of the specified compression methods should be named members of the enum is up for debate), instead of inventing its own way of specifying the compression method, that may or may not be useful in the future. Or do you have an example where what you're proposing would be useful today?

@Clockwork-Muse
Copy link
Contributor

Relying on such headers is silly for zip files, since they already have a standardized 2-byte entry field for compression method.

Ah, I was not aware that zip itself listed the possible methods, mybad.

@adamsitnik
Copy link
Member

@carlossanlop what is your take on this? Would adding such API help to implement algorithms that are currently not supported OOTB?

@jeffhandley
Copy link
Member

Thanks for this suggestion, @PJB3005. I'm moving this to Future, but I've also referenced it in #62658 so that we look at it alongside the LZMA and other potential investments during our .NET 8 planning.

@jeffhandley jeffhandley added this to the Future milestone Aug 2, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression
Projects
None yet
Development

No branches or pull requests

6 participants