-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cli: collect: sort: Enable gzip compression #489
base: main
Are you sure you want to change the base?
Conversation
Add the --gzip parameter to allow output to be gzip compressed. Also add support for reading both gzip and uncompressed files transparently. Signed-off-by: Mike Pattrick <[email protected]>
This is not a full review. I wanted to relate this feature with the discussions that took place in #454. In essence, should be code stuff into retis or maybe have retis place nice with UNIX pipes. In this particular case, it would avoid us the need to re-implement "gzip" inside retis. For example: With this PR:
With pipes and a slightly hacked main branch[1]:
By using pipes, retis is able to capture more than 2.5x events The compression rate is similar although Still, the pipe option has a decent
Still, having a slightly less compression rate but more than doubling the number of events is a fair trade of IMHO. Of course this assumes you have a server where [1] I quickly hacked a |
This makes sense. With my my patch the compression happens in the main thread, not ideal! Having writes happen in a different thread is a better solution even without builtin gzip, as writes could block for an indeterminate amount of time in any case. Re #454 another possible option would be to add support for serde_cbor. The interface is very similar, so not a lot of code would have to change. And one benchmark showed a 3x improvement on serialization and 30% improvement on deserialization. But probably the best long term solution is to have a separate thread do read/write and encode/decode. This patch was my first code written in Rust, so I don't know what that would look like, but I can try to tackle it this weekend. |
Binary representation is something I've wanted to explore for some time. I wanted to wait until we had python bindings because in my mind, parsing json in python was always a "plan B". Now, even the python library published in "pip" uses rust to parse so this change could very well be completely transparent to users. @atenart thoughts?
We've discussed this also a few times. An argument keeps popping up: let's keep it simple; keep the collection as simple as possible and do the rest as post-processing. I actually must have patches for this, somewhere... |
Yes, that makes sense. I have the similar feeling: JSON was important to keep as long as it was the only option for post-processing. But now we have Python support and this is the preferred way. We also never promised to keep the event file format stable so it's fine to improve it. Although if that does not add too many quirks we could support JSON and other formats. We can then just make the binary one the default.
IIRC this might not be possible w/o so prerequisites, but I might not remember well. One that front I have multiple ideas about improving performances (startup + run time) and I'll likely have a look at those after the current set of PRs I have open. |
I agree with this approach. JSON may still prove to be useful.
In general bpf/user throughput has room for improvement (w/ file write being the end of the chain for collection), some of them may influence each other (per CPU buffer/event structure/consumers), and among the multiple things discussed, there is also an alignment front to take care of. |
one more thing to keep in mind (that might even turn out to not be a problem at all) is backward compatibility (whether we want to have it to some extent or not) and the binary format should not corner ourselves (ideally, at least no more than the other supported formats :)
|
One note here: I find it extremely useful to be able to grep the events.json for things I'm looking for, just to check if they are present or not. For example:
This is especially important in situations where retis sort/print takes multiple minutes or even dozens of minutes to parse the file, while grep can find required information in seconds. The gzip archive can be uncompressed if necessary, but having an arbitrary binary format is not really desired, from my point of view. |
Different use cases having different needs. That makes perfect sense, in addition to backward compatibility to read older event files. |
Add the --gzip parameter to allow output to be gzip compressed. Also add support for reading both gzip and uncompressed files transparently.
This feature was suggested by @igsilya