-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance when opening raw data with (wrong) gzip encoding in header #88
Comments
Hm, this is an interesting effect. I'm a bit hesitant to include special functionality to intercept when the This seems like a special use case. Is there a reason why you need to have a If you have a special use case where this is deemed necessary or advantageous, then I'm fine with what you suggest to do. However, it could be worthwhile to investigate where the bottleneck is from. There are two options as far as I can tell: |
Haven't done any digging yet but will try to do so. From a user perspective, setting compression_level is how you control the size/performance tradeoff. |
Ok, I did some testing, and it is the decompression object (alternative 1) that is the problem. Since the data is split in 1MB chunks, the reading loop runs about 600 times for me, and the time for each loop increases. First loop it's about 2-3ms and last loop 385ms for running line 409. I assume the memory is reallocated when growing the immutable bytes object "decompressed_data". Growing a bytearray instead solves the problem, but maybe there is a better way? np.fromstring() still wants immutable bytes as input. This comment makes me think there is room for improvement even by some minor changes. |
Perfect, I was doing some testing as well and got the same results. An interesting thing I found is that bzip2 does not support a compression level of 0, only 1-9. Maybe we restrict compression level to be 1-9 and throw an error if the user sets it to 0. Then we document appropriately. However, I think you've stumbled across an interesting find that maybe we should address. The 1MB chunks are before my time so I'll need to check out the commit as to why they were added. I would surely think this parameter could be much larger, like 50-100MB before causing problems. Edit: Just hacked it to make the size 100MB and it reduced the amount of time to load any compressed data by 50-100x. This is something worthwhile to look into. We should set this to an appropriate value and/or let it be adjustable. |
Here's the commit: c98ac3c Here's the issue: #21 Not sure if this issue is related since it doesn't mention anything about the CRC check: https://bugs.python.org/issue25626 As far as I know, we should be able to set this limit to 4GB and be safe. I think it's a safe assumption We need some way to test this adequately without having to save a 4GB file. The test should generate a > 4GB file and then read from it. If this takes too long, then best not to have this done on TravisCI. |
Great, seems we can improve the performance overall, and doing something about the header['encoding'] and compression_level handling is a separate problem. Still, I believe fixing the current improper memory handling is still important, even if the performance problem is less significant (or insignificant) with much larger chunks. Seems using io.BytesIO is the fastest, but it requires additional import. Keeping chunks of 1MB I can reduce load time from 120 sec to 4 sec, in my specific case, by using io.BytesIO as a buffer. |
Alright, so we have three different problems we're analyzing. Let me start with summarizing the problem and give my proposed solution for each problem so we can systematically analyze this. Problems:
Here are the solutions I think we should implement for the problems, let me know your opinion:
I don't know the specifics to how appending to a bytestring versus writing to May be a good idea to do some research behind these methods and see what others are doing in the Python community for reading buffered data. If so, why can't we use |
I've done some research. This might be an easy really fix to make... Refer to this article for details on the what is the most efficient method to concatenate multiple byte strings together: https://stackoverflow.com/questions/12169839/which-is-the-preferred-way-to-concatenate-a-string-in-python The second answer by I made a Jupyter notebook to do some testing. Here's an image of it here with the results too: In summary, using += is slightly faster than BytesIO if using a bytearray not a bytes object (b''). Bytearray is mutable, bytes (b'') is immutable. (See here). So, the proposed solution is just to change line 400 from I deleted my test script. Can you make that change and test again please? I'm thinking this will greatly improve reading speeds. |
I basically agree.
Regarding the |
Points 1 & 3 we are in agreement on. Point 2 still needs some more discussion. In terms of PRs, I think point 1 should be its own PR and then points 2 & 3 can be put into one PR named something like "Fixing performance issues in compressed NRRDs". For point 1, we will want to make sure to update the documentation. Once that is submitted, I'll leave it up for some others to review in case it will break compatibility for them. For points 2 and 3, we can get one started for point 3 and then add in point 2 after we finish discussing it. The 3rd point is a simple fix. However, the 2nd point will need to include some tests for larger datasets potentially... Are you able to submit PRs? If not, I can get to them at some point. |
As for point 2, I just tested with a chunk size of 4GB (4 * 2 ** 30) and it worked fine on Python 3 and Python 2. Honestly, I'm not sure if buffered reading is required since the original issue was with the |
@AAAArcus Bump Any more thoughts? If not, are you able to submit PRs or would you like me too? No rush, just want to keep the conversation going... |
@AAAArcus PRs are made. Let me know what you think and review them please. |
@addisonElliott Sorry for dropping off the radar. Haven't been able to work for more than a week... Makes sense to remove the compression level zero without actually breaking functional code (but perhaps currently rather slow). The performance fix will make this a minor issue even for those who still run with Didn't notice the difference in how the writer handled the chunks compared to the reader. Reasonable that they look more symmetric. |
No problem. This weekend was going to be the most free time I'll have the next few weeks so I wanted to get some things done on my TODO list. Yeah, the reason for the asymmetry is likely due to the intricacies of the GZIP compression algorithm, something I'm not too familiar with. That's not a bad idea to release the memory piece by piece. Another option is to figure out why However, my opinion is to keep everything the way it is until someone runs across an issue. I've found that if I try to optimize for every single use case, it ends up taking forever! I think I mentioned this in the PR, but my justification is that likely the compressed data is much smaller than the decompressed data. Thus, if the user is able to have the entire set of decompressed data in memory, then the RAM required is decompressed data size + compressed data size which shouldn't be too much more than the RAM required just for decompressed data. Whenever I get some free time this week, I'll merge the PRs and close this issue. Thanks for reporting it and helping come to a solution! |
I guess the use case where the memory problem could show up would be if the uncompressed data is basically taking up all the memory, so the compressed would not fit along side it. But that shouldn't be too common. Usually you need some memory to process the uncompressed data afterwards anyway, so fitting the compressed data should rarely be the main problem. Looking forward to the new version (instead of my own hack). |
This PR makes two changes to improve the performance of the `nrrd.write` function for compressed NRRD files, i.e. gzip or bzip2 compressed data. Fixes issue #88
The GZIP library supports a compression level of `0` for no compression but this has been found to be slower than setting the encoding to `raw`. This PR removes documentation for setting `compression_level=0`. In addition, `bz2` in Python does not support a compression level of 0. Another benefit of this PR is the unified compression level range. Note that the user can still specify a compression level of 0 at their own risk. Fixes issue #88
Manually putting a field 'encoding'='raw' in the header overrides any compression_level>0 input to nrrd.write(), which I guess is the way it was intended to be. And when setting 'encoding'='gzip' and writing a file with compression_level=0, the output is uncompressed but the header is unmodified. This would not be an issue if it wasn't for the fact that reading a raw data file with 'encoding'='gzip' in the header (which it should not have of course) seems to take much longer.
In my example with a 600MB raw data, writing with compression_level=0 and 'encoding'='raw' in the header gives one 600MB file, and writing again, just changing to 'encoding'='gzip' in the header, gives another 600MB file. However, reading the latter with nrrd.read() takes 300(!) times longer than reading the first.
There is probably and explanation if you look into the details, but it feels like if compression_level=0, the encoding should be set to 'raw' in the header automatically on write?
The text was updated successfully, but these errors were encountered: