-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👌 IMPROVE: Add AbstractRepositoryBackend.iter_object_streams
#5168
👌 IMPROVE: Add AbstractRepositoryBackend.iter_object_streams
#5168
Conversation
This is essentially an addition to aiidateam#5156 and required aiidateam#5145 Without the "optimised" use of `Container.get_objects_stream_and_meta` for the `DiskObjectStoreRepositoryBackend`, the profiled archive creation in aiidateam#5145 goes from 4 minutes to 9 minutes!
Codecov Report
@@ Coverage Diff @@
## develop #5168 +/- ##
===========================================
+ Coverage 80.96% 80.97% +0.01%
===========================================
Files 535 535
Lines 37188 37201 +13
===========================================
+ Hits 30107 30119 +12
- Misses 7081 7082 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @chrisjsewell , great work advancing the cause of batch operations! Haha. I only wanted to go over a few comments but this should be ready to go in no problem.
aiida/repository/backend/abstract.py
Outdated
for key in keys: | ||
with self.open(key) as handle: # pylint: disable=not-context-manager | ||
yield key, handle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the two implementations of this method have slightly different behaviors. The abstract.iter_object_streams
will get each handle in its iteration and close each file when going to the next element, while disk_object_store.iter_object_streams
will get all handles at the start and close all files only at the end. This means that, for example, one could use the second to do a "pass of content" from one file from the next but no the first, right? (weird case, but just to illustrate).
Since the external iteration is something the caller could do on their own, I would personally prefer to implement the version with the internal iteration just to offer the different path for the caller to decide what is best (one never knows for what use case that one didn't anticipate it could matter). However it is true that this case is very subtle so I am also ok if you prefer to leave it like this.
In any case, what I do think it could be more important is to document a bit more this behavior on the docstring: that the streams will be closed after the iteration finishes so you can't use this to put them on a list, for example, and (if we go that way) that this function may offer the possibility to do the upfront opening where better performance would be expected but it is not guaranteed for all cases (and that each handle is only guaranteed to be available on its own iteration step).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, yeh I guess you can just document more the behaviour, but yes each stream should only be read in its own iteration.
This means that, for example, one could use the second to do a "pass of content" from one file from the next but no the first, right?
They are read only handles, so this would not be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the docs
def test_iter_object_streams(repository): | ||
"""Test the ``Repository.iter_object_streams`` method.""" | ||
key = repository.put_object_from_filelike(io.BytesIO(b'content')) | ||
|
||
for _key, stream in repository.iter_object_streams([key]): | ||
assert _key == key | ||
assert stream.read() == b'content' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it curious that you decided to implement a default behavior for the abstract
repository backend to overwrite instead of leaving it as an abstract method. I guess it makes sense if having the open
method implemented already allows for a fallback implementation, but I'll be interested to know if there were other aspects to this that you considered for this (even considerations in the other direction that you weighted not as important here) or was just because of that?
Anyways, the reason I comment it here is that given that design choice it would make more sense to put this test in the tests/repository/backend/test_abstract.py
, no? There is already a mock repository there so it shouldn't require much adaptation. Also we could have 2 or 3 files to iterate over so as to be sure we are covering more of the functionality of the iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or was just because of that?
Yes just the reason that it is possible to already have a default implementation, so no need to duplicate this code across implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a mock repository there so it shouldn't require much adaptation.
The abstract, and thus the mock object, does not implement an open
method, and so this does not work. In fact the ast act class is faulty in this respect, in that it should make it abstract, plus implementations should likely return a contextmanager rather than an (already within the context) iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/chrisjsewell/aiida_core/blob/8132651145311e2bf5b3f02beb3457af50a10248/aiida/tools/archive/abstract.py#L257 and it's overloads, for a good implementation example of a stream opener
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The abstract, and thus the mock object, does not implement an
open
method, and so this does not work.
I mean, I didn't say it would require no adaptation 😅, but it should be just implementing that method and adjust one or two more things, right? The change would make the test more findable (if I want to identify where I am testing that implemented logic I would look in the tests for the abstract), more informative (the failing test ensures the problem is in the abstract class and I there is no contamination from the sandbox) and more robust (if for some reason we make changes or even remove the sandbox, and thus its tests, we are no longer testing this default implementation).
In fact the ast act class is faulty in this respect, in that it should make it abstract, plus implementations should likely return a contextmanager rather than an (already within the context) iterator
Yeah, this is a good point! We should probably change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I've moved to removed the code from the abstract class, made the method an abstractmethod
and moved the code to the SandboxRepositoryBackend
.
Honestly, there shouldn't really be any testing against the AbstractRepositoryBackend
, because it is just that an abstract class; it should merely define the public interface of the class. Yeh you can sneak in some "shared" code, to reduce duplication, but from a "purity" kind of view, using Protocols is the way forward: https://www.python.org/dev/peps/pep-0544/
This is essentially an addition to #5156 and required for #5145
Without the "optimised" use of
Container.get_objects_stream_and_meta
for theDiskObjectStoreRepositoryBackend
, the profiled archive creation in #5145 (comment) goes from 4 minutes to 9 minutes!