-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fix append to dynamic table #920
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #920 +/- ##
==========================================
- Coverage 71.16% 71.11% -0.06%
==========================================
Files 37 37
Lines 2792 2797 +5
Branches 554 556 +2
==========================================
+ Hits 1987 1989 +2
Misses 679 679
- Partials 126 129 +3
Continue to review full report at Codecov.
|
Is there any way to do this without reading the entire dataset into memory? |
except Exception as e: | ||
self.reader.close() | ||
self.reader = None | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this cause test_roundtrip
to get skipped if actOnContainer
is not implemented?
@@ -194,6 +202,10 @@ def getContainer(self, nwbfile): | |||
''' Should take an NWBFile object and return the Container''' | |||
raise unittest.SkipTest('Cannot run test unless getContainer is implemented') | |||
|
|||
def actOnContainer(self, nwbfile): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the point of this to provide the ability to do something to a container after roundtripping?
@@ -318,6 +319,8 @@ def __getitem__(self, args): | |||
return self.data[args] | |||
|
|||
def append(self, arg): | |||
if isinstance(self.data, HDMFDataset) or isinstance(self.data, Dataset): | |||
self.__data = self.data[()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could avoid reading the dataset into memory by reshaping the dataset, and then adding the new data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rly Just a few questions
Hi @rly , thanks for the fix! |
@luiztauffer could you elaborate a bit more on the specific use-case you are working with that requires removal of rows? |
@oruebel The current version of pynwb does not allow for directly setting already defines fields. E.g. I tried to directly assign a new field: I just comment out this line of code it and can now update any fields I want. Any particular reason why this direct update of fields is forbidden on pynwb? |
I agree that modification of data already written to file is an important and useful feature, especially for the invalid trial times. This is also relevant when intermediate data is stored. However, this is a complicated issue that we need to discuss as a team and with users. Currently dataset modification is not really supported except for probably a few edge cases. So even though you can update fields by commenting out that line of code, I think with the current code, when you go to write the NWBFile, your change would not be written to disk. One way to implement this is anytime a user wants to alter an existing dataset (add, remove, modify), then PyNWB would read the entire original dataset and alter it. Writing the changes to disk would involve writing the entire modified dataset. You could currently do this yourself by reading the entire dataset into memory, making changes, reading all other data, and then writing a brand new NWBFile, but it would be nice to have this functionality built-in. We would just have to be very explicit about it because of the potentially high computing cost. |
@rly |
@luiztauffer For data arrays that are being read lazily you can do updates right now already but that is limited to large data arrays (and updates are immediate). Enabling update of files directly (without making full copies) should be doable but will require tracking of which fields have been updated. Currently this is done on a per-container basis but not yet on a per-field basis.
If you mean allowing users to assign a version number to a file, that is certainly doable. However, more general version control and journaling of data are not on the current development plan right now. Doing versioning of files as a whole (i.e., storing full versions of a file) is problematic due to the large size of the data (and is something that on could easily do themselves). Doing journaling on a per-field basis where each attribute, dataset etc. is journaled and versioned independently is very involved and would require us rolling out our own solutions, because none of the existing file standards support this. In general, this sort of unctionality is more the regime of the storage backend, rather than NWB:N or specific API for NWB:N. E.g., one could imagine creating a database-based FORMIO backend for NWB:N. |
Yes, built-in versioning would be pretty cool, but as @oruebel points out it would be very involved. Actually, Gigantum does something like this and might be of interest for users wanting version control of big data files. But let's get back to the issue of altering an existing written dataset. I think our best options is altering the data by accessing the data directly on disk via h5py |
This PR has been superseded by hdmf-dev/hdmf#161. Discussions are still relevant for #1067, however. |
Fixes #918 which was actually a deeper issue: it was not possible to append to any DynamicTable that was read from a file.
Let me know if there is a more elegant way to handle:
I also added tests to act on containers after checking that what was written equals what was read.