Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix append to dynamic table #920

Closed
wants to merge 5 commits into from
Closed

[WIP] Fix append to dynamic table #920

wants to merge 5 commits into from

Conversation

rly
Copy link
Contributor

@rly rly commented May 8, 2019

Fixes #918 which was actually a deeper issue: it was not possible to append to any DynamicTable that was read from a file.

Let me know if there is a more elegant way to handle:

def append(self, arg):
        if isinstance(self.data, HDMFDataset) or isinstance(self.data, Dataset):
            self.__data = self.data[()]

I also added tests to act on containers after checking that what was written equals what was read.

@rly rly requested a review from ajtritt May 8, 2019 20:53
@codecov
Copy link

codecov bot commented May 8, 2019

Codecov Report

Merging #920 into dev will decrease coverage by 0.05%.
The diff coverage is 44.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #920      +/-   ##
==========================================
- Coverage   71.16%   71.11%   -0.06%     
==========================================
  Files          37       37              
  Lines        2792     2797       +5     
  Branches      554      556       +2     
==========================================
+ Hits         1987     1989       +2     
  Misses        679      679              
- Partials      126      129       +3
Impacted Files Coverage Δ
src/pynwb/file.py 71.03% <100%> (ø) ⬆️
src/pynwb/core.py 73.3% <37.5%> (-0.2%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 962b91e...d0a6088. Read the comment docs.

@bendichter
Copy link
Contributor

Is there any way to do this without reading the entire dataset into memory?

except Exception as e:
self.reader.close()
self.reader = None
raise e
Copy link
Member

@ajtritt ajtritt May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this cause test_roundtrip to get skipped if actOnContainer is not implemented?

@@ -194,6 +202,10 @@ def getContainer(self, nwbfile):
''' Should take an NWBFile object and return the Container'''
raise unittest.SkipTest('Cannot run test unless getContainer is implemented')

def actOnContainer(self, nwbfile):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the point of this to provide the ability to do something to a container after roundtripping?

@@ -318,6 +319,8 @@ def __getitem__(self, args):
return self.data[args]

def append(self, arg):
if isinstance(self.data, HDMFDataset) or isinstance(self.data, Dataset):
self.__data = self.data[()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid reading the dataset into memory by reshaping the dataset, and then adding the new data.

Copy link
Member

@ajtritt ajtritt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rly Just a few questions

@luiztauffer
Copy link

Hi @rly , thanks for the fix!
I’m working on a project where it would be desirable not only to append, but also remove elements. What’s the best way of allowing more freedom of manipulation by the user?

@oruebel
Copy link
Contributor

oruebel commented May 9, 2019

@luiztauffer could you elaborate a bit more on the specific use-case you are working with that requires removal of rows?

@luiztauffer
Copy link

@luiztauffer could you elaborate a bit more on the specific use-case you are working with that requires removal of rows?

@oruebel
I’m working on a GUI for manual marking of invalid intervals (or any other classification for intervals of interest) for multiple time series signals, all within a NWB file. Many times the identification of events on these time series is not a clear cut decision and, even when it is, the user might mark and save a wrongly assigned time interval, and wish to remove it later.

The current version of pynwb does not allow for directly setting already defines fields. E.g. I tried to directly assign a new field:
nwb2.invalid_times = TimeIntervals(‘invalid_times’,’time intervals to be removed’)
which raises the error:
AttributeError: can’t set attribute ‘invalid_times’ – already set
from core.py, line 134

I just comment out this line of code it and can now update any fields I want. Any particular reason why this direct update of fields is forbidden on pynwb?

@rly
Copy link
Contributor Author

rly commented May 9, 2019

I agree that modification of data already written to file is an important and useful feature, especially for the invalid trial times. This is also relevant when intermediate data is stored. However, this is a complicated issue that we need to discuss as a team and with users. Currently dataset modification is not really supported except for probably a few edge cases. So even though you can update fields by commenting out that line of code, I think with the current code, when you go to write the NWBFile, your change would not be written to disk.

One way to implement this is anytime a user wants to alter an existing dataset (add, remove, modify), then PyNWB would read the entire original dataset and alter it. Writing the changes to disk would involve writing the entire modified dataset.

You could currently do this yourself by reading the entire dataset into memory, making changes, reading all other data, and then writing a brand new NWBFile, but it would be nice to have this functionality built-in. We would just have to be very explicit about it because of the potentially high computing cost.

@luiztauffer
Copy link

@rly
For small files that maybe would do, but for data over GBs it’s certainly not a good option.
I understand where the concerns might come from and possibly you have discussed it all already. But how about a security check, maybe setting an argument, overwrite=True?
Any data versioning functionality on the future plans?

@oruebel
Copy link
Contributor

oruebel commented May 10, 2019

@luiztauffer For data arrays that are being read lazily you can do updates right now already but that is limited to large data arrays (and updates are immediate). Enabling update of files directly (without making full copies) should be doable but will require tracking of which fields have been updated. Currently this is done on a per-container basis but not yet on a per-field basis.

Any data versioning functionality on the future plans?

If you mean allowing users to assign a version number to a file, that is certainly doable. However, more general version control and journaling of data are not on the current development plan right now. Doing versioning of files as a whole (i.e., storing full versions of a file) is problematic due to the large size of the data (and is something that on could easily do themselves). Doing journaling on a per-field basis where each attribute, dataset etc. is journaled and versioned independently is very involved and would require us rolling out our own solutions, because none of the existing file standards support this. In general, this sort of unctionality is more the regime of the storage backend, rather than NWB:N or specific API for NWB:N. E.g., one could imagine creating a database-based FORMIO backend for NWB:N.

@bendichter
Copy link
Contributor

Yes, built-in versioning would be pretty cool, but as @oruebel points out it would be very involved. Actually, Gigantum does something like this and might be of interest for users wanting version control of big data files.

But let's get back to the issue of altering an existing written dataset. I think our best options is altering the data by accessing the data directly on disk via h5py Datasets. e.g. nwb.acquisition.t_series.data[:] = [1., 2., 3.]. This deviates from the normal pynwb workflow, because it changes the values immediately and does not require a write command. This also only currently works if the size of the new data is the same as the size of the old data, so it would not work in the case of append or remove currently. In order to get append to work we'll need to have a maxshape parameter that accommodates the growing array.

@rly rly changed the title Fix append to dynamic table [WIP] Fix append to dynamic table Jul 30, 2019
@rly
Copy link
Contributor Author

rly commented Oct 7, 2019

This PR has been superseded by hdmf-dev/hdmf#161. Discussions are still relevant for #1067, however.

@rly rly closed this Oct 7, 2019
@rly rly deleted the fix_dyntable_append branch October 7, 2019 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cannot change shape of dataset once written
5 participants