Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the H5StoreManager class. #129

Merged
merged 15 commits into from
Feb 25, 2019
Merged

Conversation

csadorf
Copy link
Contributor

@csadorf csadorf commented Feb 19, 2019

Description

Introduces the DictManager and the H5StoreManager classes that are exposed as job.stores and project.stores.

Motivation and Context

It would be highly advantageous to enable users to easily manage multiple HDF5-files instead of just one. There are plenty of reasons why that is a good idea:

  1. While the HDF5-format is generally mutable, it is fundamentally designed to be used as an immutable data container. In fact, even the latest version is internally restricted to append-only operations and a file will continue to grow with repeated addition and edits. The only way to reduce the file size is to create a new file (e.g. with h5repack).
  2. A single large file is difficult to synchronize. Even if we only change a few MBs or even KBs of a multiple TBs-large file, there is a large chance that synchronization tools like rsync or GLOBUS will have to transfer the complete file. Furthermore, all data stored within one large file share one modification time stamp, making fine-grained synchronization even more difficult. Finally, synchronization of parts of a file are impossible.
  3. A single large file, especially when it is repeatedly mutated poses a data corruption risk. The HDF5 library uses file locks to reduce the risk of data corruption, but there is no journaling as of yet and the corruption of parts of the file, e.g., through a crashed application (think job killed due to walltime stop etc.) might corrupt the whole file leading to complete data loss.
  4. The HDF5-format does not play well with concurrent/parallel access. While the single-writer-multiple-reader (SWMR) mode is implemented for some libraries, it comes with severe limitations and requires a special access style that cannot be easily made transparent to the user. Existence checks etc would be much easier to execute concurrently if they are kept to the file system level without needing to open the file.

Types of Changes

  • Documentation update
  • Bug fix
  • New feature
  • Breaking change1

This pull request is based on

  • master, because it is a bug fix or an update to the documentation.
  • develop, because it introduces a new feature.

1The change breaks (or has the potential to break) existing functionality.

Checklist:

If necessary:

  • I have updated the API documentation as part of the package doc-strings.
  • I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
  • I have updated the changelog.

@csadorf csadorf marked this pull request as ready for review February 19, 2019 21:07
@csadorf csadorf requested a review from a team as a code owner February 19, 2019 21:07
@csadorf csadorf added this to the v1.0.0 milestone Feb 20, 2019
@vyasr vyasr assigned vyasr and csadorf and unassigned vyasr Feb 20, 2019
@csadorf
Copy link
Contributor Author

csadorf commented Feb 20, 2019

@glotzerlab/signac-maintainers This is ready for review.

@csadorf csadorf requested review from vyasr and mikemhenry February 20, 2019 23:50
Manages multiple instances of H5Store within a specific directory.

Exposed as job.stores and project.stores.
Register returned dicts in internal registry to ensure persistence of
returned dicts over the lifetime of the manager instance.
In addition to testing 'job.data', we also test 'job.stores.test'.
This patch fixes an issue that required users to explicitly hold a
reference to each store returned from `job.stores`, otherwise the
returned container would be immediatley deleted prior to use.
@csadorf csadorf force-pushed the feature/h5-store-manager branch from d9f8cb5 to ed4ddde Compare February 21, 2019 18:02
Copy link
Collaborator

@mikemhenry mikemhenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I don't work with the H5Store enough to approve the PR but looking over it I don't see anything obviously wrong. We should make sure to update glotzerlab/signac-docs#6 in light of the changes made here.

@csadorf
Copy link
Contributor Author

csadorf commented Feb 21, 2019

LGTM, I don't work with the H5Store enough to approve the PR but looking over it I don't see anything obviously wrong. We should make sure to update glotzerlab/signac-docs#6 in light of the changes made here.

Thank you, that's a good point and I will go through the docs before merging.

@vyasr
Copy link
Contributor

vyasr commented Feb 23, 2019

I will review this tomorrow, trying to finish up another task before I start going back through signac PRs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty minimal set of changes to enable the new use-case, I like it. Just a couple small things and we're good to go.

changelog.txt Outdated Show resolved Hide resolved
signac/contrib/project.py Outdated Show resolved Hide resolved
signac/core/dict_manager.py Outdated Show resolved Hide resolved
pass
raise error
else:
del self._dict_registry[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow why this is necessary. The file corresponding to key was successfully created if we reach this point, why does it need to be removed from the tracked list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really necessary, I figured it might be a good idea to replace any references of "old" H5Store under the same key, but there is virtually no difference. I'll remove that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this is actually necessary. I tried to delete this line and the test_assign and test_assign_data unit tests started to fail.

for fn in os.listdir(self.prefix):
m = re.match('(.*){}'.format(self.suffix), fn)
if m:
yield m.groups()[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using re seems like overkill here, I'd switch this to using glob.glob: yield from glob.glob("*.{}".format(self.suffix)) (without using yield from for py2 compatibility).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't directly extract the key name with glob.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you'd still have to do something like ...replace(self.suffix, '') to get just the key without the extension, that's all that you're referring to right? I personally think that's still clearer than using a regex since you're really just parsing out an extension from a set of filenames, but I'm OK with leaving it as is if you prefer since it doesn't simplify the code that much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried it. I would also need to strip off the directory name, so I'm sticking to the original implementation.

However, should we explicitly skip hidden h5-files?

Copy link
Contributor

@vyasr vyasr Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary. I view creating and editing hidden hd5 files as a feature, not a bug. If users explicitly access a hidden file I think it's reasonable to expect that they know what they're doing.

Copy link
Contributor Author

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr Thank you for your review. I'll address any required code changes tomorrow.

pass
raise error
else:
del self._dict_registry[key]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really necessary, I figured it might be a good idea to replace any references of "old" H5Store under the same key, but there is virtually no difference. I'll remove that.

for fn in os.listdir(self.prefix):
m = re.match('(.*){}'.format(self.suffix), fn)
if m:
yield m.groups()[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't directly extract the key name with glob.

@vyasr vyasr self-requested a review February 24, 2019 16:02
@bdice bdice requested review from mikemhenry and removed request for mikemhenry February 24, 2019 18:03
self.assertIn(key, job.stores.test)
self.assertEqual(job.stores.test[key], d)
self.assertEqual(job.stores.test.get(key), d)
self.assertEqual(job.stores.test.get('bs', d), d)
Copy link
Member

@bdice bdice Feb 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest another word like 'nonexistent_key' so it's more clear. Do a find/replace because this re-occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummm, yes. Oops. :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to fix that in a separate commit, because there are a bunch of other occurrences of the same pattern in other test modules.

tests/test_job.py Outdated Show resolved Hide resolved
@bdice bdice self-requested a review February 24, 2019 18:15
@csadorf
Copy link
Contributor Author

csadorf commented Feb 24, 2019

@vyasr I've addressed all of your comments.

@vyasr
Copy link
Contributor

vyasr commented Feb 25, 2019

Approving pending the change that Bradley requested.

@csadorf csadorf merged commit 66cdb21 into develop Feb 25, 2019
@csadorf csadorf deleted the feature/h5-store-manager branch February 25, 2019 01:17
tcmoore3 added a commit that referenced this pull request Jul 26, 2019
This is related to signac-flow issue #129 and PR #130, where we decided
it makes more sense to break up the data space initialization into
signac core and the flow project initialization into flow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants