Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage abstraction #169

Merged
merged 25 commits into from
Aug 27, 2022
Merged

Storage abstraction #169

merged 25 commits into from
Aug 27, 2022

Conversation

jpwchang
Copy link
Collaborator

Description

Introduces a new layer of abstraction between Corpus components (Utterance, Speaker, Conversation, ConvoKitMeta) and concrete data storage. Data storage is now handled by a StorageManager instance variable in the Corpus. StorageManager defines an abstract class interface that can have concrete implementations with varying storage scheme; this pull request implements MemStorageManager which stores all data in in-memory dicts, thus replicating the existing behavior.

Motivation and Context

In previous versions of ConvoKit, Corpus components have stored their data and metadata directly as instance variables. This scheme inherently means that the data must live directly in memory. Plans are currently underway to introduce alternative data backings such as database-backed storage. Introducing an abstraction layer for data storage will allow these alternative data models to be implemented without having to reimplement every Corpus component class from scratch.

How has this been tested?

Besides ensuring that all existing unit tests pass, tests have also been expanded to add new checks for StorageManager integrity after complex operations such as merges.

@jpwchang
Copy link
Collaborator Author

jpwchang commented Jul 18, 2022

Per our discussion on Slack, I have updated this branch with changes to make certain destructive Corpus operations (merging, reindexing, and filtering by utterance) static methods. The actual implementation and behavior of these methods remains unchanged. Additionally, add_utterances has been updated to always operate in-place (before, it was in-place when with_checks=False and not-in-place otherwise). Making this change required decoupling add_utterances from merge since the latter is not-in-place. Tests for add_utterances have also been updated to more thoroughly verify the integrity of the modified Corpus, and to verify that the operation is in fact in-place (i.e., returns the same Corpus it was called from)

@jpwchang jpwchang requested a review from calebchiam July 18, 2022 16:15
Copy link
Collaborator

@calebchiam calebchiam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall functionality looks great! Some minor changes requested, mainly to do with naming, documentation, and user-facing elements.

@jpwchang jpwchang requested a review from calebchiam July 20, 2022 19:09

def reinitialize_from(self, other: Union["ConvoKitMeta", dict]):
"""
Reinitialize this ConvoKitMeta instance with the data from other
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: from another ConvoKitMeta instance or a dictionary with metadata key-value pairs

Copy link
Collaborator

@calebchiam calebchiam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go!

@calebchiam
Copy link
Collaborator

@jpwchang Can we merge this?

@jpwchang jpwchang merged commit f610e67 into master Aug 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants