Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB storage manager #175

Merged
merged 13 commits into from
Sep 4, 2022
Merged

DB storage manager #175

merged 13 commits into from
Sep 4, 2022

Conversation

jpwchang
Copy link
Collaborator

Description

Adds support for MongoDB-backed representation of ConvoKit objects. Support is mainly provided through a new StorageManager subclass, though the Corpus constructor does have some DB-specific changes to streamline the JSON-to-DB construction path for the sake of efficiency.

Motivation and Context

Previously, ConvoKit objects have been stored entirely in RAM, which requires RAM that is large enough to hold both the raw dataset contents and the additional overhead added by ConvoKit. This has proven to be infeasible for large corpora, a use case that is extremely common in conversational research. Using a DB-backed representation lets us take advantage of the DB's lazy evaluation, thus reducing memory overhead. An second use case for DB-backed representation is applications that perform real-time updates of a Corpus over a long period and require resiliency; for instance, using ConvoKit in a web application backend. Because the DB is ultimately backed by on-disk file storage, in the event of a program crash or system outage, in-progress work that was not yet dumped can still be recovered by reconnecting to the DB (a path explicitly supported through a new reconnect_to_db factory method).

How has this been tested?

So far all testing has been manual, but I've confirmed a few things: first that all basic functionality works, second that (at least for a small corpus) the DB version matches the in-memory version (iterative tests of equality for all component properties all pass), and finally that there is a memory savings (RAM footprint of the Python process for the wikiconv-2004 corpus goes from 5GB to 1GB). However, we should add new unit tests for DB mode to formalize this.

@jpwchang jpwchang requested a review from calebchiam August 27, 2022 23:37
@@ -44,6 +46,8 @@ def __init__(
self,
filename: Optional[str] = None,
utterances: Optional[List[Utterance]] = None,
db_collection_prefix: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on putting db arguments in a db_config argument?

This allows for easier extensibility and maintainability in the long run since we don’t have to change the Corpus constructor if we want to include more configurability for db mode. It also makes the constructor less intimidating for users.

We should have done this for the exclude_ arguments as well IMO, but that’s hindsight for ya

@calebchiam calebchiam merged commit 9634005 into master Sep 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants