DB storage manager #175

jpwchang · 2022-08-27T23:37:27Z

Description

Adds support for MongoDB-backed representation of ConvoKit objects. Support is mainly provided through a new StorageManager subclass, though the Corpus constructor does have some DB-specific changes to streamline the JSON-to-DB construction path for the sake of efficiency.

Motivation and Context

Previously, ConvoKit objects have been stored entirely in RAM, which requires RAM that is large enough to hold both the raw dataset contents and the additional overhead added by ConvoKit. This has proven to be infeasible for large corpora, a use case that is extremely common in conversational research. Using a DB-backed representation lets us take advantage of the DB's lazy evaluation, thus reducing memory overhead. An second use case for DB-backed representation is applications that perform real-time updates of a Corpus over a long period and require resiliency; for instance, using ConvoKit in a web application backend. Because the DB is ultimately backed by on-disk file storage, in the event of a program crash or system outage, in-progress work that was not yet dumped can still be recovered by reconnecting to the DB (a path explicitly supported through a new reconnect_to_db factory method).

How has this been tested?

So far all testing has been manual, but I've confirmed a few things: first that all basic functionality works, second that (at least for a small corpus) the DB version matches the in-memory version (iterative tests of equality for all component properties all pass), and finally that there is a memory savings (RAM footprint of the Python process for the wikiconv-2004 corpus goes from 5GB to 1GB). However, we should add new unit tests for DB mode to formalize this.

convokit/model/corpusHelper.py

convokit/model/corpus.py

calebchiam · 2022-09-03T21:18:28Z

convokit/model/corpus.py

@@ -44,6 +46,8 @@ def __init__(
        self,
        filename: Optional[str] = None,
        utterances: Optional[List[Utterance]] = None,
+        db_collection_prefix: Optional[str] = None,


What are your thoughts on putting db arguments in a db_config argument?

This allows for easier extensibility and maintainability in the long run since we don’t have to change the Corpus constructor if we want to include more configurability for db mode. It also makes the constructor less intimidating for users.

We should have done this for the exclude_ arguments as well IMO, but that’s hindsight for ya

jpwchang added 7 commits July 19, 2022 11:26

initial implementation of DBStorageManager

56295ce

Merge branch 'storage-abstraction' into db-storage-manager

9a1cbec

Update github actions and dependencies (from original db-mode branch)

0cb56ad

Basic DB mode functionality implemented

0d5931d

Implement database reconnection

ce72df8

Add support for binary metadata in db

8a8c3b2

Implement optimized JSON-to-DB path

5cae325

jpwchang requested a review from calebchiam August 27, 2022 23:37

jpwchang and others added 2 commits August 27, 2022 19:41

Delete accidentally added file arcExtractor.py

3ec55b5

ran black fmt

78e015c

calebchiam reviewed Aug 28, 2022

View reviewed changes

convokit/model/corpusHelper.py Show resolved Hide resolved

calebchiam reviewed Aug 28, 2022

View reviewed changes

convokit/model/corpusHelper.py Outdated Show resolved Hide resolved

calebchiam added 2 commits August 28, 2022 23:46

corpusHelper -> corpus_helpers

9064cd2

renaming some funcs for clarity

5d33915

calebchiam reviewed Aug 28, 2022

View reviewed changes

convokit/model/corpus.py Outdated Show resolved Hide resolved

jpwchang added 2 commits August 29, 2022 11:37

Refactor DB operation calls into corpus_helpers

8ac810c

Add docs for DB mode

7689385

calebchiam reviewed Sep 3, 2022

View reviewed changes

calebchiam merged commit 9634005 into master Sep 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB storage manager #175

DB storage manager #175

jpwchang commented Aug 27, 2022

calebchiam Sep 3, 2022

DB storage manager #175

DB storage manager #175

Conversation

jpwchang commented Aug 27, 2022

Description

Motivation and Context

How has this been tested?

calebchiam Sep 3, 2022

Choose a reason for hiding this comment