-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB storage manager #175
DB storage manager #175
Conversation
@@ -44,6 +46,8 @@ def __init__( | |||
self, | |||
filename: Optional[str] = None, | |||
utterances: Optional[List[Utterance]] = None, | |||
db_collection_prefix: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on putting db arguments in a db_config argument?
This allows for easier extensibility and maintainability in the long run since we don’t have to change the Corpus constructor if we want to include more configurability for db mode. It also makes the constructor less intimidating for users.
We should have done this for the exclude_ arguments as well IMO, but that’s hindsight for ya
Description
Adds support for MongoDB-backed representation of ConvoKit objects. Support is mainly provided through a new
StorageManager
subclass, though theCorpus
constructor does have some DB-specific changes to streamline the JSON-to-DB construction path for the sake of efficiency.Motivation and Context
Previously, ConvoKit objects have been stored entirely in RAM, which requires RAM that is large enough to hold both the raw dataset contents and the additional overhead added by ConvoKit. This has proven to be infeasible for large corpora, a use case that is extremely common in conversational research. Using a DB-backed representation lets us take advantage of the DB's lazy evaluation, thus reducing memory overhead. An second use case for DB-backed representation is applications that perform real-time updates of a Corpus over a long period and require resiliency; for instance, using ConvoKit in a web application backend. Because the DB is ultimately backed by on-disk file storage, in the event of a program crash or system outage, in-progress work that was not yet dumped can still be recovered by reconnecting to the DB (a path explicitly supported through a new
reconnect_to_db
factory method).How has this been tested?
So far all testing has been manual, but I've confirmed a few things: first that all basic functionality works, second that (at least for a small corpus) the DB version matches the in-memory version (iterative tests of equality for all component properties all pass), and finally that there is a memory savings (RAM footprint of the Python process for the
wikiconv-2004
corpus goes from 5GB to 1GB). However, we should add new unit tests for DB mode to formalize this.