Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dgw): persistent job queue for crash resistance #1108

Merged
merged 4 commits into from
Nov 15, 2024

Conversation

CBenoit
Copy link
Member

@CBenoit CBenoit commented Nov 14, 2024

Hackaton: Crash and kill signal resistance using a persistent job queue via Turso’s libSQL

Motivation

This year we added some background tasks in the Gateway that should not be canceled, or if they are, should be restarted later. Essentially two tasks: mass deletion of recordings (relatively important, but it's always possible to launch indexing in DVLS in case of a problem) and remuxing recordings to webm format (good to have). If the service is killed in the middle of one of these operations, we should resume execution on the next startup.

Demo

rec2024-11-14-12-46-44.webm

Implementation

This persisent job queue is implemented using Turso’s libSQL. Using libSQL (or SQLite) for implementing the queue allow us to benefit from all the work put into implementing a reliable, secure and performant disk-based database instead of attempting to implement our own ad-hoc storage and debugging it forever.

Inspiration was taken from 37signals' Solid Queue:

And "How to build a job queue with Rust and PostgreSQL" from kerkour.com:

The 'user_version' value, which is a SQLite PRAGMA, is used to keep track of the migration state. It's a very lightweight approach as it is just an integer at a fixed offset in the SQLite file.

Why Turso’s libSQL?

Introducing Turso’s libSQL, as opposed to SQLite, will serve us for "Recording Farms" in the future. We’ll want instances of a same Recording Farm to coordinate. At this point, we’ll want to use Turso's libSQL network database feature. Indeed, putting the SQLite database file on a virtual filesystem is not recommended. This can lead to corruption and data loss. Turso will allow us to have a local mode for the simplest setups, and a network and distributed mode for Recording Farms when we get there.

This year we added some background tasks in the Gateway that should
not be canceled, or if they are, should be restarted later. Essentially
two tasks: mass deletion of recordings (relatively important, but
it's always possible to launch indexing in DVLS in case of a problem)
and remuxing recordings to webm format (good to have). If the service
is killed in the middle of one of these operations, we should resume
execution on the next startup.

This persisent job queue is implemented using Turso’s libSQL. Using
libSQL (or SQLite) for implementing the queue allow us to benefit from
all the work put into implementing a reliable, secure and performant
disk-based database instead of attempting to implement our own ad-hoc
storage and debugging it forever.

Inspiration was taken from 37signals' Solid Queue:

- https://dev.37signals.com/introducing-solid-queue/
- https://github.com/rails/solid_queue/

And "How to build a job queue with Rust and PostgreSQL" from kerkour.com:

- https://kerkour.com/rust-job-queue-with-postgresql

The 'user_version' value, which is a SQLite PRAGMA, is used to keep track
of the migration state. It's a very lightweight approach as it is just an
integer at a fixed offset in the SQLite file.

- https://sqlite.org/pragma.html#pragma_user_version
- https://www.sqlite.org/fileformat.html#user_version_number

Introducing Turso’s libSQL, as opposed to SQLite, will serve us for
"Recording Farms" in the future. We’ll want instances of a same
Recording Farm to coordinate. At this point, we’ll want to use Turso's
libSQL network database feature. Indeed, putting the SQLite database
file on a virtual filesystem is not recommended. This can lead to
corruption and data loss. Turso will allow us to have a local mode for
the simplest setups, and a network and distributed mode for Recording
Farms when we get there.
@pacmancoder pacmancoder self-requested a review November 14, 2024 09:36
@CBenoit CBenoit enabled auto-merge (squash) November 15, 2024 07:57
async fn push_job(&self, job: &DynJob, schedule_for: Option<OffsetDateTime>) -> anyhow::Result<()> {
let sql_query = "INSERT INTO job_queue
(id, scheduled_for, failed_attempts, status, name, def)
VALUES (:id, :scheduled_for, :failed_attempts, :status, :name, jsonb(:def))";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, SQLite finally added a dedicated binary JSON type! 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems convenient, it’s possible to perform query and indexing on it even!


// UUID v4 only provides randomness, which leads to fragmentation.
// We use ULID instead to reduce index fragmentation.
// https://github.com/ulid/spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! 👍

Copy link
Contributor

@pacmancoder pacmancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! 🥇
Looks good to merge! 🎉

@CBenoit CBenoit merged commit 2420b07 into master Nov 15, 2024
30 checks passed
@CBenoit CBenoit deleted the hackaton-job-queue branch November 15, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants