From b5f00f1efa2189adba1af118c8b584be884361f5 Mon Sep 17 00:00:00 2001 From: Praveen Kumar Date: Fri, 17 Jan 2025 13:26:29 +0000 Subject: [PATCH] chore: added wal/snapshot doc --- influxdb3_wal/wal.md | 214 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 214 insertions(+) create mode 100644 influxdb3_wal/wal.md diff --git a/influxdb3_wal/wal.md b/influxdb3_wal/wal.md new file mode 100644 index 00000000000..2bd8b1bc8c3 --- /dev/null +++ b/influxdb3_wal/wal.md @@ -0,0 +1,214 @@ +# WAL (write-ahead-log) and Snapshot process + +All writes get buffered into a wal buffer and then every flush interval all the contents are flushed to disk at which point the writes become durable and +the clients are sent the confirmation that their writes are successful. At each flush interval there is also a check to see if queryable buffer need to +evict data to disk, this process is called snapshotting. The rest of this doc discusses all the moving parts in wal flush and snapshotting. + + +## Overview + +``` + + + + ┌────────────┐ ┌────────────┐ + │flush buffer│──────────►│ wal buffer │ + └────────────┘ └────────────┘ + ▲ + │ + (takes everything from wal buffer and writes to wal file) + │ + │ (actual background_wal_flush uses wal trait - which wal obj store impls) + │ ┌──────────────(background create)───────────────────────────┐ + │ │ ▼ + ┌─────┴──────┐ │ ┌────────┐ + ┌────►│wal objstore├─┴──(remove wal,need each wal file num / drop semaphore)─►│wal file│ + │ └─────┬──────┘ └────────┘ + │ │ + │ │ + │ │ + │ │ + │ │ + │ │ + │ (notifies wal + │ and snapshot optionally & snapshot semaphore taken) ┌─────────────┐ + │ │ ┌───────────────►│update buffer│ (whatever wal flushed into chunks) + │ ▼ │ └─────────────┘ + │ ┌────────────┐ │ ┌─────────────┐ + │ │query buffer├───────────────────────┼───────────────►│parquet file │ + │ └─────┬──────┘ │ └─────────────┘ + │ │ │ + │ (run snapshot) │ ┌─────────────┐ + └───────────┘ ├───────────────►│snapshot file│ (holds last wal seq number) + │ └─────────────┘ + │ ┌────────────┐ + └───────────────►│clear buffer│ (whatever snapshotted is removed) + └────────────┘ +``` + + +#### Steps + +1. When _writes_ comes in, they go into a write batch in wal buffer. These batches are held per database and the batches keep track of min + and max times within each batch. These batches further hold per table chunks. This chunk is created by taking incoming rows and pinning + them to a period. It is done by `t - (t % gen_1_duration)`. If `gen_1_duration` is 10 mins, then all of the rows will be divided into + 10 min chunks. As an example if there are rows for 10.29 and 10.35 then they both go into 2 separate chunks (10.20 and 10.30). And this + 10.20 and 10.30 are used later as the key in queryable buffer. +2. Every flush interval, the wal buffer is flushed and all batches are written to to wal file (converts to wal content and gets min/max + times from all batches) and every time wal file contents are written, we add wal period to snapshot tracker that tracks min/max times across + batches that went into single wal file +3. Then snapshot tracker is checked to see if there are enough wal periods to trigger snapshot. There are different conditions checked, the + common scenarios are, + - wal periods > (1.5 * snapshot size). e.g default settings will lead to snapshot size = 600, if we have 900 wal periods trigger snapshot + - force snapshot, irrespective of sizes go ahead and run snapshot + + If going ahead with force snapshotting, pick all the wal periods in the tracker and find the max time from most recent wal period. This will be + used as the `end_time_marker` to evict data from query buffer. Because forcing a snapshot can be triggered when wal buffer is empty (even though + queryable buffer is full), we need to add `Noop` (a no-op WalOp) to the wal file to hold the snapshot details in wal file. + + ``` + + Snapshot (all, emptying wal periods in tracker) + ▲ ▲ + │◄───wal in snapshot───────►│ + ┌──────┬──────┬──────┬──────┤ + │ 0 │ 1 │ 2 │ 3 │ + │ │ │ │ │ + ────┴──────┴──────┴──────┴──────┴─────► (time) + + ``` + + If it is a normal snapshot, then leave one wal period (`3` in eg below) and pick the last one (`2` in eg below) max time used as `end_time_marker` + and every wal period with max time is less than this `end_time_marker` will be removed from snapshot tracker. + + ``` + Snapshot (wal max < last wal period max, + 3 is max here that is left behind) + ▲ ▲ ▲ + │◄─ wal in snapshot─►│ │ + ┌──────┬──────┬──────┬──────┤ + │ 0 │ 1 │ 2 │ 3 │ + │ │ │ │ │ + ────┴──────┴──────┴──────┴──────┴─────────────► (time) + + ``` + At this point we may or may not have to snapshot, but the wal buffer will still need to be emptied into the wal file. The clients that have been + writing data are held up at most of flush interval time (defaults to 1s). The clients get notified together that the writes have been successful + when writes are persisted to wal file. + +4. Then query buffer is notified to update the buffer, and query buffer does the following things in sequence, + - it updates the buffer with incoming rows + - writes to parquet file, + - writes snapshot summary file + - clears the buffer (using the `end_time_marker` that has been passed along) + + It is useful to visualize how query buffer holds data internally to understand how the buffer is cleared. + - query buffer holds data as mapping between chunk time and MutableTableChunks (these are col id -> arrow array builder mappings with + min/max times for that chunk), looks roughly like below + ``` + + │ + ├───10.20───────────►┌────────────────────────────┐ + │ │ chunk 10.20 - 10.29 │ + │ │ │ + ├───10.30───────────►├────────────────────────────┤ + │ │ chunk 10.30 - 10.39 │ + │ │ │ + ├───10.40───────────►└────────────────────────────┘ + │ + ▼ + time + + ``` + The crucial thing to note here is, `10.20` and `10.30` are the keys and they're taken from the write batches that have been derived + from newly added rows. These are added to query buffer which manages an arrow backed buffer that holds all the newly added rows. + + - Above mutable chunks are added each time a wal flush happens, when snapshotting it uses the `end_time_marker` to evict data. Say, + 10.30 is the `end_time_marker` to evict data from queryable buffer, then query buffer evicts all data before 10.30 and holds it as + snapshot chunk which has converted the arrow arrays to a record batch that's ready to be passed to sort/dedupe before writing to + parquet file. Then a summary snapshot file is written which tracks what wal file number was last snapshotted along with pointers to + parquet files created. This data is updated in persisted files so that any new query which spans that time can find data from the + parquet files as buffer will not have them anymore. + +5. Once snapshotting process is complete, now deletes all the wal files up to a configured number of wal files to retain. When replaying + the last snapshotted file is looked up from snapshot summary file that has been created so none of the wal files that've already been + snapshotted is loaded into the buffer. +6. If server crashed and restarted wal replay happens. But before that all snapshots are loaded, but even though snapshot file is written + out it doesn't guarantee that wal files relevant to snapshot has been removed. + +## Example + +Below section walks through some of the nuances touched in the overall process described above + +- These are the wal files (1-4) with data for different time ranges [20 - 50] etc. Notice data can be overlapping between time periods +``` +1[20 - 50] +2[31 - 70] +3[51 - 90] +4[45 - 110] - snapshot! <= 2 (wal file num), <= 70 (end_time_marker). snapshot details has a file number and end_time_marker. +``` + +- This is a mapping between queryable buffer's chunk time and what wal files the data comes from. This is not how queryable buffer + maps the data internally, it doesn't know about wal files - however this mapping helps visualise the dependency with wal file better. + The data for time slice 20 originates purely from wal file 1, for 40 it holds rows from wal files 1, 2 and 4 etc. +``` +20 - 1 +30 - 1 +40 - 1, 2, 4 +50 - 2, 4 +60 - 2, 3, 4 +70 - 2, 3, 4 +80 - 3, 4 +90 - 3, 4 +``` + +- At this point, lets say we need to snapshot as per the details above (`snapshot! <= 2 (wal file num), <= 70 (end_time_marker)`) file is written out + & parquet files removing all items in queryable buffer upto time 70. This is the left over in queryable buffer - + mainly the data that was in wal files 3, 4 are now in parquet files already. +``` +80 - 3, 4 +90 - 3, 4 +``` + +- If there's been a restart at this point data is loaded from wal files (due to restart), times 60 and 70 will be loaded into memory. so it may look like this, +``` +60 - 3, 4 +70 - 3, 4 +80 - 3, 4 +90 - 3, 4 +``` + +- And if further snapshot is kicked off by replaying earlier snapshot from wal file 4, because the `end_time_marker`(=70) is still there in wal file + we'd end up removing it from the query buffer leaving the query buffer in state as expected. +``` +80 - 3, 4 +90 - 3, 4 +``` + +Say instead we force snapshotted instead of normal snapshotting, + +- Everything will be removed from query buffer. These are the wal files with data for different time ranges [20 - 50] etc. Notice data can be overlapping between time periods +``` +1[20 - 50] +2[31 - 70] +3[51 - 90] +4[45 - 110] - snapshot! <= 4 (wal file num), <= 110 (end_time_marker) +``` + +- And query buffer looks like, +``` +20 - 1 +30 - 1 +40 - 1, 2, 4 +50 - 2, 4 +60 - 2, 3, 4 +70 - 2, 3, 4 +80 - 3, 4 +90 - 3, 4 +100 - 4 +110 - 4 +``` + +- At this point, snapshot file is written out & parquet files removing all items in queryable buffer upto time 110. So, all files 1, 2, 3, 4 are ready for deletion +- There should be nothing in queryable buffer as end time marker is 110 and if there's been a restart at this point there are no wal files to load and everything is in parquet files (snapshotted) +