Add Titan documents (pingcap#3811)

* Add Titan documents * config * config * Update titan-configuration.md * Update titan-overview.md * Apply suggestions from code review Co-authored-by: Lilian Lee <[email protected]> * unify LSM-tree * Apply suggestions from code review Co-authored-by: Lilian Lee <[email protected]> Co-authored-by: Lilian Lee <[email protected]>
db-will · Sep 17, 2020 · 4a18cbc · 4a18cbc
1 parent 24f31fb
commit 4a18cbc
Show file tree

Hide file tree

Showing 10 changed files with 269 additions and 0 deletions.
diff --git a/TOC.md b/TOC.md
@@ -479,6 +479,8 @@
     + TiKV
       + [TiKV Overview](/tikv-overview.md)
       + [RocksDB Overview](/storage-engine/rocksdb-overview.md)
+      + [Titan Overview](/storage-engine/titan-overview.md)
+      + [Titan Configuration](/storage-engine/titan-configuration.md)
     + TiFlash
       + [Overview](/tiflash/tiflash-overview.md)
       + [Use TiFlash](/tiflash/use-tiflash.md)

diff --git a/media/titan/titan-1.png b/media/titan/titan-1.png
diff --git a/media/titan/titan-2.png b/media/titan/titan-2.png
diff --git a/media/titan/titan-3.png b/media/titan/titan-3.png
diff --git a/media/titan/titan-4.png b/media/titan/titan-4.png
diff --git a/media/titan/titan-5.png b/media/titan/titan-5.png
diff --git a/media/titan/titan-6.png b/media/titan/titan-6.png
diff --git a/media/titan/titan-7.png b/media/titan/titan-7.png
diff --git a/storage-engine/titan-configuration.md b/storage-engine/titan-configuration.md
@@ -0,0 +1,142 @@
+---
+title: Titan Configuration
+summary: Learn how to configure Titan.
+---
+
+# Titan Configuration
+
+This document introduces how to enable and disable [Titan](/storage-engine/titan-overview.md) using the corresponding configuration items, as well as the relevant parameters and the Level Merge feature.
+
+## Enable Titan
+
+Titan is compatible with RocksDB, so you can directly enable Titan on the existing TiKV instances that use RocksDB. You can use one of the following two methods to enable Titan:
+
++ Method 1: If you have deployed the cluster using TiUP, you can execute the `tiup cluster edit-config ${cluster-name}` command and edit the TiKV configuration file as the following example shows:
+
+    {{< copyable "shell-regular" >}}
+
+    ```shell
+      tikv:
+        rocksdb.titan.enabled: true
+    ```
+
+    Reload the configuration and TiKV will be rolling restarted online:
+
+    {{< copyable "shell-regular" >}}
+
+    ```shell
+    tiup cluster reload ${cluster-name} -R tikv
+    ```
+
+    For the detailed command, see [Modify the configuration using TiUP](/maintain-tidb-using-tiup.md#modify-the-configuration).
+
++ Method 2: Directly edit the TiKV configuration file to enable Titan (**NOT** recommended for the production environment).
+
+    {{< copyable "" >}}
+
+    ``` toml
+    [rocksdb.titan]
+    enabled = true
+    ```
+
+After Titan is enabled, the existing data stored in RocksDB is not immediately moved to the Titan engine. As new data is written to the TiKV foreground and RocksDB performs compaction, the values are progressively separated from keys and written to Titan. You can view the **TiKV Details** -> **Titan kv** -> **blob file size** panel to confirm the size of the data stored in Titan.
+
+If you want to speed up the writing process, compact data of the whole TiKV cluster manually using tikv-ctl. For details, see [manual compaction](/tikv-control.md#compact-data-of-the-whole-tikv-cluster-manually).
+
+> **Note:**
+>
+> When Titan is disabled, RocksDB cannot read data that has been migrated to Titan. If Titan is incorrectly disabled on a TiKV instance with Titan already enabled (mistakenly set `rocksdb.titan.enabled` to `false`), TiKV will fail to start, and the `You have disabled titan when its data directory is not empty` error appears in the TiKV log. To correctly disabled Titan, see [Disable Titan](#disable-titan-experimental).
+
+## Parameters
+
+To adjust Titan-related parameters using TiUP, refer to [Modify the configuration](/maintain-tidb-using-tiup.md#modify-the-configuration).
+
++ Titan GC thread count.
+
+    From the **TiKV Details** -> **Thread CPU** -> **RocksDB CPU** panel, if you observe that the Titan GC threads are at full capacity for a long time, consider increasing the size of the Titan GC thread pool.
+
+    {{< copyable "" >}}
+
+    ```toml
+    [rocksdb.titan]
+    max-background-gc = 1
+    ```
+
++ Value size threshold.
+
+    When the size of the value written to the foreground is smaller than the threshold, this value is stored in RocksDB; otherwise, this value is stored in the blob file of Titan. Based on the distribution of value sizes, if you increase the threshold, more values are stored in RocksDB and TiKV performs better in reading small values. If you decrease the threshold, more values go to Titan, which further reduces RocksDB compactions.
+
+    ```toml
+    [rocksdb.defaultcf.titan]
+    min-blob-size = "1KB"
+    ```
+
++ The algorithm used for compressing values in Titan, which takes value as the unit.
+
+    ```toml
+    [rocksdb.defaultcf.titan]
+    blob-file-compression = "lz4"
+    ```
+
++ The size of value caches in Titan.
+
+    Larger cache size means higher read performance of Titan. However, too large a cache size causes Out of Memory (OOM). It is recommended to set the value of `storage.block-cache.capacity` to the store size minus the blob file size and set `blob-cache-size` to `memory size * 50% - block cache size` according to the monitoring metrics when the database is running stably. This maximizes the blob cache size when the block cache is large enough for the whole RocksDB engine.
+
+    ```toml
+    [rocksdb.defaultcf.titan]
+    blob-cache-size = 0
+    ```
+
++ When the ratio of discardable data (the corresponding key has been updated or deleted) in a blob file exceeds the following threshold, Titan GC is triggered.
+
+    ```toml
+    discardable-ratio = 0.5
+    ```
+
+    When Titan writes the useful data of this blob file to another file, you can use the `discardable-ratio` value to estimate the upper limits of write amplification and space amplification (assuming the compression is disabled).
+
+    Upper limit of write amplification = 1 / discardable_ratio
+
+    Upper limit of space amplification = 1 / (1 - discardable_ratio)
+
+    From the two equations above, you can see that decreasing the value of `discardable_ratio` can reduce space amplification but causes GC to be more frequent in Titan. Increasing the value reduces Titan GC, the corresponding I/O bandwidth, and CPU consumption but increases disk usage.
+
++ The following option limits the I/O rate of RocksDB compaction. During peak traffic, limiting RocksDB compaction, its I/O bandwidth, and its CPU consumption reduces its impact on the write and read performance of the foreground.
+
+    When Titan is enabled, this option limits the summed I/O rates of RocksDB compaction and Titan GC. If you find that the I/O and/or CPU consumption of RocksDB compaction and Titan GC is too large, set this option to a suitable value according the disk I/O bandwidth and the actual write traffic.
+
+    ```toml
+    [rocksdb]
+    rate-bytes-per-sec = 0
+    ```
+
+## Disable Titan (experimental)
+
+To disable Titan, you can configure the `rocksdb.defaultcf.titan.blob-run-mode` option. The optional values for `blob-run-mode` are as follows:
+
+- When the option is set to `kNormal`, Titan performs read and write operations normally.
+- When the option is set to `kReadnly`, all newly written values are written into RocksDB, regardless of the value size.
+- When the option is set to `KFallback`, all newly written values are written into RocksDB, regardless of the value size. Also, all compacted values stored in the Titan blob file are automatically moved back to RocksDB.
+
+To disable Titan, set `blob-run-mode = "kFallback"` and perform a full compaction using tikv-ctl. After that, check the monitoring metrics, confirm that the blob file size decreases to `0`. Then you can set `rocksdb.titan.enabled` to `false` and restart TiKV.
+
+> **Note:**
+>
+> Disabling Titan is an experimental feature. It is **NOT** recommended to use it if not necessary.
+
+## Level Merge (experimental)
+
+In TiKV 4.0, [Level Merge](/storage-engine/titan-overview.md#level-merge), a new algorithm, is introduced to improve the performance of range query and to reduce the impact of Titan GC on the foreground write operations. You can enable Level Merge using the following option:
+
+```toml
+[rocksdb.defaultcf.titan]
+level-merge = true
+```
+
+Enabling Level Merge has the following benefits:
+
+- Greatly improve the performance of Titan range query.
+- Reduce the impact of Titan GC on the foreground write operations and improve write performance.
+- Reduce space amplification of Titan and the disk usage (compared to the disk usage with the default configuration).
+
+Accordingly, the write amplification with Level Merge enabled is slightly higher than that of Titan but is still lower than that of the native RocksDB.
diff --git a/storage-engine/titan-overview.md b/storage-engine/titan-overview.md
@@ -0,0 +1,125 @@
+---
+title: Titan Overview
+summary: Learn the overview of the Titan storage engine.
+---
+
+# Titan Overview
+
+[Titan](https://github.com/pingcap/rocksdb/tree/titan-5.15) is a high-performance [RocksDB](https://github.com/facebook/rocksdb) plugin for key-value separation. Titan can reduce write amplification in RocksDB when large values are used.
+
+When the value size in Key-Value pairs is large, Titan performs better than RocksDB in write, update, and point read scenarios. However, Titan gets a higher write performance by sacrificing storage space and range query performance. As the price of SSDs continues to decrease, this trade-off will be more and more meaningful.
+
+## Key features
+
+- Reduce write amplification by separating values from the log-structured merge-tree (LSM-tree) and storing them independently.
+- Seamlessly upgrade RocksDB instances to Titan. The upgrade does not require human intervention and does not impact online services.
+- Achieve 100% compatibility with all RocksDB features used by the current TiKV.
+
+## Usage scenarios
+
+Titan is suitable for the scenarios where a huge volume of data is written to the TiKV foreground:
+
+- RocksDB triggers a large amount of compactions, which consumes a lot of I/O bandwidth or CPU resources. This causes poor read and write performance of the foreground.
+- The RocksDB compaction lags much behind (due to the I/O bandwidth limit or CPU bottleneck) and frequently causes write stalls.
+- RocksDB triggers a large amount of compactions, which causes a lot of I/O writes and affects the life of the SSD disk.
+
+## Prerequisites
+
+The prerequisites for enabling Titan are as follows:
+
+- The average size of values is large, or the size of all large values accounts for much of the total value size. Currently, the size of a value greater than 1 KB is considered as a large value. In some situations, this number (1 KB) can be 512 B. Note that a single value written to TiKV cannot exceed 6 MB due to the limitation of the TiKV Raft layer.
+- No range query will be performed or you do not need a high performance of range query. Because the data stored in Titan is not well-ordered, its performance of range query is poorer than that of RocksDB, especially for the query of a large range. According PingCAP's internal test, Titan's range query performance is 40% to a few times lower than that of RocksDB.
+- Sufficient disk space, because Titan reduces write amplification at the cost of disk space. In addition, Titan compresses values one by one, and its compression rate is lower than that of RocksDB. RocksDB compresses blocks one by one. Therefore, Titan consumes more storage space than RocksDB, which is expected and normal. In some situations, Titan's storage consumption can be twice that of RocksDB.
+
+If you want to improve the performance of Titan, see the blog post [Titan: A RocksDB Plugin to Reduce Write Amplification](https://pingcap.com/blog/titan-storage-engine-design-and-implementation/).
+
+## Architecture and implementation
+
+The following figure shows the architecture of Titan:
+
+![Titan Architecture](/media/titan/titan-1.png)
+
+During flush and compaction operations, Titan separates values from the LSM-tree. The advantage of this approach is that the write process is consistent with RocksDB, which reduces the chance of invasive changes to RocksDB.
+
+### BlobFile
+
+When Titan separates the value file from the LSM-tree, it stores the value file in the BlobFile. The following figure shows the BlobFile format:
+
+![BlobFile Format](/media/titan/titan-2.png)
+
+A blob file mainly consists of blob records, meta blocks, a meta index block, and a footer. Each block record stores a Key-Value pair. The meta blocks are used for scalability, and store properties related to the blob file. The meta index block is used for meta block searching.
+
+> **Note:**
+>
+> + The Key-Value pairs in the blob file are stored in order, so that when the Iterator is implemented, the sequential reading performance can be improved via prefetching.
+> + Each blob record keeps a copy of the user key corresponding to the value. This way, when Titan performs Garbage Collection (GC), it can query the user key and identify whether the corresponding value is outdated. However, this process introduces some write amplification.
+> + BlobFile supports compression at the blob record level. Titan supports multiple compression algorithms, such as [Snappy](https://github.com/google/snappy), [LZ4](https://github.com/lz4/lz4), and [Zstd](https://github.com/facebook/zstd). Currently, the default compression algorithm Titan uses is LZ4.
+
+### TitanTableBuilder
+
+![TitanTableBuilder](/media/titan/titan-3.png)
+
+TitanTableBuilder is the key to achieving Key-Value separation. TitanTableBuilder determines the Key-Pair value size, and based on that, decides whether to separate the value from the Key-Value pair and store it in the blob file.
+
++ If the value size is greater than or equal to `min_blob_size`, TitanTableBuilder separates the value and stores it in the blob file. TitanTableBuilder also generates an index and writes it into the SST.
++ If the value size is smaller than `min_blob_size`, TitanTableBuilder writes the value directly into the SST.
+
+Titan can also be downgraded to RocksDB in the process above. When RocksDB is performing compactions, the separated value can be written back to the newly generated SST files.
+
+## Garbage Collection
+
+Titan uses Garbage Collection (GC) to reclaim space. As the keys are being reclaimed in the LSM-tree compaction, some values stored in blob files are not deleted at the same time. Therefore, Titan needs to perform GC periodically to delete outdated values. Titan provides the following two types of GC:
+
++ Blob files are periodically integrated and rewritten to delete outdated values. This is the regular way of performing GC.
++ Blob files are rewritten while the LSM-tree compaction is performed at the same time. This is the feature of Level Merge.
+
+### Regular GC
+
+Titan uses the TablePropertiesCollector and EventListener components of RocksDB to collect the information for GC.
+
+#### TablePropertiesCollector
+
+RocksDB supports using BlobFileSizeCollector, a custom table property collector, to collect properties from the SST which are written into corresponding SST files. The collected properties are named BlobFileSizeProperties. The following figure shows the BlobFileSizeCollector workflow and data formats:
+
+![BlobFileSizeProperties](/media/titan/titan-4.png)
+
+On the left is the SST index format. The first column is the blob file ID; the second column is the offset for the blob record in the blob file; the third column is the blob record size.
+
+On the right is the BlobFileSizeProperties format. Each line represents a blob file and how much data is saved in this blob file. The first column is the blob file ID; the second column is the size of the data.
+
+#### EventListener
+
+RocksDB uses compaction to discard old data and reclaim space. After each compaction, some blob files in Titan might contain partly or entirely outdated data. Therefore, you can trigger GC by listening to compaction events. During compaction, you can collect and compare the input/output blob file size properties of SST to determine which blob files require GC. The following figure shows the general process:
+
+![EventListener](/media/titan/titan-5.png)
+
++ *inputs* stands for the blob file size properties for all SSTs that participate in the compaction.
++ *outputs* stands for the blob file size properties for all SSTs generated in the compaction.
++ *discardable size* is the size of the file to be discarded for each blob file, calculated based on inputs and outputs. The first column is the blob file ID. The second column is the size of the file to be discarded.
+
+For each valid blob file, Titan maintains a discardable size variable in memory. After each compaction, this variable is accumulated for the corresponding blob file. Each time when GC starts, it picks the blob file with the greatest discardable size as the candidate file for GC. To reduce write amplification, a certain level of space amplification is allowed, which means GC can be started on a blob file only when the discardable file has reached a specific proportion in size.
+
+For the selected blob file, Titan checks whether the blob index of the key corresponding to each value exists or has been updated to determine whether this value is outdated. If the value is not outdated, Titan merges and sorts the value into a new blob file, and writes the updated blob index into SST using WriteCallback or MergeOperator. Then, Titan records the latest sequence number of RocksDB and does not delete the old blob file until the sequence of the oldest snapshot exceeds the recorded sequence number. The reason is that after the blob index is written back to SST, the old blob index is still accessible via the previous snapshot. Therefore, we need to ensure that no snapshot will access the old blob index before GC can safely deletes the corresponding blob file.
+
+### Level Merge
+
+Level Merge is a newly introduced algorithm in Titan. According to the implementation principle of Level Merge, Titan merges and rewrites blob file that corresponds to the SST file, and generates new blob file while compactions are performed in LSM-tree. The following figure shows the general process:
+
+![LevelMerge General Process](/media/titan/titan-6.png)
+
+When compactions are performed on the SSTs of level z-1 and level z, Titan reads and writes Key-Value pairs in order. Then it writes the values of the selected blob files into new blob files in order, and updates the blob indexes of keys when new SSTs are generated. For the keys deleted in compactions, the corresponding values will not be written to the new blob file, which works similar to GC.
+
+Compared with the regular way of GC, the Level Merge approach completes the blob GC while compactions are performed in LSM-tree. In this way, Titan no longer needs to check the status of blob index in LSM-tree or to write the new blob index into LSM-tree. This reduces the impact of GC on the foreground operations. As the blob file is repeatedly rewritten, fewer files overlap with each other, which makes the whole system in better order and improves the performance of scan.
+
+However, layering blob files similar to tiering compaction brings write amplification. Because 99% of the data in LSM-tree is stored at the lowest two levels, Titan performs the Level Merge operation on the blob files corresponding to the data that is compacted only to the lowest two levels of LSM-tree.
+
+#### Range Merge
+
+Range Merge is an optimized approach of GC based on Level Merge. However, the bottom level of LSM-tree might be in poorer order in the following situations:
+
+- When `level_compaction_dynamic_level_bytes` is enabled, data volume at each level of LSM-tree dynamically increases, and the sorted runs at the bottom level keep increasing.
+- A specific range of data is frequently compacted, and this causes a lot of sorted runs in that range.
+
+![RangeMerge](/media/titan/titan-7.png)
+
+Therefore, the Range Merge operation is needed to keep the number of sorted runs within a certain level. At the time of OnCompactionComplete, Titan counts the number of sorted runs in a range. If the number is large, Titan marks the corresponding blob file as ToMerge and rewrites it in the next compaction.