diff --git a/alert-rules.md b/alert-rules.md index 86904a1c2d287..e23e13ec4adaf 100644 --- a/alert-rules.md +++ b/alert-rules.md @@ -233,7 +233,7 @@ This section gives the alert rules for the PD component. * Description: - The number of Region replicas is smaller than the value of `max-replicas`. When a TiKV machine is down and its downtime exceeds `max-down-time`, it usually leads to missing replicas for some Regions during a period of time. + The number of Region replicas is smaller than the value of `max-replicas`. * Solution: diff --git a/best-practices/java-app-best-practices.md b/best-practices/java-app-best-practices.md index ee05c7301c878..7adff8a1d6b91 100644 --- a/best-practices/java-app-best-practices.md +++ b/best-practices/java-app-best-practices.md @@ -232,7 +232,7 @@ The application needs to return the connection after finishing using it. It is a ### Probe configuration -The connection pool maintains persistent connections to TiDB. TiDB does not proactively close client connections by default (unless an error is reported), but generally there will be network proxies such as LVS or HAProxy between the client and TiDB. Usually, these proxies will proactively clean up connections that are idle for a certain period of time. In addition to paying attention to the idle configuration of the proxies, the connection pool also needs to keep alive or probe connections. +The connection pool maintains persistent connections to TiDB. TiDB does not proactively close client connections by default (unless an error is reported), but generally there will be network proxies such as LVS or HAProxy between the client and TiDB. Usually, these proxies will proactively clean up connections that are idle for a certain period of time (controlled by the proxy's idle configuration). In addition to paying attention to the idle configuration of the proxies, the connection pool also needs to keep alive or probe connections. If you often see the following error in your Java application: diff --git a/br/backup-and-restore-use-cases.md b/br/backup-and-restore-use-cases.md index 42f61e981cf7f..369ffffb62497 100644 --- a/br/backup-and-restore-use-cases.md +++ b/br/backup-and-restore-use-cases.md @@ -86,7 +86,7 @@ The BR tool already supports self-adapting to GC. It automatically registers `ba For the detailed usage of the `br backup` command, refer to [Use BR Command-line for Backup and Restoration](/br/use-br-command-line-tool.md). 1. Before executing the `br backup` command, ensure that no DDL is running on the TiDB cluster. -2. Ensure that the storage device where the backup will be created has sufficient space. +2. Ensure that the storage device where the backup will be created has sufficient space (no less than 1/3 of the disk space of the backup cluster). ### Preparation for restoration diff --git a/daily-check.md b/daily-check.md index 7e4dbde5e262b..7f7b337e5356c 100644 --- a/daily-check.md +++ b/daily-check.md @@ -40,7 +40,7 @@ You can locate the slow SQL statement executed in the cluster. Then you can opti + `miss-peer-region-count`: The number of Regions without enough replicas. This value is not always greater than `0`. + `extra-peer-region-count`: The number of Regions with extra replicas. These Regions are generated during the scheduling process. + `empty-region-count`: The number of empty Regions, generated by executing the `TRUNCATE TABLE`/`DROP TABLE` statement. If this number is large, you can consider enabling `Region Merge` to merge Regions across tables. -+ `pending-peer-region-count`: The number of Regions with outdated Raft logs. It is normal that a few pending peers are generated in the scheduling process. However, it is not normal if this value is large for a period of time. ++ `pending-peer-region-count`: The number of Regions with outdated Raft logs. It is normal that a few pending peers are generated in the scheduling process. However, it is not normal if this value is large for a period of time (longer than 30 minutes). + `down-peer-region-count`: The number of Regions with an unresponsive peer reported by the Raft leader. + `offline-peer-region-count`: The number of Regions during the offline process. diff --git a/dm/deploy-a-dm-cluster-using-tiup.md b/dm/deploy-a-dm-cluster-using-tiup.md index 05de051fb951c..27234998dc044 100644 --- a/dm/deploy-a-dm-cluster-using-tiup.md +++ b/dm/deploy-a-dm-cluster-using-tiup.md @@ -15,7 +15,7 @@ TiUP supports deploying DM v2.0 or later DM versions. This document introduces h ## Prerequisites -When DM performs a full data replication task, the DM-worker is bound with only one upstream database. The DM-worker first exports the full amount of data locally, and then imports the data into the downstream database. Therefore, the worker's host needs sufficient storage space (The storage path is specified later when you create the task). +When DM performs a full data replication task, the DM-worker is bound with only one upstream database. The DM-worker first exports the full amount of data locally, and then imports the data into the downstream database. Therefore, the worker's host space must be large enough to store all upstream tables to be exported. The storage path is specified later when you create the task. In addition, you need to meet the [hardware and software requirements](/dm/dm-hardware-and-software-requirements.md) when deploying a DM cluster. diff --git a/dumpling-overview.md b/dumpling-overview.md index 6b5ecf502f818..43752df6d6e04 100644 --- a/dumpling-overview.md +++ b/dumpling-overview.md @@ -317,9 +317,9 @@ When Dumpling is exporting a large single table from TiDB, Out of Memory (OOM) m + Reduce the value of `--tidb-mem-quota-query` to `8589934592` (8 GB) or lower. `--tidb-mem-quota-query` controls the memory usage of a single query statement in TiDB. + Adjust the `--params "tidb_distsql_scan_concurrency=5"` parameter. [`tidb_distsql_scan_concurrency`](/system-variables.md#tidb_distsql_scan_concurrency) is a session variable which controls the concurrency of the scan operations in TiDB. -### TiDB GC settings when exporting a large volume of data +### TiDB GC settings when exporting a large volume of data (more than 1 TB) -When exporting data from TiDB, if the TiDB version is later than or equal to v4.0.0 and Dumpling can access the PD address of the TiDB cluster, Dumpling automatically extends the GC time without affecting the original cluster. +When exporting data from TiDB (more than 1 TB), if the TiDB version is later than or equal to v4.0.0 and Dumpling can access the PD address of the TiDB cluster, Dumpling automatically extends the GC time without affecting the original cluster. In other scenarios, if the data size is very large, to avoid export failure due to GC during the export process, you can extend the GC time in advance: diff --git a/faq/manage-cluster-faq.md b/faq/manage-cluster-faq.md index 6e028507f86a0..c93dc7087d5a8 100644 --- a/faq/manage-cluster-faq.md +++ b/faq/manage-cluster-faq.md @@ -425,6 +425,6 @@ This section describes common problems you may encounter during backup and resto ### How to back up data in TiDB? -Currently, for the backup of a large volume of data, the preferred method is using [BR](/br/backup-and-restore-tool.md). Otherwise, the recommended tool is [Dumpling](/dumpling-overview.md). Although the official MySQL tool `mysqldump` is also supported in TiDB to back up and restore data, its performance is worse than [BR](/br/backup-and-restore-tool.md) and it needs much more time to back up and restore large volumes of data. +Currently, for the backup of a large volume of data (more than 1 TB), the preferred method is using [BR](/br/backup-and-restore-tool.md). Otherwise, the recommended tool is [Dumpling](/dumpling-overview.md). Although the official MySQL tool `mysqldump` is also supported in TiDB to back up and restore data, its performance is worse than [BR](/br/backup-and-restore-tool.md) and it needs much more time to back up and restore large volumes of data. For more FAQs about BR, see [BR FAQs](/br/backup-and-restore-faq.md). diff --git a/migrate-large-mysql-to-tidb.md b/migrate-large-mysql-to-tidb.md index f29354d47efe1..f96b993cf384e 100644 --- a/migrate-large-mysql-to-tidb.md +++ b/migrate-large-mysql-to-tidb.md @@ -28,7 +28,7 @@ This document describes how to migrate large datasets from MySQL to TiDB. The wh **Disk space**: -- Dumpling requires enough disk space to store the whole data source. SSD is recommended. +- Dumpling requires a disk space that can store the whole data source (or to store all upstream tables to be exported). SSD is recommended. - During the import, TiDB Lightning needs temporary space to store the sorted key-value pairs. The disk space should be enough to hold the largest single table from the data source. - If the full data volume is large, you can increase the binlog storage time in the upstream. This is to ensure that the binlogs are not lost during the incremental replication. @@ -78,7 +78,7 @@ The target TiKV cluster must have enough disk space to store the imported data. |-`B` or `--database` |Specifies a database to be exported| |`-f` or `--filter` |Exports tables that match the pattern. Refer to [table-filter](/table-filter.md) for the syntax.| - Make sure `${data-path}` has enough space to store the exported data. To prevent the export from being interrupted by a large table consuming all the spaces, it is strongly recommended to use the `-F` option to limit the size of a single file. + Make sure `${data-path}` has the space to store all exported upstream tables. To prevent the export from being interrupted by a large table consuming all the spaces, it is strongly recommended to use the `-F` option to limit the size of a single file. 2. View the `metadata` file in the `${data-path}` directory. This is a Dumpling-generated metadata file. Record the binlog position information, which is required for the incremental replication in Step 3. diff --git a/pd-control.md b/pd-control.md index d396d127fe434..04cc9f4cc8b0c 100644 --- a/pd-control.md +++ b/pd-control.md @@ -255,19 +255,19 @@ Usage: >> config set region-schedule-limit 2 // 2 tasks of Region scheduling at the same time at most ``` -- `replica-schedule-limit` controls the number of tasks scheduling the replica at the same time. This value affects the scheduling speed when the node is down or removed. A larger value means a higher speed and setting the value to 0 closes the scheduling. Usually the replica scheduling has a large load, so do not set a too large value. +- `replica-schedule-limit` controls the number of tasks scheduling the replica at the same time. This value affects the scheduling speed when the node is down or removed. A larger value means a higher speed and setting the value to 0 closes the scheduling. Usually the replica scheduling has a large load, so do not set a too large value. Note that this configuration item is usually kept at the default value. If you want to change the value, you need to try a few values to see which one works best according to the real situation. ```bash >> config set replica-schedule-limit 4 // 4 tasks of replica scheduling at the same time at most ``` -- `merge-schedule-limit` controls the number of Region Merge scheduling tasks. Setting the value to 0 closes Region Merge. Usually the Merge scheduling has a large load, so do not set a too large value. +- `merge-schedule-limit` controls the number of Region Merge scheduling tasks. Setting the value to 0 closes Region Merge. Usually the Merge scheduling has a large load, so do not set a too large value. Note that this configuration item is usually kept at the default value. If you want to change the value, you need to try a few values to see which one works best according to the real situation. ```bash >> config set merge-schedule-limit 16 // 16 tasks of Merge scheduling at the same time at most ``` -- `hot-region-schedule-limit` controls the hot Region scheduling tasks that are running at the same time. Setting its value to `0` means to disable the scheduling. It is not recommended to set a too large value, otherwise it might affect the system performance. +- `hot-region-schedule-limit` controls the hot Region scheduling tasks that are running at the same time. Setting its value to `0` means disabling the scheduling. It is not recommended to set a too large value. Otherwise, it might affect the system performance. Note that this configuration item is usually kept at the default value. If you want to change the value, you need to try a few values to see which one works best according to the real situation. ```bash >> config set hot-region-schedule-limit 4 // 4 tasks of hot Region scheduling at the same time at most diff --git a/releases/release-5.3.0.md b/releases/release-5.3.0.md index 1bf11d5bc3890..d8d77147bf68f 100644 --- a/releases/release-5.3.0.md +++ b/releases/release-5.3.0.md @@ -117,7 +117,7 @@ In v5.3, the key new features or improvements are as follows: Support the `ALTER TABLE [PARTITION] ATTRIBUTES` statement that allows you to set attributes for a table or partition. Currently, TiDB only supports setting the `merge_option` attribute. By adding this attribute, you can explicitly control the Region merge behavior. - User scenarios: When you perform the `SPLIT TABLE` operation, if no data is inserted after a certain period of time, the empty Regions are automatically merged by default. In this case, you can set the table attribute to `merge_option=deny` to avoid the automatic merging of Regions. + User scenarios: When you perform the `SPLIT TABLE` operation, if no data is inserted after a certain period of time (controlled by the PD parameter [`split-merge-interval`](/pd-configuration-file.md#split-merge-interval)), the empty Regions are automatically merged by default. In this case, you can set the table attribute to `merge_option=deny` to avoid the automatic merging of Regions. [User document](/table-attributes.md), [#3839](https://github.com/tikv/pd/issues/3839) diff --git a/schedule-replicas-by-topology-labels.md b/schedule-replicas-by-topology-labels.md index 361994bd2e4e2..a34cfd066b016 100644 --- a/schedule-replicas-by-topology-labels.md +++ b/schedule-replicas-by-topology-labels.md @@ -167,7 +167,7 @@ Then, assume that the number of cluster replicas is 5 (`max-replicas=5`). Becaus In the case of the 5-replica configuration, if z3 fails or is isolated as a whole, and cannot be recovered after a period of time (controlled by `max-store-down-time`), PD will make up the 5 replicas through scheduling. At this time, only 4 hosts are available. This means that host-level isolation cannot be guaranteed and that multiple replicas might be scheduled to the same host. But if the `isolation-level` value is set to `zone` instead of being left empty, this specifies the minimum physical isolation requirements for Region replicas. That is to say, PD will ensure that replicas of the same Region are scattered among different zones. PD will not perform corresponding scheduling even if following this isolation restriction does not meet the requirement of `max-replicas` for multiple replicas. -For example, a TiKV cluster is distributed across three data zones z1, z2, and z3. Each Region has three replicas as required, and PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time, PD determines that the Region replicas on z1 are no longer available. However, because `isolation-level` is set to `zone`, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled on the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of `isolation-level`, even if there are only two replicas at this moment. +For example, a TiKV cluster is distributed across three data zones z1, z2, and z3. Each Region has three replicas as required, and PD distributes the three replicas of the same Region to these three data zones respectively. If a power outage occurs in z1 and cannot be recovered after a period of time (controlled by [`max-store-down-time`](/pd-configuration-file.md#max-store-down-time) and 30 minutes by default), PD determines that the Region replicas on z1 are no longer available. However, because `isolation-level` is set to `zone`, PD needs to strictly guarantee that different replicas of the same Region will not be scheduled on the same data zone. Because both z2 and z3 already have replicas, PD will not perform any scheduling under the minimum isolation level restriction of `isolation-level`, even if there are only two replicas at this moment. Similarly, when `isolation-level` is set to `rack`, the minimum isolation level applies to different racks in the same data center. With this configuration, the isolation at the zone layer is guaranteed first if possible. When the isolation at the zone level cannot be guaranteed, PD tries to avoid scheduling different replicas to the same rack in the same zone. The scheduling works similarly when `isolation-level` is set to `host` where PD first guarantees the isolation level of rack, and then the level of host. diff --git a/shard-row-id-bits.md b/shard-row-id-bits.md index 1aba21c2417b3..9877c6d9e2255 100644 --- a/shard-row-id-bits.md +++ b/shard-row-id-bits.md @@ -11,7 +11,7 @@ This document introduces the `SHARD_ROW_ID_BITS` table attribute, which is used For the tables with a non-integer primary key or no primary key, TiDB uses an implicit auto-increment row ID. When a large number of `INSERT` operations are performed, the data is written into a single Region, causing a write hot spot. -To mitigate the hot spot issue, you can configure `SHARD_ROW_ID_BITS`. The row IDs are scattered and the data are written into multiple different Regions. But setting an overlarge value might lead to an excessively large number of RPC requests, which increases the CPU and network overheads. +To mitigate the hot spot issue, you can configure `SHARD_ROW_ID_BITS`. The row IDs are scattered and the data are written into multiple different Regions. - `SHARD_ROW_ID_BITS = 4` indicates 16 shards - `SHARD_ROW_ID_BITS = 6` indicates 64 shards diff --git a/statement-summary-tables.md b/statement-summary-tables.md index 730ffb7c6ca06..431635f24d590 100644 --- a/statement-summary-tables.md +++ b/statement-summary-tables.md @@ -139,11 +139,11 @@ The system variables above have two scopes: global and session. These scopes wor > **Note:** > -> The `tidb_stmt_summary_history_size`, `tidb_stmt_summary_max_stmt_count`, and `tidb_stmt_summary_max_sql_length` configuration items affect memory usage. It is recommended that you adjust these configurations based on your needs. It is not recommended to set them too large values. +> The `tidb_stmt_summary_history_size`, `tidb_stmt_summary_max_stmt_count`, and `tidb_stmt_summary_max_sql_length` configuration items affect memory usage. It is recommended that you adjust these configurations based on your needs, the SQL size, SQL count, and machine configuration. It is not recommended to set them too large values. You can calculate the memory usage using `tidb_stmt_summary_history_size` \* `tidb_stmt_summary_max_stmt_count` \* `tidb_stmt_summary_max_sql_length` \* `3`. ### Set a proper size for statement summary -After the system has run for a period of time, you can check the `statement_summary` table to see whether SQL eviction has occurred. For example: +After the system has run for a period of time (depending on the system load), you can check the `statement_summary` table to see whether SQL eviction has occurred. For example: ```sql select @@global.tidb_stmt_summary_max_stmt_count; diff --git a/storage-engine/titan-overview.md b/storage-engine/titan-overview.md index d1fa7ecac018b..c31e688a440c4 100644 --- a/storage-engine/titan-overview.md +++ b/storage-engine/titan-overview.md @@ -7,7 +7,7 @@ summary: Learn the overview of the Titan storage engine. [Titan](https://github.com/pingcap/rocksdb/tree/titan-5.15) is a high-performance [RocksDB](https://github.com/facebook/rocksdb) plugin for key-value separation. Titan can reduce write amplification in RocksDB when large values are used. -When the value size in Key-Value pairs is large, Titan performs better than RocksDB in write, update, and point read scenarios. However, Titan gets a higher write performance by sacrificing storage space and range query performance. As the price of SSDs continues to decrease, this trade-off will be more and more meaningful. +When the value size in Key-Value pairs is large (larger than 1 KB or 512 B), Titan performs better than RocksDB in write, update, and point read scenarios. However, Titan gets a higher write performance by sacrificing storage space and range query performance. As the price of SSDs continues to decrease, this trade-off will be more and more meaningful. ## Key features @@ -29,7 +29,7 @@ The prerequisites for enabling Titan are as follows: - The average size of values is large, or the size of all large values accounts for much of the total value size. Currently, the size of a value greater than 1 KB is considered as a large value. In some situations, this number (1 KB) can be 512 B. Note that a single value written to TiKV cannot exceed 8 MB due to the limitation of the TiKV Raft layer. You can adjust the [`raft-entry-max-size`](/tikv-configuration-file.md#raft-entry-max-size) configuration value to relax the limit. - No range query will be performed or you do not need a high performance of range query. Because the data stored in Titan is not well-ordered, its performance of range query is poorer than that of RocksDB, especially for the query of a large range. According PingCAP's internal test, Titan's range query performance is 40% to a few times lower than that of RocksDB. -- Sufficient disk space, because Titan reduces write amplification at the cost of disk space. In addition, Titan compresses values one by one, and its compression rate is lower than that of RocksDB. RocksDB compresses blocks one by one. Therefore, Titan consumes more storage space than RocksDB, which is expected and normal. In some situations, Titan's storage consumption can be twice that of RocksDB. +- Sufficient disk space (consider reserving a space twice of the RocksDB memory consumption with the same data volume). This is because Titan reduces write amplification at the cost of disk space. In addition, Titan compresses values one by one, and its compression rate is lower than that of RocksDB. RocksDB compresses blocks one by one. Therefore, Titan consumes more storage space than RocksDB, which is expected and normal. In some situations, Titan's storage consumption can be twice that of RocksDB. If you want to improve the performance of Titan, see the blog post [Titan: A RocksDB Plugin to Reduce Write Amplification](https://pingcap.com/blog/titan-storage-engine-design-and-implementation/). diff --git a/system-variables.md b/system-variables.md index a472ee73c5923..d515218fb1ca0 100644 --- a/system-variables.md +++ b/system-variables.md @@ -602,7 +602,7 @@ Constraint checking is always performed in place for pessimistic transactions (d - Unit: Rows - This variable is used to set the batch size during the `re-organize` phase of the DDL operation. For example, when TiDB executes the `ADD INDEX` operation, the index data needs to backfilled by `tidb_ddl_reorg_worker_cnt` (the number) concurrent workers. Each worker backfills the index data in batches. - If many updating operations such as `UPDATE` and `REPLACE` exist during the `ADD INDEX` operation, a larger batch size indicates a larger probability of transaction conflicts. In this case, you need to adjust the batch size to a smaller value. The minimum value is 32. - - If the transaction conflict does not exist, you can set the batch size to a large value. This can increase the speed of the backfilling data, but the write pressure on TiKV also becomes higher. + - If the transaction conflict does not exist, you can set the batch size to a large value (consider the worker count. See [Interaction Test on Online Workloads and `ADD INDEX` Operations](/benchmark/online-workloads-and-add-index-operations.md) for reference). This can increase the speed of the backfilling data, but the write pressure on TiKV also becomes higher. ### tidb_ddl_reorg_priority @@ -642,7 +642,7 @@ Constraint checking is always performed in place for pessimistic transactions (d - This variable is used to set the concurrency of the `scan` operation. - Use a bigger value in OLAP scenarios, and a smaller value in OLTP scenarios. - For OLAP scenarios, the maximum value should not exceed the number of CPU cores of all the TiKV nodes. -- If a table has a lot of partitions, you can reduce the variable value appropriately to avoid TiKV becoming out of memory (OOM). +- If a table has a lot of partitions, you can reduce the variable value appropriately (determined by the size of the data to be scanned and the frequency of the scan) to avoid TiKV becoming out of memory (OOM). ### tidb_dml_batch_size diff --git a/ticdc/troubleshoot-ticdc.md b/ticdc/troubleshoot-ticdc.md index 6baa4c661f3b6..917ed139bc64b 100644 --- a/ticdc/troubleshoot-ticdc.md +++ b/ticdc/troubleshoot-ticdc.md @@ -102,39 +102,16 @@ A replication task might be interrupted in the following known scenarios: 2. Use the new task configuration file and add the `ignore-txn-start-ts` parameter to skip the transaction corresponding to the specified `start-ts`. 3. Stop the old replication task via HTTP API. Execute `cdc cli changefeed create` to create a new task and specify the new task configuration file. Specify `checkpoint-ts` recorded in step 1 as the `start-ts` and start a new task to resume the replication. -- In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption. +- In TiCDC v4.0.13 and earlier versions, when TiCDC replicates the partitioned table, it might encounter an error that leads to replication interruption. - In this scenario, TiCDC saves the task information. Because TiCDC has set the service GC safepoint in PD, the data after the task checkpoint is not cleaned by TiKV GC within the valid period of `gc-ttl`. - Handling procedures: - 1. Pause the replication task by executing `cdc cli changefeed pause -c `. + 1. Pause the replication task by executing `cdc cli changefeed pause -c `. 2. Wait for about one munite, and then resume the replication task by executing `cdc cli changefeed resume -c `. ### What should I do to handle the OOM that occurs after TiCDC is restarted after a task interruption? -- Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions**. - -- In the above updated versions, you can enable the Unified Sorter to help you sort data in the disk when the system memory is insufficient. To enable this function, you can pass `--sort-engine=unified` to the `cdc cli` command when creating a replication task. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed update -c --sort-engine="unified" --pd=http://10.0.10.25:2379 -``` - -If you fail to update your cluster to the above new versions, you can still enable Unified Sorter in **previous versions**. You can pass `--sort-engine=unified` and `--sort-dir=/path/to/sort_dir` to the `cdc cli` command when creating a replication task. For example: - -{{< copyable "shell-regular" >}} - -```shell -cdc cli changefeed update -c --sort-engine="unified" --sort-dir="/data/cdc/sort" --pd=http://10.0.10.25:2379 -``` - -> **Note:** -> -> + Since v4.0.9, TiCDC supports the unified sorter engine. -> + TiCDC (the 4.0 version) does not support dynamically modifying the sorting engine yet. Make sure that the changefeed has stopped before modifying the sorter settings. -> + `sort-dir` has different behaviors in different versions. Refer to [compatibility notes for`sort-dir` and `data-dir`](/ticdc/ticdc-overview.md#compatibility-notes-for-sort-dir-and-data-dir), and configure it with caution. -> + Currently, the unified sorter is an experimental feature. When the number of tables is too large (>=100), the unified sorter might cause performance issues and affect replication throughput. Therefore, it is not recommended to use it in a production environment. Before you enable the unified sorter, make sure that the machine of each TiCDC node has enough disk capacity. If the total size of unprocessed data changes might exceed 1 TB, it is not recommend to use TiCDC for replication. +- Update your TiDB cluster and TiCDC cluster to the latest versions. The OOM problem has already been resolved in **v4.0.14 and later v4.0 versions, v5.0.2 and later v5.0 versions, and the latest versions**. ## What is `gc-ttl` in TiCDC? diff --git a/tidb-configuration-file.md b/tidb-configuration-file.md index df5cbbb627caa..022c6d67be8f3 100644 --- a/tidb-configuration-file.md +++ b/tidb-configuration-file.md @@ -14,7 +14,7 @@ The TiDB configuration file supports more options than command-line parameters. - Determines whether to create a separate Region for each table. - Default value: `true` -- It is recommended to set it to `false` if you need to create a large number of tables. +- It is recommended to set it to `false` if you need to create a large number of tables (for example, more than 100 thousand tables). ### `token-limit` diff --git a/tidb-lightning/tidb-lightning-distributed-import.md b/tidb-lightning/tidb-lightning-distributed-import.md index 5d58d8ee6b255..657605e65fc03 100644 --- a/tidb-lightning/tidb-lightning-distributed-import.md +++ b/tidb-lightning/tidb-lightning-distributed-import.md @@ -134,7 +134,7 @@ nohup tiup tidb-lightning -config tidb-lightning.toml > nohup.out & During parallel import, TiDB Lightning automatically performs the following checks after starting the task. -- Check whether there is enough space on the local disk and on the TiKV cluster for importing data. TiDB Lightning samples the data sources and estimates the percentage of the index size from the sample result. Because indexes are included in the estimation, there may be cases where the size of the source data is less than the available space on the local disk, but still the check fails. +- Check whether there is enough space on the local disk (controlled by the `sort-kv-dir` configuration) and on the TiKV cluster for importing data. TiDB Lightning samples the data sources and estimates the percentage of the index size from the sample result. Because indexes are included in the estimation, there might be cases where the size of the source data is less than the available space on the local disk, but still the check fails. - Check whether the regions in the TiKV cluster are distributed evenly and whether there are too many empty regions. If the number of empty regions exceeds max(1000, number of tables * 3), i.e. greater than the bigger one of "1000" or "3 times the number of tables ", then the import cannot be executed. - Check whether the data is imported in order from the data sources. The size of `mydumper.batch-size` is automatically adjusted based on the result of the check. Therefore, the `mydumper.batch-size` configuration is no longer available. diff --git a/tidb-lightning/tidb-lightning-prechecks.md b/tidb-lightning/tidb-lightning-prechecks.md index b5676fbefdada..6045c2138b9ca 100644 --- a/tidb-lightning/tidb-lightning-prechecks.md +++ b/tidb-lightning/tidb-lightning-prechecks.md @@ -13,7 +13,7 @@ The following table describes each check item and detailed explanation. | ---- | ---- |---- | | Cluster version and status | >= 5.3.0 | Check whether the cluster can be connected in the configuration, and whether the TiKV/PD/TiFlash version supports the Local import mode when the backend mode is Local. | | Read permission| >= 5.3.0 | Check whether TiDB Lightning can access data stored on cloud (Amazon S3). This ensures that the import is not interrupted by the lack of the read permission. | -| Disk space | >= 5.3.0 | Check whether there is enough space on the local disk and on the TiKV cluster for importing data. TiDB Lightning samples the data sources and estimates the percentage of the index size from the sample result. Because indexes are included in the estimation, there may be cases where the size of the source data is less than the available space on the local disk, but still the check fails. When the backend is Local, it also checks whether the local storage is sufficient because external sorting needs to be done locally. | +| Disk space | >= 5.3.0 | Check whether there is enough space on the local disk and on the TiKV cluster for importing data. TiDB Lightning samples the data sources and estimates the percentage of the index size from the sample result. Because indexes are included in the estimation, there might be cases where the size of the source data is less than the available space on the local disk, but still the check fails. When the backend is Local, it also checks whether the local storage is sufficient because external sorting needs to be done locally. For more details about the TiKV cluster space and local storage space (controlled by `sort-kv-dir`). | | Region distribution status | >= 5.3.0 | Check whether the Regions in the TiKV cluster are distributed evenly and whether there are too many empty Regions. If the number of empty Regions exceeds max(1000, number of tables * 3), i.e. greater than the bigger one of "1000" or "3 times the number of tables ", then the import cannot be executed. | | Exceedingly Large CSV files in the data file | >= 5.3.0 | When there are CSV files larger than 10 GiB in the backup file and auto-slicing is not enabled (StrictFormat=false), it will impact the import performance. The purpose of this check is to remind you to ensure the data is in the right format and to enable auto-slicing. | | Recovery from breakpoints | >= 5.3.0 | This check ensures that no changes are made to the source file or schema in the database during the breakpoint recovery process that would result in importing the wrong data. | diff --git a/tidb-troubleshooting-map.md b/tidb-troubleshooting-map.md index b7c31c4f5adf2..ec848055103c6 100644 --- a/tidb-troubleshooting-map.md +++ b/tidb-troubleshooting-map.md @@ -13,7 +13,7 @@ This document summarizes common issues in TiDB and other components. You can use - 1.1.1 The `Region is Unavailable` error is usually because a Region is not available for a period of time. You might encounter `TiKV server is busy`, or the request to TiKV fails due to `not leader` or `epoch not match`, or the request to TiKV time out. In such cases, TiDB performs a `backoff` retry mechanism. When the `backoff` exceeds a threshold (20s by default), the error will be sent to the client. Within the `backoff` threshold, this error is not visible to the client. -- 1.1.2 Multiple TiKV instances are OOM at the same time, which causes no Leader in a Region for a period of time. See [case-991](https://github.com/pingcap/tidb-map/blob/master/maps/diagnose-case-study/case991.md) in Chinese. +- 1.1.2 Multiple TiKV instances are OOM at the same time, which causes no Leader during the OOM period. See [case-991](https://github.com/pingcap/tidb-map/blob/master/maps/diagnose-case-study/case991.md) in Chinese. - 1.1.3 TiKV reports `TiKV server is busy`, and exceeds the `backoff` time. For more details, refer to [4.3](#43-the-client-reports-the-server-is-busy-error). `TiKV server is busy` is a result of the internal flow control mechanism and should not be counted in the `backoff` time. This issue will be fixed. diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md index d5d4882c162a4..dc07db6171df9 100644 --- a/tiflash/troubleshoot-tiflash.md +++ b/tiflash/troubleshoot-tiflash.md @@ -210,10 +210,6 @@ After deploying a TiFlash node and starting replication (by performing the ALTER - If the keyword is found, the PD schedules properly. - If not, the PD does not schedule properly. Contact PingCAP technical support for help. -> **Note:** -> -> When there are many small Regions in the table to be replicated, and the `region merge` parameter is enabled or set to a large value, the replication progress might stay unchanged or be reduced in a period of time. - ## Data replication gets stuck If data replication on TiFlash starts normally but then all or some data fails to be replicated after a period of time, you can confirm or resolve the issue by performing the following steps: diff --git a/tikv-configuration-file.md b/tikv-configuration-file.md index fe7c17c2b87bc..97c40a2fa68a2 100644 --- a/tikv-configuration-file.md +++ b/tikv-configuration-file.md @@ -1537,4 +1537,4 @@ For pessimistic transaction usage, refer to [TiDB Pessimistic Transaction Mode]( ### `pipelined` - This configuration item enables the pipelined process of adding the pessimistic lock. With this feature enabled, after detecting that data can be locked, TiKV immediately notifies TiDB to execute the subsequent requests and write the pessimistic lock asynchronously, which reduces most of the latency and significantly improves the performance of pessimistic transactions. But there is a still low probability that the asynchronous write of the pessimistic lock fails, which might cause the failure of pessimistic transaction commits. -- Default value: `true` \ No newline at end of file +- Default value: `true` diff --git a/troubleshoot-hot-spot-issues.md b/troubleshoot-hot-spot-issues.md index 97f32d9d79d9b..05aef0c95c7c2 100644 --- a/troubleshoot-hot-spot-issues.md +++ b/troubleshoot-hot-spot-issues.md @@ -87,7 +87,7 @@ Hover over the bright block, you can see what table or index has a heavy load. F For a non-integer primary key or a table without a primary key or a joint primary key, TiDB uses an implicit auto-increment RowID. When a large number of `INSERT` operations exist, the data is written into a single Region, resulting in a write hotspot. -By setting `SHARD_ROW_ID_BITS`, RowID are scattered and written into multiple Regions, which can alleviates the write hotspot issue. However, if you set `SHARD_ROW_ID_BITS` to an over large value, the number of RPC requests will be enlarged, increasing CPU and network overhead. +By setting `SHARD_ROW_ID_BITS`, row IDs are scattered and written into multiple Regions, which can alleviate the write hotspot issue. ``` SHARD_ROW_ID_BITS = 4 # Represents 16 shards.