From 29543b54296b6fcc6d69b7deb6018e816f57a53a Mon Sep 17 00:00:00 2001
From: shichun-0415 <89768198+shichun-0415@users.noreply.github.com>
Date: Mon, 20 Feb 2023 17:17:05 +0800
Subject: [PATCH] Modify TiFlash's troubleshooting procedure to adapt current
 version (#12269)

---
 tiflash/troubleshoot-tiflash.md | 63 ++++++++-------------------------
 1 file changed, 15 insertions(+), 48 deletions(-)
diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md
index 40d6998d9586b..c014f403d8329 100644
--- a/tiflash/troubleshoot-tiflash.md
+++ b/tiflash/troubleshoot-tiflash.md
@@ -127,12 +127,12 @@ After deploying a TiFlash node and starting replication (by performing the ALTER
     - If there is output, go to the next step.
     - If there is no output, run the `SELECT * FROM information_schema.tiflash_replica` command to check whether TiFlash replicas have been created. If not, run the `ALTER table ${tbl_name} set tiflash replica ${num}` command again, check whether other statements (for example, `add index`) have been executed, or check whether DDL executions are successful.
 
-2. Check whether the TiFlash process runs correctly.
+2. Check whether TiFlash Region replication runs correctly.
 
-   Check whether there is any change in `progress`, the `flash_region_count` parameter in the `tiflash_cluster_manager.log` file, and the Grafana monitoring item `Uptime`:
+   Check whether there is any change in `progress`:
 
-   - If yes, the TiFlash process runs correctly.
-   - If no, the TiFlash process is abnormal. Check the `tiflash` log for further information.
+   - If yes, TiFlash replication runs correctly.
+   - If no, TiFlash replication is abnormal. In `tidb.log`, search the log saying `Tiflash replica is not available`. Check whether `progress` of the corresponding table is updated. If not, check the `tiflash log` for further information. For example, search `lag_region_info` in `tiflash log` to find out which Region lags behind.
 
 3. Check whether the [Placement Rules](/configure-placement-rules.md) function has been enabled by using pd-ctl:
 
@@ -170,40 +170,23 @@ After deploying a TiFlash node and starting replication (by performing the ALTER
         }' <http://172.16.x.xxx:2379/pd/api/v1/config/rule>
     ```
 
-5. Check whether the connection between TiDB or PD and TiFlash is normal.
+5. Check whether TiDB has created any placement rule for tables.
 
-    Search the `flash_cluster_manager.log` file for the `ERROR` keyword.
-
-    - If no `ERROR` is found, the connection is normal. Go to the next step.
-    - If `ERROR` is found, the connection is abnormal. Perform the following check.
-
-      - Check whether the log records PD keywords.
-
-          If PD keywords are found, check whether `raft.pd_addr` in the TiFlash configuration file is valid. Specifically, run the `curl '{pd-addr}/pd/api/v1/config/rules'` command and check whether there is any output in 5s.
-
-      - Check whether the log records TiDB-related keywords.
-
-          If TiDB keywords are found, check whether `flash.tidb_status_addr` in the TiFlash configuration file is valid. Specifically, run the `curl '{tidb-status-addr}/tiflash/replica'` command and check whether there is any output in 5s.
-
-      - Check whether the nodes can ping through each other.
-
-    > **Note:**
-    >
-    >  If the problem persists, collect logs of the corresponding component for troubleshooting.
-
-6. Check whether `placement-rule` is created for tables.
-
-    Search the `flash_cluster_manager.log` file for the `Set placement rule … table-<table_id>-r` keyword.
+    Search the logs of TiDB DDL Owner and check whether TiDB has notified PD to add placement rules. For non-partitioned tables, search `ConfigureTiFlashPDForTable`. For partitioned tables, search `ConfigureTiFlashPDForPartitions`.
 
     - If the keyword is found, go to the next step.
     - If not, collect logs of the corresponding component for troubleshooting.
 
+6. Check whether PD has configured any placement rule for tables.
+
+    Run the `curl http://<pd-ip>:<pd-port>/pd/api/v1/config/rules/group/tiflash` command to view  all TiFlash placement rules on the current PD. If a rule with the ID being `table-<table_id>-r` is found, the PD has configured a placement rule successfully.
+
 7. Check whether the PD schedules properly.
 
     Search the `pd.log` file for the `table-<table_id>-r` keyword and scheduling behaviors like `add operator`.
 
     - If the keyword is found, the PD schedules properly.
-    - If not, the PD does not schedule properly. You can [get support](/support.md) from PingCAP or the community.
+    - If not, the PD does not schedule properly.
 
 ## Data replication gets stuck
 
@@ -216,33 +199,17 @@ If data replication on TiFlash starts normally but then all or some data fails t
     - If the disk usage ratio is greater than or equal to the value of `low-space-ratio`, the disk space is insufficient. To relieve the disk space, remove unnecessary files, such as `space_placeholder_file` (if necessary, set `reserve-space` to 0MB after removing the file) under the `${data}/flash/` folder.
     - If the disk usage ratio is less than the value of `low-space-ratio`, the disk space is sufficient. Go to the next step.
 
-2. Check the network connectivity between TiKV, TiFlash, and PD.
-
-    In `flash_cluster_manager.log`, check whether there are any new updates to `flash_region_count` corresponding to the table that gets stuck.
-
-    - If no, go to the next step.
-    - If yes, search for `down peer` (replication gets stuck if there is a peer that is down).
+2. Check whether there is any `down peer` (a `down peer` might cause the replication to get stuck).
 
-        - Run `pd-ctl region check-down-peer` to search for `down peer`.
-        - If `down peer` is found, run `pd-ctl operator add remove-peer\<region-id> \<tiflash-store-id>` to remove it.
-
-3. Check CPU usage.
-
-    On Grafana, choose **TiFlash-Proxy-Details** > **Thread CPU** > **Region task worker pre-handle/generate snapshot CPU**. Check the CPU usage of `<instance-ip>:<instance-port>-region-worker`.
-
-    If the curve is a straight line, the TiFlash node is stuck. Terminate the TiFlash process and restart it, or [get support](/support.md) from PingCAP or the community.
+    Run the `pd-ctl region check-down-peer` command to check whether there is any `down peer`. If any, run the `pd-ctl operator add remove-peer <region-id> <tiflash-store-id>` command to remove it.
 
 ## Data replication is slow
 
 The causes may vary. You can address the problem by performing the following steps.
 
-1. Adjust the value of the scheduling parameters.
-
-    - Increase [`store limit`](/configure-store-limit.md#usage) to accelerate replication.
-    - Decrease [`config set patrol-region-interval 10ms`](/pd-control.md#command) to make checker scan on Regions more frequent in TiKV.
-    - Increase [`region merge`](/pd-control.md#command) to reduce the number of Regions, which means fewer scans and higher check frequencies.
+1. Increase [`store limit`](/configure-store-limit.md#usage) to accelerate replication.
 
-2. Adjust the load on TiFlsh.
+2. Adjust the load on TiFlash.
 
     Excessively high load on TiFlash can also result in slow replication. You can check the load of TiFlash indicators on the **TiFlash-Summary** panel on Grafana: