Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BACKPORT 2024.1][#20336] YSQL: GUC to avoid kReadRestart Errors with…
… Bounded Staleness Guarantees Summary: Original commit: 2724346 / D34002 Read restart errors are a distributed database specific error scenario that do not occur in a single node database such as PostgreSQL. These errors occur due to clock skew, usually when there are reads with simultaneous writes to that same data (refer https://docs.yugabyte.com/preview/architecture/transactions/read-restart-error/ for details). Read restart errors are thrown to maintain the "read-after-commit-visibility" guarantee: any client issued read should see all data that was committed before the read request was issued (even in the presence of clock skew between nodes). In other words, the following example should always work: (1) User X commits some data (for which the db picks a commit timestamp say ht1) (2) Then user X communicates to user Y to inform about the commit via a channel outside the database (say a phone call) (3) Then user Y issues a read to some YB node which picks a read time (less than ht1 due to clock skew) (4) Then it should not happen that user Y gets an output without the data that user Y was informed about. To ensure this guarantee, when the database performs a read at read_time, it picks a global_limit (= read_time + max_clock_skew_us). If it finds any matching records for the read in the window (read_time, global_limit], there is a chance for the above guarantee to be broken. In this case docdb throws a kReadRestart error. However, users migrating from PostgreSQL are surprised by this error. Moreover, some users may not be in a position to change their application code to handle this new error scenario. The number of kReadRestart errors thrown to the external client are reduced currently by retrying the transaction/statement at the query layer or at the docdb layer. A retry at the docdb is layer is possible when this is the first RPC in a transaction/statement and no read time was picked yet on the query layer. The query layer retries have the following limitations: - Is limited by `ysql_output_buffer_size`: if the YSQL to client buffer fills up and some data was already sent to the client, YSQL can't retry the whole query on a new read point. - Has higher tail latency, sometimes, leading to statement timeouts or retries exhaustion. - kReadRestart errors are not retried for statements other than the first one in a transaction block in Repeatable Read isolation. This change aims to provide users with an opposite tradeoff mechanism of sacrificing the read-after-commit-visibility guarantee for ease of use instead. Minimizing read restart errors is a multi-stage plan and here we take the first step. Provide users with a GUC `yb_read_after_commit_visibility` to relax the guarantee. Configurable Options: 1. strict * Default. * Same behavior as current. 2. relaxed * Ignores clock skew and sets the global_limit to be the same as read_time, thus ignoring the uncertainty window. * Pick read time on the query layer and not on the storage layer. This is necessary so that users do not miss commits from their own session. That would be bad. For simplicity, the relaxed option does not affect transactions unless they are marked **READ ONLY**. Handling generic transactions is more involved because of read-write operations. This may be handled in a future change. Moreover, we ignore DDLs and catalog requests for the purposes of this revision. In the next section, we discuss the semantics of the relaxed option. In this section, we discuss what guarantees can be retained even in the relaxed mode. (1) Same Session Guarantees The reads never miss writes from its own session. |//conn1//| |--------| |INSERT | |SELECT | <--- always observes the preceding DML statements. Providing this guarantee is less obvious than one would think. (1a) The read time of SELECT should not be lower than the commit time of the preceding INSERT operation. The insert itself may pick its commit time at any node in the distributed database. However, the hybrid time is propagated back to the local proxy. As a result, the SELECT statement's read time will be higher than the preceding commit time as long as the read time is picked on local proxy, i.e. we do not pick the read time on some remote docdb. Corollary 1a: read time of read only queries must be picked on local proxy whenever we relax the yb_read_after_commit_visibility guarantee. Tested in **PgReadAfterCommitVisibilityTest.SameSessionRecency**. (1b) If hypothetically we were to pick a read time on DocDB even after corollary 1a, that would lead to another problem too: DocDB picks safe time as the read time. This is potentially a time in the past and might be smaller than the commit time of the INSERT before the SELECT. So, ignoring the uncertainty window on docdb might lead to the SELECT not seeing the prior INSERT from the same connection. Corollary 1b: Do not ignore the uncertainty window when the read time is picked on the storage layer. This cannot happen with read-only statements & transactions since we always pick read time on the local proxy. Tested in **PgSingleTServerTest.NoSafeTimeClampingInRelaxedReadAfterCommit**. (1c) Server side connection pooling should not sacrifice the above same session guarantee. Since - server side connection pooling multiplexes connections only within the same node, and - there is a common proxy tserver across all pg connections on the node, we are guaranteed to see commits within the same session even with server-side connection pooling in effect. Tested in **PgReadAfterCommitVisibilityTest.SamePgNodeRecency**. Client-side connection pooling is out of scope for discussion (especially in the case of node failures, smart drivers, etc). (2) Different Session guarantees Relaxed mode does not provide read-after-commit-visibility guarantee with writes from a different session. We still have good consistency guarantees, nonetheless. (2a) The first guarantee is consistent prefix. | //conn1// | //conn2// | | ... | INSERT 1 | | ... | INSERT 2 | | SELECT | ... | First things first, the SELECT statement on conn1 need not observe the `INSERT 2` statement on conn2 even though the insert happens before the SELECT in real time. This may happen in a distributed database because of clock skew between different machines (and no uncertainty window). Next, if SELECT does observe INSERT 2, it must also observe INSERT 1 (and all the preceding statements). This is the consistent prefix guarantee and is maintained by the fact that INSERT 2 will always have a higher commit time than INSERT 1. (2b) Monotonic Reads | //conn1// | //conn2// | | ... | INSERT 1 | | ... | INSERT 2 | | SELECT 1 | ... | | SELECT 2 | ... | A closely related consistency is that we guarantee monotonic reads. If SELECT 1 observes INSERT 1, then SELECT 2 also observes INSERT 1 (and maybe even more such as INSERT 2). This is because SELECT 2 has a higher read time than SELECT 1 because read time increases monotonically within the same session. Note that this would not be the case if we let SELECT pick the read time on the storage layer instead of force picking it on the proxy. Explanation: safe time is not //necessarily// affected by the most recent hybrid time propagation since it is potentially a time in the past. (2c) Bounded Staleness | //conn1// | //conn2// | | ... | INSERT 1 | <--- 500ms old | ... | INSERT 2 | | SELECT 1 | ... | Most intuitive property. Since physical clocks do not skew more than max_clock_skew_usec, the SELECTs always see INSERTs that are older than max_clock_skew_usec. In practice, the staleness bound is even lower since the skew between hybrid time (not physical time) across the machines is the more relevant metric here. hybrid time is close to each other across nodes since there is a regular exchange of messages across yb-tservers and yb-master. Tested in **PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeStaleRead** and **PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeBoundedStaleness**. (3) Thematic worst-case scenario Here, we discuss the type of workload that is most susceptible to stale reads. For a stale read to occur, - The read must touch a node with a higher time (than the pg connection). More likely when the read is touching a lot of nodes. - The writes don't touch enough nodes to ensure hybrid time is propagated to the query layer of the node that performs the read. Happens when the writes are single row inserts/updates. Therefore, thematically, we are most susceptible to miss recent writes with the relaxed option when there are high throughput single-row DML ops happening concurrently with a long read that touches a lot of rows. Backport-through: 2024.1 **Upgrade/Rollback safety:** Fortunately, the only change in proto files is in pg_client.proto. pg_client.proto is used exclusively for communication between postgres and local tserver proxy layer. During upgrades once a node is upgraded, both Pg and local tserver are upgraded. Therefore, both of them understand this new field. Moreover, even though the read behavior is changed in the new relaxed mode, it is only changed for upgraded nodes. Non upgraded nodes do not require any knowledge of changes in the upgraded nodes because the existing interface between the query and storage layers works well to support this new feature. No auto flags are necessary. Jira: DB-9323 Test Plan: Jenkins **In TestPgTransparentRestarts** 1. When yb_read_after_commit_visibility is strict, Long reads that exceed the ysql_output_buffer_size threshold raise a read restart error to the client since they cannot be handled transparently. ``` ./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLong ``` 2. When yb_read_after_commit_visibility is relaxed, For read only txns/stmts, we silently ignore read restart errors. ``` ./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLong_relaxedReadAfterCommitVisibility ``` 3. For execution of prepared statements, relaxed mode must be set before the execute command and not necessarily before the prepare command. ``` ./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLongExecute_relaxedReadAfterCommitVisibility ``` 4. We raise no read restart errors even after transactions restarts due to conflicts. This is because we decided to relax the guarantee only for read only queries/transactions. In addition, read only ops do not run into any transaction conflicts. **In PgReadAfterCommitVisibilityTest** 1. Same session read-after-commit-visibility guarantee. ``` ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SameSessionRecency ``` 2. Same node read-after-commit-visibility guarantee. ``` ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SamePgNodeRecency ``` 3. Sessions connecting to Pg on different nodes - bounded staleness guarantee. ``` ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeStaleRead ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeBoundedStaleness ``` 1. Guard ourselves against this scenario - Read time is picked on docdb. - Read uncertainty window is ignored. - The picked time is safe time, which is a time before the previous statement in the same session. - We miss recent updates from the same session because the read time is in the past. ``` ./yb_build.sh --java-test PgSingleTServerTest.NoSafeTimeClampingInRelaxedReadAfterCommit ``` 2. We never ignore read uncertainty window (thus do not relax read-after-commit-visibility guarantee) with - INSERT/UPDATE/DELETE - inserts/updates in WITH clause ``` ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDuplicateInsertCheck ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionUpdateKeyCheck ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDeleteKeyCheck ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDmlHidden ``` 3. We also avoid this scenario (because relaxed mode does not affect INSERTs) - Two concurrent inserts to the same key - both single shard. - The read time of the insert is picked on the local proxy (because this is in relaxed mode). - This read time is used for checking transaction conflicts happening to the same key. - However, the conflict resolution step on the RegularDB is skipped in single-shard inserts, see GitHub issue #19407. ``` ./yb_build.sh --cxx-test pgwrapper_pg_read_time-test --gtest_filter PgReadTimeTest.CheckRelaxedReadAfterCommitVisibility ``` Reviewers: pjain, smishra, rthallam Reviewed By: pjain Subscribers: ybase, yql, tnayak, hsunder, rthallam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36197
- Loading branch information