[BACKPORT 2024.1][#20336] YSQL: GUC to avoid kReadRestart Errors with…

… Bounded Staleness Guarantees Summary: Original commit: 2724346 / D34002 Read restart errors are a distributed database specific error scenario that do not occur in a single node database such as PostgreSQL. These errors occur due to clock skew, usually when there are reads with simultaneous writes to that same data (refer https://docs.yugabyte.com/preview/architecture/transactions/read-restart-error/ for details). Read restart errors are thrown to maintain the "read-after-commit-visibility" guarantee: any client issued read should see all data that was committed before the read request was issued (even in the presence of clock skew between nodes). In other words, the following example should always work: (1) User X commits some data (for which the db picks a commit timestamp say ht1) (2) Then user X communicates to user Y to inform about the commit via a channel outside the database (say a phone call) (3) Then user Y issues a read to some YB node which picks a read time (less than ht1 due to clock skew) (4) Then it should not happen that user Y gets an output without the data that user Y was informed about. To ensure this guarantee, when the database performs a read at read_time, it picks a global_limit (= read_time + max_clock_skew_us). If it finds any matching records for the read in the window (read_time, global_limit], there is a chance for the above guarantee to be broken. In this case docdb throws a kReadRestart error. However, users migrating from PostgreSQL are surprised by this error. Moreover, some users may not be in a position to change their application code to handle this new error scenario. The number of kReadRestart errors thrown to the external client are reduced currently by retrying the transaction/statement at the query layer or at the docdb layer. A retry at the docdb is layer is possible when this is the first RPC in a transaction/statement and no read time was picked yet on the query layer. The query layer retries have the following limitations: - Is limited by `ysql_output_buffer_size`: if the YSQL to client buffer fills up and some data was already sent to the client, YSQL can't retry the whole query on a new read point. - Has higher tail latency, sometimes, leading to statement timeouts or retries exhaustion. - kReadRestart errors are not retried for statements other than the first one in a transaction block in Repeatable Read isolation. This change aims to provide users with an opposite tradeoff mechanism of sacrificing the read-after-commit-visibility guarantee for ease of use instead. Minimizing read restart errors is a multi-stage plan and here we take the first step. Provide users with a GUC `yb_read_after_commit_visibility` to relax the guarantee. Configurable Options: 1. strict * Default. * Same behavior as current. 2. relaxed * Ignores clock skew and sets the global_limit to be the same as read_time, thus ignoring the uncertainty window. * Pick read time on the query layer and not on the storage layer. This is necessary so that users do not miss commits from their own session. That would be bad. For simplicity, the relaxed option does not affect transactions unless they are marked **READ ONLY**. Handling generic transactions is more involved because of read-write operations. This may be handled in a future change. Moreover, we ignore DDLs and catalog requests for the purposes of this revision. In the next section, we discuss the semantics of the relaxed option. In this section, we discuss what guarantees can be retained even in the relaxed mode. (1) Same Session Guarantees The reads never miss writes from its own session. |//conn1//| |--------| |INSERT | |SELECT | <--- always observes the preceding DML statements. Providing this guarantee is less obvious than one would think. (1a) The read time of SELECT should not be lower than the commit time of the preceding INSERT operation. The insert itself may pick its commit time at any node in the distributed database. However, the hybrid time is propagated back to the local proxy. As a result, the SELECT statement's read time will be higher than the preceding commit time as long as the read time is picked on local proxy, i.e. we do not pick the read time on some remote docdb. Corollary 1a: read time of read only queries must be picked on local proxy whenever we relax the yb_read_after_commit_visibility guarantee. Tested in **PgReadAfterCommitVisibilityTest.SameSessionRecency**. (1b) If hypothetically we were to pick a read time on DocDB even after corollary 1a, that would lead to another problem too: DocDB picks safe time as the read time. This is potentially a time in the past and might be smaller than the commit time of the INSERT before the SELECT. So, ignoring the uncertainty window on docdb might lead to the SELECT not seeing the prior INSERT from the same connection. Corollary 1b: Do not ignore the uncertainty window when the read time is picked on the storage layer. This cannot happen with read-only statements & transactions since we always pick read time on the local proxy. Tested in **PgSingleTServerTest.NoSafeTimeClampingInRelaxedReadAfterCommit**. (1c) Server side connection pooling should not sacrifice the above same session guarantee. Since - server side connection pooling multiplexes connections only within the same node, and - there is a common proxy tserver across all pg connections on the node, we are guaranteed to see commits within the same session even with server-side connection pooling in effect. Tested in **PgReadAfterCommitVisibilityTest.SamePgNodeRecency**. Client-side connection pooling is out of scope for discussion (especially in the case of node failures, smart drivers, etc). (2) Different Session guarantees Relaxed mode does not provide read-after-commit-visibility guarantee with writes from a different session. We still have good consistency guarantees, nonetheless. (2a) The first guarantee is consistent prefix. | //conn1// | //conn2// | | ... | INSERT 1 | | ... | INSERT 2 | | SELECT | ... | First things first, the SELECT statement on conn1 need not observe the `INSERT 2` statement on conn2 even though the insert happens before the SELECT in real time. This may happen in a distributed database because of clock skew between different machines (and no uncertainty window). Next, if SELECT does observe INSERT 2, it must also observe INSERT 1 (and all the preceding statements). This is the consistent prefix guarantee and is maintained by the fact that INSERT 2 will always have a higher commit time than INSERT 1. (2b) Monotonic Reads | //conn1// | //conn2// | | ... | INSERT 1 | | ... | INSERT 2 | | SELECT 1 | ... | | SELECT 2 | ... | A closely related consistency is that we guarantee monotonic reads. If SELECT 1 observes INSERT 1, then SELECT 2 also observes INSERT 1 (and maybe even more such as INSERT 2). This is because SELECT 2 has a higher read time than SELECT 1 because read time increases monotonically within the same session. Note that this would not be the case if we let SELECT pick the read time on the storage layer instead of force picking it on the proxy. Explanation: safe time is not //necessarily// affected by the most recent hybrid time propagation since it is potentially a time in the past. (2c) Bounded Staleness | //conn1// | //conn2// | | ... | INSERT 1 | <--- 500ms old | ... | INSERT 2 | | SELECT 1 | ... | Most intuitive property. Since physical clocks do not skew more than max_clock_skew_usec, the SELECTs always see INSERTs that are older than max_clock_skew_usec. In practice, the staleness bound is even lower since the skew between hybrid time (not physical time) across the machines is the more relevant metric here. hybrid time is close to each other across nodes since there is a regular exchange of messages across yb-tservers and yb-master. Tested in **PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeStaleRead** and **PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeBoundedStaleness**. (3) Thematic worst-case scenario Here, we discuss the type of workload that is most susceptible to stale reads. For a stale read to occur, - The read must touch a node with a higher time (than the pg connection). More likely when the read is touching a lot of nodes. - The writes don't touch enough nodes to ensure hybrid time is propagated to the query layer of the node that performs the read. Happens when the writes are single row inserts/updates. Therefore, thematically, we are most susceptible to miss recent writes with the relaxed option when there are high throughput single-row DML ops happening concurrently with a long read that touches a lot of rows. Backport-through: 2024.1 **Upgrade/Rollback safety:** Fortunately, the only change in proto files is in pg_client.proto. pg_client.proto is used exclusively for communication between postgres and local tserver proxy layer. During upgrades once a node is upgraded, both Pg and local tserver are upgraded. Therefore, both of them understand this new field. Moreover, even though the read behavior is changed in the new relaxed mode, it is only changed for upgraded nodes. Non upgraded nodes do not require any knowledge of changes in the upgraded nodes because the existing interface between the query and storage layers works well to support this new feature. No auto flags are necessary. Jira: DB-9323 Test Plan: Jenkins **In TestPgTransparentRestarts** 1. When yb_read_after_commit_visibility is strict, Long reads that exceed the ysql_output_buffer_size threshold raise a read restart error to the client since they cannot be handled transparently. ``` ./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLong ``` 2. When yb_read_after_commit_visibility is relaxed, For read only txns/stmts, we silently ignore read restart errors. ``` ./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLong_relaxedReadAfterCommitVisibility ``` 3. For execution of prepared statements, relaxed mode must be set before the execute command and not necessarily before the prepare command. ``` ./yb_build.sh --java-test TestPgTransparentRestarts#selectStarLongExecute_relaxedReadAfterCommitVisibility ``` 4. We raise no read restart errors even after transactions restarts due to conflicts. This is because we decided to relax the guarantee only for read only queries/transactions. In addition, read only ops do not run into any transaction conflicts. **In PgReadAfterCommitVisibilityTest** 1. Same session read-after-commit-visibility guarantee. ``` ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SameSessionRecency ``` 2. Same node read-after-commit-visibility guarantee. ``` ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SamePgNodeRecency ``` 3. Sessions connecting to Pg on different nodes - bounded staleness guarantee. ``` ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeStaleRead ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.SessionOnDifferentNodeBoundedStaleness ``` 1. Guard ourselves against this scenario - Read time is picked on docdb. - Read uncertainty window is ignored. - The picked time is safe time, which is a time before the previous statement in the same session. - We miss recent updates from the same session because the read time is in the past. ``` ./yb_build.sh --java-test PgSingleTServerTest.NoSafeTimeClampingInRelaxedReadAfterCommit ``` 2. We never ignore read uncertainty window (thus do not relax read-after-commit-visibility guarantee) with - INSERT/UPDATE/DELETE - inserts/updates in WITH clause ``` ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDuplicateInsertCheck ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionUpdateKeyCheck ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDeleteKeyCheck ./yb_build.sh --cxx-test pg_txn-test --gtest_filter PgReadAfterCommitVisibilityTest.NewSessionDmlHidden ``` 3. We also avoid this scenario (because relaxed mode does not affect INSERTs) - Two concurrent inserts to the same key - both single shard. - The read time of the insert is picked on the local proxy (because this is in relaxed mode). - This read time is used for checking transaction conflicts happening to the same key. - However, the conflict resolution step on the RegularDB is skipped in single-shard inserts, see GitHub issue #19407. ``` ./yb_build.sh --cxx-test pgwrapper_pg_read_time-test --gtest_filter PgReadTimeTest.CheckRelaxedReadAfterCommitVisibility ``` Reviewers: pjain, smishra, rthallam Reviewed By: pjain Subscribers: ybase, yql, tnayak, hsunder, rthallam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D36197
yugabyte · Jun 27, 2024 · 025c231 · 025c231
1 parent 717fbc5
commit 025c231
Show file tree

Hide file tree

Showing 25 changed files with 987 additions and 12 deletions.
diff --git a/java/yb-pgsql/src/test/java/org/yb/pgsql/TestPgTransparentRestarts.java b/java/yb-pgsql/src/test/java/org/yb/pgsql/TestPgTransparentRestarts.java
@@ -383,6 +383,58 @@ public PreparedStatement createStatement(Connection conn) throws Exception {
     }.runTest();
   }
 
+  /**
+   * No restarts expected when yb_read_after_commit_visibility is relaxed.
+   * Note that this is a long read that exceeds the output buffer size.
+   */
+  @Test
+  public void selectStarLong_relaxedReadAfterCommitVisibility() throws Exception {
+    new RegularStatementTester(
+        getConnectionBuilder(),
+        "SELECT * FROM test_rr",
+        getLongString(),
+        false /* expectRestartErrors */
+    ) {
+      @Override
+      public String getReadAfterCommitVisibility() {
+        return "relaxed";
+      }
+    }.runTest();
+  }
+
+  /*
+   * Ensures that we need not set the yb_read_after_commit_visibility GUC
+   * before the prepare statement.
+   * This test guards us against scenarios where the GUC
+   * is captured in the prepared statement to be used during execution.
+   *
+   * Example: pg_hint_plan is a planner time configuration and is to be
+   * captured in the prepare phase to be used during execution.
+   *
+   * Also, use simple query mode to test that simultaneously.
+   */
+  @Test
+  public void selectStarLongExecute_relaxedReadAfterCommitVisibility() throws Exception {
+    new RegularStatementTester(
+        getConnectionBuilder().withPreferQueryMode("simple"),
+        "EXECUTE select_stmt(0)",
+        getLongString(),
+        false /* expectRestartErrors */) {
+
+      @Override
+      public Statement createStatement(Connection conn) throws Exception {
+        Statement stmt = super.createStatement(conn);
+        stmt.execute("PREPARE select_stmt (int) AS SELECT * FROM test_rr WHERE i >= $1");
+        return stmt;
+      };
+
+      @Override
+      public String getReadAfterCommitVisibility() {
+        return "relaxed";
+      }
+    }.runTest();
+  }
+
   /**
    * The following two methods attempt to test retries on kReadRestart for all below combinations -
    *    1. Type of statement - UPDATE/DELETE.
@@ -886,9 +938,16 @@ private Boolean expectConflictErrors(IsolationLevel isolation) throws Exception
           !is_wait_on_conflict_concurrency_control;
     }
 
+    // Override this function to set a different read window behavior.
+    public String getReadAfterCommitVisibility() {
+      return "strict";
+    }
+
     @Override
     public List<ThrowingRunnable> getRunnableThreads(
         ConnectionBuilder cb, BooleanSupplier isExecutionDone) {
+      String setReadAfterCommitVisibility = "SET yb_read_after_commit_visibility TO ";
+
       List<ThrowingRunnable> runnables = new ArrayList<>();
       //
       // Singular SELECT statement (equal probability of being either serializable/repeatable read/
@@ -926,6 +985,14 @@ public List<ThrowingRunnable> getRunnableThreads(
             auxSerializableStatement.execute(LOG_RESTARTS_SQL);
             auxRrStatement.execute(LOG_RESTARTS_SQL);
             auxRcStatement.execute(LOG_RESTARTS_SQL);
+
+            // SET yb_read_after_commit_visibility
+            auxSerializableStatement.execute(
+              setReadAfterCommitVisibility + getReadAfterCommitVisibility());
+            auxRrStatement.execute(
+              setReadAfterCommitVisibility + getReadAfterCommitVisibility());
+            auxRcStatement.execute(
+              setReadAfterCommitVisibility + getReadAfterCommitVisibility());
           }
 
           for (/* No setup */; !isExecutionDone.getAsBoolean(); /* NOOP */) {
@@ -1020,8 +1087,18 @@ public List<ThrowingRunnable> getRunnableThreads(
               Stmt stmt = createStatement(selectTxnConn)) {
             try (Statement auxStmt = selectTxnConn.createStatement()) {
               auxStmt.execute(LOG_RESTARTS_SQL);
+
+              // SET yb_read_after_commit_visibility
+              auxStmt.execute(setReadAfterCommitVisibility + getReadAfterCommitVisibility());
             }
             selectTxnConn.setAutoCommit(false);
+            // This is a read only txn, so setReadOnly.
+            // Moreover, yb_read_after_commit_visibility option relies on txn being read only.
+            if (isolation != IsolationLevel.SERIALIZABLE) {
+              // SERIALIZABLE, READ ONLY txns are not actually serializable
+              // txns and we wish to test SERIALIZABLE txns too.
+              selectTxnConn.setReadOnly(true);
+            }
             for (/* No setup */; !isExecutionDone.getAsBoolean(); ++txnsAttempted) {
               int numCompletedOps = 0;
               try {

diff --git a/src/postgres/src/backend/access/transam/xact.c b/src/postgres/src/backend/access/transam/xact.c
@@ -3829,6 +3829,10 @@ BeginTransactionBlock(void)
 				 BlockStateAsString(s->blockState));
 			break;
 	}
+
+	/* YB: Notify pggate that we are within a txn block. */
+	if (IsYugaByteEnabled())
+		HandleYBStatus(YBCPgSetInTxnBlock(true));
 }
 
 /*
@@ -4157,6 +4161,10 @@ BeginImplicitTransactionBlock(void)
 	 */
 	if (s->blockState == TBLOCK_STARTED)
 		s->blockState = TBLOCK_IMPLICIT_INPROGRESS;
+
+	/* YB: Notify pggate that we are within an (implicit) txn block. */
+	if (IsYugaByteEnabled())
+		HandleYBStatus(YBCPgSetInTxnBlock(true));
 }
 
 /*
@@ -5091,7 +5099,7 @@ CommitSubTransaction(void)
 
 	/* Conserve sticky object count before popping transaction state. */
 	s->parent->ybUncommittedStickyObjectCount = s->ybUncommittedStickyObjectCount;
-	
+
 	PopTransaction();
 }
 

diff --git a/src/postgres/src/backend/executor/execMain.c b/src/postgres/src/backend/executor/execMain.c
@@ -162,8 +162,10 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * We have lower-level defenses in CommandCounterIncrement and elsewhere
 	 * against performing unsafe operations in parallel mode, but this gives a
 	 * more user-friendly error message.
+	 *
+	 * YB: We also notify pggate whether the statement is read only.
 	 */
-	if ((XactReadOnly || IsInParallelMode()) &&
+	if ((IsYugaByteEnabled() || XactReadOnly || IsInParallelMode()) &&
 		!(eflags & EXEC_FLAG_EXPLAIN_ONLY))
 		ExecCheckXactReadOnly(queryDesc->plannedstmt);
 
@@ -762,11 +764,14 @@ ExecCheckRTEPermsModified(Oid relOid, Oid userid, Bitmapset *modifiedCols,
  * Note: in a Hot Standby this would need to reject writes to temp
  * tables just as we do in parallel mode; but an HS standby can't have created
  * any temp tables in the first place, so no need to check that.
+ *
+ * YB: We also notify pggate whether the statement is read only.
  */
 static void
 ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
 {
 	ListCell   *l;
+	bool		yb_is_read_only = true;
 
 	/*
 	 * Fail if write permissions are requested in parallel mode for table
@@ -786,10 +791,21 @@ ExecCheckXactReadOnly(PlannedStmt *plannedstmt)
 			continue;
 
 		PreventCommandIfReadOnly(CreateCommandTag((Node *) plannedstmt));
+		yb_is_read_only = false;
 	}
 
 	if (plannedstmt->commandType != CMD_SELECT || plannedstmt->hasModifyingCTE)
+	{
 		PreventCommandIfParallelMode(CreateCommandTag((Node *) plannedstmt));
+		yb_is_read_only = false;
+	}
+
+	if (IsYugaByteEnabled())
+	{
+		if (plannedstmt->rowMarks)
+			yb_is_read_only = false;
+		HandleYBStatus(YBCPgSetReadOnlyStmt(yb_is_read_only));
+	}
 }
 
 

diff --git a/src/postgres/src/backend/utils/misc/guc.c b/src/postgres/src/backend/utils/misc/guc.c
@@ -219,6 +219,7 @@ static bool check_transaction_priority_upper_bound(double *newval, void **extra,
 extern void YBCAssignTransactionPriorityUpperBound(double newval, void* extra);
 extern double YBCGetTransactionPriority();
 extern TxnPriorityRequirement YBCGetTransactionPriorityType();
+static bool yb_check_no_txn(int* newval, void **extra, GucSource source);
 
 static void assign_yb_pg_batch_detection_mechanism(int new_value, void *extra);
 static void assign_ysql_upgrade_mode(bool newval, void *extra);
@@ -485,6 +486,12 @@ static struct config_enum_entry shared_memory_options[] = {
 	{NULL, 0, false}
 };
 
+const struct config_enum_entry yb_read_after_commit_visibility_options[] = {
+  {"strict", YB_STRICT_READ_AFTER_COMMIT_VISIBILITY, false},
+  {"relaxed", YB_RELAXED_READ_AFTER_COMMIT_VISIBILITY, false},
+  {NULL, 0, false}
+};
+
 /*
  * Options for enum values stored in other modules
  */
@@ -5444,6 +5451,46 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, assign_yb_pg_batch_detection_mechanism, NULL
 	},
 
+	{
+		/*
+		 * Read-after-commit-visibility guarantee: any client issued read
+		 * should see all data that was committed before the read request
+		 * was issued (even in the presence of clock skew between nodes).
+		 * In other words, the following example should always work:
+		 * (1) User X commits some data (for which the db picks a commit
+		 * 	timestamp say ht1)
+		 * (2) Then user X communicates to user Y to inform about the commit
+		 * 	via a channel outside the database (say a phone call)
+		 * (3) Then user Y issues a read to some YB node which picks a
+		 * 	read time (less than ht1 due to clock skew)
+		 * (4) Then it should not happen that user Y gets an output without
+		 * 	the data that user Y was informed about.
+		 */
+		{
+			"yb_read_after_commit_visibility", PGC_USERSET, CUSTOM_OPTIONS,
+			gettext_noop("Control read-after-commit-visibility guarantee."),
+			gettext_noop(
+				"This GUC is intended as a crutch for users migrating from PostgreSQL and new to"
+				" read restart errors. Users can now largely avoid these errors when"
+				" read-after-commit-visibility guarantee is not a strong requirement."
+				" This option cannot be set from within a transaction block."
+				" Configure one of the following options:"
+				" (a) strict: Default Behavior. The read-after-commit-visibility guarantee is"
+				" maintained by the database. However, users may see read restart errors that"
+				" show \"ERROR:  Query error: Restart read required at: ...\". The database"
+				" attempts to retry on such errors internally but that is not always possible."
+				" (b) relaxed: With this option, the read-after-commit-visibility guarantee is"
+				" relaxed. Read only statements/transactions do not see read restart errors but"
+				" may miss recent updates with staleness bounded by clock skew."
+			),
+			0
+		},
+		&yb_read_after_commit_visibility,
+		YB_STRICT_READ_AFTER_COMMIT_VISIBILITY,
+		yb_read_after_commit_visibility_options,
+		yb_check_no_txn, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, NULL, NULL, NULL, NULL
@@ -12807,5 +12854,22 @@ yb_check_toast_catcache_threshold(int *newVal, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * YB: yb_check_no_txn
+ *
+ * Do not allow users to set yb_read_after_commit_visibility
+ * from within a txn block.
+ */
+static bool
+yb_check_no_txn(int *newVal, void **extra, GucSource source)
+{
+	if (IsTransactionBlock())
+	{
+		GUC_check_errdetail("Cannot be set within a txn block.");
+		return false;
+	}
+	return true;
+}
+
 
 #include "guc-file.c"
diff --git a/src/yb/common/common_fwd.h b/src/yb/common/common_fwd.h
@@ -21,6 +21,7 @@
 #include "yb/common/ql_protocol.fwd.h"
 #include "yb/common/redis_protocol.fwd.h"
 #include "yb/common/wire_protocol.fwd.h"
+#include "yb/util/strongly_typed_bool.h"
 
 namespace yb {
 
@@ -65,6 +66,8 @@ enum class DataType;
 
 enum class SortingType;
 
+YB_STRONGLY_TYPED_BOOL(ClampUncertaintyWindow);
+
 namespace common {
 
 class Jsonb;

diff --git a/src/yb/common/consistent_read_point.cc b/src/yb/common/consistent_read_point.cc
@@ -33,8 +33,11 @@ void ConsistentReadPoint::SetReadTimeUnlocked(
   restarts_.clear();
 }
 
-void ConsistentReadPoint::SetCurrentReadTimeUnlocked() {
-  SetReadTimeUnlocked(ReadHybridTime::FromHybridTimeRange(clock_->NowRange()));
+void ConsistentReadPoint::SetCurrentReadTimeUnlocked(const ClampUncertaintyWindow clamp) {
+  SetReadTimeUnlocked(
+    clamp
+    ? ReadHybridTime::SingleTime(clock_->Now())
+    : ReadHybridTime::FromHybridTimeRange(clock_->NowRange()));
 }
 
 void ConsistentReadPoint::SetReadTime(
@@ -43,9 +46,9 @@ void ConsistentReadPoint::SetReadTime(
   SetReadTimeUnlocked(read_time, &local_limits);
 }
 
-void ConsistentReadPoint::SetCurrentReadTime() {
+void ConsistentReadPoint::SetCurrentReadTime(const ClampUncertaintyWindow clamp) {
   std::lock_guard lock(mutex_);
-  SetCurrentReadTimeUnlocked();
+  SetCurrentReadTimeUnlocked(clamp);
 }
 
 Status ConsistentReadPoint::TrySetDeferredCurrentReadTime() {

diff --git a/src/yb/common/consistent_read_point.h b/src/yb/common/consistent_read_point.h
@@ -41,7 +41,9 @@ class ConsistentReadPoint {
   void MoveFrom(ConsistentReadPoint* rhs);
 
   // Set the current time as the read point.
-  void SetCurrentReadTime() EXCLUDES(mutex_);
+  // No uncertainty window when clamp is set.
+  void SetCurrentReadTime(
+    const ClampUncertaintyWindow clamp = ClampUncertaintyWindow::kFalse) EXCLUDES(mutex_);
 
   // If read point is not set, use the current time as the read point and defer it to the global
   // limit. If read point was already set, return error if it is not deferred.
@@ -91,7 +93,8 @@ class ConsistentReadPoint {
  private:
   inline void SetReadTimeUnlocked(
       const ReadHybridTime& read_time, HybridTimeMap* local_limits = nullptr) REQUIRES(mutex_);
-  void SetCurrentReadTimeUnlocked() REQUIRES(mutex_);
+  void SetCurrentReadTimeUnlocked(
+    const ClampUncertaintyWindow clamp = ClampUncertaintyWindow::kFalse) REQUIRES(mutex_);
   void UpdateLimitsMapUnlocked(
       const TabletId& tablet, const HybridTime& local_limit, HybridTimeMap* map) REQUIRES(mutex_);
   void RestartRequiredUnlocked(const TabletId& tablet, const ReadHybridTime& restart_time)

diff --git a/src/yb/integration-tests/mini_cluster.cc b/src/yb/integration-tests/mini_cluster.cc
@@ -874,6 +874,11 @@ server::SkewedClockDeltaChanger JumpClock(
   return server::SkewedClockDeltaChanger(delta, skewed_clock);
 }
 
+server::SkewedClockDeltaChanger JumpClock(
+    tserver::MiniTabletServer* server, std::chrono::milliseconds delta) {
+  return JumpClock(server->server(), delta);
+}
+
 std::vector<server::SkewedClockDeltaChanger> SkewClocks(
     MiniCluster* cluster, std::chrono::milliseconds clock_skew) {
   std::vector<server::SkewedClockDeltaChanger> delta_changers;

diff --git a/src/yb/integration-tests/mini_cluster.h b/src/yb/integration-tests/mini_cluster.h
@@ -298,6 +298,15 @@ class MiniCluster : public MiniClusterBase {
   PortPicker port_picker_;
 };
 
+// Requires that skewed clock is registered as physical clock.
+// Jumps the physical clock by delta
+// i.e. no effect on hybrid ts unless by physical clock.
+// new_clock = old_clock + delta (clocks are moving).
+// Returns an RAII structure that resets the delta change
+// when it goes out of scope.
+server::SkewedClockDeltaChanger JumpClock(
+    tserver::MiniTabletServer* server, std::chrono::milliseconds delta);
+
 MUST_USE_RESULT std::vector<server::SkewedClockDeltaChanger> SkewClocks(
     MiniCluster* cluster, std::chrono::milliseconds clock_skew);
 

diff --git a/src/yb/tserver/pg_client.proto b/src/yb/tserver/pg_client.proto
@@ -607,6 +607,12 @@ message PgPerformOptionsPB {
   int64 pg_txn_start_us = 22;
 
   AshMetadataPB ash_metadata = 23;
+
+  // When set,
+  // - Sets the read time locally.
+  // - Clamps the read uncertainty window.
+  // See the commit desc for more info.
+  bool clamp_uncertainty_window = 24;
 }
 
 message PgPerformRequestPB {