Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Leverage AWS Clock Bound to reduce the number of read restarts. #21963

Closed
1 task done
pao214 opened this issue Apr 13, 2024 · 1 comment
Closed
1 task done

[YSQL] Leverage AWS Clock Bound to reduce the number of read restarts. #21963

pao214 opened this issue Apr 13, 2024 · 1 comment
Assignees
Labels
2.23.1_blocker 2024.2 Backport Required area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@pao214
Copy link
Contributor

pao214 commented Apr 13, 2024

Jira Link: DB-10879

Description

Motivation

Using timestamps to decide the order between events in a distributed system is tricky because there is an inherent clock skew between machines. YB uses a very conservative value for the clock skew. This makes it infeasible to wait out the clock skew to resolve event ordering issues.

Proposal

AWS provides a really tight error bound on clocks, see https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/.

Roadmap [TBD]

Impact on hybrid time

NtpClock picks the earliest time in the uncertainty interval as the physical clock.

Properties of new hybrid time

  1. The true time forms an upper bound on the hybrid times across the nodes in the cluster. This follows from the fact that the hybrid time is actually the earliest possible time on some node and that earliest time is still lower than the true time. If anything, the true time has progressed since then.
  2. This means that the true time is also a global limit. Consequently, the latest possible time is an easily computable global limit. Thus, we have a global limit without requiring any explicit coordination across nodes in the cluster (outside NTP of course).

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@pao214 pao214 added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Apr 13, 2024
@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue labels Apr 13, 2024
@pao214 pao214 changed the title [YSQL] Use AWS Time Sync Service to get better clock error bounds. [YSQL] Use AWS Clock Bound to reduce the number of read restarts. Aug 15, 2024
@pao214 pao214 changed the title [YSQL] Use AWS Clock Bound to reduce the number of read restarts. [YSQL] Leverage AWS Clock Bound to reduce the number of read restarts. Aug 15, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Aug 21, 2024
@pao214 pao214 moved this from Pending to In Progress in Wait-Queue Based Locking Sep 13, 2024
pao214 added a commit that referenced this issue Oct 9, 2024
…t errors.

Summary:
### Motivation

Prior to this revision, the physical clock uses a constant 500ms time window for the possible clock skew between any two nodes in the cluster. The skew is very conservative since it is a constant and we need to account for the worst case scenarios. This leads to an excessive number of read restart errors, see https://docs.yugabyte.com/preview/architecture/transactions/read-restart-error/.

A better approach handles the clock error dynamically. This can be done by leveraging the AWS clockbound library. Since, the clock error is several orders of magnitude lower than the conservative constant bound, we raise much fewer read restart errors. In fact, the read latency improves significantly for the SQLStaleReadDetector yb-sample-apps workload.

This revision improves clock precision. It also limits the impact of faulty clocks on the cluster since only those nodes that are out of sync crash.

### About Clockbound

As mentioned above, we use the clockbound library to retrieve the uncertainty intervals for timestamps. Clockbound works in a server-client architecture where a clock-bound-d daemon is registered as a systemd service. This daemon requests chronyd for timestamp related information and publishes the clock accuracy information and clock synchronization status to shared memory. The clockbound client then computes the current timestamp uncertainty interval based on the information in the shared memory.

NOTE: chronyd does not have sufficient information when using PTP. In such cases, clockbound augments clock error with error information from special device files.

### Configuration

Configuring clockbound is a two-step process.

1. Configure the system to setup precise timestamps.
2. Configure the database to use these precise timestamps.

#### System Configuration

```
[PHC available] sudo bash ./bin/configure_ptp.sh
sudo bash ./bin/configure_clockbound.sh
```

#### Database Configuration

Set tserver and master gFlag `time_source=clockbound`.

#### yugabyted Configuration

Autodetects AWS clusters and recommends configuring clockbound.

Provides `--enhance_time_sync_via_clockbound` flag in `yugabyetd start` command.

1. Prechecks for chrony and clockbound configuration.
2. Configures the database with time_source=clockbound.
3. Autodetects PTP and configures clockbound_clock_error_estimate to an appropriate value.

### Design

#### Clockbound Client

The clockbound client library is compiled and packaged in the third party library repo. This is a library written in Rust that is linked to tserver and accessed through its C interface.

#### Clockbound Clock

Uses the clockbound library to get the uncertainty intervals. See the comment on clockbound_clock.cc for more information.

#### Fault Tolerance

Crash and, as a result, temporarily remove the node from Raft groups it is in when clocks go out of sync. This will prevent stale read anomalies. Crashing also prevents the node from killing other nodes in the cluster since it no longer propagates extremely skewed timestamps.

#### Utilities

Includes the following additional utilities

1. configure_ptp.sh
  - Installs network driver compiled with PHC.
  - Configures chrony to use PHC as refclock.
2. configure_clockbound.sh
  - Setup chrony to give accurate timestamp uncertainty intervals.
  - Setup clockbound agent.
  - Setup permissions.
3. clockbound_dump
  - Dumps the result of clockbound_now client side API.
  - Useful for computing clock error in external applications such as YBA.
Jira: DB-10879

Test Plan:
Jenkins: urgent, compile only

### Quick Benchmark (Not statistically significant)

Ran the SqlStaleReadDetector workload that

1. Increments random counters in write threads.
2. Aggregates the counter values in the read thread.

for 5mins and measures the number of restart read requests and the read latency per operation.

| Measurement              | WallClock | NtpClock | ClockboundClock | EST_ERROR=0 | NTP/PHC | PTP/PHC |
|--------------------------|------------|----------------|------------------|--------------|----------|-----|
| Restart Read Requests     | ~5k        | ~380        | ~70              | ~36            | ~5          |  ~5         |
| Latency (ms/op)           | ~430       | ~150         | ~120             | ~105 | ~140*        | ~150*       |

The latencies are measured on the client side.

| **Wall Clock** | Current clock implementation. |
| **Clockbound Clock** | Proposed wall clock compatible clock implementation. |
| **EST_ERROR=0** | When using now=earliest, global_limit=latest where reference clock is in interval [earliest, latest]. |
| **NTP/PHC** | Same but when running the database in the US N Virginia region where PHC is available. |
| **PTP/PHC** | Same but using PTP for timestamps. |

*Higher latency is expected with PHC since the client is present in Oregon and the database is running in N. Virginia.

### Other benchmarks

Developed a few realistic apps in yb-sample-apps.

1. SqlEventCounter
2. SqlBankTransfers
3. SqlWarehouseStock
4. SqlMessageQueue
5. SqlConsistentHashing

They all demonstrate a reduction of several orders of magnitude in read restart errors, reinforcing the value of using AWS Time Sync Service and clockbound.

### Failure Scenarios

1. When clockbound is not setup and user configures time_source=clockbound,

The database fails to start with an error in tserver.err log.

```
F0826 17:47:53.453330  4432 hybrid_clock.cc:157] Couldn't get the current time: Clock unsynchronized. Status: IO error (yb/util/clockbound_time.cc:145): clockbound API failed with error: No such file or directory, and detail: open
...
```

2. When selinux permissions are not set correctly for clockbound to access chronyd socket,

The systemctl status shows an error

```
Aug 26 17:55:57 ip-10-9-10-243.us-west-2.compute.internal clockbound[32122]: 2024-08-26T17:55:57.318518Z ERROR ThreadId(02) /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/clock-bound-d-1.0.0/src/chrony_poller.rs:73: No reply from chronyd. Is it running? Error: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
```

Backport-through: 2024.2

Reviewers: sergei, mbautin, pjain

Reviewed By: sergei, mbautin, pjain

Subscribers: svc_phabricator, mbautin, sergei, rthallam, smishra, yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37365
@pao214 pao214 moved this from In Progress to Backporting in Wait-Queue Based Locking Oct 9, 2024
pao214 added a commit that referenced this issue Oct 9, 2024
…educe read restart errors.

Summary:
Original commit: 28f27ee / D37365
### Motivation

Prior to this revision, the physical clock uses a constant 500ms time window for the possible clock skew between any two nodes in the cluster. The skew is very conservative since it is a constant and we need to account for the worst case scenarios. This leads to an excessive number of read restart errors, see https://docs.yugabyte.com/preview/architecture/transactions/read-restart-error/.

A better approach handles the clock error dynamically. This can be done by leveraging the AWS clockbound library. Since, the clock error is several orders of magnitude lower than the conservative constant bound, we raise much fewer read restart errors. In fact, the read latency improves significantly for the SQLStaleReadDetector yb-sample-apps workload.

This revision improves clock precision. It also limits the impact of faulty clocks on the cluster since only those nodes that are out of sync crash.

### About Clockbound

As mentioned above, we use the clockbound library to retrieve the uncertainty intervals for timestamps. Clockbound works in a server-client architecture where a clock-bound-d daemon is registered as a systemd service. This daemon requests chronyd for timestamp related information and publishes the clock accuracy information and clock synchronization status to shared memory. The clockbound client then computes the current timestamp uncertainty interval based on the information in the shared memory.

NOTE: chronyd does not have sufficient information when using PTP. In such cases, clockbound augments clock error with error information from special device files.

### Configuration

Configuring clockbound is a two-step process.

1. Configure the system to setup precise timestamps.
2. Configure the database to use these precise timestamps.

#### System Configuration

```
[PHC available] sudo bash ./bin/configure_ptp.sh
sudo bash ./bin/configure_clockbound.sh
```

#### Database Configuration

Set tserver and master gFlag `time_source=clockbound`.

#### yugabyted Configuration

Autodetects AWS clusters and recommends configuring clockbound.

Provides `--enhance_time_sync_via_clockbound` flag in `yugabyetd start` command.

1. Prechecks for chrony and clockbound configuration.
2. Configures the database with time_source=clockbound.
3. Autodetects PTP and configures clockbound_clock_error_estimate to an appropriate value.

### Design

#### Clockbound Client

The clockbound client library is compiled and packaged in the third party library repo. This is a library written in Rust that is linked to tserver and accessed through its C interface.

#### Clockbound Clock

Uses the clockbound library to get the uncertainty intervals. See the comment on clockbound_clock.cc for more information.

#### Fault Tolerance

Crash and, as a result, temporarily remove the node from Raft groups it is in when clocks go out of sync. This will prevent stale read anomalies. Crashing also prevents the node from killing other nodes in the cluster since it no longer propagates extremely skewed timestamps.

#### Utilities

Includes the following additional utilities

1. configure_ptp.sh
  - Installs network driver compiled with PHC.
  - Configures chrony to use PHC as refclock.
2. configure_clockbound.sh
  - Setup chrony to give accurate timestamp uncertainty intervals.
  - Setup clockbound agent.
  - Setup permissions.
3. clockbound_dump
  - Dumps the result of clockbound_now client side API.
  - Useful for computing clock error in external applications such as YBA.
Jira: DB-10879

Test Plan:
Jenkins: urgent

### Quick Benchmark (Not statistically significant)

Ran the SqlStaleReadDetector workload that

1. Increments random counters in write threads.
2. Aggregates the counter values in the read thread.

for 5mins and measures the number of restart read requests and the read latency per operation.

| Measurement              | WallClock | NtpClock | ClockboundClock | EST_ERROR=0 | NTP/PHC | PTP/PHC |
|--------------------------|------------|----------------|------------------|--------------|----------|-----|
| Restart Read Requests     | ~5k        | ~380        | ~70              | ~36            | ~5          |  ~5         |
| Latency (ms/op)           | ~430       | ~150         | ~120             | ~105 | ~140*        | ~150*       |

The latencies are measured on the client side.

| **Wall Clock** | Current clock implementation. |
| **Clockbound Clock** | Proposed wall clock compatible clock implementation. |
| **EST_ERROR=0** | When using now=earliest, global_limit=latest where reference clock is in interval [earliest, latest]. |
| **NTP/PHC** | Same but when running the database in the US N Virginia region where PHC is available. |
| **PTP/PHC** | Same but using PTP for timestamps. |

*Higher latency is expected with PHC since the client is present in Oregon and the database is running in N. Virginia.

### Other benchmarks

Developed a few realistic apps in yb-sample-apps.

1. SqlEventCounter
2. SqlBankTransfers
3. SqlWarehouseStock
4. SqlMessageQueue
5. SqlConsistentHashing

They all demonstrate a reduction of several orders of magnitude in read restart errors, reinforcing the value of using AWS Time Sync Service and clockbound.

### Failure Scenarios

1. When clockbound is not setup and user configures time_source=clockbound,

The database fails to start with an error in tserver.err log.

```
F0826 17:47:53.453330  4432 hybrid_clock.cc:157] Couldn't get the current time: Clock unsynchronized. Status: IO error (yb/util/clockbound_time.cc:145): clockbound API failed with error: No such file or directory, and detail: open
...
```

2. When selinux permissions are not set correctly for clockbound to access chronyd socket,

The systemctl status shows an error

```
Aug 26 17:55:57 ip-10-9-10-243.us-west-2.compute.internal clockbound[32122]: 2024-08-26T17:55:57.318518Z ERROR ThreadId(02) /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/clock-bound-d-1.0.0/src/chrony_poller.rs:73: No reply from chronyd. Is it running? Error: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
```

Backport-through: 2024.2

Reviewers: sergei, mbautin, pjain

Reviewed By: pjain

Subscribers: ybase, yql, smishra, rthallam, sergei, mbautin, svc_phabricator

Differential Revision: https://phorge.dev.yugabyte.com/D38858
@pao214 pao214 moved this from Backporting to Done in Wait-Queue Based Locking Oct 9, 2024
@pao214
Copy link
Contributor Author

pao214 commented Oct 11, 2024

Landed on master (2.23.1) and 2024.2

@pao214 pao214 closed this as completed Oct 11, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 14, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 14, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 15, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 15, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 17, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 17, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 17, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 18, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 18, 2024
pao214 added a commit to pao214/yugabyte-db that referenced this issue Oct 18, 2024
pao214 added a commit that referenced this issue Oct 18, 2024
1. Changes to manual deployment configuration with additional details added to clock sync setup.
2. Changes to database configuration after the system is setup with clockbound systemd service.
3. Changes to read restart error doc on additional recommendation about using the new clock.
pao214 added a commit that referenced this issue Oct 22, 2024
Summary:
_exported libs are not used anywhere. Remove them.
Jira: DB-10879

Test Plan: Jenkins

Reviewers: mbautin

Reviewed By: mbautin

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38526
pao214 added a commit that referenced this issue Oct 23, 2024
Summary:
time_source does not have any secrets. Call home info on time_source is useful.

Also, time_source is a non-runtime flag.
Jira: DB-10879

Test Plan:
Jenkins

Backport-through: 2024.2

Reviewers: hsunder, smishra

Reviewed By: hsunder

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39031
pao214 added a commit that referenced this issue Oct 25, 2024
Summary:
### Azure PHC Issue

Azure VMs have hardware clocks too. However, we haven't figured out how we can use them yet. Currently, the clockbound configuration script fatals with the following error.

```
PHC is not available on eth0
```

**Fix:** Configure PTP only when the script runs on an AWS machine.

### Missing policycoreutils package

Install policycoreutils-devel explicitly.

### Yugabyted changes

clockbound can now be used on any cloud provider. So, alter users with a warning when using Azure or GCP as well.
Jira: DB-10879

Test Plan:
Jenkins: compile only

Ran

```
sudo bash ./bin/configure_clockbound.sh
```

on AWS, Azure, and GCP

Reviewers: nikhil, sanketh

Reviewed By: sanketh

Differential Revision: https://phorge.dev.yugabyte.com/D39224
pao214 added a commit that referenced this issue Oct 30, 2024
Summary:
Original commit: d5c096f / D39031
time_source does not have any secrets. Call home info on time_source is useful.

Also, time_source is a non-runtime flag.
Jira: DB-10879

Test Plan:
Jenkins: compile only

Backport-through: 2024.2

Reviewers: hsunder, smishra

Reviewed By: hsunder

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39361
pao214 added a commit that referenced this issue Dec 4, 2024
…viders

Summary:
Original commit: 689117b / D39224
### Azure PHC Issue

Azure VMs have hardware clocks too. However, we haven't figured out how we can use them yet. Currently, the clockbound configuration script fatals with the following error.

```
PHC is not available on eth0
```

**Fix:** Configure PTP only when the script runs on an AWS machine.

### Missing policycoreutils package

Install policycoreutils-devel explicitly.

Jira: DB-10879

Test Plan:
Jenkins: urgent, compile only

Ran

```
sudo bash ./bin/configure_clockbound.sh
```

on AWS, Azure, and GCP

Reviewers: nikhil, sanketh

Reviewed By: sanketh

Differential Revision: https://phorge.dev.yugabyte.com/D39464
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.23.1_blocker 2024.2 Backport Required area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
Status: Done
Development

No branches or pull requests

3 participants