Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storcon: switch to diesel-async and tokio-postgres #10280

Merged
merged 6 commits into from
Jan 27, 2025
Merged

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Jan 3, 2025

Switches the storcon away from using diesel's synchronous APIs in favour of diesel-async.

Advantages:

We had to turn off usage of the connection pool for migrations, as diesel migrations don't support async APIs. Thus we still use spawn_blocking in that one place. But this is explicitly done in one of the diesel-async examples.

@arpad-m arpad-m requested a review from jcsp January 3, 2025 23:15
@arpad-m arpad-m requested a review from a team as a code owner January 3, 2025 23:15
Copy link

github-actions bot commented Jan 4, 2025

7429 tests run: 7043 passed, 0 failed, 386 skipped (full report)


Flaky tests (5)

Postgres 17

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 33.5% (8493 of 25347 functions)
  • lines: 49.3% (71479 of 145075 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
84230b8 at 2025-01-27T14:10:07.668Z :recycle:

@arpad-m
Copy link
Member Author

arpad-m commented Jan 4, 2025

Reconcile error: Cannot perform this operation while a transaction is open

not sure what the cause could be, I did a 1:1 translation, except that we don't start a transaction for running the migrations because of weiznich/diesel_async#115 . but that happens on an entirely separate connection from the reconciler where the issue occurs.

@conradludgate
Copy link
Contributor

I was curious so I tried with 1.85 with AsyncFn. It almost works, but there seems to be a limitation in proving Send bounds. I also had to manually implement the logic of TransactionManager.

@arpad-m
Copy link
Member Author

arpad-m commented Jan 8, 2025

I had Send issues as well, what worked is adding a Sync requirement. Not really beautiful, but it works :).

@arpad-m arpad-m force-pushed the arpad/diesel_async branch from 9775558 to 2f14a6e Compare January 13, 2025 02:35
@arpad-m
Copy link
Member Author

arpad-m commented Jan 13, 2025

Made it use r2d2 instead, but no avail, now it times out in the run function of TransactionBuilder when trying to figure out who the current leader is (basically hangs there).

@arpad-m arpad-m force-pushed the arpad/diesel_async branch from 2f14a6e to a661209 Compare January 14, 2025 23:57
@alexanderlaw
Copy link
Contributor

Standing on 94ab6c5, with debug logging added, I can see the following when test_check_visibility_map fails:
in storage_controller_1/storage_controller.log:

...
2025-01-23T16:58:18.265439Z  INFO reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: !!!with_conn| run...
2025-01-23T16:58:18.265452Z DEBUG reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: executing statement batch: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
...
2025-01-23T16:58:18.266355Z DEBUG reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: preparing query s48 with types [Int4, Int8, Text, Int4, Int4]: UPDATE "tenant_shards" SET "generation" 
...
2025-01-23T16:58:18.273133Z DEBUG reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: executing statement s48 with parameters: ...
...
2025-01-23T16:58:18.285455Z DEBUG reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: executing statement batch: COMMIT
...
2025-01-23T16:58:18.291589Z  INFO reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: !!!with_conn| SerializationFailure: Query(DatabaseError(SerializationFailure, "could not serialize access due to read/write dependencies among transactions"))
2025-01-23T16:58:18.291605Z DEBUG reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: Retrying transaction on serialization failure Query(DatabaseError(SerializationFailure, "could not serialize access due to read/write dependencies among transactions"))
### SerializationFailure may be expected here
2025-01-23T16:58:18.291611Z  INFO reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: !!!with_conn| run...
2025-01-23T16:58:18.291622Z  INFO reconciler{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: !!!with_conn| other Err: Query(AlreadyInTransaction)
### But not AlreadyInTransaction, which is returned by diesel's AnsiTransactionManager::begin_transaction_sql, I guess
2025-01-23T16:58:18.291871Z  WARN process_result{seq=2 tenant_id=cdf37dd4ec533a62533523a2add299a8 shard_id=0508}: Reconcile error: Cannot perform this operation while a transaction is open

in storage_controller_db/postgres.log:

...
2025-01-23 16:58:18.265 GMT [737251:7] [unknown] LOG:  statement: BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE
...
2025-01-23 16:58:18.273 GMT [737251:8] [unknown] LOG:  execute s48: UPDATE "tenant_shards" SET "generation"  ...
...
2025-01-23 16:58:18.273 GMT [737251:9] [unknown] DETAIL:  parameters: $1 = '1', $2 = '1',...
...
2025-01-23 16:58:18.285 GMT [737251:10] [unknown] LOG:  statement: COMMIT
...
2025-01-23 16:58:18.285 GMT [737251:11] [unknown] ERROR:  could not serialize access due to read/write dependencies among transactions
2025-01-23 16:58:18.285 GMT [737251:12] [unknown] DETAIL:  Reason code: Canceled on identification as a pivot, during commit attempt.
2025-01-23 16:58:18.285 GMT [737251:13] [unknown] HINT:  The transaction might succeed if retried.
2025-01-23 16:58:18.291 GMT [737251:14] [unknown] DEBUG:  unexpected EOF on client connection
...

So it looks like we can't recover after SerializationFailure.

@alexanderlaw
Copy link
Contributor

Probably something like this:

--- a/storage_controller/src/persistence.rs
+++ b/storage_controller/src/persistence.rs
@@ -293,9 +293,9 @@ async fn with_conn<'a, 'b, F, R>(&self, func: F) -> DatabaseResult<R>
             + 'a,
         R: Send + 'b,
     {
-        let mut conn = self.connection_pool.get().await?;
         let mut retry_count = 0;
         loop {
+            let mut conn = self.connection_pool.get().await?;
             match conn
                 .build_transaction()
                 .serializable()

could fix the issue. At least it makes the test pass reliably for me, even when " ERROR: could not serialize access due to read/write dependencies among transaction" found in storage_controller_db/postgres.log.

@arpad-m arpad-m force-pushed the arpad/diesel_async branch from a661209 to 8e7ce7a Compare January 24, 2025 14:24
@arpad-m
Copy link
Member Author

arpad-m commented Jan 24, 2025

Yup, the test works now when I apply the change. Good, then we don't need r2d2 and can use bb8 instead (bb8 is properly async, r2d2 which we use right now only works via blocking wrappers).

I think this means that once tests pass, we can merge this PR. Thank you so much @alexanderlaw for identifying where the issue was.

@alexanderlaw
Copy link
Contributor

By the way, maybe you would like also to get rid of

neon/Makefile

Lines 67 to 68 in 9f1408f

# Set PQ_LIB_DIR to make sure `storage_controller` get linked with bundled libpq (through diesel)
CARGO_CMD_PREFIX += PQ_LIB_DIR=$(POSTGRES_INSTALL_DIR)/v16/lib

within this PR, as PQ_LIB_DIR is specified there only for pq-sys, which should not be used after this change.

Copy link
Collaborator

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nice to have one less spawn_blocking!

I didn't review the modified Persistence functions line-by-line, presuming they are all just wrapping in box-futures.

I took a look at our new dependencies' repos & they all seem to be actively maintained projects.

@bayandin
Copy link
Member

within this PR, as PQ_LIB_DIR is specified there only for pq-sys, which should not be used after this change.

Should we remove PQ_LIB_DIR from all places to make sure we don't need it (afaiu, we needed it only for the storage controller), sorry if late for the party.

$ git grep PQ_LIB_DIR=
  .github/workflows/_build-and-test-locally.yml:          PQ_LIB_DIR=$(pwd)/pg_install/v16/lib
  .github/workflows/_build-and-test-locally.yml:          PQ_LIB_DIR=$(pwd)/pg_install/v16/lib
  .github/workflows/build-macos.yml:        run: PQ_LIB_DIR=$(pwd)/pg_install/v17/lib cargo build --all --release -j$(sysctl -n hw.ncpu)
  .github/workflows/neon_extra_builds.yml:        run: PQ_LIB_DIR=$(pwd)/pg_install/v17/lib cargo build --all --release --timings -j$(nproc)
  Dockerfile:    && PQ_LIB_DIR=$(pwd)/pg_install/v${STABLE_PG_VERSION}/lib RUSTFLAGS="-Clinker=clang -Clink-arg=-fuse-ld=mold -Clink-arg=-Wl,--no-rosegment -Cforce-frame-pointers=yes ${ADDITIONAL_RUSTFLAGS}" cargo build \

@arpad-m arpad-m added this pull request to the merge queue Jan 27, 2025
Merged via the queue into main with commit b0b4b7d Jan 27, 2025
85 checks passed
@arpad-m arpad-m deleted the arpad/diesel_async branch January 27, 2025 14:26
@arpad-m
Copy link
Member Author

arpad-m commented Jan 27, 2025

@bayandin I agree we should remove it from the other places too, I'll file a followup.

github-merge-queue bot pushed a commit that referenced this pull request Jan 27, 2025
We now don't need libpq any more for the build of the storage
controller, as we use `diesel-async` since #10280. Therefore, we remove
the env var that gave cargo/rustc the location for libpq.

Follow-up of #10280
arpad-m added a commit that referenced this pull request Jan 30, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jan 30, 2025
…10592)

There was a regression of #10280, tracked in
[#23583](neondatabase/cloud#23583).

I have ideas how to fix the issue, but we are too close to the release
cutoff, so revert #10280 for now. We can revert the revert later :).
github-merge-queue bot pushed a commit that referenced this pull request Jan 31, 2025
Update to a rebased version of our rust-postgres patches, rebased on
[this](sfackler/rust-postgres@98f5a11)
commit this time.

With #10280 reapplied, this means that the rust-postgres crates will be
deduplicated, as the new crate versions are finally compatible with the
requirements of diesel-async.

Earlier update: #10561

rust-postgres PR: neondatabase/rust-postgres#39
arpad-m added a commit that referenced this pull request Jan 31, 2025
github-merge-queue bot pushed a commit that referenced this pull request Feb 3, 2025
Successor of #10280 after it was reverted in #10592.

Re-introduce the usage of diesel-async again, but now also add TLS
support so that we connect to the storcon database using TLS. By
default, diesel-async doesn't support TLS, so add some code to make us
explicitly request TLS.

cc neondatabase/cloud#23583
winter-loo pushed a commit to winter-loo/neon that referenced this pull request Feb 4, 2025
Switches the storcon away from using diesel's synchronous APIs in favour
of `diesel-async`.

Advantages:

* less C dependencies, especially no openssl, which might be behind the
bug: https://github.com/neondatabase/cloud/issues/21010
* Better to only have async than mix of async plus `spawn_blocking`

We had to turn off usage of the connection pool for migrations, as
diesel migrations don't support async APIs. Thus we still use
`spawn_blocking` in that one place. But this is explicitly done in one
of the `diesel-async` examples.
winter-loo pushed a commit to winter-loo/neon that referenced this pull request Feb 4, 2025
We now don't need libpq any more for the build of the storage
controller, as we use `diesel-async` since neondatabase#10280. Therefore, we remove
the env var that gave cargo/rustc the location for libpq.

Follow-up of neondatabase#10280
winter-loo pushed a commit to winter-loo/neon that referenced this pull request Feb 4, 2025
…ase#10280)" (neondatabase#10592)

There was a regression of neondatabase#10280, tracked in
[#23583](https://github.com/neondatabase/cloud/issues/23583).

I have ideas how to fix the issue, but we are too close to the release
cutoff, so revert neondatabase#10280 for now. We can revert the revert later :).
winter-loo pushed a commit to winter-loo/neon that referenced this pull request Feb 4, 2025
Update to a rebased version of our rust-postgres patches, rebased on
[this](sfackler/rust-postgres@98f5a11)
commit this time.

With neondatabase#10280 reapplied, this means that the rust-postgres crates will be
deduplicated, as the new crate versions are finally compatible with the
requirements of diesel-async.

Earlier update: neondatabase#10561

rust-postgres PR: neondatabase/rust-postgres#39
winter-loo pushed a commit to winter-loo/neon that referenced this pull request Feb 4, 2025
…0614)

Successor of neondatabase#10280 after it was reverted in neondatabase#10592.

Re-introduce the usage of diesel-async again, but now also add TLS
support so that we connect to the storcon database using TLS. By
default, diesel-async doesn't support TLS, so add some code to make us
explicitly request TLS.

cc https://github.com/neondatabase/cloud/issues/23583
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants