Compute release 2025-02-07 #10713

vipvap · 2025-02-07T07:00:59Z

Compute release 2025-02-07

Please merge this Pull Request using 'Create a merge commit' button

## Problem Because dashmap 6 switched to hashbrown RawTable API, it required us to use unsafe code in the upgrade: #8107 ## Summary of changes Switch to clashmap, a fork maintained by me which removes much of the unsafe and ultimately switches to HashTable instead of RawTable to remove much of the unsafe requirement on us.

Update `tokio` base crates and their deps. Pin `tokio` to at least 1.41 which stabilized task ID APIs. To dedup `mio` dep the `notify` crate is updated. It's used in `compute_tools`. https://github.com/neondatabase/neon/blob/9f81828429ad6475b4fbb1a814240213b74bec63/compute_tools/src/pg_helpers.rs#L258-L367

…10585) ## Problem test_scrubber_tenant_snapshot restarts pageservers, but log validation fails tests on any non white listed storcon warnings, making the test flaky. ## Summary of changes Allow warns like 2025-01-29T12:37:42.622179Z WARN reconciler{seq=1 tenant_id=2011077aea9b4e8a60e8e8a19407634c shard_id=0004}: Call to node 2 (localhost:15352) management API failed, will retry (attempt 1): receive body: error sending request for url (http://localhost:15352/v1/tenant/2011077aea9b4e8a60e8e8a19407634c-0004/location_config): client error (Connect) ref #10462

…10594) ## Problem If offloading races with normal shutdown, we get a "failed to freeze and flush: cannot flush frozen layers when flush_loop is not running, state is Exited". This is harmless but points to it being quite strange to try and freeze and flush such a timeline. flushing on shutdown for an archived timeline isn't useful. Related: #10389 ## Summary of changes - During Timeline::shutdown, ignore ShutdownMode::FreezeAndFlush if the timeline is archived

We want to keep the AWS SDK up to date as that way we benefit from new developments and improvements. Prior update was in #10056

## Problem This test may not fully detect data corruption during splits, since we don't force-compact the entire keyspace. ## Summary of changes Force-compact all data in `test_sharding_autosplit`.

@myrrc

…rl (#10575) Ref: neondatabase/cloud#23461 ## Problem Just made changes around and see these 2 base layers could be optimised. and after review comment from @myrrc setting up timeouts and retries in `alpine/curl` image ## Summary of changes

## Problem We want to check that `diesel print-schema` doesn't generate any changes (`storage_controller/src/schema.rs`) in comparison with the list of migration. ## Summary of changes - Add `diesel_cli` to `build-tools` image - Add `Check diesel schema` step to `build-neon` job, at this stage we have all required binaries, so don't need to compile anything additionally - Check runs only on x86 release builds to be sure we do it at least once per CI run.

Update to a rebased version of our rust-postgres patches, rebased on [this](sfackler/rust-postgres@98f5a11) commit this time. With #10280 reapplied, this means that the rust-postgres crates will be deduplicated, as the new crate versions are finally compatible with the requirements of diesel-async. Earlier update: #10561 rust-postgres PR: neondatabase/rust-postgres#39

## Problem There are two (related) problems with the previous handling of `cargo-deny`: - When a new advisory is added to rustsec that affects a dependency, unrelated pull requests will fail. - New advisories rely on pushes or PRs to be surfaced. Problems that already exist on main will only be found if we try to merge new things into main. ## Summary of changes We split out `cargo-deny` into a separate workflow that runs on all PRs that touch `Cargo.lock`, and on a schedule on `main`, `release`, `release-compute` and `release-proxy` to find new advisories.

## Problem This assertion is incorrect: it is legal to see another shard's data at this point, after a shard split. Closes: #10609 ## Summary of changes - Remove faulty assertion

## Problem When a client dropped before a request completed, and a handler returned an ApiError, we would log that at error severity. That was excessive in the case of a request erroring on a shutdown, and could cause test flakes. example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/13067651123/index.html#suites/ad9c266207b45eafe19909d1020dd987/6021ce86a0d72ae7/ ``` Cancelled request finished with an error: ShuttingDown ``` ## Summary of changes - Log a different info-level on ShuttingDown and ResourceUnavailable API errors from cancelled requests

## Problem Merge Queue fails if changes include Rust code. ## Summary of changes - Fix condition for `build-build-tools-image` - Add a couple of no-op `false ||` to make predicates look symmetric

…nch --initialize in benchmarking workflow (#10616) ## Problem we want to disable some steps in benchmarking workflow that do not initialize new projects and instead run the test more frequently Test run https://github.com/neondatabase/neon/actions/runs/13077737888

The primary benefit is that all the ad hoc get_matches() calls are no longer necessary. Now all it takes to get at the CLI arguments is referencing a struct member. It's also great the we can replace the ad hoc CLI struct we had with this more formal solution. Signed-off-by: Tristan Partin <[email protected]>

…nal (#10608) We forked copy_bidirectional to solve some issues like fast-shutdown (disallowing half-open connections) and to introduce better error tracking (which side of the conn closed down). A change recently made its way upstream offering performance improvements: tokio-rs/tokio#6532. These seem applicable to our fork, thus it makes sense to apply them here as well.

…ock for public internet / VPC (#10143) - Wired up filtering on VPC endpoints - Wired up block access from public internet / VPC depending on per project flag - Added cache invalidation for VPC endpoints (partially based on PR from Raphael) - Removed BackendIpAllowlist trait --------- Co-authored-by: Ivan Efremov <[email protected]>

We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Release notes](https://releases.rs/docs/1.84.1/). Prior update was in #10328. Co-authored-by: Arpad Müller <[email protected]>

## Problem when introducing pg17 for job step `Generate matrix for OLAP benchmarks` I introduced a syntax error that only hits on Saturdays. ## Summary of changes Remove trailing comma ## successful test run https://github.com/neondatabase/neon/actions/runs/13086363907

In the safekeeper, we block deletions on the timeline's gate closing, and any `WalResidentTimeline` keeps the gate open (because it owns a gate lock object). Thus, unless the `main_task` function of a partial backup doesn't return, we can't delete the associated timeline. In order to make these tasks exit early, we call the cancellation token of the timeline upon its shutdown. However, the partial backup task wasn't looking for the cancellation while waiting to acquire a partial backup permit. On a staging safekeeper we have been in a situation in the past where the semaphore was already empty for a duration of many hours, rendering all attempted deletions unable to proceed until a restart where the semaphore was reset: https://neondb.slack.com/archives/C03H1K0PGKH/p1738416586442029

## Problem #10616 was only intended temparily during the weekend, want to reset to prior state ## Summary of changes revert #10616 but keep fixes in #10622

…shot (#10606) ## Problem This test would sometimes emit unexpected logs from the storage controller's requests to do migrations, which overlap with the test's restarts of pageservers, where those migrations are happening some time after a shard split as the controller moves load around. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10602/13067323736/index.html#testresult/f66f1329557a1fc5/retries ## Summary of changes - Do a reconcile_until_idle after shard split, so that the rest of the test doesn't run concurrently with migrations

## Problem While working on #10617 I (unintentionally) merged the PR before the main CI pipeline has finished. I suspect this happens because we have received all the required job results from the pre-merge-checks workflow, which runs on PRs that include changes to relevant files. ## Summary of changes - Skip the `conclusion` job in `pre-merge-checks` workflows for PRs

Successor of #10280 after it was reverted in #10592. Re-introduce the usage of diesel-async again, but now also add TLS support so that we connect to the storcon database using TLS. By default, diesel-async doesn't support TLS, so add some code to make us explicitly request TLS. cc neondatabase/cloud#23583

…ker (#5718) (#10627) ## Problem Since [#5580](#5580) the delete and uninit file markers are no longer needed. ## Summary of changes Remove the remaining code for the delete and uninit markers. Additionally removes the `ends_with_suffix` function as it is no longer required. Closes [#5718](#5718).

Fixes flaky test_lr_with_slow_safekeeper test #10242 Fix query to `pg_catalog.pg_stat_subscription` catalog to handle table synchronization and parallel LR correctly.

@varunsh-coder

## Problem 1. First of all it's more correct 2. Current usage allows ` Time-of-Check-Time-of-Use (TOCTOU) 'Pwn Request' vulnerabilities`. Please check security slack channel or reach me for more details. I will update PR description after merge. ## Summary of changes 1. Use `actions/checkout` with `ref: ${{ github.event.pull_request.head.sha }}` Discovered by and Co-author: @varunsh-coder

Logging errors with the debug format specifier causes multi-line errors, which are sometimes a pain to deal with. Instead, we should use anyhow's alternate display format, which shows the same information on a single line. Also adjusted a couple of error messages that were stale. Fixes neondatabase/cloud#14710.

## Problem Tests for logical replication (on Staging) have been failing for some time because logical replication is not enabled for them. This issue occurred after switching to an org API key with a different default setting, where logical replication was not enabled by default. ## Summary of changes - Add `enable_logical_replication` input to `actions/neon-project-create` - Enable logical replication in `test-logical-replication` job

…10633) ## Problem * The behavior of this flag changed. Plus, it's not necessary to disable the IP check as long as there are no IPs listed in the local postgres. ## Summary of changes * Drop the flag from the command in the README.md section. * Change the postgres URL passed to proxy to not use the endpoint hostname. * Also swap postgres creation and proxy startup, so the DB is running when proxy comes up.

Add a build with sanitizers (asan, ubsan) to the CI pipeline and run tests on it. See #6053 --------- Co-authored-by: Alexander Bayandin <[email protected]>

## Problem We need a metrics to know what's going on in pageserver's background jobs. ## Summary of changes * Waiting tasks: task still waiting for the semaphore. * Running tasks: tasks doing their actual jobs. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

…10692) ## Problem During ingest_benchmark which uses `pgcopydb` ([see](https://github.com/dimitri/pgcopydb))we sometimes had outages. - when PostgreSQL COPY step failed we got a segfault (reported [here](dimitri/pgcopydb#899)) - the root cause was Neon idle_in_transaction_session_timeout is set to 5 minutes which is suboptimal for long-running tasks like project import (reported [here](dimitri/pgcopydb#900)) ## Summary of changes Patch pgcopydb to avoid segfault. override idle_in_transaction_session_timeout and set it to "unlimited"

#10693) ## Problem The default `GITHUB_TOKEN` is used to push changes created with `approved-for-ci-run`, which doesn't work: ``` Run git push --force origin "${BRANCH}" remote: Permission to neondatabase/neon.git denied to github-actions[bot]. fatal: unable to access 'https://github.com/neondatabase/neon/': The requested URL returned error: 403 ``` Ref: https://github.com/neondatabase/neon/actions/runs/13166108303/job/36746518291?pr=10687 ## Summary of changes - Use `CI_ACCESS_TOKEN` to clone an external repo - Remove unneeded `actions/checkout`

## Problem This test called NeonPageserver.tenant_detach, which as well as detaching locally on the pageserver, also updates the storage controller to put the tenant into Detached mode. When the test runs slowly in debug mode, it sometimes takes long enough that the background_reconcile loop wakes up and drops the tenant from memory in response, such that the pageserver can't validate its deletions and the test does not behave as expected. Closes: #10513 ## Summary of changes - Call the pageserver HTTP client directly rather than going via NeonPageserver.tenant_detach

## Problem We don't have visibility into how long an individual background job is waiting for a semaphore permit. ## Summary of changes * Make `pageserver_background_loop_semaphore_wait_seconds` a histogram rather than a sum. * Add a paced warning when a task takes more than 10 minutes to get a permit (for now). * Drive-by cleanup of some `EnumMap` usage.

#10620) (#10687) ## Problem Currently the following line below uses array subscript notation which is confusing since `reqlsns` is not an array but just a pointer to a struct. ``` XLogWaitForReplayOf(reqlsns[0].request_lsn); ``` ## Summary of changes Switch from array subscript notation to arrow operator to improve readability of code. Close #10620.

## Problem We have unclear issue with stuck s3 client, probably after partial aws sdk update without updating sdk-s3. #10588 Let's try to update s3 as well. ## Summary of changes Result of running cargo update -p aws-types -p aws-sigv4 -p aws-credential-types -p aws-smithy-types -p aws-smithy-async -p aws-sdk-kms -p aws-sdk-iam -p aws-sdk-s3 -p aws-config ref #10695

when library name doesn't match extension name. The bug was introduced by recent commit ebc55e6

## Problem Found another pgcopydb segfault in error handling ```bash 2025-02-06 15:30:40.112 51299 ERROR pgsql.c:2330 [TARGET -738302813] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.112 51298 ERROR pgsql.c:2330 [TARGET -1407749748] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.112 51297 ERROR pgsql.c:2330 [TARGET -2073308066] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.112 51300 ERROR pgsql.c:2330 [TARGET 1220908650] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.432 51300 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.513 51290 ERROR copydb.c:773 Sub-process 51300 exited with code 0 and signal Segmentation fault 2025-02-06 15:30:40.578 51299 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command 2025-02-06 15:30:40.613 51290 ERROR copydb.c:773 Sub-process 51299 exited with code 0 and signal Segmentation fault 2025-02-06 15:30:41.253 51298 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command 2025-02-06 15:30:41.314 51290 ERROR copydb.c:773 Sub-process 51298 exited with code 0 and signal Segmentation fault 2025-02-06 15:30:43.133 51297 ERROR pgsql.c:2536 [Postgres] FATAL: terminating connection due to administrator command 2025-02-06 15:30:43.215 51290 ERROR copydb.c:773 Sub-process 51297 exited with code 0 and signal Segmentation fault 2025-02-06 15:30:43.215 51290 ERROR indexes.c:123 Some INDEX worker process(es) have exited with error, see above for details 2025-02-06 15:30:43.215 51290 ERROR indexes.c:59 Failed to create indexes, see above for details 2025-02-06 15:30:43.232 51271 ERROR copydb.c:768 Sub-process 51290 exited with code 12 ``` ```bashadmin@ip-172-31-38-164:~/pgcopydb$ gdb /usr/local/pgsql/bin/pgcopydb core GNU gdb (Debian 13.1-3) 13.1 Copyright (C) 2023 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "aarch64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /usr/local/pgsql/bin/pgcopydb... [New LWP 51297] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1". Core was generated by `pgcopydb: create index ocr.ocr_pipeline_step_results_version_pkey '. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000aaaac3a4b030 in splitLines (lbuf=lbuf@entry=0xffffd8b86930, buffer=<optimized out>) at string_utils.c:630 630 *newLinePtr = '\0'; (gdb) bt #0 0x0000aaaac3a4b030 in splitLines (lbuf=lbuf@entry=0xffffd8b86930, buffer=<optimized out>) at string_utils.c:630 #1 0x0000aaaac3a3a678 in pgsql_execute_log_error (pgsql=pgsql@entry=0xffffd8b87040, result=result@entry=0x0, sql=sql@entry=0xffff81fe9be0 "CREATE UNIQUE INDEX IF NOT EXISTS ocr_pipeline_step_results_version_pkey ON ocr.ocr_pipeline_step_results_version USING btree (id, transaction_id);", debugParameters=debugParameters@entry=0xaaaaec5f92f0, context=context@entry=0x0) at pgsql.c:2322 #2 0x0000aaaac3a3bbec in pgsql_execute_with_params (pgsql=pgsql@entry=0xffffd8b87040, sql=0xffff81fe9be0 "CREATE UNIQUE INDEX IF NOT EXISTS ocr_pipeline_step_results_version_pkey ON ocr.ocr_pipeline_step_results_version USING btree (id, transaction_id);", paramCount=paramCount@entry=0, paramTypes=paramTypes@entry=0x0, paramValues=paramValues@entry=0x0, context=context@entry=0x0, parseFun=parseFun@entry=0x0) at pgsql.c:1649 #3 0x0000aaaac3a3c468 in pgsql_execute (pgsql=pgsql@entry=0xffffd8b87040, sql=<optimized out>) at pgsql.c:1522 #4 0x0000aaaac3a245f4 in copydb_create_index (specs=specs@entry=0xffffd8b8ec98, dst=dst@entry=0xffffd8b87040, index=index@entry=0xffff81f71800, ifNotExists=<optimized out>) at indexes.c:846 #5 0x0000aaaac3a24ca8 in copydb_create_index_by_oid (specs=specs@entry=0xffffd8b8ec98, dst=dst@entry=0xffffd8b87040, indexOid=<optimized out>) at indexes.c:410 #6 0x0000aaaac3a25040 in copydb_index_worker (specs=specs@entry=0xffffd8b8ec98) at indexes.c:297 #7 0x0000aaaac3a25238 in copydb_start_index_workers (specs=specs@entry=0xffffd8b8ec98) at indexes.c:209 #8 0x0000aaaac3a252f4 in copydb_index_supervisor (specs=specs@entry=0xffffd8b8ec98) at indexes.c:112 #9 0x0000aaaac3a253f4 in copydb_start_index_supervisor (specs=0xffffd8b8ec98) at indexes.c:57 #10 copydb_start_index_supervisor (specs=specs@entry=0xffffd8b8ec98) at indexes.c:34 #11 0x0000aaaac3a51ff4 in copydb_process_table_data (specs=specs@entry=0xffffd8b8ec98) at table-data.c:146 #12 0x0000aaaac3a520dc in copydb_copy_all_table_data (specs=specs@entry=0xffffd8b8ec98) at table-data.c:69 #13 0x0000aaaac3a0ccd8 in cloneDB (copySpecs=copySpecs@entry=0xffffd8b8ec98) at cli_clone_follow.c:602 #14 0x0000aaaac3a0d2cc in start_clone_process (pid=0xffffd8b743d8, copySpecs=0xffffd8b8ec98) at cli_clone_follow.c:502 #15 start_clone_process (copySpecs=copySpecs@entry=0xffffd8b8ec98, pid=pid@entry=0xffffd8b89788) at cli_clone_follow.c:482 #16 0x0000aaaac3a0d52c in cli_clone (argc=<optimized out>, argv=<optimized out>) at cli_clone_follow.c:164 #17 0x0000aaaac3a53850 in commandline_run (command=command@entry=0xffffd8b9eb88, argc=0, argc@entry=22, argv=0xffffd8b9edf8, argv@entry=0xffffd8b9ed48) at /home/admin/pgcopydb/src/bin/pgcopydb/../lib/subcommands.c/commandline.c:71 #18 0x0000aaaac3a01464 in main (argc=22, argv=0xffffd8b9ed48) at main.c:140 (gdb) ``` The problem is most likely that the following call returned a message in a read-only memory segment where we cannot replace \n with \0 in string_utils.c splitLines() function ```C char *message = PQerrorMessage(pgsql->connection); ``` ## Summary of changes modified the patch to also address this problem

github-actions · 2025-02-07T07:52:44Z

7451 tests run: 7098 passed, 0 failed, 353 skipped (full report)

Flaky tests (1)

Postgres 17

test_scrubber_physical_gc_ancestors[None]: release-x86-64-with-lfc

Code coverage* (full report)

functions: 33.2% (8586 of 25823 functions)
lines: 49.1% (72292 of 147256 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
ffc1a81 at 2025-02-07T13:35:22.601Z :recycle:}

conradludgate and others added 30 commits January 31, 2025 09:53

Update AWS SDK crates (#10588)

7d5c70c

We want to keep the AWS SDK up to date as that way we benefit from new developments and improvements. Prior update was in #10056

test_runner: force-compact in test_sharding_autosplit (#10605)

afbcebe

## Problem This test may not fully detect data corruption during splits, since we don't force-compact the entire keyspace. ## Summary of changes Force-compact all data in `test_sharding_autosplit`.

pageserver: remove faulty debug assertion in compaction (#10610)

a93e9f2

## Problem This assertion is incorrect: it is legal to see another shard's data at this point, after a shard split. Closes: #10609 ## Summary of changes - Remove faulty assertion

CI(pre-merge-checks): fix condition (#10617)

48c87dc

## Problem Merge Queue fails if changes include Rust code. ## Summary of changes - Fix condition for `build-build-tools-image` - Add a couple of no-op `false ||` to make predicates look symmetric

revert #10616 (#10631)

4dfe60e

## Problem #10616 was only intended temparily during the weekend, want to reset to prior state ## Summary of changes revert #10616 but keep fixes in #10622

Fix logical_replication_sync test fixture (#10531)

b1bc33e

Fixes flaky test_lr_with_slow_safekeeper test #10242 Fix query to `pg_catalog.pg_stat_subscription` catalog to handle table synchronization and parallel LR correctly.

alexanderlaw and others added 11 commits February 6, 2025 12:53

Enable sanitizers for postgres v17 (#10401)

977781e

Add a build with sanitizers (asan, ubsan) to the CI pipeline and run tests on it. See #6053 --------- Co-authored-by: Alexander Bayandin <[email protected]>

Fix remote extension lookup (#10708)

44b905d

when library name doesn't match extension name. The bug was introduced by recent commit ebc55e6

Compute release 2025-02-07

ffc1a81

vipvap requested review from a team as code owners February 7, 2025 07:01

vipvap requested review from MMeent, myrrc, jcsp, clipperhouse, piercypixel, cloneable and mikhail-sakhnov and removed request for a team February 7, 2025 07:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute release 2025-02-07 #10713

Compute release 2025-02-07 #10713

vipvap commented Feb 7, 2025

github-actions bot commented Feb 7, 2025 •

edited

Loading

Postgres 17

Compute release 2025-02-07 #10713

Are you sure you want to change the base?

Compute release 2025-02-07 #10713

Conversation

vipvap commented Feb 7, 2025