convert `Timeline::layers` to async RwLock atop #4333 #4361

problame · 2023-05-26T18:19:00Z

convert Timeline::layers to async RwLock atop #4333

This PR refactors the original page_binutils with a single tool pagectl, use clap derive for better command line parsing, and adds the dump kv tool to extract information from delta file. This helps me better understand what's inside the page server. We can add support for other types of file and more functionalities in the future. --------- Signed-off-by: Alex Chi <[email protected]>

Probably increase CI success rate. --------- Signed-off-by: Alex Chi <[email protected]>

(Instead of going through mgr every iteration.) The `wait_for_active_tenant` function's `wait` argument could be removed because it was only used for the loop that waits for the tenant to show up in the tenants map. Since we're passing the tenant in, we now longer need to get it from the tenants map. NB that there's no guarantee that the tenant object is in the tenants map at the time the background loop function starts running. But the tenant mgr guarantees that it will be quite soon. See `tenant_map_insert` way upwards in the call hierarchy for details. This is prep work to eliminate `subscribe_for_state_updates` (PR #4299 ) Fixes: #3501

## Describe your changes ## Issue ticket number and link ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [x] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Before this patch, it would use error type `TenantStateError` which has many more error variants than can actually happen with `mgr::get_tenant`. Along the way, I also introduced `SetNewTenantConfigError` because it uses `mgr::get_tenant` and also can only fail in much fewer ways than `TenantStateError` suggests. The new `page_service.rs`'s `GetActiveTimelineError` and `GetActiveTenantError` types were necessary to avoid an `Other` variant on the `GetTenantError`. This patch is a by-product of reading code that subscribes to `Tenant::state` changes. Can't really connect it to any given project.

Note this also changes the status code to the (correct) 404. Not sure if that's relevant to Console. Context: https://neondb.slack.com/archives/C04PSBP2SAF/p1684746238831449?thread_ts=1684742106.169859&cid=C04PSBP2SAF Atop #4300 because it cleans up the mgr::get_tenant() error type and I want eyes on that PR.

(This is prep work to make `Timeline::activate` infallible.) This patch removes the global storage_broker client instance from the pageserver codebase. Instead, pageserver startup instantiates it and passes it down to the `Timeline::activate` function, which in turn passes it to the WalReceiver, which is the entity that actually uses it. Patch series: - #4316 - #4317 - #4318 - #4319

proceeding #4237, this PR bumps AWS dependencies along with all other dependencies to the latest compatible semver. Signed-off-by: Alex Chi <[email protected]>

Flakyness most likely introduced in #4170, detected in https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4232/4980691289/index.html#suites/542b1248464b42cc5a4560f408115965/18e623585e47af33. Opted to allow it globally because it can happen in other tests as well, basically whenever compaction is enabled and we stop pageserver gracefully.

Routine `vm-builder` version bump, from autoscaling repo release. You can find the release notes here: https://github.com/neondatabase/autoscaling/releases/tag/v0.8.0 The changes are from v0.7.2 — most of them were already included in v0.7.3-alpha3. Of particular note: This (finally) fixes the cgroup issues, so we should now be able to scale up when we're about to run out of memory. **NB:** This has the effect of limit the DB's memory usage in a way it wasn't limited before. We may run into issues because of that. There is currently no way to disable that behavior, other than switching the endpoint back to the k8s-pod provisioner.

Timeline::activate() was only fallible because `launch_wal_receiver` was. `launch_wal_receiver` was fallible only because of some preliminary checks in `WalReceiver::start`. Turns out these checks can be shifted to the type system by delaying creatinon of the `WalReceiver` struct to the point where we activate the timeline. The changes in this PR were enabled by my previous refactoring that funneled the broker_client from pageserver startup to the activate() call sites. Patch series: - #4316 - #4317 - #4318 - #4319

## Problem Test `test_metric_collection` become flaky: ``` AssertionError: assert not ['2023-05-25T14:03:41.644042Z ERROR metrics_collection: failed to send metrics: reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("localhost")), port: Some(18022), path: "/billing/api/v1/usage_events", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" })) }', ...] ``` I suspect it is caused by having 2 places when we define `httpserver_listen_address` fixture (which is internally used by `pytest-httpserver` plugin) ## Summary of changes - Remove the definition of `httpserver_listen_address` from `test_runner/regress/test_ddl_forwarding.py` and keep one in `test_runner/fixtures/neon_fixtures.py` - Also remote unused `httpserver_listen_address` parameter from `test_proxy_metric_collection`

We currently have a semaphore based rate limiter which we hope will keep us under S3 limits. However, the semaphore does not consider time, so I've been hesitant to raise the concurrency limit of 100. See #3698. The PR Introduces a leaky-bucket based rate limiter instead of the `tokio::sync::Semaphore` which will allow us to raise the limit later on. The configuration changes are not contained here.

We used to generate the ID, if the caller didn't specify it. That's bad practice, however, because network is never fully reliable, so it's possible we create a new tenant but the caller doesn't know about it, and because it doesn't know the tenant ID, it has no way of retrying or checking if it succeeded. To discourage that, make it mandatory. The web control plane has not relied on the auto-generation for a long time.

## Problem GitHub Autocomment script posts a comment only for PRs. It's harder to debug failed tests on main or release branches. ## Summary of changes - Change the GitHub Autocomment script to be able to post a comment to either a PR or a commit of a branch

This reverts commit 024109f for it failing to be speed up anything, but run into more errors. See: #3698.

…ct_level0_phase1 Without this, the seocnd read().unwrap() becomes an await point, which makes the future not-Send, but, we require it to be Send because it runs inside task_mgr::spawn, which requires the Fut's to be Send

The checkpointer loop isn't running anyway, so, there's no risk of blocking it through the pre-lock. (cherry picked from commit 1b26633)

…utex internally fails with cs@devvm:[~/src/neon]: cargo check -p pageserver --features testing Checking pageserver v0.1.0 (/home/cs/src/neon/pageserver) error: future cannot be sent between threads safely --> pageserver/src/tenant/timeline/walreceiver/connection_manager.rs:426:33 | 426 | let connection_handle = TaskHandle::spawn(move |events_sender, cancellation| { | ^^^^^^^^^^^^^^^^^ future created by async block is not `Send` | = help: within `Instrumented<[async block@pageserver/src/tenant/timeline/walreceiver/connection_manager.rs:427:13: 439:14]>`, the trait `std::marker::Send` is not implemented for `std::sync::RwLockReadGuard<'_, LayerMap<dyn PersistentLayer>>` note: future is not `Send` as this value is used across an await --> pageserver/src/tenant/timeline.rs:872:46 | 850 | let layers = self.layers.read().unwrap(); | ------ has type `std::sync::RwLockReadGuard<'_, LayerMap<dyn PersistentLayer>>` which is not `Send` ... 872 | self.freeze_inmem_layer(true).await; | ^^^^^^ await occurs here, with `layers` maybe used later ... 881 | } | - `layers` is later dropped here note: required by a bound in `TaskHandle::<E>::spawn` --> pageserver/src/tenant/timeline/walreceiver.rs:196:52 | 192 | fn spawn<Fut>( | ----- required by a bound in this ... 196 | Fut: Future<Output = anyhow::Result<()>> + Send, | ^^^^ required by this bound in `TaskHandle::<E>::spawn` error: could not compile `pageserver` due to previous error

Compaction is usually a compute-heavy process and might affect other futures running on the thread of the compaction. Therefore, we add `block_in_place` as a temporary solution to avoid blocking other futures on the same thread as compaction in the runtime. As we are migrating towards a fully-async-style pageserver, we can revert this change when everything is async and when we move compaction to a separate runtime. --------- Signed-off-by: Alex Chi <[email protected]>

Require the error type to be ApiError. It implicitly required that anyway, because the function used error::handler, which downcasted the error to an ApiError. If the error was in fact anything else than ApiError, it would just panic. Better to check it at compilation time. Also make the last-resort error handler more forgiving, so that it returns an 500 Internal Server error response, instead of panicking, if a request handler returns some other error than an ApiError.

Previously, you used it like this: |r| RequestSpan(my_handler).handle(r) But I don't see the point of the RequestSpan struct. It's just a wrapper around the handler function. With this commit, the call becomes: |r| request_span(r, my_handler) Which seems a little simpler. At first I thought that the RequestSpan struct would allow "chaining" other kinds of decorators like RequestSpan, so that you could do something like this: |r| CheckPermissions(RequestSpan(my_handler)).handle(r) But it doesn't work like that. If each of those structs wrap a handler *function*, it would actually look like this: |r| CheckPermissions(|r| RequestSpan(my_handler).handle(r))).handle(r) This commit doesn't make that kind of chaining any easier, but seems a little more straightforward anyway.

If the timeline is already being deleted, return an error. We used to notice the duplicate request and error out in persist_index_part_with_deleted_flag(), but it's better to detect it earlier. Add an explicit lock for the deletion. Note: This doesn't do anything about the async cancellation problem (github issue #3478): if the original HTTP request dropped, because the client disconnected, the timeline deletion stops half-way through the operation. That needs to be fixed, too, but that's a separate story. (This is a simpler replacement for PR #4194. I'm also working on the cancellation shielding, see PR #4314.)

#4155 inadvertently switched to a version of the VM builder that leaves the file cache integration disabled by default. This re-enables the vm-informant's file cache integration. (as a refresher: The vm-informant is the autoscaling component that sits inside the VM and manages postgres / compute_ctl) See also: neondatabase/autoscaling#265

This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). The motivation is to avoid locking `Tenant::timelines` in places that can't be `async`, because in #4333 we want to convert Tenant::timelines from `std::sync::Mutex` to `tokio::sync::Mutex`. But, the changes here are useful in general because they clean up & document tenant state transitions. That also paves the way for #4350, which is an alternative to #4333 that refactors the pageserver code so that we can keep the `Tenant::timelines` mutex sync. This patch consists of the following core insights and changes: * spawn_load and spawn_attach own the tenant state until they're done * once load()/attach() calls are done ... * if they failed, transition them to Broken directly (we know that there's no background activity because we didn't call activate yet) * if they succeed, call activate. We can make it infallible. How? Later. * set_broken() and set_stopping() are changed to wait for spawn_load() / spawn_attach() to finish. * This sounds scary because it might hinder detach or shutdown, but actually, concurrent attach+detach, or attach+shutdown, or load+shutdown, or attach+shutdown were just racy before this PR. So, with this change, they're not anymore. In the future, we can add a `CancellationToken` stored in Tenant to cancel `load` and `attach` faster, i.e., make `spawn_load` / `spawn_attach` transition them to Broken state sooner. See the doc comments on TenantState for the state transitions that are now possible. It might seem scary, but actually, this patch reduces the possible state transitions. We introduce a new state `TenantState::Activating` to avoid grabbing the `Tenant::timelines` lock inside the `send_modify` closure. These were the humble beginnings of this PR (see Motivation section), and I think it's still the right thing to have this `Activating` state, even if we decide against async `Tenant::timelines` mutex. The reason is that `send_modify` locks internally, and by moving locking of Tenant::timelines out of the closure, the internal locking of `send_modify` becomes a leaf of the lock graph, and so, we eliminate deadlock risk. Co-authored-by: Joonas Koivunen <[email protected]>

Startup can take a long time. We suspect it's the initial logical size calculations. Long term solution is to not block the tokio executors but do most of I/O in spawn_blocking. See: #4025, #4183 Short-term solution to above: - Delay global background tasks until initial tenant loads complete - Just limit how many init logical size calculations can we have at the same time to `cores / 2` This PR is for trying in staging.

added in #4366. revert for testing without it; it may have unintenteded side-effects, and it's very difficult to understand the results from the 10k load testing environments. earlier results: #4366 (comment)

…d… (#4353) This parameter can be use to restrict number of image layers generated because of GC request (wanted image layers). Been set to zero it completely eliminates creation of such image layers. So it allows to avoid extra storage consumption after merging #3673 ## Problem PR #3673 forces generation of missed image layers. So i short term is cause cause increase (in worst case up to two times) size of storage. It was intended (by me) that GC period is comparable with PiTR interval. But looks like it is not the case now - GC is performed much more frequently. It may cause the problem with space exhaustion: GC forces new image creation while large PiTR still prevent GC from collecting old layers. ## Summary of changes Add new pageserver parameter` forced_image_creation_limit` which restrict number of created image layers which are requested by GC. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist

compute_ctl can panic, but `tracing` is used for logging. panic stderr output can interleave with messages from normal logging. The fix is to use the established way (pageserver, safekeeper, storage_broker) of using `tracing` to report panics.

## Problem This PR includes doc changes to the current metrics as well as adding new metrics. With the new set of metrics, we can quantitatively analyze the read amp., write amp. and space amp. in the system, when used together with https://github.com/neondatabase/neonbench close #4312 ref #4347 compaction metrics TBD, a novel idea is to print L0 file number and number of layers in the system, and we can do this in the future when we start working on compaction. ## Summary of changes * Add `READ_NUM_FS_LAYERS` for computing read amp. * Add `MATERIALIZED_PAGE_CACHE_HIT_UPON_REQUEST`. * Add `GET_RECONSTRUCT_DATA_TIME`. GET_RECONSTRUCT_DATA_TIME + RECONSTRUCT_TIME + WAIT_LSN_TIME should be approximately total time of reads. * Add `5.0` and `10.0` to `STORAGE_IO_TIME_BUCKETS` given some fsync runs slow (i.e., > 1s) in some cases. * Some `WAL_REDO` metrics are only used when Postgres is involved in the redo process. --------- Signed-off-by: Alex Chi <[email protected]>

## Problem Part of #4373 ## Summary of changes This PR adds `PersistentLayerDesc`, which will be used in LayerMap mapping and probably layer cache. After this PR and after we change LayerMap to map to layer desc, we can safely drop RemoteLayerDesc. --------- Signed-off-by: Alex Chi <[email protected]> Co-authored-by: bojanserafimov <[email protected]>

See superceded #4390. - capture log in test - expand the span to cover init and error reporting - remove obvious logging by logging only unexpected

We now spawn a new task for every HTTP request, and wait on the JoinHandle. If Hyper drops the Future, the spawned task will keep running. This protects the rest of the pageserver code from unexpected async cancellations. This creates a CancellationToken for each request and passes it to the handler function. If the HTTP request is dropped by the client, the CancellationToken is signaled. None of the handler functions make use for the CancellationToken currently, but they now they could. The CancellationToken arguments also work like documentation. When you're looking at a function signature and you see that it takes a CancellationToken as argument, it's a nice hint that the function might run for a long time, and won't be async cancelled. The default assumption in the pageserver is now that async functions are not cancellation-safe anyway, unless explictly marked as such, but this is a nice extra reminder. Spawning a task for each request is OK from a performance point of view because spawning is very cheap in Tokio, and none of our HTTP requests are very performance critical anyway. Fixes issue #3478

Upgrading to rust 1.70 will require these.

## Describe your changes Port HNSW implementation for ANN search top Postgres ## Issue ticket number and link https://www.pinecone.io/learn/hnsw ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist

As seen on deployment of 2023-06-01 release, times were improving but there were some outliers caused by: - timelines `eviction_task` starting while activating and running imitation - timelines `initial logical size` calculation This PR fixes it so that `eviction_task` is delayed like other background tasks fixing an oversight from earlier #4372. After this PR activation will be two phases: 1. load and activate tenants AND calculate some initial logical sizes 2. rest of initial logical sizes AND background tasks - compaction, gc, disk usage based eviction, timelines `eviction_task`, consumption metrics

Create pgx_ulid extension ``` postgres=# create extension ulid; CREATE EXTENSION postgres=# CREATE TABLE users ( id ulid NOT NULL DEFAULT gen_ulid() PRIMARY KEY, name text NOT NULL ); CREATE TABLE postgres=# insert into users (name) values ('vadim'); INSERT 0 1 postgres=# select * from users; id | name ----------------------------+------- 01H25DDG3KYMYZTNR41X38E256 | vadim ```

Solves a flaky test error in the wild[^1] by: - Make the gc shutdown signal reading an `allowed_error` - Note the gc shutdown signal readings as being in `allowed_error`s - Allow passing tenant conf to init_start to avoid unncessary tenants [^1]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4399/5176432780/index.html#suites/b97efae3a617afb71cb8142f5afa5224/2cd76021ea011f93

walreceiver logs are a bit hard to understand because of partial span usage, extra messages, ignored errors popping up as huge stacktraces. Fixes #3330 (by spans, also demote info -> debug). - arrange walreceivers spans into a hiearchy: - `wal_connection_manager{tenant_id, timeline_id}` -> `connection{node_id}` -> `poller` - unifies the error reporting inside `wal_receiver`: - All ok errors are now `walreceiver connection handling ended: {e:#}` - All unknown errors are still stacktraceful task_mgr reported errors with context `walreceiver connection handling failure` - Remove `connect` special casing, was: `DB connection stream finished` for ok errors - Remove `done replicating` special casing, was `Replication stream finished` for ok errors - lowered log levels for (non-exhaustive list): - `WAL receiver manager started, connecting to broker` (at startup) - `WAL receiver shutdown requested, shutting down` (at shutdown) - `Connection manager loop ended, shutting down` (at shutdown) - `sender is dropped while join handle is still alive` (at lucky shutdown, see #2885) - `timeline entered terminal state {:?}, stopping wal connection manager loop` (at shutdown) - `connected!` (at startup) - `Walreceiver db connection closed` (at disconnects?, was without span) - `Connection cancelled` (at shutdown, was without span) - `observed timeline state change, new state is {new_state:?}` (never after Timeline::activate was made infallible) - changed: - `Timeline dropped state updates sender, stopping wal connection manager loop` - was out of date; sender is not dropped but `Broken | Stopping` state transition - also made `debug!` - `Timeline dropped state updates sender before becoming active, stopping wal connection manager loop` - was out of date: sender is again not dropped but `Broken | Stopping` state transition - also made `debug!` - log fixes: - stop double reporting panics via JoinError

added the `allowed_error` to the `positive_env` so any tests completing the attach are allowed have this print out. they are allowed to do so, because the `random_init_delay` can produce close to zero and thus the first run will be near attach. Though... Unsure if we ever really need the eviction task to run **before** it can evict something, as in after 20min or 24h. in the failed test case however period is 20s so interesting that we didn't run into this sooner. evidence of flaky: https://github.com/neondatabase/neon/actions/runs/5175677035/jobs/9323705929?pr=4399#step:4:38536

…#4412) …ompiler ## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist

Flakyness introduced by #4402 evidence [^1]. I had assumed the NotConnected would had been an expected io error, but it's not. Restore the global `allowed_error`. [^1]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-4407/5185897757/index.html#suites/82004ab4e3720b47bf78f312dabe7c55/14f636d0ecd3939d/

We have 2 ways of tenant shutdown, we should have just one. Changes are mostly mechanical simple refactorings. Added `warn!` on the "shutdown all remaining tasks" should trigger test failures in the between time of not having solved the "tenant/timeline owns all spawned tasks" issue. Cc: #4327.

This adds test coverage for 'compute_ctl', as it is now used by all the python tests. There are a few differences in how 'compute_ctl' is called in the tests, compared to the real web console: - In the tests, the postgresql.conf file is included as one large string in the spec file, and it is written out as it is to the data directory. I added a new field for that to the spec file. The real web console, however, sets all the necessary settings in the 'settings' field, and 'compute_ctl' creates the postgresql.conf from those settings. - In the tests, the information needed to connect to the storage, i.e. tenant_id, timeline_id, connection strings to pageserver and safekeepers, are now passed as new fields in the spec file. The real web console includes them as the GUCs in the 'settings' field. (Both of these are different from what the test control plane used to do: It used to write the GUCs directly in the postgresql.conf file). The plan is to change the control plane to use the new method, and remove the old method, but for now, support both. Some tests that were sensitive to the amount of WAL generated needed small changes, to accommodate that compute_ctl runs the background health monitor which makes a few small updates. Also some tests shut down the pageserver, and now that the background health check can run some queries while the pageserver is down, that can produce a few extra errors in the logs, which needed to be allowlisted. Other changes: - remove obsolete comments about PostgresNode; - create standby.signal file for Static compute node; - log output of `compute_ctl` and `postgres` is merged into `endpoints/compute.log`. --------- Co-authored-by: Anastasia Lubennikova <[email protected]>

context: #4414 And improve messages/comments here and there.

ref #4414

sync-safekeepers doesn't know it and sends 0.

The issue was in the wrong `control` file name

Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s). Implements the three step/tier initialization ordering described in #4397: 1. load local tenants 2. do initial logical sizes per walreceivers for 10s 3. background tasks Ordering is controlled by: - waiting on `utils::completion::Barrier`s on background tasks - having one attempt for each Timeline to do initial logical size calculation - `pageserver/src/bin/pageserver.rs` releasing background jobs after timeout or completion of initial logical size calculation The timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible. Co-authored-by: Christian Schwarz <[email protected]>

…ne-get/timeline-layers-tokio-sync-atop-4333

github-actions · 2023-06-07T14:40:36Z

1000 tests run: 958 passed, 1 failed, 41 skipped (full report)

Failures on Posgres 14

test_get_tenant_size_with_multiple_branches: release

# Run failed on Postgres 14 tests locally:
DEFAULT_PG_VERSION=14 scripts/pytest -k "test_get_tenant_size_with_multiple_branches[release-pg14]"

Flaky tests (1)

Postgres 15

test_remote_storage_upload_queue_retries[local_fs]: ✅ debug

_{The comment gets automatically updated with the latest test results
a66a16d at 2023-06-07T14:40:36.304Z :recycle:}

problame · 2023-06-12T16:38:06Z

Abandoned in favor of #4441

skyzh and others added 21 commits May 24, 2023 19:36

ci: use eu-central-1 bucket (#4315)

f276f21

Probably increase CI success rate. --------- Signed-off-by: Alex Chi <[email protected]>

bump dependencies version (#4336)

ab2757f

proceeding #4237, this PR bumps AWS dependencies along with all other dependencies to the latest compatible semver. Signed-off-by: Alex Chi <[email protected]>

Storage: use Postgres 15 as default (#2809)

08e7d24

Revert "Allow for higher s3 concurrency (#4292)" (#4356)

be177f8

This reverts commit 024109f for it failing to be speed up anything, but run into more errors. See: #3698.

basebackup import: pre-lock the layer map for the flush() calls

a1680b1

The checkpointer loop isn't running anyway, so, there's no risk of blocking it through the pre-lock. (cherry picked from commit 1b26633)

turn Timeline::layers into tokio::sync::RwLock

2001c31

problame changed the base branch from main to problame/async-timeline-get/tenant-timelines-tokio-sync-mutex May 26, 2023 18:21

skyzh and others added 8 commits May 26, 2023 17:15

revert: static global init logical size limiter (#4368)

db14355

added in #4366. revert for testing without it; it may have unintenteded side-effects, and it's very difficult to understand the results from the 10k load testing environments. earlier results: #4366 (comment)

bojanserafimov and others added 27 commits May 31, 2023 12:23

Fix layer map correctness bug (#4342)

b644746

Fix stale and misleading comment in LayerMap (#4297)

3300836

pg_sni_router: add session_id to more messages (#4403)

ef80a90

See superceded #4390. - capture log in test - expand the span to cover init and error reporting - remove obvious logging by logging only unexpected

chore: comment marker fixes (#4406)

a55c663

Upgrading to rust 1.70 will require these.

Add libcurl as dependency to readme (#4405)

4ba950a

Remove arch-specific stuff from HNSW extension (#4423)

ac11e7c

Increase timeouts on compute -> sk connections.

dc6a382

context: #4414 And improve messages/comments here and there.

Quick exit in truncate_wal if nothing to do.

c058e1c

ref #4414

Don't warn on system id = 0 in walproposer greeting.

6b3c020

sync-safekeepers doesn't know it and sends 0.

Fix pgx_ulid extension (#4431)

88f0cfc

The issue was in the wrong `control` file name

Merge remote-tracking branch 'origin/main' into problame/async-timeli…

a66a16d

…ne-get/timeline-layers-tokio-sync-atop-4333

problame closed this Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert `Timeline::layers` to async RwLock atop #4333 #4361

convert `Timeline::layers` to async RwLock atop #4333 #4361

problame commented May 26, 2023 •

edited

Loading

github-actions bot commented Jun 7, 2023

Postgres 15

problame commented Jun 12, 2023

convert Timeline::layers to async RwLock atop #4333 #4361

convert Timeline::layers to async RwLock atop #4333 #4361

Conversation

problame commented May 26, 2023 • edited Loading

github-actions bot commented Jun 7, 2023

1000 tests run: 958 passed, 1 failed, 41 skipped (full report)

Failures on Posgres 14

Postgres 15

problame commented Jun 12, 2023

convert `Timeline::layers` to async RwLock atop #4333 #4361

convert `Timeline::layers` to async RwLock atop #4333 #4361

problame commented May 26, 2023 •

edited

Loading