Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible correctness issue in neon_read_at_lsnv() that could cause incorrect pages on Hot Standby replicas #10620

Open
mvdholst opened this issue Jan 31, 2025 · 3 comments
Labels
c/compute Component: compute, excluding postgres itself external A PR or Issue is created by an external user t/bug Issue Type: Bug

Comments

@mvdholst
Copy link

mvdholst commented Jan 31, 2025

In the function neon_read_at_lsnv(), there is the following code:

for (int i = 0; i < nblocks; i++)
{
	void	   *buffer = buffers[i];
	BlockNumber blockno = base_blockno + i;
	neon_request_lsns *reqlsns = &request_lsns[i];
	TimestampTz		start_ts, end_ts;

	if (PointerIsValid(mask) && !BITMAP_ISSET(mask, i))
		continue;

	start_ts = GetCurrentTimestamp();

	if (RecoveryInProgress() && MyBackendType != B_STARTUP)
		XLogWaitForReplayOf(reqlsns[0].request_lsn);

I think the last line here is incorrect, and should be

       XLogWaitForReplayOf(reqlsns[i].request_lsn)

Note that 0 is replaced with i.

The current code, as is, might return pages from the page server that have LSNs that are higher or equal to the current WAL replay LSN. In case of a multi-page WAL record, this might cause problems if the WAL record is partly replayed, and some pages were missing in the buffer pool and skipped by redo.

@mvdholst mvdholst added the t/bug Issue Type: Bug label Jan 31, 2025
@github-actions github-actions bot added the external A PR or Issue is created by an external user label Jan 31, 2025
@kelvich
Copy link
Contributor

kelvich commented Jan 31, 2025

Thank you!

@knizhnik can you please take a look

@ololobus ololobus added the c/compute Component: compute, excluding postgres itself label Jan 31, 2025
@knizhnik
Copy link
Contributor

It is not a bug, just need to look some line above where variable reqlsns is initialized:

		neon_request_lsns *reqlsns = &request_lsns[i];
		TimestampTz		start_ts, end_ts;

		if (PointerIsValid(mask) && !BITMAP_ISSET(mask, i))
			continue;

		start_ts = GetCurrentTimestamp();

		if (RecoveryInProgress() && MyBackendType != B_STARTUP)
			XLogWaitForReplayOf(reqlsns[0].request_lsn);

So, reqlsns[0].request_lsn is strange but correct way to write reqlsns->equest_lsn

@mvdholst
Copy link
Author

Yup, my mistake.

I still suggest to simplify the code a little bit. The last line becomes much clearer when you write:

XLogWaitForReplayOf(reqlsns->request_lsn);

i.e. get rid of the [0] stuff, which is weird anyway, as reqlsns is not an array, but just a pointer to some struct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself external A PR or Issue is created by an external user t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

4 participants