Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: a script to backfill commitments table #184

Closed
wants to merge 2 commits into from

Conversation

bajtos
Copy link
Member

@bajtos bajtos commented Jan 3, 2024

I will keep this PR as a draft and then close it unmerged after completing the migration manually.

This is a follow-up for #182

@bajtos bajtos requested a review from juliangruber January 3, 2024 12:44
GROUP BY published_as
ORDER BY published_at
ON CONFLICT DO NOTHING
RETURNING cid -- I _think_ this helps to get correct row count reported by PG
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not run this query, there may be syntax errors.

WDYT, should I spend a bit of extra time to clone a subset of live measurements to my local PG database and run the query on that DB first? Otherwise, I'll go YOLO and debug this query while executing it against the live data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can YOLO, if it goes wrong we can just delete all commitments and start over, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can YOLO, if it goes wrong we can just delete all commitments and start over, right?

Yes, we can 👍🏻

// the part "ON CONFLICT DO NOTHING" will tell PG to silently ignore the INSERT command.
const { rows, rowCount } = await client.query(`
INSERT INTO commitments (cid, published_at)
SELECT published_as as cid, MAX(finished_at) + INTERVAL '3 minutes' as published_at FROM measurements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 3 minutes offset will lead to lots of overlap between 1h windows, would it be easier to remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the overlaps; it gives me higher confidence that we don't accidentally miss any measurements.

I would like to keep the estimated published_at value close to the real values we report for new commitments.

We are reading the current time after the smart contract call completes, see here:
https://github.com/filecoin-station/spark-api/blob/ca10df39450cf50ff68be871869b763287bc0939/spark-publish/index.js#L81-L83

I can change that line to use started instead, which should bring us closer to MAX(finished_at). It still would not be perfect because if there is a large backlog of unpublished measurements, we can commit a batch of measurements much later than they were recorded (finished).

I can also rework that line to compute MAX(finished_at) from the committed measurements. I think that's the most accurate solution, WDYT?

Also, maybe I am sweating this too much, and this few-minute difference does not really matter. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah I don't mind at all, if you think this doesn't create any bad effects 👍

@bajtos
Copy link
Member Author

bajtos commented Jan 3, 2024

Next steps:

  • Run the script, fix any problems discovered. It's safe to re-run this script if it crashes.
  • Check that we backfilled all commitments. E.g. SELECT COUNT(DISTINCT published_as) FROM measurements should return the same value as SELECT COUNT(*) FROM commitments. Warning: the first query will be expensive!!
  • Manually delete old measurements, e.g. DELETE FROM measurements WHERE finished_at < '2023-11-01T00:00:00.000Z';
  • Implement periodic cleanup of old measurements - this last step is not urgent

cc @juliangruber

@bajtos
Copy link
Member Author

bajtos commented Jan 4, 2024

The query stopped working because we had a long gap where no measurements were recorded.

Processing measurements reported before 2023-12-09T08:50:25.461Z
rowCount: 9 rows.length: 9
Processing measurements reported before 2023-12-09T08:04:53.404Z
rowCount: 0 rows.length: 0

After bumping up the interval to 12 hours, the script runs again

Processing measurements reported before 2023-12-09T08:04:53.404Z
rowCount: 5 rows.length: 5
Processing measurements reported before 2023-12-08T20:47:01.827Z
rowCount: 162 rows.length: 162
Processing measurements reported before 2023-12-08T08:51:40.292Z
rowCount: 351 rows.length: 351

@juliangruber
Copy link
Member

Implement periodic cleanup of old measurements - this last step is not urgent

Could we make this part of spark-publish, to always delete measurements after committing them?

@bajtos
Copy link
Member Author

bajtos commented Jan 4, 2024

Another hiccup for 2023-12-07, I solved it by temporarily bumping up the interval to 24 hours.

Processing measurements reported before 2023-11-08T16:10:37.206Z
rowCount: 34 rows.length: 34
Processing measurements reported before 2023-11-08T11:55:39.865Z
rowCount: 0 rows.length: 0
[---fix & restart --]
Processing measurements reported before 2023-11-08T11:55:39.865Z
rowCount: 571 rows.length: 571
Processing measurements reported before 2023-11-07T11:59:33.213Z
rowCount: 691 rows.length: 691
Processing measurements reported before 2023-11-07T00:03:32.238Z
rowCount: 0 rows.length: 0

The oldest measurement in our DB is from 2023-11-07 00:00:00.93795+00, so all should be good now.

@bajtos
Copy link
Member Author

bajtos commented Jan 4, 2024

SELECT COUNT(*) FROM commitments

39145

SELECT COUNT(DISTINCT published_as) FROM measurements

this killed our DB server 😢

@bajtos
Copy link
Member Author

bajtos commented Jan 4, 2024

Running SELECT COUNT(DISTINCT published_as) FROM measurements with the following filters:

WHERE finished_at < '2023-11-15';`
→ 5341
WHERE finished_at >= '2023-11-15' AND finished_at < '2023-12-01';
→ 10369
WHERE finished_at >= '2023-12-01' AND finished_at < '2023-12-10';
→ 5734

vs

SELECT COUNT(*) FROM commitments WHERE published_at < '2023-11-15'
→ 5338
SELECT COUNT(*) FROM commitments WHERE published_at >= '2023-11-15' AND published_at < '2023-12-01';
→ 10369
SELECT COUNT(*) FROM commitments WHERE published_at >= '2023-12-01' AND published_at < '2023-12-10';
→ 5733

The counts are roughly the same, I think the difference is caused by the 3 minutes we are adding to finished_at.

I am going to assume the migration was successful and will delete historical data older than 1 week later today.

@juliangruber
Copy link
Member

Sounds good!

@juliangruber
Copy link
Member

The db is at 84% storage capacity, so we have enough time

@bajtos
Copy link
Member Author

bajtos commented Jan 4, 2024

FWIW, running DELETE commands is not enough to release the disk space back to the OS.
Will investigate further.

https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY

In PostgreSQL, an UPDATE or DELETE of a row does not immediately remove the old version of the row. This approach is necessary to gain the benefits of multiversion concurrency control (MVCC, see Chapter 13): the row version must not be deleted while it is still potentially visible to other transactions. But eventually, an outdated or deleted row version is no longer of interest to any transaction. The space it occupies must then be reclaimed for reuse by new rows, to avoid unbounded growth of disk space requirements. This is done by running VACUUM.
The standard form of VACUUM removes dead row versions in tables and indexes and marks the space available for future reuse. However, it will not return the space to the operating system, except in the special case where one or more pages at the end of a table become entirely free and an exclusive table lock can be easily obtained. In contrast, VACUUM FULL actively compacts tables by writing a complete new version of the table file with no dead space. This minimizes the size of the table, but can take a long time. It also requires extra disk space for the new copy of the table, until the operation completes.

@bajtos
Copy link
Member Author

bajtos commented Jan 4, 2024

Deleted all measurements WHERE published_as IS NOT NULL AND finished_at < '2023-12-05'

@bajtos
Copy link
Member Author

bajtos commented Jan 4, 2024

A useful command to know: list info about tables & indices, including their size.

spark=# \dti+

Things occupying more than 1GB:

Schema Name Type Owner Table Persistence Access method Size Description
public commitments table spark permanent heap 3928 kB
public commitments_pkey index spark commitments permanent btree 4424 kB
public commitments_published_at index spark commitments permanent btree 1264 kB
public measurements table spark permanent heap 325 GB
public measurements_finished_at index spark measurements permanent btree 34 GB
public measurements_not_published index spark measurements permanent btree 1414 MB

@bajtos
Copy link
Member Author

bajtos commented Jan 8, 2024

Next steps:

  • Run the script, fix any problems discovered. It's safe to re-run this script if it crashes.
  • Check that we backfilled all commitments. E.g. SELECT COUNT(DISTINCT published_as) FROM measurements should return the same value as SELECT COUNT(*) FROM commitments. Warning: the first query will be expensive!!
  • Manually delete old measurements, e.g. DELETE FROM measurements WHERE finished_at < '2023-11-01T00:00:00.000Z';
  • Implement periodic cleanup of old measurements - this last step is not urgent

I opened a PR to implement the cleanup; see #186

Closing this PR un-merged as the work here is finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants