feat: a script to backfill commitments table #184

bajtos · 2024-01-03T12:44:31Z

I will keep this PR as a draft and then close it unmerged after completing the migration manually.

This is a follow-up for #182

Signed-off-by: Miroslav Bajtoš <[email protected]>

bajtos · 2024-01-03T12:45:59Z

mig.js

+    GROUP BY published_as
+    ORDER BY published_at
+  ON CONFLICT DO NOTHING
+  RETURNING cid -- I _think_ this helps to get correct row count reported by PG


I did not run this query, there may be syntax errors.

WDYT, should I spend a bit of extra time to clone a subset of live measurements to my local PG database and run the query on that DB first? Otherwise, I'll go YOLO and debug this query while executing it against the live data.

I think we can YOLO, if it goes wrong we can just delete all commitments and start over, right?

I think we can YOLO, if it goes wrong we can just delete all commitments and start over, right?

Yes, we can 👍🏻

juliangruber · 2024-01-03T13:10:49Z

mig.js

+  // the part "ON CONFLICT DO NOTHING" will tell PG to silently ignore the INSERT command.
+  const { rows, rowCount } = await client.query(`
+  INSERT INTO commitments (cid, published_at)
+    SELECT published_as as cid, MAX(finished_at) + INTERVAL '3 minutes' as published_at FROM measurements


this 3 minutes offset will lead to lots of overlap between 1h windows, would it be easier to remove it?

I don't mind the overlaps; it gives me higher confidence that we don't accidentally miss any measurements.

I would like to keep the estimated published_at value close to the real values we report for new commitments.

We are reading the current time after the smart contract call completes, see here:
https://github.com/filecoin-station/spark-api/blob/ca10df39450cf50ff68be871869b763287bc0939/spark-publish/index.js#L81-L83

I can change that line to use started instead, which should bring us closer to MAX(finished_at). It still would not be perfect because if there is a large backlog of unpublished measurements, we can commit a batch of measurements much later than they were recorded (finished).

I can also rework that line to compute MAX(finished_at) from the committed measurements. I think that's the most accurate solution, WDYT?

Also, maybe I am sweating this too much, and this few-minute difference does not really matter. WDYT?

Nah I don't mind at all, if you think this doesn't create any bad effects 👍

bajtos · 2024-01-03T14:12:21Z

Next steps:

Run the script, fix any problems discovered. It's safe to re-run this script if it crashes.
Check that we backfilled all commitments. E.g. SELECT COUNT(DISTINCT published_as) FROM measurements should return the same value as SELECT COUNT(*) FROM commitments. Warning: the first query will be expensive!!
Manually delete old measurements, e.g. DELETE FROM measurements WHERE finished_at < '2023-11-01T00:00:00.000Z';
Implement periodic cleanup of old measurements - this last step is not urgent

cc @juliangruber

bajtos · 2024-01-04T08:53:16Z

The query stopped working because we had a long gap where no measurements were recorded.

Processing measurements reported before 2023-12-09T08:50:25.461Z
rowCount: 9 rows.length: 9
Processing measurements reported before 2023-12-09T08:04:53.404Z
rowCount: 0 rows.length: 0

After bumping up the interval to 12 hours, the script runs again

Processing measurements reported before 2023-12-09T08:04:53.404Z
rowCount: 5 rows.length: 5
Processing measurements reported before 2023-12-08T20:47:01.827Z
rowCount: 162 rows.length: 162
Processing measurements reported before 2023-12-08T08:51:40.292Z
rowCount: 351 rows.length: 351

Signed-off-by: Miroslav Bajtoš <[email protected]>

juliangruber · 2024-01-04T08:58:15Z

Implement periodic cleanup of old measurements - this last step is not urgent

Could we make this part of spark-publish, to always delete measurements after committing them?

bajtos · 2024-01-04T09:02:43Z

Another hiccup for 2023-12-07, I solved it by temporarily bumping up the interval to 24 hours.

Processing measurements reported before 2023-11-08T16:10:37.206Z
rowCount: 34 rows.length: 34
Processing measurements reported before 2023-11-08T11:55:39.865Z
rowCount: 0 rows.length: 0
[---fix & restart --]
Processing measurements reported before 2023-11-08T11:55:39.865Z
rowCount: 571 rows.length: 571
Processing measurements reported before 2023-11-07T11:59:33.213Z
rowCount: 691 rows.length: 691
Processing measurements reported before 2023-11-07T00:03:32.238Z
rowCount: 0 rows.length: 0

The oldest measurement in our DB is from 2023-11-07 00:00:00.93795+00, so all should be good now.

bajtos · 2024-01-04T09:18:53Z

SELECT COUNT(*) FROM commitments

39145

SELECT COUNT(DISTINCT published_as) FROM measurements

this killed our DB server 😢

bajtos · 2024-01-04T09:47:48Z

Running SELECT COUNT(DISTINCT published_as) FROM measurements with the following filters:

WHERE finished_at < '2023-11-15';`
→ 5341
WHERE finished_at >= '2023-11-15' AND finished_at < '2023-12-01';
→ 10369
WHERE finished_at >= '2023-12-01' AND finished_at < '2023-12-10';
→ 5734

vs

SELECT COUNT(*) FROM commitments WHERE published_at < '2023-11-15'
→ 5338
SELECT COUNT(*) FROM commitments WHERE published_at >= '2023-11-15' AND published_at < '2023-12-01';
→ 10369
SELECT COUNT(*) FROM commitments WHERE published_at >= '2023-12-01' AND published_at < '2023-12-10';
→ 5733

The counts are roughly the same, I think the difference is caused by the 3 minutes we are adding to finished_at.

I am going to assume the migration was successful and will delete historical data older than 1 week later today.

juliangruber · 2024-01-04T09:51:39Z

Sounds good!

juliangruber · 2024-01-04T09:52:22Z

The db is at 84% storage capacity, so we have enough time

bajtos · 2024-01-04T13:52:02Z

FWIW, running DELETE commands is not enough to release the disk space back to the OS.
Will investigate further.

https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-SPACE-RECOVERY

In PostgreSQL, an UPDATE or DELETE of a row does not immediately remove the old version of the row. This approach is necessary to gain the benefits of multiversion concurrency control (MVCC, see Chapter 13): the row version must not be deleted while it is still potentially visible to other transactions. But eventually, an outdated or deleted row version is no longer of interest to any transaction. The space it occupies must then be reclaimed for reuse by new rows, to avoid unbounded growth of disk space requirements. This is done by running VACUUM.
The standard form of VACUUM removes dead row versions in tables and indexes and marks the space available for future reuse. However, it will not return the space to the operating system, except in the special case where one or more pages at the end of a table become entirely free and an exclusive table lock can be easily obtained. In contrast, VACUUM FULL actively compacts tables by writing a complete new version of the table file with no dead space. This minimizes the size of the table, but can take a long time. It also requires extra disk space for the new copy of the table, until the operation completes.

bajtos · 2024-01-04T13:59:00Z

Deleted all measurements WHERE published_as IS NOT NULL AND finished_at < '2023-12-05'

bajtos · 2024-01-04T15:19:33Z

A useful command to know: list info about tables & indices, including their size.

spark=# \dti+

Things occupying more than 1GB:

Schema	Name	Type	Owner	Table	Persistence	Access method	Size
public	commitments	table	spark		permanent	heap	3928 kB
public	commitments_pkey	index	spark	commitments	permanent	btree	4424 kB
public	commitments_published_at	index	spark	commitments	permanent	btree	1264 kB
public	measurements	table	spark		permanent	heap	325 GB
public	measurements_finished_at	index	spark	measurements	permanent	btree	34 GB
public	measurements_not_published	index	spark	measurements	permanent	btree	1414 MB

bajtos · 2024-01-08T07:44:37Z

Next steps:

Run the script, fix any problems discovered. It's safe to re-run this script if it crashes.

Check that we backfilled all commitments. E.g. SELECT COUNT(DISTINCT published_as) FROM measurements should return the same value as SELECT COUNT(*) FROM commitments. Warning: the first query will be expensive!!

Manually delete old measurements, e.g. DELETE FROM measurements WHERE finished_at < '2023-11-01T00:00:00.000Z';

Implement periodic cleanup of old measurements - this last step is not urgent

I opened a PR to implement the cleanup; see #186

Closing this PR un-merged as the work here is finished.

feat: a script to backfill commitments table

82ea80d

Signed-off-by: Miroslav Bajtoš <[email protected]>

bajtos requested a review from juliangruber January 3, 2024 12:44

bajtos commented Jan 3, 2024

View reviewed changes

juliangruber reviewed Jan 3, 2024

View reviewed changes

fix: process 12 hours worth of measurements at a time

918048b

Signed-off-by: Miroslav Bajtoš <[email protected]>

bajtos mentioned this pull request Jan 8, 2024

feat: delete published measurements #186

Merged

bajtos closed this Jan 8, 2024

bajtos mentioned this pull request Jan 8, 2024

feat: add a script to fetch committed measurements CheckerNetwork/spark-evaluate#114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: a script to backfill commitments table #184

feat: a script to backfill commitments table #184

bajtos commented Jan 3, 2024 •

edited

Loading

bajtos Jan 3, 2024

juliangruber Jan 3, 2024

bajtos Jan 3, 2024

juliangruber Jan 3, 2024

bajtos Jan 3, 2024

juliangruber Jan 3, 2024

bajtos commented Jan 3, 2024 •

edited

Loading

bajtos commented Jan 4, 2024

juliangruber commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 4, 2024

juliangruber commented Jan 4, 2024

juliangruber commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 8, 2024

feat: a script to backfill commitments table #184

feat: a script to backfill commitments table #184

Conversation

bajtos commented Jan 3, 2024 • edited Loading

bajtos Jan 3, 2024

Choose a reason for hiding this comment

juliangruber Jan 3, 2024

Choose a reason for hiding this comment

bajtos Jan 3, 2024

Choose a reason for hiding this comment

juliangruber Jan 3, 2024

Choose a reason for hiding this comment

bajtos Jan 3, 2024

Choose a reason for hiding this comment

juliangruber Jan 3, 2024

Choose a reason for hiding this comment

bajtos commented Jan 3, 2024 • edited Loading

bajtos commented Jan 4, 2024

juliangruber commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 4, 2024

juliangruber commented Jan 4, 2024

juliangruber commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 4, 2024

bajtos commented Jan 8, 2024

bajtos commented Jan 3, 2024 •

edited

Loading

bajtos commented Jan 3, 2024 •

edited

Loading