-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: a script to backfill commitments table #184
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
import pg from 'pg' | ||
|
||
/* | ||
Usage: | ||
|
||
1. Setup port forwarding between your local computer and Postgres instance hosted by Fly.io | ||
([docs](https://fly.io/docs/postgres/connecting/connecting-with-flyctl/)). Remember to use a | ||
different port if you have a local Postgres server for development! | ||
|
||
```sh | ||
fly proxy 5454:5432 -a spark-db | ||
``` | ||
|
||
2. Find spark-db entry in 1Password and get the user and password from the connection string. | ||
|
||
3. Run the following command to apply the updates, remember to replace "user" and "password" | ||
with the real credentials: | ||
|
||
```sh | ||
DATABASE_URL="postgres://user:password@localhost:5454/spark" node mig.js | ||
``` | ||
|
||
*/ | ||
|
||
const { DATABASE_URL } = process.env | ||
|
||
const client = new pg.Client({ connectionString: DATABASE_URL }) | ||
await client.connect() | ||
|
||
while (true) { | ||
// Step 1: find the oldest commitment (the first commitment that we don't need to backfill) | ||
const { rows: [{ published_at: end }] } = await client.query( | ||
'SELECT published_at FROM commitments ORDER BY published_at LIMIT 1' | ||
) | ||
console.log('Processing measurements reported before', end) | ||
|
||
// Step 2: backfill commitments for the last hour before "end" | ||
// Note: we may omit some older measurements belonging to the first commitment found | ||
// That's ok. When the next run finds a CID already present in the commitments table, | ||
// the part "ON CONFLICT DO NOTHING" will tell PG to silently ignore the INSERT command. | ||
const { rows, rowCount } = await client.query(` | ||
INSERT INTO commitments (cid, published_at) | ||
SELECT published_as as cid, MAX(finished_at) + INTERVAL '3 minutes' as published_at FROM measurements | ||
WHERE finished_at >= ($1::TIMESTAMPTZ - INTERVAL '12 hour') AND finished_at <= $1::TIMESTAMPTZ | ||
GROUP BY published_as | ||
ORDER BY published_at | ||
ON CONFLICT DO NOTHING | ||
RETURNING cid -- I _think_ this helps to get correct row count reported by PG | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did not run this query, there may be syntax errors. WDYT, should I spend a bit of extra time to clone a subset of live measurements to my local PG database and run the query on that DB first? Otherwise, I'll go YOLO and debug this query while executing it against the live data. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can YOLO, if it goes wrong we can just delete all commitments and start over, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, we can 👍🏻 |
||
`, [ | ||
end | ||
]) | ||
|
||
// See https://node-postgres.com/apis/result#resultrowcount-int--null | ||
// The property `result.rowCount` does not reflect the number of rows returned from a | ||
// query. e.g. an update statement could update many rows (so high result.rowCount value) but | ||
// result.rows.length would be zero. | ||
// I am not sure which value is the correct one to use, I'll update this code after I run | ||
// it for the first time. | ||
console.log('rowCount: %s rows.length: %s', rowCount, rows.length) | ||
if (rowCount === 0) break | ||
} | ||
|
||
await client.end() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 3 minutes offset will lead to lots of overlap between 1h windows, would it be easier to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind the overlaps; it gives me higher confidence that we don't accidentally miss any measurements.
I would like to keep the estimated
published_at
value close to the real values we report for new commitments.We are reading the current time after the smart contract call completes, see here:
https://github.com/filecoin-station/spark-api/blob/ca10df39450cf50ff68be871869b763287bc0939/spark-publish/index.js#L81-L83
I can change that line to use
started
instead, which should bring us closer toMAX(finished_at)
. It still would not be perfect because if there is a large backlog of unpublished measurements, we can commit a batch of measurements much later than they were recorded (finished).I can also rework that line to compute
MAX(finished_at)
from the committed measurements. I think that's the most accurate solution, WDYT?Also, maybe I am sweating this too much, and this few-minute difference does not really matter. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah I don't mind at all, if you think this doesn't create any bad effects 👍