Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: a script to backfill commitments table #184

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions mig.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
import pg from 'pg'

/*
Usage:

1. Setup port forwarding between your local computer and Postgres instance hosted by Fly.io
([docs](https://fly.io/docs/postgres/connecting/connecting-with-flyctl/)). Remember to use a
different port if you have a local Postgres server for development!

```sh
fly proxy 5454:5432 -a spark-db
```

2. Find spark-db entry in 1Password and get the user and password from the connection string.

3. Run the following command to apply the updates, remember to replace "user" and "password"
with the real credentials:

```sh
DATABASE_URL="postgres://user:password@localhost:5454/spark" node mig.js
```

*/

const { DATABASE_URL } = process.env

const client = new pg.Client({ connectionString: DATABASE_URL })
await client.connect()

while (true) {
// Step 1: find the oldest commitment (the first commitment that we don't need to backfill)
const { rows: [{ published_at: end }] } = await client.query(
'SELECT published_at FROM commitments ORDER BY published_at LIMIT 1'
)
console.log('Processing measurements reported before', end)

// Step 2: backfill commitments for the last hour before "end"
// Note: we may omit some older measurements belonging to the first commitment found
// That's ok. When the next run finds a CID already present in the commitments table,
// the part "ON CONFLICT DO NOTHING" will tell PG to silently ignore the INSERT command.
const { rows, rowCount } = await client.query(`
INSERT INTO commitments (cid, published_at)
SELECT published_as as cid, MAX(finished_at) + INTERVAL '3 minutes' as published_at FROM measurements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 3 minutes offset will lead to lots of overlap between 1h windows, would it be easier to remove it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the overlaps; it gives me higher confidence that we don't accidentally miss any measurements.

I would like to keep the estimated published_at value close to the real values we report for new commitments.

We are reading the current time after the smart contract call completes, see here:
https://github.com/filecoin-station/spark-api/blob/ca10df39450cf50ff68be871869b763287bc0939/spark-publish/index.js#L81-L83

I can change that line to use started instead, which should bring us closer to MAX(finished_at). It still would not be perfect because if there is a large backlog of unpublished measurements, we can commit a batch of measurements much later than they were recorded (finished).

I can also rework that line to compute MAX(finished_at) from the committed measurements. I think that's the most accurate solution, WDYT?

Also, maybe I am sweating this too much, and this few-minute difference does not really matter. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah I don't mind at all, if you think this doesn't create any bad effects 👍

WHERE finished_at >= ($1::TIMESTAMPTZ - INTERVAL '12 hour') AND finished_at <= $1::TIMESTAMPTZ
GROUP BY published_as
ORDER BY published_at
ON CONFLICT DO NOTHING
RETURNING cid -- I _think_ this helps to get correct row count reported by PG
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not run this query, there may be syntax errors.

WDYT, should I spend a bit of extra time to clone a subset of live measurements to my local PG database and run the query on that DB first? Otherwise, I'll go YOLO and debug this query while executing it against the live data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can YOLO, if it goes wrong we can just delete all commitments and start over, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can YOLO, if it goes wrong we can just delete all commitments and start over, right?

Yes, we can 👍🏻

`, [
end
])

// See https://node-postgres.com/apis/result#resultrowcount-int--null
// The property `result.rowCount` does not reflect the number of rows returned from a
// query. e.g. an update statement could update many rows (so high result.rowCount value) but
// result.rows.length would be zero.
// I am not sure which value is the correct one to use, I'll update this code after I run
// it for the first time.
console.log('rowCount: %s rows.length: %s', rowCount, rows.length)
if (rowCount === 0) break
}

await client.end()