Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring down infra cost #71

Closed
4 tasks
Tracked by #90 ...
bajtos opened this issue Feb 13, 2024 · 30 comments
Closed
4 tasks
Tracked by #90 ...

Bring down infra cost #71

bajtos opened this issue Feb 13, 2024 · 30 comments
Assignees

Comments

@bajtos
Copy link
Member

bajtos commented Feb 13, 2024

Tasks

@juliangruber
Copy link
Member

Asked on Slack whether the dashboard work will provide us with total amount of jobs performed. I believe this is the only metric that requires us to have infinite retention on the station bucket, which is the main cost factor

@juliangruber
Copy link
Member

The dashboard work will replace that 👍 So we can turn on a bucket retention policy after these metrics land.

In order to implement some quick cost reductions, I propose we perform a manual purge:

  1. Sum the total job count before April 1st 2024
  2. Delete all measurements before April 1st 2024
  3. Add the previously calculated sum as a fixed offset to Grafana charts and the website

Wdyt @bajtos @patrickwoodhead?

@juliangruber
Copy link
Member

juliangruber commented May 2, 2024

As suggested by @bajtos: Before deleting all measurements, store them in cold storage. Try compressing using https://facebook.github.io/zstd/.

@juliangruber
Copy link
Member

juliangruber commented May 12, 2024

  • Export measurements before May 1st from Influx
  • Store export
  • Enable 30d bucket retention policy
  • Sum total jobs completed
  • Add sum as fixed offset
    • Grafana
    • Website
  • Re-enable website query

@bajtos
Copy link
Member Author

bajtos commented May 13, 2024

Export measurements before April 1st from Influx
Enable 30d bucket retention policy

The 30d retention will delete all data older than April 13th if enabled today. Your export will contain only data older than April 1st. We will lose measurements recorded between April 1st and 13th.

Is that okay? Did I miss something?

@juliangruber
Copy link
Member

juliangruber commented May 13, 2024

Of course 😅 ok I will include more data in the export. Updated the task list to go up to May 1st (to be sure)

@juliangruber
Copy link
Member

Script used for the export, currently running: https://gist.github.com/juliangruber/cd50f1227d08e8b94d6b4b36620b4711

@juliangruber
Copy link
Member

The export finished. The resulting ndjson file is 58GB in size. Something strange is going on however: It's not including any records with "_measurement":"jobs-completed" and "_field":"value". This is what we need to recreate the job count. My suspicion is that the export is incomplete. In comparison, this query works:

Screenshot 2024-05-14 at 23 36 58

I'm going to try to export only these records. I'm also going to repeat the export of everything, to see if it creates a different file.

@juliangruber
Copy link
Member

This export finished:

2024-05-14T23:32:57.724Z { rows: 23382470, jobs: 32774485 }

It suggests there were 32m jobs completed, while on the website we show 161m. I will repeat this export to see if it is deterministic.

@juliangruber
Copy link
Member

Next run:

2024-05-15T13:40:09.512Z { rows: 23483927, jobs: 32840534 }

It's in the same ballpark, but not exact. Since no new events are being added to the old timeframe, this export mechanism is flawed. Let's check if we can do something else

@juliangruber
Copy link
Member

Next run:

2024-05-16T12:16:11.297Z { rows: 20583385, jobs: 29386919 }

This time I used async iteration instead of the queryRows function. It looked at significantly less measurements.

@juliangruber
Copy link
Member

I assume we can improve our chances by performing many queries, maybe one for each day. I will try this now

@juliangruber
Copy link
Member

The oldest row it can find is from 2022-11-05. We landed the telemetry commit on Oct 31st (CheckerNetwork/desktop@6d135e6). I don't know what this means.

@juliangruber
Copy link
Member

juliangruber commented May 22, 2024

  • downloading measurements from one day takes ~4 minutes
  • at the moment the script is at June 28th, 2023
  • it will process data until May 1st, 2024
  • that's 308 days left
  • the script is expected to finish in 21h / ~Thursday 23rd 8am CEST

@juliangruber juliangruber mentioned this issue May 22, 2024
16 tasks
@juliangruber
Copy link
Member

juliangruber commented May 22, 2024

The script ran until { day: 2023-11-17T00:00:00.000Z }, when we started receiving 429 Too Many Requests / org XYZ has exceeded limited_query plan limit.

I'm going to continue the script tomorrow with that date as the new starting point, and will merge the result with the previous export.

@juliangruber
Copy link
Member

juliangruber commented Jun 3, 2024

The 1TB disk instance ran out of space. It's currently on 2024-01-07. I'm resizing the machine to 2TB, removing the incomplete day from the export, and then will let it continue

@juliangruber
Copy link
Member

The script currently is at 2024-01-29 (file size 1.4TB) and has until 2024-06-01 to run.

@juliangruber
Copy link
Member

juliangruber commented Jul 2, 2024

Up to 2024-03-03T17:14:30.236298505Z, there were 567,564,171 jobs recorded in InfluxDB. This is how far my script reached before getting rate limited again. I'm now going to destroy the machine that keeps this export.

@juliangruber
Copy link
Member

I will now evaluate deleting these old rows, more work needs to be done before we can turn on a retention policy

@juliangruber
Copy link
Member

I have deleted all rows from the station bucket that were recorded before 2024-03-03T17:14:30.236298505Z

@juliangruber
Copy link
Member

from 2024-03-03T17:14:30.000Z to 2024-03-03T18:27:50.000Z there were 1_025_567 more jobs. I suspect we're getting rate limited again.

@juliangruber
Copy link
Member

I have paused the script as even with a 1s window it was bringing down the Influx cluster

@juliangruber
Copy link
Member

We are waiting to hear back from the Influx support team, which has taken on this case

@juliangruber
Copy link
Member

For now, I'm exporting all measurements from web3.storage, to get a job count without needing InfluxDB

@juliangruber
Copy link
Member

Still exporting Voyager, currently at 3.7TB size

@juliangruber juliangruber mentioned this issue Jul 24, 2024
25 tasks
@juliangruber
Copy link
Member

InfluxDB support told us that up to 2024-07-09T00:00:00Z to sum of all jobs completed is 155_153_052_475.

@juliangruber
Copy link
Member

I've enabled a 90 days retention policy on the "station" bucket. This matches what we have for "peer-checker", "spark-evaluate" and "spark-publish". We can reduce it to 30 days if it's still too expensive.

@bajtos
Copy link
Member Author

bajtos commented Jul 25, 2024

FWIW, the daily_stations table started tracking the number of jobs completed on 2024-06-28. That means we have the data to calculate the total job count after 2024-07-09. All is good 👍🏻

@juliangruber
Copy link
Member

Data Graphana + JSON integration doesn't work any more (on PL Grafana), so this has to be reimplemented once the "Station" dashboard has been moved to the Space Meridian Grafana. Hereby, this job is complete finally

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants