Bring down infra cost #71

bajtos · 2024-02-13T12:20:47Z

Tasks

Export all InfluxDB data from station bucket
~~Upload data export to w3s~~
Hardcode job count offset in queries
- Get total jobs completed from spark-stats site-backend#14
- Update Grafana queries
Turn on retention for all InfluxDB buckets, to bring down Storage cost

The text was updated successfully, but these errors were encountered:

juliangruber · 2024-04-30T12:35:41Z

Asked on Slack whether the dashboard work will provide us with total amount of jobs performed. I believe this is the only metric that requires us to have infinite retention on the station bucket, which is the main cost factor

juliangruber · 2024-05-02T06:28:14Z

The dashboard work will replace that 👍 So we can turn on a bucket retention policy after these metrics land.

In order to implement some quick cost reductions, I propose we perform a manual purge:

Sum the total job count before April 1st 2024
Delete all measurements before April 1st 2024
Add the previously calculated sum as a fixed offset to Grafana charts and the website

Wdyt @bajtos @patrickwoodhead?

juliangruber · 2024-05-02T09:00:40Z

As suggested by @bajtos: Before deleting all measurements, store them in cold storage. Try compressing using https://facebook.github.io/zstd/.

juliangruber · 2024-05-12T14:32:11Z

bajtos · 2024-05-13T16:26:19Z

Export measurements before April 1st from Influx
Enable 30d bucket retention policy

The 30d retention will delete all data older than April 13th if enabled today. Your export will contain only data older than April 1st. We will lose measurements recorded between April 1st and 13th.

Is that okay? Did I miss something?

juliangruber · 2024-05-13T19:31:17Z

Of course 😅 ok I will include more data in the export. Updated the task list to go up to May 1st (to be sure)

juliangruber · 2024-05-13T20:18:33Z

Script used for the export, currently running: https://gist.github.com/juliangruber/cd50f1227d08e8b94d6b4b36620b4711

juliangruber · 2024-05-14T21:37:37Z

The export finished. The resulting ndjson file is 58GB in size. Something strange is going on however: It's not including any records with "_measurement":"jobs-completed" and "_field":"value". This is what we need to recreate the job count. My suspicion is that the export is incomplete. In comparison, this query works:

I'm going to try to export only these records. I'm also going to repeat the export of everything, to see if it creates a different file.

juliangruber · 2024-05-15T11:39:58Z

This export finished:

2024-05-14T23:32:57.724Z { rows: 23382470, jobs: 32774485 }

It suggests there were 32m jobs completed, while on the website we show 161m. I will repeat this export to see if it is deterministic.

juliangruber · 2024-05-16T10:14:39Z

Next run:

2024-05-15T13:40:09.512Z { rows: 23483927, jobs: 32840534 }

It's in the same ballpark, but not exact. Since no new events are being added to the old timeframe, this export mechanism is flawed. Let's check if we can do something else

juliangruber · 2024-05-17T15:02:07Z

Next run:

2024-05-16T12:16:11.297Z { rows: 20583385, jobs: 29386919 }

This time I used async iteration instead of the queryRows function. It looked at significantly less measurements.

juliangruber · 2024-05-17T15:03:06Z

I assume we can improve our chances by performing many queries, maybe one for each day. I will try this now

juliangruber · 2024-05-17T15:26:50Z

The oldest row it can find is from 2022-11-05. We landed the telemetry commit on Oct 31st (CheckerNetwork/desktop@6d135e6). I don't know what this means.

juliangruber · 2024-05-22T06:54:04Z

Tools for uploading big files to w3s:

juliangruber · 2024-05-22T09:47:40Z

downloading measurements from one day takes ~4 minutes
at the moment the script is at June 28th, 2023
it will process data until May 1st, 2024
that's 308 days left
the script is expected to finish in 21h / ~Thursday 23rd 8am CEST

juliangruber · 2024-05-22T20:30:07Z

The script ran until { day: 2023-11-17T00:00:00.000Z }, when we started receiving 429 Too Many Requests / org XYZ has exceeded limited_query plan limit.

I'm going to continue the script tomorrow with that date as the new starting point, and will merge the result with the previous export.

juliangruber · 2024-06-03T10:26:52Z

The 1TB disk instance ran out of space. It's currently on 2024-01-07. I'm resizing the machine to 2TB, removing the incomplete day from the export, and then will let it continue

juliangruber · 2024-06-05T09:42:39Z

The script currently is at 2024-01-29 (file size 1.4TB) and has until 2024-06-01 to run.

juliangruber · 2024-07-02T15:17:38Z

Up to 2024-03-03T17:14:30.236298505Z, there were 567,564,171 jobs recorded in InfluxDB. This is how far my script reached before getting rate limited again. I'm now going to destroy the machine that keeps this export.

juliangruber · 2024-07-02T15:26:27Z

I will now evaluate deleting these old rows, more work needs to be done before we can turn on a retention policy

juliangruber · 2024-07-02T15:33:53Z

I have deleted all rows from the station bucket that were recorded before 2024-03-03T17:14:30.236298505Z

juliangruber · 2024-07-03T16:50:37Z

from 2024-03-03T17:14:30.000Z to 2024-03-03T18:27:50.000Z there were 1_025_567 more jobs. I suspect we're getting rate limited again.

juliangruber · 2024-07-04T15:02:24Z

I have paused the script as even with a 1s window it was bringing down the Influx cluster

juliangruber · 2024-07-16T09:52:56Z

We are waiting to hear back from the Influx support team, which has taken on this case

juliangruber · 2024-07-17T13:23:24Z

For now, I'm exporting all measurements from web3.storage, to get a job count without needing InfluxDB

juliangruber · 2024-07-22T11:04:19Z

Still exporting Voyager, currently at 3.7TB size

juliangruber · 2024-07-24T20:12:26Z

InfluxDB support told us that up to 2024-07-09T00:00:00Z to sum of all jobs completed is 155_153_052_475.

juliangruber · 2024-07-24T20:45:29Z

I've enabled a 90 days retention policy on the "station" bucket. This matches what we have for "peer-checker", "spark-evaluate" and "spark-publish". We can reduce it to 30 days if it's still too expensive.

bajtos · 2024-07-25T07:57:39Z

FWIW, the daily_stations table started tracking the number of jobs completed on 2024-06-28. That means we have the data to calculate the total job count after 2024-07-09. All is good 👍🏻

juliangruber · 2024-07-25T21:37:25Z

Data Graphana + JSON integration doesn't work any more (on PL Grafana), so this has to be reimplemented once the "Station" dashboard has been moved to the Space Meridian Grafana. Hereby, this job is complete finally

bajtos mentioned this issue Feb 13, 2024

M4.2: Two paying Modules, Spark retrievals linked to SPs, Meridian Platform Beginnings #61

Closed

patrickwoodhead assigned juliangruber Feb 13, 2024

patrickwoodhead added this to Space Meridian Feb 19, 2024

juliangruber mentioned this issue Apr 11, 2024

M4.4 #90

Closed

juliangruber mentioned this issue May 22, 2024

M4.5 #104

Closed

16 tasks

juliangruber mentioned this issue Jun 18, 2024

M4.6 Get ready for EthCC & FDS 🚀 #116

Closed

11 tasks

juliangruber mentioned this issue Jul 24, 2024

M4.7 #121

Closed

25 tasks

juliangruber mentioned this issue Jul 25, 2024

Get total jobs completed from spark-stats CheckerNetwork/site-backend#14

Merged

juliangruber closed this as completed Jul 25, 2024

github-project-automation bot moved this to ✅ done in Space Meridian Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring down infra cost #71

Bring down infra cost #71

bajtos commented Feb 13, 2024 •

edited by juliangruber

Loading

juliangruber commented Apr 30, 2024

juliangruber commented May 2, 2024

juliangruber commented May 2, 2024 •

edited

Loading

juliangruber commented May 12, 2024 •

edited

Loading

bajtos commented May 13, 2024

juliangruber commented May 13, 2024 •

edited

Loading

juliangruber commented May 13, 2024

juliangruber commented May 14, 2024

juliangruber commented May 15, 2024

juliangruber commented May 16, 2024

juliangruber commented May 17, 2024

juliangruber commented May 17, 2024

juliangruber commented May 17, 2024

juliangruber commented May 22, 2024

juliangruber commented May 22, 2024 •

edited

Loading

juliangruber commented May 22, 2024 •

edited

Loading

juliangruber commented Jun 3, 2024 •

edited

Loading

juliangruber commented Jun 5, 2024

juliangruber commented Jul 2, 2024 •

edited

Loading

juliangruber commented Jul 2, 2024

juliangruber commented Jul 2, 2024

juliangruber commented Jul 3, 2024

juliangruber commented Jul 4, 2024

juliangruber commented Jul 16, 2024

juliangruber commented Jul 17, 2024

juliangruber commented Jul 22, 2024

juliangruber commented Jul 24, 2024

juliangruber commented Jul 24, 2024

bajtos commented Jul 25, 2024

juliangruber commented Jul 25, 2024

Bring down infra cost #71

Bring down infra cost #71

Comments

bajtos commented Feb 13, 2024 • edited by juliangruber Loading

Tasks

juliangruber commented Apr 30, 2024

juliangruber commented May 2, 2024

juliangruber commented May 2, 2024 • edited Loading

juliangruber commented May 12, 2024 • edited Loading

bajtos commented May 13, 2024

juliangruber commented May 13, 2024 • edited Loading

juliangruber commented May 13, 2024

juliangruber commented May 14, 2024

juliangruber commented May 15, 2024

juliangruber commented May 16, 2024

juliangruber commented May 17, 2024

juliangruber commented May 17, 2024

juliangruber commented May 17, 2024

juliangruber commented May 22, 2024

juliangruber commented May 22, 2024 • edited Loading

juliangruber commented May 22, 2024 • edited Loading

juliangruber commented Jun 3, 2024 • edited Loading

juliangruber commented Jun 5, 2024

juliangruber commented Jul 2, 2024 • edited Loading

juliangruber commented Jul 2, 2024

juliangruber commented Jul 2, 2024

juliangruber commented Jul 3, 2024

juliangruber commented Jul 4, 2024

juliangruber commented Jul 16, 2024

juliangruber commented Jul 17, 2024

juliangruber commented Jul 22, 2024

juliangruber commented Jul 24, 2024

juliangruber commented Jul 24, 2024

bajtos commented Jul 25, 2024

juliangruber commented Jul 25, 2024

bajtos commented Feb 13, 2024 •

edited by juliangruber

Loading

juliangruber commented May 2, 2024 •

edited

Loading

juliangruber commented May 12, 2024 •

edited

Loading

juliangruber commented May 13, 2024 •

edited

Loading

juliangruber commented May 22, 2024 •

edited

Loading

juliangruber commented May 22, 2024 •

edited

Loading

juliangruber commented Jun 3, 2024 •

edited

Loading

juliangruber commented Jul 2, 2024 •

edited

Loading