Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[get.jenkins.io/mirrors/mirrorbit - Azure] High costs due to usage of Azure File Storage #3917

Closed
dduportal opened this issue Jan 21, 2024 · 15 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented Jan 21, 2024

Service(s)

get.jenkins.io, mirrors.jenkins.io, pkg.jenkins.io, Update center

Summary

While checking the sources of costs in Azure for the year 2023, it appeared that the storage account prodjenkinsreleases (in the release group prod-core-releases) costs us around 20-25% of the monthly bills.

The amount is around 1800-2000$ monthly which is insane for a 1 Tb shared file storage.

Capture d’écran 2024-01-21 à 13 30 17

Most of the cost comes from "LRS Write Operation" meter (from 1000$ up to 1700$ in the past months), followed by "Read Operations" (~180-> 200$) and "Protocol Operation" (~170 -> 190$) [costs per month]. The storage in itself is really cheap: ~30$ monthly + 6$ "hot" (e.g. caching data often read in the filer).

This issue is to track the analysis and study to see if we can decrease this cost one way or another.


The usage of this storage is:

  • get.jenkins.io (mirrors.jenkins.io):

    • Mirrorbits reads the volume content to check files considered as "references" to compare to mirrors when scanning. It primilarly scan the FS tree hierchically and the inodes metadatas (name, checksum, date, etc.) as regular intervals.
    • Apache (httpd) reads file when used as a fallback from the load balancer (e.g. either when serving non-files resources such as directory listings or when accessed through fallback.get.jenkins.io).
    • Both services are replicatd for high availability: mirrorbits and httpd). We have 2 replicas with a budget of "always 1 running" maintained by Kubernetes for rolling updates and maintenance.
    • Replication + services split between httpd requires concurrent accesses to the same file storage by the 4 processes. We used to have a ReadWriteMany persistent volume to achieve this but we are going into a full ReadOnlyMany mode soon as no write is needed.
      • Azure File Storage is the only CSI volume type supporting one or the other (e.g. mounting the same volume on different nodes).
  • trusted.ci.jenkins.io:

  • release.ci.jenkins.io:

=> Both trusted.ci and release.ci usages are reponsible for writing data. They can by using commands remotely on pkg.jenkins.io (AWS VM) which is allowed to access the storage through the Azure storage API. This pattern (writing avery 3 minutes) is clearly the culprit for the cost here.
=>


Storage consideration (usage/price):

  • This storage is bound to always grow: we do not garbage collect data as we want to keep it forever.
    • This policy could be revised as we have archives.jenkins.io.
  • Current usage today is:
    • Storage uses around 520 Gb.
    • Current maximum IOPS are 1000/s
    • Current maximum bandwitdh is 60 Mbits/s for ingress, same for egress

Reproduction steps

No response

@dduportal
Copy link
Contributor Author

dduportal commented Jan 21, 2024

Proposal:

Since the original implementation of get.jenkins.io with mirrorbits, many things changed:

  • The amount of replicas for get.jenkins.io have been decreased to 2 which means only 2 "visions" of the persistent storage are needed
  • As we recently worked on parallelization of remote copy operations on the update center, the overhead time to copy on 2 copies to a block storage are now acceptable
  • The flow of copies (azure -> AWS -> azure) changed quite often in 2020/2021 but were not revisited since then

We could try switching to using 2 PVCs of type block storage instead of the current Azure File Storage:

  • We would need to implement a set of "write to disk" services to provide entrypoint to write to the disks

    • Luckily, the recent work of @hervelemeur added rsyncd as either included in the mirrorbits-parent chart or as a distinct chart with a ReadWriteOnce policy
    • We would need to replace blobxfer calls by rsync calls:
      • Use SSH key based authentication so it would be easier to run rsync from pkg/update VM (and eventually release.ci pipeline for core release)
      • Parallelize the copy of the 2 rsyncd copies on each script (to avoid reaching the 3 min limit on update center)
      • Restrict IPs at LB level to increase security
  • Scheduling constraint: due to the ReadWriteOnce policy, for each get.jenkins.io replica (e.g. a set of 1 mirrorbits pod, 1 httpd pod and 1 rsyncd pod) would be schedule on the same machine

    • Not an HA problem: we can keep 2 distinct machines, each one for 1 "replica" with its own PVC
    • However, we would need to roll back httpd and rsync to using x86_64 nodes instead of arm64 nodes, at least until mirrorbits is amr64 compliant. Not a billing problem if we "pack" properly
  • Pricing changes:

    • As per https://azure.microsoft.com/en-us/pricing/details/managed-disks/, a Premium SSD of 1 Tb (512 Gb is the smaller option available below 1 Tb) is $184.32 monthly with 5000 IOPS (burstable) and 200 MB/s. There could be "outbound bandwidth" billed per Gb but only if we read the disk from another availability zone: if we use LRS we are fine.
    • That means we could expected a static billing of ~400$ monthly (usage and error margin included) instead of ~1800 / 2000 today ! Even including the engineering effort it is clearly worthwhile.

Besides switching to a block storage-based persistent volume would:

@dduportal
Copy link
Contributor Author

dduportal commented Jan 21, 2024

Alternative Proposal:

Premium file shares use a provisioned billing model, where you pay for the amount of storage you would like your file share to have, regardless of how much you use

This solution could be better than my disk proposal (less engineering effort) as the volume billing would be:

  • 164$ monthly (0,16*1024 for premium)
  • Snapshot costs would drastically increase though: (0,136$ / Gb instead of 0,06$ / Gb Today). As we already copy data in pkg VM and archives (2 other different locations), this is acceptable.

=> It would also allows us to use NFS file share for our volumes which could greatly improve #3525

Gotta check if the conversion from the existing volume is possible and under which constraints


(edit)

Looks like a migration is required to a new distinct Storage account: https://learn.microsoft.com/en-us/answers/questions/413129/how-to-migrate-aure-standard-storage-to-premium-st

Still worth the effort

@dduportal
Copy link
Contributor Author

Update: PR opened to create new storage (premium): jenkins-infra/azure#598

@smerle33
Copy link
Contributor

as per https://azure.microsoft.com/en-us/pricing/details/storage/files/

the ZRS Redundancy is a good choice, the price is a little higher than LRS but still cheap (in Premium version) as we consume only 600Go:

LRS:
Capture d’écran 2024-01-24 à 16 16 40

ZRS:
Capture d’écran 2024-01-24 à 16 16 47

And the write are included as premium:
Capture d’écran 2024-01-24 à 16 17 06

dduportal added a commit to jenkins-infra/azure that referenced this issue Jan 25, 2024
…ns.io (#598)

as per jenkins-infra/helpdesk#3917

---------

Co-authored-by: Damien Duportal <[email protected]>
Co-authored-by: Hervé Le Meur <[email protected]>
dduportal added a commit to jenkins-infra/azure that referenced this issue Jan 25, 2024
Ref. jenkins-infra/helpdesk#3917

Fixup of #598 

This PR correct attributes, mainly `account_kind` which must have the
value `FileStorage`

Signed-off-by: Damien Duportal <[email protected]>
@dduportal
Copy link
Contributor Author

dduportal commented Jan 25, 2024

Update:

Next step:

  • Start copying the data between the 2 storage accounts (should take the night)
  • Test a new PVC pointing to this storage with NFS in publick8s
  • Test access from pkg.origin.jenkins.io

@dduportal
Copy link
Contributor Author

dduportal commented Jan 25, 2024

Update (Start copying the data between the 2 storage accounts):

  • Data is being copied:
    • Executed in a screen named helpdesk-3917 in pkg.origin.jenkins.io's VM.
    • azcopy existed in /bin/azcopy and was upgraded to latest 10.x version.
    • Generated an SAS (with only file and read/list permissions) on prodjenkinsreleases valid only 5 days.
    • Command is azcopy copy 'https://prodjenkinsreleases.file.core.windows.net/mirrorbits?sv=2022-11-02&<redacted>' 'https://getjenkinsio.file.core.windows.net/mirrorbits?sv=2022-11-02&<redacted>' --preserve-smb-permissions=true --preserve-smb-info=true --recursive

@dduportal
Copy link
Contributor Author

dduportal commented Jan 26, 2024

Update (Test access from pkg.origin.jenkins.io and Keeping storage accounts in sync):

@dduportal
Copy link
Contributor Author

WiP (Test a new PVC pointing to this storage with NFS in publick8s):

@dduportal
Copy link
Contributor Author

Update: production migration to be started:

@dduportal
Copy link
Contributor Author

Update: Operation on get.jenkins.io finished with success:

Next steps:

  • Close operation (and communication)
  • Remove access to prodjenkinsreleases from PKG
  • Cleanup prod-core-releases resource group
  • profit

@dduportal
Copy link
Contributor Author

Proceeding to deletion of prod-core-releases resource group which only has the storage account prodjenkinsreleases with 2 file shares: mirrorbits (replaced by getjenkinsio's mirrorbits file share) and an unused website file share:

Capture d’écran 2024-01-26 à 13 52 04

=> This storage account has not used since the past 15 min and no error are seen on update center, trusted or PKG VM.

Capture d’écran 2024-01-26 à 13 51 23 Capture d’écran 2024-01-26 à 13 51 36

@smerle33
Copy link
Contributor

usage of the new premium storage class

Capture d’écran 2024-01-26 à 14 31 16

@dduportal
Copy link
Contributor Author

Capture d’écran 2024-01-29 à 08 35 16

=> Looks like the outcome is really good: the daily rate in the range "70 to 110" $ daily is now ~5 $ daily for get.jenkins.io!

Closing the issue as complete, but there should be 1 new issue to automatically track the storage size from jenkins-infra/azure to the Kubernetes PVC size + 1 comment to have updates.jenkins.io changed with the same pattern (premium storage).

@dduportal
Copy link
Contributor Author

See #3913 (comment) for overall costs

@lemeurherve
Copy link
Member

Proceeding to deletion of prod-core-releases resource group which only has the storage account prodjenkinsreleases with 2 file shares: mirrorbits (replaced by getjenkinsio's mirrorbits file share) and an unused website file share:

Likely the cause of #3927

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants