Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics data loss in K8S controller #3607

Closed
alvin-7 opened this issue Jan 23, 2024 · 11 comments · Fixed by #3692
Closed

Metrics data loss in K8S controller #3607

alvin-7 opened this issue Jan 23, 2024 · 11 comments · Fixed by #3692
Labels
kind/bug These are bugs.

Comments

@alvin-7
Copy link
Contributor

alvin-7 commented Jan 23, 2024

What happened:
After restarting the K8S controller, the "agones_gameservers_total" metric is no longer being collected for the "shipping-mode1-map1-3568" battle server. However, the "agones_gameservers_count" metric is still being collected.

What you expected to happen:
I expected both the "agones_gameservers_total" and "agones_gameservers_count" metrics to continue being collected consistently, even after the controller restart.

  • max by(type) (agones_gameservers_total{fleet_name="shipping-mode1-map1-3568"})
    image
  • avg by(type) (agones_gameservers_count{fleet_name="shipping-mode1-map1-3568"})
    image

How to reproduce it (as minimally and precisely as possible):

  1. Start the K8S controller - Agones Controller.
  2. Create "shipping-mode1-map1-3568" fleet
  3. Check the metrics for the "shipping-mode1-map1-3568" battle server.
  4. Observe that the "agones_gameservers_total" metric is being collected, but "agones_gameservers_count" metric is not.
  5. Restart the controller.
  6. Check the metrics for the "shipping-mode1-map1-3568" battle server after the restart.
  7. Notice that the "agones_gameservers_total" metric is no longer being collected, while the "agones_gameservers_count" metric is still being collected.

Anything else we need to know?:

  1. In the cluster, there are a total of 10 fleets, and there is a continuous process of deleting existing fleets and creating new fleets. This dynamic fleet activity might have an impact on the metrics data collection.

Environment:

  • Agones version: 1.35.0

  • Kubernetes version (use kubectl version):
    Client Version: v1.27.2
    Kustomize Version: v5.0.1
    Server Version: v1.22.5-tke.19

  • Cloud provider or hardware configuration:

  • Install method (yaml/helm): helm

  • Troubleshooting guide log(s):

  • Others:

@alvin-7 alvin-7 added the kind/bug These are bugs. label Jan 23, 2024
@markmandel
Copy link
Collaborator

This sounds like works as intended.

  1. If a fleet is deleted and we restart the controller, we can't create the old metrics - it's all in memory.
  2. If a fleet is deleted we specifically remove it from all metrics reporting to ensure a memory leak / metric explosion doesn't happen (we have to do a full reset to do it).

See #2478 for context.

@alvin-7
Copy link
Contributor Author

alvin-7 commented Jan 24, 2024

This sounds like works as intended.

  1. If a fleet is deleted and we restart the controller, we can't create the old metrics - it's all in memory.
  2. If a fleet is deleted we specifically remove it from all metrics reporting to ensure a memory leak / metric explosion doesn't happen (we have to do a full reset to do it).

See #2478 for context.

In my specific scenario, the controller metrics has already malfunctioned before the restart.

  1. Start the K8S controller - Agones Controller.
  2. Create "shipping-mode1-map1-3568" fleet
  3. Check the metrics for the "shipping-mode1-map1-3568" battle server.
  4. Observe that the "agones_gameservers_total" metric is being collected, but "agones_gameservers_count" metric is not.

@markmandel
Copy link
Collaborator

That's a good point - will have to attempt to replicate 🤔

@alvin-7
Copy link
Contributor Author

alvin-7 commented Jan 25, 2024

That's a good point - will have to attempt to replicate 🤔

In our use case, Agones is configured with 10 fleets, and each fleet has a fleet autoscaler enabled. Additionally, 10 separate gameservers have been configured, which are not managed by the fleets.

Hope this can help you successfully reproduce the issue. Thank you for your hard work.

@alvin-7
Copy link
Contributor Author

alvin-7 commented Jan 31, 2024

That's a good point - will have to attempt to replicate 🤔

Hello,markmandel,

I hope this message finds you well. I wanted to follow up on the issue. Furthermore, I understand that replicating the issue can sometimes be challenging, and I'm wondering if there's any additional information or assistance I can provide to facilitate the process.

If any specific scenarios, logs, or system configurations would be helpful, please let me know. I’m also willing to assist with testing or any other tasks that might help you address the issue more efficiently.

Looking forward to your guidance on how I can best support your efforts. Thank you for your time and attention to this matter.

Best regards

@markmandel
Copy link
Collaborator

Sorry this isn't currently at the top of my priority queue, so haven't had a chance to look at it. Would definitely be happy to provide pointers if you wanted to dig into it?

@alvin-7
Copy link
Contributor Author

alvin-7 commented Feb 2, 2024 via email

@markmandel
Copy link
Collaborator

If you would like to go digging (and i encourage it!), all these metrics are managed here: https://github.com/googleforgames/agones/tree/main/pkg/metrics

Feel free to drop questions here, or in #development channel on our Slack!

@Kalaiselvi84
Copy link
Contributor

Kalaiselvi84 commented Feb 20, 2024

We have replicated this issue locally and the agones_gameservers_total is missing after restarting the agones-controller.

Before restart:
Screenshot 2024-02-20 at 3 05 07 PM

After:
Screenshot 2024-02-20 at 3 05 47 PM

@alvin-7
Copy link
Contributor Author

alvin-7 commented Mar 6, 2024

In Agones version 1.35.0, disabling the FeatureGate: "ResetMetricsOnDelete" can resolve issues with metrics anomalies.

Through an in-depth analysis of the source code, I've discovered that this feature can lead to certain memory optimization benefits. However, it also results in an increase in code complexity. Notably, during this optimization process, there seems to be a bug within the code that causes anomalies in the metrics indicators.

Based on these findings, I will attempt to fix this issue and provide a pull request (PR) if everything goes smoothly.

@markmandel
Copy link
Collaborator

Thanks for digging in!

markmandel added a commit that referenced this issue Apr 1, 2024
* fix: #3607 Metrics data loss in K8S controller
* add unit test for #3607

Co-authored-by: Zach Loafman <[email protected]>
Co-authored-by: Mark Mandel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants