-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up check_kubernetes_service_replication #4020
base: master
Are you sure you want to change the base?
Speed up check_kubernetes_service_replication #4020
Conversation
grr, i forgot that make test doesn't run mypy for some reason |
huh, what are our tests doing - mypy is showing me that this should have definitely broken some of the other checkers (e.g., the flink one) |
At some point this ran in <1min, and we have >1 things assuming this is still running in <1min...but the current version of this script actually takes multiple minutes in most of our clusters. I grabbed a couple flamegraphs of the current version of this (which I've since lost in my scrollback/browser history, but I can re-gen pretty easily) and noticed a couple obvious things: * we spend a significant amount of time getting pods - let's parallelize that! Note: we do this with multiprocessing and not multithreading since there's a lot of serialization happening of k8s objects that would likely hold onto the GIL * we were spending an obscene amount of time in filter_pods_by_service_instance() since we were calling it over and over again - let's group pods once in a smarter way and pass the grouping around instead :) * this one is kinda ???: socket.gethostbyaddr() was pretty prominent in the flamegraphs - we don't actually use the hostnames in this check, so let's add a way to skip that hostname resolution I also deleted the --additional-namespaces code since it's entirely unused now and it made things slightly cleaner.
6f67448
to
e09c741
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer grouped_pods
to be something more descriptive like pods_by_service_instance
or something
…92-speedup-check_kubernetes_service_replication
I also tested the flink replication checker on infrastage after mypy alerted me that I had broken it :) |
@EvanKrall done! |
At some point this ran in <1min, and we have >1 things assuming this is still running in <1min...but the current version of this script actually takes multiple minutes in most of our clusters.
I grabbed a couple flamegraphs of the current version of this (which I've since lost in my scrollback/browser history, but I can re-gen pretty easily) and noticed a couple obvious things:
I also deleted the --additional-namespaces code since it's entirely unused now and it made things slightly cleaner.
EDIT: I regenerated the flamegraphs: