We discovered slow networking on GKE and reported the issue and these series of experiments are attempting to investigate different aspects.
- run1 was my original testing done on GKE to report the issue (May 4, 2023)
- run2 was a secondary test to run nslookup across different cluster flags (May 16, 2023)
- run3 ran telnet in the worker pod to look at connection times/patterns to the broker leader (index 0)
- run4 used a test deployment of the operator that wrapped flux start with strace
- run5 attempts to remove DNS by getting pod ip addresses and writing them into
/etc/hosts
- run6 is an effort to put together best practices of what we learned and reproduce the run1 experiments with improvements (May 17, 2023)
- run7 the same but adding back the coredns to see if it replicates the original error
- run8 was one more attempt to reproduce the issue (done, and one huge timeout)
- run9 was the final case to replicate (did)
- run10 is the equivalent experiment but scaled up to a larger cluster
- run11 are results from Dmitri on the Google networking team.
- run12 a small run that tests the original experiment with only one hostname
- run13 testing more random configurations hoping for insight