Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding distinct exit codes for cluster connection failures. #4933

Merged
merged 23 commits into from
Nov 3, 2020

Conversation

PriyaModali
Copy link
Contributor

@PriyaModali PriyaModali commented Oct 23, 2020

Fixes: #4645
Related: Tracking issue #4921
Description
Adding distinct exit codes for cluster connection failures.

User facing changes
These changes will enable Skaffold dev, deploy, to return distinct error codes when kubernetes cluster could not be reached.

Ex. before error messages:

Unable to connect to the server: net/http: TLS handshake timeout

exiting dev mode because first deploy failed: unable to connect to Kubernetes: Get "https://192.168.64.3:8443/version?timeout=32s": net/http: TLS handshake timeout

Ex. after error messages:

Deploy Failed. Could not connect to <cluster name> cluster.

Deploy Failed. Could not connect to cluster test_cluster due to \"https://192.168.64.3:8443/version?timeout=32s\": net/http: TLS handshake timeout. Check your connection for the cluster.

@PriyaModali PriyaModali requested a review from a team as a code owner October 23, 2020 18:04
@PriyaModali PriyaModali requested a review from nkubala October 23, 2020 18:04
@google-cla google-cla bot added the cla: yes label Oct 23, 2020
regexp: re(fmt.Sprintf(".*%s.* Uanable to connect: .*", ClusterConnectErrPrefix)),
errCode: proto.StatusCode_DEPLOY_CLUSTER_CONNECTION_ERR,
description: func(error) string {
return "Deploy Failed."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this gets turned into the log entry, which is then reflected to the output on the IDEs. Maybe "Unable to connect to cluster"?

Copy link
Contributor

@tejal29 tejal29 Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is not used in the Event API but only for the console.
The ActionableErr.Message which is the raw error is used.

I made a conscious decision to send the raw error to IDEs to they could do any extra processing of the error message.
Maybe it makes sense to send the description or we could change or add another field to the ActionableErr e.g.
ProcessedMessage or ExtractedMessage along with the raw error text in Message.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only wondered if Skaffold would swallow the Unable to connect and instead report Deploy Failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. got it. something like this

Deploy Failed. Check your connection for the cluster test_cluster.

Vs what we have right now.


Deploy Failed. Could not connect to cluster test_cluster due to \"https://192.168.64.3:8443/version?timeout=32s\": net/http: TLS handshake timeout. 

Check your connection for the cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes the message more readable / actionable.

@google-cla
Copy link

google-cla bot commented Oct 24, 2020

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: no and removed cla: yes labels Oct 24, 2020
@tejal29
Copy link
Contributor

tejal29 commented Oct 24, 2020

@googlebot I consent

@google-cla google-cla bot added cla: yes and removed cla: no labels Oct 24, 2020
@codecov
Copy link

codecov bot commented Oct 24, 2020

Codecov Report

Merging #4933 into master will increase coverage by 0.00%.
The diff coverage is 67.74%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4933   +/-   ##
=======================================
  Coverage   72.19%   72.19%           
=======================================
  Files         362      363    +1     
  Lines       12715    12749   +34     
=======================================
+ Hits         9179     9204   +25     
- Misses       2855     2863    +8     
- Partials      681      682    +1     
Impacted Files Coverage Δ
pkg/skaffold/errors/err_map.go 87.17% <55.55%> (-9.49%) ⬇️
pkg/skaffold/errors/deploy_problems.go 71.42% <71.42%> (ø)
pkg/skaffold/errors/errors.go 96.87% <100.00%> (+0.10%) ⬆️
pkg/skaffold/schema/defaults/defaults.go 87.34% <0.00%> (-1.75%) ⬇️
pkg/skaffold/runner/dev.go 67.66% <0.00%> (ø)
pkg/skaffold/runner/build_deploy.go 68.88% <0.00%> (ø)
pkg/skaffold/docker/image.go 81.93% <0.00%> (+1.32%) ⬆️
...ffold/kubernetes/portforward/resource_forwarder.go 86.00% <0.00%> (+3.50%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9010a6...8d5ae6a. Read the comment docs.

@briandealwis
Copy link
Member

BTW: this code path is never run on the first iteration of dev. I tried pausing minikube and then skaffold dev. I see an error reported

exiting dev mode because first deploy failed: unable to connect to Kubernetes: Get "https://192.168.64.2:8443/version?timeout=32s": net/http: TLS handshake timeout

But there are no events:

{"result":{"timestamp":"2020-10-26T17:17:44.630814Z","event":{"metaEvent":{"entry":"Starting Skaffold: \u0026{Version: ConfigVersion:skaffold/v2beta9 GitVersion: GitCommit: GitTreeState: BuildDate: GoVersion:go1.14.3 Compiler:gc Platform:darwin/amd64}","metadata":{"build":{"numberOfArtifacts":1,"builders":[{"type":"DOCKER","count":1}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"MINIKUBE"}}}}}}
{"result":{"timestamp":"2020-10-26T17:17:45.067668Z","event":{"devLoopEvent":{"status":"In Progress"}},"entry":"Update initiated"}}
event.DevLoopFailedInPhase: phase=Deploy deployState=status:"Not Started" autoTrigger:true 
{"result":{"timestamp":"2020-10-26T17:17:55.194247Z","event":{"devLoopEvent":{"status":"Failed","err":{"message":"unable to connect to Kubernetes: Get \"https://192.168.64.2:8443/version?timeout=32s\": net/http: TLS handshake timeout"}}},"entry":"Update failed with error code OK"}}

@tejal29
Copy link
Contributor

tejal29 commented Oct 26, 2020

BTW: this code path is never run on the first iteration of dev. I tried pausing minikube and then skaffold dev. I see an error reported

exiting dev mode because first deploy failed: unable to connect to Kubernetes: Get "https://192.168.64.2:8443/version?timeout=32s": net/http: TLS handshake timeout

But there are no events:

{"result":{"timestamp":"2020-10-26T17:17:44.630814Z","event":{"metaEvent":{"entry":"Starting Skaffold: \u0026{Version: ConfigVersion:skaffold/v2beta9 GitVersion: GitCommit: GitTreeState: BuildDate: GoVersion:go1.14.3 Compiler:gc Platform:darwin/amd64}","metadata":{"build":{"numberOfArtifacts":1,"builders":[{"type":"DOCKER","count":1}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"MINIKUBE"}}}}}}
{"result":{"timestamp":"2020-10-26T17:17:45.067668Z","event":{"devLoopEvent":{"status":"In Progress"}},"entry":"Update initiated"}}
event.DevLoopFailedInPhase: phase=Deploy deployState=status:"Not Started" autoTrigger:true 
{"result":{"timestamp":"2020-10-26T17:17:55.194247Z","event":{"devLoopEvent":{"status":"Failed","err":{"message":"unable to connect to Kubernetes: Get \"https://192.168.64.2:8443/version?timeout=32s\": net/http: TLS handshake timeout"}}},"entry":"Update failed with error code OK"}}

@PriyaModali Can you please verify?

@tejal29
Copy link
Contributor

tejal29 commented Oct 26, 2020

Currently, we suggest "Check cluster connection" as a suggestion.

@gsquared94 added a PR to detect if a context is isMinikube in this PR #4701.

We could use this functionality to suggest

  1. If cluster.IsMinikube(opts.KubeContext) is true and opts.KubeContext == minkube, then suggestion should be
Check if minikube is running using `minikube status` command and try again
  1. If cluster.IsMinikube(opts.KubeContext) is true and opts.KubeContext != minkube e.g. "cloud-run-dev-internal`, then suggestion should be
Check if minikube is running using `minikube status -p cloud-run-dev-internal` command and try again.
  1. if cluster.IsMinikube(opts.KubeContext) is false and opts.KubeContext== "gke_tejal-test_us-central1-a_dump " then suggestion should be
Check your cluster connection for cluster gke_tejal-test_us-central1-a_dump.

It would be great, if for conditions 1 and 2, if we can actually run minikube status command and

  1. if minikube is paused, suggest users to run minikube unpause -p <>
  2. if minikube stopped, then suggest users to run minikube start -p <>.

You can follow up the above two changes in another PR and create an issue for this.

@PriyaModali
Copy link
Contributor Author

PriyaModali commented Oct 29, 2020

BTW: this code path is never run on the first iteration of dev. I tried pausing minikube and then skaffold dev. I see an error reported

exiting dev mode because first deploy failed: unable to connect to Kubernetes: Get "https://192.168.64.2:8443/version?timeout=32s": net/http: TLS handshake timeout

But there are no events:

{"result":{"timestamp":"2020-10-26T17:17:44.630814Z","event":{"metaEvent":{"entry":"Starting Skaffold: \u0026{Version: ConfigVersion:skaffold/v2beta9 GitVersion: GitCommit: GitTreeState: BuildDate: GoVersion:go1.14.3 Compiler:gc Platform:darwin/amd64}","metadata":{"build":{"numberOfArtifacts":1,"builders":[{"type":"DOCKER","count":1}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"MINIKUBE"}}}}}}
{"result":{"timestamp":"2020-10-26T17:17:45.067668Z","event":{"devLoopEvent":{"status":"In Progress"}},"entry":"Update initiated"}}
event.DevLoopFailedInPhase: phase=Deploy deployState=status:"Not Started" autoTrigger:true 
{"result":{"timestamp":"2020-10-26T17:17:55.194247Z","event":{"devLoopEvent":{"status":"Failed","err":{"message":"unable to connect to Kubernetes: Get \"https://192.168.64.2:8443/version?timeout=32s\": net/http: TLS handshake timeout"}}},"entry":"Update failed with error code OK"}}

Looks like accidentally deleted the error display logic during the merge process. Added it back.

BTW: this code path is never run on the first iteration of dev. I tried pausing minikube and then skaffold dev. I see an error reported

exiting dev mode because first deploy failed: unable to connect to Kubernetes: Get "https://192.168.64.2:8443/version?timeout=32s": net/http: TLS handshake timeout

But there are no events:

{"result":{"timestamp":"2020-10-26T17:17:44.630814Z","event":{"metaEvent":{"entry":"Starting Skaffold: \u0026{Version: ConfigVersion:skaffold/v2beta9 GitVersion: GitCommit: GitTreeState: BuildDate: GoVersion:go1.14.3 Compiler:gc Platform:darwin/amd64}","metadata":{"build":{"numberOfArtifacts":1,"builders":[{"type":"DOCKER","count":1}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"MINIKUBE"}}}}}}
{"result":{"timestamp":"2020-10-26T17:17:45.067668Z","event":{"devLoopEvent":{"status":"In Progress"}},"entry":"Update initiated"}}
event.DevLoopFailedInPhase: phase=Deploy deployState=status:"Not Started" autoTrigger:true 
{"result":{"timestamp":"2020-10-26T17:17:55.194247Z","event":{"devLoopEvent":{"status":"Failed","err":{"message":"unable to connect to Kubernetes: Get \"https://192.168.64.2:8443/version?timeout=32s\": net/http: TLS handshake timeout"}}},"entry":"Update failed with error code OK"}}

@PriyaModali Can you please verify?

BTW: this code path is never run on the first iteration of dev. I tried pausing minikube and then skaffold dev. I see an error reported

exiting dev mode because first deploy failed: unable to connect to Kubernetes: Get "https://192.168.64.2:8443/version?timeout=32s": net/http: TLS handshake timeout

But there are no events:

{"result":{"timestamp":"2020-10-26T17:17:44.630814Z","event":{"metaEvent":{"entry":"Starting Skaffold: \u0026{Version: ConfigVersion:skaffold/v2beta9 GitVersion: GitCommit: GitTreeState: BuildDate: GoVersion:go1.14.3 Compiler:gc Platform:darwin/amd64}","metadata":{"build":{"numberOfArtifacts":1,"builders":[{"type":"DOCKER","count":1}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"MINIKUBE"}}}}}}
{"result":{"timestamp":"2020-10-26T17:17:45.067668Z","event":{"devLoopEvent":{"status":"In Progress"}},"entry":"Update initiated"}}
event.DevLoopFailedInPhase: phase=Deploy deployState=status:"Not Started" autoTrigger:true 
{"result":{"timestamp":"2020-10-26T17:17:55.194247Z","event":{"devLoopEvent":{"status":"Failed","err":{"message":"unable to connect to Kubernetes: Get \"https://192.168.64.2:8443/version?timeout=32s\": net/http: TLS handshake timeout"}}},"entry":"Update failed with error code OK"}}

BTW: this code path is never run on the first iteration of dev. I tried pausing minikube and then skaffold dev. I see an error reported

exiting dev mode because first deploy failed: unable to connect to Kubernetes: Get "https://192.168.64.2:8443/version?timeout=32s": net/http: TLS handshake timeout

But there are no events:

{"result":{"timestamp":"2020-10-26T17:17:44.630814Z","event":{"metaEvent":{"entry":"Starting Skaffold: \u0026{Version: ConfigVersion:skaffold/v2beta9 GitVersion: GitCommit: GitTreeState: BuildDate: GoVersion:go1.14.3 Compiler:gc Platform:darwin/amd64}","metadata":{"build":{"numberOfArtifacts":1,"builders":[{"type":"DOCKER","count":1}],"type":"LOCAL"},"deploy":{"deployers":[{"type":"KUBECTL","count":1}],"cluster":"MINIKUBE"}}}}}}
{"result":{"timestamp":"2020-10-26T17:17:45.067668Z","event":{"devLoopEvent":{"status":"In Progress"}},"entry":"Update initiated"}}
event.DevLoopFailedInPhase: phase=Deploy deployState=status:"Not Started" autoTrigger:true 
{"result":{"timestamp":"2020-10-26T17:17:55.194247Z","event":{"devLoopEvent":{"status":"Failed","err":{"message":"unable to connect to Kubernetes: Get \"https://192.168.64.2:8443/version?timeout=32s\": net/http: TLS handshake timeout"}}},"entry":"Update failed with error code OK"}}

@PriyaModali Can you please verify?

Looks like accidentally deleted the error display logging during the merge. Added it back.

Copy link
Contributor

@tejal29 tejal29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Use evaluated kubeContext to check if cluster is local/minikube.
  2. Mock isMinkube and add all variations of tests.
  • isMinikube=true, clusterName = minikube
  • isMinikube=true, clusterName = some_random
  • isMinikube=false, cluterName = 'test_cluster" (what you have right now)

@tejal29
Copy link
Contributor

tejal29 commented Nov 2, 2020

Currently, we suggest "Check cluster connection" as a suggestion.

@gsquared94 added a PR to detect if a context is isMinikube in this PR #4701.

We could use this functionality to suggest

  1. If cluster.IsMinikube(opts.KubeContext) is true and opts.KubeContext == minkube, then suggestion should be
Check if minikube is running using `minikube status` command and try again
  1. If cluster.IsMinikube(opts.KubeContext) is true and opts.KubeContext != minkube e.g. "cloud-run-dev-internal`, then suggestion should be
Check if minikube is running using `minikube status -p cloud-run-dev-internal` command and try again.
  1. if cluster.IsMinikube(opts.KubeContext) is false and opts.KubeContext== "gke_tejal-test_us-central1-a_dump " then suggestion should be
Check your cluster connection for cluster gke_tejal-test_us-central1-a_dump.

It would be great, if for conditions 1 and 2, if we can actually run minikube status command and

  1. if minikube is paused, suggest users to run minikube unpause -p <>
  2. if minikube stopped, then suggest users to run minikube start -p <>.

You can follow up the above two changes in another PR and create an issue for this.

We spoke in our 1:1 on Friday, to switch opts.KubeContext with the currentConfig() value

Copy link
Contributor

@tejal29 tejal29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code Looks good to me expect for 1 nit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skaffold- Could not connect to kubernetes cluster should report a distinct exit code.
3 participants