-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry for pv creation commands #457
Conversation
Is this happening with 4.8 or 4.9? |
@cfergeau This is from 4.9 side. |
/retest |
/test e2e-snc |
1 similar comment
/test e2e-snc |
Not a big fan that some temporary files need to be added, the script is nicer before these changes :( |
@@ -170,19 +170,20 @@ function create_pvs() { | |||
|
|||
for pvname in $(seq -f "pv%04g" 1 ${count}); do | |||
if ! ${OC} get pv "${pvname}" &> /dev/null; then | |||
generate_pv "${pvdir}/${pvname}" "${pvname}" | ${OC} create -f - | |||
generate_pv "${pvdir}/${pvname}" "${pvname}" > tmp_pv.yaml | |||
retry ${OC} create -f tmp_pv.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could work and avoid the need for the temporary file?
retry ${OC} create -f - <<EOF
$(generate_pv "${pvdir}/${pvname}" "${pvname}")
EOF
else | ||
echo "persistentvolume ${pvname} already exists" | ||
fi | ||
done | ||
|
||
# Apply registry pvc to bound with pv0001 | ||
${OC} apply -f registry_pvc.yaml | ||
retry ${OC} apply -f registry_pvc.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it's possible to do something like
OC="retry ${OC}"
so that all calls to oc
go through a retry. Alternatively, this could be a run_oc
function which calls retry ${OC} ...
. Might be cleaner.
snc.sh
Outdated
@@ -191,7 +191,8 @@ retry ${OC} scale --replicas=1 ingresscontroller/default -n openshift-ingress-op | |||
retry ${OC} patch config.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge | |||
|
|||
# Add a tip in the login page | |||
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]' | base64 -d > login.html | |||
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json > tmp_secret_template.json | |||
${JQ} -r '.data["login.html"]' tmp_secret_template.json | base64 -d > login.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this:
diff --git a/snc.sh b/snc.sh
index 4ffe9ad..d3dfaef 100755
--- a/snc.sh
+++ b/snc.sh
@@ -191,7 +191,7 @@ retry ${OC} scale --replicas=1 ingresscontroller/default -n openshift-ingress-op
retry ${OC} patch config.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
# Add a tip in the login page
-retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]' | base64 -d > login.html
+retry_xx /dev/null ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]' | base64 -d > login.html
${PATCH} login.html < login.html.patch
retry ${OC} create secret generic login-template --from-file=login.html -n openshift-config
diff --git a/tools.sh b/tools.sh
index 288bd74..2cad5c2 100755
--- a/tools.sh
+++ b/tools.sh
@@ -88,7 +88,9 @@ if ! which ${PATCH}; then
sudo yum -y install /usr/bin/patch
fi
-function retry {
+function retry_xx {
+ local output=$1
+ shift
local retries=10
local count=0
until "$@"; do
@@ -106,6 +108,10 @@ function retry {
return 0
}
+function retry {
+ retry /dev/stdout $@
+}
+
function get_vm_prefix {
local crc_vm_name=$1
# This random_string is created by installer and added to each resource type,
could work.
Alternatively, have you tried something like this:
branding_template=$(retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json)
${JQ} ...
Actually I'm not sure I understand why the pipe won't work when the redirect does :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm not sure I understand why the pipe won't work when the redirect does :-/
Pipe and redirect both works but in case of pipe next command execute even with the first command failure but in case of redirect it complete the retry. Following bash script will do retry for 4 times and if you replace redirect with | echo "foobar"
it just do once and exit.
#!/bin/bash
set -exuo pipefail
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
function retry {
local retries=4
local count=0
until "$@"; do
exit=$?
wait=$((2 ** $count))
count=$(($count + 1))
if [ $count -lt $retries ]; then
echo "Retry $count/$retries exited $exit, retrying in $wait seconds..."
sleep $wait
else
echo "Retry $count/$retries exited $exit, no more retries left."
return $exit
fi
done
return 0
}
retry false > tmp_file
Ah, so the temporary file will also contain |
Give more resources to the cluster, we run the testsuite on should hopefully prevent some disruptions/timeouts because of low memory conditions.
This would avoid to have any retry data on a file if we redirect a command output to it and command failed for 1-2 times. ``` retry false > temp_file ```
/retest |
1 similar comment
/retest |
snc.sh
Outdated
@@ -191,7 +191,8 @@ retry ${OC} scale --replicas=1 ingresscontroller/default -n openshift-ingress-op | |||
retry ${OC} patch config.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge | |||
|
|||
# Add a tip in the login page | |||
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]' | base64 -d > login.html | |||
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json > tmp_secret_template.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this commit needed now that retry
output goes to stderr? I think pipe only picks up stdout and not stderr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfergeau yes this is still required because even the output is stderr but retry <cmd1> | <cmd2>
still going to execute the cmd2 on first failure, what we want is cmd2 only execute if cmd1 successful.
#!/bin/bash
set -exuo pipefail
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
function retry {
local retries=4
local count=0
until "$@"; do
exit=$?
wait=$((2 ** $count))
count=$(($count + 1))
if [ $count -lt $retries ]; then
echo "Retry $count/$retries exited $exit, retrying in $wait seconds..." 1>&2
sleep $wait
else
echo "Retry $count/$retries exited $exit, no more retries left." 1>&2
return $exit
fi
done
return 0
}
retry false | echo "Hello" > foo.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What seems to matter is whether <cmd1>
outputs something on stdout on failures, or if the only time something is printed to stdout is on success
#!/bin/bash
set -exuo pipefail
export LC_ALL=C.UTF-8
export LANG=C.UTF-8
function retry {
local retries=4
local count=0
until "$@"; do
exit=$?
wait=$((2 ** $count))
count=$(($count + 1))
if [ $count -lt $retries ]; then
echo "fail">/dev/null
else
echo '{"foo": 0}'
return 0
fi
done
return 0
}
retry false | jq .
works fine, jq
is called on the last iteration, but if I remove the >/dev/null
and print to stdout in the error cases, then it fails.
Maybe what I suggested before would do the trick:
branding_template=$(retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json)
${JQ} ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfergeau so with latest patch I tried looks like it is failing on the CI since it have input stream once ?
++ cat
+ pv_template='apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0001
labels:
volume: pv0001
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
- ReadWriteMany
- ReadOnlyMany
hostPath:
path: /mnt/pv-data/pv0001
persistentVolumeReclaimPolicy: Recycle'
+ retry ./openshift-clients/linux/oc create -f /dev/fd/63
+ local retries=10
++ echo 'apiVersion: v1
kind: PersistentVolume
metadata:
name: pv0001
labels:
volume: pv0001
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
- ReadWriteMany
- ReadOnlyMany
hostPath:
path: /mnt/pv-data/pv0001
persistentVolumeReclaimPolicy: Recycle'
+ local count=0
+ ./openshift-clients/linux/oc create -f /dev/fd/63
error: error when creating "/dev/fd/63": Post "https://api.crc.testing:6443/api/v1/persistentvolumes?fieldManager=kubectl-create": dial tcp 192.168.126.11:6443: connect: connection refused
+ exit=1
+ wait=1
+ count=1
+ '[' 1 -lt 10 ']'
+ echo 'Retry 1/10 exited 1, retrying in 1 seconds...'
Retry 1/10 exited 1, retrying in 1 seconds...
+ sleep 1
+ ./openshift-clients/linux/oc create -f /dev/fd/63
error: no objects passed to create
+ exit=1
+ wait=2
+ count=2
+ '[' 2 -lt 10 ']'
+ echo 'Retry 2/10 exited 1, retrying in 2 seconds...'
Retry 2/10 exited 1, retrying in 2 seconds...
+ sleep 2
+ ./openshift-clients/linux/oc create -f /dev/fd/63
error: no objects passed to create
.gitignore
Outdated
@@ -9,3 +9,4 @@ yq | |||
openshift-clients/ | |||
podman-remote/ | |||
.sw[a-p] | |||
tmp_pv.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still prefer to avoid temporary files as much as possible :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfergeau yes but I don't see options here :(
/retest |
2 similar comments
/retest |
/retest |
d75e265
to
0426e8e
Compare
/retest |
0426e8e
to
258e50f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add some explanation to the commit log for Add cluster-cloud-controller-manager-operator to cvo override list
If we need to revisit that decision in 1 year, will be useful to remember why we decided it was not needed in the past.
snc-library.sh
Outdated
@@ -170,19 +170,20 @@ function create_pvs() { | |||
|
|||
for pvname in $(seq -f "pv%04g" 1 ${count}); do | |||
if ! ${OC} get pv "${pvname}" &> /dev/null; then | |||
generate_pv "${pvdir}/${pvname}" "${pvname}" | ${OC} create -f - | |||
pv_template=$(generate_pv "${pvdir}/${pvname}" "${pvname}") | |||
retry ${OC} create -f <(echo ${pv_template}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation mismatch it seems, one line with tabs, the other with spaces?
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cfergeau, praveenkumar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
258e50f
to
adf6d95
Compare
New changes are detected. LGTM label has been removed. |
@@ -191,7 +191,8 @@ retry ${OC} scale --replicas=1 ingresscontroller/default -n openshift-ingress-op | |||
retry ${OC} patch config.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge | |||
|
|||
# Add a tip in the login page | |||
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]' | base64 -d > login.html | |||
secret_template=$(retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json) | |||
${JQ} -r '.data["login.html"]' <(echo "${secret_template}") | base64 -d > login.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be echo ${secret_template} | ${JQ} ...
. Not familiar at all with the <(echo ...)
syntax even if I was the one to bring it up, so don't know if this is behaving as expected.
snc-library.sh
Outdated
@@ -170,19 +170,20 @@ function create_pvs() { | |||
|
|||
for pvname in $(seq -f "pv%04g" 1 ${count}); do | |||
if ! ${OC} get pv "${pvname}" &> /dev/null; then | |||
generate_pv "${pvdir}/${pvname}" "${pvname}" | ${OC} create -f - | |||
pv_template=$(generate_pv "${pvdir}/${pvname}" "${pvname}") | |||
retry ${OC} create -f <(echo "${pv_template}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think anything involving <
|
won't work as expected with retry, as the first try will consume the data which was produced by echo
, and the next tries will have nothing to read from.
diff --git a/tools.sh b/tools.sh
index 288bd74..d22426a 100755
--- a/tools.sh
+++ b/tools.sh
@@ -91,7 +91,13 @@ fi
function retry {
local retries=10
local count=0
- until "$@"; do
+ if [ -t 0 ]; then
+ # no pipe
+ input=""
+ else
+ # pipe, save stdin
+ input="$(cat)"
+ fi
+
+ until echo "$input" | "$@"; do
exit=$?
wait=$((2 ** $count))
count=$(($count + 1))
might help, but I expect there will be other corner cases I'm not familiar with. An intermediate function may help (one which runs echo ${pv_template} | ${OC} create -f
so that we can then run retry intermediate_function ${pv_template}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cfergeau in that case I will go with intermediate file and remove it after retry happen. Don't want to spend more time around it with hit an trial method. I can create a follow up issue for improvement around it if you like and spend sometime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #460
This patch will fix following error from CI ``` ./openshift-clients/linux/oc create -f - error: error when creating "STDIN": Post "https://api.crc.testing:6443/api/v1/persistentvolumes?fieldManager=kubectl-create": dial tcp 192.168.126.11:6443: connect: connection refused ```
This patch will fix following CI failure. ``` + retry ./openshift-clients/linux/oc get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json + local retries=10 + local count=0 + jq -r '.data["login.html"]' + ./openshift-clients/linux/oc get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json + base64 -d The connection to the server api.crc.testing:6443 was refused - did you specify the right host or port? + exit=1 + wait=1 + count=1 + '[' 1 -lt 10 ']' + echo 'Retry 1/10 exited 1, retrying in 1 seconds...' + sleep 1 parse error: Invalid numeric literal at line 1, column 6 ```
On the CI these tests are failing every single time. - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/code-ready_snc/457/pull-ci-code-ready-snc-master-e2e-snc/1420281953436831744
adf6d95
to
25449e3
Compare
This patch will fix following error from CI