Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry for pv creation commands #457

Merged
merged 6 commits into from
Aug 16, 2021

Conversation

praveenkumar
Copy link
Member

This patch will fix following error from CI

./openshift-clients/linux/oc create -f -
error: error when creating "STDIN": Post "https://api.crc.testing:6443/api/v1/persistentvolumes?fieldManager=kubectl-create": dial tcp 192.168.126.11:6443: connect: connection refused

@cfergeau
Copy link
Contributor

This patch will fix following error from CI

Is this happening with 4.8 or 4.9?

@praveenkumar
Copy link
Member Author

Is this happening with 4.8 or 4.9?

@cfergeau This is from 4.9 side.

@praveenkumar
Copy link
Member Author

/retest

@praveenkumar
Copy link
Member Author

/test e2e-snc

1 similar comment
@praveenkumar
Copy link
Member Author

/test e2e-snc

@cfergeau
Copy link
Contributor

cfergeau commented Aug 2, 2021

Not a big fan that some temporary files need to be added, the script is nicer before these changes :(

@@ -170,19 +170,20 @@ function create_pvs() {

for pvname in $(seq -f "pv%04g" 1 ${count}); do
if ! ${OC} get pv "${pvname}" &> /dev/null; then
generate_pv "${pvdir}/${pvname}" "${pvname}" | ${OC} create -f -
generate_pv "${pvdir}/${pvname}" "${pvname}" > tmp_pv.yaml
retry ${OC} create -f tmp_pv.yaml
Copy link
Contributor

@cfergeau cfergeau Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could work and avoid the need for the temporary file?

retry  ${OC} create -f - <<EOF
    $(generate_pv "${pvdir}/${pvname}" "${pvname}")
EOF

else
echo "persistentvolume ${pvname} already exists"
fi
done

# Apply registry pvc to bound with pv0001
${OC} apply -f registry_pvc.yaml
retry ${OC} apply -f registry_pvc.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it's possible to do something like

OC="retry ${OC}"

so that all calls to oc go through a retry. Alternatively, this could be a run_oc function which calls retry ${OC} .... Might be cleaner.

snc.sh Outdated
@@ -191,7 +191,8 @@ retry ${OC} scale --replicas=1 ingresscontroller/default -n openshift-ingress-op
retry ${OC} patch config.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge

# Add a tip in the login page
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]' | base64 -d > login.html
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json > tmp_secret_template.json
${JQ} -r '.data["login.html"]' tmp_secret_template.json | base64 -d > login.html
Copy link
Contributor

@cfergeau cfergeau Aug 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this:

diff --git a/snc.sh b/snc.sh
index 4ffe9ad..d3dfaef 100755
--- a/snc.sh
+++ b/snc.sh
@@ -191,7 +191,7 @@ retry ${OC} scale --replicas=1 ingresscontroller/default -n openshift-ingress-op
 retry ${OC} patch config.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
 
 # Add a tip in the login page
-retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]'  | base64 -d > login.html
+retry_xx /dev/null ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]'  | base64 -d > login.html
 ${PATCH} login.html < login.html.patch
 retry ${OC} create secret generic login-template --from-file=login.html -n openshift-config
 
diff --git a/tools.sh b/tools.sh
index 288bd74..2cad5c2 100755
--- a/tools.sh
+++ b/tools.sh
@@ -88,7 +88,9 @@ if ! which ${PATCH}; then
     sudo yum -y install /usr/bin/patch
 fi
 
-function retry {
+function retry_xx {
+    local output=$1
+    shift
     local retries=10
     local count=0
     until "$@"; do
@@ -106,6 +108,10 @@ function retry {
     return 0
 }
 
+function retry {
+        retry /dev/stdout $@
+}
+
 function get_vm_prefix {
     local crc_vm_name=$1
     # This random_string is created by installer and added to each resource type,

could work.

Alternatively, have you tried something like this:

branding_template=$(retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json)
${JQ} ...

Actually I'm not sure I understand why the pipe won't work when the redirect does :-/

Copy link
Member Author

@praveenkumar praveenkumar Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not sure I understand why the pipe won't work when the redirect does :-/

Pipe and redirect both works but in case of pipe next command execute even with the first command failure but in case of redirect it complete the retry. Following bash script will do retry for 4 times and if you replace redirect with | echo "foobar" it just do once and exit.

#!/bin/bash

set -exuo pipefail

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

function retry {
    local retries=4
    local count=0
    until "$@"; do
        exit=$?
        wait=$((2 ** $count))
        count=$(($count + 1))
        if [ $count -lt $retries ]; then
            echo "Retry $count/$retries exited $exit, retrying in $wait seconds..."
            sleep $wait
        else
            echo "Retry $count/$retries exited $exit, no more retries left."
            return $exit
        fi
    done
    return 0
}

retry false > tmp_file

@cfergeau
Copy link
Contributor

cfergeau commented Aug 4, 2021

Ah, so the temporary file will also contain Retry ..., retrying in ... seconds... together with the json data, and we are lucky that jq accepts to parse this? Would be better to be able to cleanly separate the logs from retry from the actual json data?

cfergeau and others added 2 commits August 5, 2021 11:39
Give more resources to the cluster, we run the testsuite on
should hopefully prevent some disruptions/timeouts
because of low memory conditions.
This would avoid to have any retry data on a file if we redirect a
command output to it and command failed for 1-2 times.

```
retry false > temp_file
```
@praveenkumar
Copy link
Member Author

/retest

1 similar comment
@praveenkumar
Copy link
Member Author

/retest

snc.sh Outdated
@@ -191,7 +191,8 @@ retry ${OC} scale --replicas=1 ingresscontroller/default -n openshift-ingress-op
retry ${OC} patch config.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge

# Add a tip in the login page
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]' | base64 -d > login.html
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json > tmp_secret_template.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commit needed now that retry output goes to stderr? I think pipe only picks up stdout and not stderr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfergeau yes this is still required because even the output is stderr but retry <cmd1> | <cmd2> still going to execute the cmd2 on first failure, what we want is cmd2 only execute if cmd1 successful.

#!/bin/bash

set -exuo pipefail

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

function retry {
    local retries=4
    local count=0
    until "$@"; do
        exit=$?
        wait=$((2 ** $count))
        count=$(($count + 1))
        if [ $count -lt $retries ]; then
            echo "Retry $count/$retries exited $exit, retrying in $wait seconds..." 1>&2
            sleep $wait
        else
            echo "Retry $count/$retries exited $exit, no more retries left." 1>&2
            return $exit 
        fi
    done
    return 0
}

retry false | echo "Hello" > foo.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What seems to matter is whether <cmd1> outputs something on stdout on failures, or if the only time something is printed to stdout is on success

#!/bin/bash

set -exuo pipefail

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

function retry {
    local retries=4
    local count=0
    until "$@"; do
        exit=$?
        wait=$((2 ** $count))
        count=$(($count + 1))
        if [ $count -lt $retries ]; then
            echo "fail">/dev/null
        else
            echo '{"foo": 0}'
            return 0
        fi
    done
    return 0
}

retry false | jq .

works fine, jq is called on the last iteration, but if I remove the >/dev/null and print to stdout in the error cases, then it fails.
Maybe what I suggested before would do the trick:

branding_template=$(retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json)
${JQ} ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfergeau so with latest patch I tried looks like it is failing on the CI since it have input stream once ?

++ cat
+ pv_template='apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv0001
  labels:
    volume: pv0001
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
    - ReadWriteMany
    - ReadOnlyMany
  hostPath:
    path: /mnt/pv-data/pv0001
  persistentVolumeReclaimPolicy: Recycle'
+ retry ./openshift-clients/linux/oc create -f /dev/fd/63
+ local retries=10
++ echo 'apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv0001
  labels:
    volume: pv0001
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
    - ReadWriteMany
    - ReadOnlyMany
  hostPath:
    path: /mnt/pv-data/pv0001
  persistentVolumeReclaimPolicy: Recycle'
+ local count=0
+ ./openshift-clients/linux/oc create -f /dev/fd/63
error: error when creating "/dev/fd/63": Post "https://api.crc.testing:6443/api/v1/persistentvolumes?fieldManager=kubectl-create": dial tcp 192.168.126.11:6443: connect: connection refused
+ exit=1
+ wait=1
+ count=1
+ '[' 1 -lt 10 ']'
+ echo 'Retry 1/10 exited 1, retrying in 1 seconds...'
Retry 1/10 exited 1, retrying in 1 seconds...
+ sleep 1
+ ./openshift-clients/linux/oc create -f /dev/fd/63
error: no objects passed to create
+ exit=1
+ wait=2
+ count=2
+ '[' 2 -lt 10 ']'
+ echo 'Retry 2/10 exited 1, retrying in 2 seconds...'
Retry 2/10 exited 1, retrying in 2 seconds...
+ sleep 2
+ ./openshift-clients/linux/oc create -f /dev/fd/63
error: no objects passed to create

.gitignore Outdated
@@ -9,3 +9,4 @@ yq
openshift-clients/
podman-remote/
.sw[a-p]
tmp_pv.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still prefer to avoid temporary files as much as possible :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfergeau yes but I don't see options here :(

@praveenkumar
Copy link
Member Author

/retest

2 similar comments
@praveenkumar
Copy link
Member Author

/retest

@praveenkumar
Copy link
Member Author

/retest

@praveenkumar
Copy link
Member Author

/retest

Copy link
Contributor

@cfergeau cfergeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add some explanation to the commit log for Add cluster-cloud-controller-manager-operator to cvo override list
If we need to revisit that decision in 1 year, will be useful to remember why we decided it was not needed in the past.

snc-library.sh Outdated
@@ -170,19 +170,20 @@ function create_pvs() {

for pvname in $(seq -f "pv%04g" 1 ${count}); do
if ! ${OC} get pv "${pvname}" &> /dev/null; then
generate_pv "${pvdir}/${pvname}" "${pvname}" | ${OC} create -f -
pv_template=$(generate_pv "${pvdir}/${pvname}" "${pvname}")
retry ${OC} create -f <(echo ${pv_template})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation mismatch it seems, one line with tabs, the other with spaces?

@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfergeau, praveenkumar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cfergeau,praveenkumar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link

openshift-ci bot commented Aug 11, 2021

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot removed the lgtm label Aug 11, 2021
@@ -191,7 +191,8 @@ retry ${OC} scale --replicas=1 ingresscontroller/default -n openshift-ingress-op
retry ${OC} patch config.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge

# Add a tip in the login page
retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json | ${JQ} -r '.data["login.html"]' | base64 -d > login.html
secret_template=$(retry ${OC} get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json)
${JQ} -r '.data["login.html"]' <(echo "${secret_template}") | base64 -d > login.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be echo ${secret_template} | ${JQ} .... Not familiar at all with the <(echo ...) syntax even if I was the one to bring it up, so don't know if this is behaving as expected.

snc-library.sh Outdated
@@ -170,19 +170,20 @@ function create_pvs() {

for pvname in $(seq -f "pv%04g" 1 ${count}); do
if ! ${OC} get pv "${pvname}" &> /dev/null; then
generate_pv "${pvdir}/${pvname}" "${pvname}" | ${OC} create -f -
pv_template=$(generate_pv "${pvdir}/${pvname}" "${pvname}")
retry ${OC} create -f <(echo "${pv_template}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think anything involving < | won't work as expected with retry, as the first try will consume the data which was produced by echo, and the next tries will have nothing to read from.

diff --git a/tools.sh b/tools.sh
index 288bd74..d22426a 100755
--- a/tools.sh
+++ b/tools.sh
@@ -91,7 +91,13 @@ fi
 function retry {
     local retries=10
     local count=0
-    until "$@"; do
+    if [ -t 0 ]; then
+        # no pipe
+        input=""
+    else
+        # pipe, save stdin
+        input="$(cat)"
+    fi
+
+    until echo "$input" | "$@"; do
         exit=$?
         wait=$((2 ** $count))
         count=$(($count + 1))

might help, but I expect there will be other corner cases I'm not familiar with. An intermediate function may help (one which runs echo ${pv_template} | ${OC} create -f so that we can then run retry intermediate_function ${pv_template})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfergeau in that case I will go with intermediate file and remove it after retry happen. Don't want to spend more time around it with hit an trial method. I can create a follow up issue for improvement around it if you like and spend sometime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #460

This patch will fix following error from CI
```
./openshift-clients/linux/oc create -f -
error: error when creating "STDIN": Post "https://api.crc.testing:6443/api/v1/persistentvolumes?fieldManager=kubectl-create": dial tcp 192.168.126.11:6443: connect: connection refused
```
This patch will fix following CI failure.
```
+ retry ./openshift-clients/linux/oc get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json
+ local retries=10
+ local count=0
+ jq -r '.data["login.html"]'
+ ./openshift-clients/linux/oc get secrets -n openshift-authentication v4-0-config-system-ocp-branding-template -o json
+ base64 -d
The connection to the server api.crc.testing:6443 was refused - did you specify the right host or port?
+ exit=1
+ wait=1
+ count=1
+ '[' 1 -lt 10 ']'
+ echo 'Retry 1/10 exited 1, retrying in 1 seconds...'
+ sleep 1
parse error: Invalid numeric literal at line 1, column 6
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants