Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[postgresql-ha] delete pods in ha break cluster #2805

Closed
borgez opened this issue Jun 10, 2020 · 23 comments
Closed

[postgresql-ha] delete pods in ha break cluster #2805

borgez opened this issue Jun 10, 2020 · 23 comments
Labels
on-hold Issues or Pull Requests with this label will never be considered stale

Comments

@borgez
Copy link

borgez commented Jun 10, 2020

Which chart:
postgresql-ha:3.2.9

Describe the bug
postgresql-0, postgresql-1, postgresql-2 cannot start agan because repmgr passwords change when pods recreate

To Reproduce

  1. create with default config
  2. wait to full up Workloads
  3. hard delete all pods from deployment (like update chart)
  4. wait for pods recreate and up
  5. pods cannot start because passwords change what?

Expected behavior

  1. pods up, cluster resync
  2. values in Secrets not change
@marcosbc
Copy link
Contributor

Hi @borgez, when you mention this:

hard delete all pods from deployment (like update chart)

Do you mean running a command such as "helm update"?

Note that if you do that, secrets are regenerated and if you had a persistent volume, there will be a mismatch between the current secrets and configured secrets, meaning the authentication fails.

To fix that, you have a couple of options:

  • Specify existingSecrets with the name of the previous -working- secrets (i.e. postgresql.existingSecrets/pgpool.existingSecrets/ldap.existingSecrets).
  • Specify the password in each update of the chart (i.e. postgresql.password/postgresql.repmgrPassword/ldap.bindPassword/pgpool.adminPassword).

IMPORTANT: These options need to be specified both in the creation (helm install) and update (helm update) of the chart. If you don't have the values for the previous deployment, either delete the existing PVC and lose any previous data (the simpler option) or reset the password to match the existing secret (hard). Finally, re-crete the deployment specifying the secrets/password.

@borgez
Copy link
Author

borgez commented Jun 11, 2020

Do you mean running a command such as "helm update"?

I tried to break the cluster in different ways to check its fault tolerance.
I think I did similar thing to "helm update"

Note that if you do that, secrets are regenerated and if you had a persistent volume, there will be a mismatch between the current secrets and configured secrets, meaning the authentication fails.

it break cluster

Specify existingSecrets with the name of the previous -working- secrets (i.e. postgresql.existingSecrets/pgpool.existingSecrets/ldap.existingSecrets).

I not found this in docs

Specify the password in each update of the chart (i.e. postgresql.password/postgresql.repmgrPassword/ldap.bindPassword/pgpool.adminPassword).

it fix for me, yes

But this is not the expected behavior.
In docs i found update section and i expect it will be new password, not exist, this not clear why is there such default behavior.

We have secrets and I expect that when updating nothing will break and existing secrets wil be reuse as default.
When i want change passwords i expect i pass somting like --new-password="1234" or update it manually by cmd

@borgez
Copy link
Author

borgez commented Jun 11, 2020

Or may be helm param global.rollPasswords=true, but not by default

@borgez
Copy link
Author

borgez commented Jun 11, 2020

And i think if chart force roll password it need to notify payload about it and change passwords in payload postgresql, repmgr etc, it break kubernetes philosophy

@joancafom
Copy link
Contributor

Hi @borgez

By default, if no password is provided by the end user, the Chart will generate and use a random one instead. When performing an upgrade (and not specifying a password) this behaviour is preserved. By providing with the same credentials (or an existingSecret containing them), this random stage is skipped. As aforementioned, this is specially important when dealing with persistent volumes, as the new release may change the secrets but in no manner will it be able to alter previously persisted configuration.

@borgez
Copy link
Author

borgez commented Jun 15, 2020

Hi @joancafom

Yes, if secrets alredy exists random stage should be skipped but it not...

@borgez
Copy link
Author

borgez commented Jun 15, 2020

I record screencast https://nimb.ws/MkaKpZ

@joancafom
Copy link
Contributor

Hi @borgez

I will try to exemplify my words by running an example. Let's say I want to deploy a new release of my chart using helm:

$ helm install tauro bitnami/postgresql-ha                                                                                                                                                                 
NAME: tauro
LAST DEPLOYED: Tue Jun 16 09:43:02 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
** Please be patient while the chart is being deployed **
...

To get the password for "postgres" run:

    export POSTGRES_PASSWORD=$(kubectl get secret --namespace default tauro-postgresql-ha-postgresql -o jsonpath="{.data.postgresql-password}" | base64 --decode)

To get the password for "repmgr" run:

    export REPMGR_PASSWORD=$(kubectl get secret --namespace default tauro-postgresql-ha-postgresql -o jsonpath="{.data.repmgr-password}" | base64 --decode)
...

Once I give some reasonable amount of time for the pods to go up, I am successfully able to see them all running:

$ kubectl get pods                                                                                                                                                                                             
NAME                                          READY   STATUS    RESTARTS   AGE
tauro-postgresql-ha-pgpool-5c48ff85f8-hkdwd   1/1     Running   0          3m52s
tauro-postgresql-ha-postgresql-0              1/1     Running   0          3m52s
tauro-postgresql-ha-postgresql-1              1/1     Running   0          3m44s

This release also came along with some secrets using that were populated using random passwords, as I did not specify any on installation time:

$ kubectl get secrets                                                                                                                                                                                            
NAME                             TYPE                                  DATA   AGE
tauro-postgresql-ha-pgpool       Opaque                                1      14m
tauro-postgresql-ha-postgresql   Opaque                                2      14m

Now imagine that I want to update the number of replicas, be it one more than current's count. In order to preserve the same passwords and prevent the chart from failing on the update, I must provide the old passwords. We can create two new variables containing them, using the instructions provided on the first command:

$ export POSTGRES_PASSWORD=$(kubectl get secret --namespace default tauro-postgresql-ha-postgresql -o jsonpath="{.data.postgresql-password}" | base64 --decode)
$ export REPMGR_PASSWORD=$(kubectl get secret --namespace default tauro-postgresql-ha-postgresql -o jsonpath="{.data.repmgr-password}" | base64 --decode)

I can safely perform the update now:

$ helm upgrade tauro --set postgresql.password=$POSTGRES_PASSWORD \
--set postgresql.repmgrPassword=$REPMGR_PASSWORD \
--set postgresql.replicaCount=3 \
bitnami/postgresql-ha
$ kubectl get pods                                                                                                                                                                                               
NAME                                          READY   STATUS    RESTARTS   AGE
tauro-postgresql-ha-pgpool-64b79f5659-l869k   1/1     Running   0          94s
tauro-postgresql-ha-postgresql-0              1/1     Running   1          32s
tauro-postgresql-ha-postgresql-1              1/1     Running   0          62s
tauro-postgresql-ha-postgresql-2              1/1     Running   2          94s

If the passwords weren't provided in the upgrade stage, the Chart will create new pairs of randomly generated credentials, leading to a failure. We are currently considering how to improve this behaviour.

I hope this running example helps, thanks!

@stale
Copy link

stale bot commented Jul 1, 2020

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@stale stale bot added the stale 15 days without activity label Jul 1, 2020
@borgez
Copy link
Author

borgez commented Jul 1, 2020

@joancafom hi, you have plan to fix it?

@stale stale bot removed the stale 15 days without activity label Jul 1, 2020
@joancafom
Copy link
Contributor

Hi @borgez

We are evaluating different approaches regarding how to skip this complementary information provision when performing an upgrade, but I am afraid I can't give you an ETA. On the meantime, README.md contains a section explaining this is necessary and how to overcome the problem.

Thanks again!

@stale
Copy link

stale bot commented Jul 17, 2020

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@stale stale bot added the stale 15 days without activity label Jul 17, 2020
@joancafom
Copy link
Contributor

joancafom commented Jul 20, 2020

Hi again,

We are currently making improvements to prevent this issue from happening. You can check this PR #3150 for further details 😄, but we plan to log some errors when some required fields were not provided in the upgrade.

Thanks for your issue!

@stale stale bot removed the stale 15 days without activity label Jul 20, 2020
@borgez
Copy link
Author

borgez commented Jul 20, 2020

@joancafom Great, thank you!

@stale
Copy link

stale bot commented Aug 6, 2020

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@stale stale bot added the stale 15 days without activity label Aug 6, 2020
@tombh
Copy link

tombh commented Aug 9, 2020

I just ran into this problem too. I've manually set global.postgresql.database.repmgrPassword now so it should be ok. It's not clear from PR #3150 whether this is fixed yet? Or indeed if that PR is just providing better error reporting?

@stale stale bot removed the stale 15 days without activity label Aug 9, 2020
@joancafom
Copy link
Contributor

Hi @tombh

We are still implementing changes regarding this (check PR #3335 out). Once they are done, any upgrade release not providing the correct parameters will block and report the issue to the end user, thus preventing this situation from happening.

Thanks

@stale
Copy link

stale bot commented Aug 27, 2020

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@stale stale bot added the stale 15 days without activity label Aug 27, 2020
@rafariossaa
Copy link
Contributor

Hi ,
Sorry for the delay, the PR #3335 was merged.
Could you give this a try and check if this issue keeps happening to you ?

@stale stale bot removed the stale 15 days without activity label Aug 27, 2020
@tombh
Copy link

tombh commented Aug 27, 2020

Thank you ❤️️ I'll test on our next blue/green cluster swap, but might not be for a while :/

@stale
Copy link

stale bot commented Sep 11, 2020

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@stale stale bot added the stale 15 days without activity label Sep 11, 2020
@rafariossaa rafariossaa added the on-hold Issues or Pull Requests with this label will never be considered stale label Sep 14, 2020
@stale stale bot removed the stale 15 days without activity label Sep 14, 2020
@rafariossaa
Copy link
Contributor

I will left this as on-hold waiting for confirmation.

@carrodher
Copy link
Member

Unfortunately, this issue was created a long time ago and although there is an internal task to fix it, it was not prioritized as something to address in the short/mid term. It's not a technical reason but something related to the capacity since we're a small team.

Being said that, contributions via PRs are more than welcome in both repositories (containers and charts). Just in case you would like to contribute.

During this time, there are several releases of this asset and it's possible the issue has gone as part of other changes. If that's not the case and you are still experiencing this issue, please feel free to reopen it and we will re-evaluate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold Issues or Pull Requests with this label will never be considered stale
Projects
None yet
Development

No branches or pull requests

6 participants