-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RabbitMQ OCF RA M/S HA cluster agent migration #1698
Conversation
based on Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e All the work done for fuel-library packaging Should be overriden by the change above after we switch CI to package-based implements blueprint: package-fuel-components Change-Id: I48ed37a009b42f0a9a21cc869a869edb505b39c3
1) Package fuel library into three different packages: RPM: fuel-library6.1 ALL: fuel-ha-utils, fuel-misc 2) Install packages onto slave nodes implements blueprint: package-fuel-components Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e
This commit makes post-start notify action to check hostlist of nodes that should be joined to the cluster to contain not only nodes that will be started but also ones that are already started. This fixes the case when Pacemaker sends notifies only for the latest event and thus the node which is not included into the start list will not join the cluster. Also it checks whether the node is already clustered and skips the join if it is not needed. Change-Id: Ibe8ecdcfe42c14228350b1eb3c9d08b1a64e117d Closes-bug: #1455761
There is a mistake in OCF logic which tries to start rabbitmq app without running beam after Mnesia reset getting into the loop which constantly fails until it times out Change-Id: Id096961e206a083b51978fc5034f99d04715d7ea Related-bug: #1436812
W/o this patch, the code in OCF script from deployment/ dir will never get to the fuel-library packages, which are building from files/ and /debian dirs only. The solution is: 1) sync the code diverged to the files/ and debian/ 2) either to remove the source OCF file or to update the way files being linked. This patch fixes only the step 1 as there is not yet decided how to deal with the step 2. Related-bug: #1457441 Related-bug: #184966 Change-Id: Ied86640e8e853de99bcd26f1ae726fc8272b6db7 Signed-off-by: Bogdan Dobrelya <[email protected]>
W/o this fix, when rabbit app cannot start due to corrupted mnesia state, the mnesia would be cleaned not completely. This may prevent the rabbit app from start and take the node out of the cluster permanently. The solution is to remove all rabbit node related mnesia files. Closes-bug: #1457766 Change-Id: I680efbf573c22aa9a13d8429d985b5a57235b2bf Signed-off-by: Bogdan Dobrelya <[email protected]>
* When the rabbit node went down, its status remains 'running' in mnesia db for a while, so few retries (50 sec of total) are required in order to kick and forget this node from the cluster. This also requires +50 sec for actions stop & demote timeout. * The rabbit master score in the CIB is retained after the current master moved manually. This is wrong and the score must be reset ASAP for post-demote and post-stop as well. * The demoted node must be kicked from cluster by other nodes on post-demote processing. * Post-demote should stop the rabbit app at the node being demoted as this node should be kicked from the cluster by other nodes. Instead, it stops the app at the *other* nodes and brings full cluster downtime. * The check to join should be only done at the post-start and not at the post-promote, otherwise the node being promoted may think it is clustered with some node while the join check reports it as already clustered with another one. (the regression was caused by https://review.openstack.org/184671) * Change `hostname` call to `crm_node -n` via $THIS_PCMK_NODE everywhere to ensure we are using correct pacemaker node name * Handle empty values for OCF_RESKEY_CRM_meta_notify_* by reporting the resource as not running. This will rerun resource and restore its state, eventually. Closes-bug: #1436812 Closes-bug: #1455761 Change-Id: Ib01c1731b4f06e6b643a4bca845828f7db507ad3 Signed-off-by: Bogdan Dobrelya <[email protected]>
W/o this fix, the failover time was longer than expected as rabbit nodes was able to query corosync nodes left the cluster and also try to join them by rabbit cluster ending up being reset and rejoin alive nodes later. 1) Add functions: a) to get all alive nodes in the partition b) to get all nodes This fixes get_monitor behaviour so that it ignores attributes for dead nodes as crm_node behaviour changed with upgrade of pacemaker. So rabbit nodes will never try to join the dead ones. 2) Fix bash scopes for local variables Minor change removing unexcpeted behavior when local variable impacts global scope. Related-bug: #1436812 Change-Id: I89b716b4cd007572bb6832365d4424669921f057 Signed-off-by: Bogdan Dobrelya <[email protected]>
W/o this fix, rabbitmqctl sometimes may hang failing many commands. This is a problem as it brings the rabbit node to unresponsive and broken state. This also may affect entire cluster operations, for example, when the failed command is the forget_cluster_node. The solution is to check for the cases when the command rabbitmqctl list_channels timed out and killed or termintated with exit codes 137 or 124 and return generic error. There is also related confusing error message "get_status() returns generic error" may be logged when the rabbit node is running out of the cluster and fixed as well. Closes-bug: #1459173 Change-Id: Ia52fc5f2ab7adb36252a7194f9209ab87ce487de Signed-off-by: Bogdan Dobrelya <[email protected]>
This commit checks whether there is a running cluster of rabbitmq and if rabbitmq app is running on the node and exits with non-zero code if current node is not running rabbitmq, but should do so Change-Id: I2098405b39ade7325b94781aeb997de0937bdf4c Closes-bug: #1458828
W/o this fix, the situation is possible when a rabbit node would stuck in a start/stop loop failing to join the cluster with an error: "no_running_cluster_nodes, You cannot leave a cluster if no online nodes are present." This is an issue because the rabbit node should always be able to join the cluster, if it was ordered to start by pacemaker RA. The solution is to force the mnesia reset, if the rabbit node cannot join the cluster on post-start notify. Note, that for the master starting, the node wouldn't be reset. So, the mnesia will be kept intact at least on the resource master. Partial-bug: #1461509 Change-Id: I69bc13266a1dc784681b2677ae5616bfc28cf54f Signed-off-by: Bogdan Dobrelya <[email protected]>
W/o this fix the dead end situation is possible when the rabbit node have no free memory resources left and the cluster blocks all publishing, by design. But the app thinks "let's wait for the publish block have lifted" and cannot recover. The workaround is to monitor results of crucial rabbitmqctl commands and restart the rabbit node, if queues/channels/alarms cannot be listed or if there are memory alarms found. This is the similar logic as we have for the cases when rabbitmqctl list_channels hangs. But the channels check is also fixed to verify if the exit code>0 when the rabbit app is running. Additional checks added to the monitor also require extending the timeout window for the monitor action from 60 to 180 seconds. Besides that, this patch makes the monitor action to gather the rabbit status and runtime stats, like consumed memory by all queues of total Mem+Swap, total messages in all queues and average queue consumer utilization. This info should help to troubleshoot failures better. DocImpact: ops guide. If any rabbitmq node exceeded its memory threshold the publish became blocked cluster-wide, by design. For such cases, this rabbit node would be recovered from the raised memory alert and immediately stopped to be restarted later by the pacemaker. Otherwise, this blocked publishing state might never have been lifted, if the pressure persists from the OpenStack apps side. Closes-bug: #1463433 Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9 Signed-off-by: Bogdan Dobrelya <[email protected]>
W/o this fix, the list of file names not accessible by rabbitmq user will be treated as multiple arguments to the if command causing it to throw the "too many arguments" error and the chown command to be skipped. This is the problem as it might prevent the rabbitmq server from starting because of a bad files ownership. The solution is to pass the list of files as a single argument "${foo}". Closes-bug: #1472175 Change-Id: I1d00ec3f31cd0f023bd58a4e11e5b31659977229 Signed-off-by: Bogdan Dobrelya <[email protected]>
W/o this fix the situation is possible when rabbit OCF returns OCF_NOT_RUNNING in the hope of future restart of the resource by pacemaker. But in fact, pacemaker will not trigger restart action if monitor returns "not running". This is an issue as we want resource restarted. The solution is to return OCF_ERR_GENERIC instead of OCF_NOT_RUNNING when we expect the resource to be restarted (which is action stop plus action start). Closes-bug: #1472230 Change-Id: I10c6e43d92cb23596636d86932674b36864d1595 Signed-off-by: Bogdan Dobrelya <[email protected]>
This changes leverages the rabbitmq management plugin to dump exchanges, queues, bindings, users, virtual hosts, permissions and parameters from the running system. Specifically this change adds the following: * The dumping rabbitMQ definitions (users/vhosts/exchanges/etc) during the end of the deployment * The possibility to restore definitions to the rabbitmq-server ocf script during rabbitMQ startup. * Enabled rabbitmq admin plugin, but restricts it to localhost traffic. This reverts Ic01c26200f6019a8112b1c5fb04a282e64b3b3e6 but adds firewall rules to mitigate the issue. DocImpact: The dump_rabbit_definitions task can be used to backup the rabbitmq definitions and if custom definitions (users/vhosts/etc) are created it must be run or the changes may be lost during the rabbitmq failover via pacemaker. Change-Id: I715f7c2ae527f7e105b9f6b7d82c443e8accf178 Closes-bug: #1383258 Related-bug: #1450443 Co-Authored-By: Alex Schultz <[email protected]>
Previously we were sending the json backup data on the command line which fails when the dataset is large. This change updates the command line options for curl to pass the filename directly and let it handle the reading of the data. Change-Id: I37f298279beca06df41fb08e1745602976c6a776 Closes-Bug: 1383258
It's really hard to debug, when get_status() returns $OCF_NOT_RUNNING only and looses exit code and error output. Added more logs to avoid of this situation. Related-Bug: #1488999 Change-Id: Id0999235d7be688f55799e2952fe22e97b678ce7
W/o this patch, the race condition is possible when there is no running rabbit nodes/resource master. The rabbit nodes will start/stop in an endless loop as a result introducing full downtime for AMQP cluster and cloud control plane. The solution is: * On post-start/post-promote notify, do nothing, if either of the following is a true: - there is no rabbit resources running or no master - the list of rabbit resources being started/promoted reported empty * For such cases, do not report resource failure and delegate recovery, if needed, to the "running out of the cluster" monitor's logic. * Additionally, report about a last man standing when there is no running rabbit resources around. Closes-bug: #1491306 Change-Id: If1c62fac26b63410636413c49fce55c35e53dc5f Signed-off-by: Bogdan Dobrelya <[email protected]>
The change makes OCF script ignore small number of timeouts of rabbitmqctl for 'heavy' operations: list_channels, get_alarms and list_queues. Number of tolerated timeouts in a row is configured through a new variable 'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl timeouts are not tolerated at all. Bug #1487517 is fixed by extracting declaration of local variables 'rc_alarms' and 'rc_queues' from assignment operations. Text for Operations Guide: If on node where RabbitMQ is deployed other processes consume significant part of CPU, RabbitMQ starts responding slow to queries by 'rabbitmqctl' utility. The utility is used by RabbitMQ's OCF script to monitor state of the RabbitMQ. When utility fails to return in pre-defined timeout, OCF script considers RabbitMQ to be down and restarts it, which might lead to a limited (several minutes) OpenStack downtime. Such restarts are undesirable as they cause downtime without benefit. To mitigate the issue, the OCF script might be told to tolerate certain amount of rabbitmqctl timeouts in a row using the following command: crm_resource --resource p_rabbitmq-server --set-parameter \ max_rabbitmqctl_timeouts --parameter-value N Here N should be replaced with the number of timeouts. For instance, if it is set to 3, the OCF script will tolerate two rabbitmqctl timeouts in a row, but fail if the third one occurs. By default the parameter is set to 1, i.e. rabbitmqctl timeout is not tolerated at all. The downside of increasing the parameter is that if a real issue occurs which causes rabbitmqctl timeout, OCF script will detect that only after N monitor runs and so the restart, which might fix the issue, will be delayed. To understand that RabbitMQ's restart was caused by rabbitmqctl timeout you should examine lrmd.log of the corresponding controller on Fuel master node in /var/log/docker-logs/remote/ directory. Here lines like "the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..." indicate rabbitmqctl timeout. The next line will explain if it caused restart or not. For example: "rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now." DocImpact: user-guide, operations-guide Closes-Bug: #1479815 Closes-Bug: #1487517 Change-Id: I9dec06fc08dbeefbc67249b9e9633c8aab5e09ca
Change get_status to return NOT_RUNNING when beam is not_running. Otherwise, pacemaker will get stuck during rabbitmq failover and will not attempt to restart the failed resource Change-Id: I926a3eafa9968abdf07baa5f2d5c22480300fb30 Closes-bug: #1484280
On notify, if we detect that we are a part of a cluster we still need to start the RabbitMQ application, because it is always down after action_start finishes. Closes-Bug: #1496386 Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432
When the data returned from 'rabbitmqctl list_queues' grows a lot and awk sums up all the rows especially for memory calculation it returns the sum in scientific notation (example from bug was .15997e+09), later when we want to calculate the memory in MB instead of bytes, the bash division does not like this string. We can just avoid the situation by doing the division into MB in awk itself. Since we don't need the memory in bytes anyway. Closes-Bug: #1503331 Change-Id: I38d25406b84d0f70ed62101d5fb5ba108bcab8bd
Added new OCF key stop_time (corresponding to start_time) Added wait_sync function which tries until start_time/2 for queues on stopped/demoted node to reach synced state. Added optional [-t timeout] to su_rabbit_cmd function to provide arbitrary timeout Change-Id: Iae2211b3d477a9603a58d5eacb12e0fba924861a Closes-Bug: #1464637
Sync upstream changes back to Fuel downstream Source https://github.com/rabbitmq/rabbitmq-server version stable/fedfefebaa39a0aeb41cf9328ba44c3a458e4614 Related blueprint upstream-rabbit-ocf Closes-bug: #1473015 Change-Id: Ie19c2f071c53b873a359c6c5134e9498c6391e66 Signed-off-by: Bogdan Dobrelya <[email protected]>
... in the source distribution anymore
And fix local bashisms as a little bonus Upstream patch rabbitmq/rabbitmq-server#374 Related-bug: #1464637 Change-Id: I13189de9f8abce23673c031d11132e495e1972e3 Signed-off-by: Bogdan Dobrelya <[email protected]>
…mq nodes We introduce the OCF_RESKEY_allowed_cluster_node parameter which can be used to specify which nodes of the cluster rabbitmq is expected to run on. When this variable is not set the resource agent assumes that all nodes of the cluster (output of crm_node -l) are eligible to run rabbitmq. The use case here is clusters that have a large numbers of node, where only a specific subset is used for rabbitmq (usually this is done with some constraints). Tested in a 9-node cluster as follows: [root@messaging-0 ~]# pcs resource config rabbitmq Resource: rabbitmq (class=ocf provider=rabbitmq type=rabbitmq-server-ha) Attributes: allowed_cluster_nodes="messaging-0 messaging-1 messaging-2" avoid_using_iptables=true Meta Attrs: container-attribute-target=host master-max=3 notify=true ordered=true Operations: demote interval=0s timeout=30 (rabbitmq-demote-interval-0s) monitor interval=5 timeout=30 (rabbitmq-monitor-interval-5) monitor interval=3 role=Master timeout=30 (rabbitmq-monitor-interval-3) notify interval=0s timeout=20 (rabbitmq-notify-interval-0s) promote interval=0s timeout=60s (rabbitmq-promote-interval-0s) start interval=0s timeout=200s (rabbitmq-start-interval-0s) stop interval=0s timeout=200s (rabbitmq-stop-interval-0s) [root@messaging-0 ~]# pcs status |grep -e rabbitmq -e messaging * Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] ... * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-0 * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-1 * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-2
Currently every call to unblock_client_access() is followed by a log line showing which function requested the unblocking. When we pass the parameter OCF_RESKEY_avoid_using_iptables=true it makes no sense to log unblocking of iptables since it is effectively a no-op. Let's move that logging inside the unblock_client_access() function allowing a parameter to log which function called it. Tested on a cluster with rabbitmq bundles with avoid_using_iptables=true and observed no spurious logging any longer: [root@messaging-0 ~]# journalctl |grep 'unblocked access to RMQ port' |wc -l 0
RABBITMQ_NODE_PORT is exported by default and set to 5672. Re-exporting it in that case will actually break the case where we set up rabbit with tls on the default port: 2021-02-28 07:44:10.732 [error] <0.453.0> Failed to start Ranch listener {acceptor,{172,17,1,93},5672} in ranch_ssl:listen([{cacerts,'...'},{key,'...'},{cert,'...'},{ip,{172,17,1,93}},{port,5672}, inet,{keepalive,true}, {versions,['tlsv1.1','tlsv1.2']},{certfile,"/etc/pki/tls/certs/rabbitmq.crt"},{keyfile,"/etc/pki/tls/private/rabbitmq.key"}, {depth,1},{secure_renegotiate,true},{reuse_sessions,true},{honor_cipher_order,true},{verify,verify_none},{fail_if_no_peer_cert,false}]) for reason eaddrinuse (address already in use) This is because by explicitely always exporting it, we force rabbit to listen to that port via tcp and that is a problem when we want to do SSL on that port. Since 5672 is the default port already we can just avoid exporting this port when the user does not customize the port. Tested both in a non-TLS env (A) and in a TLS-env (B) successfully: (A) Non-TLS [root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq [root@messaging-0 /]# [root@messaging-0 /]# pcs status |grep rabbitmq * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-0 * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-1 * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-2 (B) TLS [root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq/ |head -n3 /etc/rabbitmq/rabbitmq.config: {ssl, [{versions, ['tlsv1.1', 'tlsv1.2']}]}, /etc/rabbitmq/rabbitmq.config: {ssl_listeners, [{"172.17.1.48", 5672}]}, /etc/rabbitmq/rabbitmq.config: {ssl_options, [ [root@messaging-0 ~]# pcs status |grep rabbitmq * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-0 * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-1 * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-2 Note: I don't believe we should export RABBITMQ_NODE_PORT at all, since you can specify all ports in the rabbit configuration anyways, but prefer to play it safe here as folks might rely on being able to customize this. Signed-off-by: Michele Baldessari <[email protected]>
Upgrade Lager to 3.9 for OTP 24 compatibility
In newer Erlang, beam.smp no longer writes a pidfile, until the rabbit applicataion starts. It also no longer passes -mneisa dir and -sname, which are required in order to start the node only delaying the application start up. Handle that so the Pacemaker HA setup keeps working with newer Erlang and rabbitmq-server versions. Fix '[ x == x ]' bashisms as well to silence errors in the RA logs. Signed-off-by: Bogdan Dobrelya <[email protected]>
Fuel for OpenStack origined the rabbitmq OCF RA. Restore history of changes for it. This commit is empty and only set a milestone for its Fuel histroy ending at Tue May 10 15:27:53 2016. Signed-off-by: Bogdan Dobrelya <[email protected]>
Every time we recompile the erlang/elixir/rebar/rabbitmq stack there is one or more fresh new warnings that will completely trip up any parsing of these commands. Most end up being bugs that get fixed later on [1]. Since stderr is rarely interesting and just holds any rebase up, let's ignore it when running these rabbitmqctl commands. [1] https://elixirforum.com/t/mix-local-hex-warning-authenticity-is-not-established-by-certificate-path-validation/39665 Authored-by: Michele Baldessari <[email protected]> Signed-off-by: Bogdan Dobrelya <[email protected]>
Can one of the admins verify this patch? |
5ff52f7
to
7553d8f
Compare
Moving it from repo: https://github.com/rabbitmq/rabbitmq-server The original path: scripts/rabbitmq-server-ha.ocf Also preserve list of authors and changes history since its very initial commit in the Fuel for OpenStack project, now archieved: https://github.com/openstack-archive/fuel-library To get the history use: $ git log --follow heartbeat/rabbitmq-server-ha.ocf Reasoning behind: the OCF RA script provides M/S HA pacemaker resource for RabbitMQ cluster and better fits this place. Background ========== It's been actively maintained for years. And now it needs a new home due to requests of RabbitMQ team, since it is no longer possible to run CI tests for changes proposed against it by the old location. TripleO upstream project and its layered RH OSP product have plans to adopt this OCF RA for its use. That guarantees the future maintanance and support for it. How it works ========== Documentation is kept by its original upstream location: https://www.rabbitmq.com/pacemaker.html#auto-pacemaker Future Plans ============ Once it is there, the package builds for RDO will catch up changes for that OCF RA and make sure it's CI'ed, also in TripleO and OSP. Status Quo ========== Until the adoption completes, I'm planning to test changes proposed by this new location in my fork, with github actions, like [0]. The CI runs on pre-build images [1] and vagrant scripts [1] that I maintain for the (more or less) recent Pacemaker and RabbitMQ builds. The test coverage includes a simple cluster assemble smoke test and a sofisticated Jepsen testcase that verifies auto-healing of the cluster resource in Pacemaker managed by this OCF RA. [0] https://github.com/bogdando/rabbitmq-server/runs/3757495446 [1] https://hub.docker.com/r/bogdando/rabbitmq-cluster-ocf [2] https://github.com/bogdando/rabbitmq-cluster-ocf-vagrant Signed-off-by: Bogdan Dobrelya <[email protected]>
Now it works as expected, see an example commit from history |
Please note that this PR includes commits all the way back to 2015 intentionally, to preserve authored changes and link authors, whenever possible, as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general.
I've added some comments of what I think should be improved.
You should also check what parts of the logging to INFO should stay INFO, and which can be changed to DEBUG (to avoid spamming the logs too much).
Consider switching license to the same as the other agents, unless you got a good reason for using the current license:
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc#licensing
heartbeat/rabbitmq-server-ha.ocf
Outdated
local LH="${LL} my_host():" | ||
|
||
hostname=$(process_fqdn $(get_hostname)) | ||
ocf_log info "${LH} hostlist is: $hostlist" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be debug, so we dont spam it to the logs unless we want it.
heartbeat/rabbitmq-server-ha.ocf
Outdated
# remove all temporary RMQ blocking rules, if there are more than one exist | ||
for i in $(iptables -nvL --wait --line-numbers | awk '/temporary RMQ block/ {print $1}'); do | ||
iptables --wait -D INPUT -p tcp -m tcp --dport ${OCF_RESKEY_node_port} -m state --state NEW,RELATED,ESTABLISHED \ | ||
-m comment --comment 'temporary RMQ block' -j REJECT --reject-with tcp-reset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent a step more to make it more visible that it's part of line above.
|
||
elif [ -n "${alarms}" ]; then | ||
for node in ${alarms}; do | ||
name=`echo ${node} | perl -n -e "m/memory,'(?<n>\S+)+'/ && print \"$+{n}\n\""` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use sed or awk here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'd fix that in follow-up. Since that might regress the agent code and requires extra run of functional testing, I'd ran it in my forked repo with github actions once I point that to the "new home".
@jeckersb PTAL |
@oalbrigt thank you for reviewing! Do you think it would make sence to land this first, then addressing your comments by small patches, which I could test on my own, as individual changes? |
We can just fix it here, as I'm in no hurry to merge it, so we can leave it open for a week or 3. |
Signed-off-by: Bogdan Dobrelya <[email protected]>
7553d8f
to
3555d9e
Compare
It's good to go. I tested it with github workflows set up for my fork, it passed smoke and jepsen without regressions (in my test suite jepsen always reports failures because messages are expectedly lost, so that's the expected result) |
New home for rabbitmq-server-ha RA OCF m/s HA
Moving it from repo:
https://github.com/rabbitmq/rabbitmq-server
The original path:
scripts/rabbitmq-server-ha.ocf
Also preserve list of authors and changes history since its
very initial commit in the Fuel for OpenStack project,
now archieved: https://github.com/openstack-archive/fuel-library
To get the history use:
$ git log --follow heartbeat/rabbitmq-server-ha.ocf
Reasoning behind: the OCF RA script provides M/S HA
pacemaker resource for RabbitMQ cluster and better fits
this place.
Background
It's been actively maintained for years.
And now it needs a new home due to requests of RabbitMQ team,
since it is no longer possible to run CI tests for changes
proposed against it by the old location.
TripleO upstream project and its layered RH OSP product have plans
to adopt this OCF RA for its use. That guarantees the future
maintanance and support for it.
How it works
Documentation is kept by its original upstream location:
https://www.rabbitmq.com/pacemaker.html#auto-pacemaker
Future Plans
Once it is there, the package builds for RDO will catch up
changes for that OCF RA and make sure it's CI'ed, also in TripleO
and OSP.
Status Quo
Until the adoption completes, I'm planning to test changes proposed
by this new location in my fork, with github actions, like [0].
The CI runs on pre-build images [1] and vagrant scripts [1] that
I maintain for the (more or less) recent Pacemaker and RabbitMQ
builds. The test coverage includes a simple cluster assemble smoke
test and a sofisticated Jepsen testcase that verifies auto-healing
of the cluster resource in Pacemaker managed by this OCF RA.
[0] https://github.com/bogdando/rabbitmq-server/runs/3757495446
[1] https://hub.docker.com/r/bogdando/rabbitmq-cluster-ocf
[2] https://github.com/bogdando/rabbitmq-cluster-ocf-vagrant
Signed-off-by: Bogdan Dobrelya [email protected]