Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop of rabbit app within start_rmq_server_app (OCF rabbitmq-server-ha) #1833

Open
f-schie opened this issue Jan 11, 2023 · 5 comments
Open

Comments

@f-schie
Copy link

f-schie commented Jan 11, 2023

Hi,

in the OCF file rabbitmq-server-ha, I don't understand why after a successful start, the function stop_rmq_server_app is called.

stop_rmq_server_app

As seen in snippet below, why would I want to stop the rmq server app when it just was started successfully as master of cluster:

    if [ $rc -eq $OCF_SUCCESS ] ; then
        # rabbitmq-server started successfuly as master of cluster
        master_score $MIN_MASTER_SCORE
        stop_rmq_server_app
        rc=$?
        if [ $rc -ne 0 ] ; then
            ocf_log err "${LH} RMQ-server app can't be stopped. Beam will be killed."
            kill_rmq_and_remove_pid
            unblock_client_access "${LH}"
            return $OCF_ERR_GENERIC
        fi

Clearly I am missing something, could someone please explain why it is done this way? @bogdando maybe can you help me out here?

We are using OCF rabbitmq-server-ha within a pacemaker cluster of 3 nodes and experience slow starts and a somewhat strange master election of the RabbitMQ master (newly booted node tears down active master and starts its own promotion...)

@f-schie f-schie changed the title Stop of rabbit app within start_rmq_server_app Stop of rabbit app within start_rmq_server_app (OCF rabbitmq-server-ha) Jan 11, 2023
@bogdando
Copy link
Contributor

bogdando commented Jan 11, 2023

Firstly, thank you for using this agent and taking care of its health!

In the repositroy from which this OCF agent originates (now in openstack-archive), there had been a related change and the corresponding gerrit change. There are some related LP bugs linked in the commit message for more context.

For the record: setting master_score 1 (minimal positive master-score for this node) means that the application is stopping on a non-master node (all of them). The master takes master_score 1000 normally, and the node which never should be promoted takes the score 0.

So, as the follow-up fix clarifies, we want to stop to only test it, if the app can be started "for real". There had been some corner cases around application reports started, but in fact is not functioning properly. The linked lp bug explain that in details. FWIW, we want to make sure that the app can be stopped w/o errors, after we have started it. And if it cannot, the mnesia DB will be cleaned up, so that for the next time pacemaker runs monitor or processes other events, it should start w/o problems (most likely!)

@bogdando
Copy link
Contributor

By the way, there is some automation around customized Jepsen tests, which I used to run from time to time in a fork of rabbitmq-server repo, via github actions

It used to always reassemble the cluster upon network partitions caused by the testing framework, and allowed the test to complete. I no longer maintain that automation and fork, as we moved the script from rabbitmq-server repo to this new home. Having that jepsen-CI around here could be a good idea...

@bogdando
Copy link
Contributor

newly booted node tears down active master and starts its own promotion

this could be a valid issue, and would also explain suboptimal testing results in Jepsen (many pending messages)

@f-schie
Copy link
Author

f-schie commented Jan 11, 2023

Thanks for the quick reply!
So if I am understanding correctly, it is OK (and partially expected) to have a scenario like this:

  1. Initiate a restart of msRabbitMQ via Pacemaker
  2. RabbitMQ is stopped via action_stop() (which calls stop_server_process())
  3. After successfully stopping everything, action_start() invokes:
  4. start_rmq_server_app() leading to starting the RMQ-server app via try_to_start_rmq_app() as
  5. /usr/sbin/rabbitmqctl start_app is called within and even if successful, stop_app is executed to verify the correct stop/start behavior
  6. If stop_app fails, Mnesia DB is reset and whole process is repeated

Resetting the mnesia DB means losing all durable exchanges/queues and the data in those, doesn't it? If there is a cluster outage because of power loss, it means all data that has not been processed is lost upon recovery?

Regarding my other scenario:

newly booted node tears down active master and starts its own promotion

I need to investigate this further. The remaining node receives a notify followed by a demote action by the DC as soon as the "old" master reboots.

Thanks for the link to the automation repo - I'll check this out!

@f-schie f-schie closed this as completed Jan 11, 2023
@f-schie f-schie reopened this Jan 11, 2023
@bogdando
Copy link
Contributor

bogdando commented Jan 11, 2023

Resetting mnesia DB is the standard handling for the start/stop/join and the like unrecoverable failures. When using HA queues, or raft queues (which also requires durable queues), perhaps the data loss can be minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants