Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
scylla_cluster: fix handling of wait_other_notice
After starting a multi-node cluster, it's important to wait until all nodes are aware that all other nodes are available, otherwise if the user sends a CL=ALL request to one node, it might not be aware that the other replicas are usable, and fail the request. For this reason a cluster's start() has a wait_other_notice=True option, and dtests correctly use it. However, the implementation doesn't wait for the right thing... To check if node A thinks that B is available, it checks that A's log contains the message "InetAddress B is now UP". But this message is unreliable - when it is printed, A still doesn't think that B is fully available - it can still think that B is in a "joining" state for a while longer. If wait_other_notice returns at this point, and the user sends a CL=ALL request to node A, it will fail. The solution I propose in this patch uses the REST API, instead of the log, to wait until node A thinks node B is both live and finished joining. This patch is needed if Scylla is modified to boot up faster. We start seeing dtests which use RF=ALL in the beginning of a test failing, because the node we contact doesn't know that the other nodes are usable. Fixes #461 Signed-off-by: Nadav Har'El <[email protected]>
- Loading branch information