scylla_cluster: fix handling of wait_other_notice

After starting a multi-node cluster, it's important to wait until all nodes are aware that all other nodes are available, otherwise if the user sends a CL=ALL request to one node, it might not be aware that the other replicas are usable, and fail the request. For this reason a cluster's start() has a wait_other_notice=True option, and dtests correctly use it. However, the implementation doesn't wait for the right thing... To check if node A thinks that B is available, it checks that A's log contains the message "InetAddress B is now UP". But this message is unreliable - when it is printed, A still doesn't think that B is fully available - it can still think that B is in a "joining" state for a while longer. If wait_other_notice returns at this point, and the user sends a CL=ALL request to node A, it will fail. The solution I propose in this patch uses the REST API, instead of the log, to wait until node A thinks node B is both live and finished joining. This patch is needed if Scylla is modified to boot up faster. We start seeing dtests which use RF=ALL in the beginning of a test failing, because the node we contact doesn't know that the other nodes are usable. Fixes #461 Signed-off-by: Nadav Har'El <[email protected]>
scylladb · May 28, 2023 · 78fe133 · 78fe133
1 parent 2cc25cf
commit 78fe133
Show file tree

Hide file tree

Showing 2 changed files with 31 additions and 2 deletions.
diff --git a/ccmlib/scylla_cluster.py b/ccmlib/scylla_cluster.py
@@ -146,7 +146,7 @@ def start_nodes(self, nodes=None, no_wait=False, verbose=False, wait_for_binary_
             for old_node, mark in marks:
                 for node, _, _ in started:
                     if old_node is not node:
-                        old_node.watch_log_for_alive(node, from_mark=mark)
+                        old_node.watch_rest_for_alive(node)
 
         return started
 

diff --git a/ccmlib/scylla_node.py b/ccmlib/scylla_node.py
@@ -18,6 +18,7 @@
 import yaml
 import glob
 import re
+import requests
 
 from ccmlib.common import CASSANDRA_SH, BIN_DIR, wait_for, copy_directory
 
@@ -309,7 +310,7 @@ def _start_scylla(self, args, marks, update_pid, wait_other_notice,
         if wait_other_notice:
             for node, mark in marks:
                 t = timeout if timeout is not None else 120 if self.cluster.scylla_mode != 'debug' else 360
-                node.watch_log_for_alive(self, from_mark=mark, timeout=t)
+                node.watch_rest_for_alive(self, timeout=t)
 
         if wait_for_binary_proto:
             t = timeout * 4 if timeout is not None else 420 if self.cluster.scylla_mode != 'debug' else 900
@@ -1335,6 +1336,34 @@ def upgrade(self, upgrade_to_version):
     def rollback(self, upgrade_to_version):
         self.upgrader.upgrade(upgrade_version=upgrade_to_version, recover_system_tables=True)
 
+    def watch_rest_for_alive(self, nodes, timeout=120):
+        """
+        Use the REST API to wait until this node detects that the nodes listed
+        in "nodes" become fully operational (live and no longer "joining").
+        This is similar to watch_log_for_alive but uses ScyllaDB's REST API
+        instead of the log file and waits for the node to be really useable,
+        not just "UP" (see issue #461)
+        """
+        tofind = nodes if isinstance(nodes, list) else [nodes]
+        tofind = set([node.address() for node in tofind])
+        url_live = f"http://{self.address()}:10000/gossiper/endpoint/live"
+        url_joining = f"http://{self.address()}:10000/storage_service/nodes/joining"
+        endtime = time.time() + timeout
+        while time.time() < endtime:
+            live = {}
+            response = requests.get(url=url_live)
+            if response.text:
+                live = set(response.json())
+            response = requests.get(url=url_joining)
+            if response.text:
+                live = live - set(response.json())
+            if tofind.issubset(live):
+                # This node thinks that all given nodes are alive and not
+                # "joining", we're done.
+                return
+            time.sleep(0.1)
+        raise TimeoutError(f"watch_rest_for_alive() timeout after {timeout} seconds")
+
     @property
     def gnutls_config_file(self):
         candidates = [