scylla_cluster: fix handling of wait_other_notice #462

nyh · 2023-05-28T14:55:03Z

After starting a multi-node cluster, it's important to wait until all nodes are aware that all other nodes are available, otherwise if the user sends a CL=ALL request to one node, it might not be aware that the other replicas are usable, and fail the request.

For this reason a cluster's start() has a wait_other_notice=True option, and dtests correctly use it. However, the implementation doesn't wait for the right thing... To check if node A thinks that B is available, it checks that A's log contains the message "InetAddress B is now UP". But this message is unreliable - when it is printed, A still doesn't think that B is fully available - it can still think that B is in a "joining" state for a while longer. If wait_other_notice returns at this point, and the user sends a CL=ALL request to node A, it will fail.

The solution I propose in this patch uses the REST API, instead of the log, to wait until node A thinks node B is both live and finished joining.

This patch is needed if Scylla is modified to boot up faster. We start seeing dtests which use RF=ALL in the beginning of a test failing, because the node we contact doesn't know that the other nodes are usable.

Fixes #461

ccmlib/scylla_cluster.py

ccmlib/scylla_node.py

fruch · 2023-05-28T15:29:09Z

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

nyh · 2023-05-28T15:31:22Z

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

Do we run dtests also on Cassandra? If we do, my new function won't work there :-(

fruch · 2023-05-28T15:34:38Z

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

Do we run dtests also on Cassandra? If we do, my new function won't work there :-(

we could run with cassnadra, but we don't do that regularly

we could overload watch_log_for_alive to always go to watch_rest_for_alive if it's scylla.
and then test would still work with no change needed

nyh · 2023-05-29T11:37:25Z

Pushed new version with that "_" thing.
Didn't change the sleep from 0.1 seconds - I can change it to anything, but I don't know what to pick...

nyh · 2023-06-06T12:03:43Z

Pushed a new version, the REST API now also waits for nodes to know other nodes' tokens.
With this patch, when test is modified to disabled "wait for gossip to settle", we're down from 170 failures when I started to "just" 38.

nyh · 2023-06-13T11:43:28Z

Ping.

fruch

LGTM

fruch · 2023-06-13T11:57:09Z

@bhalevy, as whom refactored it a few time, care to also give it a look ?

ccmlib/scylla_node.py

bhalevy · 2023-06-13T12:20:23Z

ccmlib/scylla_node.py

+                for n in check:
+                    response = requests.get(url=url_tokens+n)
+                    if response.text == '[]':
+                        tofind.add(n)


I wonder how this would work with topology-related tests that currently track the log to inject some event during e.g. bootstrap or replace.
Although they can pass wait_other_notice=False and implement their own state tracking but they might break with this change.

I don't know what kind of tests you mean, but we can see if any tests starts failing with this patch.
I think that if a certain test wants to capture some exact moment during the middle of the boot process, it can't ask ccm to wait until the boot is ove.

ccmlib/scylla_node.py

fruch · 2023-06-26T17:26:29Z

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.

And then we could merge it

mykaul · 2023-07-06T11:53:23Z

@eliransin - do you need to review this to get this moving forward?

fruch · 2023-07-06T11:58:41Z

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.

And then we could merge it

running this with a full suite of dtest, to verify it's not causing regressions (or not too many of them)
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/

fruch · 2023-07-06T12:01:40Z

@eliransin - do you need to review this to get this moving forward?

it's a change that affects all of the tests, I would want it tested first (@nyh has tested it so far only with the 'skip_wait_for_gossip_to_settle': 0)

After starting a multi-node cluster, it's important to wait until all nodes are aware that all other nodes are available, otherwise if the user sends a CL=ALL request to one node, it might not be aware that the other replicas are usable, and fail the request. For this reason a cluster's start() has a wait_other_notice=True option, and dtests correctly use it. However, the implementation doesn't wait for the right thing... To check if node A thinks that B is available, it checks that A's log contains the message "InetAddress B is now UP". But this message is unreliable - when it is printed, A still doesn't think that B is fully available - it can still think that B is in a "joining" state for a while longer. If wait_other_notice returns at this point, and the user sends a CL=ALL request to node A, it will fail. The solution I propose in this patch uses the REST API, instead of the log, to wait until node A thinks node B is both live and finished joining. This patch is needed if Scylla is modified to boot up faster. We start seeing dtests which use RF=ALL in the beginning of a test failing, because the node we contact doesn't know that the other nodes are usable. Fixes scylladb#461 Signed-off-by: Nadav Har'El <[email protected]>

In the previous patch, we introduced the wait_rest_for_alive() function that uses the REST API to wait for node A to be able to use node B. We waited for node A to consider node B "alive" and "not joining" (in nodetool status, this is called "UN" state), but for some tests this is not enough: If the test sends A a read with CL=ALL, it needs to know B's tokens to send it one of the reads. It turns out that a node A can think that B is alive before B has gossipped its tokens. So in this patch we modify wait_rest_for_alive() to use the REST API to **also** ensure that node A is aware of B's tokens. An example dtest that is fixed by this patch is cdc_test.py:: TestCdcWithCompactStorage::test_artificial_column_with_type_empty_is_missing Signed-off-by: Nadav Har'El <[email protected]>

When starting a node with "wait_other_notice", the existing code waited until all other nodes think the new node is ready. However, it's safer to also check that the new node thinks all the other nodes are ready. It should usually be the case (the new node gets information on all other nodes in the first gossip), but it doesn't strictly have to be true so better ensure that it is. Signed-off-by: Nadav Har'El <[email protected]>

nyh · 2023-07-06T13:36:56Z

pushed new version - rebased, and changed {} to set() as @bhalevy noticed.

nyh · 2023-07-06T13:40:50Z

@eliransin - do you need to review this to get this moving forward?

it's a change that affects all of the tests, I would want it tested first (@nyh has tested it so far only with the 'skip_wait_for_gossip_to_settle': 0)

I think you did this above, but I don't know how to read the results (if I understood correctly, we can't expect zero failures?).

fruch · 2023-07-06T13:42:31Z

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.
And then we could merge it

running this with a full suite of dtest, to verify it's not causing regressions (or not too many of them) https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/

@nyh only a handful of failures

some are timing out the 2min timeout of currently introduced wait function.
I'm running now only the gating with it:
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/315/

fruch · 2023-07-06T13:59:22Z

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/junit/bootstrap_test/TestBootstrap/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split018___test_seeds_on_duty/

this test expect boot to failed, test should enable wait notice others

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/replace_address_test/TestReplaceAddress/Run_Dtest_Parallel_Cloud_Machines___HeavyDtest___heavy_split001___test_replace_node_diff_ip_take_write_use_host_id_rbo_disabled_/

fails a bit like failures in debug runs...

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/compaction_additional_test/TestTimeWindowDataSegregation/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split006___test_twcs_multiple_sstables_during_bootstrap/

missing data in node3, doesn't fail like that on master runs

failing as in master:
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/repair_based_node_operations_test/TestRepairBasedNodeOperations/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split019___test_ignore_dead_nodes_for_replace_option_2/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/compaction_additional_test/TestTimeWindowDataSegregation/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split021___test_enable_disable_optimized_query_for_twcs/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/update_cluster_layout_tests/TestUpdateClusterLayout/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_decommission_after_changing_node_ip_2/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/cqlsh_tests.cqlsh_copy_tests/TestCqlshCopy/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_copy_to_with_child_process_crashing/

fruch · 2023-07-11T12:27:01Z

@nyh gating passed o.k.

so we are left with 3 tests that seems to regress (and we didn't tested enterprise features, even that most of them are single node)

@bhalevy @eliransin, it's your call if we are o.k. with fixing those tests later, this should be block until someone would be able to attend to those ?

bhalevy · 2023-07-11T13:08:51Z

I don't see how it would be ok to introduce regressions

nyh · 2023-07-11T15:43:39Z

I don't see how it would be ok to introduce regressions

It's not ok, but to be honest, I don't have the peace of mind to fix random buggy dtests while I keep getting requests to do other things at top priority, so I'm not going to fix this any time soon.

roydahan · 2023-07-17T17:30:32Z

I started a new run with this: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/

The 3 test failures @fruch mentioned are directly related but looks quite easy to fix, so we can assign someone to fix.

However, this PR by itself won't improve runtime and possibly even degrade runtime.

nyh · 2023-07-17T18:15:04Z

I started a new run with this: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/

The 3 test failures @fruch mentioned are directly related but looks quite easy to fix, so we can assign someone to fix.

Thanks.

However, this PR by itself won't improve runtime and possibly even degrade runtime.

The idea is to have a dtest PR which runs Scylla without huge sleeps. The branch is https://github.com/nyh/scylla-dtest/commits/quick-boot-2. There is a one-line patch https://github.com/nyh/scylla-dtest/commit/0b23742b4622c9b6e18f4613722ce433640cf41c plus one fix for one test's bug - https://github.com/nyh/scylla-dtest/commit/0b23742b4622c9b6e18f4613722ce433640cf41c. There will need to be more patches because a few dozen tests still fail (not as many as before the ccm patches, and but still, a non-zero list).

The next step can be to either make the dtest patch an optional parameter (so developers can use it), but @fruch and @eliransin suggested that it can be the default, and just add a tag for tests that can't work in this "sleepless" mode and only they will be run in the slow mode.

fruch · 2023-07-18T15:32:55Z

I'm handling the tests that need to be fixed,
looks like it's identical in all 3 tests, we need to not wait_other_notice, and it solves the issue.

so this last run had 8 failure, 6 of them would be solved by https://github.com/scylladb/scylla-dtest/pull/3313.
and the rest 2 are identical to master failures (which mean we have only 2 failures on master ????)

fruch · 2023-07-18T15:35:43Z

I'm handling the tests that need to be fixed, looks like it's identical in all 3 tests, we need to not wait_other_notice, and it solves the issue.

so this last run had 8 failure, 6 of them would be solved by scylladb/scylla-dtest#3313. and the rest 2 are identical to master failures (which mean we have only 2 failures on master ????)

I was skeptical, so went to check:
https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release/306/testReport/

for the last 3 runs, we have only 2 failing tests

fruch · 2023-07-18T20:30:55Z

run with the fixes of https://github.com/scylladb/scylla-dtest/pull/3313 in:
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/317

fruch · 2023-07-19T10:53:44Z

one more test that need to be addressed:
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/317/testReport/throughput_limits_test/TestStreamingLimitThroughput/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split005___test_limit_rbno_bootstrap_throughput/

roydahan

Approved from my side.
We can merge this and the PR for dtest fixes.

kbr-scylla · 2024-03-08T13:07:33Z

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

2 nodes A, B
kill node A
start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted
but because node B didn't even notice that A was killed yet!

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (https://github.com/scylladb/scylla-dtest/pull/3991)...

temichus · 2024-03-08T13:14:39Z

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

2 nodes A, B

kill node A

start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (scylladb/scylla-dtest#3991)...

@kbr-scylla

I've created PR where the issue can be reproduced https://github.com/scylladb/scylla-dtest/pull/4042/files
approx with 500 runs, we should get at least 1 error.

bhalevy · 2024-03-08T14:18:41Z

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

2 nodes A, B

kill node A

start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (scylladb/scylla-dtest#3991)...

kbr-scylla · 2024-03-08T14:23:39Z

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

Which is only a workaround for the problem, the implementation of wait_other_notice is still incorrect

bhalevy · 2024-03-08T15:56:27Z

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

Which is only a workaround for the problem, the implementation of wait_other_notice is still incorrect

Then it needs to be fixed. Issue nr?

kbr-scylla · 2024-03-11T08:39:39Z

I opened #563

…r_alive scylladb#462 introduced watch_rest_for_alive that replaced the calls to watch_log_for_alive on the scylla node(s) start path. But a node is killed and then restarted, and other nodes miss the kill event, `watch_rest_for_alive` will consider that node already as up as seen by the other nodes, while previously, watch_log_for_alive, waited until other nodes discovered this node as up again, based on markes taken right before (re)starting that node. This change brings this call back. Fixes scylladb#563 Signed-off-by: Benny Halevy <[email protected]>

…r_alive #462 introduced watch_rest_for_alive that replaced the calls to watch_log_for_alive on the scylla node(s) start path. But a node is killed and then restarted, and other nodes miss the kill event, `watch_rest_for_alive` will consider that node already as up as seen by the other nodes, while previously, watch_log_for_alive, waited until other nodes discovered this node as up again, based on markes taken right before (re)starting that node. This change brings this call back. Fixes #563 Signed-off-by: Benny Halevy <[email protected]>

nyh requested review from fruch and eliransin May 28, 2023 14:55

fruch reviewed May 28, 2023

View reviewed changes

ccmlib/scylla_cluster.py Outdated Show resolved Hide resolved

fruch reviewed May 28, 2023

View reviewed changes

ccmlib/scylla_node.py Outdated Show resolved Hide resolved

fruch reviewed May 28, 2023

View reviewed changes

ccmlib/scylla_node.py Show resolved Hide resolved

nyh force-pushed the better-wait-other-notice-2 branch from 78fe133 to a5d16bf Compare May 29, 2023 11:36

nyh force-pushed the better-wait-other-notice-2 branch from 11acc47 to 3d405ca Compare June 6, 2023 10:06

fruch added the no-backport-needed label Jun 13, 2023

fruch requested a review from bhalevy June 13, 2023 11:56

fruch approved these changes Jun 13, 2023

View reviewed changes

bhalevy reviewed Jun 13, 2023

View reviewed changes

ccmlib/scylla_node.py Outdated Show resolved Hide resolved

bhalevy reviewed Jun 13, 2023

View reviewed changes

ccmlib/scylla_node.py Show resolved Hide resolved

nyh mentioned this pull request Jun 26, 2023

Replacing node doesn't wait for replaced node to be forgotten scylladb/scylladb#14159

Open

nyh added 3 commits July 6, 2023 16:35

nyh force-pushed the better-wait-other-notice-2 branch from 3d405ca to c12ded5 Compare July 6, 2023 13:36

roydahan approved these changes Jul 19, 2023

View reviewed changes

fruch merged commit 65872f6 into scylladb:next Jul 19, 2023
1 check passed

kbr-scylla mentioned this pull request Mar 8, 2024

node.start(wait_other_notice=True, ...) may return too early because implementation of wait_other_notice may observe past state #563

Closed

bhalevy mentioned this pull request Mar 12, 2024

scylla_cluster, scylla_node: watch_log_for_alive before watch_rest_for_alive #564

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scylla_cluster: fix handling of wait_other_notice #462

scylla_cluster: fix handling of wait_other_notice #462

nyh commented May 28, 2023

fruch commented May 28, 2023

nyh commented May 28, 2023

fruch commented May 28, 2023

nyh commented May 29, 2023

nyh commented Jun 6, 2023

nyh commented Jun 13, 2023

fruch left a comment

fruch commented Jun 13, 2023

bhalevy Jun 13, 2023

nyh Jul 6, 2023

fruch commented Jun 26, 2023

mykaul commented Jul 6, 2023

fruch commented Jul 6, 2023

fruch commented Jul 6, 2023

nyh commented Jul 6, 2023

nyh commented Jul 6, 2023

fruch commented Jul 6, 2023

fruch commented Jul 6, 2023

fruch commented Jul 11, 2023

bhalevy commented Jul 11, 2023

nyh commented Jul 11, 2023

roydahan commented Jul 17, 2023

nyh commented Jul 17, 2023

fruch commented Jul 18, 2023

fruch commented Jul 18, 2023

fruch commented Jul 18, 2023

fruch commented Jul 19, 2023

roydahan left a comment

kbr-scylla commented Mar 8, 2024

temichus commented Mar 8, 2024

bhalevy commented Mar 8, 2024

kbr-scylla commented Mar 8, 2024

bhalevy commented Mar 8, 2024

kbr-scylla commented Mar 11, 2024

scylla_cluster: fix handling of wait_other_notice #462

scylla_cluster: fix handling of wait_other_notice #462

Conversation

nyh commented May 28, 2023

fruch commented May 28, 2023

nyh commented May 28, 2023

fruch commented May 28, 2023

nyh commented May 29, 2023

nyh commented Jun 6, 2023

nyh commented Jun 13, 2023

fruch left a comment

Choose a reason for hiding this comment

fruch commented Jun 13, 2023

bhalevy Jun 13, 2023

Choose a reason for hiding this comment

nyh Jul 6, 2023

Choose a reason for hiding this comment

fruch commented Jun 26, 2023

mykaul commented Jul 6, 2023

fruch commented Jul 6, 2023

fruch commented Jul 6, 2023

nyh commented Jul 6, 2023

nyh commented Jul 6, 2023

fruch commented Jul 6, 2023

fruch commented Jul 6, 2023

fruch commented Jul 11, 2023

bhalevy commented Jul 11, 2023

nyh commented Jul 11, 2023

roydahan commented Jul 17, 2023

nyh commented Jul 17, 2023

fruch commented Jul 18, 2023

fruch commented Jul 18, 2023

fruch commented Jul 18, 2023

fruch commented Jul 19, 2023

roydahan left a comment

Choose a reason for hiding this comment

kbr-scylla commented Mar 8, 2024

temichus commented Mar 8, 2024

bhalevy commented Mar 8, 2024

kbr-scylla commented Mar 8, 2024

bhalevy commented Mar 8, 2024

kbr-scylla commented Mar 11, 2024