Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scylla_cluster: fix handling of wait_other_notice #462

Merged
merged 3 commits into from
Jul 19, 2023

Conversation

nyh
Copy link
Contributor

@nyh nyh commented May 28, 2023

After starting a multi-node cluster, it's important to wait until all nodes are aware that all other nodes are available, otherwise if the user sends a CL=ALL request to one node, it might not be aware that the other replicas are usable, and fail the request.

For this reason a cluster's start() has a wait_other_notice=True option, and dtests correctly use it. However, the implementation doesn't wait for the right thing... To check if node A thinks that B is available, it checks that A's log contains the message "InetAddress B is now UP". But this message is unreliable - when it is printed, A still doesn't think that B is fully available - it can still think that B is in a "joining" state for a while longer. If wait_other_notice returns at this point, and the user sends a CL=ALL request to node A, it will fail.

The solution I propose in this patch uses the REST API, instead of the log, to wait until node A thinks node B is both live and finished joining.

This patch is needed if Scylla is modified to boot up faster. We start seeing dtests which use RF=ALL in the beginning of a test failing, because the node we contact doesn't know that the other nodes are usable.

Fixes #461

@nyh nyh requested review from fruch and eliransin May 28, 2023 14:55
ccmlib/scylla_cluster.py Outdated Show resolved Hide resolved
ccmlib/scylla_node.py Outdated Show resolved Hide resolved
@fruch
Copy link
Contributor

fruch commented May 28, 2023

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

@nyh
Copy link
Contributor Author

nyh commented May 28, 2023

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

Do we run dtests also on Cassandra? If we do, my new function won't work there :-(

@fruch
Copy link
Contributor

fruch commented May 28, 2023

@nyh seems like there are some places in dtest using watch_log_for_alive, we probably should replace them after this one gets merged

Do we run dtests also on Cassandra? If we do, my new function won't work there :-(

we could run with cassnadra, but we don't do that regularly

we could overload watch_log_for_alive to always go to watch_rest_for_alive if it's scylla.
and then test would still work with no change needed

@nyh nyh force-pushed the better-wait-other-notice-2 branch from 78fe133 to a5d16bf Compare May 29, 2023 11:36
@nyh
Copy link
Contributor Author

nyh commented May 29, 2023

Pushed new version with that "_" thing.
Didn't change the sleep from 0.1 seconds - I can change it to anything, but I don't know what to pick...

@nyh nyh force-pushed the better-wait-other-notice-2 branch from 11acc47 to 3d405ca Compare June 6, 2023 10:06
@nyh
Copy link
Contributor Author

nyh commented Jun 6, 2023

Pushed a new version, the REST API now also waits for nodes to know other nodes' tokens.
With this patch, when test is modified to disabled "wait for gossip to settle", we're down from 170 failures when I started to "just" 38.

@nyh
Copy link
Contributor Author

nyh commented Jun 13, 2023

Ping.

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruch
Copy link
Contributor

fruch commented Jun 13, 2023

@bhalevy, as whom refactored it a few time, care to also give it a look ?

ccmlib/scylla_node.py Outdated Show resolved Hide resolved
for n in check:
response = requests.get(url=url_tokens+n)
if response.text == '[]':
tofind.add(n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how this would work with topology-related tests that currently track the log to inject some event during e.g. bootstrap or replace.
Although they can pass wait_other_notice=False and implement their own state tracking but they might break with this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what kind of tests you mean, but we can see if any tests starts failing with this patch.
I think that if a certain test wants to capture some exact moment during the middle of the boot process, it can't ask ccm to wait until the boot is ove.

@fruch
Copy link
Contributor

fruch commented Jun 26, 2023

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.

And then we could merge it

@mykaul
Copy link
Contributor

mykaul commented Jul 6, 2023

@eliransin - do you need to review this to get this moving forward?

@fruch
Copy link
Contributor

fruch commented Jul 6, 2023

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.

And then we could merge it

running this with a full suite of dtest, to verify it's not causing regressions (or not too many of them)
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/

@fruch
Copy link
Contributor

fruch commented Jul 6, 2023

@eliransin - do you need to review this to get this moving forward?

it's a change that affects all of the tests, I would want it tested first (@nyh has tested it so far only with the 'skip_wait_for_gossip_to_settle': 0)

nyh added 3 commits July 6, 2023 16:35
After starting a multi-node cluster, it's important to wait until all nodes
are aware that all other nodes are available, otherwise if the user sends a
CL=ALL request to one node, it might not be aware that the other replicas
are usable, and fail the request.

For this reason a cluster's start() has a wait_other_notice=True option,
and dtests correctly use it. However, the implementation doesn't wait
for the right thing... To check if node A thinks that B is available,
it checks that A's log contains the message "InetAddress B is now UP".
But this message is unreliable - when it is printed, A still doesn't
think that B is fully available - it can still think that B is in a
"joining" state for a while longer. If wait_other_notice returns at
this point, and the user sends a CL=ALL request to node A, it will fail.

The solution I propose in this patch uses the REST API, instead of the
log, to wait until node A thinks node B is both live and finished
joining.

This patch is needed if Scylla is modified to boot up faster. We start
seeing dtests which use RF=ALL in the beginning of a test failing,
because the node we contact doesn't know that the other nodes are usable.

Fixes scylladb#461

Signed-off-by: Nadav Har'El <[email protected]>
In the previous patch, we introduced the wait_rest_for_alive() function
that uses the REST API to wait for node A to be able to use node B.
We waited for node A to consider node B "alive" and "not joining" (in
nodetool status, this is called "UN" state), but for some tests this
is not enough: If the test sends A a read with CL=ALL, it needs to
know B's tokens to send it one of the reads. It turns out that a node
A can think that B is alive before B has gossipped its tokens.

So in this patch we modify wait_rest_for_alive() to use the REST API
to **also** ensure that node A is aware of B's tokens.

An example dtest that is fixed by this patch is cdc_test.py::
TestCdcWithCompactStorage::test_artificial_column_with_type_empty_is_missing

Signed-off-by: Nadav Har'El <[email protected]>
When starting a node with "wait_other_notice", the existing code waited
until all other nodes think the new node is ready. However, it's safer
to also check that the new node thinks all the other nodes are ready.
It should usually be the case (the new node gets information on all
other nodes in the first gossip), but it doesn't strictly have to be
true so better ensure that it is.

Signed-off-by: Nadav Har'El <[email protected]>
@nyh nyh force-pushed the better-wait-other-notice-2 branch from 3d405ca to c12ded5 Compare July 6, 2023 13:36
@nyh
Copy link
Contributor Author

nyh commented Jul 6, 2023

pushed new version - rebased, and changed {} to set() as @bhalevy noticed.

@nyh
Copy link
Contributor Author

nyh commented Jul 6, 2023

@eliransin - do you need to review this to get this moving forward?

it's a change that affects all of the tests, I would want it tested first (@nyh has tested it so far only with the 'skip_wait_for_gossip_to_settle': 0)

I think you did this above, but I don't know how to read the results (if I understood correctly, we can't expect zero failures?).

@fruch
Copy link
Contributor

fruch commented Jul 6, 2023

@nyh let's do a run of this change without the wait for gossip change, to see that we aren't breaking any tests with it.
And then we could merge it

running this with a full suite of dtest, to verify it's not causing regressions (or not too many of them) https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/

@nyh only a handful of failures

some are timing out the 2min timeout of currently introduced wait function.
I'm running now only the gating with it:
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/315/

@fruch
Copy link
Contributor

fruch commented Jul 6, 2023

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/junit/bootstrap_test/TestBootstrap/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split018___test_seeds_on_duty/

this test expect boot to failed, test should enable wait notice others

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/replace_address_test/TestReplaceAddress/Run_Dtest_Parallel_Cloud_Machines___HeavyDtest___heavy_split001___test_replace_node_diff_ip_take_write_use_host_id_rbo_disabled_/

fails a bit like failures in debug runs...

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/compaction_additional_test/TestTimeWindowDataSegregation/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split006___test_twcs_multiple_sstables_during_bootstrap/

missing data in node3, doesn't fail like that on master runs

failing as in master:
https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/repair_based_node_operations_test/TestRepairBasedNodeOperations/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split019___test_ignore_dead_nodes_for_replace_option_2/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/compaction_additional_test/TestTimeWindowDataSegregation/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split021___test_enable_disable_optimized_query_for_twcs/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/update_cluster_layout_tests/TestUpdateClusterLayout/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split010___test_decommission_after_changing_node_ip_2/

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/314/testReport/cqlsh_tests.cqlsh_copy_tests/TestCqlshCopy/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_copy_to_with_child_process_crashing/

@fruch
Copy link
Contributor

fruch commented Jul 11, 2023

@nyh gating passed o.k.

so we are left with 3 tests that seems to regress (and we didn't tested enterprise features, even that most of them are single node)

@bhalevy @eliransin, it's your call if we are o.k. with fixing those tests later, this should be block until someone would be able to attend to those ?

@bhalevy
Copy link
Member

bhalevy commented Jul 11, 2023

I don't see how it would be ok to introduce regressions

@nyh
Copy link
Contributor Author

nyh commented Jul 11, 2023

I don't see how it would be ok to introduce regressions

It's not ok, but to be honest, I don't have the peace of mind to fix random buggy dtests while I keep getting requests to do other things at top priority, so I'm not going to fix this any time soon.

@roydahan
Copy link
Contributor

I started a new run with this: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/

The 3 test failures @fruch mentioned are directly related but looks quite easy to fix, so we can assign someone to fix.

However, this PR by itself won't improve runtime and possibly even degrade runtime.

@nyh
Copy link
Contributor Author

nyh commented Jul 17, 2023

I started a new run with this: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/fruch/job/new-dtest-pytest-parallel/

The 3 test failures @fruch mentioned are directly related but looks quite easy to fix, so we can assign someone to fix.

Thanks.

However, this PR by itself won't improve runtime and possibly even degrade runtime.

The idea is to have a dtest PR which runs Scylla without huge sleeps. The branch is https://github.com/nyh/scylla-dtest/commits/quick-boot-2. There is a one-line patch https://github.com/nyh/scylla-dtest/commit/0b23742b4622c9b6e18f4613722ce433640cf41c plus one fix for one test's bug - https://github.com/nyh/scylla-dtest/commit/0b23742b4622c9b6e18f4613722ce433640cf41c. There will need to be more patches because a few dozen tests still fail (not as many as before the ccm patches, and but still, a non-zero list).

The next step can be to either make the dtest patch an optional parameter (so developers can use it), but @fruch and @eliransin suggested that it can be the default, and just add a tag for tests that can't work in this "sleepless" mode and only they will be run in the slow mode.

@fruch
Copy link
Contributor

fruch commented Jul 18, 2023

I'm handling the tests that need to be fixed,
looks like it's identical in all 3 tests, we need to not wait_other_notice, and it solves the issue.

so this last run had 8 failure, 6 of them would be solved by https://github.com/scylladb/scylla-dtest/pull/3313.
and the rest 2 are identical to master failures (which mean we have only 2 failures on master ????)

@fruch
Copy link
Contributor

fruch commented Jul 18, 2023

I'm handling the tests that need to be fixed, looks like it's identical in all 3 tests, we need to not wait_other_notice, and it solves the issue.

so this last run had 8 failure, 6 of them would be solved by scylladb/scylla-dtest#3313. and the rest 2 are identical to master failures (which mean we have only 2 failures on master ????)

I was skeptical, so went to check:
https://jenkins.scylladb.com/job/scylla-master/job/dtest-daily-release/306/testReport/

for the last 3 runs, we have only 2 failing tests

Copy link
Contributor

@roydahan roydahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from my side.
We can merge this and the PR for dtest fixes.

@fruch fruch merged commit 65872f6 into scylladb:next Jul 19, 2023
1 check passed
@kbr-scylla
Copy link

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

  • 2 nodes A, B
  • kill node A
  • start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted
but because node B didn't even notice that A was killed yet!

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (https://github.com/scylladb/scylla-dtest/pull/3991)...

@temichus
Copy link

temichus commented Mar 8, 2024

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

  • 2 nodes A, B
  • kill node A
  • start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (scylladb/scylla-dtest#3991)...

@kbr-scylla

I've created PR where the issue can be reproduced https://github.com/scylladb/scylla-dtest/pull/4042/files
approx with 500 runs, we should get at least 1 error.

@bhalevy
Copy link
Member

bhalevy commented Mar 8, 2024

@nyh This is incorrect. It may fix some problems but introduces others

Consider the following:

  • 2 nodes A, B
  • kill node A
  • start node A with wait_other_notice=True

The start call will often return quickly, but not because node B noticed that A restarted but because node B didn't even notice that A was killed yet!

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

That's why the old check used the log markers --- the log markers ensure that we don't depend on obsolete state for wait_other_notice. That is, other nodes really have to notice this new node restart!

We're actually now observing this scenario trying to unflake one dtest with @temichus (scylladb/scylla-dtest#3991)...

@kbr-scylla
Copy link

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

Which is only a workaround for the problem, the implementation of wait_other_notice is still incorrect

@bhalevy
Copy link
Member

bhalevy commented Mar 8, 2024

You can stop A with wait_other_notice=True to ensure node B notices that A went down, before restarting it.

Which is only a workaround for the problem, the implementation of wait_other_notice is still incorrect

Then it needs to be fixed. Issue nr?

@kbr-scylla
Copy link

I opened #563

bhalevy added a commit to bhalevy/scylla-ccm that referenced this pull request Mar 12, 2024
…r_alive

scylladb#462 introduced
watch_rest_for_alive that replaced the calls to watch_log_for_alive
on the scylla node(s) start path.

But a node is killed and then restarted, and other nodes
miss the kill event, `watch_rest_for_alive` will consider
that node already as up as seen by the other nodes,
while previously, watch_log_for_alive, waited until
other nodes discovered this node as up again, based
on markes taken right before (re)starting that node.

This change brings this call back.

Fixes scylladb#563

Signed-off-by: Benny Halevy <[email protected]>
fruch pushed a commit that referenced this pull request Mar 15, 2024
…r_alive

#462 introduced
watch_rest_for_alive that replaced the calls to watch_log_for_alive
on the scylla node(s) start path.

But a node is killed and then restarted, and other nodes
miss the kill event, `watch_rest_for_alive` will consider
that node already as up as seen by the other nodes,
while previously, watch_log_for_alive, waited until
other nodes discovered this node as up again, based
on markes taken right before (re)starting that node.

This change brings this call back.

Fixes #563

Signed-off-by: Benny Halevy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wait_other_notice doesn't wait for the right thing
8 participants