Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners #523

Conversation

bhalevy
Copy link
Member

@bhalevy bhalevy commented Nov 8, 2023

It is not enough for node to know about other nodes' tokens, as they might not be reflected in the token_metadata map.

Instead, check the /storage_service/host_id api that provides a list of nodes that are normal token owners and ready to be used by queries.

Refs scylladb/scylladb#15146

ccmlib/scylla_node.py Outdated Show resolved Hide resolved
@bhalevy bhalevy force-pushed the scylla_node-watch_rest_for_alive-normal-token-owners branch from e6f05bc to 9f97400 Compare November 9, 2023 06:05
@bhalevy
Copy link
Member Author

bhalevy commented Nov 9, 2023

@nyh, @fruch please review

ccmlib/scylla_node.py Outdated Show resolved Hide resolved
@fruch
Copy link
Contributor

fruch commented Nov 9, 2023

Seems o.k.

But I'll want to take a ride in at least in gating

@bhalevy bhalevy force-pushed the scylla_node-watch_rest_for_alive-normal-token-owners branch from 9f97400 to ca9352d Compare November 9, 2023 13:50
@bhalevy
Copy link
Member Author

bhalevy commented Nov 9, 2023

@bhalevy seems like we are getting None as host_id all over the place

should be fixed now (at least it passes for me locally, for example update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_removenode_3[api-True])

@fruch
Copy link
Contributor

fruch commented Nov 9, 2023

@bhalevy seems like we are getting None as host_id all over the place

should be fixed now (at least it passes for me locally, for example update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_removenode_3[api-True])

running gating again:

@bhalevy bhalevy force-pushed the scylla_node-watch_rest_for_alive-normal-token-owners branch from ca9352d to 34712cd Compare November 9, 2023 19:42
@bhalevy
Copy link
Member Author

bhalevy commented Nov 9, 2023

  • scylla_node: start_scylla: reset node_hostid

so to force reload of the host_id after restart as the node may wake up with a different host_id (after being wiped)

@bhalevy
Copy link
Member Author

bhalevy commented Nov 9, 2023

@bhalevy
Copy link
Member Author

bhalevy commented Nov 10, 2023

@bhalevy bhalevy force-pushed the scylla_node-watch_rest_for_alive-normal-token-owners branch from 34712cd to 54332a9 Compare November 10, 2023 19:29
@bhalevy
Copy link
Member Author

bhalevy commented Nov 10, 2023

In 54332a9:

  • scylla_node: hostid: handle errors
  • scylla_node: watch_rest_for_alive: check response.status_code
  • scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners:
    • extends current function by waiting on tokens first, then on nrmal token ownership
  • scylla_node: watch_rest_for_alive: add wait_normal_token_owner param
    • for backward compatibility
    • applied also from ScyllaCluster::start_nodes

@fruch
Copy link
Contributor

fruch commented Nov 12, 2023

giving one last run, this is some that gonna happen on every single test:

@bhalevy also keep in mind it might break non gating tests...
we'll need to keep that in mind following this merge

@bhalevy
Copy link
Member Author

bhalevy commented Nov 12, 2023

@bhalevy also keep in mind it might break non gating tests...
we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

@fruch
Copy link
Contributor

fruch commented Nov 12, 2023

@bhalevy also keep in mind it might break non gating tests...
we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

And which tests are gonna use that ? Only one we will be unstable?

@bhalevy
Copy link
Member Author

bhalevy commented Nov 12, 2023

@bhalevy also keep in mind it might break non gating tests...
we'll need to keep that in mind following this merge

Right. That's why I added the wait_normal_token_owner flag.

And which tests are gonna use that ? Only one we will be unstable?

Yes. This category of tests monitor starting nodes' log looking for particular events.
In most cases we already call start with wait_other_notice=False, wait_for_binary_protocol=False, but in cases we didn't do that we'll have to figure out what's the right thing to do.

@bhalevy
Copy link
Member Author

bhalevy commented Nov 12, 2023

This is a test issue: https://github.com/scylladb/scylla-dtest/issues/3717

@bhalevy
Copy link
Member Author

bhalevy commented Nov 13, 2023

@nyh since you're the last to change watch_rest_for_alive, any comments before we merge this change?

Copy link
Contributor

@fruch fruch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bhalevy
Copy link
Member Author

bhalevy commented Nov 20, 2023

@nyh please review (since you were the last one to change this area)

ccmlib/scylla_node.py Show resolved Hide resolved
ccmlib/scylla_node.py Show resolved Hide resolved
return self.node_hostid
except Exception as e:
self.error(f"Failed to get hostid using {url}: {e}")
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

‎This "pass" isn't needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I find it clarifying, as it documents the fact that the error is ignored on purpose.
Otherwise, a naive reader might suspect there is a missing raise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped pass in next version

if tofind.issubset(live):
# This node thinks that all given nodes are alive and not
# "joining", we're almost done, but still need to verify
# that the node knows the others' tokens.
check = tofind
tofind = set()
have_no_tokens = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: The original code, after checking that node X does have tokens, removed it from the "tofind" list, so no need to check it again in the next iteration. Your rewrite lost this optimization. But I agree it's not a very important optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see what it takes to preserve that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored the trimming of tofind in the next version (in a different way)

ccmlib/scylla_cluster.py Show resolved Hide resolved
@bhalevy bhalevy force-pushed the scylla_node-watch_rest_for_alive-normal-token-owners branch from 54332a9 to 5fb352a Compare November 30, 2023 09:34
@bhalevy
Copy link
Member Author

bhalevy commented Nov 30, 2023

In 5fb352a:

  • scylla_node: override hostid() using the REST api
    • used timeout arg for requests.get
  • scylla_node: start_scylla: reset node_hostid
    • added code comment explaining why we do that
  • scylla_node: hostid: handle errors
    • dropped pass statement
  • scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners
    • trim tofind as nodes are found
    • but maintain cummulative found set and found_host_id_map dict for debugging purposes
  • scylla_node: watch_rest_for_alive: add wait_normal_token_owner param
    • Document wait_normal_token_owner param also in watch_rest_for_alive

@bhalevy bhalevy requested a review from nyh December 3, 2023 10:57
@bhalevy
Copy link
Member Author

bhalevy commented Dec 6, 2023

@nyh please re-review.
I addressed your review comments.

@bhalevy
Copy link
Member Author

bhalevy commented Dec 12, 2023

@nyh please re-review. I addressed your review comments.

@nyh ping, when do you think you can re-review?

tofind = tofind.difference(normal)
if not tofind:
return
# Update cummulative maps for debugging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling error: cummulative

ccmlib/scylla_cluster.py Show resolved Hide resolved
For scylla, it is better to retrieve the node host_id
using the REST api as it doesn't require scylla-jmx
to serve `nodetool`.  This allows us to get the node's
host_id much earlier in its start process,
as soone as it starts to server the REST API.

Signed-off-by: Benny Halevy <[email protected]>
So it will be retrieved again once the node restarts
since it might restart with a different host_id than it previously
had if it was wiped and reused for bootstrap / replace.

Signed-off-by: Benny Halevy <[email protected]>
If `hostid()` is called too early, it may fail as follows:
```
           requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.89.5', port=10000): Max retries exceeded with url: /storage_service/hostid/local (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa20680aa50>: Failed to establish a new connection: [Errno 111] Connection refused'))
```

Just print the error and return None to indicate that the
hostid is unknown.

Signed-off-by: Benny Halevy <[email protected]>
The correct way to determine if the http request
succeeded is by checking response.status_code == requests.codes.ok.

Signed-off-by: Benny Halevy <[email protected]>
…ormal token owners

It is not enough for node to know about other nodes' tokens,
as they might not be reflected in the token_metadata map.

After checking tokens, check also the `/storage_service/host_id` api
that provides a list of nodes that are normal token owners and ready
to be used by queries.

Refs scylladb/scylladb#15146

Signed-off-by: Benny Halevy <[email protected]>
For backward compatibility.
Some tests may want to pass `node.start(wait_other_notice=True)`
and not wait for nodes to become normal token owners
if they need to examine the node earlier than that.

Signed-off-by: Benny Halevy <[email protected]>
@bhalevy bhalevy force-pushed the scylla_node-watch_rest_for_alive-normal-token-owners branch from 5fb352a to 1d226d2 Compare December 19, 2023 15:03
@bhalevy
Copy link
Member Author

bhalevy commented Dec 19, 2023

  • fixed spelling mistake and rebased
    @fruch please merge when CI passed

@bhalevy
Copy link
Member Author

bhalevy commented Dec 19, 2023

All checks have passed, please merge

@fruch fruch merged commit 1411230 into scylladb:next Dec 19, 2023
3 checks passed
@bhalevy
Copy link
Member Author

bhalevy commented Jan 1, 2024

@fruch can you please backport to 5.4 and 2024.1?

@mykaul
Copy link
Contributor

mykaul commented Jan 1, 2024

This kinda reminds a (very) long discussion with the operator team, where they did not know when can they safely do a rolling restart of Scylla pods - and the request was a simple API to let them know 'yes, it's real, this node is really up and serving traffic and is part of the cluster and you can safely move to the next node and restart it'.
See scylladb/scylladb#8275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants