-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners #523
scylla_node: watch_rest_for_alive: wait for others to be considered normal token owners #523
Conversation
e6f05bc
to
9f97400
Compare
Seems o.k. But I'll want to take a ride in at least in gating |
@bhalevy seems like we are getting None as host_id all over the place |
9f97400
to
ca9352d
Compare
should be fixed now (at least it passes for me locally, for example |
running gating again:
|
ca9352d
to
34712cd
Compare
so to force reload of the host_id after restart as the node may wake up with a different host_id (after being wiped) |
34712cd
to
54332a9
Compare
In 54332a9:
|
giving one last run, this is some that gonna happen on every single test:
@bhalevy also keep in mind it might break non gating tests... |
Right. That's why I added the |
And which tests are gonna use that ? Only one we will be unstable? |
Yes. This category of tests monitor starting nodes' log looking for particular events. |
There's only https://jenkins.scylladb.com/job/scylla-staging/job/dtest-pytest-gating/146/testReport/junit/cql_tests/TestTruncate/FullDtest___full_split000___test_cql_query_filtering_without_indexes/ |
This is a test issue: https://github.com/scylladb/scylla-dtest/issues/3717 |
@nyh since you're the last to change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@nyh please review (since you were the last one to change this area) |
ccmlib/scylla_node.py
Outdated
return self.node_hostid | ||
except Exception as e: | ||
self.error(f"Failed to get hostid using {url}: {e}") | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "pass" isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I find it clarifying, as it documents the fact that the error is ignored on purpose.
Otherwise, a naive reader might suspect there is a missing raise
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropped pass
in next version
if tofind.issubset(live): | ||
# This node thinks that all given nodes are alive and not | ||
# "joining", we're almost done, but still need to verify | ||
# that the node knows the others' tokens. | ||
check = tofind | ||
tofind = set() | ||
have_no_tokens = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: The original code, after checking that node X does have tokens, removed it from the "tofind" list, so no need to check it again in the next iteration. Your rewrite lost this optimization. But I agree it's not a very important optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see what it takes to preserve that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restored the trimming of tofind
in the next version (in a different way)
54332a9
to
5fb352a
Compare
In 5fb352a:
|
@nyh please re-review. |
ccmlib/scylla_node.py
Outdated
tofind = tofind.difference(normal) | ||
if not tofind: | ||
return | ||
# Update cummulative maps for debugging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling error: cummulative
For scylla, it is better to retrieve the node host_id using the REST api as it doesn't require scylla-jmx to serve `nodetool`. This allows us to get the node's host_id much earlier in its start process, as soone as it starts to server the REST API. Signed-off-by: Benny Halevy <[email protected]>
So it will be retrieved again once the node restarts since it might restart with a different host_id than it previously had if it was wiped and reused for bootstrap / replace. Signed-off-by: Benny Halevy <[email protected]>
If `hostid()` is called too early, it may fail as follows: ``` requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.89.5', port=10000): Max retries exceeded with url: /storage_service/hostid/local (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa20680aa50>: Failed to establish a new connection: [Errno 111] Connection refused')) ``` Just print the error and return None to indicate that the hostid is unknown. Signed-off-by: Benny Halevy <[email protected]>
The correct way to determine if the http request succeeded is by checking response.status_code == requests.codes.ok. Signed-off-by: Benny Halevy <[email protected]>
…ormal token owners It is not enough for node to know about other nodes' tokens, as they might not be reflected in the token_metadata map. After checking tokens, check also the `/storage_service/host_id` api that provides a list of nodes that are normal token owners and ready to be used by queries. Refs scylladb/scylladb#15146 Signed-off-by: Benny Halevy <[email protected]>
For backward compatibility. Some tests may want to pass `node.start(wait_other_notice=True)` and not wait for nodes to become normal token owners if they need to examine the node earlier than that. Signed-off-by: Benny Halevy <[email protected]>
5fb352a
to
1d226d2
Compare
|
All checks have passed, please merge |
@fruch can you please backport to 5.4 and 2024.1? |
This kinda reminds a (very) long discussion with the operator team, where they did not know when can they safely do a rolling restart of Scylla pods - and the request was a simple API to let them know 'yes, it's real, this node is really up and serving traffic and is part of the cluster and you can safely move to the next node and restart it'. |
It is not enough for node to know about other nodes' tokens, as they might not be reflected in the token_metadata map.
Instead, check the
/storage_service/host_id
api that provides a list of nodes that are normal token owners and ready to be used by queries.Refs scylladb/scylladb#15146