Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddPeer API #5123

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

AddPeer API #5123

wants to merge 1 commit into from

Conversation

ramonberrutti
Copy link
Contributor

WIP AddPeer API.

Need to:

  • Test Edge Case
  • Corrupt Disk comeback.
  • Empty Disk Comeback.

Signed-off-by: Ramon Berrutti [email protected]

@ramonberrutti ramonberrutti requested a review from a team as a code owner February 22, 2024 19:41
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What specific problem are we trying to solve? Peers automatically get added in. Is this specific to after a peer remove step?

@ramonberrutti
Copy link
Contributor Author

ramonberrutti commented Feb 22, 2024

What specific problem are we trying to solve? Peers automatically get added in. Is this specific to after a peer remove step?

Yes, after peer removal, if we want to join the cluster again, one method is to change the server name, but in our case, we want to add it after some minutes or hours.
Another solution that we found is to force a leader election until the nodes are added again to the raft (I haven't looked at how this is working or if it is just luck)

The solution from the code only works for peers already removed, but that was kept in the hashmap

@derekcollison
Copy link
Member

You could simply shutdown the server and do the maintenance needed and restart?

If you need to move stream and consumer peers off that machine during the downtime you can do that separately.

I will double check when the system will re-add a peer that was removed..

@ramonberrutti
Copy link
Contributor Author

You could simply shutdown the server and do the maintenance needed and restart?

If you need to move stream and consumer peers off that machine during the downtime you can do that separately.

I will double check when the system will re-add a peer that was removed..

We can't do that because we want to adjust the Quorum Number.
During our maintenance, we scale up new nodes (1/3 of the total nodes)

For example, we have 9 nodes, so we need 5 to reach the meta leader quorum.

During our maintenance, we scale up 3 new nodes and scale down 3.
Now we need 7 nodes to reach the quorum, but we already lost 3, we can only lose 2 extra nodes.
We want to remove that node for a bit to be able to lose 3 nodes instead of 2.

Also, the need to repeat that process multiple times, and we also want to remove the added nodes when the first ones removed are recovered.

@derekcollison
Copy link
Member

With your 9 node cluster, you can have 4 failures in terms of the whole cluster being available (meta). What purpose is being served by scaling up to 12?

@ripienaar
Copy link
Contributor

Why do you want to adjust the quorum number?

The process of swapping machines in and out works really well in a rolling fashion if you bring nodes back with set server_names predictably, there should never be a need to peer remove a server other than it is gone for good.

Rather than carry API bloat I'd rather want to see a better process used here for maintenance - and discover what we can help you to achieve a better process that keeps the RAFT layer stable over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants