Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only broadcast HostRemovedFromCluster when a host is part of a cluster #1611

Merged
merged 6 commits into from
Jul 11, 2023

Conversation

jamie-suse
Copy link
Contributor

Description

Relates to #1474

This PR fixes possible concurrency issues when a host is removed from a cluster (delta-deregistration flow).
The issue is caused by two aggregates acting concurrently, so the HostAddedToCluster event and HostRemovedFromCluster event ordering is not guaranteed.

The solution is to change the projector logic to skip the event if the host is not part of the cluster anymore.

We still need to conditionally broadcast to the FE the event, but this will come once the FE PR is merged.

This change will broadcast HostRemovedFromCluster only when a given host is part of a cluster.

How was this tested?

Added conditions in unit tests to assert for the existence/nonexistence of the broadcast messages when HostRemovedFromCluster is called

@jamie-suse jamie-suse added the elixir Pull requests that update Elixir code label Jul 10, 2023
@@ -275,6 +275,17 @@ defmodule Trento.HostProjector do
end
end

def after_update(
%HostRemovedFromCluster{host_id: id, cluster_id: cluster_id},
Copy link
Contributor

@arbulu89 arbulu89 Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the cluster_id here.
This field is used to know from which cluster to remove the host.
Here we need to broadcast cluster: nil only if cluster_id coming in the event matched with the current cluster id stored in the db.

You could do that pattern matching the 3rd argument:

  def after_update(
        %HostRemovedFromCluster{host_id: id},
        _,
        %{host: %HostReadModel{cluster_id: nil}}
       ) do

With this clause, you would only run the broadcast if there was a change in the multi.
Anyway, we should have the tests to confirm it hehe

@jamie-suse jamie-suse force-pushed the broadcast-host-removed-from-cluster branch from ff9076a to d615d40 Compare July 11, 2023 08:33
%{host: %Trento.HostReadModel{cluster_id: nil}}
) do
TrentoWeb.Endpoint.broadcast("monitoring:hosts", "host_details_updated", %{
id: host_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct payload to broadcast? cluster_id will always be nil because of the pattern-matching, is it useful to transmit this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we need to explicitily send cluster_id: nil, because this is the value we want to store in the frontend.
If we don't send it, actually we are not sending any information.
This message is used to merge the already stored host data in redux with the new one, where we want to set the cluster as nil

Copy link
Contributor

@arbulu89 arbulu89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good from my side!
Only commented the refute_broadcast timeout value, which slows down or tests


refute_broadcast "host_details_updated",
%{id: ^host_id},
1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just a bit worried about this 1000 value.
In the refute this blocks the test until the timeout is expired, so it will be waiting for 1 second to have a green test.
Maybe we should use the default 100 value?
@CDimonaco @fabriziosestito Any suggestion on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the 1000 value was set by me when I started doing broadcast test, this was set by testing manually different values, if the default timeout is sufficient we can switch to that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, in assert_broadcast is fine, because it will just wait until the message is sent. And it will be a long test only on failure.
The refute will always wait for the time, even in passing tests.
That's the difference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced this timeout to 100 mS 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a least request XD
Could you test if putting this same refute_broadcast with 100ms in the test above fails? Of course, just temporarily, in your local env.
It should fail. I want to be sure that the 100ms makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did on my local machine and refute_broadcast can catch messages using a timeout as low as 1 mS 😵 I guess we may see different performance in the CI environment though.

@jamie-suse jamie-suse marked this pull request as ready for review July 11, 2023 09:18
@arbulu89
Copy link
Contributor

@jamie-suse Green light from my side!

@jamie-suse jamie-suse merged commit 68d8614 into main Jul 11, 2023
16 checks passed
@jamie-suse jamie-suse deleted the broadcast-host-removed-from-cluster branch July 11, 2023 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
elixir Pull requests that update Elixir code
Development

Successfully merging this pull request may close these issues.

3 participants