Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fence_heuristics_resource: add new fence-agent for dynamic delay fencing #308

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

curvygrin
Copy link
Contributor

In a two-node cluster, if a fence races occur during running service on the standby node (= the delay of fence-agent is set) after failover due to monitoring failure on active node etc, the standby node will be fenced, and service cannot be continued.
This is due to the delay setting of fencing is fixed to the node.

proposal (this PR)

Implements a new heuristics agent to determine the node where the service (resource) is running and dynamically delay fencing.

Signed-off-by: Kumabuchi Kenji [email protected]

@knet-ci-bot
Copy link

Can one of the admins verify this patch?

fence-agents.spec.in Outdated Show resolved Hide resolved
@curvygrin curvygrin force-pushed the implements-fence-heuristics-resource branch from b8c5478 to c928a80 Compare November 5, 2019 08:20
@oalbrigt
Copy link
Collaborator

oalbrigt commented Nov 5, 2019

ok to test

standby_wait = int(options["--standby-wait"])
p = None
cmd = "%s -r %s -W" % (crm_resource_path, resource)
search_str = re.compile(r"\s%s$" % os.uname()[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as pacemaker doesn't necessarily use uname as name for the node it would be preferable here to use crm_node --name or something rather.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use crm_node --name

Copy link
Contributor

@wenningerk wenningerk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the idea of heuristics was rather to enable/disable another fence agent than introduce a delay. You should at least adapt the description accordingly.
And the logic should cover the case of self-fencing I guess, where additional waiting doesn't make sense regardless if the resource is running or not.
To make it more generic it would probably make sense to think of promotable/master-slave resources as well.

@curvygrin
Copy link
Contributor Author

@wenningerk Thank you for your feedback. I think it's not so difficult to fix the description and consider the promotable resources. But I have no idea to cover the self-fence. Is there a good way to detect it?

Certainly, in the case of self-fence, a wasteful delay occurs. But honestly, I think it's no different from a delay parameter fixed to a node.

@wenningerk
Copy link
Contributor

@wenningerk Thank you for your feedback. I think it's not so difficult to fix the description and consider the promotable resources. But I have no idea to cover the self-fence. Is there a good way to detect it?

Certainly, in the case of self-fence, a wasteful delay occurs. But honestly, I think it's no different from a delay parameter fixed to a node.

Well you know who you are (crm_node --name) and you know the target ;-)

@wenningerk
Copy link
Contributor

Was talking as well to Ken yesterday about that approach and he is sharing my feeling that doing something complicated like determining the location of a resource should be considered carefully as part of a fence-agent. At least a note in the fence-agent should state that this maybe puts more stress on keeping the one instance running than on reliable recovery.
If you put that fence-agent on the same level as the real fence-agent and determining the location fails or hangs this node would never fence.
So one possibility to think about would be to rather put it on a higher level as the real fence-agent and to always make it fail. Then if determining the location is working it might introduce a delay and if that fails we would at least still have fencing. Could be an option as well so that the user can select if he wants to weight reliable recovery higher than keeping the service running.

@curvygrin
Copy link
Contributor Author

@wenningerk & KEN
Thank you very much for considering this proposal.
I understand that the current approach has the concerns.

So one possibility to think about would be to rather put it on a higher level as the real fence-agent and to always make it fail. Then if determining the location is working it might introduce a delay and if that fails we would at least still have fencing.

I certainly think that the above approach is safer.
(In fact, I thought about it, but the implementation of always returning FALSE seemed strange and I stopped it.)
I will try to fix PR.

@wenningerk
Copy link
Contributor

I certainly think that the above approach is safer.
(In fact, I thought about it, but the implementation of always returning FALSE seemed strange and I stopped it.)
I will try to fix PR.

If you feel unpleasant with it leave the default as is and make returning FALSE an option ...

Copy link

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thanks for contributing a significant new capability! I do have some concerns that I've commented on.

logging.error("Command failed. rc=%s, stderr=%s" % (rc, err))
return False

search_str = re.compile(r"\s%s%s$" % (mynodename, '\sMaster' if promotable else ''))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this? I think there's a space at the end of the line if "Master" isn't printed.

I'd be a little concerned about introducing a dependency on the output format. We regularly have problems crop up when other software (e.g. pcs) looks for specific strings in the output, since we don't always remember that something requires it when we need to update the pacemaker output for some reason.

We actually are in the middle of a project to have all the pacemaker tools support XML output, so external code can use a format that we will try to keep backward-compatible while still allowing us to change text output as needed. Unfortunately crm_resource has not yet been adapted, and you'd want to support older pacemaker versions anyway.

Another concern I have is that crm_resource -W doesn't distinguish between running cleanly and failed. The output would be the same in either case. If the resource is failed and the node needs to be fenced, this will unnecessarily delay recovery.

One possibility would be to use crm_mon --as-xml instead, which would avoid both the magic string problem and the success vs failure problem. crm_mon's XML output has been supported "forever". It will generate a bunch of output including something like:

<resource id="R" ... failed="true" failure_ignored="false" ... >
    <node name="N" ... />
</resource>

Another corner case is that a resource can be (wrongly) active on more than one node. In a 2-node cluster, it would be simple to skip the delay here if the resource is active on the local node, even if it is also active on the other node. In a larger cluster the best course of action is not as obvious, but a safe choice would be to skip the delay if the resource is active on multiple nodes.

Side note: the upcoming release of pacemaker will have a slight change in how crm_mon does XML output. --as-xml will still behave the same, but it will be deprecated (however it will remain supported for years to come). A new --output-as=xml option will generate nearly identical XML output, but with a different outermost tag. That is part of the project to make all tools support XML -- that option and outermost tag will be identical for all tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a space at the end of the line if "Master" isn't printed.

At this point, it was an implementation that removed the space by strip() .

One possibility would be to use crm_mon --as-xml instead, which would avoid both the magic string problem and the success vs failure problem. crm_mon's XML output has been supported "forever". It will generate a bunch of output including something like:

I added new commit to use crm_mon --as-xml.
Is it necessary and sufficient to detect resource failure by checking the following two attributes?

failed="true" failure_ignored="false"

If even one of the resources has failed, including resources inside the resource group or promotable resource, the delay is skipped.

Another corner case is that a resource can be (wrongly) active on more than one node. In a 2-node cluster, it would be simple to skip the delay here if the resource is active on the local node, even if it is also active on the other node. In a larger cluster the best course of action is not as obvious, but a safe choice would be to skip the delay if the resource is active on multiple nodes.

Regardless of the number of nodes in the cluster, the delay is skipped when the resource is started on multiple nodes.

In the latest commit, the delay occurs only when the specified resource, resource group, promotable master resource is started on another single node (not remote) and has not failed. The delay is also skipped for self-fencing.

"getopt" : "r:",
"longopt" : "resource",
"required" : "1",
"help" : "-r, --resource=[resource-id] ID of the resource that should be running in the ACT node",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would document that it does not make sense to specify a cloned or bundled resource unless it is promotable and has only a single master instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I documented it.

"shortdesc" : "Handle the promotable resource. The node on which the master resource is running is considered as ACT.",
"default" : "False",
"order" : 1
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use crm_mon --as-xml, you can detect this automatically. Cloned resources are wrapped in clone tags that include a multi_state true/false attribute that is equivalent to promotable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to distinguish promotable resources based on clone tags and attriburtes in xml output.

"longopt" : "standby-wait",
"required" : "0",
"help" : "-w, --standby-wait=[seconds] Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.",
"shortdesc" : "Wait X seconds on SBY node. If a positive number is specified, fencing action of this agent will always succeed after waits.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent will delay but not succeed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@curvygrin
Copy link
Contributor Author

Sorry for the late response. Thank you for many important comments.
I updated PR, and it tested on RHEL8.0 / Pacemaker 2.0.1.
But I'm not familiar with crm_mon's xml output. Please let me know if you have other comments.

@gao-yan
Copy link
Member

gao-yan commented Dec 4, 2019

Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ...

Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent.

How about we introduce a couple of meta attributes for resources for example:

fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.

fencing-delay-promoted

-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches.

Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay.

What do you think?

@kgaillot
Copy link

kgaillot commented Dec 4, 2019

Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ...

Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent.

How about we introduce a couple of meta attributes for resources for example:

fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.

fencing-delay-promoted

-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches.

Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay.

What do you think?

That's an interesting idea -- I was hesitant about agents getting that involved in pacemaker's internal state, and that would avoid any such concerns.

We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.

The fencer uses the "quick location" flag with the scheduler, so I'm not sure offhand whether it has all the information needed.

@wenningerk
Copy link
Contributor

Delaying fencing actions targeting the nodes that are hosting more significant resources or resource instances (promoted) to make them win fencing matches has been an very interesting topic indeed for a lot of use cases especially in regard of 2-node clusters. Thanks for bringing up the topic as well with such an implementation via a special fence agent ...
Actually, I've been thinking about the possibility of achieving this in pacemaker without requiring an additional fence agent or resource agent.
How about we introduce a couple of meta attributes for resources for example:
fencing-delay
-- Delay fencing actions targeting the nodes that are hosting this resource for the specified period of time, so that they survive fencing matches.
fencing-delay-promoted
-- Delay fencing actions targeting the nodes that are hosting the promoted instances of this resource for the specified period of time, so that they survive fencing matches.
Fencer could either combine these delays together with pcmk_delay_base/max to calculate the eventual delay, or pick the longest delay among them as the actual delay.
What do you think?

That's an interesting idea -- I was hesitant about agents getting that involved in pacemaker's internal state, and that would avoid any such concerns.

We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.

The fencer uses the "quick location" flag with the scheduler, so I'm not sure offhand whether it has all the information needed.

Guess this is bringing up the old issue again that the view fenced has of the state of the cluster
is just updated by a few rules and so it wouldn't know which delay to take as atm it isn't aware
of a clone being promoted or not.
But of course we could go a route where scheduler pushes some additional delay into fenced
where fenced would be agnostic of how that is composed - either a parameter of the fence-action
or asynchronous via a side-channel.

@kgaillot
Copy link

kgaillot commented Dec 5, 2019

But of course we could go a route where scheduler pushes some additional delay into fenced
where fenced would be agnostic of how that is composed - either a parameter of the fence-action
or asynchronous via a side-channel.

I like the idea of the scheduler calculating the delay and adding it to the graph action. However that would mean it would only apply to fencing initiated by the cluster (and not e.g. stonith_admin or dlm).

@gao-yan
Copy link
Member

gao-yan commented Dec 5, 2019

Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?

@gao-yan
Copy link
Member

gao-yan commented Dec 5, 2019

We already have a "priority" meta-attribute for resources -- maybe it could be involved somehow. Instead of a per-resource fencing-delay, maybe a cluster-wide priority-fencing-delay that would get applied to the node with the highest total resource priority. I'm not sure how to prefer promoted instances, maybe a "promoted-priority" meta-attribute for clones that would get added to the base priority of promoted instances.

Good thinking. If to just involve "priority" meta attribute, that'd be fine. Not sure if it'd become confusing to introduce another "promoted-priority" attribute... A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)

@kgaillot
Copy link

kgaillot commented Dec 5, 2019

Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?

Up to now we've tried to avoid the overhead. The fencer checks the status at noncritical times like when the CIB is changed. When fencing is requested, we kind of want that to be as streamlined as possible, both to avoid unnecessary delays and to reduce the chance that something could go wrong. If we want status then, we have to get the current CIB, so worst case we have to wait for that to timeout before giving up.

But it would make it easier to have dynamic capabilities like rule processing and this.

If we did check status at fence time, we'd have to define what to do when reading the current CIB fails. Do we go with the most recent copy we have, or fail? If we use the most recent copy, do we disable the priority feature to avoid using outdated information?

@kgaillot
Copy link

kgaillot commented Dec 5, 2019

A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)

That makes sense. Scores are integer so I'd just go with +1 rather than convert back and forth from floating-point. The +1 for implicit clone stickiness establishes a precedent. In my experience users tend to use bigger jumps between configured scores anyway (at least 10).

Overall I lean to doing the calculations in the scheduler (and losing the feature for external fencing) rather than making the fencer more complex, but I can see arguments both ways.

@gao-yan
Copy link
Member

gao-yan commented Dec 6, 2019

Probably somehow let fencer check the cluster status and calculate the delay only when fencing is actually requested?

Up to now we've tried to avoid the overhead. The fencer checks the status at noncritical times like when the CIB is changed. When fencing is requested, we kind of want that to be as streamlined as possible, both to avoid unnecessary delays and to reduce the chance that something could go wrong. If we want status then, we have to get the current CIB, so worst case we have to wait for that to timeout before giving up.

But it would make it easier to have dynamic capabilities like rule processing and this.

If we did check status at fence time, we'd have to define what to do when reading the current CIB fails. Do we go with the most recent copy we have, or fail? If we use the most recent copy, do we disable the priority feature to avoid using outdated information?

It'd be a difficult situation indeed ...

@gao-yan
Copy link
Member

gao-yan commented Dec 6, 2019

A simpler rule of course could be "promoted instances only take a little higher priority (+0.000001) than the base priority on calculation, not higher than any other resources with real higher priority ( >= +1)

That makes sense. Scores are integer so I'd just go with +1 rather than convert back and forth from floating-point. The +1 for implicit clone stickiness establishes a precedent. In my experience users tend to use bigger jumps between configured scores anyway (at least 10).

Makes sense.

Overall I lean to doing the calculations in the scheduler (and losing the feature for external fencing) rather than making the fencer more complex, but I can see arguments both ways.

Agreed. There probably always have to be trade-offs. I'll start with something on this.

@gao-yan
Copy link
Member

gao-yan commented Dec 6, 2019

And again a question is, should fencer combine priority-fencing-delay with pcmk_delay_base/max, or should it just pick the longest among them as the actual delay?

@kgaillot
Copy link

kgaillot commented Dec 6, 2019

And again a question is, should fencer combine priority-fencing-delay with pcmk_delay_base/max, or should it just pick the longest among them as the actual delay?

My first reaction was that they should combine, but thinking about possible scenarios, I think priority-fencing-delay should always have precedence.

The idea is that the user could configure priority-fencing-delay, which would always be used for cluster-initiated fencing, and then configure a static and/or random delay as a fallback to be used with external fencing. If the fencer receives a request with a priority delay specified, it would use that, otherwise it would use the normal delay parameters. That means the scheduler should always add it to fencing ops, even if it's 0 (i.e. targeting a lower-priority node), if the option is enabled.

The only scenario I can think of where combining them would be useful would be for when two nodes have equal priority. Someone could configure a static and/or random delay totalling smaller than the priority delay, which would only matter in that situation. But since by definition the user doesn't have a preference which node gets fenced in that case, it should be acceptable to pick one node algorithmically (lowest ID or whatever) to get a +1 in that situation.

A mixed-version cluster might be another scenario where they're useful, but that should be a temporary situation.

I wonder if it would be worthwhile to have some exceptions, e.g. don't use a delay if we're fencing a node in our partition, and/or use the delay only if there's an even number of nodes. But presumably the user wouldn't configure a delay for an odd number of nodes.

gao-yan added a commit to gao-yan/pacemaker that referenced this pull request Mar 17, 2020
This feature addresses the relevant topics and implements the ideas
brought up from:

ClusterLabs/fence-agents#308

This commit adds priority-fencing-delay option (just the option, not the
feature itself).

Enforce specified delay for the fencings that are targeting the lost
nodes with the highest total resource priority in case we don't
have the majority of the nodes in our cluster partition, so that
the more significant nodes potentially win any fencing match,
which is especially meaningful under split-brain of 2-node
cluster. If all the nodes have equal priority, then any
pcmk_delay_base/max configured for the corresponding fencing
resources will be applied. Otherwise as long as it's set, even if
to 0, it takes precedence over any configured pcmk_delay_base/max.
By default, priority fencing delay is disabled.
gao-yan added a commit to gao-yan/pacemaker that referenced this pull request Mar 18, 2020
This feature addresses the relevant topics and implements the ideas
brought up from:

ClusterLabs/fence-agents#308

This commit adds priority-fencing-delay option (just the option, not the
feature itself).

Enforce specified delay for the fencings that are targeting the lost
nodes with the highest total resource priority in case we don't
have the majority of the nodes in our cluster partition, so that
the more significant nodes potentially win any fencing match,
which is especially meaningful under split-brain of 2-node
cluster. If all the nodes have equal priority, then any
pcmk_delay_base/max configured for the corresponding fencing
resources will be applied. Otherwise as long as it's set, even if
to 0, it takes precedence over any configured pcmk_delay_base/max.
By default, priority fencing delay is disabled.
gao-yan added a commit to gao-yan/pacemaker that referenced this pull request Mar 19, 2020
This feature addresses the relevant topics and implements the ideas
brought up from:

ClusterLabs/fence-agents#308

This commit adds priority-fencing-delay option (just the option, not the
feature itself).

Enforce specified delay for the fencings that are targeting the lost
nodes with the highest total resource priority in case we don't
have the majority of the nodes in our cluster partition, so that
the more significant nodes potentially win any fencing match,
which is especially meaningful under split-brain of 2-node
cluster. If all the nodes have equal priority, then any
pcmk_delay_base/max configured for the corresponding fencing
resources will be applied. Otherwise as long as it's set, even if
to 0, it takes precedence over any configured pcmk_delay_base/max.
By default, priority fencing delay is disabled.
gao-yan added a commit to gao-yan/pacemaker that referenced this pull request Mar 20, 2020
This feature addresses the relevant topics and implements the ideas
brought up from:

ClusterLabs/fence-agents#308

This commit adds priority-fencing-delay option (just the option, not the
feature itself).

Enforce specified delay for the fencings that are targeting the lost
nodes with the highest total resource priority in case we don't
have the majority of the nodes in our cluster partition, so that
the more significant nodes potentially win any fencing match,
which is especially meaningful under split-brain of 2-node
cluster. A promoted resource instance takes the base priority + 1
on calculation if the base priority is not 0. If all the nodes
have equal priority, then any pcmk_delay_base/max configured for
the corresponding fencing resources will be applied. Otherwise as
long as it's set, even if to 0, it takes precedence over any
configured pcmk_delay_base/max. By default, priority fencing
delay is disabled.
gao-yan added a commit to gao-yan/pacemaker that referenced this pull request Apr 20, 2020
This feature addresses the relevant topics and implements the ideas
brought up from:

ClusterLabs/fence-agents#308

This commit adds priority-fencing-delay option (just the option, not the
feature itself).

Enforce specified delay for the fencings that are targeting the lost
nodes with the highest total resource priority in case we don't
have the majority of the nodes in our cluster partition, so that
the more significant nodes potentially win any fencing match,
which is especially meaningful under split-brain of 2-node
cluster. A promoted resource instance takes the base priority + 1
on calculation if the base priority is not 0. If all the nodes
have equal priority, then any pcmk_delay_base/max configured for
the corresponding fencing resources will be applied. Otherwise as
long as it's set, even if to 0, it takes precedence over any
configured pcmk_delay_base/max. By default, priority fencing
delay is disabled.
@knet-jenkins
Copy link

knet-jenkins bot commented Jun 12, 2023

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/fence-agents-pipeline/job/PR-308/1/input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants