Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fence_scsi_check is leaving zombie processes #535

Open
calippus opened this issue Apr 6, 2023 · 5 comments
Open

fence_scsi_check is leaving zombie processes #535

calippus opened this issue Apr 6, 2023 · 5 comments

Comments

@calippus
Copy link

calippus commented Apr 6, 2023

I have a cluster with enabled watchdog. I have realized that fence_scsi_check is creating zombie processes continuously on both node.

[root@d-clus1 ~]# top -b1 -n1 | grep Z
3224472 root      -2   0       0      0      0 Z   0.0   0.0   0:00.12 fence_scsi_chec

I have tried to increase the "test-timeout" in /etc/watchdog.conf but it didn't solve the problem.

Honestly, it can be a problem by watchdog, I am not sure.
Here it is the output of watchdog;

[root@d-clus1 ~]# /usr/sbin/watchdog -v
watchdog: Integer 'test-timeout' found = 590
watchdog: Integer 'interval' found = 10
watchdog: String 'log-dir' found as '/var/log/watchdog'
watchdog: Variable 'realtime' found as 'yes' = 1
watchdog: Integer 'priority' found = 1
watchdog: adding /etc/watchdog.d/fence_scsi_check_hardreboot to list of auto-repair binaries
@oalbrigt
Copy link
Collaborator

Is this on physical servers, VMs or cloud?

Can you provide some more info e.g. distro, distro version, watchdog type (might be logged in /var/log/watchdog and/or /var/log/messages)?

@calippus
Copy link
Author

It is currently on VM (VMware).

Disto: Rocky 8.6
watchdog package: watchdog-5.15-2.el8.x86_64

/etc/watchdog.conf ;

#ping			= 172.31.14.1
#ping			= 172.26.1.255
#interface		= eth0
#file			= /var/log/messages
#change			= 1407

# Uncomment to enable test. Setting one of these values to '0' disables it.
# These values will hopefully never reboot your machine during normal use
# (if your machine is really hung, the loadavg will go much higher than 25)
#max-load-1		= 24
#max-load-5		= 18
#max-load-15		= 12

# Note that this is the number of pages!
# To get the real size, check how large the pagesize is on your machine.
#min-memory		= 1
#allocatable-memory	= 1

# With enforcing SELinux policy please use the /usr/libexec/watchdog/scripts/
# or /etc/watchdog.d/ for your test-binary and repair-binary configuration.
#repair-binary		= /usr/sbin/repair
#repair-timeout		= 60
#test-binary		= /etc/watchdog.d/fence_scsi_check_hardreboot
test-timeout		= 590

# The retry-timeout and repair limit are used to handle errors in a more robust
# manner. Errors must persist for longer than retry-timeout to action a repair
# or reboot, and if repair-maximum attempts are made without the test passing a
# reboot is initiated anyway.
#retry-timeout		= 60
#repair-maximum		= 1

#watchdog-device	= /dev/watchdog

# Defaults compiled into the binary
#temperature-sensor	=
#max-temperature	= 90

# Defaults compiled into the binary
#admin			= root
interval		= 10
#logtick                = 1
log-dir		= /var/log/watchdog

# This greatly decreases the chance that watchdog won't be scheduled before
# your machine is really loaded
realtime		= yes
priority		= 1

# When using custom service pid check with custom service
# systemd unit file please be aware the "Requires="
# does dependent service deactivation.
# Using "Before=watchdog.service" or "Before=watchdog-ping.service"
# in the custom service unit file may be the desired operation instead.
# See man 5 systemd.unit for more details.
#
# Check if rsyslogd is still running by enabling the following line
#pidfile		= /var/run/rsyslogd.pid

@wenningerk
Copy link
Contributor

wenningerk commented Apr 13, 2023

For unresponsive block-devices (e.g. you can reproduce that by suspending via dm) you usually get a zombie when trying to address. And I think there is no real way out of it.
Usually we handle this by having the block-device access in a sub-process that then becomes a zombie and the main-process detects the timeout.
When it is about tickling a watchdog if everything is OK we wouldn't necessarily need these sub-processes as hanging might be fine for the purpose as well.
For this very purpose here I don't see zombies in case of an issue with the device as that big of a problem.
With the device coming back timely they should disappear and if the device doesn't come back we anyway want the watchdog to trigger.
If your watchdog doesn't trigger properly you might get a number of zombies instead of a reboot.
There isn't enough info in the thread above for me to know what happens exactly but I hope my text is somehow helpful to understand what might be going on.
If we're checking the status of a block-device and our reaction to a non-responsive device isn't expected to be a reboot we'd like to prevent the zombies from piling up. So for those cases checking for the presence of a previous check-process instead of firing up new ones is one of the viable solutions. But I don't think this is needed here and the check-script should rather be kept as simple as possible.

@calippus
Copy link
Author

Thanks a lot for the explanation. Actually the fence and rebooting functionality is working as expected as far as I could test.
I don't know exactly what else I can provide as more information.

I was checking the documentation of watchdog and I have found this part;

watchdog will try periodically to fork itself to see whether the process table is full. This process will leave a zombie process until watchdog wakes up again and catches it; this is harmless, don't worry about it.

So should I suppose that this is a problem only with the watchdog process.
It is written "don't worry' but the process id number is increasing like crazy so I worry.

@wenningerk
Copy link
Contributor

So should I suppose that this is a problem only with the watchdog process. It is written "don't worry' but the process id number is increasing like crazy so I worry.

I wasn't aware of this mechanism in watchdog-daemon - that btw seem totally unrelated to the fence-scsi-script being called.
The explanation you quoted sounds reasonable. And it sounds like a good idea to do that as one of the measures that should detect a system that isn't working properly anymore and thus should be rebooted.
As long as the number of zombies isn't increasing - initially it wasn't clear how many zombies you are seeing - it is probably nothing to worry about. The process id of this zombie is expected to increase given the explanation how it is created. Reaching some maximum it should wrap around to low numbers again. If you don't want it to increase that quickly you probably can reduce the 'base-clock' of the watchdog-daemon - don't know off my mind how exactly but it just has to be quick enough to reliably kick the hardware-watchdog. Relaxing the hardware-watchdog-timeout would slow down how quickly a hanging node is actually rebooting. But in case of scsi-fencing that doesn't even have an effect on how quickly resources in a cluster can be recovered on a different node. Just if you e.g. need them to fail back because the other node is failing as well or overloaded this timeout might be really relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants