Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal propagation by the ros2 run verb #895

Open
meyerj opened this issue Apr 10, 2024 · 9 comments
Open

Signal propagation by the ros2 run verb #895

meyerj opened this issue Apr 10, 2024 · 9 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@meyerj
Copy link

meyerj commented Apr 10, 2024

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04 Jammy
  • Installation type:
  • Version or commit hash:
    $ dpkg -l | grep ros2cli
    ii  ros-humble-ros2cli                                0.18.9-1jammy.20240217.070501               amd64        Framework for ROS 2 command line tools.
    ii  ros-humble-ros2cli-common-extensions              0.1.1-4jammy.20240217.081520                amd64        Meta package for ros2cli common extensions
    
  • DDS implementation:
    • Fast-RTPS and Cyclone
  • Client library (if applicable):
    • N/A

Steps to reproduce issue

#!/bin/bash
# Launch a child process (e.g. in a shell script) with ros2 run and remember its PID.
ros2 run demo_nodes_cpp talker &
PID=$!

# Do something else, start more processes, or just wait a bit.
sleep 5

# Try to kill the background node and wait.
kill $PID        # same as kill -TERM $PID
wait

Expected behavior

The talker process should cleanly terminate, as if the signal was sent directly to the child process. Some processes may have to perform some cleanup work, like close files, log something etc.

If I do not use ros2 run and launch the node executable directly, everything does work as expected:

#!/bin/bash
# Launch a child process (e.g. in a shell script) by invoking the executable directly and remember its PID.
/opt/ros/humble/lib/demo_nodes_cpp/talker &
PID=$!

# Do something else, start more processes, or just wait a bit.
sleep 5

# Try to kill the background node and wait.
kill $PID        # same as kill -TERM $PID
wait

With this variant, the talker process terminates cleanly, then wait returns.

The shell script is only an exemplary example - there are other cases where a supervisor process monitors its children and expects them to terminate after sending a TERM signal, e.g. systemd or the Docker daemon. Other signals should also be propagated and may be needed by some processes, for example SIGPIPE or SIGWINCH or user signals SIGUSR1 and SIGUSR1.

Actual behavior

The SIGTERM signal is handled by the ros2 run Python process in the default way, and it terminates and detaches from its child talker, but the signal is not propagated and the node keeps running and talking. I did not expect that ros2 run does not replace itself by the launched process, like rosrun in ROS 1 did, apparently only for the sake of logging when the child process terminated and its exit status.

It would be possible to work around by sending the signal to all child processes of $PID (e.g. pkill -P $PID), or other techniques to send the signal to the actual process instead of the direct child process. Or use ROS 2 launch.

Pressing Ctrl-C in the same terminal before ros2 run process was terminated works, because the shell sends the SIGINT signal to all processes in its process group then, including the talker. After the talker was detached and the script finished, Ctrl-C has no effect anymore on the running node process.

A similar (undesired) effect when running in Docker:

$ docker run --rm -d osrf/ros:humble-desktop ros2 run demo_nodes_cpp talker
4f6054bf2eea3a32d9883d2717286ad4821411e96f19819923d051bfeaf8466d
$ docker stop 4f6054bf2eea3a32d9883d2717286ad4821411e96f19819923d051bfeaf8466d
[...]
# The container processes are forcefully killed after 10 seconds only by default.
# The initial SIGTERM signal sent by the docker daemon was not propagated to talker by ros2 run.

While this works fine and docker stop returns after 1-2 seconds:

$ docker run --rm -d osrf/ros:humble-desktop /opt/ros/humble/lib/demo_nodes_cpp/talker
25c535f175c6d976897e1e12507af9c94bddcb9fbca732d0dcaf65e31edac1cb
$ docker stop 25c535f175c6d976897e1e12507af9c94bddcb9fbca732d0dcaf65e31edac1cb
25c535f175c6d976897e1e12507af9c94bddcb9fbca732d0dcaf65e31edac1cb
$ 

Feature request

Feature description

  • ros2 run should propagate all signals that can be handled to its child process, or at least SIGINT and SIGTERM.
  • ros2 run process should never terminate and leave its child running detached.
  • In any case the two shell scripts above are expected to behave in exactly the same way.

Implementation considerations

I am not sure how other platforms, like Windows, are dealing with this and whether replacing the process with its child would work there.

@clalancette
Copy link
Contributor

I am not sure how other platforms, like Windows, are dealing with this and whether replacing the process with its child would work there.

Yeah, I'm guessing this is the reason for the difference in behavior. Windows does not have execve(). There may be ways to achieve what you want in a cross-platform way, but signals are tricky in general, and even trickier on Windows. Someone would have to do a deep dive here to figure it out.

@clalancette clalancette added the help wanted Extra attention is needed label Apr 10, 2024
@fujitatomoya
Copy link
Collaborator

A similar (undesired) effect when running in Docker:

i am not sure what you are expecting here.
the following that i checked on ubuntu20.04, works as expected? removing namespace should delete anything inside of that.

> docker run --rm -d osrf/ros:humble-desktop ros2 run demo_nodes_cpp talker
aeceb9dc2f4cee7d1ddebbb1d2241c3602a662e699d43a20d481bcfecd2ac51f

> ps -ef | grep humble
root     3516555 3516533  1 16:03 ?        00:00:00 /usr/bin/python3 /opt/ros/humble/bin/ros2 run demo_nodes_cpp talker
root     3516613 3516555  0 16:03 ?        00:00:00 /opt/ros/humble/lib/demo_nodes_cpp/talker
tomoyaf+ 3516708    7546  0 16:03 pts/0    00:00:00 grep --color humble

> docker stop aeceb9dc2f4c
aeceb9dc2f4c

> ps -ef | grep humble
tomoyaf+ 3517055    7546  0 16:04 pts/0    00:00:00 grep --color humble

@fujitatomoya
Copy link
Collaborator

How about adding the signal handler in ros2run and then check if the child process is still under communication. if child process is still alive and running, we can send the same signal received on this signal handler to the child process. and exit from communication, to check the return code from the child process? the reason is i think basically executable decides what needs to be done with the signal in the executable's signal handlers, so this does not break that behavior.

@meyerj
Copy link
Author

meyerj commented Apr 21, 2024

A similar (undesired) effect when running in Docker:

i am not sure what you are expecting here. the following that i checked on ubuntu20.04, works as expected? removing namespace should delete anything inside of that.

> docker run --rm -d osrf/ros:humble-desktop ros2 run demo_nodes_cpp talker
aeceb9dc2f4cee7d1ddebbb1d2241c3602a662e699d43a20d481bcfecd2ac51f

> ps -ef | grep humble
root     3516555 3516533  1 16:03 ?        00:00:00 /usr/bin/python3 /opt/ros/humble/bin/ros2 run demo_nodes_cpp talker
root     3516613 3516555  0 16:03 ?        00:00:00 /opt/ros/humble/lib/demo_nodes_cpp/talker
tomoyaf+ 3516708    7546  0 16:03 pts/0    00:00:00 grep --color humble

> docker stop aeceb9dc2f4c
aeceb9dc2f4c

> ps -ef | grep humble
tomoyaf+ 3517055    7546  0 16:04 pts/0    00:00:00 grep --color humble

Yes, the talker node is ultimately killed when the whole Docker container is stopped, but it never gets a chance to handle the SIGTERM signal and is only forcefully killed by the Docker daemon with SIGKILL after 10 seconds by default (--stop-timeout). For some processes, e.g. ros2 bag record that is a problem, and they cannot close the open file properly and write out the metadata.

How about adding the signal handler in ros2run and then check if the child process is still under communication. if child process is still alive and running, we can send the same signal received on this signal handler to the child process. and exit from communication, to check the return code from the child process? the reason is i think basically executable decides what needs to be done with the signal in the executable's signal handlers, so this does not break that behavior.

Exactly, that is also what I meant by the first item in the "Feature Request" section of the issue description. But implementation-wise it would be simpler to replace the ros2 Python process by the new process like in ROS 1 on platforms that support that. Then the shell or whatever other parent process (systemd, docker, ...) directly sees the launched process as its child, without the need to communicate or forward signals explicitly.

@fujitatomoya
Copy link
Collaborator

@meyerj thanks for providing the good information. i think i can come up with patch fix for this. (in addition to the patch fix, we should probably add this information how to use docker container in ros2 documentation somewhere in https://docs.ros.org/en/rolling/How-To-Guides/Run-2-nodes-in-single-or-separate-docker-containers.html)

ros2 bag record that is a problem, and they cannot close the open file properly and write out the metadata.

besides this, we will lose writer cache data... this is bad...

@fujitatomoya
Copy link
Collaborator

@meyerj can you try #899? that should work for all concerned cases above, including rosbag2 cases. (so i think that is okay to skip documentation update.)

@fujitatomoya fujitatomoya self-assigned this Apr 22, 2024
@meyerj
Copy link
Author

meyerj commented Apr 23, 2024

@meyerj can you try #899? that should work for all concerned cases above, including rosbag2 cases. (so i think that is okay to skip documentation update.)

I am not sure whether ros2 bag record was actually affected because, if I am not mistaken, it uses the extension points of ros2cli and does not fork a new process like ros2 run does. It may only indirectly be affected, for example if ros2 run ... launches a shell script that in turn calls

exec ros2 bag record ...

with some predefined and commonly used options specific to the application, like we sometimes do at Intermodalics. Note that if you would forget the exec prefix here, or make sure that signals are propagated in another way, the ros2 bag record process would detach and keep running again, even if ros2 run properly signals the intermediate shell process now on SIGTERM.

Thanks for the quick solution, code-wise it looks good and should resolve the issue! I quickly built the patched version in an overlay workspace with ROS 2 Humble binaries installed from http://packages.ros.org/ros2/ubuntu, and the example snippet from the description behaves as expected now.

@fujitatomoya
Copy link
Collaborator

I am not sure whether ros2 bag record was actually affected because, if I am not mistaken, it uses the extension points of ros2cli and does not fork a new process like ros2 run does. It may only indirectly be affected, for example if ros2 run ... launches a shell script that in turn calls

Ah sorry, my bad... you are correct. my PR just fixes ros2run package, not other extensions 😅

it does not fork the process, besides it handles signals https://github.com/ros2/rosbag2/blob/a111ee561fb8cb54a56ae91eb5b29a7e347967f2/rosbag2_py/src/rosbag2_py/_transport.cpp#L175-L180

and the example snippet from the description behaves as expected now.

appreciate that!

@ciandonovan
Copy link

@meyerj

Feature request

Feature description

* `ros2 run` should propagate all signals that can be handled to its child process, or at least SIGINT and SIGTERM.
  
  * Alternatively, it could use [`os.execve()`](https://docs.python.org/3/library/os.html#os.execve) and its variants, to **replace** the current process by the child process, like [`rosrun` in ROS 1 did](https://github.com/ros/ros/blob/136e18e9dfda748a87b5ed197a36587f612c6e46/tools/rosbash/scripts/rosrun#L150).

* `ros2 run` process should never terminate and leave its child running detached.

* In any case the two shell scripts above are expected to behave in exactly the same way.

Implementation considerations

I am not sure how other platforms, like Windows, are dealing with this and whether replacing the process with its child would work there.

Since we already have ros2 launch as a supervisor process for running and monitoring nodes, would be nice to see ros2 run take a different approach with execve(). Certain toolchains struggle or get confused with debugging and profiling apps that fork, I'm thinking particularly of Qt with QtCreator. ros2 run being a program just to take arguments, setup, but ultimately just run a single executable would be really complimentary to ros2 launch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants