【Feature Request】Add Process Status Check Before Profiling to Handle Non-Running Training Tasks #943

yingjun8 · 2024-06-03T07:38:47Z

Description
We have developed an on-demand profiling daemon for our large-scale cluster by integrating dynolog with kineto. While this tool greatly benefits our users in diagnosing their training jobs, we have encountered a usability issue during interaction.

Our users initiate profiling requests when their training jobs are not actually running. Both dynolog CLI and kineto currently have limited capacity to handle such scenarios gracefully, leading to confusion and unnecessary waiting times for our users.

Feature Request
We are requesting a feature that implements a process status check before attempting to profile. Specifically, the profiling tool should:

Verify if the training task's process is active.
If the process is not running or found, return a clear and specific error message or code to the user or CLI tool.
Prevent the profiling request from proceeding, thereby saving resources and user time.
This functionality will not only enhance user experience but also prevent the profiling daemon from engaging in futile profiling attempts, thereby improving the overall efficiency of our on-demand profiling service.

Thank you for considering this feature addition. Any guidance or suggestions on how to implement this check or if there are already existing techniques we could leverage would be greatly appreciated.

sraikund16 · 2024-10-17T17:02:37Z

The on-demand workflow only kicks in if there is a backend Kineto process able to field the request. So as long as this is instantiated, it will being the profiling loop. It is unclear to me how this backend would be aware of where the program is in its execution to determine if training is happening or not.

One idea is to have an API on the python side to toggle a flag in Kineto to gracefully exit an on-demand process early.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【Feature Request】Add Process Status Check Before Profiling to Handle Non-Running Training Tasks #943

【Feature Request】Add Process Status Check Before Profiling to Handle Non-Running Training Tasks #943

yingjun8 commented Jun 3, 2024

sraikund16 commented Oct 17, 2024

【Feature Request】Add Process Status Check Before Profiling to Handle Non-Running Training Tasks #943

【Feature Request】Add Process Status Check Before Profiling to Handle Non-Running Training Tasks #943

Comments

yingjun8 commented Jun 3, 2024

sraikund16 commented Oct 17, 2024