You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
We have developed an on-demand profiling daemon for our large-scale cluster by integrating dynolog with kineto. While this tool greatly benefits our users in diagnosing their training jobs, we have encountered a usability issue during interaction.
Our users initiate profiling requests when their training jobs are not actually running. Both dynolog CLI and kineto currently have limited capacity to handle such scenarios gracefully, leading to confusion and unnecessary waiting times for our users.
Feature Request
We are requesting a feature that implements a process status check before attempting to profile. Specifically, the profiling tool should:
Verify if the training task's process is active.
If the process is not running or found, return a clear and specific error message or code to the user or CLI tool.
Prevent the profiling request from proceeding, thereby saving resources and user time.
This functionality will not only enhance user experience but also prevent the profiling daemon from engaging in futile profiling attempts, thereby improving the overall efficiency of our on-demand profiling service.
Thank you for considering this feature addition. Any guidance or suggestions on how to implement this check or if there are already existing techniques we could leverage would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
The on-demand workflow only kicks in if there is a backend Kineto process able to field the request. So as long as this is instantiated, it will being the profiling loop. It is unclear to me how this backend would be aware of where the program is in its execution to determine if training is happening or not.
One idea is to have an API on the python side to toggle a flag in Kineto to gracefully exit an on-demand process early.
Description
We have developed an on-demand profiling daemon for our large-scale cluster by integrating dynolog with kineto. While this tool greatly benefits our users in diagnosing their training jobs, we have encountered a usability issue during interaction.
Our users initiate profiling requests when their training jobs are not actually running. Both dynolog CLI and kineto currently have limited capacity to handle such scenarios gracefully, leading to confusion and unnecessary waiting times for our users.
Feature Request
We are requesting a feature that implements a process status check before attempting to profile. Specifically, the profiling tool should:
Verify if the training task's process is active.
If the process is not running or found, return a clear and specific error message or code to the user or CLI tool.
Prevent the profiling request from proceeding, thereby saving resources and user time.
This functionality will not only enhance user experience but also prevent the profiling daemon from engaging in futile profiling attempts, thereby improving the overall efficiency of our on-demand profiling service.
Thank you for considering this feature addition. Any guidance or suggestions on how to implement this check or if there are already existing techniques we could leverage would be greatly appreciated.
The text was updated successfully, but these errors were encountered: