Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Feature Request】Add Process Status Check Before Profiling to Handle Non-Running Training Tasks #943

Open
yingjun8 opened this issue Jun 3, 2024 · 1 comment

Comments

@yingjun8
Copy link

yingjun8 commented Jun 3, 2024

Description
We have developed an on-demand profiling daemon for our large-scale cluster by integrating dynolog with kineto. While this tool greatly benefits our users in diagnosing their training jobs, we have encountered a usability issue during interaction.

Our users initiate profiling requests when their training jobs are not actually running. Both dynolog CLI and kineto currently have limited capacity to handle such scenarios gracefully, leading to confusion and unnecessary waiting times for our users.

Feature Request
We are requesting a feature that implements a process status check before attempting to profile. Specifically, the profiling tool should:

Verify if the training task's process is active.
If the process is not running or found, return a clear and specific error message or code to the user or CLI tool.
Prevent the profiling request from proceeding, thereby saving resources and user time.
This functionality will not only enhance user experience but also prevent the profiling daemon from engaging in futile profiling attempts, thereby improving the overall efficiency of our on-demand profiling service.

Thank you for considering this feature addition. Any guidance or suggestions on how to implement this check or if there are already existing techniques we could leverage would be greatly appreciated.

@sraikund16
Copy link
Contributor

The on-demand workflow only kicks in if there is a backend Kineto process able to field the request. So as long as this is instantiated, it will being the profiling loop. It is unclear to me how this backend would be aware of where the program is in its execution to determine if training is happening or not.

One idea is to have an API on the python side to toggle a flag in Kineto to gracefully exit an on-demand process early.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants