Remove FailedScheduling event from list of unrecoverable workspace pod events #1280

AObuchow · 2024-06-26T14:51:48Z

There are many cases where causing the FailedScheduling event to result in workspace failure is problematic. For example, flaky cluster infrastructure can require multiple attempts to schedule a workspace pod on a cluster. Additionally, the cluster auto-scaler can only kick in when a pod remains in the unschedulable state -- if we delete the deployment immediately after a pod is deteremiend to be unschedulable, the auto-scaler cannot kick in.

Thus we should remove the FailedScheduling event from the list of unrecoverable workspace pod events. #1279 is required to allow users the ability to re-add the FailedScheduling event to the list of unrecoverable workspace pod events.

AObuchow · 2024-08-15T18:20:01Z

After further consideration, removing the FailedScheduling event from the hard coded list of unrecoverable workspace pod events might not be the best approach.

If the FailedScheduling event is ignored, it'll be shown on workspace timeout:

NAME                  DEVWORKSPACE ID             PHASE      INFO
theia-next-high-cpu   workspace656dfe6d86764967   Failed     devworkspace failed to progress past phase 'Starting' for longer than timeout (1m). Reason: Detected unrecoverable event FailedScheduling: 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

However, if we remove the FailedScheduling event from the hard coded list of unrecoverable pod events, then it will not be shown on workspace timeout.

Our goal is to have the FailedScheduling event not cause workspace timeouts by default. This would make it easier to use cluster autoscaling in Che, and prevent workspaces from failing immediately if there are transient cluster issues.

However, we still would like the FailedScheduling event to be able to be caught (hence #1279), and to let users know if their workspace timed out due to the FailedScheduling event.

Thus a potential alternate approach is to have the FailedScheduling event be set in the DWOC's ignoredUnrecoverableEvents by default. This would probably be accomplished through kubebuilder annotations as well as the internal default DWOC.

However, this alternate approach might not work as we need to ensure users can remove the FailedScheduling event from the default list of ignoredUnrecoverableEvents if they want to. It might be difficult/impossible to differentiate between the ignoredUnrecoverableEvents list being emptied by the user (no unrecoverable events should be ignored), and the ignoredUnrecoverableEvents list not being configured (the defaults unrecoverable events/FailedScheduling event should be ignored)

AObuchow added this to Eclipse Che Team B Backlog Jun 26, 2024

dkwon17 assigned AObuchow Jun 26, 2024

dkwon17 moved this to 📅 Planned for this Sprint in Eclipse Che Team B Backlog Jul 17, 2024

mkuznyetsov self-assigned this Aug 13, 2024

mkuznyetsov mentioned this issue Aug 14, 2024

feat: remove FailedScheduling event from list of unrecoverable worksp… #1306

Closed

3 tasks

mkuznyetsov unassigned AObuchow Aug 15, 2024

mkuznyetsov moved this from 📅 Planned for this Sprint to 🚧 In Progress in Eclipse Che Team B Backlog Aug 15, 2024

mkuznyetsov mentioned this issue Aug 22, 2024

feat: set defaults for ignoredUnrecoverableEvents operator config #1310

Closed

3 tasks

This was referenced Sep 9, 2024

feat: set default ignoredUnrecoverableEvents eclipse-che/che-operator#1897

Merged

feat: update autoscaler documentation eclipse-che/che-docs#2789

Merged

mkuznyetsov moved this from 🚧 In Progress to Ready for Review in Eclipse Che Team B Backlog Sep 18, 2024

mkuznyetsov closed this as completed Sep 20, 2024

github-project-automation bot moved this from Ready for Review to ✅ Done in Eclipse Che Team B Backlog Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove FailedScheduling event from list of unrecoverable workspace pod events #1280

Remove FailedScheduling event from list of unrecoverable workspace pod events #1280

AObuchow commented Jun 26, 2024

AObuchow commented Aug 15, 2024

Remove FailedScheduling event from list of unrecoverable workspace pod events #1280

Remove FailedScheduling event from list of unrecoverable workspace pod events #1280

Comments

AObuchow commented Jun 26, 2024

AObuchow commented Aug 15, 2024