[SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks #49270

yabola · 2024-12-23T11:31:52Z

What changes were proposed in this pull request?

This is a problem that Spark always had. See the following section for the scenario when the problem occurs.

When cancel a job, some tasks may be still running.
The reason is that when DAGScheduler#handleTaskCompletion encounters FetchFailed, markStageAsFinished will be called to remove the stage in DAGScheduler#runningStages (see

spark/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

Line 2059 in 7cd5c4a

markStageAsFinished(failedStage, errorMessage = Some(failureMessage),

) and don't killAllTaskAttempts.
But DAGScheduler#cancelRunningIndependentStages only find runningStages, this will leave zombie shuffle tasks, occupying cluster resources.

Why are the changes needed?

Assume a job is stage1-> stage2, when FetchFailed occurs during the stage 2, the stage1 and stage2 will resubmit (stage2 may still have some tasks running even if stage2 is resubmitted , this is as expected, because these tasks may eventually succeed and avoid retry)

But during the execution of the stage1-retry , if the SQL is canceled, the tasks in stage1 and stage1-retry can all be killed, but the tasks previously running in stage2 are still running and can't be killed. These tasks can greatly affect cluster stability and occupy resources.

Does this PR introduce any user-facing change?

No

Was this patch authored or co-authored using generative AI tooling?

No

…n parent stage, might leave behind zombie tasks

wangyum · 2024-12-24T02:12:28Z

cc @cloud-fan

cloud-fan · 2024-12-24T04:45:49Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -2937,7 +2937,9 @@ private[spark] class DAGScheduler(
        } else {
          // This stage is only used by the job, so finish the stage if it is running.
          val stage = stageIdToStage(stageId)
-          if (runningStages.contains(stage)) {
+          val isRunningStage = runningStages.contains(stage) ||
+            (waitingStages.contains(stage) && taskScheduler.hasRunningTasks(stageId))


what if we just kill all waiting stages? Does taskScheduler.killAllTaskAttempts handle it well?

and can we add a special flag to indicate the waiting stages that are submitted due to retry?

what if we just kill all waiting stages? Does taskScheduler.killAllTaskAttempts handle it well?

I tested: if the normally generated waiting stages call killAllTaskAttempts, the stage status will be displayed as FAILED, which was SKIPPED before, killAllTaskAttempts itself will not go wrong.
Always kill waiting stage seems to be a safer approach (tasks really shouldn't be run anymore), but it may generate unnecessary stageFailed events compared with before.

and can we add a special flag to indicate the waiting stages that are submitted due to retry?

Yes, we can add a flag , please see the update codes.
Actually , there is a trade-off here to kill waiting stages, the range of choices from large to small :

kill all waiting stage

kill waiting stage when had failed ( stage#failedAttemptIds > 0)

kill waiting stage when retry in fetch failed (stage#resubmitInFetchFailed)

kill waiting stage which only has running tasks (this might not be enough? )

cloud-fan · 2024-12-24T04:46:09Z

This is a good catch! cc @jiangxb1987 @Ngone51

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

mridulm · 2024-12-25T08:49:49Z

core/src/main/scala/org/apache/spark/scheduler/TaskScheduler.scala

@@ -121,4 +121,6 @@ private[spark] trait TaskScheduler {
   */
  def applicationAttemptId(): Option[String]

+
+  def hasRunningTasks(stageId: Int): Boolean


Why do we need this method ?

Please see this
This is the code I wrote at the beginning (kill waiting stage which only has running tasks).
I'm not sure which way is better and I will delete this if we choose one.

If the method is no longer being used, please remove it.

Ngone51 · 2024-12-25T13:20:42Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -2937,7 +2938,9 @@ private[spark] class DAGScheduler(
        } else {
          // This stage is only used by the job, so finish the stage if it is running.
          val stage = stageIdToStage(stageId)
-          if (runningStages.contains(stage)) {
+          val shouldKill = runningStages.contains(stage) ||
+            (waitingStages.contains(stage) && stage.resubmitInFetchFailed)


Shall we check the failedAttemptIds instead?

Suggested change

(waitingStages.contains(stage) && stage.resubmitInFetchFailed)

stage.failedAttemptIds.nonEmpty

Yes we can Please see this
I'm not sure which way is better.

I like @Ngone51's suggestion better - simply check for stage.failedAttemptIds.nonEmpty || runningStages.contains(stage).
I can see an argument being made for failed as well.
With this, the PR will boil down to this change and tests to stress this logic ofcourse.

@mridulm @Ngone51 do you think it is necessary (waitingStages.contains(stage) && stage.failedAttemptIds.nonEmpty) || runningStages.contains(stage). Only considering failedAttemptIds may result in repeated calls to the the stage already completed and failed.

Only considering failedAttemptIds may result in repeated calls to the the stage already completed and failed.

It looks like there could be a case where the stage exists in failedStages but not in waitingStages, e.g., in the case of fetch failures, map stage and reduce stage can be added into failedStages, but the related job could be canceled before they were resubmitted. So adding waitingStages.contains(stage) would miss the stages in failedStages. And I don't think we would have repeated calls as we don't kill tasks for those failed stages.

Thanks for the confirmation, done

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

[SPARK-50648] [core] when the job is cancelled during shuffle retry i…

893fe64

…n parent stage, might leave behind zombie tasks

github-actions bot added the CORE label Dec 23, 2024

yabola changed the title ~~[SPARK-50648] [core] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie tasks~~ [SPARK-50648] [core] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks Dec 23, 2024

HyukjinKwon changed the title ~~[SPARK-50648] [core] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks~~ [SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks Dec 24, 2024

cloud-fan reviewed Dec 24, 2024

View reviewed changes

yabola added 2 commits December 24, 2024 20:09

add a flag in Stage to indicate retry when shuffle failed

a820cad

fix ut && add more detailed log

1bcd4b6

github-actions bot added the SQL label Dec 24, 2024

yabola commented Dec 24, 2024

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala Show resolved Hide resolved

mridulm reviewed Dec 25, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala Outdated Show resolved Hide resolved

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala Outdated Show resolved Hide resolved

mridulm reviewed Dec 25, 2024

View reviewed changes

Ngone51 reviewed Dec 25, 2024

View reviewed changes

yabola added 8 commits December 26, 2024 00:21

update comments && add ut

7fb4bc1

fix ut

9a58f60

fix ut

bc33ffb

kill stage which has failedAttemptIds

4e0fd2f

kill stage which has failedAttemptIds

c589fe8

kill stage which has failedAttemptIds

8ebd0b1

kill stage which has failedAttemptIds

85c7c02

update ut

d2822ed

Ngone51 reviewed Dec 26, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala Outdated Show resolved Hide resolved

update ut

a37c667

github-actions bot removed the SQL label Dec 26, 2024

Ngone51 approved these changes Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks #49270

[SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks #49270

yabola commented Dec 23, 2024 •

edited

Loading

wangyum commented Dec 24, 2024

cloud-fan Dec 24, 2024

cloud-fan Dec 24, 2024

yabola Dec 24, 2024 •

edited

Loading

cloud-fan commented Dec 24, 2024

mridulm Dec 25, 2024

yabola Dec 25, 2024 •

edited

Loading

mridulm Dec 26, 2024

Ngone51 Dec 25, 2024

yabola Dec 25, 2024

mridulm Dec 26, 2024 •

edited

Loading

yabola Dec 26, 2024

Ngone51 Dec 26, 2024

yabola Dec 26, 2024

	(waitingStages.contains(stage) && stage.resubmitInFetchFailed)
	stage.failedAttemptIds.nonEmpty

[SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks #49270

Are you sure you want to change the base?

[SPARK-50648][CORE] when the job is cancelled during shuffle retry in parent stage, might leave behind zombie running tasks #49270

Conversation

yabola commented Dec 23, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Was this patch authored or co-authored using generative AI tooling?

wangyum commented Dec 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yabola Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Dec 24, 2024

Choose a reason for hiding this comment

yabola Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mridulm Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yabola commented Dec 23, 2024 •

edited

Loading

yabola Dec 24, 2024 •

edited

Loading

yabola Dec 25, 2024 •

edited

Loading

mridulm Dec 26, 2024 •

edited

Loading