Issue with "status.applicationState.state" After Spark-Operator Upgrade from v1beta2-1.4.6-3.5.0 to 2.0.2 #2366

anushkafer · 2024-12-18T20:15:52Z

What question do you want to ask?

✋ I have searched the open/closed issues and my issue is not listed.

We are encountering an issue where Spark jobs, executed as part of an Argo-Workflow, are not being triggered successfully. The workflows that use Spark fail to proceed beyond the execution step, and this issue encountered an issue after upgrading the Spark-Operator from version v1beta2-1.4.6-3.5.0 to 2.0.2.

The status.applicationState.state field not getting update appears to be behaving unexpectedly or not reflecting the desired state during Spark job execution. [cx-lab-create1-vv52g, NOTE: Complete status job ran before the upgrade]

Spark application pod logs `lab-create1-xxxxx`

time="2024-12-18T14:29:02.450Z" level=info msg="Get sparkapplications 200"
time="2024-12-18T14:29:02.450Z" level=info msg="failure condition '{status.applicationState.state == [FAILED]}' evaluated false"
time="2024-12-18T14:29:02.450Z" level=info msg="success condition '{status.applicationState.state == [COMPLETED]}' evaluated false"
time="2024-12-18T14:29:02.450Z" level=info msg="0/1 success conditions matched"
time="2024-12-18T14:29:02.450Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/lab-create1-xxxxx in namespace <NAMESPACE> resulted in retryable error: Neither success condition nor the failure condition has been matched. Retrying..."

spark-operator-controller pod logs

Displaying logs from Namespace: spark for Pod: spark-operator-controller-b6bdb5dd9-txzk6. Logs from 12/18/2024, 4:53:36 PM
++ 1a -9
+ gid=185
+ set +e
++ getent passwd 185
+ uidentry=spark:x:185:185::/home/spark:/bin/sh
+ set -e
+ [L -z spark:x:185:185::/home/spark:/bin/sh ]]
+ exec /usr/bin/tini-s -- /usr/bin/spark-operator controller start --zap-log-level=info --namespaces=default --controller-threads=10 --enable-ui-service=true --enable-metrics=true --me
Spark Operator Version: 2.0.2+HEAD+unknown
Build Date: 2024-10-11T01:46:23+00:00
Git Commit ID:
Git Tree State: clean
Go Version: g01.23.1
Compiler: gc
Platform: linux/amd64
I1218 14:53:38.142572 10 request.go:697] Waited for 1.035793361s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/karpenter.k8s.aws/v1
2024-12-18T14:53:38.607Z INFO controller/start.go:298 Starting manager
2024-12-18T14:53:38.608Z INFO controller-runtime.metrics server/server.go:205 Starting metrics server
2024-12-18T14:53:38.608Z INFO manager/server.go:50 starting server {"kind":
"health probe"
"addr"： "0.0.0.0:8081"｝
2024-12-18T14:53:38.608Z INFO controller-runtime.metrics server/server.go:244 Serving metrics server {"bindAddress":
":8080", "secure": false}
I1218 14:53:38.609333 10 leaderelection.go:250] attempting to acquire leader lease spark/spark-operator-controller-lock...
I1218 14:53:54.335544 10 leaderelection.go:260] successfully acquired lease spark/spark-operator-controller-lock
2024-12-18T14:53:54.335Z INFO controller/controller.go:178 Starting EventSource {"controller": "spark-application-controller"
2024-12-18T14:53:54.336Z INFO controller/controller.go:178 Starting EventSource {"controller": "spark-application-controller"
"source": "kind source:
*v1.Pod"}
"source": "kind source: *V1beta2.SparkApplication"}
2024-12-18T14:53:54.336Z INFO controller/controller.go:186 Starting Controller {"controller": "spark-application-controller"}
2024-12-18T14:53:54.335Z INFO controller/controller.go:178 Starting EventSource {"controller":
"scheduled-spark-application-controller",
"source": "kind source: *v1beta2.ScheduledSparkf
2024-12-18T14:53:54.336Z INFO controller/controller.go:186 Starting Controller {"controller": "scheduled-spark-application-controller"}
2024-12-18T14:53:54.437Z INFO controller/controller.go:220 Starting workers {"controller":
"spark-application-controller"
"worker count": 10}
2024-12-18T14:53:54.437Z INFO controller/controller.go:220 Starting workers {"controller": "scheduled-spark-application-controller", "worker count": 10}

Steps to Reproduce:

Upgrade the Spark-Operator from v1beta2-1.4.6-3.5.0 to 2.0.2.
Submit a Spark job.
Observe the status.applicationState.state field.

Expected Behavior:

The Spark job should be successfully triggered and executed as part of the Argo-Workflow.

Additional context

Argo-Workflow version: v3.2.7
Spark-Operator version: 2.0.2
EKS version: 1.25

Have the same question?

Give it a 👍 We prioritize the question with most 👍

The text was updated successfully, but these errors were encountered:

anushkafer · 2024-12-20T21:12:33Z

Any suggestion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with "status.applicationState.state" After Spark-Operator Upgrade from v1beta2-1.4.6-3.5.0 to 2.0.2 #2366

Issue with "status.applicationState.state" After Spark-Operator Upgrade from v1beta2-1.4.6-3.5.0 to 2.0.2 #2366

anushkafer commented Dec 18, 2024 •

edited

Loading

anushkafer commented Dec 20, 2024

Issue with "status.applicationState.state" After Spark-Operator Upgrade from v1beta2-1.4.6-3.5.0 to 2.0.2 #2366

Issue with "status.applicationState.state" After Spark-Operator Upgrade from v1beta2-1.4.6-3.5.0 to 2.0.2 #2366

Comments

anushkafer commented Dec 18, 2024 • edited Loading

What question do you want to ask?

Spark application pod logs lab-create1-xxxxx

spark-operator-controller pod logs

Steps to Reproduce:

Expected Behavior:

Additional context

Have the same question?

anushkafer commented Dec 20, 2024

anushkafer commented Dec 18, 2024 •

edited

Loading

Spark application pod logs `lab-create1-xxxxx`