Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with "status.applicationState.state" After Spark-Operator Upgrade from v1beta2-1.4.6-3.5.0 to 2.0.2 #2366

Open
1 task done
anushkafer opened this issue Dec 18, 2024 · 1 comment

Comments

@anushkafer
Copy link

anushkafer commented Dec 18, 2024

What question do you want to ask?

  • ✋ I have searched the open/closed issues and my issue is not listed.

We are encountering an issue where Spark jobs, executed as part of an Argo-Workflow, are not being triggered successfully. The workflows that use Spark fail to proceed beyond the execution step, and this issue encountered an issue after upgrading the Spark-Operator from version v1beta2-1.4.6-3.5.0 to 2.0.2.

The status.applicationState.state field not getting update appears to be behaving unexpectedly or not reflecting the desired state during Spark job execution. [cx-lab-create1-vv52g, NOTE: Complete status job ran before the upgrade]

image

Spark application pod logs lab-create1-xxxxx

time="2024-12-18T14:29:02.450Z" level=info msg="Get sparkapplications 200"
time="2024-12-18T14:29:02.450Z" level=info msg="failure condition '{status.applicationState.state == [FAILED]}' evaluated false"
time="2024-12-18T14:29:02.450Z" level=info msg="success condition '{status.applicationState.state == [COMPLETED]}' evaluated false"
time="2024-12-18T14:29:02.450Z" level=info msg="0/1 success conditions matched"
time="2024-12-18T14:29:02.450Z" level=info msg="Waiting for resource sparkapplication.sparkoperator.k8s.io/lab-create1-xxxxx in namespace <NAMESPACE> resulted in retryable error: Neither success condition nor the failure condition has been matched. Retrying..."

spark-operator-controller pod logs

Displaying logs from Namespace: spark for Pod: spark-operator-controller-b6bdb5dd9-txzk6. Logs from 12/18/2024, 4:53:36 PM
++ 1a -9
+ gid=185
+ set +e
++ getent passwd 185
+ uidentry=spark:x:185:185::/home/spark:/bin/sh
+ set -e
+ [L -z spark:x:185:185::/home/spark:/bin/sh ]]
+ exec /usr/bin/tini-s -- /usr/bin/spark-operator controller start --zap-log-level=info --namespaces=default --controller-threads=10 --enable-ui-service=true --enable-metrics=true --me
Spark Operator Version: 2.0.2+HEAD+unknown
Build Date: 2024-10-11T01:46:23+00:00
Git Commit ID:
Git Tree State: clean
Go Version: g01.23.1
Compiler: gc
Platform: linux/amd64
I1218 14:53:38.142572 10 request.go:697] Waited for 1.035793361s due to client-side throttling, not priority and fairness, request: GET:https://172.20.0.1:443/apis/karpenter.k8s.aws/v1
2024-12-18T14:53:38.607Z INFO controller/start.go:298 Starting manager
2024-12-18T14:53:38.608Z INFO controller-runtime.metrics server/server.go:205 Starting metrics server
2024-12-18T14:53:38.608Z INFO manager/server.go:50 starting server {"kind":
"health probe"
"addr": "0.0.0.0:8081"}
2024-12-18T14:53:38.608Z INFO controller-runtime.metrics server/server.go:244 Serving metrics server {"bindAddress":
":8080", "secure": false}
I1218 14:53:38.609333 10 leaderelection.go:250] attempting to acquire leader lease spark/spark-operator-controller-lock...
I1218 14:53:54.335544 10 leaderelection.go:260] successfully acquired lease spark/spark-operator-controller-lock
2024-12-18T14:53:54.335Z INFO controller/controller.go:178 Starting EventSource {"controller": "spark-application-controller"
2024-12-18T14:53:54.336Z INFO controller/controller.go:178 Starting EventSource {"controller": "spark-application-controller"
"source": "kind source:
*v1.Pod"}
"source": "kind source: *V1beta2.SparkApplication"}
2024-12-18T14:53:54.336Z INFO controller/controller.go:186 Starting Controller {"controller": "spark-application-controller"}
2024-12-18T14:53:54.335Z INFO controller/controller.go:178 Starting EventSource {"controller":
"scheduled-spark-application-controller",
"source": "kind source: *v1beta2.ScheduledSparkf
2024-12-18T14:53:54.336Z INFO controller/controller.go:186 Starting Controller {"controller": "scheduled-spark-application-controller"}
2024-12-18T14:53:54.437Z INFO controller/controller.go:220 Starting workers {"controller":
"spark-application-controller"
"worker count": 10}
2024-12-18T14:53:54.437Z INFO controller/controller.go:220 Starting workers {"controller": "scheduled-spark-application-controller", "worker count": 10}

Steps to Reproduce:

  • Upgrade the Spark-Operator from v1beta2-1.4.6-3.5.0 to 2.0.2.
  • Submit a Spark job.
  • Observe the status.applicationState.state field.

Expected Behavior:

  • The Spark job should be successfully triggered and executed as part of the Argo-Workflow.

Additional context

Argo-Workflow version: v3.2.7
Spark-Operator version: 2.0.2
EKS version: 1.25

Have the same question?

Give it a 👍 We prioritize the question with most 👍

@anushkafer
Copy link
Author

Any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant