Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[frontend] KFP v1 Pipeline Run Details Page Component Status shows "Execution was skipped" and "ML Metadata not found" #11457

Open
ahsan-habib-ta opened this issue Dec 10, 2024 · 0 comments

Comments

@ahsan-habib-ta
Copy link

ahsan-habib-ta commented Dec 10, 2024

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?

Kubeflow manifest 1.9.1 (multi-user)
k8s version: 1.29 (AWS EKS)

  • KFP version:
  • kubeflow pipeline 2.3.0
  • Argo-workflow: v3.4.17
  • KFP Python SDK: 1.8.2

Steps to reproduce

  • Crerate a KFP v1 Pipeline based on hello_world example and also disable component caching by setting max_cache_staleness = "P0D" .
  • Create a Experiment from UI and create a run for the hello_world pipeline.
  • Go to the Run Details page and wait for the run to complete and refresh the page.
  • Component status shows "Execution was skipped and outputs were taken from cache" though the component was executed.
  • Click on "ML Metadata" tab from component details section and it gives "Corresponding ML Metadata not found." though database inspection shows the metadata exists.
v1_run_status_and_metadata

Expected result

Don't expect to see wrong status message "Execution was skipped and outputs were taken from cache" and expect to see Metadata information on the UI.

Materials and Reference

This issue is related to Argo-workflow version v3.4.17 because from version v3.4.0 (Changelog) argo-workflow use Pod naming format v2.

Argo-workflow v2 pod-naming format: [workflow-name]-[step-template-name]-[random-number-string] [Ref] and it's different from node_id format [workflow-name]-[random-number-string]

Because of this formatting change, function wasNodeCached fails to determine cache status [similar issue].

Updating the of the function to the following fixes the cache status issue


function wasNodeCached(node: NodeStatus): boolean {
  const artifacts = node.outputs?.artifacts;
  // HACK: There is a way to detect the skipped pods based on the WorkflowStatus alone.
  // All output artifacts have the pod name (same as node ID) in the URI. But for skipped
  // pods, the pod name does not match the URIs.
  // (And now there are always some output artifacts since we've enabled log archiving).
  const split = node.id.split('-');
  const hash = split[split.length - 1];
  const prefix = split.slice(0, split.length - 1).join('-');
  const pod_name = prefix.concat('-', node.templateName).concat('-',hash);

  return !artifacts || !node.id || node.type !== 'Pod'
    ? false
    : artifacts.some(artifact => artifact.s3 && !artifact.s3.key.includes(node.id));
    : artifacts.some(artifact => artifact.s3 && !(artifact.s3.key.includes(node.id) || artifact.s3.key.includes(pod_name)));
}

Because of the same component pod name formatting change UI also fails to fetch Metadata formation.

    const selectedExecution = mlmdExecutions?.find(
      execution => ExecutionHelpers.getKfpPod(execution) === selectedNodeId,
    );

Metadata display issue on UI can be fixed with following modification to the above

    const selectedExecution = mlmdExecutions?.find(
      execution => (ExecutionHelpers.getKfpPod(execution) === selectedNodeId || ExecutionHelpers.getKfpPod(execution) === selectedNodeName),
    );

Impacted by this bug? Give it a 👍.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant