Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to emit metric to the target AMP #486

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

weicongw
Copy link
Contributor

Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
Beta

Issue #, if available:

Description of changes:

  • Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
  • The test support emitting metric from cross account cross region
  • If amp url is not set, the test will not emitting metics
  • Emit NCCL test avg bus bandwith metric
  • Add metadata label to the metric
  • Add/update readme

Test

go test -timeout 60m -v . -args -nvidiaTestImage public.ecr.aws/o5d5x8n6/weicongw:nvidia --efaEnabled=true --feature=multi-node --ampMetricUrl=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/remote_write --ampMetricRoleArn=arn:aws:iam::665181186642:role/amp
...
        [1,0]<stdout>:# Out of bounds values : 0 OK
        [1,0]<stdout>:# Avg bus bandwidth    : 3.68456 
        [1,0]<stdout>:#
        [1,0]<stdout>:
        
    mpi_test.go:145: Emitting nccl test metrics to AMP

Query the metric from AMP

export AMP_QUERY_ENDPOINT=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/query

awscurl -X POST --region us-west-2 \
--service aps "${AMP_QUERY_ENDPOINT}" \
-d 'query=nccl_average_bandwidth_gbps[60m]' \
--header 'Content-Type: application/x-www-form-urlencoded'

{"status":"success","data":{"resultType":"matrix","result":[{"metric":
{"__name__":"nccl_average_bandwidth_gbps","ami_id":"ami-0cd7612ff47454cd6",
"aws_ofi_nccl_version":"1.9.1","efa_count":"1","efa_enabled":"true",
"efa_installer_version":"1.34.0","instance_type":"p4de.24xlarge",
"kubernetes_version":"1.30+","nccl_version":"2.18.5","node_count":"2",
"nvidia_driver_version":"550.90.07","os_type":"Amazon Linux 2"},
"values":[[1726791286.534,"3.62432"],[1726794564.87,"3.68456"]]}]}}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@weicongw weicongw marked this pull request as ready for review September 24, 2024 22:01
}

// PushMetricsToAMP pushes metric data to AWS Managed Prometheus (AMP) using SigV4 authentication
func (m *MetricManager) PushMetricsToAMP(name string, help string, value float64) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. batching the samples would be preferable to making a separate call to the remote_write API for every sample we collect, IMO

  2. are you not able to use the upstream remote write client because of the assume-role jump? https://github.com/prometheus/prometheus/blob/5037cf75f2d4f1671ad365ba1e99902fc36808d5/storage/remote/client.go#L180

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first point, that sounds good—I’ll change it in the next revision. As for the second point, I spent some time trying to use the remote write client, but I wasn’t able to integrate it into my code.

e2e2/internal/metric/metric.go Outdated Show resolved Hide resolved
e2e2/internal/utils/aws.go Outdated Show resolved Hide resolved
return nil, fmt.Errorf("no nodes found in the cluster")
}

// Get instance type and metadata from the first node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test case shouldn't really assume that all the nodes in the cluster are the same across all these dimensions; can you pass in the dimensions with your sample, instead of fetching them ahead of time? Then you'd be able to pass dimensions that you know match the sample

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the latest revision

Comment on lines 108 to 112
echo "NCCL Version: $NCCL_VERSION" && \
echo "AWS OFI NCCL Version: $AWS_OFI_NCCL_VERSION" && \
printf "NVIDIA Driver Version: " && \
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better suited for an ENTRYPOINT script that logged this info and then ran whatever CMD was used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the latest revision

"os_type": osType,
}

// Create a job to fetch the logs of meta info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you could just log these details in your actual test run instead of using a separate pod to print them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't, I tried to add ENTRYPOINT in my dockerfile, but the nccl test pods doesn't print these details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering the same..the launcher pods or worker pods should have these details right as they also run the entrypoint script ?

Comment on lines +26 to +30
### Enter the Kubetest2 Container

```bash
docker run --name kubetest2 -d -i -t kubetest2 /bin/sh
docker exec -it kubetest2 sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just build the deployer and e2e-nvidia binary locally, would be simpler + faster during dev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the Kubetest2 the deployer?

Comment on lines 284 to 299
job := &batchv1.Job{
ObjectMeta: metav1.ObjectMeta{
Name: "metadata-job",
Namespace: "default",
},
Spec: batchv1.JobSpec{
Template: v1.PodTemplateSpec{
Spec: v1.PodSpec{
RestartPolicy: v1.RestartPolicyNever,
Containers: []v1.Container{
{
Name: "metadata-job",
Image: *nvidiaTestImage,
ImagePullPolicy: v1.PullAlways,
Resources: v1.ResourceRequirements{
Limits: v1.ResourceList{
"nvidia.com/gpu": node.Status.Capacity["nvidia.com/gpu"],
"vpc.amazonaws.com/efa": node.Status.Capacity["vpc.amazonaws.com/efa"],
},
},
},
},
},
},
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use a template here to reduce the function size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in new rev

ObjectMeta: metav1.ObjectMeta{Name: "metadata-job", Namespace: "default"},
}
err = wait.For(fwext.NewConditionExtension(cfg.Client().Resources()).JobSucceeded(job),
wait.WithContext(ctx))
Copy link
Contributor

@Pavani-Panakanti Pavani-Panakanti Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some comments around - purpose of this job and what it running

ARG EFA_INSTALLER_VERSION=latest
# Add ENV to make ARG values available at runtime
ARG EFA_INSTALLER_VERSION=1.34.0
ARG NCCL_VERSION=2.18.5
Copy link
Contributor

@Pavani-Panakanti Pavani-Panakanti Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using an older version of nccl here ? General recommendation is to use either of last 2 releases (preferred n-1 as latest might have issues)

)

type MetricManager struct {
// Metadata map[string]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to use this field later ?

@Pavani-Panakanti
Copy link
Contributor

@weicongw Can we also rebase the PR ? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants