-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to emit metric to the target AMP #486
base: main
Are you sure you want to change the base?
Conversation
e2e2/internal/metric/metric.go
Outdated
} | ||
|
||
// PushMetricsToAMP pushes metric data to AWS Managed Prometheus (AMP) using SigV4 authentication | ||
func (m *MetricManager) PushMetricsToAMP(name string, help string, value float64) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
batching the samples would be preferable to making a separate call to the remote_write API for every sample we collect, IMO
-
are you not able to use the upstream remote write client because of the assume-role jump? https://github.com/prometheus/prometheus/blob/5037cf75f2d4f1671ad365ba1e99902fc36808d5/storage/remote/client.go#L180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first point, that sounds good—I’ll change it in the next revision. As for the second point, I spent some time trying to use the remote write client, but I wasn’t able to integrate it into my code.
e2e2/test/cases/nvidia/main_test.go
Outdated
return nil, fmt.Errorf("no nodes found in the cluster") | ||
} | ||
|
||
// Get instance type and metadata from the first node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test case shouldn't really assume that all the nodes in the cluster are the same across all these dimensions; can you pass in the dimensions with your sample, instead of fetching them ahead of time? Then you'd be able to pass dimensions that you know match the sample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in the latest revision
e2e2/test/images/nvidia/Dockerfile
Outdated
echo "NCCL Version: $NCCL_VERSION" && \ | ||
echo "AWS OFI NCCL Version: $AWS_OFI_NCCL_VERSION" && \ | ||
printf "NVIDIA Driver Version: " && \ | ||
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be better suited for an ENTRYPOINT
script that logged this info and then ran whatever CMD
was used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in the latest revision
e2e2/test/cases/nvidia/main_test.go
Outdated
"os_type": osType, | ||
} | ||
|
||
// Create a job to fetch the logs of meta info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like you could just log these details in your actual test run instead of using a separate pod to print them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't, I tried to add ENTRYPOINT
in my dockerfile, but the nccl test pods doesn't print these details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering the same..the launcher pods or worker pods should have these details right as they also run the entrypoint script ?
### Enter the Kubetest2 Container | ||
|
||
```bash | ||
docker run --name kubetest2 -d -i -t kubetest2 /bin/sh | ||
docker exec -it kubetest2 sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just build the deployer and e2e-nvidia binary locally, would be simpler + faster during dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the Kubetest2
the deployer?
e2e2/test/cases/nvidia/mpi_test.go
Outdated
job := &batchv1.Job{ | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: "metadata-job", | ||
Namespace: "default", | ||
}, | ||
Spec: batchv1.JobSpec{ | ||
Template: v1.PodTemplateSpec{ | ||
Spec: v1.PodSpec{ | ||
RestartPolicy: v1.RestartPolicyNever, | ||
Containers: []v1.Container{ | ||
{ | ||
Name: "metadata-job", | ||
Image: *nvidiaTestImage, | ||
ImagePullPolicy: v1.PullAlways, | ||
Resources: v1.ResourceRequirements{ | ||
Limits: v1.ResourceList{ | ||
"nvidia.com/gpu": node.Status.Capacity["nvidia.com/gpu"], | ||
"vpc.amazonaws.com/efa": node.Status.Capacity["vpc.amazonaws.com/efa"], | ||
}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use a template here to reduce the function size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in new rev
…rometheus workspace
ObjectMeta: metav1.ObjectMeta{Name: "metadata-job", Namespace: "default"}, | ||
} | ||
err = wait.For(fwext.NewConditionExtension(cfg.Client().Resources()).JobSucceeded(job), | ||
wait.WithContext(ctx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some comments around - purpose of this job and what it running
ARG EFA_INSTALLER_VERSION=latest | ||
# Add ENV to make ARG values available at runtime | ||
ARG EFA_INSTALLER_VERSION=1.34.0 | ||
ARG NCCL_VERSION=2.18.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using an older version of nccl here ? General recommendation is to use either of last 2 releases (preferred n-1 as latest might have issues)
) | ||
|
||
type MetricManager struct { | ||
// Metadata map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to use this field later ?
@weicongw Can we also rebase the PR ? Thanks |
Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
Beta
Issue #, if available:
Description of changes:
Test
Query the metric from AMP
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.