DCGM exporter doesn't work on the latest version of Bottlerocket AMI #34

peter-volkov · 2024-05-08T12:40:47Z

I'm not sure what is the correct place to report this, please direct me if this is not the correct place.

Goal:
I want to have EKS cluster with working observability, Bottlerocket AMI and GPU-nodes (g5* instances)
I use this helm chart by enabling amazon-cloudwatch-observability EKS add-on for my cluster.

Steps to reproduce:

I create latest version of EKS, GPU nodes with the last version of Bottlerocket AMI.
I use the latest nvidia-device-plugin ( https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0 )
I enable the latest version of the amazon-cloudwatch-observability EKS add-on (dcgm-exporter image 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/observability/dcgm-exporter:3.3.3-3.3.1-ubuntu22.04 is used )
All related daemonsets except for dcgm-exporter work well.
DCGM-exporter containers has this in output:

time="2024-05-08T11:22:02Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2024-05-08T11:22:02Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

I guess this is some version incompatibility issue for the DCGM and nvidia driver (being installed to nodes via k8s-device-plugin ).
What should I do to make DCGM exporter work?

The text was updated successfully, but these errors were encountered:

mitali-salvi · 2024-05-21T16:10:50Z

Hey @peter-volkov,
could you provide the full Bottlerocket AMI so that we re-create this issue on our end ?

peter-volkov · 2024-05-21T19:01:14Z

I appreciate your help.
I'm just creating a node group with ami_type = "BOTTLEROCKET_x86_64_NVIDIA"
specified in Terraform config. It takes the latest version of the image family during the initial creation.
Currently I have release_version = "1.19.5-64049ba8"
imageId=ami-0f3f964e4f939bbd0

But I do not really care about the version. If you can successfully run DCGM export as a part of amazon-cloudwatch-observability EKS add-on with g5.xlarge on any BottleRocket image -- It will be enough for me. Then I will consider the issue to be my own problem and will debug it myself

lfpalacios · 2024-07-11T18:25:17Z

I'm experiencing this as well. I use the official bottlerocket-nvidia AMI ami-02ce823b770755757 bottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.20.2-536d69d0 in my EKS 1.30 cluster, for GPU workloads.

The dcgm-exporter enters a CrashLoopBackOff state, with the following logs:

Warning #2: dcgm-exporter doesn't have sufficient privileges to expose profiling metrics. To get profiling metrics with dcgm-exporter, use --cap-add SYS_ADMIN
time="2024-07-11T18:15:00Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2024-07-11T18:15:00Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

I tried to manually setup the capability under aws-observability helm chart values.yaml, but it seems that the parameter is ignored or doesn't exist:

dcgmExporter:
  ...
  securityContext:
    capabilities:
      add: ["SYS_ADMIN"]

I tried to make dcgm-exporter work in many ways, by using the official NVIDIA helm charts, also with DataDog Integration, it all fail when dcgm-exporter try to startup on Kubernetes.

lisguo · 2024-07-12T20:02:28Z

We are aware of the issue and are looking at potential solutions. Will keep you posted on a path forward.

dbcelm · 2024-08-13T03:49:03Z

Facing this issue as well with BottleRocketOS AMI GPU nodes, I don't see a direct way though to disable DCGM exporter from helm chart, can we include conditional variable to disable DCGM Exporter within helm until this is fixed? As of now only way seem to be to delete CRD resource of it

movence · 2024-08-19T13:56:56Z

@dbcelm You can try updating the agent configuration to disable accelerated hardware monitoring which should disable DCGM Exporter in your cluster.
The following configuration will disable DCGM Exporter while keeping Enhanced Container Insights feature activated:

{
  "logs": {
    "metrics_collected": {
      "kubernetes": {
        "enhanced_container_insights": true,
        "accelerated_compute_metrics": false
      }
    }
  }
}

For more details, please check the doc.

Please note that disabling GPU monitoring with accelerated_compute_metrics flag in the agent configuration will disable both DCGM Exporter (NVIDIA) and Neuron Monitor (Inferentia & Trainium).

dbcelm · 2024-08-22T08:54:55Z

@movence even after this change, I don't see GPU metrics widgets on Container Insights Dashboard

movence · 2024-08-26T13:27:43Z

I don't see GPU metrics widgets on Container Insights Dashboard

@dbcelm Are you looking for an option to disable GPU monitoring or to disable DCGM Exporter only for your cluster?

GPU widgets in Container Insights Dashboard are displayed conditionally when there are GPU metrics to generate the widgets with. If you followed the instruction above to disable GPU monitoring, there will be no GPU metrics since the flag will disable BOTH DCGM Exporter AND Neuron Monitor.

dbcelm · 2024-08-26T18:06:06Z

@movence I have installed DCGM Exporter helm chart separately on this cluster

movence · 2024-08-29T15:19:19Z

@dbcelm thanks for the information. Here are some possible reasons:

DCGM Exporter that was spawn up separately does not have required label (k8s-app=dcgm-exporter-service) which is used by CWAgent to look up DCGM Exporter using service discovery mechanism within a node.
CWAgent uses TLS to communicate with DCGM Exporter with certs generated by Helm charts. These certs are added to both CWAgent and DCGM Exporter as secrets using volume mounts. Those DCGM Exporters not managed by helm charts might be missing TLS setup.

dbcelm · 2024-09-10T06:02:09Z

@movence I have deployed through Add-On and now DCGM exporter pods are working. But I still don't see GPU matrices on Dashboard. I suspect could it be due to the fact that DCGM Exporter is not running in "hostNetwork: true" mode as I use Cilium CNI.

Is there a way I can configure "hostNetwork: true" when deploying through Add-On to test this?

lisguo added the bug Something isn't working label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCGM exporter doesn't work on the latest version of Bottlerocket AMI #34

DCGM exporter doesn't work on the latest version of Bottlerocket AMI #34

peter-volkov commented May 8, 2024

mitali-salvi commented May 21, 2024

peter-volkov commented May 21, 2024 •

edited

Loading

lfpalacios commented Jul 11, 2024

lisguo commented Jul 12, 2024

dbcelm commented Aug 13, 2024

movence commented Aug 19, 2024

dbcelm commented Aug 22, 2024

movence commented Aug 26, 2024

dbcelm commented Aug 26, 2024

movence commented Aug 29, 2024 •

edited

Loading

dbcelm commented Sep 10, 2024

DCGM exporter doesn't work on the latest version of Bottlerocket AMI #34

DCGM exporter doesn't work on the latest version of Bottlerocket AMI #34

Comments

peter-volkov commented May 8, 2024

mitali-salvi commented May 21, 2024

peter-volkov commented May 21, 2024 • edited Loading

lfpalacios commented Jul 11, 2024

lisguo commented Jul 12, 2024

dbcelm commented Aug 13, 2024

movence commented Aug 19, 2024

dbcelm commented Aug 22, 2024

movence commented Aug 26, 2024

dbcelm commented Aug 26, 2024

movence commented Aug 29, 2024 • edited Loading

dbcelm commented Sep 10, 2024

peter-volkov commented May 21, 2024 •

edited

Loading

movence commented Aug 29, 2024 •

edited

Loading