KubeFlow on GPUs

These are the steps currently necessary to enable KubeFlow-compatible GPU support on a worker node. Eventually these steps will be automated by the kubernetes-worker charm.

On the worker node

juju ssh to all GPU-enabled kubernetes-worker nodes and perform the following steps:

# add nvidia-docker repo
curl -s -L | sudo apt-key add -
curl -s -L | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# add docker repo
curl -fsSL | sudo apt-key add -
sudo add-apt-repository \
   "deb [arch=amd64] \
   $(lsb_release -cs) \
# install all the necessary bits
sudo apt-get update
sudo apt-get remove docker docker-engine && sudo apt-get install docker-ce nvidia-docker2

# if you get driver/library version mismatches:
sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia
sudo modprobe nvidia

# now, this command should work:
nvidia-smi -a

# this command should now work as well:
sudo systemctl restart docker.service
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

# to make the nvidia runtime change permanent, change the charm template on the worker nodes:
sudo vi /var/lib/juju/agents/unit-kubernetes-worker-${NODE_NR}/charm/templates/docker.systemd
# and replace the "docker daemon" invocation with "dockerd"

# then do this:
sudo sed -i 's|ExecStart=/usr/bin/docker daemon -H fd:// $DOCKER_OPTS|ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS|' /lib/systemd/system/docker.service

# reload the docker daemon
sudo systemctl daemon-reload
sudo systemctl restart docker.service

# you can test whether the runtime changes are permanent by executing: (ONLY AFTER CLIENT CHANGES BELOW ARE RUN)
docker run --rm nvidia/cuda nvidia-smi

On the juju client machine

juju config kubernetes-worker kubelet-extra-args="feature-gates=DevicePlugins=true"
juju config kubernetes-worker docker-opts="--default-runtime=nvidia"

# you also need the nvidia k8s daemonset deployed to expose the feature on the worker nodes
kubectl create -f