GPUs example
We will follow the guide
from Nvidia
to deploy the gpu-operator
into a claudie build kubernetes cluster. Make sure you fulfill the necessary listed
requirements in prerequisites before continuing, if you decide to use a different cloud provider.
In this example we will be using GenesisCloud as our provider, with the following config:
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
name: genesis-example
labels:
app.kubernetes.io/part-of: claudie
spec:
providers:
- name: genesiscloud
providerType: genesiscloud
secretRef:
name: genesiscloud-secret
namespace: secrets
nodePools:
dynamic:
- name: gencloud-cpu
providerSpec:
name: genesiscloud
region: ARC-IS-HAF-1
count: 1
serverType: vcpu-2_memory-4g_disk-80g
image: "Ubuntu 22.04"
storageDiskSize: 50
- name: gencloud-gpu
providerSpec:
name: genesiscloud
region: ARC-IS-HAF-1
count: 2
serverType: vcpu-4_memory-12g_disk-80g_nvidia3080-1
image: "Ubuntu 22.04"
storageDiskSize: 50
kubernetes:
clusters:
- name: gpu-example
version: v1.27.0
network: 172.16.2.0/24
pools:
control:
- gencloud-cpu
compute:
- gencloud-gpu
After the InputManifest
was successfully build by claudie, we deploy the gpu-operator
to the gpu-examepl
kubernetes cluster.
- Create a namespace for the gpu-operator.
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
- Add Nvidia Helm repository.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
&& helm repo update
- Install the operator.
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
- Wait for the pods in the
gpu-operator
namespace to be ready.
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-4lrbz 1/1 Running 0 10m
gpu-feature-discovery-5x88d 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-gc-84ff8f47tn7cd 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-master-757c27tm6 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-worker-495z2 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-worker-n8fl6 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-worker-znsk4 1/1 Running 0 10m
gpu-operator-6dfb9bd487-2gxzr 1/1 Running 0 10m
nvidia-container-toolkit-daemonset-jnqwn 1/1 Running 0 10m
nvidia-container-toolkit-daemonset-x9t56 1/1 Running 0 10m
nvidia-cuda-validator-l4w85 0/1 Completed 0 10m
nvidia-cuda-validator-lqxhq 0/1 Completed 0 10m
nvidia-dcgm-exporter-l9nzt 1/1 Running 0 10m
nvidia-dcgm-exporter-q7c2x 1/1 Running 0 10m
nvidia-device-plugin-daemonset-dbjjl 1/1 Running 0 10m
nvidia-device-plugin-daemonset-x5kfs 1/1 Running 0 10m
nvidia-driver-daemonset-dcq4g 1/1 Running 0 10m
nvidia-driver-daemonset-sjjlb 1/1 Running 0 10m
nvidia-operator-validator-jbc7r 1/1 Running 0 10m
nvidia-operator-validator-q59mc 1/1 Running 0 10m
When all pods are ready you should be able to verify if the GPUs can be used
kubectl get nodes -o json | jq -r '.items[] | {name:.metadata.name, gpus:.status.capacity."nvidia.com/gpu"}'
- Deploy an example manifest that uses one of the available GPUs from the worker nodes.
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1
From the logs of the pods you should be able to see
kubectl logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done