GPUs example
We will follow the guide
from Nvidia
to deploy the gpu-operator into a Claudie-built Kubernetes cluster. Make sure you fulfill the necessary listed
requirements in prerequisites before continuing, if you decide to use a different cloud provider.
AWS GPU Example¶
In this example we will be using AWS as our provider. AWS GPU instances (like g4dn.xlarge) come with GPUs attached, so no additional machineSpec configuration is needed:
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
name: aws-gpu-example
labels:
app.kubernetes.io/part-of: claudie
spec:
providers:
- name: aws-1
providerType: aws
secretRef:
name: aws-secret
namespace: secrets
nodePools:
dynamic:
- name: control-aws
providerSpec:
name: aws-1
region: eu-central-1
zone: eu-central-1a
count: 1
serverType: t3.medium
# AMI ID of the image Ubuntu 24.04.
# Make sure to update it according to the region.
image: ami-07eef52105e8a2059
- name: gpu-aws
providerSpec:
name: aws-1
region: eu-central-1
zone: eu-central-1a
count: 2
serverType: g4dn.xlarge
# AMI ID of the image Ubuntu 24.04.
# Make sure to update it according to the region.
image: ami-07eef52105e8a2059
storageDiskSize: 50
kubernetes:
clusters:
- name: gpu-example
version: v1.34.0
network: 172.16.2.0/24
pools:
control:
- control-aws
compute:
- gpu-aws
GCP GPU Example¶
For GCP, you must explicitly specify the GPU type and count using the machineSpec block. GCP requires both nvidiaGpuCount and nvidiaGpuType to attach GPUs to instances:
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
name: gcp-gpu-example
labels:
app.kubernetes.io/part-of: claudie
spec:
providers:
- name: gcp-1
providerType: gcp
# GCP Spot VM support is available from claudie-config v0.11.4+
templates:
repository: "https://github.com/berops/claudie-config"
tag: v0.11.4
path: "templates/terraformer/gcp"
secretRef:
name: gcp-secret
namespace: secrets
nodePools:
dynamic:
- name: control-gcp
providerSpec:
name: gcp-1
region: us-central1
zone: us-central1-a
count: 1
serverType: e2-medium
image: ubuntu-2404-noble-amd64-v20251001
- name: gpu-gcp
providerSpec:
name: gcp-1
region: us-central1
zone: us-central1-a
count: 2
# Use n1-standard machine types for GPU attachment
serverType: n1-standard-4
image: ubuntu-2404-noble-amd64-v20251001
storageDiskSize: 50
# GPU configuration required for GCP
machineSpec:
nvidiaGpuCount: 1
nvidiaGpuType: nvidia-tesla-t4
kubernetes:
clusters:
- name: gpu-example
version: v1.34.0
network: 172.16.2.0/24
pools:
control:
- control-gcp
compute:
- gpu-gcp
GCP GPU Requirements
- The
nvidiaGpuTypefield is required whennvidiaGpuCount > 0for GCP providers - Available GPU types vary by zone. Check GCP GPU regions and zones for availability
- Common GPU types:
nvidia-tesla-t4,nvidia-tesla-v100,nvidia-tesla-a100,nvidia-l4 - GPU instances cannot be live migrated, so they will be terminated during maintenance events
GCP Spot GPU Inference Example (Autoscaled)¶
GCP Spot VMs offer 60–91% cost savings over on-demand pricing, in exchange for possible reclamation with about 30 seconds of notice. Combined with GPU attachment and scale-from-zero autoscaling, this is a common pattern for cost-effective GPU inference: the nodepool scales up when work arrives and back down to zero when idle.
To request spot nodes, set spot: true on a GCP dynamic nodepool. Spot is only supported on worker (compute) nodepools and is rejected by the webhook on control-plane nodepools or unsupported providers. Claudie automatically applies the label claudie.io/spot=true and the taint claudie.io/spot=true:NoSchedule to every node in the pool, so only pods with a matching toleration are scheduled there.
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
name: gcp-spot-gpu-autoscaled
labels:
app.kubernetes.io/part-of: claudie
spec:
providers:
- name: gcp-1
providerType: gcp
# GCP Spot VM support is available from claudie-config v0.11.4+
templates:
repository: "https://github.com/berops/claudie-config"
tag: v0.11.4
path: "templates/terraformer/gcp"
secretRef:
name: gcp-secret
namespace: secrets
nodePools:
dynamic:
- name: control-gcp
providerSpec:
name: gcp-1
region: us-central1
zone: us-central1-a
count: 1
serverType: e2-medium
image: ubuntu-2404-noble-amd64-v20251001
- name: spot-gpu-workers
providerSpec:
name: gcp-1
region: us-central1
zone: us-central1-a
# Use autoscaler instead of a fixed count; scales to zero when idle.
autoscaler:
min: 0
max: 4
serverType: n1-standard-4
image: ubuntu-2404-noble-amd64-v20251001
storageDiskSize: 50
machineSpec:
nvidiaGpuCount: 1
nvidiaGpuType: nvidia-tesla-t4
# GCP Spot VMs — significant cost savings for interruptible inference workloads.
spot: true
kubernetes:
clusters:
- name: spot-gpu-cluster
version: v1.34.0
network: 172.16.4.0/24
pools:
control:
- control-gcp
compute:
- spot-gpu-workers
Spot reclamation
GCP may reclaim spot instances with approximately 30 seconds of notice. Design workloads on spot nodepools to handle abrupt termination gracefully (e.g. checkpoint frequently, use job restart policies).
Pods that need to run on this nodepool must include both a spot toleration (at the pod spec level) and a GPU resource request (under spec.containers[]):
apiVersion: v1
kind: Pod
metadata:
name: inference
spec:
# Tolerate the spot taint so the pod is allowed onto spot nodes.
tolerations:
- key: claudie.io/spot
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: inference
image: my-inference:latest
# Request a GPU so the scheduler (and the autoscaler) place this on the GPU pool.
resources:
limits:
nvidia.com/gpu: 1
GPU Operator on spot nodepools
The spot taint claudie.io/spot=true:NoSchedule also keeps the NVIDIA GPU Operator components off spot nodes unless they tolerate it. When installing the operator, add a toleration for claudie.io/spot so its driver, device-plugin and toolkit daemonsets schedule on spot GPU nodes (otherwise nvidia.com/gpu is never advertised). For example, with Helm:
Exoscale GPU Example¶
For Exoscale, GPU instances have the GPU built into the instance type (like AWS), so no additional machineSpec configuration is needed. Simply use a GPU instance type such as gpu2.small as the serverType:
apiVersion: claudie.io/v1beta1
kind: InputManifest
metadata:
name: exoscale-gpu-example
labels:
app.kubernetes.io/part-of: claudie
spec:
providers:
- name: exoscale-1
providerType: exoscale
# Exoscale templates are supported from claudie-config v0.9.18+
templates:
repository: "https://github.com/berops/claudie-config"
tag: v0.9.18
path: "templates/terraformer/exoscale"
secretRef:
name: exoscale-secret
namespace: secrets
nodePools:
dynamic:
- name: control-exo
providerSpec:
name: exoscale-1
region: ch-gva-2
count: 1
serverType: standard.medium
image: "Linux Ubuntu 24.04 LTS 64-bit"
- name: gpu-exo
providerSpec:
name: exoscale-1
region: at-vie-1
count: 1
serverType: gpu2.small
image: "Linux Ubuntu 24.04 LTS 64-bit"
storageDiskSize: 50
kubernetes:
clusters:
- name: gpu-example
version: v1.34.0
network: 172.16.2.0/24
pools:
control:
- control-exo
compute:
- gpu-exo
Exoscale GPU Requirements
- GPU instance types require account authorization from Exoscale. Contact Exoscale support to enable GPU quota.
- Available GPU types and zones may change. List current offerings with
exo compute instance-type list --verbose | grep -i gpuor check the Exoscale pricing page.
Deploying the GPU Operator¶
After the InputManifest has been successfully built by Claudie, deploy the gpu-operator to the gpu-example Kubernetes cluster.
- Create a namespace for the gpu-operator.
- Add Nvidia Helm repository.
- Install the operator.
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator --set cdi.enabled=true --set cdi.nriPluginEnabled=true
Claudie overrides /etc/containerd/config.toml on every reconciliation loop. To avoid conflicts with these overrides the cdi and nri plugins are enabled. This bypasses the conflict with Claudie-reconciled /etc/containerd/config.toml for the operator.
- Wait for the pods in the
gpu-operatornamespace to be ready.
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-4lrbz 1/1 Running 0 10m
gpu-feature-discovery-5x88d 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-gc-84ff8f47tn7cd 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-master-757c27tm6 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-worker-495z2 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-worker-n8fl6 1/1 Running 0 10m
gpu-operator-1708080094-node-feature-discovery-worker-znsk4 1/1 Running 0 10m
gpu-operator-6dfb9bd487-2gxzr 1/1 Running 0 10m
nvidia-container-toolkit-daemonset-jnqwn 1/1 Running 0 10m
nvidia-container-toolkit-daemonset-x9t56 1/1 Running 0 10m
nvidia-cuda-validator-l4w85 0/1 Completed 0 10m
nvidia-cuda-validator-lqxhq 0/1 Completed 0 10m
nvidia-dcgm-exporter-l9nzt 1/1 Running 0 10m
nvidia-dcgm-exporter-q7c2x 1/1 Running 0 10m
nvidia-device-plugin-daemonset-dbjjl 1/1 Running 0 10m
nvidia-device-plugin-daemonset-x5kfs 1/1 Running 0 10m
nvidia-driver-daemonset-dcq4g 1/1 Running 0 10m
nvidia-driver-daemonset-sjjlb 1/1 Running 0 10m
nvidia-operator-validator-jbc7r 1/1 Running 0 10m
nvidia-operator-validator-q59mc 1/1 Running 0 10m
When all pods are ready, you should be able to verify if the GPUs can be used.
kubectl get nodes -o json | jq -r '.items[] | {name:.metadata.name, gpus:.status.capacity."nvidia.com/gpu"}'
- Deploy an example manifest that uses one of the available GPUs from the worker nodes.
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1
From the logs of the pods you should be able to see