Running GPUStack with NVIDIA MIG: A Deep Dive into Multi-Instance GPU Orchestration
Multi-Instance GPU (MIG) technology promises to maximize GPU utilization by partitioning a single GPU into isolated instances. But getting MIG to work with container orchestration tools like GPUStack requires navigating a maze of CDI configuration, device enumeration, and runtime patches. This technical deep-dive shares our battle-tested solutions.
Running GPUStack with NVIDIA MIG: A Deep Dive into Multi-Instance GPU Orchestration
NVIDIA's Multi-Instance GPU (MIG) technology, introduced with the Ampere architecture, enables a single physical GPU to be partitioned into up to seven isolated instances. Each instance has dedicated compute resources, memory bandwidth, and L2 cache - making it ideal for multi-tenant AI inference workloads.
Yet deploying MIG with container orchestration tools remains challenging. The intersection of CDI (Container Device Interface), NVIDIA container runtime, and application-level device enumeration creates a complex web of potential failure points.
This article documents our journey deploying GPUStack with MIG on NVIDIA H200 NVL GPUs, including:
- 8 distinct bugs discovered and fixed
- Runtime patches for GPUStack and vLLM
- Complete automation scripts for production deployment
The MIG Value Proposition
Before diving into implementation, let's understand why MIG matters for AI infrastructure.
Traditional GPU Allocation
Without MIG, GPUs are allocated as whole units:
┌─────────────────────────────────────────────────────────────┐
│ TRADITIONAL GPU ALLOCATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ GPU 0 (H200 - 141GB) GPU 1 (H200 - 141GB) │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ │ │ │ │
│ │ Model A │ │ Model B │ │
│ │ (uses 20GB) │ │ (uses 35GB) │ │
│ │ │ │ │ │
│ │ ░░░░░░░░░░ │ │ ░░░░░░░░░░ │ │
│ │ 121GB WASTED │ │ 106GB WASTED │ │
│ │ │ │ │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ Utilization: ~20% Utilization: ~25% │
│ │
└─────────────────────────────────────────────────────────────┘
MIG-Enabled Allocation
With MIG, a single GPU can serve multiple isolated workloads:
┌─────────────────────────────────────────────────────────────┐
│ MIG-ENABLED GPU ALLOCATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ GPU 0 (H200 - Full) GPU 1 (H200 - MIG Enabled) │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ │ │ ┌─────────────┐ │ │
│ │ Large Model │ │ │ 4g.71gb │ │ Model B │
│ │ (needs full │ │ │ (71GB) │ │ │
│ │ GPU memory) │ │ └─────────────┘ │ │
│ │ │ │ ┌───────┐ │ │
│ │ │ │ │2g.35gb│ │ Model C │
│ │ │ │ │(35GB) │ │ │
│ │ │ │ └───────┘ │ │
│ │ │ │ ┌────┐ │ │
│ │ │ │ │1g │ │ Model D │
│ │ │ │ │18GB│ │ │
│ │ │ │ └────┘ │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ Utilization: 100% Utilization: ~90% │
│ (workload needs it) (3 isolated workloads) │
│ │
└─────────────────────────────────────────────────────────────┘
MIG Profile Sizes (H200 NVL)
| Profile | GPU Memory | GPU Slices | Compute | Use Case |
|---|---|---|---|---|
| 1g.18gb | 18 GB | 1/7 | 1 SM | Small inference, testing |
| 2g.35gb | 35 GB | 2/7 | 2 SMs | Medium models (7B params) |
| 3g.47gb | 47 GB | 3/7 | 3 SMs | Large models (13B params) |
| 4g.71gb | 71 GB | 4/7 | 4 SMs | Very large models (30B+) |
| 7g.141gb | 141 GB | 7/7 | All SMs | Full GPU (no partitioning) |
Our Environment
Hardware Configuration
┌─────────────────────────────────────────────────────────────┐
│ HARDWARE SETUP │
├─────────────────────────────────────────────────────────────┤
│ │
│ Server: ai-1 │
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ GPU 0 │ │ GPU 1 │ │
│ │ NVIDIA H200 NVL │ │ NVIDIA H200 NVL │ │
│ │ MIG: DISABLED │ │ MIG: ENABLED │ │
│ │ Memory: 143 GB │ │ Memory: 141 GB total │ │
│ │ │ │ │ │
│ │ UUID: GPU-aaaaaaaa- │ │ UUID: GPU-11111111- │ │
│ │ bbbb-cccc-... │ │ 2222-3333-... │ │
│ │ │ │ │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ Full GPU │ │ │ │ MIG 4g.71gb │ │ │
│ │ │ Available │ │ │ │ 71 GB │ │ │
│ │ └─────────────────┘ │ │ │ MIG-xxxxxxxx-.. │ │ │
│ │ │ │ └─────────────────┘ │ │
│ │ │ │ ┌───────────┐ │ │
│ │ │ │ │ MIG 2g.35gb│ │ │
│ │ │ │ │ 35 GB │ │ │
│ │ │ │ │ MIG-abc60c│ │ │
│ │ │ │ └───────────┘ │ │
│ │ │ │ ┌─────┐ │ │
│ │ │ │ │1g.18│ 18 GB │ │
│ │ │ │ │MIG- │ │ │
│ │ │ │ └─────┘ │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Software Stack
| Component | Version |
|---|---|
| GPUStack | v2.0.3 |
| gpustack-runtime | v0.1.38.post4 |
| vLLM | 0.13.0 |
| NVIDIA Driver | 590.48.01 |
| NVIDIA Container Toolkit | Latest |
| Container Runtime | Docker with nvidia-container-runtime |
The Problem: Eight Distinct Failures
When we first attempted to deploy models on MIG devices through GPUStack, we encountered a cascade of failures. Each fix revealed another underlying issue - a classic "peeling the onion" debugging experience.
Failure Cascade Overview
┌─────────────────────────────────────────────────────────────┐
│ FAILURE CASCADE │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. CDI Vendor Mismatch │
│ │ "unresolvable CDI devices runtime.nvidia.com/gpu" │
│ ▼ │
│ 2. CDI Device Naming │
│ │ Indices don't match parent:child format │
│ ▼ │
│ 3. MIG Temperature/Power Queries │
│ │ pynvml.NVMLError on MIG device queries │
│ ▼ │
│ 4. MIG Enumeration NotFound │
│ │ Non-contiguous MIG indices throw errors │
│ ▼ │
│ 5. MIG Name Reuse Bug │
│ │ All MIG devices show same name │
│ ▼ │
│ 6. MIG Index Collision │
│ │ MIG indices start at 0, collide with non-MIG GPU │
│ ▼ │
│ 7. CUDA_VISIBLE_DEVICES Index Mismatch │
│ │ GPUStack indices don't match CUDA enumeration │
│ ▼ │
│ 8. vLLM UUID Parsing │
│ │ "ValueError: invalid literal for int()" │
│ ▼ │
│ ✓ SUCCESS: Models deploy on MIG devices │
│ │
└─────────────────────────────────────────────────────────────┘
Root Cause Analysis
Issue 1: CDI Vendor Mismatch
Symptom:
unresolvable CDI devices runtime.nvidia.com/gpu=2
Root Cause:
CDI (Container Device Interface) uses a vendor prefix to namespace devices. The default NVIDIA CDI generation creates devices under nvidia.com/gpu, but GPUStack requests devices using runtime.nvidia.com/gpu.
Default CDI Output:
# nvidia.com/gpu - DEFAULT (wrong)
nvidia.com/gpu=0
nvidia.com/gpu=1
Required CDI Output:
# runtime.nvidia.com/gpu - REQUIRED
runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1
Fix:
nvidia-ctk cdi generate \
--vendor=runtime.nvidia.com \
--device-name-strategy=index \
--device-name-strategy=uuid \
--output=/etc/cdi/nvidia.yaml
Issue 2: CDI Device Naming Strategy
Symptom:
GPUStack requests MIG devices by index (e.g., runtime.nvidia.com/gpu=2), but CDI generates MIG devices using parent:child notation (e.g., runtime.nvidia.com/gpu=1:0).
Analysis:
MIG devices exist within a parent GPU context. The CDI default naming reflects this hierarchy:
GPU 0 (no MIG) → gpu=0
GPU 1 (MIG parent) → gpu=1
MIG instance 0 → gpu=1:0
MIG instance 1 → gpu=1:1
MIG instance 2 → gpu=1:2
But GPUStack's device allocation uses flat indices and UUIDs.
Fix:
Generate CDI with both index and UUID naming strategies:
nvidia-ctk cdi generate \
--vendor=runtime.nvidia.com \
--device-name-strategy=index \
--device-name-strategy=uuid \ # Enables UUID-based device selection
--output=/etc/cdi/nvidia.yaml
Resulting CDI Devices:
runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1:0
runtime.nvidia.com/gpu=1:1
runtime.nvidia.com/gpu=1:2
runtime.nvidia.com/gpu=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
runtime.nvidia.com/gpu=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
runtime.nvidia.com/gpu=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
runtime.nvidia.com/gpu=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
runtime.nvidia.com/gpu=all
Issue 3: MIG Temperature/Power Query Failures
Symptom:
pynvml.NVMLError: NVML_ERROR_NOT_SUPPORTED
Root Cause:
The pynvml library (Python bindings for NVIDIA Management Library) throws errors when querying temperature and power for MIG device handles. MIG instances don't support these queries - only the parent GPU does.
Problematic Code (gpustack_runtime/detector/nvidia.py):
# These calls fail for MIG devices
mdev_temp = pynvml.nvmlDeviceGetTemperature(mdev, pynvml.NVML_TEMPERATURE_GPU)
mdev_power_used = pynvml.nvmlDeviceGetPowerUsage(mdev) // 1000
Fix:
Wrap queries in contextlib.suppress to gracefully handle failures:
import contextlib
mdev_temp = None
with contextlib.suppress(pynvml.NVMLError):
mdev_temp = pynvml.nvmlDeviceGetTemperature(
mdev,
pynvml.NVML_TEMPERATURE_GPU,
)
mdev_power_used = None
with contextlib.suppress(pynvml.NVMLError):
mdev_power_used = pynvml.nvmlDeviceGetPowerUsage(mdev) // 1000
Issue 4: MIG Enumeration NotFound Errors
Symptom:
pynvml.NVMLError_NotFound during MIG device enumeration
Root Cause:
MIG device indices can be non-contiguous. If you create MIG instances 0, 1, 2, then delete instance 1, you're left with indices 0 and 2. The code assumed contiguous indexing:
mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
mdev = pynvml.nvmlDeviceGetMigDeviceHandleByIndex(dev, mdev_idx) # Throws NotFound!
Fix:
Add try/except handling to skip missing indices:
for mdev_idx in range(mdev_count):
try:
mdev = pynvml.nvmlDeviceGetMigDeviceHandleByIndex(dev, mdev_idx)
except pynvml.NVMLError_NotFound:
continue # Skip non-existent MIG indices
Issue 5: MIG Name Reuse Bug
Symptom:
All MIG devices displayed the same name (the first MIG device's profile name).
Root Cause:
The mdev_name variable was initialized outside the MIG device loop and never reset:
mdev_name = "" # Initialized once
mdev_cores = 1
mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
# mdev_name keeps the value from previous iteration!
if some_condition:
mdev_name = profile_name # Only set conditionally
Fix:
Reset mdev_name inside the loop:
mdev_count = pynvml.nvmlDeviceGetMaxMigDeviceCount(dev)
for mdev_idx in range(mdev_count):
mdev_name = "" # Reset for each MIG device
mdev_cores = 1
# ... rest of loop
Issue 6: MIG Index Collision
Symptom:
GPUStack showed MIG devices with indices 0, 1, 2 - but index 0 was already used by the non-MIG GPU. This caused confusion and potential device selection errors.
idx=0 name=NVIDIA H200 NVL # Non-MIG GPU
idx=0 name=4g.71gb # MIG device - COLLISION!
idx=1 name=2g.35gb
idx=2 name=1g.18gb
Root Cause:
MIG device index assignment used the local MIG index (0, 1, 2...) instead of a global index that accounts for non-MIG devices:
mdev_index = mdev_idx # Local MIG index, starts at 0
Fix:
Use len(ret) to start MIG indices after all previously detected devices:
mig_global_idx = len(ret) # Start after non-MIG devices
for mdev_idx in range(mdev_count):
mdev_index = mig_global_idx # Global index
mig_global_idx += 1
Corrected Output:
idx=0 name=NVIDIA H200 NVL # Non-MIG GPU
idx=1 name=4g.71gb # MIG device (unique index)
idx=2 name=2g.35gb
idx=3 name=1g.18gb
Issue 7: CUDA_VISIBLE_DEVICES Index Mismatch
Symptom:
Model containers failed to start with errors about invalid CUDA device indices.
Root Cause:
GPUStack assigns device indices 0, 1, 2, 3 to all detected devices (1 non-MIG GPU + 3 MIG devices). But CUDA's device enumeration is different - it only sees:
- Device 0: The non-MIG GPU
- Device 1: The MIG parent GPU (with MIG instances accessible via UUIDs)
When GPUStack sets CUDA_VISIBLE_DEVICES=2, CUDA fails because it doesn't have a device 2.
The Index Translation Problem:
┌─────────────────────────────────────────────────────────────┐
│ INDEX MISMATCH PROBLEM │
├─────────────────────────────────────────────────────────────┤
│ │
│ GPUStack View: CUDA View: │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ idx=0: H200 │ → │ device=0: H200│ │
│ │ idx=1: 4g.71gb│ ? │ device=1: MIG │ (parent) │
│ │ idx=2: 2g.35gb│ ? │ │ │
│ │ idx=3: 1g.18gb│ │ NO device 2 │ │
│ └───────────────┘ │ NO device 3 │ │
│ └───────────────┘ │
│ │
│ CUDA_VISIBLE_DEVICES=2 → ERROR: invalid device │
│ CUDA_VISIBLE_DEVICES=MIG-uuid → SUCCESS │
│ │
└─────────────────────────────────────────────────────────────┘
Fix:
Configure GPUStack to use UUIDs instead of indices for CUDA_VISIBLE_DEVICES:
# Old code: maps index to sequential index
alignment = {dev_indexes[i]: str(i) for i in range(len(devs))}
# New code: maps index to UUID
alignment = {dev_indexes[i]: dev_uuids[i] for i in range(len(devs))}
Enable via environment variable:
GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE_ALIGNMENT=CUDA_VISIBLE_DEVICES
Issue 8: vLLM UUID Parsing
Symptom:
ValueError: invalid literal for int() with base 10: 'MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'
Root Cause:
After fixing Issue 7, GPUStack correctly sets CUDA_VISIBLE_DEVICES to the MIG UUID. But vLLM's device mapping code assumes this environment variable contains integers:
# vllm/platforms/interface.py
def get_device_mapping():
physical_device_id = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[device_id]
return int(physical_device_id) # Fails with UUID!
Fix:
Patch vLLM to handle UUID values gracefully:
def get_device_mapping():
physical_device_id = os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")[device_id]
try:
return int(physical_device_id)
except ValueError:
# UUID format (e.g., MIG-xxx) - CUDA has already
# remapped devices, so return the local device_id
return device_id
The Complete Solution
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ GPUSTACK + MIG ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ HOST SYSTEM │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ CDI Configuration (/etc/cdi/nvidia.yaml) │ │ │
│ │ │ - vendor: runtime.nvidia.com │ │ │
│ │ │ - strategies: index + uuid │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ │ ┌─────────────────────────────────────────────────────┐ │ │
│ │ │ NVIDIA Container Runtime Config │ │ │
│ │ │ - default-kind: runtime.nvidia.com/gpu │ │ │
│ │ └─────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ gpustack-worker Container │ │
│ │ ┌───────────────────────────────────────────────────┐ │ │
│ │ │ gpustack-runtime (PATCHED) │ │ │
│ │ │ - MIG temp/power: contextlib.suppress │ │ │
│ │ │ - MIG enumeration: NotFound handling │ │ │
│ │ │ - MIG naming: reset per device │ │ │
│ │ │ - MIG indexing: global indices │ │ │
│ │ │ - CUDA alignment: UUID-based │ │ │
│ │ └───────────────────────────────────────────────────┘ │ │
│ │ Environment Variables: │ │
│ │ - NVIDIA_VISIBLE_DEVICES=all │ │
│ │ - GPUSTACK_RUNTIME_DEPLOY_RUNTIME_VISIBLE_DEVICES_VALUE │ │
│ │ _UUID=NVIDIA_VISIBLE_DEVICES │ │
│ │ - GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE │ │
│ │ _ALIGNMENT=CUDA_VISIBLE_DEVICES │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Spawns │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ vLLM Runner Container (PATCHED) │ │
│ │ ┌───────────────────────────────────────────────────┐ │ │
│ │ │ vllm (PATCHED) │ │ │
│ │ │ - UUID handling in CUDA_VISIBLE_DEVICES │ │ │
│ │ └───────────────────────────────────────────────────┘ │ │
│ │ Environment: │ │
│ │ - NVIDIA_VISIBLE_DEVICES=MIG-xxxxxxxx-... │ │
│ │ - CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-... │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ MIG Device Instance │ │
│ │ Profile: 4g.71gb | Memory: 71GB | Isolated Compute │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Fix Implementation
We've automated the entire fix into a single script. Here's the breakdown:
Step 1: CDI Regeneration
nvidia-ctk cdi generate \
--vendor=runtime.nvidia.com \
--device-name-strategy=index \
--device-name-strategy=uuid \
--output=/etc/cdi/nvidia.yaml
Step 2: Container Runtime Configuration
sed -i 's|default-kind = "nvidia.com/gpu"|default-kind = "runtime.nvidia.com/gpu"|' \
/etc/nvidia-container-runtime/config.toml
Step 3: Build Patched vLLM Runner Image
FROM gpustack/runner:cuda12.9-vllm0.13.0
# Patch vLLM to handle UUID values in CUDA_VISIBLE_DEVICES
RUN sed -i 's/return int(physical_device_id)/# VLLM_UUID_FIX\n try:\n return int(physical_device_id)\n except ValueError:\n return device_id/' \
/usr/local/lib/python3.12/dist-packages/vllm/platforms/interface.py && \
python3 -c "from vllm.platforms.interface import Platform; print('vLLM patch verified')"
Step 4: Deploy GPUStack Worker with MIG Environment Variables
docker run -d \
--name gpustack-worker \
--hostname gpustack-worker \
--restart unless-stopped \
--network host \
--runtime nvidia \
--privileged \
--shm-size 64m \
-v /data/cache:/var/lib/gpustack/cache \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /data/models:/data/models \
-v gpustack-data:/var/lib/gpustack \
-e "GPUSTACK_TOKEN=$TOKEN" \
-e "NVIDIA_DISABLE_REQUIRE=true" \
-e "NVIDIA_VISIBLE_DEVICES=all" \
-e "NVIDIA_DRIVER_CAPABILITIES=compute,utility" \
-e "GPUSTACK_RUNTIME_DEPLOY_MIRRORED_DEPLOYMENT=true" \
-e "GPUSTACK_RUNTIME_DEPLOY_RUNTIME_VISIBLE_DEVICES_VALUE_UUID=NVIDIA_VISIBLE_DEVICES" \
-e "GPUSTACK_RUNTIME_DEPLOY_BACKEND_VISIBLE_DEVICES_VALUE_ALIGNMENT=CUDA_VISIBLE_DEVICES" \
gpustack/gpustack:v2.0.3 \
--server-url http://$SERVER_IP \
--worker-ip $WORKER_IP \
--worker-port 10170 \
--worker-metrics-port 10171
Step 5: Apply Runtime Patches
The gpustack-runtime patches must be applied inside the running container. Here's the CUDA UUID alignment patch:
# Patch location: /usr/local/lib/python3.11/dist-packages/gpustack_runtime/deployer/__types__.py
# Old code:
alignment = {dev_indexes[i]: str(i) for i in range(len(devs))}
# New code:
alignment = {dev_indexes[i]: dev_uuids[i] for i in range(len(devs))}
Verification
After applying all fixes, verify the setup:
GPU Detection Test
docker exec gpustack-worker python3 -c "
from gpustack_runtime.detector.nvidia import NVIDIADetector
det = NVIDIADetector()
result = det.detect()
for d in result:
mem = getattr(d, 'memory', '?')
uuid = getattr(d, 'uuid', '?')
print(f'idx={d.index} name={d.name} mem={mem}MB uuid={uuid}')
"
Expected Output:
idx=0 name=NVIDIA H200 NVL mem=143771MB uuid=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
idx=1 name=4g.71gb mem=71424MB uuid=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
idx=2 name=2g.35gb mem=33280MB uuid=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
idx=3 name=1g.18gb mem=16384MB uuid=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
vLLM MIG Test
docker run --rm --runtime nvidia \
-e "NVIDIA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" \
-e "CUDA_VISIBLE_DEVICES=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" \
gpustack/runner:cuda12.9-vllm0.13.0 python3 -c "
import torch
print(f'CUDA devices: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
print(f' {i}: {torch.cuda.get_device_name(i)}')
from vllm.platforms.cuda import CudaPlatform
print('vLLM imported successfully!')
"
Expected Output:
CUDA devices: 1
0: NVIDIA H200 NVL
vLLM imported successfully!
CDI Verification
nvidia-ctk cdi list
Expected Output (includes UUIDs):
runtime.nvidia.com/gpu=0
runtime.nvidia.com/gpu=1:0
runtime.nvidia.com/gpu=1:1
runtime.nvidia.com/gpu=1:2
runtime.nvidia.com/gpu=GPU-aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee
runtime.nvidia.com/gpu=MIG-zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz
runtime.nvidia.com/gpu=MIG-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
runtime.nvidia.com/gpu=MIG-yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
runtime.nvidia.com/gpu=all
Operational Considerations
Patch Persistence
| Component | Persistence | Notes |
|---|---|---|
| CDI configuration | Survives reboots | Written to /etc/cdi/nvidia.yaml |
| Container runtime config | Survives reboots | Written to /etc/nvidia-container-runtime/config.toml |
| vLLM runner image | Permanent | Baked into Docker image |
| gpustack-runtime patches | Survives restarts | Lost on container recreation |
Important: If you docker rm and recreate the gpustack-worker container, you must reapply the runtime patches. However, docker restart preserves them.
Monitoring MIG Devices
MIG devices have limited monitoring capabilities compared to full GPUs:
| Metric | Full GPU | MIG Device |
|---|---|---|
| Temperature | Yes | No (parent GPU only) |
| Power Usage | Yes | No (parent GPU only) |
| Memory Usage | Yes | Yes |
| Utilization | Yes | Yes |
| Process List | Yes | Yes |
MIG Profile Selection Guidelines
| Model Size | Recommended Profile | Notes |
|---|---|---|
| < 5B params | 1g.18gb | Small inference tasks |
| 5B - 13B params | 2g.35gb | Typical 7B model with context |
| 13B - 30B params | 4g.71gb | Larger models, batch inference |
| > 30B params | Full GPU | Disable MIG for this GPU |
Key Takeaways
-
MIG + Container Orchestration is complex. The intersection of CDI, NVIDIA container runtime, and application-level device enumeration creates multiple potential failure points.
-
Vendor prefixes matter. CDI's vendor namespace (
nvidia.comvsruntime.nvidia.com) must match what your orchestrator requests. -
Device naming strategies must align. MIG devices can be addressed by parent:child index or UUID. Your orchestrator and runtime must agree on which to use.
-
pynvml has MIG limitations. Not all NVML queries work on MIG device handles. Wrap potentially failing calls in error handlers.
-
Index enumeration differs between layers. GPUStack, CUDA, and CDI may all enumerate devices differently. UUID-based device selection is the most reliable approach.
-
Runtime patches may be necessary. Both GPUStack-runtime and vLLM required patches to handle MIG correctly. These should eventually be upstreamed.
-
Test the full stack. Verify GPU detection, CDI configuration, and actual model deployment. Each layer can fail independently.
-
Document your patches. Runtime patches don't survive container recreation. Automate their application and document the process.
Resources
MIG is a powerful technology for maximizing GPU utilization in multi-tenant environments. With the right configuration and patches, GPUStack can effectively orchestrate inference workloads across MIG partitions - enabling more efficient use of expensive GPU hardware.

Frederico Vicente
AI Research Engineer