Skip to content

docs(operations): add containerized GPU workloads guide#555

Open
Aleksei Sviridkin (lexfrei) wants to merge 1 commit into
mainfrom
feat/gpu-container-workloads-docs
Open

docs(operations): add containerized GPU workloads guide#555
Aleksei Sviridkin (lexfrei) wants to merge 1 commit into
mainfrom
feat/gpu-container-workloads-docs

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei Aleksei Sviridkin (lexfrei) commented May 28, 2026

What this PR does

Add a new operations guide describing the container variant of cozystack.gpu-operator — the architectural mode for containerized GPU workloads (CUDA pods, ML training, inference) on Linux GPU nodes that already ship the NVIDIA driver and nvidia-container-toolkit via the distro package manager.

The new page lands at content/en/docs/next/operations/gpu-container-workloads.md and rounds out the GPU documentation surface:

Content covers when to pick the variant (host driver + host toolkit prerequisite), the operator-validator host-driver auto-detect mechanism (/host/usr/bin/nvidia-smi), the Talos caveat with a pointer to the examples/values-native-talos.yaml reference, install steps with Package CR variant: container, a sample CUDA pod for verification, and a three-row variant comparison matrix.

Companion to cozystack/cozystack#2766, which adds the container variant itself.

Release note

docs(operations): add guide for containerized GPU workloads via the gpu-operator `container` variant.

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guide for running containerized GPU workloads on cluster nodes, including prerequisites, installation steps, verification procedures for operator health and GPU resource allocation, sample CUDA Pod workflow, and comparison of container, default, and vGPU operator variants.

Review Change Stack

@netlify
Copy link
Copy Markdown

netlify Bot commented May 28, 2026

Deploy Preview for cozystack ready!

Name Link
🔨 Latest commit 8b83e54
🔍 Latest deploy log https://app.netlify.com/projects/cozystack/deploys/6a18889a629dd1000851e38e
😎 Deploy Preview https://deploy-preview-555--cozystack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 897bf5d8-cda2-4452-b97a-9aa8bf762ee4

📥 Commits

Reviewing files that changed from the base of the PR and between ef54f10 and 8b83e54.

📒 Files selected for processing (1)
  • content/en/docs/next/operations/gpu-container-workloads.md

📝 Walkthrough

Walkthrough

This PR adds a new operations documentation page explaining how to deploy and use the cozystack.gpu-operator container variant on Cozystack management cluster nodes. The guide covers installation prerequisites, step-by-step setup via a Package CR, verification steps, a CUDA workload example, HAMi-based fractional GPU sharing, and a comparison of operator variants.

Changes

GPU Container Workloads Documentation

Layer / File(s) Summary
GPU container variant guide
content/en/docs/next/operations/gpu-container-workloads.md
New operations guide explains when to use the container variant (host already has NVIDIA driver and nvidia-container-toolkit), installation prerequisites, Package CR setup with warnings against bundles.enabledPackages, health verification, CUDA Pod example, HAMi fractional sharing guidance, and variant comparison table.

Possibly related issues

  • cozystack/cozystack#2764: Directly addresses the same cozystack.gpu-operator container variant documentation and configuration guidance referenced in this PR.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A GPU path through containers clear,
With HAMi sharing, fractional cheer!
From driver checks to CUDA Pod flight,
The operations guide shines oh-so bright! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding documentation for containerized GPU workloads, which matches the new guide page created in the pull request.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gpu-container-workloads-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new documentation page detailing how to run containerized GPU workloads using the container variant of the cozystack.gpu-operator package. The review feedback suggests specifying the cozy-system namespace in both the kubectl patch command and the Package resource manifest to ensure they are applied to the correct namespace.

Comment on lines +36 to +37
kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \
-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Cozystack, the Package resources (including cozystack.cozystack-platform) are typically located in the cozy-system namespace. Running kubectl patch without specifying the namespace will fail if the user's current context is set to another namespace (like default). Adding -n cozy-system ensures the command runs successfully.

Suggested change
kubectl patch packages.cozystack.io cozystack.cozystack-platform --type=json \
-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'
kubectl patch packages.cozystack.io cozystack.cozystack-platform -n cozy-system --type=json \\
-p '[{"op": "add", "path": "/spec/components/platform/values/bundles/enabledPackages/-", "value": "cozystack.gpu-operator"}]'

Comment on lines +43 to +48
apiVersion: cozystack.io/v1alpha1
kind: Package
metadata:
name: cozystack.gpu-operator
spec:
variant: container
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Package resource needs to be created in the cozy-system namespace for the Cozystack operator to detect and reconcile it. Adding namespace: cozy-system to the metadata ensures it is applied to the correct namespace.

Suggested change
apiVersion: cozystack.io/v1alpha1
kind: Package
metadata:
name: cozystack.gpu-operator
spec:
variant: container
apiVersion: cozystack.io/v1alpha1
kind: Package
metadata:
name: cozystack.gpu-operator
namespace: cozy-system
spec:
variant: container

Document the new container variant of cozystack.gpu-operator, paired with
cozystack/cozystack#2766. Covers the apt-installed-driver-and-toolkit
Linux shape that the variant targets: when to pick it over the
passthrough and vGPU variants, prerequisites (host driver + host
nvidia-container-toolkit, validated via nvidia-smi over kubectl debug),
the operator-validator host-driver auto-detect path
(/host/usr/bin/nvidia-smi), Talos caveat with a pointer to the
values-native-talos.yaml reference, install steps, a sample CUDA pod for
verification, the variant comparison matrix, and a cross-reference to
the HAMi sharing guide for tenant Kubernetes clusters.

Lands under operations/ — symmetric with virtualization/gpu.md (VM
passthrough on management cluster) and kubernetes/gpu-sharing.md (HAMi
in tenant Kubernetes addons).

Assisted-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant