Volcano
What is Volcano?
Volcano is a cloud-native batch scheduling system for Kubernetes, designed for high-performance workloads. It is the first and only official container batch scheduling project accepted by the CNCF (Cloud Native Computing Foundation, incubating stage).
The core problem it solves: Kubernetes’ default scheduler (kube-scheduler) handles pods one at a time and is designed for long-running services (web servers, APIs). It is inadequate for batch workloads like ML training, big data jobs, and scientific computing, which require:
- Scheduling groups of pods together (gang scheduling)
- Queue-based resource management and multi-tenancy
- Fair sharing, preemption, and resource reclaim between teams
- Lifecycle management of jobs with multiple interdependent tasks
Volcano fills this gap without replacing kube-scheduler — it runs alongside it.
Why Not Just kube-scheduler?
| Requirement | kube-scheduler | Volcano |
|---|---|---|
| Schedule pods individually | Yes | Yes |
| Schedule groups of pods atomically (gang) | No | Yes |
| Queue-based resource management | No | Yes |
| Fair sharing / DRF across queues | No | Yes |
| Preemption between queues based on priority | Limited | Yes |
| Resource reclaim when queues are over-allocated | No | Yes |
| Job lifecycle management (retry, policies) | Basic (via Job) | Advanced (VolcanoJob) |
| GPU/NPU-aware scheduling | Basic | Advanced (MIG, vCUDA) |
| Network topology-aware scheduling | No | Yes |
Architecture
Volcano consists of four components:
graph TD
API["Kubernetes API Server<br/><i>CRDs: Queue, PodGroup, VolcanoJob</i>"]
API --> Scheduler
API --> CM["Controller Manager"]
API --> Admission["Admission Webhook"]
subgraph Scheduler
direction TB
S_Actions["Actions:<br/>enqueue → allocate → preempt<br/>→ reclaim → backfill"]
S_Plugins["Plugins:<br/>gang, priority, DRF, proportion,<br/>predicates, nodeorder, binpack"]
end
subgraph CM["Controller Manager"]
QCM["Queue CM"]
PGCM["PodGroup CM"]
VJCM["VCJob CM"]
end
Admission -.- |"Validates CRD<br/>API requests"| API
vcctl["vcctl<br/><i>CLI client</i>"] --> API
- Scheduler — Schedules jobs to nodes based on configurable actions and plugins.
- Controller Manager — Manages the lifecycle of Volcano CRDs (Queue, PodGroup, VolcanoJob).
- Admission — Validates CRD API requests (webhook).
- vcctl — Command-line client for Volcano.
Core Concepts
Volcano introduces three custom resource definitions (CRDs):
1. Queue
A Queue is a collection of PodGroups. It is the unit of resource division in Volcano and follows FIFO ordering.
Queues enable multi-tenancy: each team/project gets a queue with guaranteed and capped resources.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ml-training
spec:
# Hard upper limit of resources this queue can use
capability:
cpu: "64"
memory: 256Gi
nvidia.com/gpu: "8"
# Minimum guaranteed resources (cannot be taken by other queues)
guarantee:
resource:
cpu: "16"
memory: 64Gi
# Relative weight for proportional sharing (proportion plugin)
# deserved = (weight / total-weight) * total-cluster-resources
weight: 4
# Allow other queues to reclaim excess resources?
reclaimable: true
# Higher priority queues get allocation/preemption precedence
priority: 100
Key Fields
| Field | Description |
|---|---|
capability | Hard ceiling. Queue cannot use more than this. If unset, defaults to cluster total minus other queues’ guarantees. |
guarantee | Reserved minimum. Other queues cannot touch these resources. Must be <= deserved. |
weight | Soft share. The proportion plugin computes deserved as (weight/total_weight) * cluster_resources. |
deserved | Expected resource amount (used by capacity plugin instead of weight). If a queue exceeds its deserved, excess can be reclaimed. |
reclaimable | Whether excess resources can be reclaimed by other queues (default: true). |
priority | Higher priority queues get precedence in allocation/preemption. |
parent | For hierarchical queues — specifies the parent queue. |
Queue States
| State | Meaning |
|---|---|
Open | Accepting new PodGroups |
Closed | Not accepting new PodGroups |
Closing | Transitioning to Closed |
Unknown | Status unknown (e.g. network issue) |
Special Queues
default— Created automatically at startup (weight=1). Jobs not assigned to a queue go here.root— Created automatically. Parent of all queues when hierarchical queues are enabled.
Resource Sharing Model
graph LR
subgraph Cluster["Cluster Total: 100 CPU"]
A["Queue A<br/>weight=3<br/>deserved=60 CPU"]
B["Queue B<br/>weight=1<br/>deserved=20 CPU"]
C["Queue C<br/>weight=1<br/>deserved=20 CPU"]
end
A -- "borrows idle<br/>resources" --> B
A -- "borrows idle<br/>resources" --> C
B -- "reclaims when<br/>work submitted" --> A
deserved = (weight / total_weight) * cluster_total- If Queue B and C are idle → Queue A can borrow up to its capability
- When Queue B submits work → Queue A’s excess beyond 60 CPU is reclaimed back
2. PodGroup
A PodGroup is a group of pods with strong association. It is the scheduling unit — the scheduler considers all pods in a PodGroup together (gang scheduling).
When you create a VolcanoJob, a PodGroup is automatically created with the same name.
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: training-job
spec:
# Minimum pods required to start (gang scheduling threshold)
minMember: 4
# Minimum total resources required to start
minResources:
cpu: "8"
memory: 32Gi
# Queue this PodGroup belongs to
queue: ml-training
# Priority class for scheduling order
priorityClassName: high-priority
Key Fields
| Field | Description |
|---|---|
minMember | Minimum number of pods that must be schedulable. If the cluster can’t fit minMember pods, none are scheduled (gang scheduling). |
minResources | Minimum aggregate resources required. If unavailable, nothing is scheduled. |
queue | Which queue this PodGroup belongs to. |
priorityClassName | Used to order PodGroups within a queue. |
PodGroup States
stateDiagram-v2
[*] --> Pending
Pending --> Inqueue : resources available
Inqueue --> Running : minMember pods bound
Running --> [*] : completed
Running --> Unknown : some pods lost/unscheduled
Unknown --> Running : pods rescheduled
| State | Meaning |
|---|---|
Pending | Accepted but resource requirements not yet met |
Inqueue | Passed validation, waiting to be bound to nodes (transient) |
Running | At least minMember pods are running |
Unknown | Some of minMember pods are running, others are not scheduled |
Why PodGroups Matter (Gang Scheduling)
Without gang scheduling, in a distributed training job requiring 4 workers:
- kube-scheduler might schedule 2 out of 4 workers
- Those 2 sit idle waiting for the other 2 (which may never be scheduled)
- Resources are wasted — deadlock
With gang scheduling via PodGroup:
- Either all 4 workers get scheduled, or none do
- No wasted resources, no deadlock
3. VolcanoJob (vcjob)
A VolcanoJob is Volcano’s enhanced job type. It wraps multiple tasks (each with their own pod template and replica count) into a single job with lifecycle management.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: tf-training
spec:
# Minimum pods to consider the job "running"
minAvailable: 3
# Use Volcano scheduler (not default-scheduler)
schedulerName: volcano
# Queue assignment
queue: ml-training
# Priority for preemption
priorityClassName: high-priority
# Max retries before marking as failed
maxRetry: 5
# Lifecycle policies
policies:
- event: PodEvicted
action: RestartJob
# Plugins for job features
plugins:
ssh: [] # Sets up SSH between pods (for MPI)
env: [] # Injects env vars (VK_TASK_INDEX, etc.)
svc: [] # Creates headless service for DNS
# Task definitions
tasks:
# Parameter server
- name: ps
replicas: 1
template:
spec:
containers:
- name: tensorflow
image: tf-training:latest
resources:
requests:
cpu: "2"
memory: 4Gi
restartPolicy: Never
# Workers
- name: worker
replicas: 4
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- name: tensorflow
image: tf-training:latest
resources:
requests:
cpu: "4"
memory: 8Gi
nvidia.com/gpu: "1"
restartPolicy: Never
Key Fields
| Field | Description |
|---|---|
schedulerName | Must be volcano to use Volcano scheduler. Can also be default-scheduler. |
minAvailable | Minimum running pods for the job to be considered Running. |
queue | Queue the job belongs to. |
tasks | List of task groups. Each has a name, replicas, pod template, and optional policies. |
policies | Default lifecycle policies for all tasks. |
plugins | Volcano plugins: ssh (inter-pod SSH), env (task index env vars), svc (headless DNS service). |
maxRetry | Max retries before marking the job as Failed. |
priorityClassName | Priority for scheduling and preemption. |
VolcanoJob States
stateDiagram-v2
[*] --> Pending
Pending --> Running : minAvailable pods running
Pending --> Failed : maxRetry exceeded
Running --> Completing : tasks finishing
Completing --> Completed
Running --> Restarting : e.g. PodEvicted policy
Restarting --> Pending
Running --> Aborting : external cause
Aborting --> Aborted
Running --> Terminating : internal cause
Terminating --> Terminated
| State | Meaning |
|---|---|
Pending | Waiting to be scheduled |
Running | At least minAvailable pods running |
Completing | Pods finishing, cleanup in progress |
Completed | At least minAvailable pods completed |
Restarting | Job is restarting (e.g. after PodEvicted policy) |
Aborting/Aborted | Being/has been aborted (external cause) |
Terminating/Terminated | Being/has been terminated (internal cause) |
Failed | Could not start after maxRetry attempts |
The Scheduler: Actions and Plugins
The Volcano scheduler is built on a composite pattern: actions define what to do; plugins define how to do it.
Scheduling Cycle
graph TD
Open["Open Session"] --> E["1. enqueue<br/><i>filter jobs, Pending → Inqueue</i>"]
E --> A["2. allocate<br/><i>find best node for each job</i>"]
A --> P["3. preempt<br/><i>evict lower-priority in same queue</i>"]
P --> R["4. reclaim<br/><i>reclaim cross-queue excess</i>"]
R --> B["5. backfill<br/><i>fill remaining gaps</i>"]
B --> Close["Close Session"]
Close -.-> |"repeats periodically"| Open
Actions
| Action | What it does |
|---|---|
| enqueue | Filters jobs that meet scheduling requirements. Moves PodGroups from Pending → Inqueue. |
| allocate | Selects the best node for each task using prediction and scoring algorithms. |
| preempt | Within the same queue, evicts lower-priority tasks to make room for higher-priority ones. |
| reclaim | Across queues, reclaims resources from queues that exceed their deserved share when a new queue needs resources. |
| backfill | Fills remaining pending tasks into available node slots to maximize utilization. |
Plugins
Plugins provide the actual algorithms called by actions:
| Plugin | Purpose |
|---|---|
| gang | Ensures all tasks of a PodGroup can be scheduled together. Checks minMember/minResources. |
| priority | Compares priority between jobs (by priorityClassName) and tasks (by priority, createTime, id). |
| DRF (Dominant Resource Fairness) | Tasks with fewer resources get higher priority. Ensures fair multi-dimensional resource sharing. |
| proportion | Computes each queue’s deserved share based on weight. Handles resource borrowing/reclaim between queues. |
| predicates | Evaluates whether a task can physically fit on a node (CPU, memory, affinity, taints, etc.). |
| nodeorder | Scores nodes for a task. Picks the highest-scoring node. |
| binpack | Packs tasks tightly onto fewer nodes (opposite of spread) to maximize utilization. |
| conformance | Tasks in kube-system namespace get highest priority and are never preempted. |
Default Configuration
# ConfigMap: volcano-scheduler-configmap (namespace: volcano-system)
apiVersion: v1
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
Plugins in the same tier run together. Higher tiers have precedence.
Scheduling Policies in Depth
Gang Scheduling
The defining feature of Volcano. Ensures all-or-nothing scheduling for a group of pods.
Problem it solves: distributed training jobs (TensorFlow, PyTorch, MPI) require N workers to run simultaneously. If only N-1 get scheduled, all N-1 sit idle wasting resources. With multiple such jobs, you get resource deadlock.
How it works:
- Job specifies
minAvailable: N(or PodGroup specifiesminMember: N). - Scheduler checks if the cluster can satisfy N pods simultaneously.
- If yes → schedule all N. If no → schedule none, keep job pending.
# Gang scheduling example: MPI job needing exactly 8 workers
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: mpi-job
spec:
minAvailable: 8
schedulerName: volcano
queue: compute
tasks:
- name: worker
replicas: 8
template:
spec:
containers:
- name: mpi-worker
image: mpi-app:latest
resources:
requests:
cpu: "4"
nvidia.com/gpu: "1"
Proportion (Queue Fair Sharing)
Each queue gets a deserved share proportional to its weight:
deserved(queue) = (queue.weight / sum(all_weights)) * cluster_total
- If a queue uses less than its deserved share, other queues can borrow the slack.
- If a queue uses more than its deserved share and another queue needs resources, the excess is reclaimed.
- Queues with
reclaimable: falsecannot have their excess taken back.
Binpack Scheduling
Packs pods tightly onto fewer nodes rather than spreading them. Leaves some nodes completely empty (for autoscaler to reclaim or for large jobs).
# Enable binpack with custom weights
- name: binpack
arguments:
binpack.weight: 10
binpack.cpu: 1
binpack.memory: 1
binpack.resources: nvidia.com/gpu
binpack.resources.nvidia.com/gpu: 2
DRF (Dominant Resource Fairness)
For multi-dimensional resources (CPU + memory + GPU), DRF identifies each job’s dominant resource (the resource it needs most relative to total) and equalizes dominant resource shares.
Example: If Job A is CPU-heavy and Job B is memory-heavy, DRF ensures neither dominates all resources.
Preemption vs Reclaim
| Preempt | Reclaim | |
|---|---|---|
| Scope | Within the same queue | Across queues |
| Trigger | Higher-priority job in the queue | Queue needs resources from over-allocated queues |
| Based on | priorityClassName | Queue weight/deserved |
Lifecycle Policies
VolcanoJob supports event-driven lifecycle management:
policies:
- event: PodEvicted
action: RestartJob
- event: PodFailed
action: AbortJob
- event: TaskCompleted
action: CompleteJob
| Event | Description |
|---|---|
PodEvicted | A pod was evicted (e.g. node pressure) |
PodFailed | A pod failed |
TaskCompleted | All replicas of a task completed |
JobUnknown | Job state is unknown |
* | Any event |
| Action | Description |
|---|---|
RestartJob | Restart the entire job |
AbortJob | Abort the job |
CompleteJob | Mark the job as completed |
TerminateJob | Terminate the job |
RestartTask | Restart only the failed task |
Job Plugins
Volcano jobs can use built-in plugins:
plugins:
ssh: [] # Sets up SSH keys between pods (for MPI/Horovod)
env: [] # Injects env vars: VK_TASK_INDEX, VK_TASK_NUM, etc.
svc: [] # Creates a headless Service for pod DNS resolution
- ssh: Generates SSH key pairs, distributes them to all pods. Essential for MPI-based distributed training (Horovod, OpenMPI).
- env: Injects environment variables so each pod knows its index (
VK_TASK_INDEX) and total count (VK_TASK_NUM). - svc: Creates a headless Kubernetes Service so pods can discover each other by DNS name (e.g.
worker-0.job-name).
Real-World Examples
PyTorch Distributed Training
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: pytorch-ddp
spec:
minAvailable: 4
schedulerName: volcano
queue: ml-training
plugins:
env: []
svc: []
tasks:
- name: worker
replicas: 4
template:
spec:
containers:
- name: pytorch
image: pytorch-training:latest
command: ["python", "-m", "torch.distributed.launch",
"--nproc_per_node=1", "--nnodes=4",
"train.py"]
resources:
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
restartPolicy: OnFailure
Spark Job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: spark-job
spec:
minAvailable: 3
schedulerName: volcano
queue: data-processing
tasks:
- name: driver
replicas: 1
template:
spec:
containers:
- name: spark-driver
image: spark:latest
resources:
requests:
cpu: "2"
memory: 4Gi
- name: executor
replicas: 4
template:
spec:
containers:
- name: spark-executor
image: spark:latest
resources:
requests:
cpu: "2"
memory: 8Gi
Multi-Queue Setup for Multi-Tenancy
# Team A: ML research (high priority, more resources)
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ml-research
spec:
weight: 4
priority: 100
capability:
cpu: "128"
memory: 512Gi
nvidia.com/gpu: "16"
guarantee:
resource:
cpu: "32"
memory: 128Gi
nvidia.com/gpu: "4"
reclaimable: true
---
# Team B: Data pipeline (lower priority, fewer resources)
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: data-pipeline
spec:
weight: 2
priority: 50
capability:
cpu: "64"
memory: 256Gi
guarantee:
resource:
cpu: "8"
memory: 32Gi
reclaimable: true
Hierarchical Queues
Queues can be organized in a tree structure for complex organizational resource management:
graph TD
root["root"]
root --> deptA["dept-a<br/>weight=3"]
root --> deptB["dept-b<br/>weight=2"]
deptA --> teamA1["team-a1<br/>weight=2"]
deptA --> teamA2["team-a2<br/>weight=1"]
deptB --> teamB1["team-b1<br/>weight=1"]
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: dept-a
spec:
parent: root
weight: 3
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: team-a1
spec:
parent: dept-a
weight: 2
Child queues inherit and share resources from their parent. This models organizational hierarchies naturally.
Installation
# Using Helm
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace
# Using kubectl (from release manifests)
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
# Verify
kubectl get pods -n volcano-system
# Should see: volcano-scheduler, volcano-controllers, volcano-admission
CLI (vcctl)
# List queues
vcctl queue list
# Create a queue
vcctl queue create -n my-queue -w 2
# View job status
vcctl job list
vcctl job view -n my-job
# Suspend/resume a job
vcctl job suspend -n my-job
vcctl job resume -n my-job
# Delete a job
vcctl job delete -n my-job
Supported Frameworks
Volcano integrates with major computing frameworks without modification:
| Category | Frameworks |
|---|---|
| ML/DL | TensorFlow, PyTorch, MindSpore, PaddlePaddle, Horovod, MXNet |
| Big Data | Spark, Flink |
| Workflow | Argo, Kubeflow |
| HPC | OpenMPI |
| Bio | Cromwell, KubeGene |
| Other | Ray |
Mental Model
graph TD
subgraph Cluster["Cluster Resources"]
subgraph QA["Queue A (weight=3)"]
PG1["PodGroup 1<br/><i>gang</i>"]
PG2["PodGroup 2"]
end
subgraph QB["Queue B (weight=1)"]
PG3["PodGroup 3"]
end
subgraph QC["Queue C (weight=1)"]
PG5["PodGroup 5"]
end
end
PG1 --> |"schedule"| Nodes["Nodes"]
PG2 --> |"schedule"| Nodes
PG3 --> |"schedule"| Nodes
PG5 --> |"schedule"| Nodes
Sched["Scheduler<br/>enqueue → allocate → preempt<br/>→ reclaim → backfill"] --> Cluster
Jobs submit to Queues → PodGroups scheduled as gangs → Pods land on Nodes
Key properties:
- Gang scheduling: all-or-nothing pod group scheduling, no partial deadlocks
- Fair sharing: queues get proportional resources, can borrow slack, must return on demand
- Multi-tenancy: queues with guarantees, caps, and priorities isolate teams
- Extensible: custom actions and plugins for any scheduling algorithm