Volcano

What is Volcano?

Volcano is a cloud-native batch scheduling system for Kubernetes, designed for high-performance workloads. It is the first and only official container batch scheduling project accepted by the CNCF (Cloud Native Computing Foundation, incubating stage).

The core problem it solves: Kubernetes’ default scheduler (kube-scheduler) handles pods one at a time and is designed for long-running services (web servers, APIs). It is inadequate for batch workloads like ML training, big data jobs, and scientific computing, which require:

Scheduling groups of pods together (gang scheduling)
Queue-based resource management and multi-tenancy
Fair sharing, preemption, and resource reclaim between teams
Lifecycle management of jobs with multiple interdependent tasks

Volcano fills this gap without replacing kube-scheduler — it runs alongside it.

Why Not Just kube-scheduler?

Requirement	kube-scheduler	Volcano
Schedule pods individually	Yes	Yes
Schedule groups of pods atomically (gang)	No	Yes
Queue-based resource management	No	Yes
Fair sharing / DRF across queues	No	Yes
Preemption between queues based on priority	Limited	Yes
Resource reclaim when queues are over-allocated	No	Yes
Job lifecycle management (retry, policies)	Basic (via Job)	Advanced (VolcanoJob)
GPU/NPU-aware scheduling	Basic	Advanced (MIG, vCUDA)
Network topology-aware scheduling	No	Yes

Architecture

Volcano consists of four components:

graph TD
    API["Kubernetes API Server<br/><i>CRDs: Queue, PodGroup, VolcanoJob</i>"]

    API --> Scheduler
    API --> CM["Controller Manager"]
    API --> Admission["Admission Webhook"]

    subgraph Scheduler
        direction TB
        S_Actions["Actions:<br/>enqueue → allocate → preempt<br/>→ reclaim → backfill"]
        S_Plugins["Plugins:<br/>gang, priority, DRF, proportion,<br/>predicates, nodeorder, binpack"]
    end

    subgraph CM["Controller Manager"]
        QCM["Queue CM"]
        PGCM["PodGroup CM"]
        VJCM["VCJob CM"]
    end

    Admission -.- |"Validates CRD<br/>API requests"| API

    vcctl["vcctl<br/><i>CLI client</i>"] --> API

Scheduler — Schedules jobs to nodes based on configurable actions and plugins.
Controller Manager — Manages the lifecycle of Volcano CRDs (Queue, PodGroup, VolcanoJob).
Admission — Validates CRD API requests (webhook).
vcctl — Command-line client for Volcano.

Core Concepts

Volcano introduces three custom resource definitions (CRDs):

1. Queue

A Queue is a collection of PodGroups. It is the unit of resource division in Volcano and follows FIFO ordering.

Queues enable multi-tenancy: each team/project gets a queue with guaranteed and capped resources.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-training
spec:
  # Hard upper limit of resources this queue can use
  capability:
    cpu: "64"
    memory: 256Gi
    nvidia.com/gpu: "8"

  # Minimum guaranteed resources (cannot be taken by other queues)
  guarantee:
    resource:
      cpu: "16"
      memory: 64Gi

  # Relative weight for proportional sharing (proportion plugin)
  # deserved = (weight / total-weight) * total-cluster-resources
  weight: 4

  # Allow other queues to reclaim excess resources?
  reclaimable: true

  # Higher priority queues get allocation/preemption precedence
  priority: 100

Key Fields

Field	Description
`capability`	Hard ceiling. Queue cannot use more than this. If unset, defaults to cluster total minus other queues’ guarantees.
`guarantee`	Reserved minimum. Other queues cannot touch these resources. Must be <= `deserved`.
`weight`	Soft share. The `proportion` plugin computes deserved as `(weight/total_weight) * cluster_resources`.
`deserved`	Expected resource amount (used by `capacity` plugin instead of weight). If a queue exceeds its deserved, excess can be reclaimed.
`reclaimable`	Whether excess resources can be reclaimed by other queues (default: `true`).
`priority`	Higher priority queues get precedence in allocation/preemption.
`parent`	For hierarchical queues — specifies the parent queue.

Queue States

State	Meaning
`Open`	Accepting new PodGroups
`Closed`	Not accepting new PodGroups
`Closing`	Transitioning to Closed
`Unknown`	Status unknown (e.g. network issue)

Special Queues

default — Created automatically at startup (weight=1). Jobs not assigned to a queue go here.
root — Created automatically. Parent of all queues when hierarchical queues are enabled.

graph LR
    subgraph Cluster["Cluster Total: 100 CPU"]
        A["Queue A<br/>weight=3<br/>deserved=60 CPU"]
        B["Queue B<br/>weight=1<br/>deserved=20 CPU"]
        C["Queue C<br/>weight=1<br/>deserved=20 CPU"]
    end

    A -- "borrows idle<br/>resources" --> B
    A -- "borrows idle<br/>resources" --> C
    B -- "reclaims when<br/>work submitted" --> A

deserved = (weight / total_weight) * cluster_total
If Queue B and C are idle → Queue A can borrow up to its capability
When Queue B submits work → Queue A’s excess beyond 60 CPU is reclaimed back

2. PodGroup

A PodGroup is a group of pods with strong association. It is the scheduling unit — the scheduler considers all pods in a PodGroup together (gang scheduling).

When you create a VolcanoJob, a PodGroup is automatically created with the same name.

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: training-job
spec:
  # Minimum pods required to start (gang scheduling threshold)
  minMember: 4

  # Minimum total resources required to start
  minResources:
    cpu: "8"
    memory: 32Gi

  # Queue this PodGroup belongs to
  queue: ml-training

  # Priority class for scheduling order
  priorityClassName: high-priority

Key Fields

Field	Description
`minMember`	Minimum number of pods that must be schedulable. If the cluster can’t fit `minMember` pods, none are scheduled (gang scheduling).
`minResources`	Minimum aggregate resources required. If unavailable, nothing is scheduled.
`queue`	Which queue this PodGroup belongs to.
`priorityClassName`	Used to order PodGroups within a queue.

PodGroup States

stateDiagram-v2
    [*] --> Pending
    Pending --> Inqueue : resources available
    Inqueue --> Running : minMember pods bound
    Running --> [*] : completed
    Running --> Unknown : some pods lost/unscheduled
    Unknown --> Running : pods rescheduled

State	Meaning
`Pending`	Accepted but resource requirements not yet met
`Inqueue`	Passed validation, waiting to be bound to nodes (transient)
`Running`	At least `minMember` pods are running
`Unknown`	Some of `minMember` pods are running, others are not scheduled

Why PodGroups Matter (Gang Scheduling)

Without gang scheduling, in a distributed training job requiring 4 workers:

kube-scheduler might schedule 2 out of 4 workers
Those 2 sit idle waiting for the other 2 (which may never be scheduled)
Resources are wasted — deadlock

With gang scheduling via PodGroup:

Either all 4 workers get scheduled, or none do
No wasted resources, no deadlock

3. VolcanoJob (vcjob)

A VolcanoJob is Volcano’s enhanced job type. It wraps multiple tasks (each with their own pod template and replica count) into a single job with lifecycle management.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tf-training
spec:
  # Minimum pods to consider the job "running"
  minAvailable: 3

  # Use Volcano scheduler (not default-scheduler)
  schedulerName: volcano

  # Queue assignment
  queue: ml-training

  # Priority for preemption
  priorityClassName: high-priority

  # Max retries before marking as failed
  maxRetry: 5

  # Lifecycle policies
  policies:
    - event: PodEvicted
      action: RestartJob

  # Plugins for job features
  plugins:
    ssh: []     # Sets up SSH between pods (for MPI)
    env: []     # Injects env vars (VK_TASK_INDEX, etc.)
    svc: []     # Creates headless service for DNS

  # Task definitions
  tasks:
    # Parameter server
    - name: ps
      replicas: 1
      template:
        spec:
          containers:
            - name: tensorflow
              image: tf-training:latest
              resources:
                requests:
                  cpu: "2"
                  memory: 4Gi
          restartPolicy: Never

    # Workers
    - name: worker
      replicas: 4
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - name: tensorflow
              image: tf-training:latest
              resources:
                requests:
                  cpu: "4"
                  memory: 8Gi
                  nvidia.com/gpu: "1"
          restartPolicy: Never

Key Fields

Field	Description
`schedulerName`	Must be `volcano` to use Volcano scheduler. Can also be `default-scheduler`.
`minAvailable`	Minimum running pods for the job to be considered `Running`.
`queue`	Queue the job belongs to.
`tasks`	List of task groups. Each has a `name`, `replicas`, pod `template`, and optional `policies`.
`policies`	Default lifecycle policies for all tasks.
`plugins`	Volcano plugins: `ssh` (inter-pod SSH), `env` (task index env vars), `svc` (headless DNS service).
`maxRetry`	Max retries before marking the job as `Failed`.
`priorityClassName`	Priority for scheduling and preemption.

VolcanoJob States

stateDiagram-v2
    [*] --> Pending
    Pending --> Running : minAvailable pods running
    Pending --> Failed : maxRetry exceeded

    Running --> Completing : tasks finishing
    Completing --> Completed

    Running --> Restarting : e.g. PodEvicted policy
    Restarting --> Pending

    Running --> Aborting : external cause
    Aborting --> Aborted

    Running --> Terminating : internal cause
    Terminating --> Terminated

State	Meaning
`Pending`	Waiting to be scheduled
`Running`	At least `minAvailable` pods running
`Completing`	Pods finishing, cleanup in progress
`Completed`	At least `minAvailable` pods completed
`Restarting`	Job is restarting (e.g. after PodEvicted policy)
`Aborting/Aborted`	Being/has been aborted (external cause)
`Terminating/Terminated`	Being/has been terminated (internal cause)
`Failed`	Could not start after `maxRetry` attempts

The Scheduler: Actions and Plugins

The Volcano scheduler is built on a composite pattern: actions define what to do; plugins define how to do it.

Scheduling Cycle

graph TD
    Open["Open Session"] --> E["1. enqueue<br/><i>filter jobs, Pending → Inqueue</i>"]
    E --> A["2. allocate<br/><i>find best node for each job</i>"]
    A --> P["3. preempt<br/><i>evict lower-priority in same queue</i>"]
    P --> R["4. reclaim<br/><i>reclaim cross-queue excess</i>"]
    R --> B["5. backfill<br/><i>fill remaining gaps</i>"]
    B --> Close["Close Session"]
    Close -.-> |"repeats periodically"| Open

Actions

Action	What it does
enqueue	Filters jobs that meet scheduling requirements. Moves PodGroups from `Pending` → `Inqueue`.
allocate	Selects the best node for each task using prediction and scoring algorithms.
preempt	Within the same queue, evicts lower-priority tasks to make room for higher-priority ones.
reclaim	Across queues, reclaims resources from queues that exceed their deserved share when a new queue needs resources.
backfill	Fills remaining pending tasks into available node slots to maximize utilization.

Plugins

Plugins provide the actual algorithms called by actions:

Plugin	Purpose
gang	Ensures all tasks of a PodGroup can be scheduled together. Checks `minMember`/`minResources`.
priority	Compares priority between jobs (by `priorityClassName`) and tasks (by priority, createTime, id).
DRF (Dominant Resource Fairness)	Tasks with fewer resources get higher priority. Ensures fair multi-dimensional resource sharing.
proportion	Computes each queue’s deserved share based on `weight`. Handles resource borrowing/reclaim between queues.
predicates	Evaluates whether a task can physically fit on a node (CPU, memory, affinity, taints, etc.).
nodeorder	Scores nodes for a task. Picks the highest-scoring node.
binpack	Packs tasks tightly onto fewer nodes (opposite of spread) to maximize utilization.
conformance	Tasks in `kube-system` namespace get highest priority and are never preempted.

Default Configuration

# ConfigMap: volcano-scheduler-configmap (namespace: volcano-system)
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Plugins in the same tier run together. Higher tiers have precedence.

Scheduling Policies in Depth

Gang Scheduling

The defining feature of Volcano. Ensures all-or-nothing scheduling for a group of pods.

Problem it solves: distributed training jobs (TensorFlow, PyTorch, MPI) require N workers to run simultaneously. If only N-1 get scheduled, all N-1 sit idle wasting resources. With multiple such jobs, you get resource deadlock.

How it works:

Job specifies minAvailable: N (or PodGroup specifies minMember: N).
Scheduler checks if the cluster can satisfy N pods simultaneously.
If yes → schedule all N. If no → schedule none, keep job pending.

# Gang scheduling example: MPI job needing exactly 8 workers
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mpi-job
spec:
  minAvailable: 8
  schedulerName: volcano
  queue: compute
  tasks:
    - name: worker
      replicas: 8
      template:
        spec:
          containers:
            - name: mpi-worker
              image: mpi-app:latest
              resources:
                requests:
                  cpu: "4"
                  nvidia.com/gpu: "1"

Each queue gets a deserved share proportional to its weight:

deserved(queue) = (queue.weight / sum(all_weights)) * cluster_total

If a queue uses less than its deserved share, other queues can borrow the slack.
If a queue uses more than its deserved share and another queue needs resources, the excess is reclaimed.
Queues with reclaimable: false cannot have their excess taken back.

Binpack Scheduling

Packs pods tightly onto fewer nodes rather than spreading them. Leaves some nodes completely empty (for autoscaler to reclaim or for large jobs).

# Enable binpack with custom weights
- name: binpack
  arguments:
    binpack.weight: 10
    binpack.cpu: 1
    binpack.memory: 1
    binpack.resources: nvidia.com/gpu
    binpack.resources.nvidia.com/gpu: 2

DRF (Dominant Resource Fairness)

For multi-dimensional resources (CPU + memory + GPU), DRF identifies each job’s dominant resource (the resource it needs most relative to total) and equalizes dominant resource shares.

Example: If Job A is CPU-heavy and Job B is memory-heavy, DRF ensures neither dominates all resources.

Preemption vs Reclaim

	Preempt	Reclaim
Scope	Within the same queue	Across queues
Trigger	Higher-priority job in the queue	Queue needs resources from over-allocated queues
Based on	`priorityClassName`	Queue `weight`/`deserved`

Lifecycle Policies

VolcanoJob supports event-driven lifecycle management:

policies:
  - event: PodEvicted
    action: RestartJob
  - event: PodFailed
    action: AbortJob
  - event: TaskCompleted
    action: CompleteJob

Event	Description
`PodEvicted`	A pod was evicted (e.g. node pressure)
`PodFailed`	A pod failed
`TaskCompleted`	All replicas of a task completed
`JobUnknown`	Job state is unknown
`*`	Any event

Action	Description
`RestartJob`	Restart the entire job
`AbortJob`	Abort the job
`CompleteJob`	Mark the job as completed
`TerminateJob`	Terminate the job
`RestartTask`	Restart only the failed task

Job Plugins

Volcano jobs can use built-in plugins:

plugins:
  ssh: []    # Sets up SSH keys between pods (for MPI/Horovod)
  env: []    # Injects env vars: VK_TASK_INDEX, VK_TASK_NUM, etc.
  svc: []    # Creates a headless Service for pod DNS resolution

ssh: Generates SSH key pairs, distributes them to all pods. Essential for MPI-based distributed training (Horovod, OpenMPI).
env: Injects environment variables so each pod knows its index (VK_TASK_INDEX) and total count (VK_TASK_NUM).
svc: Creates a headless Kubernetes Service so pods can discover each other by DNS name (e.g. worker-0.job-name).

Real-World Examples

PyTorch Distributed Training

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: pytorch-ddp
spec:
  minAvailable: 4
  schedulerName: volcano
  queue: ml-training
  plugins:
    env: []
    svc: []
  tasks:
    - name: worker
      replicas: 4
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch-training:latest
              command: ["python", "-m", "torch.distributed.launch",
                        "--nproc_per_node=1", "--nnodes=4",
                        "train.py"]
              resources:
                requests:
                  cpu: "4"
                  memory: 16Gi
                  nvidia.com/gpu: "1"
          restartPolicy: OnFailure

Spark Job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: spark-job
spec:
  minAvailable: 3
  schedulerName: volcano
  queue: data-processing
  tasks:
    - name: driver
      replicas: 1
      template:
        spec:
          containers:
            - name: spark-driver
              image: spark:latest
              resources:
                requests:
                  cpu: "2"
                  memory: 4Gi

    - name: executor
      replicas: 4
      template:
        spec:
          containers:
            - name: spark-executor
              image: spark:latest
              resources:
                requests:
                  cpu: "2"
                  memory: 8Gi

Multi-Queue Setup for Multi-Tenancy

# Team A: ML research (high priority, more resources)
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-research
spec:
  weight: 4
  priority: 100
  capability:
    cpu: "128"
    memory: 512Gi
    nvidia.com/gpu: "16"
  guarantee:
    resource:
      cpu: "32"
      memory: 128Gi
      nvidia.com/gpu: "4"
  reclaimable: true
---
# Team B: Data pipeline (lower priority, fewer resources)
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: data-pipeline
spec:
  weight: 2
  priority: 50
  capability:
    cpu: "64"
    memory: 256Gi
  guarantee:
    resource:
      cpu: "8"
      memory: 32Gi
  reclaimable: true

Hierarchical Queues

Queues can be organized in a tree structure for complex organizational resource management:

graph TD
    root["root"]
    root --> deptA["dept-a<br/>weight=3"]
    root --> deptB["dept-b<br/>weight=2"]
    deptA --> teamA1["team-a1<br/>weight=2"]
    deptA --> teamA2["team-a2<br/>weight=1"]
    deptB --> teamB1["team-b1<br/>weight=1"]

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: dept-a
spec:
  parent: root
  weight: 3
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: team-a1
spec:
  parent: dept-a
  weight: 2

Child queues inherit and share resources from their parent. This models organizational hierarchies naturally.

Installation

# Using Helm
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

# Using kubectl (from release manifests)
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

# Verify
kubectl get pods -n volcano-system
# Should see: volcano-scheduler, volcano-controllers, volcano-admission

CLI (vcctl)

# List queues
vcctl queue list

# Create a queue
vcctl queue create -n my-queue -w 2

# View job status
vcctl job list
vcctl job view -n my-job

# Suspend/resume a job
vcctl job suspend -n my-job
vcctl job resume -n my-job

# Delete a job
vcctl job delete -n my-job

Supported Frameworks

Volcano integrates with major computing frameworks without modification:

Category	Frameworks
ML/DL	TensorFlow, PyTorch, MindSpore, PaddlePaddle, Horovod, MXNet
Big Data	Spark, Flink
Workflow	Argo, Kubeflow
HPC	OpenMPI
Bio	Cromwell, KubeGene
Other	Ray

Mental Model

graph TD
    subgraph Cluster["Cluster Resources"]
        subgraph QA["Queue A (weight=3)"]
            PG1["PodGroup 1<br/><i>gang</i>"]
            PG2["PodGroup 2"]
        end
        subgraph QB["Queue B (weight=1)"]
            PG3["PodGroup 3"]
        end
        subgraph QC["Queue C (weight=1)"]
            PG5["PodGroup 5"]
        end
    end

    PG1 --> |"schedule"| Nodes["Nodes"]
    PG2 --> |"schedule"| Nodes
    PG3 --> |"schedule"| Nodes
    PG5 --> |"schedule"| Nodes

    Sched["Scheduler<br/>enqueue → allocate → preempt<br/>→ reclaim → backfill"] --> Cluster

Jobs submit to Queues → PodGroups scheduled as gangs → Pods land on Nodes

Key properties:

Gang scheduling: all-or-nothing pod group scheduling, no partial deadlocks
Fair sharing: queues get proportional resources, can borrow slack, must return on demand
Multi-tenancy: queues with guarantees, caps, and priorities isolate teams
Extensible: custom actions and plugins for any scheduling algorithm

Volcano

What is Volcano?

Why Not Just kube-scheduler?

Architecture

Core Concepts

1. Queue

Key Fields

Queue States

Special Queues

Resource Sharing Model

2. PodGroup

Key Fields

PodGroup States

Why PodGroups Matter (Gang Scheduling)

3. VolcanoJob (vcjob)

Key Fields

VolcanoJob States

The Scheduler: Actions and Plugins

Scheduling Cycle

Actions

Plugins

Default Configuration

Scheduling Policies in Depth

Gang Scheduling

Proportion (Queue Fair Sharing)

Binpack Scheduling

DRF (Dominant Resource Fairness)

Preemption vs Reclaim

Lifecycle Policies

Job Plugins

Real-World Examples

PyTorch Distributed Training

Spark Job

Multi-Queue Setup for Multi-Tenancy

Hierarchical Queues

Installation

CLI (vcctl)

Supported Frameworks

Mental Model

Further Resources