Volcano

What is Volcano?

Volcano is a cloud-native batch scheduling system for Kubernetes, designed for high-performance workloads. It is the first and only official container batch scheduling project accepted by the CNCF (Cloud Native Computing Foundation, incubating stage).

The core problem it solves: Kubernetes’ default scheduler (kube-scheduler) handles pods one at a time and is designed for long-running services (web servers, APIs). It is inadequate for batch workloads like ML training, big data jobs, and scientific computing, which require:

  • Scheduling groups of pods together (gang scheduling)
  • Queue-based resource management and multi-tenancy
  • Fair sharing, preemption, and resource reclaim between teams
  • Lifecycle management of jobs with multiple interdependent tasks

Volcano fills this gap without replacing kube-scheduler — it runs alongside it.


Why Not Just kube-scheduler?

Requirementkube-schedulerVolcano
Schedule pods individuallyYesYes
Schedule groups of pods atomically (gang)NoYes
Queue-based resource managementNoYes
Fair sharing / DRF across queuesNoYes
Preemption between queues based on priorityLimitedYes
Resource reclaim when queues are over-allocatedNoYes
Job lifecycle management (retry, policies)Basic (via Job)Advanced (VolcanoJob)
GPU/NPU-aware schedulingBasicAdvanced (MIG, vCUDA)
Network topology-aware schedulingNoYes

Architecture

Volcano consists of four components:

graph TD
    API["Kubernetes API Server<br/><i>CRDs: Queue, PodGroup, VolcanoJob</i>"]

    API --> Scheduler
    API --> CM["Controller Manager"]
    API --> Admission["Admission Webhook"]

    subgraph Scheduler
        direction TB
        S_Actions["Actions:<br/>enqueue → allocate → preempt<br/>→ reclaim → backfill"]
        S_Plugins["Plugins:<br/>gang, priority, DRF, proportion,<br/>predicates, nodeorder, binpack"]
    end

    subgraph CM["Controller Manager"]
        QCM["Queue CM"]
        PGCM["PodGroup CM"]
        VJCM["VCJob CM"]
    end

    Admission -.- |"Validates CRD<br/>API requests"| API

    vcctl["vcctl<br/><i>CLI client</i>"] --> API
  • Scheduler — Schedules jobs to nodes based on configurable actions and plugins.
  • Controller Manager — Manages the lifecycle of Volcano CRDs (Queue, PodGroup, VolcanoJob).
  • Admission — Validates CRD API requests (webhook).
  • vcctl — Command-line client for Volcano.

Core Concepts

Volcano introduces three custom resource definitions (CRDs):

1. Queue

A Queue is a collection of PodGroups. It is the unit of resource division in Volcano and follows FIFO ordering.

Queues enable multi-tenancy: each team/project gets a queue with guaranteed and capped resources.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-training
spec:
  # Hard upper limit of resources this queue can use
  capability:
    cpu: "64"
    memory: 256Gi
    nvidia.com/gpu: "8"

  # Minimum guaranteed resources (cannot be taken by other queues)
  guarantee:
    resource:
      cpu: "16"
      memory: 64Gi

  # Relative weight for proportional sharing (proportion plugin)
  # deserved = (weight / total-weight) * total-cluster-resources
  weight: 4

  # Allow other queues to reclaim excess resources?
  reclaimable: true

  # Higher priority queues get allocation/preemption precedence
  priority: 100

Key Fields

FieldDescription
capabilityHard ceiling. Queue cannot use more than this. If unset, defaults to cluster total minus other queues’ guarantees.
guaranteeReserved minimum. Other queues cannot touch these resources. Must be <= deserved.
weightSoft share. The proportion plugin computes deserved as (weight/total_weight) * cluster_resources.
deservedExpected resource amount (used by capacity plugin instead of weight). If a queue exceeds its deserved, excess can be reclaimed.
reclaimableWhether excess resources can be reclaimed by other queues (default: true).
priorityHigher priority queues get precedence in allocation/preemption.
parentFor hierarchical queues — specifies the parent queue.

Queue States

StateMeaning
OpenAccepting new PodGroups
ClosedNot accepting new PodGroups
ClosingTransitioning to Closed
UnknownStatus unknown (e.g. network issue)

Special Queues

  • default — Created automatically at startup (weight=1). Jobs not assigned to a queue go here.
  • root — Created automatically. Parent of all queues when hierarchical queues are enabled.

Resource Sharing Model

graph LR
    subgraph Cluster["Cluster Total: 100 CPU"]
        A["Queue A<br/>weight=3<br/>deserved=60 CPU"]
        B["Queue B<br/>weight=1<br/>deserved=20 CPU"]
        C["Queue C<br/>weight=1<br/>deserved=20 CPU"]
    end

    A -- "borrows idle<br/>resources" --> B
    A -- "borrows idle<br/>resources" --> C
    B -- "reclaims when<br/>work submitted" --> A
  • deserved = (weight / total_weight) * cluster_total
  • If Queue B and C are idle → Queue A can borrow up to its capability
  • When Queue B submits work → Queue A’s excess beyond 60 CPU is reclaimed back

2. PodGroup

A PodGroup is a group of pods with strong association. It is the scheduling unit — the scheduler considers all pods in a PodGroup together (gang scheduling).

When you create a VolcanoJob, a PodGroup is automatically created with the same name.

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: training-job
spec:
  # Minimum pods required to start (gang scheduling threshold)
  minMember: 4

  # Minimum total resources required to start
  minResources:
    cpu: "8"
    memory: 32Gi

  # Queue this PodGroup belongs to
  queue: ml-training

  # Priority class for scheduling order
  priorityClassName: high-priority

Key Fields

FieldDescription
minMemberMinimum number of pods that must be schedulable. If the cluster can’t fit minMember pods, none are scheduled (gang scheduling).
minResourcesMinimum aggregate resources required. If unavailable, nothing is scheduled.
queueWhich queue this PodGroup belongs to.
priorityClassNameUsed to order PodGroups within a queue.

PodGroup States

stateDiagram-v2
    [*] --> Pending
    Pending --> Inqueue : resources available
    Inqueue --> Running : minMember pods bound
    Running --> [*] : completed
    Running --> Unknown : some pods lost/unscheduled
    Unknown --> Running : pods rescheduled
StateMeaning
PendingAccepted but resource requirements not yet met
InqueuePassed validation, waiting to be bound to nodes (transient)
RunningAt least minMember pods are running
UnknownSome of minMember pods are running, others are not scheduled

Why PodGroups Matter (Gang Scheduling)

Without gang scheduling, in a distributed training job requiring 4 workers:

  • kube-scheduler might schedule 2 out of 4 workers
  • Those 2 sit idle waiting for the other 2 (which may never be scheduled)
  • Resources are wasted — deadlock

With gang scheduling via PodGroup:

  • Either all 4 workers get scheduled, or none do
  • No wasted resources, no deadlock

3. VolcanoJob (vcjob)

A VolcanoJob is Volcano’s enhanced job type. It wraps multiple tasks (each with their own pod template and replica count) into a single job with lifecycle management.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tf-training
spec:
  # Minimum pods to consider the job "running"
  minAvailable: 3

  # Use Volcano scheduler (not default-scheduler)
  schedulerName: volcano

  # Queue assignment
  queue: ml-training

  # Priority for preemption
  priorityClassName: high-priority

  # Max retries before marking as failed
  maxRetry: 5

  # Lifecycle policies
  policies:
    - event: PodEvicted
      action: RestartJob

  # Plugins for job features
  plugins:
    ssh: []     # Sets up SSH between pods (for MPI)
    env: []     # Injects env vars (VK_TASK_INDEX, etc.)
    svc: []     # Creates headless service for DNS

  # Task definitions
  tasks:
    # Parameter server
    - name: ps
      replicas: 1
      template:
        spec:
          containers:
            - name: tensorflow
              image: tf-training:latest
              resources:
                requests:
                  cpu: "2"
                  memory: 4Gi
          restartPolicy: Never

    # Workers
    - name: worker
      replicas: 4
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - name: tensorflow
              image: tf-training:latest
              resources:
                requests:
                  cpu: "4"
                  memory: 8Gi
                  nvidia.com/gpu: "1"
          restartPolicy: Never

Key Fields

FieldDescription
schedulerNameMust be volcano to use Volcano scheduler. Can also be default-scheduler.
minAvailableMinimum running pods for the job to be considered Running.
queueQueue the job belongs to.
tasksList of task groups. Each has a name, replicas, pod template, and optional policies.
policiesDefault lifecycle policies for all tasks.
pluginsVolcano plugins: ssh (inter-pod SSH), env (task index env vars), svc (headless DNS service).
maxRetryMax retries before marking the job as Failed.
priorityClassNamePriority for scheduling and preemption.

VolcanoJob States

stateDiagram-v2
    [*] --> Pending
    Pending --> Running : minAvailable pods running
    Pending --> Failed : maxRetry exceeded

    Running --> Completing : tasks finishing
    Completing --> Completed

    Running --> Restarting : e.g. PodEvicted policy
    Restarting --> Pending

    Running --> Aborting : external cause
    Aborting --> Aborted

    Running --> Terminating : internal cause
    Terminating --> Terminated
StateMeaning
PendingWaiting to be scheduled
RunningAt least minAvailable pods running
CompletingPods finishing, cleanup in progress
CompletedAt least minAvailable pods completed
RestartingJob is restarting (e.g. after PodEvicted policy)
Aborting/AbortedBeing/has been aborted (external cause)
Terminating/TerminatedBeing/has been terminated (internal cause)
FailedCould not start after maxRetry attempts

The Scheduler: Actions and Plugins

The Volcano scheduler is built on a composite pattern: actions define what to do; plugins define how to do it.

Scheduling Cycle

graph TD
    Open["Open Session"] --> E["1. enqueue<br/><i>filter jobs, Pending → Inqueue</i>"]
    E --> A["2. allocate<br/><i>find best node for each job</i>"]
    A --> P["3. preempt<br/><i>evict lower-priority in same queue</i>"]
    P --> R["4. reclaim<br/><i>reclaim cross-queue excess</i>"]
    R --> B["5. backfill<br/><i>fill remaining gaps</i>"]
    B --> Close["Close Session"]
    Close -.-> |"repeats periodically"| Open

Actions

ActionWhat it does
enqueueFilters jobs that meet scheduling requirements. Moves PodGroups from PendingInqueue.
allocateSelects the best node for each task using prediction and scoring algorithms.
preemptWithin the same queue, evicts lower-priority tasks to make room for higher-priority ones.
reclaimAcross queues, reclaims resources from queues that exceed their deserved share when a new queue needs resources.
backfillFills remaining pending tasks into available node slots to maximize utilization.

Plugins

Plugins provide the actual algorithms called by actions:

PluginPurpose
gangEnsures all tasks of a PodGroup can be scheduled together. Checks minMember/minResources.
priorityCompares priority between jobs (by priorityClassName) and tasks (by priority, createTime, id).
DRF (Dominant Resource Fairness)Tasks with fewer resources get higher priority. Ensures fair multi-dimensional resource sharing.
proportionComputes each queue’s deserved share based on weight. Handles resource borrowing/reclaim between queues.
predicatesEvaluates whether a task can physically fit on a node (CPU, memory, affinity, taints, etc.).
nodeorderScores nodes for a task. Picks the highest-scoring node.
binpackPacks tasks tightly onto fewer nodes (opposite of spread) to maximize utilization.
conformanceTasks in kube-system namespace get highest priority and are never preempted.

Default Configuration

# ConfigMap: volcano-scheduler-configmap (namespace: volcano-system)
apiVersion: v1
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

Plugins in the same tier run together. Higher tiers have precedence.


Scheduling Policies in Depth

Gang Scheduling

The defining feature of Volcano. Ensures all-or-nothing scheduling for a group of pods.

Problem it solves: distributed training jobs (TensorFlow, PyTorch, MPI) require N workers to run simultaneously. If only N-1 get scheduled, all N-1 sit idle wasting resources. With multiple such jobs, you get resource deadlock.

How it works:

  1. Job specifies minAvailable: N (or PodGroup specifies minMember: N).
  2. Scheduler checks if the cluster can satisfy N pods simultaneously.
  3. If yes → schedule all N. If no → schedule none, keep job pending.
# Gang scheduling example: MPI job needing exactly 8 workers
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mpi-job
spec:
  minAvailable: 8
  schedulerName: volcano
  queue: compute
  tasks:
    - name: worker
      replicas: 8
      template:
        spec:
          containers:
            - name: mpi-worker
              image: mpi-app:latest
              resources:
                requests:
                  cpu: "4"
                  nvidia.com/gpu: "1"

Proportion (Queue Fair Sharing)

Each queue gets a deserved share proportional to its weight:

deserved(queue) = (queue.weight / sum(all_weights)) * cluster_total
  • If a queue uses less than its deserved share, other queues can borrow the slack.
  • If a queue uses more than its deserved share and another queue needs resources, the excess is reclaimed.
  • Queues with reclaimable: false cannot have their excess taken back.

Binpack Scheduling

Packs pods tightly onto fewer nodes rather than spreading them. Leaves some nodes completely empty (for autoscaler to reclaim or for large jobs).

# Enable binpack with custom weights
- name: binpack
  arguments:
    binpack.weight: 10
    binpack.cpu: 1
    binpack.memory: 1
    binpack.resources: nvidia.com/gpu
    binpack.resources.nvidia.com/gpu: 2

DRF (Dominant Resource Fairness)

For multi-dimensional resources (CPU + memory + GPU), DRF identifies each job’s dominant resource (the resource it needs most relative to total) and equalizes dominant resource shares.

Example: If Job A is CPU-heavy and Job B is memory-heavy, DRF ensures neither dominates all resources.

Preemption vs Reclaim

PreemptReclaim
ScopeWithin the same queueAcross queues
TriggerHigher-priority job in the queueQueue needs resources from over-allocated queues
Based onpriorityClassNameQueue weight/deserved

Lifecycle Policies

VolcanoJob supports event-driven lifecycle management:

policies:
  - event: PodEvicted
    action: RestartJob
  - event: PodFailed
    action: AbortJob
  - event: TaskCompleted
    action: CompleteJob
EventDescription
PodEvictedA pod was evicted (e.g. node pressure)
PodFailedA pod failed
TaskCompletedAll replicas of a task completed
JobUnknownJob state is unknown
*Any event
ActionDescription
RestartJobRestart the entire job
AbortJobAbort the job
CompleteJobMark the job as completed
TerminateJobTerminate the job
RestartTaskRestart only the failed task

Job Plugins

Volcano jobs can use built-in plugins:

plugins:
  ssh: []    # Sets up SSH keys between pods (for MPI/Horovod)
  env: []    # Injects env vars: VK_TASK_INDEX, VK_TASK_NUM, etc.
  svc: []    # Creates a headless Service for pod DNS resolution
  • ssh: Generates SSH key pairs, distributes them to all pods. Essential for MPI-based distributed training (Horovod, OpenMPI).
  • env: Injects environment variables so each pod knows its index (VK_TASK_INDEX) and total count (VK_TASK_NUM).
  • svc: Creates a headless Kubernetes Service so pods can discover each other by DNS name (e.g. worker-0.job-name).

Real-World Examples

PyTorch Distributed Training

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: pytorch-ddp
spec:
  minAvailable: 4
  schedulerName: volcano
  queue: ml-training
  plugins:
    env: []
    svc: []
  tasks:
    - name: worker
      replicas: 4
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch-training:latest
              command: ["python", "-m", "torch.distributed.launch",
                        "--nproc_per_node=1", "--nnodes=4",
                        "train.py"]
              resources:
                requests:
                  cpu: "4"
                  memory: 16Gi
                  nvidia.com/gpu: "1"
          restartPolicy: OnFailure

Spark Job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: spark-job
spec:
  minAvailable: 3
  schedulerName: volcano
  queue: data-processing
  tasks:
    - name: driver
      replicas: 1
      template:
        spec:
          containers:
            - name: spark-driver
              image: spark:latest
              resources:
                requests:
                  cpu: "2"
                  memory: 4Gi

    - name: executor
      replicas: 4
      template:
        spec:
          containers:
            - name: spark-executor
              image: spark:latest
              resources:
                requests:
                  cpu: "2"
                  memory: 8Gi

Multi-Queue Setup for Multi-Tenancy

# Team A: ML research (high priority, more resources)
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-research
spec:
  weight: 4
  priority: 100
  capability:
    cpu: "128"
    memory: 512Gi
    nvidia.com/gpu: "16"
  guarantee:
    resource:
      cpu: "32"
      memory: 128Gi
      nvidia.com/gpu: "4"
  reclaimable: true
---
# Team B: Data pipeline (lower priority, fewer resources)
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: data-pipeline
spec:
  weight: 2
  priority: 50
  capability:
    cpu: "64"
    memory: 256Gi
  guarantee:
    resource:
      cpu: "8"
      memory: 32Gi
  reclaimable: true

Hierarchical Queues

Queues can be organized in a tree structure for complex organizational resource management:

graph TD
    root["root"]
    root --> deptA["dept-a<br/>weight=3"]
    root --> deptB["dept-b<br/>weight=2"]
    deptA --> teamA1["team-a1<br/>weight=2"]
    deptA --> teamA2["team-a2<br/>weight=1"]
    deptB --> teamB1["team-b1<br/>weight=1"]
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: dept-a
spec:
  parent: root
  weight: 3
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: team-a1
spec:
  parent: dept-a
  weight: 2

Child queues inherit and share resources from their parent. This models organizational hierarchies naturally.


Installation

# Using Helm
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace

# Using kubectl (from release manifests)
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

# Verify
kubectl get pods -n volcano-system
# Should see: volcano-scheduler, volcano-controllers, volcano-admission

CLI (vcctl)

# List queues
vcctl queue list

# Create a queue
vcctl queue create -n my-queue -w 2

# View job status
vcctl job list
vcctl job view -n my-job

# Suspend/resume a job
vcctl job suspend -n my-job
vcctl job resume -n my-job

# Delete a job
vcctl job delete -n my-job

Supported Frameworks

Volcano integrates with major computing frameworks without modification:

CategoryFrameworks
ML/DLTensorFlow, PyTorch, MindSpore, PaddlePaddle, Horovod, MXNet
Big DataSpark, Flink
WorkflowArgo, Kubeflow
HPCOpenMPI
BioCromwell, KubeGene
OtherRay

Mental Model

graph TD
    subgraph Cluster["Cluster Resources"]
        subgraph QA["Queue A (weight=3)"]
            PG1["PodGroup 1<br/><i>gang</i>"]
            PG2["PodGroup 2"]
        end
        subgraph QB["Queue B (weight=1)"]
            PG3["PodGroup 3"]
        end
        subgraph QC["Queue C (weight=1)"]
            PG5["PodGroup 5"]
        end
    end

    PG1 --> |"schedule"| Nodes["Nodes"]
    PG2 --> |"schedule"| Nodes
    PG3 --> |"schedule"| Nodes
    PG5 --> |"schedule"| Nodes

    Sched["Scheduler<br/>enqueue → allocate → preempt<br/>→ reclaim → backfill"] --> Cluster

Jobs submit to Queues → PodGroups scheduled as gangs → Pods land on Nodes

Key properties:

  • Gang scheduling: all-or-nothing pod group scheduling, no partial deadlocks
  • Fair sharing: queues get proportional resources, can borrow slack, must return on demand
  • Multi-tenancy: queues with guarantees, caps, and priorities isolate teams
  • Extensible: custom actions and plugins for any scheduling algorithm

Further Resources