Kueue

What is Kueue?

Kueue is a Kubernetes-native job queueing system. It manages when workloads are admitted to run on a cluster based on available resources, quotas, and priorities — without replacing the existing Kubernetes scheduler.

The core problem it solves: Kubernetes has a scheduler that decides where to place pods, but it has no concept of queuing. If you submit 1000 batch jobs and the cluster only has capacity for 50, vanilla Kubernetes will create all 1000 pods, most of which will sit in Pending forever, wasting API server resources and making priority decisions impossible.

Kueue adds:

  • Fair-sharing of resources between teams/tenants
  • Prioritized queuing — important jobs run first
  • Resource quotas with borrowing and preemption
  • Budget-aware scheduling — admit only what fits, queue the rest
  • Multi-tenancy — teams get guaranteed and ceiling quotas

Kueue is a sig-scheduling project, part of official Kubernetes.


How Kueue Fits Into Kubernetes

flowchart TD
    A["User submits Job<br>(with queue-name label)"]
    B["<b>Kueue</b><br>Wraps Job as Workload<br>Places in LocalQueue<br>Routes to ClusterQueue<br>Checks quotas & priorities<br>Admits or queues the Workload"]
    C["<b>Kubernetes Scheduler</b><br>Decides WHERE to place pods<br>(node selection, affinity, etc.)"]

    A --> B
    B -->|admitted| C

Key distinction:

  • Kueue decides IF and WHEN a workload should run (admission control).
  • kube-scheduler decides WHERE it runs (pod placement on nodes).

Kueue does not replace the scheduler. It gates workloads: a Job’s pods are not created until Kueue admits the workload.


Core Concepts

1. Workload

A Workload is Kueue’s internal representation of a job. You don’t create Workloads directly — Kueue automatically creates one for each supported job type (Job, MPIJob, RayJob, PyTorchJob, etc.).

The Workload object tracks:

  • Which queue the job belongs to
  • How many resources it requests (CPU, memory, GPU, etc.)
  • Its admission status (pending, admitted, finished)
  • Priority
# You don't write this -- Kueue creates it automatically
apiVersion: kueue.x-k8s.io/v1beta1
kind: Workload
metadata:
  name: job-my-training-xxxxx
spec:
  queueName: team-a-queue
  podSets:
  - count: 4
    name: main
    template:
      spec:
        containers:
        - resources:
            requests:
              cpu: "8"
              memory: "32Gi"
              nvidia.com/gpu: "1"

2. ResourceFlavor

A ResourceFlavor represents a type of resource in your cluster. It maps to a set of nodes with specific characteristics (GPU type, region, spot vs on-demand, etc.).

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-a100
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100
  nodeTaints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-t4
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: spot-cpu
spec:
  nodeLabels:
    cloud.google.com/gke-provisioning: spot
---
# A "default" flavor with no special labels
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default
spec: {}

Think of ResourceFlavors as tiers or classes of hardware. Kueue can try multiple flavors for a workload (e.g., try A100 first, fall back to T4).

3. ClusterQueue

A ClusterQueue is the central resource management object. It defines:

  • What resources are available (and from which flavors)
  • How much quota each resource has (guaranteed and borrowing limits)
  • Preemption policies
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-a-cq
spec:
  # How Kueue treats resource groups
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: gpu-a100
      resources:
      - name: "cpu"
        nominalQuota: 40        # Guaranteed 40 CPUs from A100 nodes
        borrowingLimit: 20      # Can borrow up to 20 more
        lendingLimit: 10        # Will lend up to 10 to others
      - name: "memory"
        nominalQuota: 160Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 8         # Guaranteed 8 A100 GPUs
        borrowingLimit: 4
    - name: gpu-t4               # Fallback flavor
      resources:
      - name: "cpu"
        nominalQuota: 20
      - name: "memory"
        nominalQuota: 80Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 4

  # Preemption policy
  preemption:
    reclaimWithinCohort: Any
    borrowWithinCohort:
      policy: LowerPriority
    withinClusterQueue: LowerPriority

  # Fair sharing weight (relative to other CQs in the cohort)
  fairSharing:
    weight: 1

4. LocalQueue

A LocalQueue is the namespaced entry point for users. Users submit jobs to a LocalQueue, which routes them to a ClusterQueue.

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: team-a-queue
  namespace: team-a
spec:
  clusterQueue: team-a-cq    # Which ClusterQueue to route to

Multiple LocalQueues (in different namespaces) can point to the same ClusterQueue. This is how multi-tenancy works: each team has its own namespace and LocalQueue, but they share cluster resources through ClusterQueues.

5. Cohort

A Cohort is a group of ClusterQueues that can borrow resources from each other. Cohorts are defined by name — any ClusterQueues with the same cohort field belong to the same cohort.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-a-cq
spec:
  cohort: research-cohort    # <-- same cohort name
  resourceGroups:
  - coveredResources: ["cpu", "nvidia.com/gpu"]
    flavors:
    - name: gpu-a100
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 8
        borrowingLimit: 4      # Can borrow up to 4 extra from cohort
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-b-cq
spec:
  cohort: research-cohort    # <-- same cohort name
  resourceGroups:
  - coveredResources: ["cpu", "nvidia.com/gpu"]
    flavors:
    - name: gpu-a100
      resources:
      - name: "nvidia.com/gpu"
        nominalQuota: 8
        borrowingLimit: 4

If team-a only uses 4 of its 8 GPUs, team-b can borrow the idle 4 (up to its borrowingLimit). When team-a needs them back, Kueue can preempt team-b’s borrowing workloads.

flowchart LR
    subgraph cohort ["Cohort: research-cohort"]
        direction LR
        A["team-a-cq<br>8 GPUs nominal"]
        B["team-b-cq<br>8 GPUs nominal"]
        A <-->|"Borrowing / lending"| B
    end

    cohort -.- note["Total cohort: 16 GPUs<br>Each team guaranteed 8<br>Can borrow up to 4 from idle"]

    style note fill:none,stroke:none

6. WorkloadPriorityClass

Defines priority levels for workloads. Higher-priority workloads are admitted first and can preempt lower-priority ones.

apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: production
value: 1000
description: "Production workloads -- highest priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: research
value: 100
description: "Research workloads -- normal priority"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: WorkloadPriorityClass
metadata:
  name: background
value: 10
description: "Background workloads -- preemptable"

How Admission Works — Step by Step

  1. User submits a Job with a kueue.x-k8s.io/queue-name label.
  2. Kueue suspends the Job (sets .spec.suspend = true) so no pods are created.
  3. Kueue creates a Workload object representing the Job.
  4. The Workload enters the LocalQueue, which routes it to the ClusterQueue.
  5. Kueue sorts the queue by priority (highest first), then FIFO within the same priority.
  6. For each pending Workload (in order), Kueue checks:
    • Does the ClusterQueue have enough quota (nominal or borrowed)?
    • Which ResourceFlavor can satisfy the request?
    • Should a lower-priority workload be preempted?
  7. If resources are available: Kueue admits the Workload:
    • Sets an Admission status on the Workload.
    • Unsuspends the Job (sets .spec.suspend = false).
    • The Kubernetes scheduler then places the pods normally.
  8. If resources are not available: the Workload stays queued until resources free up.
  9. When the Job finishes, Kueue releases the quota back to the ClusterQueue.
flowchart TD
    A["Job submitted<br>(suspend=true)"] --> B["Workload created"]
    B --> C["LocalQueue"]
    C --> D["ClusterQueue"]
    D --> E{"Enough<br>quota?"}
    E -->|yes| F["Admitted<br>(suspend=false)"]
    E -->|no| G{"Preempt lower<br>priority workload?"}
    G -->|yes| F
    G -->|no| H["Stay queued<br>(wait for resources)"]
    F --> I["Pods created"]
    I --> J["Job runs"]
    J --> K["Job finishes<br>Quota released"]

Preemption

Kueue supports several preemption strategies:

Within a ClusterQueue

preemption:
  withinClusterQueue: LowerPriority

A higher-priority workload can preempt lower-priority workloads in the same ClusterQueue.

Within a Cohort (reclaim lent resources)

preemption:
  reclaimWithinCohort: Any           # or LowerPriority

If team-a’s resources are being borrowed by team-b, team-a can reclaim them by preempting team-b’s workloads.

Borrowing within a Cohort

preemption:
  borrowWithinCohort:
    policy: LowerPriority
    maxPriorityThreshold: 100        # Only preempt workloads below this priority

When borrowing from another CQ in the cohort, can preempt that CQ’s lower-priority workloads.

When a workload is preempted:

  1. Kueue sets the Job back to suspend=true.
  2. The running pods are terminated.
  3. The Workload goes back to the queue and waits for re-admission.
  4. When it’s re-admitted, the Job resumes (new pods are created).

Supported Job Types

Kueue is not limited to batch/v1 Job. It supports any workload type through its integration framework:

TypeAPI GroupUse Case
Jobbatch/v1Standard batch jobs
JobSetjobset.x-k8s.io/v1alpha2Multi-template jobs (e.g., driver + workers)
MPIJobkubeflow.org/v2beta1MPI distributed training
PyTorchJobkubeflow.org/v1PyTorch distributed training
TFJobkubeflow.org/v1TensorFlow distributed training
PaddleJobkubeflow.org/v1PaddlePaddle training
XGBoostJobkubeflow.org/v1XGBoost training
RayJobray.io/v1Ray workloads
RayClusterray.io/v1Ray clusters
Podv1Plain pods (with annotation)

Each integration knows how to:

  • Suspend/unsuspend the workload type
  • Extract resource requests from the job spec
  • Map the job’s pod templates to Kueue’s podSets

Practical Examples

Example 1: Simple Batch Job

apiVersion: batch/v1
kind: Job
metadata:
  name: my-training-job
  namespace: team-a
  labels:
    kueue.x-k8s.io/queue-name: team-a-queue    # <-- This is all Kueue needs
spec:
  parallelism: 4
  completions: 4
  template:
    spec:
      containers:
      - name: trainer
        image: my-training:latest
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
      restartPolicy: Never

That’s it. The label kueue.x-k8s.io/queue-name is the only thing you add. Kueue handles the rest.

Example 2: Job with Priority

apiVersion: batch/v1
kind: Job
metadata:
  name: urgent-inference
  namespace: team-a
  labels:
    kueue.x-k8s.io/queue-name: team-a-queue
    kueue.x-k8s.io/priority-class: production    # <-- Kueue priority
spec:
  template:
    spec:
      containers:
      - name: inference
        image: my-model:latest
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
            nvidia.com/gpu: "2"
      restartPolicy: Never

Example 3: Full Setup — Two Teams Sharing a GPU Cluster

# 1. Resource Flavors
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: a100-80gb
spec:
  nodeLabels:
    gpu-type: a100-80gb
---
# 2. ClusterQueue for Team ML
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: ml-team-cq
spec:
  cohort: gpu-pool              # Shared pool
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: a100-80gb
      resources:
      - name: "cpu"
        nominalQuota: 64
      - name: "memory"
        nominalQuota: 256Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 8          # Guaranteed 8 GPUs
        borrowingLimit: 8        # Can use up to 16 total
  preemption:
    reclaimWithinCohort: LowerPriority
    withinClusterQueue: LowerPriority
---
# 3. ClusterQueue for Team Data
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: data-team-cq
spec:
  cohort: gpu-pool              # Same shared pool
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: a100-80gb
      resources:
      - name: "cpu"
        nominalQuota: 32
      - name: "memory"
        nominalQuota: 128Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 4          # Guaranteed 4 GPUs
        borrowingLimit: 4        # Can use up to 8 total
  preemption:
    reclaimWithinCohort: LowerPriority
    withinClusterQueue: LowerPriority
---
# 4. LocalQueues (one per namespace)
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: ml-queue
  namespace: ml-team
spec:
  clusterQueue: ml-team-cq
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: data-queue
  namespace: data-team
spec:
  clusterQueue: data-team-cq

In this setup:

  • ML team is guaranteed 8 GPUs, can burst to 16 if data team isn’t using theirs.
  • Data team is guaranteed 4 GPUs, can burst to 8.
  • Total cluster has at least 12 GPUs across A100 nodes.
  • If both teams need all their GPUs, borrowing workloads get preempted (lower priority first).

Fair Sharing

When multiple ClusterQueues in a cohort compete for resources, Kueue uses Dominant Resource Fairness (DRF) to decide admission order.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-a-cq
spec:
  cohort: shared
  fairSharing:
    weight: 2       # Gets 2x the share compared to weight=1

The weight determines relative share. If team-a has weight 2 and team-b has weight 1, team-a gets ~66% of shared resources and team-b gets ~33%.

Fair sharing kicks in when there’s contention. When there’s no contention, any team can use whatever’s available (up to their borrowingLimit).


Admission Checks

Kueue supports AdmissionChecks — external controllers that must approve a workload before it’s admitted. Use cases:

  • Provisioning: trigger node auto-provisioning before admitting (e.g., create GPU nodes on demand).
  • Custom validation: external webhook that validates the workload.
  • Budget approval: check with a cost system before admitting expensive jobs.
apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
  name: provisioning-check
spec:
  controllerName: kueue.x-k8s.io/provisioning
  parameters:
    apiGroup: kueue.x-k8s.io
    kind: ProvisioningRequestConfig
    name: spot-provisioning
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-a-cq
spec:
  admissionChecks:
  - provisioning-check         # Must pass before admission
  resourceGroups: [...]

ProvisioningRequest Integration

Kueue can integrate with cluster autoscaler’s ProvisioningRequest API to ensure nodes exist before admitting:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ProvisioningRequestConfig
metadata:
  name: spot-provisioning
spec:
  provisioningClassName: check-capacity.autoscaling.x-k8s.io
  parameters:
    spot: "true"

MultiKueue — Multi-Cluster Federation

MultiKueue lets you distribute workloads across multiple clusters from a single management cluster.

flowchart TD
    subgraph mgmt ["Management Cluster"]
        A["User submits Job"] --> B["LocalQueue"]
        B --> C["ClusterQueue"]
        C --> D["MultiKueue"]
    end

    D --> E["Worker Cluster 1<br>(on-prem)"]
    D --> F["Worker Cluster 2<br>(cloud-A)"]
    D --> G["Worker Cluster 3<br>(cloud-B)"]

The user interacts only with the management cluster. MultiKueue routes workloads to whichever worker cluster has capacity.


Visibility and Monitoring

Check queue status

# List all ClusterQueues and their status
kubectl get clusterqueues

# Detailed view of a ClusterQueue (shows usage vs quota)
kubectl describe clusterqueue team-a-cq

# List all pending/admitted workloads
kubectl get workloads -A

# Check a specific workload's status
kubectl describe workload job-my-training-xxxxx -n team-a

# List LocalQueues in a namespace
kubectl get localqueues -n team-a

ClusterQueue status fields

status:
  admittedWorkloads: 3
  pendingWorkloads: 7
  reservingWorkloads: 3
  flavorsReservation:
  - name: gpu-a100
    resources:
    - name: "nvidia.com/gpu"
      total: "6"           # 6 of 8 GPUs in use
  conditions:
  - type: Active
    status: "True"

Prometheus Metrics

Kueue exposes metrics on :8080/metrics:

MetricDescription
kueue_admitted_active_workloadsCurrently running admitted workloads
kueue_pending_workloadsWorkloads waiting for admission
kueue_admitted_workloads_totalTotal workloads admitted (counter)
kueue_evicted_workloads_totalTotal workloads evicted/preempted
kueue_cluster_queue_resource_usageCurrent resource usage per CQ
kueue_cluster_queue_nominal_quotaConfigured quota per CQ
kueue_admission_wait_time_secondsTime workloads spend waiting
kueue_admission_checks_wait_time_secondsTime waiting for admission checks

Installation

# Install the latest release
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.9.1/manifests.yaml

# Or with Helm
helm install kueue oci://registry.k8s.io/charts/kueue \
  --version v0.9.1 \
  --namespace kueue-system \
  --create-namespace

Kueue Configuration

apiVersion: config.kueue.x-k8s.io/v1beta1
kind: Configuration
metadata:
  name: kueue-manager-config
spec:
  integrations:
    frameworks:
    - "batch/job"
    - "kubeflow.org/mpijob"
    - "kubeflow.org/pytorchjob"
    - "ray.io/rayjob"
    - "ray.io/raycluster"
    - "jobset.x-k8s.io/jobset"
    - "pod"                       # Enable plain pod integration
  fairSharing:
    enable: true
    preemptionStrategies:
    - LessThanOrEqualToFinalShare
    - LessThanInitialShare

Common Patterns

Pattern 1: Default Queue via Namespace Annotation

Instead of labeling every Job, set a default queue on the namespace:

apiVersion: v1
kind: Namespace
metadata:
  name: team-a
  labels:
    kueue.x-k8s.io/default-queue: team-a-queue

Now any Job in team-a namespace without an explicit queue label automatically goes to team-a-queue.

Pattern 2: Spot + On-Demand Flavors (Cost Optimization)

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: cost-optimized-cq
spec:
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: spot              # Try spot first (cheaper)
      resources:
      - name: "cpu"
        nominalQuota: 100
      - name: "memory"
        nominalQuota: 400Gi
    - name: on-demand          # Fall back to on-demand
      resources:
      - name: "cpu"
        nominalQuota: 50
      - name: "memory"
        nominalQuota: 200Gi

Kueue tries flavors in order. Jobs land on spot nodes first; if spot quota is exhausted, they go to on-demand.

Pattern 3: StopPolicy — Draining a Queue

# Stop admitting new workloads (let running ones finish)
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-a-cq
spec:
  stopPolicy: HoldAndDrain     # Or just "Hold" to also keep running ones
  • Hold: stop admitting, but don’t touch running workloads.
  • HoldAndDrain: stop admitting and evict running workloads.

Useful for maintenance windows.

Pattern 4: Lending Limits

Control how much of your quota others can borrow:

resources:
- name: "nvidia.com/gpu"
  nominalQuota: 8
  lendingLimit: 4       # Only 4 of my 8 GPUs can be borrowed by others

The remaining 4 are reserved exclusively for this ClusterQueue even when idle.


Mental Model

flowchart TD
    subgraph cluster ["Cluster"]
        subgraph cohort ["Cohort"]
            CQA["ClusterQueue A<br>nominalQuota / borrowLimit<br>lendLimit / preemption"]
            CQB["ClusterQueue B<br>nominalQuota / borrowLimit<br>lendLimit / preemption"]
            CQA <-->|"borrow / lend"| CQB
        end

        LQA["LocalQueue<br>(ns: team-a)"]
        LQB["LocalQueue<br>(ns: team-b)"]

        CQA --> LQA
        CQB --> LQB

        LQA --> JA["Jobs with<br>queue-name label"]
        LQB --> JB["Jobs with<br>queue-name label"]

        RF1["ResourceFlavor: gpu-a100<br>nodeLabels: gpu=a100"]
        RF2["ResourceFlavor: spot-cpu<br>nodeLabels: spot=true"]

        WPC["WorkloadPriorityClass<br>production 1000<br>research 100<br>background 10"]
    end

    style RF1 fill:#e8f4f8,stroke:#888
    style RF2 fill:#e8f4f8,stroke:#888
    style WPC fill:#f8f0e8,stroke:#888

Kueue vs Alternatives

FeatureKueueVolcanoYARN (Hadoop)
Kubernetes-nativeYesYesNo (separate system)
ScopeAdmission controlScheduler + admissionFull resource manager
Replaces kube-schedulerNoYes (custom scheduler)N/A
Multi-tenancyCohorts + quotasQueues + quotasQueues + quotas
PreemptionYesYesYes
Fair sharingDRFDRF + othersDRF + others
Gang schedulingVia JobSetBuilt-inBuilt-in
Multi-clusterMultiKueueNoFederation
ComplexityLow (CRDs only)HigherHigh (separate infra)

Kueue’s philosophy: do one thing well (admission control) and let Kubernetes handle the rest. Volcano replaces more of the Kubernetes stack. Kueue is simpler and composes better with existing Kubernetes tooling.


Troubleshooting

# Why is my workload not being admitted?
kubectl describe workload <name> -n <namespace>
# Look at Conditions and Events

# Common reasons:
# - "Inadmissible": not enough quota in any flavor
# - "Pending": waiting in queue (lower priority than others)
# - "AdmissionCheck": waiting for external check

# Check ClusterQueue capacity
kubectl get clusterqueue <name> -o yaml
# Look at status.flavorsReservation vs spec quota

# Check Kueue controller logs
kubectl logs -n kueue-system deployment/kueue-controller-manager

# Workload stuck? Check if the Job is properly labeled
kubectl get job <name> -o yaml | grep -A2 labels

# Verify LocalQueue points to the right ClusterQueue
kubectl get localqueue -n <namespace> -o yaml

Further Resources