Flyte

What is Flyte?

Flyte is an open-source workflow orchestration platform for building, deploying, and managing data and ML pipelines at scale. Originally built at Lyft, it is now a CNCF incubating project backed by Union.ai.

It is not a scheduler like Volcano or Kueue. It operates at a higher level: you define tasks (Python functions) and workflows (DAGs of tasks), and Flyte handles execution, data passing, caching, retries, versioning, and infrastructure.

Think of it as Airflow but designed for ML/data from the ground up, with strong typing, containerized execution, and Kubernetes-native architecture.

Flyte vs Airflow vs Volcano/Kueue

	Flyte	Airflow	Volcano / Kueue
Level	Workflow orchestration	Workflow orchestration	Pod/job scheduling and queue management
Unit of work	Python function (`@task`)	Operator	Pod / Job
DAG definition	Python code with strong types	Python code (weakly typed)	YAML manifests
Data passing	Automatic, typed, via blob store	XCom (limited, JSON-based)	N/A (not its concern)
Versioning	Built-in (immutable, versioned tasks)	External (Git-based)	N/A
Caching	Built-in (deterministic memoization)	None	N/A
Execution	Kubernetes pods (one per task)	Varied (local, K8s, Celery)	Kubernetes pods
Multi-tenancy	Projects + Domains	DAG-level permissions	Queues
Use case	ML pipelines, data workflows, ETL	General orchestration, ETL	Batch job scheduling, GPU sharing

Flyte and Volcano/Kueue solve different problems — Flyte can even use Volcano or Kueue under the hood for pod scheduling while Flyte manages the workflow DAG above.

Core Concepts

Tasks

A task is the smallest unit of work — a decorated Python function that runs in its own container.

from flytekit import task, Resources

@task(
    requests=Resources(cpu="2", mem="4Gi", gpu="1"),
    retries=3,
    cache=True,
    cache_version="1.0",
)
def train_model(data: pd.DataFrame, epochs: int) -> FlyteFile:
    model = ...  # training logic
    model.save("/tmp/model.pkl")
    return FlyteFile("/tmp/model.pkl")

Key properties:

Typed inputs/outputs — Flyte enforces types at compile time (int, str, DataFrame, FlyteFile, etc.)
Containerized — each task runs in its own pod with specified resources
Cacheable — deterministic memoization based on input hash + cache version
Retryable — automatic retries with configurable backoff
Versioned — tasks are immutable once registered; new versions coexist

Workflows

A workflow is a DAG of tasks. It defines the execution order and data flow.

from flytekit import workflow

@workflow
def training_pipeline(raw_data: str, epochs: int = 10) -> FlyteFile:
    data = preprocess(raw_data=raw_data)       # task 1
    model = train_model(data=data, epochs=epochs) # task 2
    report = evaluate(model=model, data=data)    # task 3
    return model

graph LR
    A["preprocess"] --> B["train_model"]
    A --> C["evaluate"]
    B --> C
    C --> D["return model"]

Workflows are:

Compiled to a DAG before execution (not interpreted at runtime like Airflow)
Strongly typed — type mismatches caught before submission
Composable — workflows can call other workflows (sub-workflows)

Launch Plans

A launch plan binds a workflow to specific inputs and scheduling. It is the “deployment unit.”

from flytekit import LaunchPlan, CronSchedule

daily_training = LaunchPlan.get_or_create(
    workflow=training_pipeline,
    name="daily_training",
    default_inputs={"epochs": 20},
    schedule=CronSchedule(schedule="0 2 * * *"),  # 2 AM daily
)

Projects and Domains

Flyte organizes work into:

Projects — logical grouping (e.g. fraud-detection, recommender)
Domains — environment isolation (typically development, staging, production)

Each project+domain combination has independent resource quotas, IAM roles, and configurations.

Architecture

graph TD
    User["User / CI"] -->|"flytectl / SDK"| Admin["FlyteAdmin - API server"]
    Admin --> DB["PostgreSQL - metadata store"]
    Admin --> Blob["Blob Store - S3 / GCS"]
    Admin -->|"creates workflow CRD"| Propeller["FlytePropeller - K8s operator"]
    Propeller -->|"creates pods"| K8s["Kubernetes"]
    K8s --> Pods["Task Pods"]
    Pods -->|"read/write data"| Blob
    Console["FlyteConsole - Web UI"] --> Admin

Component	Role
FlyteAdmin	API server. Stores workflow definitions, manages executions, handles auth.
FlytePropeller	Kubernetes operator. Watches workflow CRDs, creates pods for each task, manages DAG execution.
FlyteConsole	Web UI for viewing workflows, executions, task logs, data lineage.
Blob Store	S3/GCS/MinIO. Stores task input/output data (offloaded from pods).
PostgreSQL	Metadata store for workflow versions, execution history, etc.
DataCatalog	Caching service. Indexes task outputs by input hash for memoization.

Key Features

Strong Typing and Data Lineage

Every task declares typed inputs and outputs. Flyte tracks the full data lineage across the DAG — which task produced which artifact, with which inputs.

Supported types: primitives, dataclass, NamedTuple, pd.DataFrame, FlyteFile, FlyteDirectory, FlyteSchema, custom types via TypeTransformer.

Caching / Memoization

If a task has cache=True, Flyte skips re-execution when the same inputs (by hash) + cache version have been seen before. This saves compute on repeated runs.

Dynamic Workflows

Tasks can generate workflow DAGs at runtime (e.g. fan-out based on data):

from flytekit import dynamic

@dynamic
def process_all(items: List[str]) -> List[Result]:
    results = []
    for item in items:
        results.append(process_one(item=item))
    return results

Map Tasks

Fan-out a single task over a list of inputs:

from flytekit import map_task

@workflow
def batch_pipeline(inputs: List[str]) -> List[Result]:
    return map_task(process_one)(item=inputs)

Task Plugins

Flyte has plugins for external execution backends:

Plugin	What it does
Spark	Submit PySpark jobs
Ray	Run Ray clusters
MPI / Horovod	Distributed training
Sagemaker	Train on AWS SageMaker
BigQuery	Run SQL queries
Hive	Run Hive queries
Papermill	Execute Jupyter notebooks
Dolt	Dataset versioning
Pod	Custom K8s pod specs
DuckDB	Run analytical SQL

Interruptible Tasks (Spot Instances)

Tasks can be marked as interruptible to run on spot/preemptible instances:

@task(interruptible=True, retries=3)
def cheap_training(...):
    ...

If the spot instance is reclaimed, Flyte automatically retries on another node.

Multi-Tenancy

Project isolation — separate namespaces, quotas, IAM
Domain isolation — dev/staging/prod with different resource limits
Role-based access — per-project, per-domain

Reproducibility

Every execution is fully reproducible:

Task code is versioned and containerized
Input data is stored in blob store (immutable)
Exact dependency versions locked in container image
Execution parameters recorded in metadata

Flyte 2 (Latest)

Flyte 2 introduced a significant API redesign:

Pure Python model — less decorator boilerplate, more Pythonic
TaskEnvironment — defines the execution environment (container image, resources) separately from task logic
Apps — first-class support for serving apps (FastAPI, Streamlit, vLLM) alongside batch tasks
AI Agent support — built-in primitives for building LLM agents with memory, tool use, and MCP integration
Async model — better support for long-running and event-driven workflows

Example: Full ML Pipeline

from flytekit import task, workflow, Resources, FlyteFile
import pandas as pd

@task(requests=Resources(cpu="2", mem="8Gi"))
def load_data(path: str) -> pd.DataFrame:
    return pd.read_parquet(path)

@task(requests=Resources(cpu="4", mem="16Gi"))
def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    df = df.dropna()
    df["feature"] = df["raw"].apply(transform)
    return df

@task(
    requests=Resources(cpu="4", mem="16Gi", gpu="1"),
    cache=True,
    cache_version="v2",
    retries=2,
)
def train(df: pd.DataFrame, lr: float = 0.001) -> FlyteFile:
    model = train_model(df, lr)
    path = "/tmp/model.pt"
    torch.save(model, path)
    return FlyteFile(path)

@task(requests=Resources(cpu="2", mem="8Gi"))
def evaluate(model: FlyteFile, df: pd.DataFrame) -> float:
    m = torch.load(model.download())
    return compute_accuracy(m, df)

@workflow
def ml_pipeline(data_path: str, lr: float = 0.001) -> float:
    raw = load_data(path=data_path)
    clean = preprocess(df=raw)
    model = train(df=clean, lr=lr)
    accuracy = evaluate(model=model, df=clean)
    return accuracy

Run it:

# Local execution
pyflyte run ml_pipeline.py ml_pipeline --data_path s3://bucket/data.parquet

# Remote execution (on Flyte cluster)
pyflyte run --remote ml_pipeline.py ml_pipeline --data_path s3://bucket/data.parquet

# Register for scheduling
pyflyte register ml_pipeline.py

When to Use Flyte

Good fit:

ML training pipelines (data prep + training + evaluation)
Data processing / ETL pipelines
Pipelines requiring caching, retries, and reproducibility
Multi-step workflows with typed data dependencies
Teams needing versioning and lineage tracking

Not the right tool for:

Simple cron jobs (overkill)
Real-time streaming (use Flink/Kafka)
Pod-level scheduling optimization (use Volcano/Kueue)
Simple CI/CD (use GitHub Actions, etc.)