Flyte

What is Flyte?

Flyte is an open-source workflow orchestration platform for building, deploying, and managing data and ML pipelines at scale. Originally built at Lyft, it is now a CNCF incubating project backed by Union.ai.

It is not a scheduler like Volcano or Kueue. It operates at a higher level: you define tasks (Python functions) and workflows (DAGs of tasks), and Flyte handles execution, data passing, caching, retries, versioning, and infrastructure.

Think of it as Airflow but designed for ML/data from the ground up, with strong typing, containerized execution, and Kubernetes-native architecture.


Flyte vs Airflow vs Volcano/Kueue

FlyteAirflowVolcano / Kueue
LevelWorkflow orchestrationWorkflow orchestrationPod/job scheduling and queue management
Unit of workPython function (@task)OperatorPod / Job
DAG definitionPython code with strong typesPython code (weakly typed)YAML manifests
Data passingAutomatic, typed, via blob storeXCom (limited, JSON-based)N/A (not its concern)
VersioningBuilt-in (immutable, versioned tasks)External (Git-based)N/A
CachingBuilt-in (deterministic memoization)NoneN/A
ExecutionKubernetes pods (one per task)Varied (local, K8s, Celery)Kubernetes pods
Multi-tenancyProjects + DomainsDAG-level permissionsQueues
Use caseML pipelines, data workflows, ETLGeneral orchestration, ETLBatch job scheduling, GPU sharing

Flyte and Volcano/Kueue solve different problems — Flyte can even use Volcano or Kueue under the hood for pod scheduling while Flyte manages the workflow DAG above.


Core Concepts

Tasks

A task is the smallest unit of work — a decorated Python function that runs in its own container.

from flytekit import task, Resources

@task(
    requests=Resources(cpu="2", mem="4Gi", gpu="1"),
    retries=3,
    cache=True,
    cache_version="1.0",
)
def train_model(data: pd.DataFrame, epochs: int) -> FlyteFile:
    model = ...  # training logic
    model.save("/tmp/model.pkl")
    return FlyteFile("/tmp/model.pkl")

Key properties:

  • Typed inputs/outputs — Flyte enforces types at compile time (int, str, DataFrame, FlyteFile, etc.)
  • Containerized — each task runs in its own pod with specified resources
  • Cacheable — deterministic memoization based on input hash + cache version
  • Retryable — automatic retries with configurable backoff
  • Versioned — tasks are immutable once registered; new versions coexist

Workflows

A workflow is a DAG of tasks. It defines the execution order and data flow.

from flytekit import workflow

@workflow
def training_pipeline(raw_data: str, epochs: int = 10) -> FlyteFile:
    data = preprocess(raw_data=raw_data)       # task 1
    model = train_model(data=data, epochs=epochs) # task 2
    report = evaluate(model=model, data=data)    # task 3
    return model
graph LR
    A["preprocess"] --> B["train_model"]
    A --> C["evaluate"]
    B --> C
    C --> D["return model"]

Workflows are:

  • Compiled to a DAG before execution (not interpreted at runtime like Airflow)
  • Strongly typed — type mismatches caught before submission
  • Composable — workflows can call other workflows (sub-workflows)

Launch Plans

A launch plan binds a workflow to specific inputs and scheduling. It is the “deployment unit.”

from flytekit import LaunchPlan, CronSchedule

daily_training = LaunchPlan.get_or_create(
    workflow=training_pipeline,
    name="daily_training",
    default_inputs={"epochs": 20},
    schedule=CronSchedule(schedule="0 2 * * *"),  # 2 AM daily
)

Projects and Domains

Flyte organizes work into:

  • Projects — logical grouping (e.g. fraud-detection, recommender)
  • Domains — environment isolation (typically development, staging, production)

Each project+domain combination has independent resource quotas, IAM roles, and configurations.


Architecture

graph TD
    User["User / CI"] -->|"flytectl / SDK"| Admin["FlyteAdmin - API server"]
    Admin --> DB["PostgreSQL - metadata store"]
    Admin --> Blob["Blob Store - S3 / GCS"]
    Admin -->|"creates workflow CRD"| Propeller["FlytePropeller - K8s operator"]
    Propeller -->|"creates pods"| K8s["Kubernetes"]
    K8s --> Pods["Task Pods"]
    Pods -->|"read/write data"| Blob
    Console["FlyteConsole - Web UI"] --> Admin
ComponentRole
FlyteAdminAPI server. Stores workflow definitions, manages executions, handles auth.
FlytePropellerKubernetes operator. Watches workflow CRDs, creates pods for each task, manages DAG execution.
FlyteConsoleWeb UI for viewing workflows, executions, task logs, data lineage.
Blob StoreS3/GCS/MinIO. Stores task input/output data (offloaded from pods).
PostgreSQLMetadata store for workflow versions, execution history, etc.
DataCatalogCaching service. Indexes task outputs by input hash for memoization.

Key Features

Strong Typing and Data Lineage

Every task declares typed inputs and outputs. Flyte tracks the full data lineage across the DAG — which task produced which artifact, with which inputs.

Supported types: primitives, dataclass, NamedTuple, pd.DataFrame, FlyteFile, FlyteDirectory, FlyteSchema, custom types via TypeTransformer.

Caching / Memoization

If a task has cache=True, Flyte skips re-execution when the same inputs (by hash) + cache version have been seen before. This saves compute on repeated runs.

Dynamic Workflows

Tasks can generate workflow DAGs at runtime (e.g. fan-out based on data):

from flytekit import dynamic

@dynamic
def process_all(items: List[str]) -> List[Result]:
    results = []
    for item in items:
        results.append(process_one(item=item))
    return results

Map Tasks

Fan-out a single task over a list of inputs:

from flytekit import map_task

@workflow
def batch_pipeline(inputs: List[str]) -> List[Result]:
    return map_task(process_one)(item=inputs)

Task Plugins

Flyte has plugins for external execution backends:

PluginWhat it does
SparkSubmit PySpark jobs
RayRun Ray clusters
MPI / HorovodDistributed training
SagemakerTrain on AWS SageMaker
BigQueryRun SQL queries
HiveRun Hive queries
PapermillExecute Jupyter notebooks
DoltDataset versioning
PodCustom K8s pod specs
DuckDBRun analytical SQL

Interruptible Tasks (Spot Instances)

Tasks can be marked as interruptible to run on spot/preemptible instances:

@task(interruptible=True, retries=3)
def cheap_training(...):
    ...

If the spot instance is reclaimed, Flyte automatically retries on another node.

Multi-Tenancy

  • Project isolation — separate namespaces, quotas, IAM
  • Domain isolation — dev/staging/prod with different resource limits
  • Role-based access — per-project, per-domain

Reproducibility

Every execution is fully reproducible:

  • Task code is versioned and containerized
  • Input data is stored in blob store (immutable)
  • Exact dependency versions locked in container image
  • Execution parameters recorded in metadata

Flyte 2 (Latest)

Flyte 2 introduced a significant API redesign:

  • Pure Python model — less decorator boilerplate, more Pythonic
  • TaskEnvironment — defines the execution environment (container image, resources) separately from task logic
  • Apps — first-class support for serving apps (FastAPI, Streamlit, vLLM) alongside batch tasks
  • AI Agent support — built-in primitives for building LLM agents with memory, tool use, and MCP integration
  • Async model — better support for long-running and event-driven workflows

Example: Full ML Pipeline

from flytekit import task, workflow, Resources, FlyteFile
import pandas as pd

@task(requests=Resources(cpu="2", mem="8Gi"))
def load_data(path: str) -> pd.DataFrame:
    return pd.read_parquet(path)

@task(requests=Resources(cpu="4", mem="16Gi"))
def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    df = df.dropna()
    df["feature"] = df["raw"].apply(transform)
    return df

@task(
    requests=Resources(cpu="4", mem="16Gi", gpu="1"),
    cache=True,
    cache_version="v2",
    retries=2,
)
def train(df: pd.DataFrame, lr: float = 0.001) -> FlyteFile:
    model = train_model(df, lr)
    path = "/tmp/model.pt"
    torch.save(model, path)
    return FlyteFile(path)

@task(requests=Resources(cpu="2", mem="8Gi"))
def evaluate(model: FlyteFile, df: pd.DataFrame) -> float:
    m = torch.load(model.download())
    return compute_accuracy(m, df)

@workflow
def ml_pipeline(data_path: str, lr: float = 0.001) -> float:
    raw = load_data(path=data_path)
    clean = preprocess(df=raw)
    model = train(df=clean, lr=lr)
    accuracy = evaluate(model=model, df=clean)
    return accuracy

Run it:

# Local execution
pyflyte run ml_pipeline.py ml_pipeline --data_path s3://bucket/data.parquet

# Remote execution (on Flyte cluster)
pyflyte run --remote ml_pipeline.py ml_pipeline --data_path s3://bucket/data.parquet

# Register for scheduling
pyflyte register ml_pipeline.py

When to Use Flyte

Good fit:

  • ML training pipelines (data prep + training + evaluation)
  • Data processing / ETL pipelines
  • Pipelines requiring caching, retries, and reproducibility
  • Multi-step workflows with typed data dependencies
  • Teams needing versioning and lineage tracking

Not the right tool for:

  • Simple cron jobs (overkill)
  • Real-time streaming (use Flink/Kafka)
  • Pod-level scheduling optimization (use Volcano/Kueue)
  • Simple CI/CD (use GitHub Actions, etc.)

Further Resources