Flyte
What is Flyte?
Flyte is an open-source workflow orchestration platform for building, deploying, and managing data and ML pipelines at scale. Originally built at Lyft, it is now a CNCF incubating project backed by Union.ai.
It is not a scheduler like Volcano or Kueue. It operates at a higher level: you define tasks (Python functions) and workflows (DAGs of tasks), and Flyte handles execution, data passing, caching, retries, versioning, and infrastructure.
Think of it as Airflow but designed for ML/data from the ground up, with strong typing, containerized execution, and Kubernetes-native architecture.
Flyte vs Airflow vs Volcano/Kueue
| Flyte | Airflow | Volcano / Kueue | |
|---|---|---|---|
| Level | Workflow orchestration | Workflow orchestration | Pod/job scheduling and queue management |
| Unit of work | Python function (@task) | Operator | Pod / Job |
| DAG definition | Python code with strong types | Python code (weakly typed) | YAML manifests |
| Data passing | Automatic, typed, via blob store | XCom (limited, JSON-based) | N/A (not its concern) |
| Versioning | Built-in (immutable, versioned tasks) | External (Git-based) | N/A |
| Caching | Built-in (deterministic memoization) | None | N/A |
| Execution | Kubernetes pods (one per task) | Varied (local, K8s, Celery) | Kubernetes pods |
| Multi-tenancy | Projects + Domains | DAG-level permissions | Queues |
| Use case | ML pipelines, data workflows, ETL | General orchestration, ETL | Batch job scheduling, GPU sharing |
Flyte and Volcano/Kueue solve different problems — Flyte can even use Volcano or Kueue under the hood for pod scheduling while Flyte manages the workflow DAG above.
Core Concepts
Tasks
A task is the smallest unit of work — a decorated Python function that runs in its own container.
from flytekit import task, Resources
@task(
requests=Resources(cpu="2", mem="4Gi", gpu="1"),
retries=3,
cache=True,
cache_version="1.0",
)
def train_model(data: pd.DataFrame, epochs: int) -> FlyteFile:
model = ... # training logic
model.save("/tmp/model.pkl")
return FlyteFile("/tmp/model.pkl")
Key properties:
- Typed inputs/outputs — Flyte enforces types at compile time (int, str, DataFrame, FlyteFile, etc.)
- Containerized — each task runs in its own pod with specified resources
- Cacheable — deterministic memoization based on input hash + cache version
- Retryable — automatic retries with configurable backoff
- Versioned — tasks are immutable once registered; new versions coexist
Workflows
A workflow is a DAG of tasks. It defines the execution order and data flow.
from flytekit import workflow
@workflow
def training_pipeline(raw_data: str, epochs: int = 10) -> FlyteFile:
data = preprocess(raw_data=raw_data) # task 1
model = train_model(data=data, epochs=epochs) # task 2
report = evaluate(model=model, data=data) # task 3
return model
graph LR
A["preprocess"] --> B["train_model"]
A --> C["evaluate"]
B --> C
C --> D["return model"]
Workflows are:
- Compiled to a DAG before execution (not interpreted at runtime like Airflow)
- Strongly typed — type mismatches caught before submission
- Composable — workflows can call other workflows (sub-workflows)
Launch Plans
A launch plan binds a workflow to specific inputs and scheduling. It is the “deployment unit.”
from flytekit import LaunchPlan, CronSchedule
daily_training = LaunchPlan.get_or_create(
workflow=training_pipeline,
name="daily_training",
default_inputs={"epochs": 20},
schedule=CronSchedule(schedule="0 2 * * *"), # 2 AM daily
)
Projects and Domains
Flyte organizes work into:
- Projects — logical grouping (e.g.
fraud-detection,recommender) - Domains — environment isolation (typically
development,staging,production)
Each project+domain combination has independent resource quotas, IAM roles, and configurations.
Architecture
graph TD
User["User / CI"] -->|"flytectl / SDK"| Admin["FlyteAdmin - API server"]
Admin --> DB["PostgreSQL - metadata store"]
Admin --> Blob["Blob Store - S3 / GCS"]
Admin -->|"creates workflow CRD"| Propeller["FlytePropeller - K8s operator"]
Propeller -->|"creates pods"| K8s["Kubernetes"]
K8s --> Pods["Task Pods"]
Pods -->|"read/write data"| Blob
Console["FlyteConsole - Web UI"] --> Admin
| Component | Role |
|---|---|
| FlyteAdmin | API server. Stores workflow definitions, manages executions, handles auth. |
| FlytePropeller | Kubernetes operator. Watches workflow CRDs, creates pods for each task, manages DAG execution. |
| FlyteConsole | Web UI for viewing workflows, executions, task logs, data lineage. |
| Blob Store | S3/GCS/MinIO. Stores task input/output data (offloaded from pods). |
| PostgreSQL | Metadata store for workflow versions, execution history, etc. |
| DataCatalog | Caching service. Indexes task outputs by input hash for memoization. |
Key Features
Strong Typing and Data Lineage
Every task declares typed inputs and outputs. Flyte tracks the full data lineage across the DAG — which task produced which artifact, with which inputs.
Supported types: primitives, dataclass, NamedTuple, pd.DataFrame, FlyteFile, FlyteDirectory, FlyteSchema, custom types via TypeTransformer.
Caching / Memoization
If a task has cache=True, Flyte skips re-execution when the same inputs (by hash) + cache version have been seen before. This saves compute on repeated runs.
Dynamic Workflows
Tasks can generate workflow DAGs at runtime (e.g. fan-out based on data):
from flytekit import dynamic
@dynamic
def process_all(items: List[str]) -> List[Result]:
results = []
for item in items:
results.append(process_one(item=item))
return results
Map Tasks
Fan-out a single task over a list of inputs:
from flytekit import map_task
@workflow
def batch_pipeline(inputs: List[str]) -> List[Result]:
return map_task(process_one)(item=inputs)
Task Plugins
Flyte has plugins for external execution backends:
| Plugin | What it does |
|---|---|
| Spark | Submit PySpark jobs |
| Ray | Run Ray clusters |
| MPI / Horovod | Distributed training |
| Sagemaker | Train on AWS SageMaker |
| BigQuery | Run SQL queries |
| Hive | Run Hive queries |
| Papermill | Execute Jupyter notebooks |
| Dolt | Dataset versioning |
| Pod | Custom K8s pod specs |
| DuckDB | Run analytical SQL |
Interruptible Tasks (Spot Instances)
Tasks can be marked as interruptible to run on spot/preemptible instances:
@task(interruptible=True, retries=3)
def cheap_training(...):
...
If the spot instance is reclaimed, Flyte automatically retries on another node.
Multi-Tenancy
- Project isolation — separate namespaces, quotas, IAM
- Domain isolation — dev/staging/prod with different resource limits
- Role-based access — per-project, per-domain
Reproducibility
Every execution is fully reproducible:
- Task code is versioned and containerized
- Input data is stored in blob store (immutable)
- Exact dependency versions locked in container image
- Execution parameters recorded in metadata
Flyte 2 (Latest)
Flyte 2 introduced a significant API redesign:
- Pure Python model — less decorator boilerplate, more Pythonic
- TaskEnvironment — defines the execution environment (container image, resources) separately from task logic
- Apps — first-class support for serving apps (FastAPI, Streamlit, vLLM) alongside batch tasks
- AI Agent support — built-in primitives for building LLM agents with memory, tool use, and MCP integration
- Async model — better support for long-running and event-driven workflows
Example: Full ML Pipeline
from flytekit import task, workflow, Resources, FlyteFile
import pandas as pd
@task(requests=Resources(cpu="2", mem="8Gi"))
def load_data(path: str) -> pd.DataFrame:
return pd.read_parquet(path)
@task(requests=Resources(cpu="4", mem="16Gi"))
def preprocess(df: pd.DataFrame) -> pd.DataFrame:
df = df.dropna()
df["feature"] = df["raw"].apply(transform)
return df
@task(
requests=Resources(cpu="4", mem="16Gi", gpu="1"),
cache=True,
cache_version="v2",
retries=2,
)
def train(df: pd.DataFrame, lr: float = 0.001) -> FlyteFile:
model = train_model(df, lr)
path = "/tmp/model.pt"
torch.save(model, path)
return FlyteFile(path)
@task(requests=Resources(cpu="2", mem="8Gi"))
def evaluate(model: FlyteFile, df: pd.DataFrame) -> float:
m = torch.load(model.download())
return compute_accuracy(m, df)
@workflow
def ml_pipeline(data_path: str, lr: float = 0.001) -> float:
raw = load_data(path=data_path)
clean = preprocess(df=raw)
model = train(df=clean, lr=lr)
accuracy = evaluate(model=model, df=clean)
return accuracy
Run it:
# Local execution
pyflyte run ml_pipeline.py ml_pipeline --data_path s3://bucket/data.parquet
# Remote execution (on Flyte cluster)
pyflyte run --remote ml_pipeline.py ml_pipeline --data_path s3://bucket/data.parquet
# Register for scheduling
pyflyte register ml_pipeline.py
When to Use Flyte
Good fit:
- ML training pipelines (data prep + training + evaluation)
- Data processing / ETL pipelines
- Pipelines requiring caching, retries, and reproducibility
- Multi-step workflows with typed data dependencies
- Teams needing versioning and lineage tracking
Not the right tool for:
- Simple cron jobs (overkill)
- Real-time streaming (use Flink/Kafka)
- Pod-level scheduling optimization (use Volcano/Kueue)
- Simple CI/CD (use GitHub Actions, etc.)
Further Resources
- Official docs
- GitHub
- Union.ai — managed Flyte platform
- Flyte School (YouTube)
- Slack community