Overview of #540: Modern Python monorepo with uv and prec
This episode of Talk Python to Me (host Michael Kennedy) is a deep, practical look inside Apache Airflow — one of the largest open-source Python monorepos — with maintainers Yarek Patuk and Amag (Amog) Desai. They explain why Airflow uses a monorepo, the tooling and Python packaging standards that made it feasible at scale, how the contributor workflow works (uv workspaces + prec hooks + per-package pyproject.toml), and the architectural choices (including a symlink-based “shared libraries” approach) that let them balance DRYness, isolation, and backward compatibility.
Guests and context
- Yarek Patuk — Airflow maintainer, Apache Software Foundation member, Security Committee member (drives security/supply chain thinking).
- Amag (Amog) Desai — Apache Airflow PMC member, top contributor, works at Astronomer (major Airflow stakeholder).
- Project scale highlighted: ~1.2M lines of Python (≈918k excluding comments), 100+ internal packages/distributions, heavy daily PR/issue traffic (dozens of PRs/day).
Key topics covered
- What a monorepo is (and how it differs from a monolith or multi-repo approach).
- Why Airflow chose/keeps a monorepo and how modern tooling changed the tradeoffs.
- Tooling and standards that enabled scaling: uv workspaces, per-package pyproject.toml, dependency groups, inline script metadata, pip changes.
- prec (replacement/enhanced alternative to pre-commit) for workspace-aware hooks and faster local checks.
- Shared libraries implemented via symlinks + automated vendoring to avoid runtime/dependency conflicts while keeping code DRY.
- Contribution/workflow realities (CI/QA, code review, AI-generated PRs, security considerations).
- IDE integration and helper scripts for PyCharm/VS Code to make multi-package repos editable.
Main takeaways (concise)
- Modern packaging standards + tooling make monorepos practical for large Python projects. The classic reasons to split into many repos have mostly faded if you use the right tools.
- uv workspaces are a game-changer: they let you treat a sub-package as the “active project”, auto-sync (and isolate) the virtual environment to only the dependencies declared for that package, and use source packages in-workspace rather than PyPI installs.
- Common commands: uv sync (create/update env for that package) and uv run pytest (auto-sync + run tests for that package).
- Use per-package pyproject.toml (dependency groups, inline script metadata) as the single source of truth for each distribution. New PEPs (inline script metadata — PEP 723; dependency groups — e.g. PEP 735) + pip and tooling support make this workable.
- prec (workspace-aware pre-commit alternative) allows defining hooks inside each distribution, runs fast, supports tab-completion and local scoping of hooks — making local dev + CI more reliable.
- Shared libraries: symlink-and-vendor approach — embed (vendor) the exact version of shared code into a distribution at build time to avoid runtime conflicts and keep packages independent. This gives the benefits of code reuse without forcing a single shared runtime dependency version for all consumers.
- Large-project benefits from these approaches:
- Enforced isolation (can't import code unless declared dependency).
- Easier local testing and reproducible builds.
- Cleaner architecture: encourages explicit initialization, dependency injection, and fewer implicit global imports.
- Better contributor onboarding and fewer “mystery” dependencies in developer envs.
Notable numbers & operational facts
- ~1.2M lines of Python; ~120+ Python distributions/subpackages.
- Weekly pulse example: ~310 active PRs in a week, ~200 merged; heavy daily review load.
- Hundreds of automated checks (pre-commit/CI hooks) enforce quality across packages.
- Airflow actively addresses the influx of low-quality / AI-generated PRs through contributor guidelines and triage.
Practical, actionable checklist (if you want to try this)
- Start per-package:
- Add a pyproject.toml to each package (declare dependencies + dependency groups).
- Define a top-level workspace in the repo-level pyproject.toml so tooling can discover packages.
- Adopt uv (or a workspace-capable tool) for local env management:
- Use uv sync in a package dir to create the correct venv for that package.
- Use uv run <tool> (pytest, linters) so env auto-syncs before running.
- Use dependency groups (e.g., [tool.uv.dependency-groups] / pyproject convention) for dev/test tooling separation.
- Use inline script metadata for runnable scripts and simpler pre-commit / tooling config.
- Replace monolithic pre-commit YAML with a workspace-aware solution (prec) that allows hooks to live inside each package and run only relevant hooks locally.
- Consider a vendoring strategy for shared internal libs:
- Use symlink generation + pre-processing hooks to include (vendor) specific shared-library code into a distribution at build time if you need independent versioning.
- Add small IDE helper scripts for PyCharm/VS Code that auto-discover and mark all package source/test roots — improves navigation & autocomplete.
Notable quotes & insights
- “The reasons why you would like to have multiple repos are gone now if you're using the right tooling. Only the benefits of having everything in one place remain.” — Yarek
- “UV workspaces were the most important thing for me — it let us split the repo into many distributions and make development isolated and simple.” — Amag
- “The best way to foresee the future is to shape it.” — Yarek (on collaborating with tool authors to make tooling support monorepo workflows)
Risks, trade-offs & operational notes
- You must invest in CI, automated checks and rigorous contribution guidelines to handle high PR volumes and prevent regressions.
- AI-generated contributions require triage strategies — make low-quality submissions expensive for submitters and quick to close for maintainers.
- GitHub/host availability can still create operational pain (cloning, Git operations) for big repos — but not a blocking problem in general.
Resources (mentioned / recommended)
- Apache Airflow GitHub repo — inspect the monorepo and the implementations described.
- “Modern Python repo for Apache Airflow” — four-part blog series by Yarek (detailed how-to + rationale).
- FOSDEM / conference talk from the guests (recording available from FOSDEM).
- uv (workspace-capable Python tooling) and prec (workspace-aware pre-commit style tool) — try them on a small repo to learn the workflow.
Final recommendation from the guests
- If you’re considering a monorepo for a growing Python project: don’t fear it — with pyproject.toml per package, uv workspaces, dependency groups, inline script metadata and workspace-aware hooks, a monorepo is now a robust, maintainable option. Their bottom line: “Just do it.”
