Overview of Post Reports — "The quest to ‘destructively scan’ all the world’s books"
This episode of The Washington Post’s Post Reports (host Martine Powers) summarizes reporting by Will Oremus about Anthropic’s secretive 2024 program codenamed Project Panama: a plan to acquire, slice, scan, and digitize huge quantities of printed books to create training data for its AI models (Claude). The episode explains how the project worked, why books are prized for AI training, the legal fallout (including settled and ongoing copyright lawsuits), and the broader implications for authors, publishers, and the AI industry.
Key points and main takeaways
- Project Panama: Anthropic attempted to build a large, high‑quality digital library by buying bulk used books, removing spines, scanning pages, and recycling the physical remains after digitization.
- Leadership and provenance: Anthropic hired Tom Turvey, formerly involved with Google Books, to lead the effort.
- Scale: Exact figures are redacted, but court documents indicate purchases costing “many millions of dollars” and acquisitions that could involve hundreds of thousands to millions of print books.
- Legal exposure: Anthropic (and other AI firms) also downloaded material from shadow libraries (e.g., LibGen) via torrenting—an activity that contributed to copyright lawsuits and settlements.
- Court outcomes are mixed and unsettled: a judge found much of the scanning-for-training could be fair use, but copied/pirated books that weren’t used in training raised infringement claims; Anthropic settled one authors’ suit for roughly $1.5 billion.
- Why books matter: Books are seen as higher‑quality, edited, curated text compared with noisy internet content—valuable for improving model reliability and reducing hallucinations.
- Industry context: Authors, publishers, journalists, photographers, illustrators, and other creatives have brought multiple suits against major AI players (Anthropic, Meta, OpenAI, Microsoft, Google). Decisions in these cases will shape licensing norms and model‑training practices.
Project Panama — how it worked
- Acquisition strategy: Anthropic targeted bulk sellers and used‑book warehouses (e.g., Better World Books), and explored deals with bookstores (The Strand) and libraries.
- Scanning process: To speed scanning and avoid spine-related issues, copies were “destructively” processed—spines sliced and pages fed into high‑speed scanners (then recycled).
- Purpose: The digitized library would be used to train Anthropic’s AI models (Claude), with the company arguing this constituted a transformative use.
Legal issues and court rulings
Shadow libraries and torrenting
- Evidence showed Anthropic (and other companies) torrenting large collections from shadow libraries (LibGen and similar). Torrenting raised separate legal risks—potentially uploading pirated material and distributing it—beyond training‑data questions.
- Internal company messages (in other firms’ filings) reflected engineer concerns about legality; some senior approvals reportedly greenlit use of shadow libraries for model training.
Copyright and fair use
- Transformative-use argument: Anthropic argued that training an AI from books transforms them into a different product (an AI model), which may qualify as fair use because it doesn’t directly substitute for selling the book.
- Case outcomes:
- Anthropic: Judge found many scanning-for-training activities likely fell under fair use, but copying of books not used for training (creating a library of copies) raised infringement problems. Anthropic settled the authors’ suit (~$1.5B).
- Meta and others: Different judges have reached different conclusions; in at least one Meta-related decision, a judge said plaintiffs failed to demonstrate how AI training specifically harmed their sales—leaving room for further litigation and divergent rulings.
- Bottom line: Law remains unsettled; outcomes will define whether and how AI firms must license creative works for training.
Why books are especially valuable for AI
- Higher editorial standards: Books tend to be better edited, fact-checked, and structured than much online content—helpful for creating reliable language models.
- Signal-to-noise: The open web includes “cruddy” content, bot‑generated text, and misinformation; books offer concentrated, curated textual quality.
- Strategic play for smaller companies: Anthropic used books as a way to compete with much larger firms by focusing on data quality rather than sheer volume.
Implications for creators, companies, and policy
- Creators: Authors and other creatives want recognition and compensation; many would prefer licensing agreements rather than their works being used without payment.
- AI companies: Some firms have made content deals with news outlets; broader, standardized licensing regimes may follow depending on court rulings or legislation.
- Policy and markets: Court decisions will influence whether training data requires licensing, what counts as fair use for model training, and how damages/liability are assessed.
- Public debate: The tension mirrors earlier digital‑age copyright battles (e.g., Napster)—a formative moment for norms around large‑scale scraping and reuse of cultural works.
Notable quotes and moments
- Evocative phrasing from documents and reporting: “destructively scan all the books in the world” — a key line that captured public attention.
- Internal memo excerpt illustrating approval: “after a prior escalation to MZ, Gen AI has been approved to use LibGen for LAMA 3 with a number of agreed upon mitigations” — cited in filings to show internal decision processes around using shadow libraries.
Practical takeaways / recommendations
- For authors and rights holders: Monitor ongoing litigation and consider pursuing collective licensing agreements; engage with policy debates to clarify fair‑use boundaries.
- For AI developers: Prioritize clear legal advice and transparent licensing where possible; weigh data quality gains against legal and reputational risk.
- For policymakers and judges: There is a need for clearer rules or frameworks that balance innovation with fair compensation and copyright protection when models are trained on copyrighted works.
Produced by The Washington Post’s Post Reports. The episode provides context on Anthropic’s Project Panama, the interplay of technology and copyright law, and why the outcomes of these cases will matter for the future of AI and creative labor.
