Summary of Book Ratings and Recommendations Podcast Episode by Data Skeptic

Overview of Book Ratings and Recommendations

This episode of Data Skeptic (host Karl Polich) features Hannes Rosenbusch (University of Amsterdam), a psychologist/data scientist and fiction writer, discussing his papers that analyze Goodreads ratings and explore personalized, interpretable book recommendation methods. The core question: "Are some books better than others?" is examined empirically using Goodreads data, written reviews, metadata, and experiments that combine human introspection with large language models (LLMs).

Key findings

Variance decomposition of Goodreads ratings shows reader effects dominate: differences in how individual readers rate books explain far more variance than differences among professionally published books.
Book-level rating differences are often small; many books fall in a similar "corridor" of ratings unless they have very large numbers of ratings.
Ratings stabilize only after many reviews (often thousands to tens of thousands). Even stable global differences (e.g., 4.1 vs 4.4) are weak predictors of any particular individual's enjoyment.
Experienced reviewers converge more than casual reviewers. Two possible explanations:
- Social/influencer effects (experienced users align with prevailing averages).
- Experienced readers apply more consistent rubrics (narrative structure, characterization, craft), producing less noisy ratings.
No major variance difference between fiction and nonfiction ratings—both are noisy and idiosyncratic in aggregate.
Written-review content tends to reflect reviewer tendencies more than book-specific consensus: what people complain about/readers notice is often personal, not universally shared.

Data, methods, and experiments

Data: Goodreads ratings and written reviews plus standard metadata (title, description, page count, etc.). Limited by lack of read-through/read-time metrics on Goodreads.
Analysis approach:
- Variance partitioning: disentangle variance attributable to books vs. readers.
- Sentiment and content analysis of written reviews to test whether textual complaints/claims converge across readers.
- Correlational analyses for features such as page count, genre, protagonist traits—predictive but generally weak at the individual level.
Stability checks: rating averages require large sample sizes to achieve stability; population drift (e.g., political re-evaluations of classics) affects longitudinal ratings.
Self-study of Isaac method (Hannes + one friend): personalized model using many content features annotated by LLMs; held-out correlation with true ratings ≈ 0.4 (moderate).

Theoretical and practical implications for recommender systems

Global average ratings (the intercept) are poor predictors for individual taste.
Collaborative filtering (user–item co-occurrence) may be effective for generic recommendations but is limited for interpretability and self-understanding.
Content-based recommendation is favored by Hannes for books: characterizing books by interpretable features (themes, protagonist attributes, pacing, content flags) helps users understand why a recommendation is made and supports personal taste discovery.
LLMs are useful as scalable annotators and hypothesis generators for content features, enabling richer content-based features at scale, but human-in-the-loop validation remains necessary.

Isaac method (Introspection + Support: AI Annotation + Curation)

Workflow:
- User supplies their rated books (e.g., Goodreads history).
- System generates many hypothesized predictors (genres, themes, protagonist features, pacing, page count, common review mentions).
- LLMs automatically annotate book content / metadata with these predictors; annotations are quality-checked on samples by humans.
- A model is trained per user to predict ratings and to surface top predictors (feature importance).
Benefits:
- Produces interpretable, portable features about why a user likes certain books.
- Helps users gain self-knowledge about reading preferences (e.g., Hannes found survival/dystopian themes strongly predictive for him).
- Performs moderately well for individuals (reported correlation ~0.4 in small self-studies).
Limitations:
- Requires either public-domain/text-accessible books or external sources (summaries, reviews); primary-source content access for commercial books is often restricted.
- LLM annotation quality needs validation; current studies are small-scale.

LLMs in this research and creative workflow

Uses LLMs as research assistants for:
- Large-scale annotation of books, summaries, and review content.
- Hypothesis generation about predictive features.
- Rapid scripting/coding help (Hannes reports not writing code manually recently).
- Summarization of papers and large texts.
Limitations and human role:
- Hannes uses LLMs for editorial help (finding words, background research) but resists using them to write creative fiction — creative identity and voice are central to authorship.
- LLMs are not yet trusted as sole editors for fiction; human editors are still preferred for nuanced feedback.
- Human-in-the-loop validation and spot-checking of LLM outputs are standard practice.

Practical takeaways / recommendations

For readers:
- Treat global star averages as a coarse market signal—useful for mainstream appeal but not determinative of your own taste.
- Avoid seeing average ratings before forming a personal opinion; writing your own notes and then comparing to platform consensus helps identify personal taste differences.
For authors:
- Single feature optimizations (e.g., page count, genre) rarely produce strong signals across readers; targeting a specific readership and using interpretable features may be more productive.
For platforms/recommender designers:
- Invest in content-based features and surfacing interpretable reasons for recommendations (e.g., “You like survival/dystopian themes; this book is similar”).
- Add or expose richer signals: read-through rates, time-on-book, finer-grained content flags (swearing, sex, violence) — these are actionable and user-personalizable.
- Use NLP/LLM pipelines to extract review-derived flags and themes, but keep human validation and transparency in the loop.
- Consider product features that characterize a user's rating behavior (e.g., tendency to give high/low scores) and show how a book matches those tendencies.

Limitations & caveats

Goodreads dataset lacks direct consumption signals (read-through, time-to-complete), which would be highly informative.
Many analyses rely on metadata and review text rather than full primary book text; moving toward primary-source analysis requires open-text corpora or licensing.
Findings about predictability and feature importance are user-specific and high-dimensional—no silver-bullet predictor exists.

What's next (from Hannes)

Shift focus to primary sources: analyzing book contents directly (public domain works, fan fiction, scripts) to get closer to author-level signals and reduce the noisy intermediary of reader reviews.
Scale and refine the Isaac-style pipelines, create public annotated book databases, and continue evaluating LLM annotation quality.

Notable quotes

“Most books … just fall into the same corridor of ratings. The differences between them are, to use a loaded term, not significant.”
“I prefer content-based recommenders for books — not because they're strictly better at raw accuracy, but because they help me understand myself.”

Where to follow / more info

Hannes Rosenbusch (fiction & research): HannesRosenbusch.com
Affiliation: University of Amsterdam (Department of Psychological Methods)
Code/papers: Hannes indicates code is public for the Isaac pipeline—look for links in the episode show notes or his website.

Summary of Book Ratings and Recommendations

Data Skepticby Kyle Polich