Overview of Book Ratings and Recommendations
This episode of Data Skeptic (host Karl Polich) features Hannes Rosenbusch (University of Amsterdam), a psychologist/data scientist and fiction writer, discussing his papers that analyze Goodreads ratings and explore personalized, interpretable book recommendation methods. The core question: "Are some books better than others?" is examined empirically using Goodreads data, written reviews, metadata, and experiments that combine human introspection with large language models (LLMs).
Key findings
- Variance decomposition of Goodreads ratings shows reader effects dominate: differences in how individual readers rate books explain far more variance than differences among professionally published books.
- Book-level rating differences are often small; many books fall in a similar "corridor" of ratings unless they have very large numbers of ratings.
- Ratings stabilize only after many reviews (often thousands to tens of thousands). Even stable global differences (e.g., 4.1 vs 4.4) are weak predictors of any particular individual's enjoyment.
- Experienced reviewers converge more than casual reviewers. Two possible explanations:
- Social/influencer effects (experienced users align with prevailing averages).
- Experienced readers apply more consistent rubrics (narrative structure, characterization, craft), producing less noisy ratings.
- No major variance difference between fiction and nonfiction ratings—both are noisy and idiosyncratic in aggregate.
- Written-review content tends to reflect reviewer tendencies more than book-specific consensus: what people complain about/readers notice is often personal, not universally shared.
Data, methods, and experiments
- Data: Goodreads ratings and written reviews plus standard metadata (title, description, page count, etc.). Limited by lack of read-through/read-time metrics on Goodreads.
- Analysis approach:
- Variance partitioning: disentangle variance attributable to books vs. readers.
- Sentiment and content analysis of written reviews to test whether textual complaints/claims converge across readers.
- Correlational analyses for features such as page count, genre, protagonist traits—predictive but generally weak at the individual level.
- Stability checks: rating averages require large sample sizes to achieve stability; population drift (e.g., political re-evaluations of classics) affects longitudinal ratings.
- Self-study of Isaac method (Hannes + one friend): personalized model using many content features annotated by LLMs; held-out correlation with true ratings ≈ 0.4 (moderate).
Theoretical and practical implications for recommender systems
- Global average ratings (the intercept) are poor predictors for individual taste.
- Collaborative filtering (user–item co-occurrence) may be effective for generic recommendations but is limited for interpretability and self-understanding.
- Content-based recommendation is favored by Hannes for books: characterizing books by interpretable features (themes, protagonist attributes, pacing, content flags) helps users understand why a recommendation is made and supports personal taste discovery.
- LLMs are useful as scalable annotators and hypothesis generators for content features, enabling richer content-based features at scale, but human-in-the-loop validation remains necessary.
Isaac method (Introspection + Support: AI Annotation + Curation)
- Workflow:
- User supplies their rated books (e.g., Goodreads history).
- System generates many hypothesized predictors (genres, themes, protagonist features, pacing, page count, common review mentions).
- LLMs automatically annotate book content / metadata with these predictors; annotations are quality-checked on samples by humans.
- A model is trained per user to predict ratings and to surface top predictors (feature importance).
- Benefits:
- Produces interpretable, portable features about why a user likes certain books.
- Helps users gain self-knowledge about reading preferences (e.g., Hannes found survival/dystopian themes strongly predictive for him).
- Performs moderately well for individuals (reported correlation ~0.4 in small self-studies).
- Limitations:
- Requires either public-domain/text-accessible books or external sources (summaries, reviews); primary-source content access for commercial books is often restricted.
- LLM annotation quality needs validation; current studies are small-scale.
LLMs in this research and creative workflow
- Uses LLMs as research assistants for:
- Large-scale annotation of books, summaries, and review content.
- Hypothesis generation about predictive features.
- Rapid scripting/coding help (Hannes reports not writing code manually recently).
- Summarization of papers and large texts.
- Limitations and human role:
- Hannes uses LLMs for editorial help (finding words, background research) but resists using them to write creative fiction — creative identity and voice are central to authorship.
- LLMs are not yet trusted as sole editors for fiction; human editors are still preferred for nuanced feedback.
- Human-in-the-loop validation and spot-checking of LLM outputs are standard practice.
Practical takeaways / recommendations
- For readers:
- Treat global star averages as a coarse market signal—useful for mainstream appeal but not determinative of your own taste.
- Avoid seeing average ratings before forming a personal opinion; writing your own notes and then comparing to platform consensus helps identify personal taste differences.
- For authors:
- Single feature optimizations (e.g., page count, genre) rarely produce strong signals across readers; targeting a specific readership and using interpretable features may be more productive.
- For platforms/recommender designers:
- Invest in content-based features and surfacing interpretable reasons for recommendations (e.g., “You like survival/dystopian themes; this book is similar”).
- Add or expose richer signals: read-through rates, time-on-book, finer-grained content flags (swearing, sex, violence) — these are actionable and user-personalizable.
- Use NLP/LLM pipelines to extract review-derived flags and themes, but keep human validation and transparency in the loop.
- Consider product features that characterize a user's rating behavior (e.g., tendency to give high/low scores) and show how a book matches those tendencies.
Limitations & caveats
- Goodreads dataset lacks direct consumption signals (read-through, time-to-complete), which would be highly informative.
- Many analyses rely on metadata and review text rather than full primary book text; moving toward primary-source analysis requires open-text corpora or licensing.
- Findings about predictability and feature importance are user-specific and high-dimensional—no silver-bullet predictor exists.
What's next (from Hannes)
- Shift focus to primary sources: analyzing book contents directly (public domain works, fan fiction, scripts) to get closer to author-level signals and reduce the noisy intermediary of reader reviews.
- Scale and refine the Isaac-style pipelines, create public annotated book databases, and continue evaluating LLM annotation quality.
Notable quotes
- “Most books … just fall into the same corridor of ratings. The differences between them are, to use a loaded term, not significant.”
- “I prefer content-based recommenders for books — not because they're strictly better at raw accuracy, but because they help me understand myself.”
Where to follow / more info
- Hannes Rosenbusch (fiction & research): HannesRosenbusch.com
- Affiliation: University of Amsterdam (Department of Psychological Methods)
- Code/papers: Hannes indicates code is public for the Isaac pipeline—look for links in the episode show notes or his website.
