Overview of Disentanglement and Interpretability in Recommender Systems (Data Skeptic)
This episode of Data Skeptic features Ervin Dervishai (3rd‑year PhD student, University of Copenhagen) discussing his survey and reproducibility study on disentangled representation learning in recommender systems. The work investigates whether disentangled latent representations (where different latent dimensions correspond to independent factors) actually yield better interpretability and/or better recommendation performance, and highlights reproducibility challenges and practical implications for researchers and practitioners.
Key points and main takeaways
- Disentanglement aims to separate independent factors of variation in learned representations (e.g., size vs. price for a t‑shirt), which intuitively should improve interpretability and allow controlled perturbations.
- Ervin and co‑authors conducted a literature survey and reproducibility study: they collected models/datasets used in prior disentanglement work in recommender systems and evaluated them quantitatively.
- Metrics used:
- Disentanglement literature metrics: disentanglement and completeness.
- Interpretability techniques: LIME and SHAP (used to derive explainability scores).
- Main findings:
- Strong positive correlation between disentanglement and interpretability (i.e., more disentangled embeddings tend to be more explainable).
- No consistent or reliable correlation between disentanglement and recommendation effectiveness/accuracy—disentanglement alone did not guarantee better recommendation performance.
- Interpretable/disentangled representations can be useful for user-facing controls (e.g., let users filter by specific latent attributes) and build trust, even if they sometimes come at a slight performance cost.
- Reproducibility is a major issue: prior works often reported qualitative evaluations, omitted exact hyperparameters, code, and data splits, and sometimes lacked ground truth factors needed for disentanglement evaluation.
Methods and evidence (brief)
- The team reimplemented previously published disentanglement models for recommender tasks and collected the same datasets when possible.
- They computed established disentanglement/completeness metrics and interpretability scores (via LIME/SHAP) and ran correlation analyses across models and datasets.
- Multiple runs with different random seeds were averaged; nevertheless, some original reported scores could not be replicated, likely due to missing hyperparameter details, data splits, or other omitted experiments details.
Interpretation and explanations
- Why interpretability correlation but not accuracy?
- Disentanglement acts like an inductive prior / regularizer: it constrains the model to build structured, interpretable factors, which can reduce model flexibility and sometimes slightly hurt raw predictive performance.
- Prior claims of performance gains may reflect limited datasets, qualitative claims, or unreported experimental details.
- Practical trade‑off: interpretability and user trust vs. small losses in accuracy. In high‑stake or trust‑sensitive applications, interpretability may be worth the trade.
Practical recommendations (for researchers & practitioners)
- For researchers:
- Report reproducible artifacts: release code, exact hyperparameters (the exact ones that produced reported results), data splits, and seeds.
- Include quantitative disentanglement metrics (not just qualitative visualizations) and interpretability evaluations.
- When claiming performance improvements, show robust evaluations across multiple datasets and runs.
- For practitioners/product teams:
- Use disentanglement primarily for explainability, UI controls (let users tweak latent factors), and trust-building—not as a guaranteed way to boost offline accuracy.
- Consider combining disentanglement with other components (e.g., content metadata, hybrid models) if you want both interpretability and strong performance.
- When using LLMs (see below) for tasks like denoising, validate outputs carefully—LLMs can help but are still black boxes.
Additional topics discussed
- Content‑based explanations / metadata ("Because you liked X") are practical and widely used ways to make recommendations interpretable to users.
- Large language models (LLMs): Ervin described work using LLMs to denoise user interaction histories (identify and remove noisy/outlier interactions to improve downstream recommendation training). He noted LLMs are useful but also opaque—why they flag items can be unclear.
- Career note: Ervin is interning at Amazon working on LLM model merging and related topics.
Limitations & open problems
- Lack of ground truth for latent factors in many recommender datasets makes disentanglement evaluation difficult.
- Many prior papers used qualitative evidence; robust, quantitative benchmarks and baselines for disentanglement in recommendation are still needed.
- Reproducibility gaps (missing hyperparameters, data splits, seeds, code) hinder progress and verification.
Future directions suggested
- More reproducible, quantitative work on disentanglement in recommender systems.
- Hybrid approaches that combine disentangled representations with other signals to try to recover or improve prediction performance while retaining interpretability.
- Better evaluation protocols and datasets with known factors of variation (or synthetic/annotated benchmarks) to measure disentanglement more reliably.
- Continued exploration of LLMs for auxiliary tasks (denoising, feature extraction), with attention to explainability of their outputs.
Notable quotes
- "If you can show explanations why the user is getting that specific recommendation, then they sort of believe more in the system."
- "Relying only on the disentanglement component did not provide connection to the performance of the recommender system."
Actionable next steps for interested readers
- If you’re a researcher: when publishing, include code, exact hyperparameters, seeds, and data splits; add quantitative disentanglement/interpretability metrics.
- If you’re a product person: experiment with disentangled embeddings for explainability and user controls, but A/B test the business metrics (revenue/retention) because offline accuracy gains aren’t guaranteed.
