Data Skeptic

Back to Home
Data Skeptic artwork
Technology
Science

by Kyle Polich

The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.

7 episodes summarized

Episodes

Healthy Friction in Job Recommender Systems

Healthy Friction in Job Recommender Systems

FULL

<p>In this episode, host Kyle Polich speaks with Roan Schellingerhout, a fourth-year PhD student at Maastricht University, about explainable multi-stakeholder recommender systems for job recruitment. Roan discusses his research on creating AI-powered job matching systems that balance the needs of multiple stakeholders—job seekers, recruiters, HR professionals, and companies. The conversation explores different types of explanations for job recommendations, including textual, bar chart, and graph-based formats, with findings showing that lay users strongly prefer simple textual explanations over more technical visualizations. Roan shares insights from his "healthy friction" study, which tested whether users could distinguish between real AI-generated explanations and randomly generated ones, revealing that participants often used explanations as information sources rather than decision-making tools.</p> <p>The discussion delves into the technical architecture behind these systems, including the use of knowledge graphs built from tabular data, inference rules, and large language models to generate human-friendly explanations. Roan explains how his research aims to open the black box of recommender systems, making them more transparent and trustworthy for non-technical users. Looking forward, he discusses ongoing work on automated knowledge graph construction from resumes and job listings, research into fairness considerations around gender and location, and plans for real-world testing with actual job seekers. The episode concludes with Roan's vision for the future: AI systems that support rather than replace human recruiters, making the job search process less grueling while maintaining the essential human judgment that recruitment requires.</p> <p> </p>

February 2, 202626:37
Fairness in PCA-Based Recommenders

Fairness in PCA-Based Recommenders

FULL

<p>In this episode, we explore the fascinating world of recommender systems and algorithmic fairness with David Liu, Assistant Research Professor at Cornell University's Center for Data Science for Enterprise and Society. David shares insights from his research on how machine learning models can inadvertently create unfairness, particularly for minority and niche user groups, even without any malicious intent. We dive deep into his groundbreaking work on Principal Component Analysis (PCA) and collaborative filtering, examining why these fundamental techniques sometimes fail to serve all users equally.</p> <p>David introduces the concept of "power niche users" - highly active users with specialized interests who generate valuable data that can benefit the entire platform. We discuss his paper "When Collaborative Filtering Is Not Collaborative," which reveals how PCA can over-specialize on popular content while neglecting both niche items and even failing to properly recommend popular artists to new potential fans. David presents solutions through item-weighted PCA and thoughtful data upweighting strategies that can improve both fairness and performance simultaneously, challenging the common assumption that these goals must be in tension. The conversation spans from theoretical insights to practical applications at companies like Meta, offering a comprehensive look at the future of personalized recommendations.</p> <p> </p>

January 26, 202649:59
Video Recommendations in Industry

Video Recommendations in Industry

FULL

<p>In this episode, Kyle Polich sits down with <span data-sheets-root="1">Cory Zechmann</span>, a content curator working in streaming television with 16 years of experience running the music blog "Silence Nogood." They explore the intersection of human curation and machine learning in content discovery, discussing the concept of "algatorial" curation—where algorithms and editorial expertise work together. Key topics include the cold start problem, why every metric is just a "proxy metric" for what users actually want, the challenge of filter bubbles, and the importance of balancing familiarity with discovery. Cory shares insights on why TikTok's algorithm works so well (clean data and massive interaction volume), the crucial role of homepage curation, and how human curators help by contextualizing content, cleaning data, and identifying positive feedback loops that algorithms might miss.</p> <p>The conversation covers practical challenges like measuring "surprise and delight," the content deluge created by democratized creation tools, and why trust in tech companies is essential for better personalization. Cory emphasizes that discovery is "a good type of friction" and explains how the CODE framework (Capture, Organize, Distill, Express, plus Analysis) guides professional curation work. Looking to the future, they discuss the need for systems thinking that creates narrative connections between content, the potential for conversational AI to help users articulate preferences, and why diverse perspectives beyond engineering are crucial for building effective discovery systems. Resources mentioned include the newsletter "Top Information Retrieval Papers of the Week" and Notebook LM for synthesizing research.</p> <p> </p>

December 26, 202538:16
Designing Recommender Systems for Digital Humanities

Designing Recommender Systems for Digital Humanities

FULL

<p>In this episode of Data Skeptic, we explore the fascinating intersection of recommender systems and digital humanities with guest Florian Atzenhofer-Baumgartner, a PhD student at Graz University of Technology. Florian is working on <a href= "http://monasterium.net/">Monasterium.net</a>, Europe's largest online collection of historical charters, containing millions of medieval and early modern documents from across the continent. The conversation delves into why traditional recommender systems fall short in the digital humanities space, where users range from expert historians and genealogists to art historians and linguists, each with unique research needs and information-seeking behaviors.</p> <p>Florian explains the technical challenges of building a recommender system for cultural heritage materials, including dealing with sparse user-item interaction matrices, the cold start problem, and the need for multi-modal similarity approaches that can handle text, images, metadata, and historical context. The platform leverages various embedding techniques and gives users control over weighting different modalities—whether they're searching based on text similarity, visual imagery, or diplomatic features like issuers and receivers. A key insight from Florian's research is the importance of balancing serendipity with utility, collection representation to prevent bias, and system explainability while maintaining effectiveness.</p> <p>The discussion also touches on unique evaluation challenges in non-commercial recommendation contexts, including Florian's "research funnel" framework that considers discovery, interaction, integration, and impact stages. Looking ahead, Florian envisions recommendation systems becoming standard tools for exploration across digital archives and cultural heritage repositories throughout Europe, potentially transforming how researchers discover and engage with historical materials. The new version of <a href="http://Monasterium.net">Monasterium.net</a>, set to launch with enhanced semantic search and recommendation features, represents an important step toward making cultural heritage more accessible and discoverable for everyone.</p> <p> </p>

November 23, 202536:48
DataRec Library for Reproducible in Recommend Systems

DataRec Library for Reproducible in Recommend Systems

FULL

<p>In this episode of Data Skeptic's Recommender Systems series, host Kyle Polich explores DataRec, a new Python library designed to bring reproducibility and standardization to recommender systems research. Guest Alberto Carlo Mario Mancino, a postdoc researcher from Politecnico di Bari, Italy, discusses the challenges of dataset management in recommendation research—from version control issues to preprocessing inconsistencies—and how DataRec provides automated downloads, checksum verification, and standardized filtering strategies for popular datasets like MovieLens, Last.fm, and Amazon reviews. </p> <p>The conversation covers Alberto's research journey through knowledge graphs, graph-based recommenders, privacy considerations, and recommendation novelty. He explains why small modifications in datasets can significantly impact research outcomes, the importance of offline evaluation, and DataRec's vision as a lightweight library that integrates with existing frameworks rather than replacing them. Whether you're benchmarking new algorithms or exploring recommendation techniques, this episode offers practical insights into one of the most critical yet overlooked aspects of reproducible ML research.</p>

November 13, 202532:48
Shilling Attacks on Recommender Systems

Shilling Attacks on Recommender Systems

FULL

<p>In this episode of Data Skeptic's Recommender Systems series, Kyle sits down with Aditya Chichani, a senior machine learning engineer at Walmart, to explore the darker side of recommendation algorithms. The conversation centers on shilling attacks—a form of manipulation where malicious actors create multiple fake profiles to game recommender systems, either to promote specific items or sabotage competitors. Aditya, who researched these attacks during his undergraduate studies at SPIT before completing his master's in computer science with a data science specialization at UC Berkeley, explains how these vulnerabilities emerge particularly in collaborative filtering systems. From promoting a friend's ska band on Spotify to inflating product ratings on e-commerce platforms, shilling attacks represent a significant threat in an industry where approximately 4% of reviews are fake, translating to $800 billion in annual sales in the US alone.</p> <p>The discussion delves deep into collaborative filtering, explaining both user-user and item-item approaches that create similarity matrices to predict user preferences. However, these systems face various shilling attacks of increasing sophistication: random attacks use minimal information with average ratings, while segmented attacks strategically target popular items (like Taylor Swift albums) to build credibility before promoting target items. Bandwagon attacks focus on highly popular items to connect with genuine users, and average attacks leverage item rating knowledge to appear authentic. User-user collaborative filtering proves particularly vulnerable, requiring as few as 500 fake profiles to impact recommendations, while item-item filtering demands significantly more resources. Aditya addresses detection through machine learning techniques that analyze behavioral patterns using methods like PCA to identify profiles with unusually high correlation and suspicious rating consistency. However, this remains an evolving challenge as attackers adapt strategies, now using large language models to generate more authentic-seeming fake reviews. His research with the MovieLens dataset tested detection algorithms against synthetic attacks, highlighting how these concerns extend to modern e-commerce systems. While companies rarely share attack and detection data publicly to avoid giving attackers advantages, academic research continues advancing both offensive and defensive strategies in recommender systems security. <!-- notionvc: c1bdce5c-05d5-4b8b-871f-892c4ad6edaa --> <!-- notionvc: 9f6a9929-cf3e-48be-af9f-98b00291654a --></p>

November 5, 202534:48
Interpretable Real Estate Recommendations

Interpretable Real Estate Recommendations

FULL

In this episode of Data Skeptic's Recommender Systems series, host Kyle Polich interviews Dr. Kunal Mukherjee, a postdoctoral research associate at Virginia Tech, about the paper "Z-REx: Human-Interpretable GNN Explanations for Real Estate...

September 22, 202532:57