Summary of Your LLM issues are really data issues Podcast Episode by The Stack Overflow Podcast

Overview of Your LLM issues are really data issues

In this episode of the Stack Overflow Podcast, Ryan Donovan talks with Harsha Chintalapani, co-founder and CTO of Collate and co-creator of OpenMetadata, about why many “AI problems” are actually data problems. The discussion focuses on production data, especially structured, real-time business data, and why LLMs struggle when organizations lack clear metadata, semantics, ownership, lineage, and data quality controls.

Harsha argues that AI does not magically solve messy data environments. Instead, LLMs often amplify existing organizational confusion: unclear definitions, duplicate tables, stale pipelines, and undocumented business metrics. The solution, he says, is to treat metadata as the foundation for AI and analytics.

Key Takeaways

The real challenge is not data processing anymore — it’s data understanding.
- Distributed systems, cloud infrastructure, and storage/compute scaling are relatively mature.
- The harder problem is knowing what the data means, who owns it, and whether it can be trusted.
LLMs struggle without semantic context.
- If an organization has not clearly defined terms like “customer,” “ARR,” or “customer health,” an LLM has no reliable way to infer them.
- Natural language interfaces to data only work well when the underlying business concepts are documented and connected to the right tables and metrics.
These problems affect companies of all sizes.
- Uber-scale data makes the problem obvious, but even smaller companies quickly run into issues once they have more than a handful of people and tables.
- As soon as a company has a data team, dashboards, or ML models, governance and metadata become necessary.
Metadata is the starting point.
- Harsha recommends starting with metadata rather than raw data.
- A strong metadata layer should capture:
  - table names and schemas
  - ownership
  - lineage
  - freshness
  - quality signals
  - glossary terms and business definitions
Data culture should borrow from engineering culture.
- Code has owners, runbooks, on-call rotations, SLAs, and production processes.
- Data often lacks the same rigor, even when it is just as critical to the business.

Why LLMs Fail on Structured Data

Lack of shared definitions

A central issue is that business terms are often tacit knowledge rather than explicit documentation. For example:

Marketing, sales, and engineering may all define “customer” differently.
Different teams may interpret “customer health” using different signals.
Without a glossary, the LLM cannot know which interpretation to use.

Poor discovery and trust

Even if the right table exists, users still face several hurdles:

finding the right dataset
identifying duplicate or stale tables
verifying access permissions
confirming freshness and quality
understanding column meanings

LLMs don’t solve those problems by themselves. They need a metadata backbone to work from.

Data quality issues become AI issues

If the underlying data is stale, incomplete, or mislabeled, then the AI output will also be wrong. Harsha emphasizes that:

bad data leads to bad dashboards
bad dashboards lead to bad decisions
bad decisions lead to costly business mistakes

What OpenMetadata / Collate Is Trying to Solve

Harsha describes OpenMetadata as a way to make data more understandable and usable for humans and AI alike.

Core capabilities discussed

Automated metadata collection from tools like Snowflake, Hadoop, MySQL, Kafka, and BI systems
Knowledge graph / RDF-based relationships between assets
Glossary and metrics catalog for business definitions
Lineage tracking at column and dataset level
Data quality and observability signals
Ownership and tiering, so critical data gets treated like critical services

The bigger idea

AI should not be the thing that makes data ready for AI; AI should help make data ready for itself.

That means using agents and automation to:

document data
classify sensitive fields
infer relationships
detect stale or incomplete datasets
surface the right table or metric for a user query

Practical Recommendations

1. Start with metadata early

As soon as you have a data team, you should begin collecting and organizing metadata.

2. Define business semantics

Create a shared glossary for important business terms:

customer
revenue
ARR
customer health
conversion
active user

3. Track ownership and lineage

Every critical table or dashboard should have:

an owner
a source
a lineage chain
a freshness/quality signal

4. Prioritize the most important dashboards and pipelines

Harsha suggests working backwards from the dashboards executives and business teams rely on most, then tracing lineage upstream to protect the most critical data paths.

5. Treat data like production software

If a dashboard or pipeline drives business decisions, it should have:

documentation
quality checks
observability
on-call responsibility
escalation paths

Notable Examples

Uber surfaced many of the challenges:
- duplicate tables
- stale or experimental datasets being mistaken for canonical ones
- inconsistent definitions across teams
- GDPR classification done manually at scale
- missed pipeline failures affecting driver payments and business reporting
A tips pipeline failure showed how a data issue can turn into a real-world business and accounting problem.
Wrong audience targeting for ads/incentives illustrated how data mistakes can waste significant money in ML-driven systems.

Final Takeaway

The episode’s central message is that AI is only as good as the data ecosystem behind it. LLMs can help users ask questions and discover information faster, but they cannot fix unclear definitions, poor ownership, stale pipelines, or missing metadata.

If companies want AI to be reliable in production, they need to invest in:

metadata
semantics
lineage
governance
data quality
ownership

In other words: the LLM issue is usually a data issue first.

Summary of Your LLM issues are really data issues

The Stack Overflow Podcastby The Stack Overflow Podcast