Overview of Your LLM issues are really data issues
In this episode of the Stack Overflow Podcast, Ryan Donovan talks with Harsha Chintalapani, co-founder and CTO of Collate and co-creator of OpenMetadata, about why many “AI problems” are actually data problems. The discussion focuses on production data, especially structured, real-time business data, and why LLMs struggle when organizations lack clear metadata, semantics, ownership, lineage, and data quality controls.
Harsha argues that AI does not magically solve messy data environments. Instead, LLMs often amplify existing organizational confusion: unclear definitions, duplicate tables, stale pipelines, and undocumented business metrics. The solution, he says, is to treat metadata as the foundation for AI and analytics.
Key Takeaways
-
The real challenge is not data processing anymore — it’s data understanding.
- Distributed systems, cloud infrastructure, and storage/compute scaling are relatively mature.
- The harder problem is knowing what the data means, who owns it, and whether it can be trusted.
-
LLMs struggle without semantic context.
- If an organization has not clearly defined terms like “customer,” “ARR,” or “customer health,” an LLM has no reliable way to infer them.
- Natural language interfaces to data only work well when the underlying business concepts are documented and connected to the right tables and metrics.
-
These problems affect companies of all sizes.
- Uber-scale data makes the problem obvious, but even smaller companies quickly run into issues once they have more than a handful of people and tables.
- As soon as a company has a data team, dashboards, or ML models, governance and metadata become necessary.
-
Metadata is the starting point.
- Harsha recommends starting with metadata rather than raw data.
- A strong metadata layer should capture:
- table names and schemas
- ownership
- lineage
- freshness
- quality signals
- glossary terms and business definitions
-
Data culture should borrow from engineering culture.
- Code has owners, runbooks, on-call rotations, SLAs, and production processes.
- Data often lacks the same rigor, even when it is just as critical to the business.
Why LLMs Fail on Structured Data
Lack of shared definitions
A central issue is that business terms are often tacit knowledge rather than explicit documentation. For example:
- Marketing, sales, and engineering may all define “customer” differently.
- Different teams may interpret “customer health” using different signals.
- Without a glossary, the LLM cannot know which interpretation to use.
Poor discovery and trust
Even if the right table exists, users still face several hurdles:
- finding the right dataset
- identifying duplicate or stale tables
- verifying access permissions
- confirming freshness and quality
- understanding column meanings
LLMs don’t solve those problems by themselves. They need a metadata backbone to work from.
Data quality issues become AI issues
If the underlying data is stale, incomplete, or mislabeled, then the AI output will also be wrong. Harsha emphasizes that:
- bad data leads to bad dashboards
- bad dashboards lead to bad decisions
- bad decisions lead to costly business mistakes
What OpenMetadata / Collate Is Trying to Solve
Harsha describes OpenMetadata as a way to make data more understandable and usable for humans and AI alike.
Core capabilities discussed
- Automated metadata collection from tools like Snowflake, Hadoop, MySQL, Kafka, and BI systems
- Knowledge graph / RDF-based relationships between assets
- Glossary and metrics catalog for business definitions
- Lineage tracking at column and dataset level
- Data quality and observability signals
- Ownership and tiering, so critical data gets treated like critical services
The bigger idea
AI should not be the thing that makes data ready for AI; AI should help make data ready for itself.
That means using agents and automation to:
- document data
- classify sensitive fields
- infer relationships
- detect stale or incomplete datasets
- surface the right table or metric for a user query
Practical Recommendations
1. Start with metadata early
As soon as you have a data team, you should begin collecting and organizing metadata.
2. Define business semantics
Create a shared glossary for important business terms:
- customer
- revenue
- ARR
- customer health
- conversion
- active user
3. Track ownership and lineage
Every critical table or dashboard should have:
- an owner
- a source
- a lineage chain
- a freshness/quality signal
4. Prioritize the most important dashboards and pipelines
Harsha suggests working backwards from the dashboards executives and business teams rely on most, then tracing lineage upstream to protect the most critical data paths.
5. Treat data like production software
If a dashboard or pipeline drives business decisions, it should have:
- documentation
- quality checks
- observability
- on-call responsibility
- escalation paths
Notable Examples
-
Uber surfaced many of the challenges:
- duplicate tables
- stale or experimental datasets being mistaken for canonical ones
- inconsistent definitions across teams
- GDPR classification done manually at scale
- missed pipeline failures affecting driver payments and business reporting
-
A tips pipeline failure showed how a data issue can turn into a real-world business and accounting problem.
-
Wrong audience targeting for ads/incentives illustrated how data mistakes can waste significant money in ML-driven systems.
Final Takeaway
The episode’s central message is that AI is only as good as the data ecosystem behind it. LLMs can help users ask questions and discover information faster, but they cannot fix unclear definitions, poor ownership, stale pipelines, or missing metadata.
If companies want AI to be reliable in production, they need to invest in:
- metadata
- semantics
- lineage
- governance
- data quality
- ownership
In other words: the LLM issue is usually a data issue first.
