Overview of The Stack Overflow Podcast — Episode: "Even GenAI uses Wikipedia as a source"
This episode (host Ryan Donovan) features Philippe Sade, AI project lead at Wikimedia Deutschland, discussing the Wikimedia vectorization project: a large-scale effort to build a vector database on top of Wikidata to enable semantic search, easier access for AI/RAG applications, and broader community-driven uses. The project transforms structured Wikidata items into textual representations, embeds them with a pre-trained model, and publishes processed data (Parquet) on Hugging Face to reduce load on Wikimedia infrastructure. An alpha release (September 2024 snapshot) with ~30 million embedded items (those linked to Wikipedia pages) is available for testing and feedback.
Key takeaways
- Motivation: Massive scraping and RAG workloads were stressing Wikimedia infrastructure — a cooperative solution (providing a preprocessed vector DB) reduces repeated heavy API/DB queries.
- Scope & scale: The team processed a large Wikidata dump (terabytes of text); they embedded a filtered subset (~30M items linked to Wikipedia pages) for the alpha.
- Approach: Convert graph items to text (labels, descriptions, aliases, and statement-derived sentences), chunk by statement, then embed with a pre-trained embedding API rather than self-hosting.
- Infrastructure / distribution: Processed Parquet exports were posted on Hugging Face so third parties can consume pre-aggregated labels without scraping Wikimedia servers.
- UX / tooling: An accompanying interface (MCP server) helps editors and users explore Wikidata and assists LLMs in generating correct SPARQL queries; vector search is used for exploration and SPARQL for precise retrieval.
- Release & testing: Alpha released (Oct 2024), currently collecting user feedback to guide improvements, update cadence, and feature priorities.
Technical details
Data preparation & what gets embedded
- Source: full Wikidata data dump (large — multiple passes required to aggregate connected labels).
- Embedded subset: ~30 million items chosen for having links to Wikipedia pages (filtered to reduce volume and focus on general knowledge).
- Textual representation per item includes:
- Label, description, aliases
- Statement-derived sentences: for each edge/property, include property label + connected item label as part of a sentence
- Excluded raw external IDs and raw opaque identifiers (but presence/flag of such properties may be recorded)
- Output format for downstream processing: Parquet (columnar), with each row containing the aggregated labels/fields needed so consumers can process row-by-row.
Embedding pipeline
- Model: pre-trained embedding (referred to as GINA embedding v3 in the discussion), consumed via provider API for speed and to avoid self-hosting.
- Chunking: items were chunked at the statement level; chunk context includes label/description/aliases. Typical chunk sizes tested included 1024-token capacity; 512-token embeddings were selected as a practical tradeoff between accuracy and resource cost.
- Techniques: used a matryoshka-style (variable-size) embedding approach supported by the provider; chunking limited the average number of chunks per item (often up to ~4 max).
- Infrastructure partners: project experimented with infrastructure partners to host and scale processing (enabling cost-effective experimentation).
Search / query flow
- Vector search: good for exploratory queries and surface-retrieval (natural language input).
- SPARQL (described as “SQL for knowledge graphs”): precise queries once you know the graph structure and identifiers.
- MCP server: intermediary that uses vector search to help LLMs discover graph structure and then synthesize valid SPARQL queries for precise retrieval.
Challenges and limitations
- Graph-to-text conversion: mapping rich graph structure to good textual embeddings required multiple passes and careful aggregation of connected labels.
- What to include: deciding which property types matter (dates, qualifiers, external IDs) required testing; raw IDs generally add noise to embeddings and were excluded.
- Scale & update cadence: re-vectorizing is costly. Current approach uses a September 2024 snapshot for alpha; they plan periodic updates and explore change-detection (last-edited timestamps, property-level checks) to avoid full re-embeddings.
- Evaluation: determining an objectively “best” embedding configuration is hard without curated evaluation datasets; alpha user testing is being used to guide choices (use cases, accuracy, feature requests).
- Resource tradeoffs: balancing embedding granularity (512 vs 1024 tokens), API cost, and infrastructure constraints required pragmatic decisions.
Use cases & next steps
- Intended use cases:
- RAG (retrieval-augmented generation) backends that need a curated, searchable knowledge base
- Exploratory search and discovery across Wikidata
- Classification, clustering, and other ML tasks that benefit from vectorized representations of graph items
- Tools to help authors/curators craft correct SPARQL queries using generative assistance
- Next steps the team is seeking through alpha feedback:
- Validate primary use cases (RAG vs classification vs tooling)
- Decide on update frequency for re-vectorization
- Determine whether fine-tuning or model changes are required based on user accuracy needs
- Iterate on MCP features and developer-facing APIs
How to access & try it
- Alpha: released using a Sept 2024 Wikidata snapshot. Search for the “Wikidata Vector Database” / “Wikidata vector” project to find the demo and details.
- Processed dataset: published on Hugging Face in Parquet format (the project put a preprocessed dump of labels/rows there to reduce pressure on Wikimedia servers).
- Feedback: Philippe and the Wikimedia Deutschland team are soliciting user testing and use-case reports to inform improvements.
- Contact from the episode: podcast@stackoverflow.com for the show; Philippe’s team at Wikimedia Deutschland for the vector DB alpha.
Notable quotes & insights
- “You can’t really stop these type of scrapings. It’s better to find solutions to either provide the data in a simpler way instead of having multiple calls on the API.”
- “Vector search is great for exploration, SPARQL is great for precision — using both together unlocks powerful workflows.”
- Practical engineering tradeoff: “We used the provider API rather than self-hosting because it was way faster and they have really good infrastructure.”
If you want to test the alpha and share feedback, look up the Wikidata Vector Database (alpha) or check the processed Parquet dump on Hugging Face to experiment without hitting Wikimedia servers.
