The Project
Version 1.0
âïž AI-assisted: This article was written with the help of ChatGPT (OpenAI), based on original sketches and ideas by Francesc LĂłpez MariĂł.
WordGuardian is a personal project that combines game mechanics, experimentation, and semantic data work.
The goal is to present real, absurd, and invented definitions inside a simple, readable interface.
đ§© Objective
I wanted to create a game where users guessed plausible definitions, either automatically generated or extracted.
But the journey was anything but simple.
I used Astro and Preact to avoid FOUC and keep it responsive, and I worked with massive Wikidata dumps, decompressed with pbzip2, zstd, and parsed using parallel Rust scripts.
âïž Stack & Techniques
- Frontend: Astro, Preact, Shoelace, Vanilla Extract
Astro turned out to be a hidden gem. It lets you combine components from different frameworks and enables efficient server-side rendering (SSR). This helps search engines index the content properly âa key difference if you care about SEO in 2025. Also, exposing a small server API (via
/api/xyzendpoints) is simple and clean, no need for a full backend. - Backend: Node.js for scraping; parallel Rust scripts for parsing Wikidata
- Processing: Wikidata dumps using p31, p279, p1629... a whole world of data
- NLP: distractor generation with Groq, heuristics using embeddings.
- Other: Load without FOUC (Flash of Unstyled Content) âmaking sure the page doesnât flash ugly before styles kick in. Minimal optimization, and as always, lots of console logs.
I also relied on the Wiktextract project by Tatu Ylönen, and the dataset from Kaikki.org, which offers Wiktionary content in a structured, usable way. Without it, building a solid lexical base would have taken much longer.
đĄ What I Learned
- Wikidata is a treasure, but navigating it without a map is a nightmare: items, properties, lexemes, glosses, cross-references... and SPARQL queries that look like arcane rituals.
- Tools I discovered:
- Wikidata dumps often come in JSONL (JSON Lines) format: a file where each line is a standalone JSON object. Great for streaming âyou donât need to load everything into memory. But with 90 GB... even the best strategy requires patience.
- Compression formats matter:
bz2is common in official dumps, but painfully slow to decompress.zstdis much faster, especially with multiple threads. Withpzstd -d -p10you can speed up decompression, although your own code becomes the bottleneck đ
- Generating plausible distractors is not trivial. Doing it with Groq was fun but inconsistent.
- Making fake but believable definitions is tricky. You need precise prompting with Groq or HuggingFace.
- Dumps are not for the faint of heart. Especially when theyâre 90 GB.
- I also learned to wrap small NLP services in Docker, exposing local APIs with ease. Using a lightweight image like
python:3.11-slimand arequirements.txt, I quickly deployed a service for semantic comparisons and phrase similarity. Perfect for local tests without dependency chaos. - I explored Common Crawl to URLs from the public web. Although I didnât end up using it, I built a script to filter the links and create a custom corpus. The idea was to use it to generate or validate questions with real content. Itâs on hold... for now.
đ Monetization? Yikes...
I had hoped to at least cover the hosting with ads. But the numbers speak for themselves:
- 462 visits
- 22 active users
- âŹ0.00 revenue
- An average of 7 minutes per user... but no clicks
In 2025, if you donât have TikTok, a newsletter, or a million page views, ads wonât even buy you coffee.
Making a website isnât enough. You have to be a showman, have money, or do daily marketing.
đŻ So, what now?
Iâm not killing it.
WordGuardian will live on as an experimental project and a small tribute to words and to Wikidata.
Maybe it will evolve into an educational app. Maybe itâll become the base for a new game.
Or maybe itâll stand as a monument to the time spent on an idea that still makes me smile.
Iâm also toying with the idea of exploring how to use Wikidataâs ontology to generate questions by topic âscience, history, pop culture, etc.
Using subclass (p279) and instance (p31) hierarchies to build thematic levels or topic-based quizzes, making the game a bit less âpsychedelicâ and a bit more navigable.
đ Try it: https://eurekatop.com/wordguardian
đ Open source (coming soon on GitHub)
A project by mutiitu.com
© Francesc López Marió, 2025
