Feedback: It's a good model, however it hallucinates very badly at local facts (Germany)
Hello,
so whenever I test new models I like to ask them questions about local facts. In that process I've noticed that Qwen 3 does a very poor job, a lot worse than smaller models like Gemma 3 12B infact. I've tested the models on their respective platforms with default sampler settings.
Here's one example out of my pool of questions related to Germany:
- Bitte nenne mir 5 Achterbahnen im Europapark. (List 5 roller coasters at Europapark, which one of the most popular theme parks in Europe)
Qwen 3 32B:
Hier sind 5 bekannte Achterbahnen im Europapark Rust :
Silver Star – Eine Hochgeschwindigkeits-Achterbahn mit Looping und großer Steigung.
Blue Tornado – Eine Actionvolle Bahn mit mehreren Loopings und Twistern.
Rulantica Speed Coaster – Die schnellste Achterbahn des Parks, gelegen in der Rulantica-Welt.
Valkyria – Eine moderne Stahlachterbahn mit hohen Geschwindigkeiten und dynamischen Elementen.
Colossus – Einer der Klassiker im Park, eine alteingesessene Holzachterbahn.
Wenn du weitere Details zu einer dieser Bahnen möchtest, helfe ich gerne!
The only correct roller coaster here is Silver Star (and it doesn't even have loopings). All other coasters are incorrect.
Gemma 3 12B:
Hier sind 5 Achterbahnen im Europapark:
Silver Star: Eine gigantische Stahlachterbahn mit hoher Geschwindigkeit und langen Steigungen.
Blue Fire Megacoaster: Eine rasante Katapult-Achterbahn mit Inversen.
Colossos: Eine Holzachterbahn mit 10 Looping-Elementen.
Arthur – In the Minimoys Kingdom: Eine Familienachterbahn mit Spezialeffekten und einer immersiven Thematisierung.
Ekstrem!: Eine Spinning-Achterbahn mit überraschenden Drehungen und Wendungen.
Gemma 3 12B gets 3 coasters correct (Silver Star, Blue Fire and Arthur).
Even the recently released GLM-4 32B, also a chinese model, does a much better job compared to Qwen 3:
Gerne, hier sind 5 Achterbahnen im Europapark:
Silver Star
Blue Fire - Mega Coaster
Raptor
Euro-Mir
Colossos - Kampenwalzer
3 coasters are correct (Silver Star, Blue Fire, Euro-Mir)
I noticed the same when asking questions about bands and tv shows
Had some similar experience, though something else than "local knowledge" might be at play. For example I tried asking Qwen3-30B-A3B-bf16 in Greek about the Peloponnesian War. The thought process (entirely in English) was correct as far as I could tell, but then the reply in Greek mentioned something completely unrelated.
<think> [...] plague in Athens [...] </think>
Was once translated as:
[...] πληθωρισμός στην Αθήνα [...] (translation: inflation in Athens)
And another time as:
[...] πανικός στην Αθήνα [...] (translation: panic in Athens)
This specific case hints that it could be:
- random hallucination, I guess
- translation / tokenization / sampling error: plague should be translated as πανούκλα (panoukla) but was translated as πληθωρισμός (plithorismos) one time, and as πανικός (panic) another time. Maybe the model generated the token for "π" (p) first, and then that influenced the rest of the word?
- data / context / training error: inflation is a pretty common topic, and Athens contains more than half the Greek population, so the internet probably contains large amounts of Greek texts talking about inflation and Athens