https://raw.githubusercontent.com/niieani/gpt-tokenizer/refs...
Indeed, how do they deal with Chinese? Are some ideograms multiple tokens?
(ges)(chn)(iegelt)
[1]: https://platform.openai.com/tokenizer"This word is geschniegelt" is [2500, 2195, 382, 192786]
Last token here is " geschniegelt"
Sometimes it would reply with the correct definition of geschniegelt, the description would sometimes be in German, sometimes in English.
Most of the time it would give me a definition for a different German word "Geil".
For whatever reason, the most interesting results I got were via my work's m365 copilot interface, where it would give me random word descriptions in Hebrew[0] and Arabic[1].
"But maybe... OLEICAT? no..."