Comment by 1718627440

> I wonder do German brains work on a much longer context window because of the language?

Maybe, but more due to the spelling of numbers and long sentences. Compound words are not an example of this, since Germans can parse these words just fine as different things. It just means that the lowest "tokenization" in everyday use is not the word, but subcomponents of them.

Do English native speakers "tokenize" expressions in words? Do you see it as '(labelling) (of) (minced)' or '(label)l(ing) (of) (minc)(ed)' ?

I can't speak for most Germans, but the algorithm I think I use is just greedy from left to right. This is also consistent with how mistokenization in common puns works, so I think this is common.

In primary school we trained to recognize syllable boundaries. Is that just a German thing, or is this common in other countries? You need to know these for spelling and once you know these, separating word components becomes trivial.