Comment by troad

Comment by troad 14 hours ago

3 replies

> Really, this bug is nothing but programmers failing to take into account that not everybody writes in English.

This bug is the exact opposite of that. The program would have worked fine had it used pure ASCII transforms (±0x20); it was the use of library functions that did in fact take Turkish into account that caused the problem.

More broadly, this is not an easy issue to solve. If a Turkish programmer writes code, what is the expected behaviour for metaprogramming and compilers? Are the function names in English or Turkish? What about variables, object members, struct fields? You could have one variable name that references some government ID number using its native Turkish name, right next to another variable name that uses the English "ID". How does the compiler know what locale to use for which symbol?

Boiling all of this down to 'just be more considerate' is not actually constructive or actionable.

jeroenhd 4 hours ago

The issue is actually quite easy to solve by specifying a default locale for string operations when you are not dealing with user input. Whether you pick US or ROOT or Turkish as a default locale, all you need to do is make sure that your fancy metaprogramming tricks relying on strings-as-enums are all parsed the same way. Locale.ROOT for Java, InvariantCulture or ToUpperInvariant() for C#, you name it.

The whole problem is that the compiler has no idea about the locale of any strings in the system, that's why it's on the programmer to specify them.

Lowercasing/uppercasing a string takes an (infuriatingly) optional locale parameter, and the moment that gets involved, you should think twice before using it for anything other than user data processing. I would happily see Oracle deprecate all string operations lacking a locale in the next version of Java.

  • troad 3 hours ago

    > actually quite easy to solve

    I cannot square your earlier assertion that we should be more mindful "that not everybody writes in English", with your current assertion that all code must only ever contain English, for simplicity's sake. Either is a cogent position on its own, just not both at the same time.

    This bug arose because the programmers made incorrect assumptions about the result of a case-changing operation. If you impose English case rules on Turkish symbol names, this exact bug would simply arise in reverse.

    More problematically, as I alluded to earlier, Turkish code may contain a mix of languages. It may, for example, be using a DSL to talk to a database with fields named in Turkish, as well as making calls to standard library functions named in English. Which half of the code is your proposed invariant locale going to break?

  • [removed] 3 hours ago
    [deleted]