Comment by okanat

Comment by okanat 19 hours ago

As a Turkish speaker who was using a Turkish-locale setup in my teenage years these kinds of bugs frustrated me infinitely. Half of the Java or Python apps I installed never run. My PHP webservers always had problems with random software. Ultimately, I had to change my system's language to English. However, US has godawful standards for everything: dates, measurement units, paper sizes.

When I shared computers with my parents I had to switch languages back-and-forth all the time. This helped me learn English rather quickly but, I find it a huge accessibility and software design issue.

If your program depends on letter cases, that is a badly designed program, period. If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.

While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.

I don't care if Unicode releases a conversion map. Natural-language behavior should always require natural language metadata too. Even modern languages like Rust did a crappy job of enforcing it: https://doc.rust-lang.org/std/primitive.char.html#method.to_... . Yes it is significantly safer but converting 'ß' to 'SS' in German definitely has gotchas too.

newpavlov 4 hours ago

>Even modern languages like Rust did a crappy job of enforcing it

Rust did the only sensible thing here. String handling algorithms SHOULD NOT depend on locale and reusing LATIN CAPITAL LETTER I arguably was a terrible decision on the Unicode side (I know there were reasons for it, but I believe they should've bit the bullet here), same as Han unification.

Reply View 0 replies

collinfunk 18 hours ago

> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake. Yet everybody did that. Just the existence of this behavior is a reason I would like to get rid of anything GNU-based in the systems I develop today.

POSIX requires that many functions account for the current locale. I'm not sure why you are blaming GNU for this.

Reply View 4 replies

keyle 11 hours ago

C wasn't designed to be running facebook, it was designed to not have to write assembly.

Reply View | 1 reply
- jen20 2 hours ago
  
  At a time when many machines did not have as many bytes of memory as there are Unicode code points.
  
  Reply View | 0 replies
immibis 6 hours ago

I'm not sure why you are blaming POSIX! The role of POSIX is to write down what is already common practice in almost all POSIX-like systems. It doesn't usually specify new behaviour.

Reply View | 1 reply
- GTP 2 hours ago
  
  I always assumed it was the other way around: a system follows POSIX to be POSIX-compliant.
  
  Reply View | 0 replies

1718627440 19 hours ago

> However, US has godawful standards for everything: dates, measurement units, paper sizes.

Isn't the choice of language and date and unit formats normally independent.

Reply View 15 replies

neandrake 19 hours ago

There are OS-level settings for date and unit formats but not all software obeys that, instead falling back to using the default date/unit formats for the selected locale.

Reply View | 0 replies
Waterluvian 17 hours ago

They’re about as independent as system language defaults causing software not to work properly. It’s that whole realm of “well we assumed that…” design error.

Reply View | 0 replies
okanat 19 hours ago

> > However, US has godawful standards for everything: dates, measurement units, paper sizes.
> Isn't the choice of language and date and unit formats normally independent.
You would hope so but, no. Quite a bit software tie the language setting to Locale setting. If you are lucky, they will provide an "English (UK)" option (which still uses miles or FFS WTF is a stone!).
On Windows you can kinda select the units easily. On Linux let me introduce you to the journey to LC_ environment variables: https://www.baeldung.com/linux/locale-environment-variables . This doesn't mean the websites or the apps will obey them. Quite a few of them don't and just use LANGUAGE, LANG or LC_TYPE as their setting.
My company switched to Notion this year (I still miss Confluence). It was hell until last month since they only had "English (US)" and used M/D/Y everywhere with no option to change!

Reply View | 12 replies
- miki123211 8 hours ago
  
  Mac OS actually lets you do English (Avganistan) or English (Somalia) or whatever.
  It's just English (I don't know when it's US and when it's UK, it's UK for Poland), but with the date / temperature / currency / unit preferences of whatever locale you actually live in.
  
  Reply View | 1 reply
  
  spookie 3 hours ago
  
  At least for any country in continental europe "English" is usually "English International", meaning English UK.
  Maybe there are some exceptions if we speak globally, hence limiting myself to europe. But I assume it is the same deal.
  
  Reply View | 0 replies
- spookie 4 hours ago
  
  Certain desktop environments like KDE provide a nice GUI for changing the locale environment variables. It has worked quite well for me, to use euro instead of my country's small currency :')
  
  Reply View | 0 replies
- [removed] 9 hours ago
  
  [deleted]
  
  Reply View | 0 replies
- menage 10 hours ago
  
  > FFS WTF is a stone!
  It's actually a pretty good weight for measuring humans (14lb). Your weight in pounds varies from day to day but your weight in (half-)stones is much more stable.
  
  Reply View | 4 replies
  
  doix 10 hours ago
  
  The real travesty is the fact that the sub-unit for a stone is a pound and not a pebble. I have no idea what stones and pounds are, but if it was stones and pebbles at least it'd be funnier
  
  Reply View | 3 replies
- doublerabbit 18 hours ago
  
  > FFS WTF is a stone
  An english imperial measurement. Measurements made based on actual stone rock and were mainly use as weighing agricultural items such as animal meat and potatoes. We also used tons and pounds before we incorporated the metric system of Europe.
  
  Reply View | 2 replies
  
  emmelaich 17 hours ago
  
  A stone is 1/8th of a long hundredweight. Easy!
  
  Reply View | 1 reply
  
  stefs 10 hours ago
  
  My car gets 40 rods to the hogshead and that's the way I likes it!
  
  Reply View | 0 replies

emmelaich 17 hours ago

If it's offered, choose EN-Australian or EN-international. Then you get sensible dates and measurement units.

Reply View 4 replies

benhurmarcel 9 hours ago

I usually set the Ireland locale, they use English but use civilized units. Sometimes there's also a "English (Europe)" or "English (Germany)" locale that works too.

Reply View | 2 replies
- distances 7 hours ago
  
  I also use Ireland sometimes for user accounts. For example Hotels.com only offers the local languages when you select which country to use. The Irish version is one of the few that has allows you to buy in Euros in English.
  
  Reply View | 0 replies
- okanat 8 hours ago
  
  Nowadays this works for many applications. Not for the "legacy" ARM compiler that was definitely invented after Win NT adopted UTF though. It crashes with "English (Germany)". Just whyy.
  
  Reply View | 0 replies
Waterluvian 17 hours ago

And if you want it to be more sensible but still not sensible, pick EN-ca.

Reply View | 0 replies

layer8 16 hours ago

> While half of the language design of C is questionable and outright dangerous, making its functions locale-sensitive by all popular OSes was an avoidable mistake.

It wasn’t a mistake for local software that is supposed to automatically use the user’s locale. It’s what made a lot of local software usefully locale-sensitive without the developer having to put much effort into it, or even necessarily be aware of it. It’s the reason why setting the LC_* environment variables on Linux has any effect on most software.

The age of server software, and software talking to other systems, is what made that default less convenient.

Reply View 1 reply

jkrejcha 11 hours ago

On the contrary, the locale APIs are problematic for many reasons. If C had just been like "well C only supports the C locale, write your own support if that's what you want", much more software would have been less subtly broken.
There's a few fundamental problems with it:
1. The locale APIs weren't designed very well and things were added over the years that do not play nice with it.
So like as an example, what should `int toupper(int c)` return? (By the way, the paramater `c` is really an unsigned char, if you try to put anything but a single byte here, you get undefined behavior. What if you're using something that uses a multibyte encoding? You only get one byte back so that doesn't really help there either.
Many of the functions were clearly designed for the "1 character = 1 byte" world, which is a key assumption of all of these APIs. Which is fine if you're working with ASCII, but blows up as soon as you change locales.
And even so, it creates problems where you try to use it. Say I have a "shell" but all of the commands are internally stored as uppercase, but you want to be compatible. If you try to use anything outside of ASCII with locales, you can't just store the command list in uppercase form because then they won't match when doing a string comparison using the obvious function for it (strcmp). You have to use strcoll instead, and sometimes you just, might not have a match for multibyte encodings.
2. The locale is global state.
The worst part about it is that it's actually global state (not even like faux-global state like errno). This basically means that it's basically wildly thread unsafe as you can have thread 1 running toupper(x) while another thread, possibly in a completely different library, calling setlocale (as many library functions do to guard against the semantics of a lot of standard library functions changing unexpectedly). And boom, instant undefined behavior, with basically nothing you could reasonably do about it. You'll probably get something out of it, but the pieces are probably going to display weirdly unless your users are from the US, where the C locale is pretty close to the US locale.
This means any of the functions in this list[1] is potentially a bomb:
> fprintf, isprint, iswdigit, localeconv, tolower, fscanf, ispunct, iswgraph, mblen, toupper, isalnum, isspace, iswlower, mbstowcs, towlower, isalpha, isupper, iswprint, mbtowc, towupper, isblank, iswalnum, iswpunct, setlocale, wcscoll, iscntrl, iswalpha, iswspace, strcoll, wcstod, isdigit, iswblank, iswupper, strerror, wcstombs, isgraph, iswcntrl, iswxdigit, strtod, wcsxfrm, islower, iswctype, isxdigit.
And there are some important ones in there too like strerror. Searching through GitHub as a random sample, it's not uncommon to see these functions be used[2], and really, would you expect `isdigit` to be thread-unsafe?
It's a little better with POSIX as they define a bunch of "_r" variants of functions like strerror and the like which at least give some thread safety (and uselocale at least is a thread-only variant of setlocale, which lets you safely do the whole "guard all library calls to `uselocale(LC_ALL, "C")`"). But Windows doesn't support uselocale so you have to use _configthreadlocale instead.
It also creates hard to trace bug reports. Saying you only support ASCII or whatever is, well it's not great today, but it's at least somewhat understandable, and is commonly seen to be the lowest common denominator for software. Sure, ideally we'd all use byte strings where we don't care or UTF-8 where we actually want to work with text (and maybe UTF-16 on Windows for certain things), but that's just a feature that doesn't exist, whereas memory corruption when you do something with a string but only for people in a certain part of the world in certain circumstances is not really a great user experience or developer experience for that matter.
The thing, I actually like C in a lot of ways. It's a very useful programming language and has incredible importance even today and probably for the far future, but I don't really think the locale API was all that well designed.
[1]: Source: https://en.cppreference.com/w/c/locale/setlocale.html
[2]: https://github.com/search?q=strerror%28+language%3AC&type=co...

Reply View | 0 replies

arccy 19 hours ago

use Australian English: English but with same settings for everything else, including keyboard layout

Reply View 8 replies

okanat 19 hours ago

I live in Germany now, so I generally set it to Irish nowadays. Since I like ISO-style enter key, I use UK keyboard layout (also easier to switch to Turkish than ANSI-layout). However many OSes now have a English (Europe) locale too

Reply View | 0 replies
Sesse__ 19 hours ago

Many Linux distributions provide en_DK specifically for this purpose. English as it is used in Denmark. :-)

Reply View | 6 replies
- Symbiote 11 hours ago
  
  This uses a comma decimal separator, which might or might not be desired.
  Irish English locale uses a dot.
  
  Reply View | 0 replies
- fph 19 hours ago
  
  Denmark doesn't have Euros as currency, unfortunately.
  
  Reply View | 4 replies
  
  jojomodding 16 hours ago
  
  Tying currency to locale seems insane. I have bank accounts in multiple currencies and use both several times per week. Why does all software on my system need to have a default currency? Most software does not care about money, those that do usually give you a quote in a currency fixed by someone else.
  
  Reply View | 2 replies
  
  Symbiote 11 hours ago
  
  en_IE does.
  
  Reply View | 0 replies

thaumasiotes 17 hours ago

> If a language ships toUpper or a toLower function without a mandatory language field, it is badly designed too. The only slightly-better option is making toUpper and toLower ASCII-only and throwing error for any other character set.

There is a deeper bug within Unicode.

The Turkish letter TURKISH CAPITAL LETTER DOTLESS I is represented as the code point U+0049, which is named LATIN CAPITAL LETTER I.

The Greek letter GREEK CAPITAL LETTER IOTA is represented as the code point U+0399, named... GREEK CAPITAL LETTER IOTA.

The relationship between the Greek letter I and the Roman letter I is identical in every way to the relationship between the Turkish letter dotless I and the Roman letter I. (Heck, the lowercase form is also dotless.) But lowercasing works on GREEK CAPITAL LETTER IOTA because it has a code point to call its own.

Should iota have its own code point? The answer to that question is "no": it is, by definition, drawn identically to the ascii I. But Unicode has never followed its principles. This crops up again and again and again, everywhere you look. (And, in "defense" of Unicode, it has several principles that directly contradict each other.)

Then people come to rely on behavior that only applies to certain buggy parts of Unicode, and get messed up by parts that don't share those particular bugs.

Reply View 7 replies

layer8 16 hours ago

It’s not a bug, it’s a feature. The reason is that ISO 8859-7 [0] used for Greek has a separate character code for Iota (for all greek letters, really), while ISO 8859-3 [1] and -9 [2] used for Turkish do not for the usual dotless uppercase I.
One important goal of Unicode is to be able to convert from existing character sets to Unicode (and back) without having to know the language of the text that is being converted. If they had invented a separate code point for I in Turkish, then when converting text from those existing ISO character encodings, you’d have to know whether the text is Turkish or English or something else, to know which Unicode code point to map the source “I” into. That’s exactly what Unicode was designed to avoid.
[0] https://en.wikipedia.org/wiki/ISO/IEC_8859-7
[1] https://en.wikipedia.org/wiki/ISO/IEC_8859-3
[2] https://en.wikipedia.org/wiki/ISO/IEC_8859-9

Reply View | 6 replies
- thaumasiotes 14 hours ago
  
  I know that. That's why I mentioned
  > in "defense" of Unicode, it has several principles that directly contradict each other
  Unicode wants to do several things, and they aren't mutually compatible. It is premised on the idea that you can be all things to all people.
  > It’s not a bug, it’s a feature.
  It is a bug. It directly violates Unicode's stated principles. It's also a feature, but that won't make it not a bug.
  
  Reply View | 0 replies
- newpavlov 4 hours ago
  
  >If they had invented a separate code point for I in Turkish, then when converting text from those existing ISO character encodings, you’d have to know whether the text is Turkish or English or something else, to know which Unicode code point to map the source “I” into. That’s exactly what Unicode was designed to avoid.
  Great. So now we have to know locale for handling case conversion for probably centuries to come, but it was totally worth to save a bit of effort in the relatively short transition phase. /s
  
  Reply View | 4 replies
  
  JuniperMesos 15 minutes ago
  
  You always have to know locale to handle case conversion - this is not actually defined the same way in different human languages and it is a mistake to pretend it is.
  
  Reply View | 1 reply
  
  newpavlov 9 minutes ago
  
  In most cases locale is encoded in character itself, i.e. Latin "a" and Cyrillic "a" are two different characters, despite being visually indistinguishable in most cases.
  The "language-sensitive" section of the special casing document [0] is extremely small and contains only the cases of stupid reuse of Latin I.
  [0]: https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing....
  
  Reply View | 0 replies
  
  fhars 4 hours ago
  
  Without it, there would not have been a transition phase.
  
  Reply View | 1 reply
  
  newpavlov 3 hours ago
  
  I call BS. Without a series of MAJOR blunders Unicode was destined to succeed. When the rest of the world has migrated to Unicode, I am more than certain that Turks would've migrated as well. Yes, they may have complained for several years and would've spent a minuscule amount of resources to adopt the conversion software, but that's it, a decade or two later everyone would've forgotten about it.
  I believe that even addition of emojis was completely unnecessary despite the pressure from Japanese telecoms. Today's landscape of messengers only confirms that.
  
  Reply View | 0 replies

themafia 16 hours ago

I thought locale is mostly controlled by the environment. So you can run your system and each program with it's own separate locale settings if you like.

Reply View 1 reply

silon42 7 hours ago

I wish there was a single letter universal locale with sane values, maybe call it U or E, with:
ISO (or RFC....) date time, UTF-8 default (maybe also alternative with ISO8859-1) decimal point in numbers and _ for thousands, metric paper / A4, ..., unicode neutral collation
but keeps US-English language

Reply View | 0 replies

fukka42 5 hours ago

Just use English. If you want to program you need to learn it anyway to make sense of anything.

I'm not a native English speaker btw. I learned it as I was learning programming as a kid 20 years ago

Reply View 3 replies

whynotmaybe 3 hours ago

Yes and no. This will work only if you don't create software used internationally.
If you only work in English, you will test in English and avoid uses cases like the one described in the article.
Did you know that many town and streets in Canada have a ' in their name? And that many websites reject any ' in their text fields because they think its Sql injection?

Reply View | 2 replies
- jen20 2 hours ago
  
  Ms O’Reilly would like a word about surname fields.
  
  Reply View | 0 replies
- fukka42 3 hours ago
  
  My EU country does the same. Of course software should work for the locales you're targeting but that is different from the language used by developer tooling. The GP is talking about changing the locale of their development machine so I assume that's what they're referring to.
  
  Reply View | 0 replies