Comment by 3pt14159

Comment by 3pt14159 9 hours ago

I remember a time before UTF-8's ubiquity. It was such a headache moving to i18z. I love UTF-8.

linguae 9 hours ago

I remember learning Japanese in the early 2000s and the fun of dealing with multiple encodings for the same language: JIS, Shift-JIS, and EUC. As late as 2011 I had to deal with processing a dataset encoded under EUC in Python 2 for a graduate-level machine learning course where I worked on a project for segmenting Japanese sentences (typically there are no spaces in Japanese sentences).

UTF-8 made processing Japanese text much easier! No more needing to manually change encoding options in my browser! No more mojibake!

Reply View 2 replies

layer8 7 hours ago

On the other hand, you now have to deal with the issues of Han unification: https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...

Reply View | 0 replies
[removed] 9 hours ago

[deleted]

Reply View | 0 replies

acdha 3 hours ago

I worked on a site in the late 90s which had news in several Asian languages, including both simplified and traditional Chinese. We had a partner in Hong Kong sending articles and being a stereotypical monolingual American I took them at their word that they were sending us simplified Chinese and had it loaded into our PHP app which dutifully served it with that encoding. It was clearly Chinese so I figured we had that feed working.

A couple of days later, I got an email from someone explaining that it was gibberish — apparently our content partner who claimed to be sending GB2312 simplified Chinese was in fact sending us Big5 traditional Chinese so while many of the byte values mapped to valid characters it was nonsensical.

Reply View 0 replies

glxxyz 9 hours ago

I worked on an email client. Many many character set headaches.

Reply View 0 replies