Comment by twoodfin

Comment by twoodfin 9 hours ago

26 replies

UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.

Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.

mort96 9 hours ago

I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets with 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.

In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...

[1] https://en.wikipedia.org/wiki/Quoted-printable

[2] https://en.wikipedia.org/wiki/8-bit_clean

  • ajross 8 hours ago

    "Five characters in a 36 bit word" was a fairly common trick on pre-byte architectures too.

    • bear8642 7 hours ago

      5 characters?

      I thought it was normally six 6bit characters?

      • mort96 7 hours ago

        The relevant Wikipedia page (https://en.wikipedia.org/wiki/36-bit_computing)indicates that 6x6 was the most common, but that 5x7 was sometimes used as well.

        ... However I'm not sure how much I trust it. It says that 5x7 was "the usual PDP-6/10 convention" and was called "five-seven ASCII", but I can't find the phrase "five-seven ASCII" anywhere on Google except for posts quoting that Wikipedia page. It cites two references, neither of which contain the phrase "five-seven ascii".

        Though one of the references (RFC 114, for FTP) corroborates that PDP-10 could use 5x7:

            [...] For example, if a
            PDP-10 receives data types A, A1, AE, or A7, it can store the
            ASCII characters five to a word (DEC-packed ASCII).  If the
            datatype is A8 or A9, it would store the characters four to a
            word.  Sixbit characters would be stored six to a word.
        
        To me, it seems like 5x7 was one of multiple conventions you could store character data in a PDP-10 (and probably other 36-bit machines), and Wikipedia hallucinated that the name for this convention is "five-seven ASCII". (For niche topics like this, I sometimes see authors just stating their own personal terminology for things as a fact; be sure to check sources!).
      • ajross 38 minutes ago

        That was true at the system level on ITS, file and command names were all 6 bit. But six bits doesn't leave space for important code points (like "lower case") needed for text processing. More practical stuff on PDP-6/10 and pre-360 IBM played other tricks.

jasonwatkinspdx 8 hours ago

Not an expert but I happened to read about some of the history of this a while back.

ASCII has its roots in teletype codes, which were a development from telegraph codes like Morse.

Morse code is variable length, so this made automatic telegraph machines or teletypes awkward to implement. The solution was the 5 bit Baudot code. Using a fixed length code simplified the devices. Operators could type Baudot code using one hand on a 5 key keyboard. Part of the code's design was to minimize operator fatigue.

Baudot code is why we refer to the symbol rate of modems and the like in Baud btw.

Anyhow, the next change came with instead of telegraph machines directly signaling on the wire, instead a typewriter was used to create a punched tape of codepoints, which would be loaded into the telegraph machine for transmission. Since the keyboard was now decoupled from the wire code, there was more flexibility to add additional code points. This is where stuff like "Carriage Return" and "Line Feed" originate. This got standardized by Western Union and internationally.

By the time we get to ASCII, teleprinters are common, and the early computer industry adopted punched cards pervasively as an input format. And they initially did the straightforward thing of just using the telegraph codes. But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.

So zooming out here the story is that we started with binary codes, then adopted new schemes as technology developed. All this happened long before the digital computing world settled on 8 bit bytes as a convention. ASCII as bytes is just a practical compromise between the older teletype codes and the newer convention.

  • pcthrowaway 7 hours ago

    > But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.

    Technically, the punch card processing technology was patented by inventor Herman Hollerith in 1884, and the company he founded wouldn't become IBM until 40 years later (though it was folded with 3 other companies into the Computing-Tabulating-Recording company in 1911, which would then become IBM in 1924).

    To be honest though, I'm not clear how ASCII came from anything used by the punch card sorting machines, since it wasn't proposed until 1961 (by an IBM engineer, but 32 years after Hollerith's death). Do you know where I can read more about the progression here?

colejohnson66 9 hours ago

The idea was that the free bit would be repurposed, likely for parity.

  • layer8 7 hours ago

    When ASCII was invented, 36-bit computers were popular, which would fit five ASCII characters with just one unused bit per 36-bit word. Before, 6-bit character codes were used, where a 36-bit word could fit six of them.

  • KPGv2 8 hours ago

    This is not true. ASCII (technically US-ASCII) was a fixed-width encoding of 7 bits. There was no 8th bit reserved. You can read the original standard yourself here: https://ia600401.us.archive.org/23/items/enf-ascii-1968-1970...

    Crucially, "the 7-bit coded character set" is described on page 6 using only seven total bits (1-indexed, so don't get confused when you see b7 in the chart!).

    There is an encoding mechanism to use 8 bits, but it's for storage on a type of magnetic tape, and even that still is silent on the 8th bit being repurposed. It's likely, given the lack of discussion about it, that it was for ergonomic or technical purposes related to the medium (8 is a power of 2) rather than for future extensibility.

    • kbolino 8 hours ago

      Notably, it is mentioned that the 7-bit code is developed "in anticipation of" ISO requesting such a code, and we see in the addenda attached at the end of the document that ISO began to develop 8-bit codes extending the base 7-bit code shortly after it was published.

      So, it seems that ASCII was kept to 7 bits primarily so "extended ASCII" sets could exist, with additional characters for various purposes (such as other languages, but also for things like mathematical symbols).

    • zokier 7 hours ago

      Mackenzie claims that parity was explicit concern for selecting 7 bit code for ASCII. He cites X3.2 subcommittee, although does not provide any references which document exactly, but considering that he was member of those committees (as far as I can tell) I would put some weight to his word.

      https://hcs64.com/files/Mackenzie%20-%20Coded%20Character%20... sections 13.6 and 13.7

  • EGreg 8 hours ago

    I would love to think this is true, and it makes sense, but do you have any actual evidence for this you could share with HN?

toast0 8 hours ago

7 bits isn't that odd. Bauddot was 5 bits, and found insufficient, so 6 bit codes were developed; they were found insufficient, so 7-bit ASCII was developed.

IBM had standardized 8-bit bytes on their System/360, so they developed the 8-bit EBCDIC encoding. Other computing vendors didn't have consistent byte lengths... 7-bits was weird, but characters didn't necessarily fit nicely into system words anyway.

layer8 8 hours ago

The use of 8-bit extensions of ASCII (like the ISO 8859-x family) was ubiquitous for a few decades, and arguably still is to some extent on Windows (the standard Windows code pages). If ASCII had been 8-bit from the start, but with the most common characters all within the first 128 integers, which would seem likely as a design, then UTF-8 would still have worked out pretty well.

The accident of history is less that ASCII happens to be 7 bits, but that the relevant phase of computer development happened to primarily occur in an English-speaking country, and that English text happens to be well representable with 7-bit units.

  • necovek 2 hours ago

    Most languages are well representable with 128 characters (7-bits) if you do not include English characters among those (eg. replace those 52 characters and some control/punctuation/symbols).

    This is easily proven by the success of all the ISO-8859-*, Windows and IBM CP-* encodings, and all the *SCII (ISCII, YUSCII...) extensions — they fit one or more languages in the upper 128 characters.

    It's mostly CJK out of large languages that fail to fit within 128 characters as a whole (though there are smaller languages too).

KPGv2 8 hours ago

Historical luck. Though "luck" is probably pushing it in the way one might say certain math proofs are historically "lucky" based on previous work. It's more an almost natural consequence.

Before ASCII there was BCDIC, which was six bits and non-standardized (there were variants, just like technically there are a number of ASCII variants, with the common just referred to as ASCII these days).

BCDIC was the capital English letters plus common punctuation plus numbers. 2^6 is 64, and for capital letters + numbers, you have 36, plus a few common punctuation marks puts you around 50. IIRC the original by IBM was around 45 or something. Slash, period, comma, tc.

So when there was a decision to support lowercase, they added a bit because that's all that was necessary, and I think the printers around at the time couldn't print anything but something less than 128 characters anyway. There wasn't any ó or ö or anything printable, so why support it?

But eventually that yielded to 8-bit encodings (various ASCIIs like latin-1 extended, etc. that had ñ etc.).

Crucially, UTF-8 is only compatible with the 7-bit ASCII. All those 8-bit ASCIIs are incompatible with UTF-8 because they use the eighth bit.

michaelsshaw 8 hours ago

I'm not sure, but it does seem like a great bit of historical foresight. It stands as a lesson to anyone standardizing something: wanna use a 32 bit integer? Make it 31 bits. Just in case. Obviously, this isn't always applicable (e.g. sizes, etc..), but the idea of leaving even the smallest amount of space for future extensibility is crucial.

[removed] 9 hours ago
[deleted]