Comment by PaulHoule
Depends what you call '16-bit'. The addressable word size [1] for the PDP-11 was 8-bits and the problem space address space was 16-bits so a user program on the PDP-11 could ac cess 64kB of RAM, same as an Apple ][, except that multiple users could have their own address spaces.
The IBM 360 had 24-bit addresses, 8-bit words, and 16 32-bit registers.
8-bit words were thoroughly established by 1980s for general purpose computers, I think because of the use of 7/8-bit ASCII characters. I mean, you could pack ASCII characters into larger words in different ways but the most comfortable (portable) way to handle them is to have a char* which requires either 8-bit words or some way to address subwords.
The PDP-10 was probably the most loved heterodox architecture with a 20-bit address space and 36-bit words. It had pointers that could point to specific bits inside a word so it was possible to port C to it with char*'s. The user space was 256k words and 1152M bytes. (If an architecture like the PDP-10 let you access bits in the next word you could even point something like a char* at a variable sized UTF-8 char if you don't mind pointer arithmetic being limited to scans)
Some of the 8-bit micros had 16-bit registers such as the 8086/8088 and the 6809. The word size doesn't have to be related to the size of the data bus: the 8088 had an 8-bit data bus and the 8086 had a 16-bit data bus, it just pumped twice if it needed 16-bits. The 68k series had 32-bit registers and a 32-bit address space (like the DEC VAX which was the first modern computer) but had various bus sizes as low as 8-bits in the 68008.
With a cache the data bus could be larger than the word size.
Programming really isn't fun if you don't have index registers at least as large as the address space. There were numerous attempts to extend 8-bit architectures to a 24-bit address space that didn't provide large enough index registers, the 65816 is probably the most famous. The eZ80 on the other hand, extends the registers to 24-bits so it's easy to write programs that use the whole address space.
[1] which I'm just going to call word size
You are using "word size" to mean "memory addressing unit size", and while you are clear about this, its clash with common usage makes your comment somewhat confusing to read. But, doing the mental translations, I think everything you said is correct, even though much of it would be false if interpreted in accordance with the usual definitions.
Usually "word size" means "register size" and a "16-bit architecture" is one with a word size, in that sense, of 16 bits; that is, one whose architectural registers are 16 bits wide. That describes all the CPUs in my list, I think. The definition necessarily gets a bit ambiguous on machines with multiple register widths like the CDC 6600, the 8080, the 8086, and the 80386. But usually on this basis we say the 6600 was 60-bit (despite its smaller address registers), the 8080 was 8-bit (despite its 16-bit register-pair instructions) and so was the 6809, the 8086 was 16-bit (despite AH, AL, etc.) and so was the 65816, and the 386 and 360 and 68k and VAX were 32-bit.
I suspect that standardizing on 8-bit byte addressability was largely due to the influence of the 360, which didn't use ASCII. ASCII (a 7-bit code) was probably a significant influence, but it fit as nicely into 9-bit PDP-10 bytes as into 8-bit bytes, with space for a 512-character character set.
One minor quibble on the PDP-11: though addresses were 16 bits, as you probably know, later PDP-11 models supported split instruction and data spaces, with separate code and data segments. This doubled the memory available for a normal user program over what an Apple ][ could manage without bank switching. Later versions of PDP-11 Unix required this capability for some larger programs, though I don't remember which.
I think the status of the VAX as "the first modern computer" is pretty debatable. Other defensible candidates might be the IBM 801, the IBM PC, the SUN workstation, the Alto, Berkeley RISC I, Stretch, the CDC 6600, the IBM 360, the IBM 360 Model 91, the IBM 360 Model 67, and the Acorn Archimedes. But the VAX definitely has a plausible claim to that title.