Exploring pre-1990 versions of wc(1) (2023)

shric 10 months ago

A fun read on word count optimization can be found in Abrash's Black Book:

https://www.jagregory.com/abrash-black-book/#lessons-learned...

You can gloss over the asm if you wish, the tricks that are explained around it are worth it imho.

Reply View 2 replies

Joker_vD 10 months ago

I wonder if large lookup tables/table-driven state machines are still as good as they used to be. After all, even with all the on-chip caches, the additional memory accesses today seem to be slower than doing some multi-instruction SIMD voodoo.

Reply View | 1 reply
- LegionMammal978 10 months ago
  
  At least the GNU version of wc [0] uses AVX2 for line counting, if available. Though it falls back to a simple character-by-character loop if you ask for a character count [not to be confused with a byte count!] or a word count.
  [0] https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/wc_...
  
  Reply View | 0 replies

tripdout 10 months ago

Those `goto`s between two different for loops is crazy.

Reply View 3 replies

actionfromafar 10 months ago

Assembly / machine code thinking.

Reply View | 1 reply
- amszmidt 10 months ago
  
  More like a relic of (actual) "spaghetti code", it was relatively common in really old Lisp code.
  
  Reply View | 0 replies
lifthrasiir 10 months ago

Not that crazy given that it closely mirrors it's state machine structure.

Reply View | 0 replies

Joker_vD 10 months ago

> A word is a maximal string of characters delimited by spaces, tabs or newlines.

And then the actual code explicitly filters out and ignores every character larger than 0x7F. Just why.

Reply View 17 replies

jolmg 10 months ago

Probably because they're not characters. They're just bytes undefined by ASCII.

Reply View | 0 replies
Tor3 10 months ago

ASCII is 7 bits (the eight bit would be parity), so that makes perfect sense, in an ASCII world.

Reply View | 10 replies
- Joker_vD 10 months ago
  
  So the character e.g. "B" would have this parity bit set and therefore should be filtered out and not count as a letter, in the ASCII world?
  
  Reply View | 9 replies
  
  aap_ 10 months ago
  
  There are only 7 bits in ASCII. An 8th can be used for parity when transmitting data but a regular program will never see it. Anything above 0x7F is simply not a character.
  
  Reply View | 0 replies
  
  Tor3 10 months ago
  
  Parity bits are not part of the character. They are for detecting transmission errors. You filter off the parity bit before looking at the byte.
  
  Reply View | 2 replies
  
  epcoa 10 months ago
  
  What in the hell are you going on about? B is 0x46 which is < 0x7F.
  
  Reply View | 4 replies
ivan_gammel 10 months ago

Because they thought that a word is something said in a human language that they can understand.

Reply View | 4 replies
- Joker_vD 10 months ago
  
  Mi ne pensas ke lingvoj kiuj usas ekskluzive la basan latinan alfabeton estas komprepeneblaj per si mem.
  
  Reply View | 3 replies
  
  luismedel 10 months ago
  
  Cool how my native language is Spanish and I can almost-understand 80% of Esperanto.
  
  Reply View | 0 replies
  
  actionfromafar 10 months ago
  
  Ze riform iz komplit.
  
  Reply View | 1 reply
  
  Joker_vD 10 months ago
  
  The [z] and [ð] are phonemically different in English, just as [i] and [i:] are, so it'd actually be "Ðe riform is komplijt". American rhotacism prevents us from spelling it "rifoom" as would be proper, unfortunately.
  
  Reply View | 0 replies

dexen 10 months ago

The brevity carried over to Plan 9. Re-posting my older comment (https://news.ycombinator.com/item?id=4023385):

http://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs follows the Unix philosophy. A lot of legacy has been shed. I can count 13 options to ls, 11 options to sed and just 5 to sed.

The standard Plan 9 shell, Rc, is described in mere ~500 lines of manpage, while Bash takes whooping ~5400 lines.

Oh, and there is no `dll hell' in P9 :-)

Reply View 0 replies