Comment by jerf
I think the "push for loops down" is missing a bit of detail about the why. The author alludes to "superior performance" but I don't think makes it clear how that can happen.
Vectorization is a bit obscure and a lot of coders aren't worried about whether their code vectorizes, but there's a much more common example that I have seen shred the performance of a lot of real-world code bases and HTTP APIs, which is functions (including APIs) that take only a single thing when they should take the full list.
Suppose we have posts in a database, like for a forum or something. Consider the difference between:
posts = {}
for id in postIDs:
post[id] = fetchPost(id)
versus posts = fetchPosts(postIDs)
fetchPost and fetchPosts both involve hitting the database. The singular version means that the resulting SQL will, by necessity, only have the one ID in it, and as a result, a full query will be made per post. This is a problem because it's pretty likely here that fetching a post is a very fast (indexed) operation, so the per-query overhead is going to hit you hard.The plural "fetchPosts", on the other hand, has all the information necessary to query the DB in one shot for all the posts, which is going to be much faster. An architecture based on fetching one post at a time is intrinsically less performant in this case.
This opens up even more in the HTTP API world, where a single query is generally of even higher overhead than a DB query. I think the most frequent mistake I see in HTTP API design (at least, ignoring quibbling about which method and error code scheme to use) is providing APIs that operate on one thing at a time when the problem domain naturally lends itself to operating on arrays (or map/objects/dicts) at a time. It's probably a non-trivial part of the reason why so many web sites and apps are so much slower than they need to be.
I find it is often easy to surprise other devs with how fast your system works. This is one of my "secrets" (please steal it!); you make sure you avoid as many "per-thing" penalties as possible by keeping sets of things together as long as possible. The "per-thing" penalties can really sneak up on you. Like nested for loops, they can easily start stacking up on you if you're not careful, as the inability to fetch all the posts at once further cascades in to you then, say, fetching user avatars one-by-one in some other loop, and then a series of other individual queries. Best part is, profiling may make it look like the problem is the DB because "the DB is taking a long time to serve this" because profiles are not always that good at turning up that your problem is per-item overhead rather than the amount of real work being done.
The worst / most amusing example of this I've seen in the wild was a third party line of business application that was sequentially triaging "pending tasks" to assign priority/to workers.
Our cloud provider had an aircon/overheating incident in the region we were using, and after it was resolved network latency between the database and application increased by a few milliseconds. Turns out if you multiply that by a few million/fast arrival rate you get a significant amount of time, and the pending tasks queue backs up causing the high priority tasks to be delayed.
Based on the traces we had it looked like a classic case of "ORM made it easy to do it this way, and it works fine until it doesn't" but was unfortunately out of our control being a third party product.
If they'd fetched/processed batches of tasks from the database instead I'm confident it wouldn't have been an issue.