Comment by coffeemug
If you ignore the word "agent" and autocomplete it in your mind to "step", things will make more sense.
Here is an example-- I highlight physical books as I read them with a red pen. Sometimes my highlights are underlines, sometimes I bracket relevant text. I also write some comments in the margins.
I want to photograph relevant pages and get the highlights and my comments into plain text. If I send an image of a highlighted/commented page to ChatGPT and ask to get everything into plain text, it doesn't work. It's just not smart enough to do it in one prompt. So, you have to do it in steps. First you ask for the comments. Then for underlined highlights. Then for bracketed highlights. Then you merge the output. Empirically, this produces much better results. (This is a really simple example; but imagine you add summarization or something, then the steps feed into each other)
As these things get complicated, you start bumping into repeated problems (like understanding what's happening between each step, tweaking prompts, etc.) Having a library with some nice tooling can help with those. It's not especially magical and nothing you couldn't do yourself. But you also could write Datadog or Splunk yourself. It's just convenient not to.
The internet decided to call these types of programs agents, which confuses engineers like you (and me) who tend to think concretely. But if you get past that word, and maybe write an example app or something, I promise these things will make sense.
To add some color to this
Anthropic does a good job of breaking down some common architecture around using these components [1] (good outline of this if you prefer video [2]).
"Agent" is definitely an overloaded term - the best framing of this I've seen is aligns more closely with the Anthropic definition. Specifically, an "agent" is a GenAI system that dynamically identifies the tasks ("steps" from the parent comment) without having to be instructed that those are the steps. There are obvious parallels to the reasoning capabilities that we've seen released in the latest cut of the foundation models.
So for example, the "Agent" would first build a plan for how to address the query, dynamically farm out the steps in that plan to other LLM calls, and then evaluate execution for correctness/success.
[1] https://www.anthropic.com/research/building-effective-agents [2] https://www.youtube.com/watch?v=pGdZ2SnrKFU