Comment by brap
Comment by brap 3 days ago
I don’t really understand agents. I just don’t get why we need to pretend we have multiple personalities, especially when they’re all using the same model.
Can anyone please give me a usecase, that couldn’t be solved with a single API call to a modern LLM (capable of multi-step planning/reasoning) and a proper prompt?
Or is this really just about building the prompt, and giving the LLM closer guidance by splitting into multiple calls?
I’m specifically not asking about function calling.
If you ignore the word "agent" and autocomplete it in your mind to "step", things will make more sense.
Here is an example-- I highlight physical books as I read them with a red pen. Sometimes my highlights are underlines, sometimes I bracket relevant text. I also write some comments in the margins.
I want to photograph relevant pages and get the highlights and my comments into plain text. If I send an image of a highlighted/commented page to ChatGPT and ask to get everything into plain text, it doesn't work. It's just not smart enough to do it in one prompt. So, you have to do it in steps. First you ask for the comments. Then for underlined highlights. Then for bracketed highlights. Then you merge the output. Empirically, this produces much better results. (This is a really simple example; but imagine you add summarization or something, then the steps feed into each other)
As these things get complicated, you start bumping into repeated problems (like understanding what's happening between each step, tweaking prompts, etc.) Having a library with some nice tooling can help with those. It's not especially magical and nothing you couldn't do yourself. But you also could write Datadog or Splunk yourself. It's just convenient not to.
The internet decided to call these types of programs agents, which confuses engineers like you (and me) who tend to think concretely. But if you get past that word, and maybe write an example app or something, I promise these things will make sense.