Comment by brap

Comment by brap 3 days ago

16 replies

I don’t really understand agents. I just don’t get why we need to pretend we have multiple personalities, especially when they’re all using the same model.

Can anyone please give me a usecase, that couldn’t be solved with a single API call to a modern LLM (capable of multi-step planning/reasoning) and a proper prompt?

Or is this really just about building the prompt, and giving the LLM closer guidance by splitting into multiple calls?

I’m specifically not asking about function calling.

coffeemug 2 days ago

If you ignore the word "agent" and autocomplete it in your mind to "step", things will make more sense.

Here is an example-- I highlight physical books as I read them with a red pen. Sometimes my highlights are underlines, sometimes I bracket relevant text. I also write some comments in the margins.

I want to photograph relevant pages and get the highlights and my comments into plain text. If I send an image of a highlighted/commented page to ChatGPT and ask to get everything into plain text, it doesn't work. It's just not smart enough to do it in one prompt. So, you have to do it in steps. First you ask for the comments. Then for underlined highlights. Then for bracketed highlights. Then you merge the output. Empirically, this produces much better results. (This is a really simple example; but imagine you add summarization or something, then the steps feed into each other)

As these things get complicated, you start bumping into repeated problems (like understanding what's happening between each step, tweaking prompts, etc.) Having a library with some nice tooling can help with those. It's not especially magical and nothing you couldn't do yourself. But you also could write Datadog or Splunk yourself. It's just convenient not to.

The internet decided to call these types of programs agents, which confuses engineers like you (and me) who tend to think concretely. But if you get past that word, and maybe write an example app or something, I promise these things will make sense.

  • fryz 2 days ago

    To add some color to this

    Anthropic does a good job of breaking down some common architecture around using these components [1] (good outline of this if you prefer video [2]).

    "Agent" is definitely an overloaded term - the best framing of this I've seen is aligns more closely with the Anthropic definition. Specifically, an "agent" is a GenAI system that dynamically identifies the tasks ("steps" from the parent comment) without having to be instructed that those are the steps. There are obvious parallels to the reasoning capabilities that we've seen released in the latest cut of the foundation models.

    So for example, the "Agent" would first build a plan for how to address the query, dynamically farm out the steps in that plan to other LLM calls, and then evaluate execution for correctness/success.

    [1] https://www.anthropic.com/research/building-effective-agents [2] https://www.youtube.com/watch?v=pGdZ2SnrKFU

    • eric-burel 2 days ago

      This sums up as ranging from multiple LLM calls to build a smart features to letting the LLM decide what to do next. I think you can go very far with the former but the latter is more autonompus in unconstrained environments (like chatting with a human etc.)

bravura 3 days ago

https://aider.chat/2024/09/26/architect.html

"Aider now has experimental support for using two models to complete each coding task:

An Architect model is asked to describe how to solve the coding problem.

An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.

Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars)."

In particular, recent discord chat suggests that o3m is the most effective architect and Claude Sonnet is the most effective code editor.

weego 3 days ago

I don't get it either. Watching implementations on YouTube etc it primarily it feels like a load of verbiage trying to carve out a sub-industry, but the meat on the bone just seems to be defining discreet units of AI actions that can be chained into workflows that interact with non-ai services.

  • jacobr1 2 days ago

    > defining discreet units of AI actions that can be chained into workflows that interact with non-ai services.

    You got. But that is the interesting part! To make AI useful, beyond basic content generation in a chat context you need interaction with the outside world. And you may need iterative workflows that can spawn more work based on the output of those interactions. The focus on Agents as personas is a tangent to the core use case. We could just call this stuff "AI Workflow Orchestration" or something ... and it would remain pretty useful!

    • karn97 2 days ago

      I wont trust an agent with anything by itself at their current state though.

ToJans 2 days ago

AI seems to forget more things as the context window grows. Agents keep scope local and focused, so you can get better/faster results, or use models trained on specific tasks.

Just like in real life, there's generalists and experts. Depending on your task you might prefer an expert over a generalist, think f.e. brain surgery versus "summarize this text".

2pointsomone 3 days ago

I don't work in prompt engineering but my partner does and she tells me numerous need for agents in cases where you want some technology which goes and seeks things on the live web and then comes back and you want to make sense of that found data with the LLM and pre-written prompts where you use that data as variables, and then possibly go back into the web if the task remains unsolved.

  • dimgl 3 days ago

    Can't that be solved with regular workflow tools and prompts? Is that what an agent is, essentially?

    Or is an agent a collection of prompts with a limited set of available tools?

    • 2pointsomone 2 days ago

      I think the agent part is deciding how to navigate the web on its own and when it is convinced (and you haven't told it specifically deterministically) it found what it wanted, to come back and work with your prompts. You can't really logic code this into a workflow.

blainm 3 days ago

One of the key limitations of even state-of-the-art LLMs is that their coherence and usefulness tend to degrade as the context window grows. When tackling complex workflows, such as customer support automation or code review pipelines - breaking the process into smaller, well-defined tasks allows the model to operate with more relevant and focused context at each step, improving reliability.

Additionally, in self-hosted environments, using an agent-based approach can be more cost-effective. Simpler or less computationally intensive tasks can be offloaded to smaller models, which not only reduces costs but also improves response times.

That being said, this approach is most effective when dealing with structured workflows that can be logically decomposed. In more open-ended tasks, such as "build me an app," the results can be inconsistent unless the task is well-scoped or has extensive precedent (e.g., generating a simple Pong clone). In such cases, additional oversight and iterative refinement are often necessary.

jacobr1 2 days ago

One way to think about it is job orchestration. You end up with some kind of DAG of work to execute. If all the work you are doing is based on context from the initiation of the workflow, then theoretically you could do everything in a single prompt. But more interesting is when there is some kind of real-world interaction, potentially multiple. Such as a websearch, or executing code, calling an API. Then you take action based on the result of then. Which in turn might trigger another decision to take some other action, iteratively, and potentially branching.

nsonha 2 days ago

Without checking out this particular framework, the word is sometimes overloaded with that meaning (LLM personality), but actually in software engineering in general, "agent" generally means something with its own inner loop and branching logic (agent as in autonomy). It's a neccessary abstraction when you compose multiple workflows together under the same LLM interface, things like which flow to run next, and edge case handling for each of them etc.

andrewmutz 3 days ago

Modularity. We could put all code in a single function, it is possible, but we prefer to organize it differently to make it easier to develop and reason about. Agents are similar