Comment by nowittyusername

If you are building an agent, start from scratch and build your own framework. This will save you more headache and wasted time down the line. one of the issues when you use someone else framework is that you miss out on learning and understanding important fundamentals about LLM's, how they work, context, etc... Also many developers don't learn the fundamentals of running LLM's locally and thus miss crucial context (giggidy) that would have helped them better understand the whole system. It seems to me that the author here came to a similar conclusion like many of us. I do want to add my own insights though that might be of use to some.

One of the things he talked about was issues with reliable tool calling by the model. I recommend he try the following approach. Have the agent perform a self calibration exercise that makes the agent use his tools at the beginning of the context. Make him perform some complex stuff. Do it many times to test for coherence and accuracy while adjusting the system prompt towards more accurate tool calls. Once the agent had performed that calibration process successfully, you "freeze" that calibration context/history by broadening the --keep n to include not just the system prompt in the rolling window but also up to the end of this calibration session. then no matter how far the context window drifts the conditioned tokens generated by that calibration session steer the agent towards proper tool use. From then on your "system prompt" includes those turns. Note that this is probably not possible on cloud based models as you don't have access to the inference engine directly. A hacky way around that is emulate the conversation turns inside the system prompt.

On the note of benchmark's. The calibration test is your benchmark from then on. When introducing new tools to the system prompt or adjusting any important variable, you must always rerun the same test to make sure the new adjustments you made don't negatively affect the system stability.

On context engineering. That is a must as a bloated history will decohere the whole system. So its important to device an automated system that compresses the context down but retains overall essence of the history. there are about a billion ways you could do this and you will need to experiment a lot. LLM's are conditioned quite heavily from their own outputs, so having the ability to remove error tool calls from the context is a big boon as now the model is less likely to repeat its same mistakes. There are trade offs though, like he said caching is a no go when going this route but you gain a lot more control and stability within the whole system if you do this right. its basically reliability vs cost here. I tend to lean towards reliability. Also i don't recommend using the whole context size of the model. Most llms perform very poorly past a specific amount and I find that using maximum of 50% of the whole context window is recommended for cohesion. Meaning that if lets say max context window is 100k tokens, treat 50k as the max limit and start compressing the history around 35k tokens. Granular and step wise system can be set up. Where the most recent context is most detailed and uncompressed but as it goes further from the current time it gets less and less detailed. Obviously you want to store the full uncompressed history for a subagent that uses rag. This allows the agent to see in detail the whole thing if it finds the need to.

ahh also on the matter of output. I found great success with making input and output channels for my agent. there are many channels that the model is conditioned in using for specific interactions. <think> channel for cot and reasoning. <message_to_user> channel for explicit messages to user. <call_agent> channel for calling agents and interacting with them. <call_tool> for tool use. and then a few other environment and system channels that are input channels from error scripts and environment towards the agent. This channel segmentation also allows for better management of internal automated scripts, and focus the model. Oh also one important thing is the fact that you need at least 2 separate output layers. meaning you need to separate your llm outputs from what is displayed to the user. and they have their own rules they follow. what that allows you to do is display information in a very human readable way to the real human while hiding all the noise but also retaining the crucial context thats needed for the model to function appropriately.

bah anyways i rambled for long enough. good luck folks. hope this info helps someone.