Comment by gbnwl

Comment by gbnwl 2 days ago

47 replies

I'm not sure how many HN users frequent other places related to agentic coding like the subreddits of particular providers, but this has got to be the 1000th "ultimate memory system"/break-free-of-the-context-limit-tyranny! project I've seen, and like all other similar projects there's never any evidence or even attempt at measuring any metric of performance improved by it. Of course it's hard to measure such a thing, but that's part of exactly why it's hard to build something like this. Here's user #1001 that's been told by Claude "What a fascinating idea! You've identified a real gap in the market for a simple database based memory system to extend agent memory."

beefsack a day ago

I feel like so many of these memory solutions are incredibly over-engineered too.

You can work around a lot of the memory issues for large and complex tasks just by making the agent keep work logs. Critical context to keep throughout large pieces of work include decisions, conversations, investigations, plans and implementations - a normal developer should be tracking these and it's sensible to have the agent track them too in a way that survives compaction.

  • wfn a day ago

    Yes. I have (as part of Claude output) a

    - `FEATURE_IMPL_PLAN.md` (master plan; or `NEXT_FEATURES_LIST.md` or somesuch)

    - `FEATURE_IMPL_PROMPT_TEMPLATE.md` (where I replace placeholders with next feature to be implemented; prompt includes various points about being thorough, making sure to validate and loop until full test pipeline works, to git version tag upon user confirmation, etc.)

    - `feature-impl-plans/` directory where Claude is to keep per-feature detailed docs (with current status) up to date - this is esp. useful for complex features which may require multiple sessions for example

    - also instruct it to keep main impl plan doc up to date, but that one is limited in size/depth/scope on purpose, not to overwhelm it

    - CLAUDE.md has summary of important code references (paths / modules / classes etc.) for lookup, but is also restricted in size. But it includes full (up-to-date) inventory of all doc files, for itself

    - If I end up expanding CLAUDE.md for some reason or temporarily (before I offload some content to separate docs), I will say as part of prompt template to "make sure to read in the whole @CLAUDE.md without skipping any content"

  • ramoz a day ago

    Great advise. For large plans I tell the agent to write to an “implementation_log.md” and make note of it during compaction. Additionally the agent can also just reference the original session logs.

    • hasperdi a day ago

      The problem with this approach, is that the model may forget to update the log... It usually happens when the context window >50% filled

      • ramoz a day ago

        I found this happen less often if the task is a part of the plan. It typically gets in a cycle habit of editing code and updating the doc

      • ilvez a day ago

        .. and not only those, but the baseline as well aka CLAUDE.md.. I've countless of times told it basics, in the same session without compacting etc etc

  • SkyPuncher a day ago

    Yep. I just have my agents write out key details to a markdown file. Doesn’t have to be perfect. Just enough to reorient itself to a problem.

  • ryanthedev a day ago

    I agree. Plan files and I use git for my work logs. Have been successful.

xnx a day ago

Some with a coding background love prompt engineering, contrived supporting systems, json prompting and any other superstition that makes it feel like they're really doing something.

They refuse to believe that it's possible to instruct these tools in terse plain English and get useful results.

christinetyip a day ago

This is fair, many memory projects out there boil down to better summaries or prompt glue without any clear way to measure impact.

One thing I’d clarify about what we’re building is that it’s not meant to be “the best memory for a single agent.”

The core idea is portability and sharing, not just persistence.

Concretely:

- you can give Codex access to memory created while working in Claude

- Claude Code can retrieve context from work done in other tools

- multiple agents can read/write the same memory instead of each carrying their own partial copy

- specific parts of context can be shared with teammates or collaborators

That’s the part that’s hard (or impossible) to do with markdown files or tool-local memory, and it’s also why we don’t frame this as “breaking the context limit.”

Measuring impact here is tricky, but the problem we’re solving shows up as fragmentation rather than forgetting: duplicated explanations, divergent state between agents, and lost context when switching tools or models.

If someone only uses a single agent in a single tool and already are using their customized CLAUDE.md, they probably don’t need this. The value shows up once you treat agents as interchangeable workers rather than a single long-running conversation.

  • gbnwl a day ago

    > That’s the part that’s hard (or impossible) to do with markdown files or tool-local memory.

    I'm confused because every single thing in that list is trivial? Why would Codex have trouble reading a markdown file Claude wrote or vice versa? Why would multiple agents need their own copy of the markdown file instead of just referring to it as needed? Why would it be hard to share specific files with teammates or collaborators?

    Edit - I realize I could be more helpful if I actually shared how I manage project context:

    CLAUDE.md or Agents.md is not the only place to store context for agents in a project, you can just store docs at any layer of granularity you want. What's worked best for me is to:

    1. Have a standards doc(s) (you can point the agents to the same standards doc in their respective claude.md/agents.md)

    2. Before coding, have the agent create implementation plans that get stored in to tickets (markdown files) for each chunk of work that would take about a context window length (estimated).

    3. Work through the tickets and update them as completed. Easy to refer back to when needed.

    4. If you want you can ask the agent to contribute to an overall dev log as well, but this gets long fast. Is useful for agents to refer to the last 50 lines or so to immediately get up to speed on "what just happened?", but so could git history.

    5. Ultimately the code is going to be the real "memory" of the true state, so try to organize it in a way that's easy for agents to comb through (no 5000 lines files that agents have trouble trying to carefully jump around in to find what they need without eating up their entire context window immediately).

    • christinetyip a day ago

      You’re right that reading the same markdown file is trivial, that’s not the hard part.

      Where it stopped being trivial for us was once multiple agents were working at the same time. For example, one agent is deciding on an architecture while another is already generating code. A constraint changes mid-way. With a flat file, both agents can read it, but you’re relying on humans as the coordination layer: deciding which docs are authoritative, when plans are superseded, which tickets are still valid, and how context should be scoped for a given agent.

      This gets harder once context is shared across tools or collaborators’ agents. You start running into questions like who can read vs. update which parts of context, how to share only relevant decisions, how agents discover what matters without scanning a growing pile of files, and how updates propagate without state drifting apart.

      You can build conventions around this with files, and for many workflows that works well. But once multiple agents are updating state asynchronously, the complexity shifts from storage to coordination. That boundary - sharing and coordinating evolving context across many agents and tools — is what we’re focused on and what an external memory network can solve.

      If you’ve found ways to push that boundary further with files alone, I’d genuinely be curious - this still feels like an open design space.

      • gbnwl a day ago

        You're still not closing the gap between the problems you're naming and how your solution solves them?

        > With a flat file, both agents can read it, but you’re relying on humans as the coordination layer: deciding which docs are authoritative, when plans are superseded, which tickets are still valid, and how context should be scoped for a given agent.

        So the memory system also automates project management by removing "humans as the coordination layer"? From the OP the only details we got were

        "What it does: (1) persists context between sessions (2) semantic & temportal search (not just string grep)"

        Which are fine, but neither it nor you explain how it can solve any of these broader problems you bring up:

        "deciding which docs are authoritative, when plans are superseded, which tickets are still valid, and how context should be scoped for a given agent, questions like who can read vs. update which parts of context, how to share only relevant decisions, how agents discover what matters without scanning a growing pile of files, and how updates propagate without state drifting apart."

        You're claiming that semantic and temporal search has solved all of this for free? This project was presented as a memory solution and now it seems like you're saying its actually an agent orchestration framework, but the gap between what you're claiming your system can achieve and how you claim it works seems vast.

stingraycharles a day ago

imho, if it’s not based on a RAG, it’s not a real memory system. the agent often doesn’t know what it doesn’t know, and as such relevant memories must be pushed into the context window by embedding distance, not actively looked up.

austinbaggio 2 days ago

Which of the 1000 is your favorite? There does seem to be a shallow race to optimizing xyz benchmark for some narrow sliver of the context problem, but you're right, context problem space is big, so I don't think we'll hurry to join that narrow race.

  • gbnwl a day ago

    | Which of the 1000 is your favorite?

    None, that's what I'm trying to say. My favorite is just storing project context locally in docs that agents can discover on their own or I can point to if needed. This doesn't require me to upload sensitive code or information to anonymous people's side projects and has and equivalent amount of hard evidence for efficacy (zero), but at least has my own anecdotal evidence of helping and doesn't invite additonal security risk.

    People go way overboard with MCPs and armies of subagents built on wishes and unproven memory systems because no one really knows for sure how to get past the spot we all hit where the agentic project that was progressing perfectly hits a sharp downtrend in progress. Doesn't mean it's time to send our data to strangers.

    • gck1 a day ago

      > no one really knows for sure how to get past the spot we all hit where the agentic project that was progressing perfectly hits a sharp downtrend in progress.

      FWIW, I find this eventual degradation point comes much later and with fewer consequences when there are strict guardrails inside and outside of the LLM itself.

      From what I've seen, most people try to fix only the "inside" part - by tweaking the prompts, installing 500 MCPs (that ironically pollute the context and make problem worse), yell in uppercase in hopes that it will remember etc, and ignore that automated compliance checks existed way before LLMs.

      Throw the strictest and most masochistic linting rules at it in a language that is masochistic itself (e.g. rust), add tons of integration tests that encode intent, add a stop hook in CC that runs all these checks and you've got a system that is simply not allowed to silently drift and can put itself back on track with feedback it gets from it.

      Basically, rather than trying to hypnotize an agent to remember everything by writing a 5000 line agents.md, just let the code itself scream at it and feed the context.

AndyNemmity a day ago

The funny part is, the vast majority of them are barely doing anything at all.

All of these systems are for managing context.

You can generally tell which ones are actually doing something if they are using skills, with programs in them.

Because then, you're actually attaching some sort of feature to the system.

Otherwise, you're just feeding in different prompts and steps, which can add some value, but okay, it doesn't take much to do that.

Like adding image generation to claude code with google nano banana, a python script that does it.

That's actually adding something claude code doesn't have, instead of just saying "You are an expert in blah"

  • austinbaggio a day ago

    It sounds like you've used quite a few. What programs are you expecting? Assuming you're talking about doing some inference on the data? Or optimizing for some RAG or something?

    • AndyNemmity a day ago

      An example of a skill i gave, adding image generation to nano banana.

      another is one claude code ships with, using rip grep.

      Those are actual features. It's adding deterministic programs that the llm calls when it needs something.

      • austinbaggio a day ago

        Oh got it - tool use

        • AndyNemmity a day ago

          Exactly. That adds actual value. Some of the 1000s of projects do this. Those pieces add value, if the tool adds value which also isn’t a given

  • troupo a day ago

    > You can generally tell which ones are actually doing something if they are using skills, with programs in them.

    > Otherwise, you're just feeding in different prompts and steps

    "skills" are literally just .md files with different prompts and steps.

    > That's actually adding something claude code doesn't have, instead of just saying "You are an expert in blah"

    It's not adding anything but a prompt saying "when asked to do X invoke script Y or do steps Z"

    • AndyNemmity 21 hours ago

      Skill are md files, but they are not just that. They are also scripts. That's what adding things are. You can make a skill that is just a prompt, but that misses the point of the value.

      You're packaging the tool with the skill, or multiple tools to do a single thing.

      • troupo 16 hours ago

        In the end it's still an .md file pointing to a script that ends being just a prompt for the agent that the agent may or may not pick up, may or may not discover, may or may not forget after context compaction etc.

        There's no inherent magic to skills, or any fundamental difference between them and "just feeding in different prompts and steps". It literally is just feeding different prompt and steps.

Forgeties79 a day ago

Have you tried using it? Not being flippant and annoying. Just curious if you tried it and what the results were

  • Game_Ender a day ago

    Why should he put effort into measuring a tool that the author has not? The point is there are so many of these tools an objective measure that the creators of these tools can compare against each other would be better.

    So a better question to ask is - Do you have any ideas for an objective way to a measure a performance of agentic coding tools? So we can truly determine what improves performance or not.

    I would hope that internal to OpenAI and Anthropic they use something similar to the harness/test cases they use for training their full models to determine if changes to claude code result in better performance.

    • morkalork a day ago

      Well, if I were Microsoft and training co-pilot, I would log all the <restore checkpoint> user actions and grade the agents on that. At scale across all users, "resets per agent command" should be useful. But then again, publishing the true numbers might be embarrassing..

      • kuboble a day ago

        I'm not sure it's a good signal.

        I often use restore conversion checkpoint after successfully completing a side quest.

  • gbnwl a day ago

    Who has time to try this when there's this huge backlog here: https://www.reddit.com/r/ClaudeAI/search/?q=memory

    • Forgeties79 a day ago

      Have you tried any of those?

      • gbnwl a day ago

        Yes, they haven't helped. Have you found one that works for you?

    • [removed] a day ago
      [deleted]
johnnyfived a day ago

I imagine HN, despite being full of experts and vet devs, also might have a prevalent attitude of looking down on using tools like MCP servers or agentic AI libraries for coding, which might be why something like this advertised seems novel rather than redundant.

  • DrewADesign a day ago

    > I imagine HN, despite being full of experts and vet devs, also might have a prevalent attitude of looking down on using tools like MCP servers or agentic AI libraries for coding, which might be why something like this advertised seems novel rather than redundant.

    I’m not sure where the ‘despite’ comes in. Experts and vets have opinions and this is probably the best online forum to express them. Lots of experts and vets also dislike extremely popular unrelated tools like VB, Windows, “no-code” systems, and Google web search… it’s not a personality flaw. It doesn’t automatically mean they’re right, either, but ‘expert’ and ‘vet’ are earned statuses, and that means something. We’ve seen trends come and go and empires rise and fall, and been repeatedly showered in the related hype/PR/FUD. Not reflexively embracing everything that some critical mass of other people like is totally fine.

    • gbnwl a day ago

      I think maybe the point they were trying to make is that despite people on HN being very technically experienced, skepticism and distrust of LLM-assisted coding tools may have prevented many of them from exploring the space too deeply yet. So a project like this may seem novel to many readers here, when the reality for users who've been using and following tools like Claude Code (and similar) closely for a while now is that claims like the one's this project is making come out multiple times per week.

      • johnnyfived a day ago

        They pretty much perfectly encapsulated the point in their fired up response haha.

  • troupo a day ago

    Because experts snd vets can uduslly quickly disassemble layers of marketing bullshit and see through false promises?

    Because experts snd vets often use these tools and find them extremely lacking?