Comment by ramoz

Comment by ramoz a day ago

2 replies

If markdown in a git repository isn’t good enough for collaboration, then why would any plugged in abstraction be better?

You imply you have a solution for current wholistic state. For this you would need a solution for context decay and relevant curation — with benchmarks that prove it is also more valuable than constant rediscovery (for quality and cost).

That narrative becomes harsher once you pivot to “general purpose agents” because you’re then competing with every existing knowledge work platform. So you’ll shift into “unified context for all your KW platforms” - where presumably the agents already have access (Claude today can basically go scrape all knowledge from anywhere).

So then it becomes an offering of “current state” in complex human processes and this is a concept I’m not sure any technology can capture; whether it’s across codebases (which for humans we settled on git) and especially not general working scenarios. And I guess this is where it becomes a unified multi-agent wholistic state capture. Ambitious and fun problem.

austinbaggio 17 hours ago

| need a solution for context decay and relevant curation — with benchmarks that prove it is also more valuable than constant rediscovery (for quality and cost).

I agree. We are looking at some metr benchmarks, not expecting a simple answer to this, but do you have any in mind you find compelling?

  • ramoz 16 hours ago

    Not really. But, You can go viral again with a "Coding Agents with memory build better software using less tokens" showcasing how you benchmarked a "twitter rebuild" -

    1. Setup Claude Code to build some layers of the stack

    2. Setup Codex to build others.

    In one instance equip them both with your product. Maybe bake in some tribal knowledge.

    In another instance let them work raw.

    In both instances, capture:

         - Time to completion
         - Tokens spent
         - Ability to meet original spec
         - Subjective quality 
         - Number of errors and categorize between the layers, to state something like "raw-claude's backend kept failing with raw-codex's frontend" etc
    
    I imagine this benchmark working well in your favor.