Comment by faangguyindia
Comment by faangguyindia 5 days ago
I am developing a coding agent that currently manages and indexes over 5,000 repositories. The agent's state is stored locally in a hidden `.agent` directory, which contains a configuration folder for different agent roles and their specific instructions. Then we've a "agents" folder with multiple files, each file has
<Role> <instruction>
Agent only reads the file if its role is defined there.
Inside project directory, we've a dot<coding agent name> folder where coding agents state is stored.
Our process kicks off with an `/init` command, which triggers a deep analysis of an entire repository. Instead of just indexing the raw code, the agent generates a high-level summary of its architecture and logic. These summaries appear in the editor as toggleable "ghost comments." They're a metadata layer, not part of the source code, so they are never committed in actual code. A sophisticated mapping system precisely links each summary annotation to the relevant lines of code.
This architecture is the solution to a problem we faced early on: running Retrieval-Augmented Generation (RAG) directly on source code never gave us the results we needed.
Our current system uses a hybrid search model. We use the AST for fast, literal lexical searches, while RAG is reserved for performing semantic searches on our high-level summaries. This makes all the difference. If you ask, "How does authentication work in this app?", a purely lexical search might only find functions containing the word `login` and functions/classes appearing in its call hierarchy. Our semantic search, however, queries the narrative-like summaries. It understands the entire authentication flow like it's reading a story, piecing together the plot points from different files to give you a complete picture.
It works like magic.
Working on something similar. Legacy codebase understanding requires this type of annotation, and “just use code comments” is too much of a blunt instrument to too much good. Are you storing the annotations completely out of band wrt the files, or using filesystem capabilities like metadata?
This type of metadata itself could have individual value; there are many types of documents that will be analyzed by LLMs, and will need not only a place to store analysis alongside document-parts, but meta-metadata related to the analysis (like timestamps, models, prompts used etc). Of course this could all be done OOB, but then you need a robust way to link your metadata store to a file that has a lifecycle all its own thats only observable by you (probably).