Comment by zh2408

Comment by zh2408 4 days ago

4 replies

The Linux repository has ~50M tokens, which goes beyond the 1M token limit for Gemini 2.5 Pro. I think there are two paths forward: (1) decompose the repository into smaller parts (e.g., kernel, shell, file system, etc.), or (2) wait for larger-context models with a 50M+ input limit.

achierius 4 days ago

Some huge percentage of that is just drivers. The kernel is likely what would be of interest to someone in this regard; moreover, much of that is architecture specific. IIRC the x86 kernel is <1M lines, though probably not <1M tokens.

  • throwup238 3 days ago

    The AMDGPU driver alone is 5 million lines - out of about 37 million lines total. Over 10% of the codebase is a driver for a single vendor, although most of it is auto generated per-product headers.

rtolsma 4 days ago

You can use the AST for some languages to identify modular components that are smaller and can fit into the 1M window

ryao 3 days ago

The first path would be the most interesting, especially if it can be automated.