Comment by ryao

Comment by ryao 3 months ago

I would find this more interesting if it made tutorials out if the Linux, LLVM, OpenZFS and FreeBSD codebases.

zh2408 3 months ago

The Linux repository has ~50M tokens, which goes beyond the 1M token limit for Gemini 2.5 Pro. I think there are two paths forward: (1) decompose the repository into smaller parts (e.g., kernel, shell, file system, etc.), or (2) wait for larger-context models with a 50M+ input limit.

Reply View 4 replies

achierius 3 months ago

Some huge percentage of that is just drivers. The kernel is likely what would be of interest to someone in this regard; moreover, much of that is architecture specific. IIRC the x86 kernel is <1M lines, though probably not <1M tokens.

Reply View | 1 reply
- throwup238 3 months ago
  
  The AMDGPU driver alone is 5 million lines - out of about 37 million lines total. Over 10% of the codebase is a driver for a single vendor, although most of it is auto generated per-product headers.
  
  Reply View | 0 replies
rtolsma 3 months ago

You can use the AST for some languages to identify modular components that are smaller and can fit into the 1M window

Reply View | 0 replies
ryao 3 months ago

The first path would be the most interesting, especially if it can be automated.

Reply View | 0 replies

wordofx 3 months ago

I would find this comment more interesting if it didn’t dismiss the project just because you didn’t find it valuable.

Reply View 2 replies

ryao 3 months ago

My comment gave constructive feedback. Yours did not.

Reply View | 0 replies
revskill 3 months ago

So what is the problem with raising an opinion ?

Reply View | 0 replies

fn-mote 3 months ago

You would need a more specific goal than “make a tutorial”.

Do you have anything in mind? Are you familiar enough with any of those codebases to suggest something useful?

The task will be much more interesting if there is not a good existing tutorial that the LLM may have trained on.

OS kernel: tutorial on how to write a driver?

OpenZFS: ?

Reply View 1 reply

ryao 3 months ago

I am #4 here:
https://github.com/openzfs/zfs/graphs/contributors
I would have preferred to see what would have been generated without my guidance, but since you asked:
* Explanations of how each sub-component is organized and works would be useful.
* Explanations of the modern disk format (an updated ZFS disk format specification) would be useful.
* Explanations of how the more complex features are implemented (e.g. encryption, raid-z expansion, draid) would be interesting.
Basically, making guides that aid development by avoiding a need to read everything line by line would be useful (the ZFS disk format specification, while old, is an excellent example of this). I have spent years doing ZFS development, and there are parts of ZFS codebase that I do not yet understand. This is true for practically all contributors. Having guides that avoid the need for developers to learn the hard way would be useful. Certain historical bugs might have been avoided had we had such guides.
As for the others, LLVM could use improved documentation on how to make plugins. A guide to the various optimization passes would also be useful. Then there is the architecture in general which would be nice to have documented. Documentation for various esoteric features of both FreeBSD and Linux would be useful. I could continue, but I the whole point of having a LLM do this sort of work is to avoid needing myself or someone else to spend time thinking about these things.

Reply View | 0 replies