Comment by ryao
Comment by ryao 4 days ago
I would find this more interesting if it made tutorials out if the Linux, LLVM, OpenZFS and FreeBSD codebases.
Comment by ryao 4 days ago
I would find this more interesting if it made tutorials out if the Linux, LLVM, OpenZFS and FreeBSD codebases.
The AMDGPU driver alone is 5 million lines - out of about 37 million lines total. Over 10% of the codebase is a driver for a single vendor, although most of it is auto generated per-product headers.
You would need a more specific goal than “make a tutorial”.
Do you have anything in mind? Are you familiar enough with any of those codebases to suggest something useful?
The task will be much more interesting if there is not a good existing tutorial that the LLM may have trained on.
OS kernel: tutorial on how to write a driver?
OpenZFS: ?
I am #4 here:
https://github.com/openzfs/zfs/graphs/contributors
I would have preferred to see what would have been generated without my guidance, but since you asked:
* Explanations of how each sub-component is organized and works would be useful.
* Explanations of the modern disk format (an updated ZFS disk format specification) would be useful.
* Explanations of how the more complex features are implemented (e.g. encryption, raid-z expansion, draid) would be interesting.
Basically, making guides that aid development by avoiding a need to read everything line by line would be useful (the ZFS disk format specification, while old, is an excellent example of this). I have spent years doing ZFS development, and there are parts of ZFS codebase that I do not yet understand. This is true for practically all contributors. Having guides that avoid the need for developers to learn the hard way would be useful. Certain historical bugs might have been avoided had we had such guides.
As for the others, LLVM could use improved documentation on how to make plugins. A guide to the various optimization passes would also be useful. Then there is the architecture in general which would be nice to have documented. Documentation for various esoteric features of both FreeBSD and Linux would be useful. I could continue, but I the whole point of having a LLM do this sort of work is to avoid needing myself or someone else to spend time thinking about these things.
The Linux repository has ~50M tokens, which goes beyond the 1M token limit for Gemini 2.5 Pro. I think there are two paths forward: (1) decompose the repository into smaller parts (e.g., kernel, shell, file system, etc.), or (2) wait for larger-context models with a 50M+ input limit.