Comment by chermi
Very nice.
When I first started using LLMs, I thought this sort of history retracing would be something you could use LLMs for. They were good at language, and research papers are language + math + graphs. At the time they didn't really understand math and they weren't multimodal yet, but still I decided to try a very basic version by feeding it some papers I knew very well in my area of expertise and try to construct the genealogy of the main idea by tracing references.
What I found at the time was garbage, but I attribute that mostly to me not being very rigorous. It suggested papers that came years after the actual catalysts that were basically regurgitations of existing results. Not even syntheses, just garbage papers that will never be cited by anyone but the authors themselves.
What I concluded was that it didn't work because LLMs don't understand ideas so they can't really trace them. They were basically doing dot products to find papers that matched the wording best in the current literature, which will of course have yield both a recency bias, as the subfields converge on common phrasings. I think there's also an "unoriginality" bias in the sense that the true catalyst/origin of an idea will likely not have the most refined and "survivable" way of describing the new idea. New ideas are new, and upon digestion by the community will probably come out looking a little different. That is to say, raw text matching isn't the best approach to tracing ideas.
I'm absolutely certain someone could and has done a much better job than my amateur exploration and I'd love to know more. As far as I know methods based solely on the analysis graphs of citations could probably beat what I tried.
Warning: ahead are less-than-half-baked ideas.
But now I'm wondering if you could extend the idea of "addition in language space" as LLMs encode (king-man+woman=queen or whatever that example is) to addition in the space of ideas/concepts as expressed in scientific research articles. It seems most doable in math, where stuff is encapsulated in theorems and mathematicians are otherwise precise about the pieces needed to construct a result. Maybe this already exists with automatic theorem provers I know exist but don't understand. Like what is the missing piece between "two intersecting lines form a plane" and "n-d space is spanned by n independent vectors in n-d space"? What's the "delta" that gets you from 2d to n-d basis? I can't even come up with a clean example of what I'm trying to convey...
What I'm trying to say is, wouldn't it be cool if we could 1) take a paper P published in 2025 2) consider all papers/talks/proceedings/blog post published before it 3) come up with the set of papers that require the smallest "delta" in idea space to reach P. That is, new idea(s) =novel part of P = delta -(contributions of ideas represented by the rest of the papers in the set). Suppose further you have some clustering to clean stuff up so you have just one paper per contributing idea(s), P_x representing idea x (or maybe a set).
Then you could do stuff like remove(1) all of the papers that are similar to the P_x representing the single "idea" x that contributed the most to the sum current_paper_idea(s) = delta +(contributions x_i from prexisting) from the corpus. With that idea x no longer in existence, how hard is it to get to the new idea - how much bigger is delta? And perhaps more interesting, is there a new novel route to the new idea? This presupposes the ability of the system to figure out the missing piece(s), but my optimistic take is that it's much easier to get to a result when you know the result. Of course, the larger the delta, the harder it is to construct a new path. If culling an idea leads the the inability to construct a new path, it was probably quite important. I think this would be valuable for trying to trace the most likely path to a paper -- emphasis most likely with the enormous assumption that " shortest path" = most likely; we'll never really know where someone got an idea. But also valuable in uncovering different trajectories/routes from some set of ideas to another via the proposed deletion pertubations. Maybe it unveils a better pedagogical approach, an otherwise unknown connection between subfields, or at the very least is instructive in the the same way that knowing how to solve a problem multiple ways is instructive.
That's all very, very vague and hand-wavy, but I'm guessing there's some ideas in epistemology, knowledge graphs and other stuff that I don't know that could bring it a little closer to sensical.
Thank you for sitting through my brain dump, feel free to shit on it.
(1) This whole half-baked idea needs a lot of work. Especially obvious is that to be sure of cleansing the idea space of everything coming from those papers would probably require complete restraining? Also, this whole thing also presupposes ideas are traceable to publications, which is unlikely for many reasons.