Chemical knowledge and reasoning of large language models vs. chemist expertise
(nature.com)81 points by bookofjoe 2 days ago
81 points by bookofjoe 2 days ago
I think a huge reason why LLMs are so far ahead in programming is because programming exists entirely in a known and totally severed digital environment outside our own. To become a master programmer all you need is a laptop and an internet connection. The nature of it existing entirely in a parallel digital universe just lends itself perfectly to training.
All of that is to say that I don't think the classic engineering fields have some kind of knowledge or intuition that is truly inaccessible to LLMs, I just think that it is in a form that is too difficult right now to train on. However if you could train a model on them, I strongly suspect they would get to the same level they are at today with software.
> [..] models are [...] limited in [...] ability to answer knowledge-intensive questions [...], they did not memorize the relevant facts. [...] This is probably because the required knowledge cannot easily be accessed via papers [...] but rather by lookup in specialized databases [...], which the humans [...] used to answer such questions [...]. This indicates that there is [...] room for improving [...] by training [...] on more specialized data sources or integrating them with specialized databases.
> [...] our analysis shows [...] performance of models is correlated with [...] size [...]. This [...] also indicates that chemical LLMs could, [...], be further improved by scaling them up.
Does that means the world of chemists will be eaten by LLMs? Will LLMs just improve chemists output or productivity? I'd be scared if this happened in my area of work.
I'm sure an LLM knows more about computer science than a human programmer.
Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.
Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.
It's impressive until you realize its limitations.
Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.
> Do you know every common programing language?
A long time ago my OH was introduced to someone who claimed "to speak seven languages fluently".
Her response at the time was was "Do they have anything interesting to say in any of them?"
As a foreign English speaker, it's a huge pet peeve is when people use acronyms without having used the full sentence before. Especially when the acronym is already a word or expression and looking it up just returns a bunch of useless examples (oh!). Eventually I'll find out the meaning (other half), and it always turns out they only saved a total of six or seven letters, which can be typed in less than 0.5 seconds, but in exchange they made their sentence more or less incomprehensible for a large group of people.
As a native English speaker, I had no idea what OH was either. I’ve seen SO for significant other and not stack overflow, and I’ve seen reference to better half not just other half. By that choice, I am left to assume this person feels they are the better half which says a lot about them.
Paste the comment into an LLM and ask it what it means. Don’t use Google.
OTOH we are one of today's "lucky" 10,000? And future searches will possibly lead to this post, further reducing friction to using this acronym. Also newly trained LLMs will also be able to answer quicker. Yay?
I wonder how acronyms such as OTOH even become so well known that they can be used without fear or not being understood? When is that threshold reached? Is using OH now the beginning of a new well-known acronym? I guess only time will tell...
> OH
Other half? I've never seen this acronym before.
> Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.
And herein lies the fundamental power of the LLM and why it can even solve "impressive" problems: it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.
LLMs are at their best when the context capacity of the human is stretched and the task doesn't really take any reasoning but requires an extraction of some basic, common pattern.
> it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.
That’s the very reason we built computers. If an LLM did not also meet this definition, there would be no point of it existing
Yes it is and you compare apples with pineapples.
file can't program in brainfuck while doing basic binary analysis.
Binwalk and Unicorn can't do that either. And they can't write to you in multiply natural languages either
> There's simply so much to know that the LLM has an inherent advantage.
But do they understand it? I mean, A child used swear words, but does it understand the meaning of the swear words. In other comment, somebodies OH also mentioned about artistic abilities and utility of the words spoken.
I asked several LLM's after jailbreaking with prompts to provide viable synthesis routes for various psychoactive substances and they did a remarkable job.
This was neat to see but also raised some eyebrows from me. A clever kid with some pharmacology knowledge and basic organic chemistry understanding could get up to no good.
Especially since you can ask the model to use commonly available reagents + precursors and for synthesis routes that use the least amount of equipment and glassware.
Nice benchmark but the human comparison is a little lacking. They claim to have surveyed 19 experts, though the vast majority of them have only a master's degree. This would be akin to comparing LLM programming expertise to a sample of programmers with less than 5 years of experience.
I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.
I'll get some downvotes for this but PhD vs master's degree difference is mostly work experience, an element of workload hazing and snobbery.
Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD
Sure, but all we know is that these "13 have a master’s degree (and are currently enroled in Ph.D. studies)". We only know they have at least "2 years of experience in chemistry after their first university-level course in chemistry."
How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?
Received 01 April 2024
Accepted 26 March 2025
Published 20 May 2025
Probably normal but shows the built in obsolescence of the peer review journal article model in such a fast moving field.
How so?
To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.
We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.
Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.
In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.
How has that work been made obsolete?
How so? All the models they've tested are obsolete, multiple generations behind high-end versions.
(Though even these obsolete models did better than the best humans and domain experts).
As I wrote, the main point of the paper was not the specific model evaluation, but the development of a benchmark which can be used to test new models.
Good benchmark development is hard work. The paper goes into the details of how it was carried out.
Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.
You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.
That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.
Nothing to see here unless you have some kind of unsatisfied interest in the future of AI :\
This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)
If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)
Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.
Ok so I am always interested in these papers as a chemist. Often, we find that the LLM are terrible at chemistry. This is because the lived experience of a chemist is fundamentally different from the education they receive. Often, a masters student takes 6 months to become productive at research in a new sub field. A PhD, around 3 months.
Most chemists will begin to develop an intuition. This is where the issues develop.
This intuition is a combination of the chemists mental model, and how the sensory environment stimulates that. As a polymer chemist in a certain system maybe brown means I see scattering hence particles. My system is supposed to be homogeneous so I bin the reaction.
It is often known that good grades don’t make good researchers. That’s because researchers aren’t doing rote recall.
So the issue is this: we ask the LLM how many proton environment in this nmr?
We should ask: I’m intercalating Li into a perovskite using BuLi. Why does the solution turn pink?