Chemical knowledge and reasoning of large language models vs. chemist expertise

81 points by bookofjoe 2 days ago

Ok so I am always interested in these papers as a chemist. Often, we find that the LLM are terrible at chemistry. This is because the lived experience of a chemist is fundamentally different from the education they receive. Often, a masters student takes 6 months to become productive at research in a new sub field. A PhD, around 3 months.

Most chemists will begin to develop an intuition. This is where the issues develop.

This intuition is a combination of the chemists mental model, and how the sensory environment stimulates that. As a polymer chemist in a certain system maybe brown means I see scattering hence particles. My system is supposed to be homogeneous so I bin the reaction.

It is often known that good grades don’t make good researchers. That’s because researchers aren’t doing rote recall.

So the issue is this: we ask the LLM how many proton environment in this nmr?

We should ask: I’m intercalating Li into a perovskite using BuLi. Why does the solution turn pink?

Reply View 1 reply

Workaccount2 44 minutes ago

I think a huge reason why LLMs are so far ahead in programming is because programming exists entirely in a known and totally severed digital environment outside our own. To become a master programmer all you need is a laptop and an internet connection. The nature of it existing entirely in a parallel digital universe just lends itself perfectly to training.
All of that is to say that I don't think the classic engineering fields have some kind of knowledge or intuition that is truly inaccessible to LLMs, I just think that it is in a form that is too difficult right now to train on. However if you could train a model on them, I strongly suspect they would get to the same level they are at today with software.

Reply View | 0 replies

marcodiego 12 minutes ago

> [..] models are [...] limited in [...] ability to answer knowledge-intensive questions [...], they did not memorize the relevant facts. [...] This is probably because the required knowledge cannot easily be accessed via papers [...] but rather by lookup in specialized databases [...], which the humans [...] used to answer such questions [...]. This indicates that there is [...] room for improving [...] by training [...] on more specialized data sources or integrating them with specialized databases.

> [...] our analysis shows [...] performance of models is correlated with [...] size [...]. This [...] also indicates that chemical LLMs could, [...], be further improved by scaling them up.

Does that means the world of chemists will be eaten by LLMs? Will LLMs just improve chemists output or productivity? I'd be scared if this happened in my area of work.

Reply View 0 replies

calibas 11 hours ago

I'm sure an LLM knows more about computer science than a human programmer.

Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.

Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.

It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.

Reply View 21 replies

mumbisChungo 10 hours ago

It's impressive until you realize its limitations.
Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.

Reply View | 0 replies
logifail 8 hours ago

> Do you know every common programing language?
A long time ago my OH was introduced to someone who claimed "to speak seven languages fluently".
Her response at the time was was "Do they have anything interesting to say in any of them?"

Reply View | 10 replies
- dandellion 4 hours ago
  
  As a foreign English speaker, it's a huge pet peeve is when people use acronyms without having used the full sentence before. Especially when the acronym is already a word or expression and looking it up just returns a bunch of useless examples (oh!). Eventually I'll find out the meaning (other half), and it always turns out they only saved a total of six or seven letters, which can be typed in less than 0.5 seconds, but in exchange they made their sentence more or less incomprehensible for a large group of people.
  
  Reply View | 7 replies
  
  dylan604 2 hours ago
  
  As a native English speaker, I had no idea what OH was either. I’ve seen SO for significant other and not stack overflow, and I’ve seen reference to better half not just other half. By that choice, I am left to assume this person feels they are the better half which says a lot about them.
  
  Reply View | 1 reply
  
  daveguy an hour ago
  
  > By that choice, I am left to assume this person feels they are the better half which says a lot about them.
  What a ridiculous assumption.
  Maybe they consider themselves and their partner to be equal halves of a whole. You know, the definition of half.
  
  Reply View | 0 replies
  
  Shadowmist an hour ago
  
  Paste the comment into an LLM and ask it what it means. Don’t use Google.
  
  Reply View | 0 replies
  
  glenneroo 3 hours ago
  
  OTOH we are one of today's "lucky" 10,000? And future searches will possibly lead to this post, further reducing friction to using this acronym. Also newly trained LLMs will also be able to answer quicker. Yay?
  I wonder how acronyms such as OTOH even become so well known that they can be used without fear or not being understood? When is that threshold reached? Is using OH now the beginning of a new well-known acronym? I guess only time will tell...
  
  Reply View | 3 replies
- arcanemachiner 6 hours ago
  
  > OH
  Other half? I've never seen this acronym before.
  
  Reply View | 0 replies
- Upvoter33 2 hours ago
  
  sounds snarky and defensive, tbh
  
  Reply View | 0 replies
timschmidt 11 hours ago

> Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.
Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.

Reply View | 5 replies
- catigula 2 hours ago
  
  And herein lies the fundamental power of the LLM and why it can even solve "impressive" problems: it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.
  LLMs are at their best when the context capacity of the human is stretched and the task doesn't really take any reasoning but requires an extraction of some basic, common pattern.
  
  Reply View | 2 replies
  
  dylan604 2 hours ago
  
  > it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.
  That’s the very reason we built computers. If an LLM did not also meet this definition, there would be no point of it existing
  
  Reply View | 1 reply
  
  catigula an hour ago
  
  You're not the first person to suggest that LLMs have no reason to exist.
  
  Reply View | 0 replies
- anthk 8 hours ago
  
  Binwalk, Unicorn... as if it that was advanced wizardry. Unix systems have file(1) since forever and binutils from and to every arch.
  
  Reply View | 1 reply
  
  Energiekomin 3 hours ago
  
  Yes it is and you compare apples with pineapples.
  file can't program in brainfuck while doing basic binary analysis.
  Binwalk and Unicorn can't do that either. And they can't write to you in multiply natural languages either
  
  Reply View | 0 replies
yMEyUyNE1 6 hours ago

> There's simply so much to know that the LLM has an inherent advantage.
But do they understand it? I mean, A child used swear words, but does it understand the meaning of the swear words. In other comment, somebodies OH also mentioned about artistic abilities and utility of the words spoken.

Reply View | 0 replies
esafak 11 hours ago

But the LLM can already connect things that you can not, by virtue of its breadth. Some may disagree, but I think it will soon go deeper too.

Reply View | 0 replies
anthk 8 hours ago

So impressive that every complex SUBLEQ code I've tried with an LLM failed really fast.

Reply View | 0 replies

sgt101 6 hours ago

Also, books, books are really good for finding knowledge !

Seriously LLM's as a cultural technology cast them as a super interactive indexing system. I find that's a useful lens to use to understand this kind of study.

Reply View 0 replies

gavinray 3 hours ago

I asked several LLM's after jailbreaking with prompts to provide viable synthesis routes for various psychoactive substances and they did a remarkable job.

This was neat to see but also raised some eyebrows from me. A clever kid with some pharmacology knowledge and basic organic chemistry understanding could get up to no good.

Especially since you can ask the model to use commonly available reagents + precursors and for synthesis routes that use the least amount of equipment and glassware.

Reply View 1 reply

dylan604 2 hours ago

My limited bit of knowledge of both chemistry and LLMs would tell me that subtle incorrect chemistry can have disastrous effects while subtle incorrect is an LLM superpower suggests that this is precisely the inevitable outcome

Reply View | 0 replies

pu_pe 7 hours ago

Nice benchmark but the human comparison is a little lacking. They claim to have surveyed 19 experts, though the vast majority of them have only a master's degree. This would be akin to comparing LLM programming expertise to a sample of programmers with less than 5 years of experience.

I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.

Reply View 3 replies

KSteffensen 7 hours ago

I'll get some downvotes for this but PhD vs master's degree difference is mostly work experience, an element of workload hazing and snobbery.
Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD

Reply View | 2 replies
- 698969 4 hours ago
  
  I think the breadth vs depth thing applies here as well, the PhD will know more about the topic they're researching of course.
  
  Reply View | 0 replies
- eesmith 6 hours ago
  
  Sure, but all we know is that these "13 have a master’s degree (and are currently enroled in Ph.D. studies)". We only know they have at least "2 years of experience in chemistry after their first university-level course in chemistry."
  How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?
  
  Reply View | 0 replies

6LLvveMx2koXfwn 10 hours ago

Received 01 April 2024

Accepted 26 March 2025

Published 20 May 2025

Probably normal but shows the built in obsolescence of the peer review journal article model in such a fast moving field.

Reply View 5 replies

rotis 5 hours ago

Yes, this paper and many others will be forgotten as soon as they leave the front page. Afterwards noone refers to articles like these here. People just talk about anecdotes and personal experiences. Not that I think this is bad.

Reply View | 0 replies
eesmith 9 hours ago

How so?
To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.
We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.
Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.
In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.
How has that work been made obsolete?

Reply View | 2 replies
- bufferoverflow 8 hours ago
  
  How so? All the models they've tested are obsolete, multiple generations behind high-end versions.
  (Though even these obsolete models did better than the best humans and domain experts).
  
  Reply View | 1 reply
  
  eesmith 8 hours ago
  
  As I wrote, the main point of the paper was not the specific model evaluation, but the development of a benchmark which can be used to test new models.
  Good benchmark development is hard work. The paper goes into the details of how it was carried out.
  Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.
  You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.
  That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.
  
  Reply View | 0 replies
Jimmc414 10 hours ago

shows the value of preprint servers like arxiv.org and chemrxiv.org

Reply View | 0 replies

AvAn12 2 hours ago

How much of this is because Scale AI and others have had human “taskers” create huge amounts of domain-specific content for OpenAI and other foundation model providers?

Reply View 0 replies

fuzzfactor 2 days ago

Nothing to see here unless you have some kind of unsatisfied interest in the future of AI :\

This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)

If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)

Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.

Reply View 1 reply

mistrial9 13 hours ago

BASF Group - will they speak in public? probably not, given what is at stake IMHO

Reply View | 0 replies