Search-R1: Training LLMs to Reason and Leverage Search Engines with RL
(arxiv.org)94 points by jonbaer 21 hours ago
94 points by jonbaer 21 hours ago
Unfortunately the "open"AI effect is starting to show in other labs as well. DeepMind recently announced a min 6months delay in publishing their SotA research, to give them a market advantage. I get it, but it's sad that it's happening.
The good thing is that there are a lot of companies out there that want to make a name for themselves. Mistral started like that with Apache 2.0 models, now ds w/ MIT models, and so on. And if the past year is a good indicator, it seems that closed SotA to open close-to-SotA is 6-3 months. So that's good.
I also find interesting LeCun's take that "there is no closed source moat, or not for long". In a podcast he went into detail on this, saying that "people move companies, and people talk". If someone finds some secret sauce, the ideas will move around and other labs will catch up quickly. So there's some hope.
A couple of comments. What’s not that interesting here is that adding search to an LLM increases accuracy — this is known, and largely implemented via RAG or other search pipelines which then stuff information into the context.
What might be interesting here is that they are thinking about taxonomic tool use-cases, and exploring training and therefore optimizing the utilization of them.
This to me is a proof of concept — an interesting one, but just a proof of concept. You can see from their example search that the model over-relied on search; it didn’t need to re-search three times to get the answer.
A next step that I think would be useful would be updating the reward function to penalize search; pressing the model to use search when it needs to and not before. This to me is a likely framework going forward where MCP tool costing matters, and would be really useful to have in the next gen of tool calling LLMs.
In the case of search we’d hopefully get a really useful signal and outcome for times the model is unsure — it would call a friend, and get good info! And for times it’s sure, we’d have taught it not to waste reward on that.
This is pretty cool. I have a similar model that’s 8 days into training on msmarco.
So far I only have the “cold start” data posted, but I’m planning on posting a full distillation dataset.
Leveraging reinforcement learning (RL) for LLMs is a fascinating evolution in search technology. The potential for improving search engines to reason intelligently and process data in real-time could revolutionize the entire industry.
Can someone ELI5 how reinforcement learning works with transformer based architecture?
This is the magical thing that happens when AI research happens in the open. Deepseek published their model and their methodology and then the nice people at the University of Illinois are able to build on it.
When OpenAI was launched this is what I thought it was going to be like. Something, something for the betterment of man kind.