What do you think of something like this?
For training reasoning and search-augmented LLM agents with reinforcement learning.
This is a step towards training an ๐ผ๐ฝ๐ฒ๐ป-๐๐ผ๐๐ฟ๐ฐ๐ฒ ๐ข๐ฝ๐ฒ๐ป๐๐ โ๐๐ฒ๐ฒ๐ฝ ๐ฟ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ตโ via RL.
๐ฏ๐ ๐ฏ๐ฎ๐๐ฒ ๐๐๐ ๐โincluding not just ๐ค๐๐ฒ๐ป ๐ฎ.๐ฑ but also ๐๐น๐ฎ๐บ๐ฎ ๐ฏ.๐ฎโlearn to ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป and ๐ฐ๐ฎ๐น๐น ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ๐ all on their own.
We follow Deepseek R1-zero, starting with a base LLM, prompts, and ground-truth rewards. Then, we apply ๐ฟ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด (RL). Our experiments are conducted on ๐ก๐ฎ๐๐๐ฟ๐ฎ๐น ๐ค๐๐ฒ๐๐๐ถ๐ผ๐ป๐ (๐ก๐ค), a factual QA dataset in which LLMs struggle with direct answers, making search engine calls crucial. The only supervision? A ๐ฟ๐๐น๐ฒ-๐ฏ๐ฎ๐๐ฒ๐ฑ ๐ผ๐๐๐ฐ๐ผ๐บ๐ฒ ๐ฟ๐ฒ๐๐ฎ๐ฟ๐ฑ (string exact match) to determine correctness.
We first experiment with ๐ฅ๐ ๐๐๐ป๐ถ๐ป๐ด ๐ค๐๐กโ๐๐ข๐ก search engine access, letting the ๐๐๐ (๐๐น๐ฎ๐บ๐ฎ ๐ฏ.๐ฎ-๐ฏ๐-๐ฏ๐ฎ๐๐ฒ) answer questions on its own. Initially, the model produces ๐ฑ๐๐บ๐บ๐ ๐ผ๐๐๐ฝ๐๐๐, but through RL, it ๐ด๐ฟ๐ฎ๐ฑ๐๐ฎ๐น๐น๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ to generate meaningful answers.

Next, we ๐ถ๐ป๐๐๐ฟ๐๐ฐ๐ the ๐๐๐ (๐๐น๐ฎ๐บ๐ฎ ๐ฏ.๐ฎ-๐ฏ๐-๐ฏ๐ฎ๐๐ฒ) that it can ๐ฐ๐ฎ๐น๐น ๐ฎ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ to retrieve relevant information. ๐ฆ๐๐ฟ๐ฝ๐ฟ๐ถ๐๐ถ๐ป๐ด๐น๐, even ๐๐ถ๐๐ต๐ผ๐๐ any supervised fine-tuning (SFT), the ๐ฏ๐ฎ๐๐ฒ ๐๐๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ ๐๐ผ ๐ฐ๐ฎ๐น๐น ๐๐ต๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ, ๐ถ๐ป๐๐ฒ๐ฟ๐ฝ๐ฟ๐ฒ๐ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฟ๐ฒ๐๐๐น๐๐, ๐ฎ๐ป๐ฑ ๐ฎ๐ป๐๐๐ฒ๐ฟ ๐พ๐๐ฒ๐๐๐ถ๐ผ๐ป๐โ๐ฎ๐น๐น ๐๐ต๐ฟ๐ผ๐๐ด๐ต ๐ฅ๐!

We compare the performance of the ๐๐๐ ๐๐ถ๐๐ต๐ผ๐๐ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ ๐ฎ๐ฐ๐ฐ๐ฒ๐๐ vs. ๐๐๐ ๐๐ถ๐๐ต ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต-๐ฎ๐๐ด๐บ๐ฒ๐ป๐๐ฒ๐ฑ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด via RL. The ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต-๐ฒ๐ป๐ฎ๐ฏ๐น๐ฒ๐ฑ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ถ๐ป๐!

When training ๐๐น๐ฎ๐บ๐ฎ ๐ฏ.๐ฎ-๐ฏ๐-๐ฏ๐ฎ๐๐ฒ with ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ ๐ฐ๐ฎ๐น๐น๐ถ๐ป๐ด, the response length follows an interesting trend:
๐๐ถ๐ฟ๐๐, ๐ถ๐ ๐ฑ๐ฒ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ๐โthe model learns to ๐ฎ๐๐ผ๐ถ๐ฑ ๐ฒ๐ ๐ฐ๐ฒ๐๐๐ถ๐๐ฒ ๐ฑ๐๐บ๐บ๐ ๐๐ผ๐ฟ๐ฑ๐. ๐ง๐ต๐ฒ๐ป, ๐ถ๐ ๐ถ๐ป๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ๐โas it learns to ๐ฐ๐ฎ๐น๐น ๐๐ต๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ ๐ฎ๐ป๐ฑ ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป effectively. Since ๐ก๐ค ๐ถ๐ ๐ฎ ๐ฟ๐ฒ๐น๐ฎ๐๐ถ๐๐ฒ๐น๐ ๐๐ถ๐บ๐ฝ๐น๐ฒ ๐๐ฎ๐๐ธ, the response length ๐๐๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ฒ๐ ๐ฎ๐ ~๐ฑ๐ฌ๐ฌ ๐๐ผ๐ธ๐ฒ๐ป๐.

We experiment with ๐ค๐๐ฒ๐ป๐ฎ.๐ฑ-๐ฏ๐-๐ฏ๐ฎ๐๐ฒ and ๐ค๐๐ฒ๐ป๐ฎ.๐ฑ-๐ณ๐-๐ฏ๐ฎ๐๐ฒ under both with/without search engine RL settings. ๐๐ ๐๐ผ๐ฟ๐ธ๐ ๐ณ๐ผ๐ฟ ๐ฏ๐ผ๐๐ต! Interestingly, in the ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต-๐ฎ๐๐ด๐บ๐ฒ๐ป๐๐ฒ๐ฑ ๐๐ฒ๐๐๐ถ๐ป๐ด, the ๐ฏ๐ ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฎ๐ฐ๐ต๐ถ๐ฒ๐๐ฒ๐ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ฐ๐ผ๐บ๐ฝ๐ฎ๐ฟ๐ฎ๐ฏ๐น๐ฒ ๐๐ผ ๐๐ต๐ฒ ๐ณ๐ ๐บ๐ผ๐ฑ๐ฒ๐น. ๐๐๐ฝ๐ผ๐๐ต๐ฒ๐๐ถ๐: When an ๐๐๐ ๐ถ๐ ๐ฐ๐ผ๐ป๐ป๐ฒ๐ฐ๐๐ฒ๐ฑ ๐๐ผ ๐ฒ๐ ๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐ถ๐ป๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐ถ๐ผ๐ป, its ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ถ๐ป๐ด ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ ๐บ๐ฎ๐ ๐ป๐ผ๐ ๐ป๐ฒ๐ฐ๐ฒ๐๐๐ฎ๐ฟ๐ถ๐น๐ ๐ฟ๐ฒ๐พ๐๐ถ๐ฟ๐ฒ ๐ฎ ๐น๐ฎ๐ฟ๐ด๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ถ๐๐ฒ.

๐๐ผ๐๐ต ๐ฏ๐ฎ๐๐ฒ ๐ฎ๐ป๐ฑ ๐ถ๐ป๐๐๐ฟ๐๐ฐ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐๐ผ๐ฟ๐ธ! The ๐ถ๐ป๐๐๐ฟ๐๐ฐ๐๐ถ๐ผ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น converges ๐ณ๐ฎ๐๐๐ฒ๐ฟ and starts from ๐ฎ ๐ฏ๐ฒ๐๐๐ฒ๐ฟ ๐ถ๐ป๐ถ๐๐ถ๐ฎ๐น ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ. However, the ๐ณ๐ถ๐ป๐ฎ๐น ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ of both models is ๐๐ฒ๐ฟ๐ ๐๐ถ๐บ๐ถ๐น๐ฎ๐ฟ. This suggests that while ๐ถ๐ป๐๐๐ฟ๐๐ฐ๐๐ถ๐ผ๐ป ๐๐๐ป๐ถ๐ป๐ด ๐ฎ๐ฐ๐ฐ๐ฒ๐น๐ฒ๐ฟ๐ฎ๐๐ฒ๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด, ๐ฟ๐ฒ๐ถ๐ป๐ณ๐ผ๐ฟ๐ฐ๐ฒ๐บ๐ฒ๐ป๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฐ๐ฎ๐ป ๐ฏ๐ฟ๐ถ๐ฑ๐ด๐ฒ ๐๐ต๐ฒ ๐ด๐ฎ๐ฝ ๐ผ๐๐ฒ๐ฟ ๐๐ถ๐บ๐ฒ.

We experiment with ๐ค๐๐ฒ๐ป๐ฎ.๐ฑ-๐ฏ๐-๐ฏ๐ฎ๐๐ฒ, ๐๐น๐ฎ๐บ๐ฎ๐ฏ.๐ฎ-๐ฏ๐-๐ฏ๐ฎ๐๐ฒ, ๐ฎ๐ป๐ฑ ๐ค๐๐ฒ๐ป๐ฎ.๐ฑ-๐ณ๐-๐ฏ๐ฎ๐๐ฒโand ๐๐ต๐ฒ๐ ๐ฎ๐น๐น ๐๐ผ๐ฟ๐ธ! This is ๐ป๐ผ๐๐ฎ๐ฏ๐น๐ ๐ฑ๐ถ๐ณ๐ณ๐ฒ๐ฟ๐ฒ๐ป๐ ๐ณ๐ฟ๐ผ๐บ ๐บ๐ฎ๐๐ต ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ถ๐ป๐ด, where only the ๐ค๐๐ฒ๐ป๐ฎ.๐ฑ ๐๐ฒ๐ฟ๐ถ๐ฒ๐ models succeed.

The ๐๐๐ ๐น๐ฒ๐ฎ๐ฟ๐ป๐ ๐๐ผ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ ๐บ๐๐น๐๐ถ-๐๐๐ฟ๐ป ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ ๐ฐ๐ฎ๐น๐น๐, refining its queries step by step to gather more relevant information. This showcases its ability to ๐ถ๐๐ฒ๐ฟ๐ฎ๐๐ถ๐๐ฒ๐น๐ ๐ถ๐บ๐ฝ๐ฟ๐ผ๐๐ฒ ๐ฟ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น ๐ฎ๐ป๐ฑ ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ถ๐ป๐ดโa key capability for real-world research agents!

Our framework supports ๐ณ๐น๐ฒ๐ ๐ถ๐ฏ๐น๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ ๐ฐ๐ต๐ผ๐ถ๐ฐ๐ฒ๐, including: ๐๐ผ๐ฐ๐ฎ๐น ๐ฟ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฒ๐ฟ๐ (sparse/dense) ๐ข๐ป๐น๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ๐ (Google, Bing, etc.) ๐๐๐๐๐ผ๐บ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต ๐ฒ๐ป๐ด๐ถ๐ป๐ฒ๐โLaunch your own on any corpus and integrate it with RL effortlessly!
The pipeline is based on verl (https://github.com/volcengine/verl), a highly efficient RL framework.
Fully open source
Please authenticate to join the conversation.
Rejected
Feature Requests
Web Search
12 months ago

JaeSwift
Get notified by email when there are changes.
Rejected
Feature Requests
Web Search
12 months ago

JaeSwift
Get notified by email when there are changes.