๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต-๐—ฅ๐Ÿญ โ€“ the first ๐—ฟ๐—ฒ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ผ๐—ณ ๐——๐—ฒ๐—ฒ๐—ฝ๐˜€๐—ฒ๐—ฒ๐—ธ-๐—ฅ๐Ÿญ (๐˜‡๐—ฒ๐—ฟ๐—ผ) with reinforcement learning

What do you think of something like this?

For training reasoning and search-augmented LLM agents with reinforcement learning.

This is a step towards training an ๐—ผ๐—ฝ๐—ฒ๐—ป-๐˜€๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ ๐—ข๐—ฝ๐—ฒ๐—ป๐—”๐—œ โ€œ๐——๐—ฒ๐—ฒ๐—ฝ ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ตโ€ via RL.

๐Ÿฏ๐—• ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—Ÿ๐—Ÿ๐— ๐˜€โ€”including not just ๐—ค๐˜„๐—ฒ๐—ป ๐Ÿฎ.๐Ÿฑ but also ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿฎโ€”learn to ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป and ๐—ฐ๐—ฎ๐—น๐—น ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐˜€ all on their own.

We follow Deepseek R1-zero, starting with a base LLM, prompts, and ground-truth rewards. Then, we apply ๐—ฟ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด (RL). Our experiments are conducted on ๐—ก๐—ฎ๐˜๐˜‚๐—ฟ๐—ฎ๐—น ๐—ค๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป๐˜€ (๐—ก๐—ค), a factual QA dataset in which LLMs struggle with direct answers, making search engine calls crucial. The only supervision? A ๐—ฟ๐˜‚๐—น๐—ฒ-๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—ผ๐˜‚๐˜๐—ฐ๐—ผ๐—บ๐—ฒ ๐—ฟ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ (string exact match) to determine correctness.

We first experiment with ๐—ฅ๐—Ÿ ๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด ๐‘ค๐‘–๐‘กโ„Ž๐‘œ๐‘ข๐‘ก search engine access, letting the ๐—Ÿ๐—Ÿ๐—  (๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿฎ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ) answer questions on its own. Initially, the model produces ๐—ฑ๐˜‚๐—บ๐—บ๐˜† ๐—ผ๐˜‚๐˜๐—ฝ๐˜‚๐˜๐˜€, but through RL, it ๐—ด๐—ฟ๐—ฎ๐—ฑ๐˜‚๐—ฎ๐—น๐—น๐˜† ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐˜€ to generate meaningful answers.

Image

Next, we ๐—ถ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜ the ๐—Ÿ๐—Ÿ๐—  (๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿฎ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ) that it can ๐—ฐ๐—ฎ๐—น๐—น ๐—ฎ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ to retrieve relevant information. ๐—ฆ๐˜‚๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ถ๐—ป๐—ด๐—น๐˜†, even ๐˜„๐—ถ๐˜๐—ต๐—ผ๐˜‚๐˜ any supervised fine-tuning (SFT), the ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—Ÿ๐—Ÿ๐—  ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐˜€ ๐˜๐—ผ ๐—ฐ๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฒ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ, ๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ฒ๐˜ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€, ๐—ฎ๐—ป๐—ฑ ๐—ฎ๐—ป๐˜€๐˜„๐—ฒ๐—ฟ ๐—พ๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป๐˜€โ€”๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฟ๐—ผ๐˜‚๐—ด๐—ต ๐—ฅ๐—Ÿ!

Image

We compare the performance of the ๐—Ÿ๐—Ÿ๐—  ๐˜„๐—ถ๐˜๐—ต๐—ผ๐˜‚๐˜ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฎ๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ vs. ๐—Ÿ๐—Ÿ๐—  ๐˜„๐—ถ๐˜๐—ต ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต-๐—ฎ๐˜‚๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฒ๐—ฑ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด via RL. The ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต-๐—ฒ๐—ป๐—ฎ๐—ฏ๐—น๐—ฒ๐—ฑ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜„๐—ถ๐—ป๐˜€!

Image

When training ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ ๐Ÿฏ.๐Ÿฎ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ with ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฐ๐—ฎ๐—น๐—น๐—ถ๐—ป๐—ด, the response length follows an interesting trend:

๐—™๐—ถ๐—ฟ๐˜€๐˜, ๐—ถ๐˜ ๐—ฑ๐—ฒ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ๐˜€โ€”the model learns to ๐—ฎ๐˜ƒ๐—ผ๐—ถ๐—ฑ ๐—ฒ๐˜…๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ ๐—ฑ๐˜‚๐—บ๐—บ๐˜† ๐˜„๐—ผ๐—ฟ๐—ฑ๐˜€. ๐—ง๐—ต๐—ฒ๐—ป, ๐—ถ๐˜ ๐—ถ๐—ป๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ฒ๐˜€โ€”as it learns to ๐—ฐ๐—ฎ๐—น๐—น ๐˜๐—ต๐—ฒ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป effectively. Since ๐—ก๐—ค ๐—ถ๐˜€ ๐—ฎ ๐—ฟ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ๐—น๐˜† ๐˜€๐—ถ๐—บ๐—ฝ๐—น๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ, the response length ๐˜€๐˜๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜‡๐—ฒ๐˜€ ๐—ฎ๐˜ ~๐Ÿฑ๐Ÿฌ๐Ÿฌ ๐˜๐—ผ๐—ธ๐—ฒ๐—ป๐˜€.

Image

We experiment with ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ and ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿณ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ under both with/without search engine RL settings. ๐—œ๐˜ ๐˜„๐—ผ๐—ฟ๐—ธ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—ฏ๐—ผ๐˜๐—ต! Interestingly, in the ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต-๐—ฎ๐˜‚๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฒ๐—ฑ ๐˜€๐—ฒ๐˜๐˜๐—ถ๐—ป๐—ด, the ๐Ÿฏ๐—• ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ฒ๐˜ƒ๐—ฒ๐˜€ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—ฏ๐—น๐—ฒ ๐˜๐—ผ ๐˜๐—ต๐—ฒ ๐Ÿณ๐—• ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น. ๐—›๐˜†๐—ฝ๐—ผ๐˜๐—ต๐—ฒ๐˜€๐—ถ๐˜€: When an ๐—Ÿ๐—Ÿ๐—  ๐—ถ๐˜€ ๐—ฐ๐—ผ๐—ป๐—ป๐—ฒ๐—ฐ๐˜๐—ฒ๐—ฑ ๐˜๐—ผ ๐—ฒ๐˜…๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป, its ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† ๐—บ๐—ฎ๐˜† ๐—ป๐—ผ๐˜ ๐—ป๐—ฒ๐—ฐ๐—ฒ๐˜€๐˜€๐—ฎ๐—ฟ๐—ถ๐—น๐˜† ๐—ฟ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ ๐—ฎ ๐—น๐—ฎ๐—ฟ๐—ด๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜€๐—ถ๐˜‡๐—ฒ.

Image

๐—•๐—ผ๐˜๐—ต ๐—ฏ๐—ฎ๐˜€๐—ฒ ๐—ฎ๐—ป๐—ฑ ๐—ถ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐˜„๐—ผ๐—ฟ๐—ธ! The ๐—ถ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น converges ๐—ณ๐—ฎ๐˜€๐˜๐—ฒ๐—ฟ and starts from ๐—ฎ ๐—ฏ๐—ฒ๐˜๐˜๐—ฒ๐—ฟ ๐—ถ๐—ป๐—ถ๐˜๐—ถ๐—ฎ๐—น ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ. However, the ๐—ณ๐—ถ๐—ป๐—ฎ๐—น ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ of both models is ๐˜ƒ๐—ฒ๐—ฟ๐˜† ๐˜€๐—ถ๐—บ๐—ถ๐—น๐—ฎ๐—ฟ. This suggests that while ๐—ถ๐—ป๐˜€๐˜๐—ฟ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด ๐—ฎ๐—ฐ๐—ฐ๐—ฒ๐—น๐—ฒ๐—ฟ๐—ฎ๐˜๐—ฒ๐˜€ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด, ๐—ฟ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐—ฐ๐—ฎ๐—ป ๐—ฏ๐—ฟ๐—ถ๐—ฑ๐—ด๐—ฒ ๐˜๐—ต๐—ฒ ๐—ด๐—ฎ๐—ฝ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ ๐˜๐—ถ๐—บ๐—ฒ.

Image

We experiment with ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ, ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ๐Ÿฏ.๐Ÿฎ-๐Ÿฏ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒ, ๐—ฎ๐—ป๐—ฑ ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿณ๐—•-๐—ฏ๐—ฎ๐˜€๐—ฒโ€”and ๐˜๐—ต๐—ฒ๐˜† ๐—ฎ๐—น๐—น ๐˜„๐—ผ๐—ฟ๐—ธ! This is ๐—ป๐—ผ๐˜๐—ฎ๐—ฏ๐—น๐˜† ๐—ฑ๐—ถ๐—ณ๐—ณ๐—ฒ๐—ฟ๐—ฒ๐—ป๐˜ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—บ๐—ฎ๐˜๐—ต ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด, where only the ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ ๐˜€๐—ฒ๐—ฟ๐—ถ๐—ฒ๐˜€ models succeed.

Image

The ๐—Ÿ๐—Ÿ๐—  ๐—น๐—ฒ๐—ฎ๐—ฟ๐—ป๐˜€ ๐˜๐—ผ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ ๐—บ๐˜‚๐—น๐˜๐—ถ-๐˜๐˜‚๐—ฟ๐—ป ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฐ๐—ฎ๐—น๐—น๐˜€, refining its queries step by step to gather more relevant information. This showcases its ability to ๐—ถ๐˜๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ๐—น๐˜† ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ ๐—ฟ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฒ๐˜ƒ๐—ฎ๐—น ๐—ฎ๐—ป๐—ฑ ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ดโ€”a key capability for real-world research agents!

Image

Our framework supports ๐—ณ๐—น๐—ฒ๐˜…๐—ถ๐—ฏ๐—น๐—ฒ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ ๐—ฐ๐—ต๐—ผ๐—ถ๐—ฐ๐—ฒ๐˜€, including: ๐—Ÿ๐—ผ๐—ฐ๐—ฎ๐—น ๐—ฟ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ๐˜€ (sparse/dense) ๐—ข๐—ป๐—น๐—ถ๐—ป๐—ฒ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐˜€ (Google, Bing, etc.) ๐—–๐˜‚๐˜€๐˜๐—ผ๐—บ ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต ๐—ฒ๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐˜€โ€”Launch your own on any corpus and integrate it with RL effortlessly!


The pipeline is based on verl (https://github.com/volcengine/verl), a highly efficient RL framework.

Fully open source

Experimental logs
Github

Please authenticate to join the conversation.

Upvoters
Status

Rejected

Board
๐Ÿ’ก

Feature Requests

Tags

Web Search

Date

12 months ago

Author

JaeSwift

Subscribe to post

Get notified by email when there are changes.