中國(guó)科大認(rèn)知全重實(shí)驗(yàn)室發(fā)布Agent-R1訓(xùn)練框架,支持自主思考與工具調(diào)用!
什么是 Agent-R1
Agent-R1是由中科大認(rèn)知智能全國(guó)重點(diǎn)實(shí)驗(yàn)室開發(fā)的智能體強(qiáng)化學(xué)習(xí)訓(xùn)練框架,致力于推進(jìn)強(qiáng)化學(xué)習(xí)與智能體技術(shù)的融合發(fā)展??蚣懿捎枚说蕉藦?qiáng)化學(xué)習(xí)方法,突破了依賴人工設(shè)計(jì)工作流的傳統(tǒng)智能體開發(fā)瓶頸,讓AI直接從與環(huán)境的交互中學(xué)習(xí)最優(yōu)策略,實(shí)現(xiàn)自主決策與行動(dòng)。開發(fā)者只需定義特定領(lǐng)域的工具和獎(jiǎng)勵(lì)函數(shù),即可將Agent-R1擴(kuò)展到各種應(yīng)用場(chǎng)景,無(wú)需編寫復(fù)雜的工作流程。
背景
隨著大型語(yǔ)言模型(LLM)技術(shù)的快速發(fā)展,智能體(Agent)已成為人工智能領(lǐng)域的熱門話題。然而,當(dāng)前市場(chǎng)上大多數(shù)所謂的"智能體"實(shí)際上只是工作流系統(tǒng)——通過(guò)預(yù)定義的代碼路徑來(lái)編排LLM和工具的使用,而非真正的自主智能。正如Anthropic最新定義的那樣,真正的LLM智能體應(yīng)當(dāng)是"系統(tǒng)中的LLM能夠動(dòng)態(tài)指導(dǎo)自己的處理過(guò)程和工具使用,保持對(duì)任務(wù)完成方式的控制權(quán)"。
傳統(tǒng)工作流系統(tǒng)在實(shí)際應(yīng)用中面臨著嚴(yán)重的局限性:
? 無(wú)法規(guī)劃,往往陷入死胡同
? 記憶能力有限,難以維持超過(guò)5-10分鐘的任務(wù)
? 長(zhǎng)期行動(dòng)效率低下,行動(dòng)序列常因累積誤差而失敗
這些痛點(diǎn)揭示了AI發(fā)展的"苦澀教訓(xùn)":通過(guò)硬編碼知識(shí)和規(guī)則來(lái)構(gòu)建智能體系統(tǒng)只能帶來(lái)短期收益,長(zhǎng)期來(lái)看必然會(huì)遇到瓶頸。而突破性進(jìn)展最終將通過(guò)基于搜索和學(xué)習(xí)的計(jì)算擴(kuò)展方法到來(lái)。
在這一背景下,OpenAI、Anthropic等前沿實(shí)驗(yàn)室已開始研發(fā)基于強(qiáng)化學(xué)習(xí)訓(xùn)練的真正智能體。OpenAI的DeepResearch和Claude Sonnet 3.7在搜索和編程任務(wù)中展現(xiàn)出了令人印象深刻的能力——它們能夠規(guī)劃搜索策略,交叉引用多個(gè)來(lái)源,并基于中間反饋完成復(fù)雜的多步驟任務(wù)。
這一技術(shù)突破表明,通過(guò)將強(qiáng)化學(xué)習(xí)與推理能力結(jié)合,LLM可以從與環(huán)境的直接交互中學(xué)習(xí)最優(yōu)策略,無(wú)需依賴人工設(shè)計(jì)的提示詞和工作流。這些智能體不僅能夠思考,還能夠自主行動(dòng)、規(guī)劃和調(diào)整策略,真正實(shí)現(xiàn)長(zhǎng)期、復(fù)雜任務(wù)的自主完成。
Agent-R1正是在這一關(guān)鍵技術(shù)變革期應(yīng)運(yùn)而生的開源框架,致力于推進(jìn)強(qiáng)化學(xué)習(xí)與智能體技術(shù)的融合發(fā)展,讓這一前沿技術(shù)能夠被更廣泛的研究者和開發(fā)者所使用,共同推動(dòng)AI真正邁向自主智能的新時(shí)代。
核心特點(diǎn)
多輪工具調(diào)用能力
Agent-R1采用端到端強(qiáng)化學(xué)習(xí)方法,訓(xùn)練智能體從完整交互軌跡中學(xué)習(xí)。這使模型能夠進(jìn)行連續(xù)多輪工具調(diào)用,理解長(zhǎng)期因果關(guān)系,將當(dāng)前行動(dòng)與未來(lái)結(jié)果關(guān)聯(lián)起來(lái),形成真正的規(guī)劃和迭代能力。
多工具協(xié)調(diào)機(jī)制
框架支持智能體學(xué)習(xí)如何協(xié)調(diào)使用多種工具共同解決復(fù)雜任務(wù)。智能體能夠靈活選擇最適合當(dāng)前情境的工具組合,形成有效的工具組合策略,而非被限制在單一工具的使用模式中。
過(guò)程獎(jiǎng)勵(lì)系統(tǒng)
創(chuàng)新的過(guò)程獎(jiǎng)勵(lì)機(jī)制允許對(duì)每個(gè)工具調(diào)用的有效性進(jìn)行單獨(dú)評(píng)估,而非僅關(guān)注最終結(jié)果。通過(guò)獎(jiǎng)勵(lì)歸一化技術(shù),該方法平衡了過(guò)程獎(jiǎng)勵(lì)與結(jié)果獎(jiǎng)勵(lì),確保智能體在追求最終目標(biāo)的同時(shí)也注重解決問(wèn)題的過(guò)程質(zhì)量。
自定義工具和環(huán)境
框架與主流LLM工具調(diào)用格式完全兼容,開發(fā)者可以輕松擴(kuò)展自己的工具和應(yīng)用場(chǎng)景。只需定義特定領(lǐng)域的工具和獎(jiǎng)勵(lì)函數(shù),即可將Agent-R1應(yīng)用到任何專業(yè)領(lǐng)域,無(wú)需復(fù)雜的工作流設(shè)計(jì)。
多種強(qiáng)化學(xué)習(xí)算法
Agent-R1 支持多種先進(jìn)的強(qiáng)化學(xué)習(xí)算法,包括PPO、GRPO和REINFORCE++,為不同需求和場(chǎng)景提供靈活選擇。開發(fā)者可以根據(jù)具體應(yīng)用選擇最適合的算法,優(yōu)化智能體的學(xué)習(xí)效果。
多模態(tài)支持
最新版本實(shí)現(xiàn)了全面的多模態(tài)支持,無(wú)縫集成視覺(jué)-語(yǔ)言模型(VLMs),使智能體能夠處理和推理文本與視覺(jué)輸入的組合。這一功能極大擴(kuò)展了智能體的應(yīng)用場(chǎng)景,能夠在豐富的多模態(tài)環(huán)境中進(jìn)行有效交互。
案例展示
在HotpotQA上的實(shí)驗(yàn)驗(yàn)證顯示,即使是1.5B的小模型,在強(qiáng)化學(xué)習(xí)后也能自主推理并調(diào)用搜索引擎,解決復(fù)雜問(wèn)題。
User
Question: The album that contains the song "Get Together" by Madonna was mainly recorded in whose home studio?
Assistant
<think>
To answer this question, we need to find out the song "Get Together" by Madonna and then determine in which home studio she recorded the album it belongs to. I will use the search tool to find information about the song "Get Together" by Madonna.
</think>
<tool_call>
{"name": "search", "arguments": {"query": "Get Together by Madonna"}}
</tool_call>
Tool
<tool_response>
{"results": ["Get Together (Madonna song) \"Get Together\" is a song by American singer Madonna from her tenth studio album \"Confessions on a Dance Floor\" (2005). Produced by Madonna and Stuart Price, the song was released as the third single from the album by Warner Bros. Records on June 6, 2006. The decision was spurred by the fact that \"Get Together\" was the third most downloaded song from the album. It was also released to coincide with the start of Madonna's Confessions Tour. Inspired by Stardust's single \"Music Sounds Better with You\", \"Get Together\" portrays an anthem-like picture with its lyrics, about the different possibilities of finding love on the dance floor.", "Keep It Together (song) \"Keep It Together\" is a song by American singer Madonna from her fourth studio album \"Like a Prayer\" (1989). It was released on January 30, 1990 by Sire Records as the sixth and final single from the album in the United States, Canada and Japan. Written and produced by Madonna and Stephen Bray, the main inspiration behind \"Keep It Together\" was Madonna's relationship with her family\u2014whom she dearly missed after her divorce from actor Sean Penn. The song was dedicated to American funk and soul band Sly and the Family Stone. The lyrics deal with the realization of how important Madonna's family has been to her life. A pop and funk song consisting of an upbeat rhythm and groove, \"Keep It Together\" features percussion, banjo and a conga.", ...]}
</tool_response>
Assistant
<think>
Based on the search results, "Get Together" is indeed a song by Madonna from her album "Confessions on a Dance Floor" (2005). This provides the name of the album and its release details.
Next, I need to find out where this album was mainly recorded.
I will now use the search tool to find information about the album "Confessions on a Dance Floor" by Madonna, specifically looking for details on its recording process.
</think>
<tool_call>
{"name": "search", "arguments": {"query": "Confessions on a Dance Floor (album)"} }
</tool_call>
Tool
<tool_response>
{"results": ["Confessions on a Dance Floor Confessions on a Dance Floor is the tenth studio album by American singer and songwriter Madonna. It was released on November 9, 2005 by Warner Bros. Records. A complete departure from her previous studio album \"American Life\" (2003), the album includes influences of 1970s and 1980s disco, as well as modern-day club music. Initially, she began working with Mirwais Ahmadza\u00ef for the album, but later felt that their collaboration was not going in the direction she desired. Madonna took her collaboration with Stuart Price who was overviewing her documentary \"I'm Going to Tell You a Secret\". The album was mainly recorded at Price's home-studio where Madonna spent most of her time during the recordings.", "Confessions (Usher album) Confessions is the fourth studio album by American singer Usher. It was released on March 23, 2004, by Arista Records. Recording sessions for the album took place during 2003 to 2004, with its production handled by his longtime collaborator Jermaine Dupri, alongside with Jimmy Jam and Terry Lewis and Lil Jon, among others. Primarily an R&B album, \"Confessions\" showcases Usher as a crooner through a mixture of ballads and up-tempos, incorporating musical genres of dance-pop, hip hop and crunk. The album's themes generated controversy about Usher's personal relationships; however, the album's primary producer Jermaine Dupri claimed the album reflects his personal story.",...]}
</tool_response>
Assistant
<think>
From the search results, we learn that "Confessions on a Dance Floor" by Madonna was indeed released in 2005, but the primary information we seek is where this album was mainly recorded. The results indicate that it was "mainly recorded at Price's home-studio".
The album "Confessions on a Dance Floor" by Madonna was mainly recorded at Stuart Price's home studio.
Therefore, the answer to the question "The album that contains the song 'Get Together' by Madonna was mainly recorded in whose home studio?" is Stuart Price's home studio.
</think>
<answer>
Stuart Price's home studio
</answer>
訓(xùn)練過(guò)程分析
以下三組圖片分別展示了使用三種主流的強(qiáng)化學(xué)習(xí)算法訓(xùn)練時(shí)模型性能、工具調(diào)用輪數(shù)以及最終回答長(zhǎng)度的變化過(guò)程。
PPO
REINFORCE++
GRPO
可以觀察到,模型 (Qwen2.5-1.5B-Instruct) 在面對(duì)具有挑戰(zhàn)性的多跳問(wèn)題時(shí),有效地學(xué)會(huì)了先思考再多輪調(diào)用工具,最終提高了精確匹配(EM)分?jǐn)?shù)。雖然不同強(qiáng)化學(xué)習(xí)算法的效果各異,但總體趨勢(shì)保持一致。
值得注意的是,實(shí)驗(yàn)揭示了一個(gè)顯著的相關(guān)性:模型性能、工具調(diào)用輪數(shù)和最終回答長(zhǎng)度在整個(gè)訓(xùn)練過(guò)程中呈現(xiàn)一致的變化趨勢(shì)。這展示了 Scaling Law 的全新維度——一個(gè)與智能體-環(huán)境交互頻率相關(guān)的維度。隨著智能體學(xué)會(huì)通過(guò)多次工具調(diào)用更有效地與環(huán)境交互,性能也相應(yīng)提升,這表明多輪環(huán)境交互能力對(duì)智能體性能的重要性可能與傳統(tǒng)的擴(kuò)展因素同等關(guān)鍵。
結(jié)語(yǔ)
我們正處于AI技術(shù)革命的前沿,Agent-R1不僅體現(xiàn)了一種技術(shù)框架,還彰顯了對(duì)AI未來(lái)發(fā)展的長(zhǎng)遠(yuǎn)愿景——從被動(dòng)執(zhí)行到主動(dòng)思考,從固定路徑到自主探索。Agent-R1的推出,旨在讓更多的研究者和開發(fā)者能夠接觸到前沿的強(qiáng)化學(xué)習(xí)與智能體技術(shù),推動(dòng)技術(shù)的廣泛應(yīng)用。只有通過(guò)開放合作,才能真正推動(dòng)AI從模仿到創(chuàng)造,從受限到自由的飛躍。
項(xiàng)目地址:???https://github.com/0russwest0/Agent-R1??
本文轉(zhuǎn)載自???NLP PaperWeekly???,作者:NLP PaperWeekly
