Agentic RL Training Roadmap#
vime is not limited to single-turn RL. Its main advantage for agentic training is the combination of high-performance training, vLLM rollout serving, and pluggable data-generation interfaces. This makes it suitable for multi-turn tool use, sandbox interaction, subagent branches, context compaction, and test-based rewards.
This page is a roadmap: use it to decide which docs and examples to read when plugging an agent workflow into vime.
Where To Start#
Goal |
Recommended entry point |
|---|---|
Run a custom agent loop, tool calls, RAG, browser/terminal/sandbox interaction for each sample |
|
Implement verifier rewards, test-based rewards, environment success checks, or an external reward service |
|
Return multiple training samples from one prompt, such as subagent, multi-agent, or context-compaction segments |
|
Avoid blocking training on long-tail agent rollouts |
|
Study a full end-to-end agent example with sandboxing, real code edits, and test-based grading |
|
Improve vLLM serving throughput for multi-turn agents |
|
Enable vLLM optimization flags, router policies, or multi-model serving |
How to Use vLLM, vLLM Config, Speculative Decoding, Low Precision Training |
Recommended Integration Pattern#
Most agentic RL tasks should start with --custom-generate-function-path. This function converts one agent execution into vime-trainable Sample objects: fill tokens, response_length, loss_mask, and status, then either fill reward directly or let --custom-rm-path compute it.
The agent workflow itself may speak in strings, chat messages, tool calls, environment observations, or framework-specific events. The training target, however, should stay token based. Preserve the model-sampled token ids and use loss_mask to separate trainable model output from prompt, template, tool-observation, or environment text.
If one prompt rollout corresponds to one training sample, return a single Sample. If one rollout splits into multiple trainable segments, such as subagent trajectories, main-agent continuations, or pre/post-compaction segments, return list[Sample] and set the same rollout_id on all sibling samples. vime then keeps those samples together for train-step splitting and loss aggregation instead of counting them as independent rollouts.
Reach for --rollout-function-path only when you need to replace the whole rollout orchestration. Common reasons include custom data-source scheduling, cross-rollout background queues, fully asynchronous generation, or workflows that cannot fit the default vllm_rollout prompt-by-sample structure.
Agent Runtime Adapters#
vime includes protocol adapters for existing agent runtimes:
vime.agent.adapters.AnthropicAdapter: Anthropic Messages API, used by Claude Code style agents.vime.agent.adapters.OpenAIAdapter: OpenAI Chat Completions and Responses APIs, used by OpenAI SDK / OpenAI Agents SDK style clients.
Adapters are a convenience layer, not a separate agent framework. Their contract is message history in, sampled tokens out: they render the chat template, call vLLM with input_ids and return_logprob=True, and export the returned token ids/logprobs as trainable trajectory segments. They avoid re-tokenizing response text to recover the training target.
Instantiate the protocol-specific adapter in your custom generate function, run its app with aiohttp, then manage each rollout through the adapter instance:
from vime.agent.adapters import AnthropicAdapter
adapter = AnthropicAdapter(
tokenizer=tokenizer,
vllm_url=vllm_url,
tool_parser=tool_parser,
reasoning_parser=reasoning_parser,
)
adapter.open_session(session_id, sampling_defaults=sampling_params)
# Agent client sends requests to adapter.app.
segments = await adapter.finish_session(session_id)
For multi-turn agents, use a stable session_id. The adapters pass it as X-SMG-Routing-Key so vLLM can route one session to the same worker and reuse prefix cache.
Agent Serving And Performance#
Agentic rollouts tend to depend more heavily on serving configuration than ordinary single-turn generation: contexts are longer, requests are multi-turn, latency has a heavier tail, and the workflow may need actor, reference, reward, or tool-side models at the same time.
Regular vLLM server arguments are passed as
--vllm-*. For example, vLLM’s--context-lengthbecomes--vllm-context-length, and--gpu-memory-utilizationbecomes--vllm-gpu-memory-utilization.Router arguments are passed as
--router-*. For multi-turn agents, consider--router-policy consistent_hashingso requests for the samesample.session_idgo to the same worker and improve prefix-cache hit rate. See Session-Affinity Routing for Multi-Turn Agents.Use
--vllm-configfor more complex topologies: PD disaggregation, multi-model serving, heterogeneous server groups, and per-group vLLM overrides.For multi-turn or agentic RL, evaluate PD disaggregation. Prefill and decode have different workload shapes, and separating them makes it easier to scale each resource independently.
For rollout-throughput optimization, also see Speculative Decoding and Low Precision Training.
Reference Example#
The full coding-agent example is examples/coding_agent_rl. It shows an end-to-end agent RL setup that is close to a real software-engineering workflow: each sample boots an isolated sandbox, the agent uses tools to edit code, the rollout captures a git diff, and a clean sandbox runs the tests to produce the reward.
This example also demonstrates agent fan-out training. Its middleware splits one trajectory into subagent, wipe (the chain frozen before compaction), and final segments. generate() returns list[Sample], and all segments share the same rollout_id.
For smaller starting points, see examples/search-r1 for multi-turn tool use, examples/retool for tool-augmented generation, and examples/multi_agent for the multi-agent pattern.