slime_skill

This skill helps you accelerate RL-based LLM post-training with slime's Megatron-LM and SGLang for scalable data generation and rollout.

TeX

5.2k

GitHub Stars

1

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill orchestra-research/ai-research-skills --skill slime

SKILL.md11.3 KB

Overview

This skill provides practical guidance for post-training large language models using slime, a Megatron+SGLang framework tailored for RL scaling. It explains core workflows, deployment notes, and configuration patterns to run GRPO, async, and multi-turn agentic training for GLM-family and similar models. Use it to integrate Megatron-LM training with high-throughput SGLang rollouts and custom data buffers.

How this skill works

The skill inspects and documents slime's architecture: a data buffer that feeds prompts to Megatron-LM training and SGLang rollout processes, with weight synchronization between training and inference. It outlines workflows for synchronous GRPO, asynchronous overlapping of rollouts and training, and agentic multi-turn generation with custom generate hooks. It also summarizes resource, argument, and debugging knobs to tune throughput and stability.

When to use it

You need native Megatron-LM training combined with SGLang inference routing for high-throughput rollouts.
You are training GLM-4.x, Qwen3, DeepSeek V3, Llama 3, or custom large models and need tight scaling control.
You require custom data generation, buffering, or off-policy replay in RL post-training.
You want research-grade RL algorithms (GRPO, GPPO variants) with production-oriented integrations.
You must overlap generation and optimization to maximize GPU utilization for large models.

Best practices

Match rollout_batch_size × n_samples_per_prompt to global_batch_size × num_steps_per_rollout to avoid imbalance.
Start with colocated mode for easier weight sync debugging, then scale to distributed actors once stable.
Use async-buffer-size and update-weights-interval to tune throughput vs staleness for async training.
Enable fault tolerance and increase sglang memory fraction if inference engine crashes under load.
Profile GPU utilization and TensorBoard reward curves regularly to catch divergence early.

Example use cases

GRPO training of a reasoning model (GLM or Qwen) with n-samples-per-prompt and KL regularization to preserve policy priors.
Asynchronous large-model training where rollouts are buffered to keep GPUs busy and reduce idle time.
Multi-turn agentic training with custom_generate implementing tool calls, tool results folded into conversation, and custom reward computation.
Off-policy experiments: use RolloutDataSourceWithBuffer to store and prioritize generated samples for replay.
Custom reward model integration: load a separate RM to score responses and pass scores into the training loop.

FAQ

Tune slime resource args (actor/rollout GPU counts), rollout_batch_size and n_samples_per_prompt first, then Megatron parallelism sizes and SGLang memory settings.

How do I fix SGLang crashes under heavy load?

Enable --use-fault-tolerance, increase --sglang-mem-fraction-static, reduce rollout batch sizes, and monitor logs for OOMs or worker failures.

When should I use async mode vs sync?

Use async for large models or long generations where synchronous rollout stalls training; choose sync when determinism and simpler debugging matter.