megatron-core_skill

This skill helps you optimize large-scale LLM training with Megatron-Core, enabling efficient 2B-462B parameter models using advanced parallelism.

TeX

5.2k

GitHub Stars

1

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill orchestra-research/ai-research-skills --skill megatron-core

SKILL.md9.5 KB

Overview

This skill trains large language models (2B–462B parameters) using NVIDIA Megatron-Core and advanced parallelism strategies. It provides production-ready recipes, scripts, and configuration patterns to achieve high GPU efficiency (up to ~47% MFU on H100) and scale to hundreds of GPUs. Use it to run LLaMA-style, MoE, and other transformer training at scale with fine-grained control over tensor, pipeline, context, and expert parallelism.

How this skill works

The skill supplies launch scripts and torchrun examples that configure tensor/pipeline/data/context/expert parallelism and precision modes (bfloat16, FP8 hybrid). It exposes performance knobs: micro-batch sizing, recompute (gradient checkpointing), Flash Attention / Transformer Engine, and optimizer offloading to trade memory versus throughput. Monitoring guidance and metric targets (MFU, tokens/sec/GPU, memory per GPU, loss trends) are included for iterative tuning.

When to use it

Training models larger than ~1B parameters and needing production-grade scaling
Maximizing throughput and MFU on NVIDIA GPUs (H100, A100) for large models
Running Mixture-of-Experts models that require expert parallelism and memory savings
Deploying multi-node distributed training across InfiniBand or high-speed networks
When fine-grained control of TP/PP/CP/EP is required for memory/performance tradeoffs

Best practices

Start with recommended parallelism templates by model size and adjust TP/PP/DP to match node topology
Tune micro-batch size up from 1 until OOM to maximize throughput without destabilizing optimization
Enable Flash Attention and Transformer Engine; use FP8 on H100 to get 1.5–2x speedups
Use gradient checkpointing and expert parallelism to reduce per-GPU memory for very large models
Monitor MFU, tokens/sec, GPU memory, and loss; raise LR warmup or clip-grad if loss diverges

Example use cases

Training a LLaMA-style 70B model across 64 GPUs using TP=4, PP=4, sequence-parallel enabled
Running an 8-expert Mixtral MoE model across 32 GPUs with expert-parallel size=4 to cut memory by 75%
Optimizing a 405B model on 128 H100s using TP=8, PP=8 and CP=2 to hit target MFU
Converting existing training pipelines to FP8/Transformer Engine to accelerate production runs
Diagnosing low MFU or OOMs with provided checklist and parameter change examples

FAQ

Use NVIDIA Ampere+ GPUs (A100, H100) and InfiniBand or 400Gb+ Ethernet for multi-node scaling; FP8 requires Hopper-family support.

When should I prefer Megatron-Core over FSDP or DeepSpeed?

Choose Megatron-Core for models >>10B or when you need the highest MFU and fine-grained TP/PP/EP control; prefer FSDP/DeepSpeed for simpler setups or smaller scale.