megatron-core_skill

This skill helps you optimize large-scale LLM training with Megatron-Core, enabling efficient 2B-462B parameter models using advanced parallelism.
  • TeX

5.2k

GitHub Stars

1

Bundled Files

3 weeks ago

Catalog Refreshed

2 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstart where the catalogue uses aiagentskills.

npx veilstart add skill orchestra-research/ai-research-skills --skill megatron-core

  • SKILL.md9.5 KB

Overview

This skill trains large language models (2B–462B parameters) using NVIDIA Megatron-Core and advanced parallelism strategies. It provides production-ready recipes, scripts, and configuration patterns to achieve high GPU efficiency (up to ~47% MFU on H100) and scale to hundreds of GPUs. Use it to run LLaMA-style, MoE, and other transformer training at scale with fine-grained control over tensor, pipeline, context, and expert parallelism.

How this skill works

The skill supplies launch scripts and torchrun examples that configure tensor/pipeline/data/context/expert parallelism and precision modes (bfloat16, FP8 hybrid). It exposes performance knobs: micro-batch sizing, recompute (gradient checkpointing), Flash Attention / Transformer Engine, and optimizer offloading to trade memory versus throughput. Monitoring guidance and metric targets (MFU, tokens/sec/GPU, memory per GPU, loss trends) are included for iterative tuning.

When to use it

  • Training models larger than ~1B parameters and needing production-grade scaling
  • Maximizing throughput and MFU on NVIDIA GPUs (H100, A100) for large models
  • Running Mixture-of-Experts models that require expert parallelism and memory savings
  • Deploying multi-node distributed training across InfiniBand or high-speed networks
  • When fine-grained control of TP/PP/CP/EP is required for memory/performance tradeoffs

Best practices

  • Start with recommended parallelism templates by model size and adjust TP/PP/DP to match node topology
  • Tune micro-batch size up from 1 until OOM to maximize throughput without destabilizing optimization
  • Enable Flash Attention and Transformer Engine; use FP8 on H100 to get 1.5–2x speedups
  • Use gradient checkpointing and expert parallelism to reduce per-GPU memory for very large models
  • Monitor MFU, tokens/sec, GPU memory, and loss; raise LR warmup or clip-grad if loss diverges

Example use cases

  • Training a LLaMA-style 70B model across 64 GPUs using TP=4, PP=4, sequence-parallel enabled
  • Running an 8-expert Mixtral MoE model across 32 GPUs with expert-parallel size=4 to cut memory by 75%
  • Optimizing a 405B model on 128 H100s using TP=8, PP=8 and CP=2 to hit target MFU
  • Converting existing training pipelines to FP8/Transformer Engine to accelerate production runs
  • Diagnosing low MFU or OOMs with provided checklist and parameter change examples

FAQ

Use NVIDIA Ampere+ GPUs (A100, H100) and InfiniBand or 400Gb+ Ethernet for multi-node scaling; FP8 requires Hopper-family support.

When should I prefer Megatron-Core over FSDP or DeepSpeed?

Choose Megatron-Core for models >>10B or when you need the highest MFU and fine-grained TP/PP/EP control; prefer FSDP/DeepSpeed for simpler setups or smaller scale.

Built by
VeilStrat
AI signals for GTM teams
© 2026 VeilStrat. All rights reserved.All systems operational