nemo-curator_skill

This skill optimizes LLM data curation with GPU-accelerated, multi-modal cleaning, deduplication, and PII redaction to improve training data quality.
  • TeX

5.2k

GitHub Stars

1

Bundled Files

3 weeks ago

Catalog Refreshed

2 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstart where the catalogue uses aiagentskills.

npx veilstart add skill orchestra-research/ai-research-skills --skill nemo-curator

  • SKILL.md9.1 KB

Overview

This skill provides GPU-accelerated data curation tools for preparing high-quality training datasets for LLMs and multi-modal models. It supports text, image, video, and audio pipelines with fuzzy/semantic deduplication, 30+ heuristic quality filters, PII redaction, and NSFW detection. The implementation scales across GPUs (Dask/CUDA) for large corpora and delivers large speedups and lower TCO versus CPU-only workflows.

How this skill works

The skill runs staged pipelines: heuristic quality filtering, exact/fuzzy/semantic deduplication, PII redaction, and classifier-based filtering. GPU kernels accelerate MinHash/LSH fuzzy deduplication, embedding-based semantic deduplication, and batched classifier inference. Pipelines operate on Parquet/JSONL/CSV inputs, integrate with Dask CUDA for multi-GPU scaling, and export curated Parquet or JSONL outputs.

When to use it

  • Curating web-scraped corpora (Common Crawl, RedPajama) before model training
  • Removing duplicates from multi-terabyte text collections with GPU speedups
  • Preparing multi-modal datasets (images, video, audio) with NSFW and quality filters
  • Redacting PII and enforcing safety policies at dataset scale
  • Reducing compute and cost for large-scale deduplication and filtering tasks

Best practices

  • Run quality heuristics first (word counts, URL ratio, repeated lines) to cheaply remove noise
  • Use exact deduplication before fuzzy/semantic passes to reduce candidate pairs
  • Tune MinHash/LSH parameters (num_hashes, num_buckets) to balance recall and speed
  • Embed-based semantic deduplication for paraphrase-heavy corpora; choose a lightweight embedder for scale
  • Leverage Dask CUDA/LocalCUDACluster for near-linear multi-GPU scaling and lower TCO

Example use cases

  • Curating Common Crawl slices: filter language, quality heuristics, deduplicate, redact PII, save Parquet
  • Assembling an image dataset: aesthetic scoring, NSFW filtering, CLIP embedding export
  • Video dataset pipeline: scene detection, clip extraction, video embeddings for retrieval
  • Audio transcription cleanup: ASR inference, WER filtering, duration constraints before training speech models
  • Large-scale fuzzy deduplication of RedPajama-style corpora using A100 or multi-GPU clusters

FAQ

Inputs: Parquet, JSONL, CSV, plus WebDataset TAR for multi-modal. Outputs: Parquet (recommended) and JSONL.

How much faster is GPU curation?

Benchmarks show ~16× speedup for fuzzy/exact deduplication and ~10× for quality filtering versus CPU baselines, with near-linear multi-GPU scaling.

Can I run without GPUs?

Yes — there is a CPU-only install and execution path, but throughput will be significantly lower and cost savings reduced.

Built by
VeilStrat
AI signals for GTM teams
© 2026 VeilStrat. All rights reserved.All systems operational