nemo-curator_skill

This skill optimizes LLM data curation with GPU-accelerated, multi-modal cleaning, deduplication, and PII redaction to improve training data quality.

TeX

5.2k

GitHub Stars

1

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill orchestra-research/ai-research-skills --skill nemo-curator

SKILL.md9.1 KB

Overview

This skill provides GPU-accelerated data curation tools for preparing high-quality training datasets for LLMs and multi-modal models. It supports text, image, video, and audio pipelines with fuzzy/semantic deduplication, 30+ heuristic quality filters, PII redaction, and NSFW detection. The implementation scales across GPUs (Dask/CUDA) for large corpora and delivers large speedups and lower TCO versus CPU-only workflows.

How this skill works

The skill runs staged pipelines: heuristic quality filtering, exact/fuzzy/semantic deduplication, PII redaction, and classifier-based filtering. GPU kernels accelerate MinHash/LSH fuzzy deduplication, embedding-based semantic deduplication, and batched classifier inference. Pipelines operate on Parquet/JSONL/CSV inputs, integrate with Dask CUDA for multi-GPU scaling, and export curated Parquet or JSONL outputs.

When to use it

Curating web-scraped corpora (Common Crawl, RedPajama) before model training
Removing duplicates from multi-terabyte text collections with GPU speedups
Preparing multi-modal datasets (images, video, audio) with NSFW and quality filters
Redacting PII and enforcing safety policies at dataset scale
Reducing compute and cost for large-scale deduplication and filtering tasks

Best practices

Run quality heuristics first (word counts, URL ratio, repeated lines) to cheaply remove noise
Use exact deduplication before fuzzy/semantic passes to reduce candidate pairs
Tune MinHash/LSH parameters (num_hashes, num_buckets) to balance recall and speed
Embed-based semantic deduplication for paraphrase-heavy corpora; choose a lightweight embedder for scale
Leverage Dask CUDA/LocalCUDACluster for near-linear multi-GPU scaling and lower TCO

Example use cases

Curating Common Crawl slices: filter language, quality heuristics, deduplicate, redact PII, save Parquet
Assembling an image dataset: aesthetic scoring, NSFW filtering, CLIP embedding export
Video dataset pipeline: scene detection, clip extraction, video embeddings for retrieval
Audio transcription cleanup: ASR inference, WER filtering, duration constraints before training speech models
Large-scale fuzzy deduplication of RedPajama-style corpora using A100 or multi-GPU clusters

FAQ

Inputs: Parquet, JSONL, CSV, plus WebDataset TAR for multi-modal. Outputs: Parquet (recommended) and JSONL.

How much faster is GPU curation?

Benchmarks show ~16× speedup for fuzzy/exact deduplication and ~10× for quality filtering versus CPU baselines, with near-linear multi-GPU scaling.

Can I run without GPUs?

Yes — there is a CPU-only install and execution path, but throughput will be significantly lower and cost savings reduced.