pandas-pro_skill

This skill acts as a senior pandas pro to optimize data cleaning, transformation, and analysis with vectorized, memory-efficient operations for large

HTML

110

GitHub Stars

1

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill jeffallan/claude-skills --skill pandas-pro

SKILL.md3.9 KB

Overview

This skill provides hands-on, production-grade guidance for working with pandas DataFrames. It focuses on vectorized data cleaning, transformation, aggregation, merging, and time-series workflows while emphasizing memory and performance best practices. Use it to get concise, actionable patterns that scale from exploratory analysis to production pipelines.

How this skill works

I inspect DataFrame structure (dtypes, memory_usage, nulls), recommend an index and dtype strategy, and implement transformations using vectorized operations and method chaining. I provide code templates for cleaning, groupby/pivot aggregation, merges, resampling, and conversion between formats, plus notes on profiling, chunking, and categorical conversion for memory reduction. Each solution includes validation checks and performance considerations.

When to use it

Loading, cleaning, and transforming tabular data with clear dtype plans
Handling missing values, duplicates, and type conversion robustly
Performing groupby aggregations, pivot tables, and cross-tabs
Merging, joining, concatenating large datasets safely and deterministically
Time series resampling, rolling aggregates, and timezone-aware operations
Optimizing pandas code for memory, speed, and production deployment

Best practices

Prefer vectorized operations; avoid row-wise iteration like .iterrows()
Set explicit dtypes early; use categorical for low-cardinality strings
Check .memory_usage(deep=True) and apply chunking for large files
Handle missing values explicitly and validate with .isna().sum() before dropping
Use .loc/.iloc to avoid chained-indexing; call .copy() when mutating subsets
Validate outputs: dtypes, shapes, null counts, and sample checks after each major step

Example use cases

Clean CSVs and spreadsheets: convert types, fill or flag missing data, deduplicate, and export to parquet for downstream use
Aggregate logs by time windows: set DatetimeIndex, resample, compute rolling metrics, and align timezones
Join customer tables: use appropriate merge keys, handle duplicates, and reconcile conflicting columns
Large-file ETL: stream CSV chunks, apply transformations per chunk, and append to a consolidated parquet store
Performance tuning: convert text columns to categorical, downcast numerics, and profile with memory_usage and timeit

FAQ

Use .loc when selecting and assign to a .copy() of a subset before mutating, e.g., sub = df.loc[mask].copy(); sub['col'] = ...

When should I use chunking vs converting dtypes?

Start by converting dtypes and using categorical to reduce memory; use chunking when the dataset still exceeds memory or when reading slow sources like very large CSVs.