pandas-pro_skill

This skill acts as a senior pandas pro to optimize data cleaning, transformation, and analysis with vectorized, memory-efficient operations for large
  • HTML

110

GitHub Stars

1

Bundled Files

3 weeks ago

Catalog Refreshed

2 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstart where the catalogue uses aiagentskills.

npx veilstart add skill jeffallan/claude-skills --skill pandas-pro

  • SKILL.md3.9 KB

Overview

This skill provides hands-on, production-grade guidance for working with pandas DataFrames. It focuses on vectorized data cleaning, transformation, aggregation, merging, and time-series workflows while emphasizing memory and performance best practices. Use it to get concise, actionable patterns that scale from exploratory analysis to production pipelines.

How this skill works

I inspect DataFrame structure (dtypes, memory_usage, nulls), recommend an index and dtype strategy, and implement transformations using vectorized operations and method chaining. I provide code templates for cleaning, groupby/pivot aggregation, merges, resampling, and conversion between formats, plus notes on profiling, chunking, and categorical conversion for memory reduction. Each solution includes validation checks and performance considerations.

When to use it

  • Loading, cleaning, and transforming tabular data with clear dtype plans
  • Handling missing values, duplicates, and type conversion robustly
  • Performing groupby aggregations, pivot tables, and cross-tabs
  • Merging, joining, concatenating large datasets safely and deterministically
  • Time series resampling, rolling aggregates, and timezone-aware operations
  • Optimizing pandas code for memory, speed, and production deployment

Best practices

  • Prefer vectorized operations; avoid row-wise iteration like .iterrows()
  • Set explicit dtypes early; use categorical for low-cardinality strings
  • Check .memory_usage(deep=True) and apply chunking for large files
  • Handle missing values explicitly and validate with .isna().sum() before dropping
  • Use .loc/.iloc to avoid chained-indexing; call .copy() when mutating subsets
  • Validate outputs: dtypes, shapes, null counts, and sample checks after each major step

Example use cases

  • Clean CSVs and spreadsheets: convert types, fill or flag missing data, deduplicate, and export to parquet for downstream use
  • Aggregate logs by time windows: set DatetimeIndex, resample, compute rolling metrics, and align timezones
  • Join customer tables: use appropriate merge keys, handle duplicates, and reconcile conflicting columns
  • Large-file ETL: stream CSV chunks, apply transformations per chunk, and append to a consolidated parquet store
  • Performance tuning: convert text columns to categorical, downcast numerics, and profile with memory_usage and timeit

FAQ

Use .loc when selecting and assign to a .copy() of a subset before mutating, e.g., sub = df.loc[mask].copy(); sub['col'] = ...

When should I use chunking vs converting dtypes?

Start by converting dtypes and using categorical to reduce memory; use chunking when the dataset still exceeds memory or when reading slow sources like very large CSVs.

Built by
VeilStrat
AI signals for GTM teams
© 2026 VeilStrat. All rights reserved.All systems operational