- Home
- Skills
- Jeffallan
- Claude Skills
- Pandas Pro
pandas-pro_skill
- HTML
110
GitHub Stars
1
Bundled Files
3 weeks ago
Catalog Refreshed
2 months ago
First Indexed
Readme & install
Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.
Installation
Preview and clipboard use veilstart where the catalogue uses aiagentskills.
npx veilstart add skill jeffallan/claude-skills --skill pandas-pro- SKILL.md3.9 KB
Overview
This skill provides hands-on, production-grade guidance for working with pandas DataFrames. It focuses on vectorized data cleaning, transformation, aggregation, merging, and time-series workflows while emphasizing memory and performance best practices. Use it to get concise, actionable patterns that scale from exploratory analysis to production pipelines.
How this skill works
I inspect DataFrame structure (dtypes, memory_usage, nulls), recommend an index and dtype strategy, and implement transformations using vectorized operations and method chaining. I provide code templates for cleaning, groupby/pivot aggregation, merges, resampling, and conversion between formats, plus notes on profiling, chunking, and categorical conversion for memory reduction. Each solution includes validation checks and performance considerations.
When to use it
- Loading, cleaning, and transforming tabular data with clear dtype plans
- Handling missing values, duplicates, and type conversion robustly
- Performing groupby aggregations, pivot tables, and cross-tabs
- Merging, joining, concatenating large datasets safely and deterministically
- Time series resampling, rolling aggregates, and timezone-aware operations
- Optimizing pandas code for memory, speed, and production deployment
Best practices
- Prefer vectorized operations; avoid row-wise iteration like .iterrows()
- Set explicit dtypes early; use categorical for low-cardinality strings
- Check .memory_usage(deep=True) and apply chunking for large files
- Handle missing values explicitly and validate with .isna().sum() before dropping
- Use .loc/.iloc to avoid chained-indexing; call .copy() when mutating subsets
- Validate outputs: dtypes, shapes, null counts, and sample checks after each major step
Example use cases
- Clean CSVs and spreadsheets: convert types, fill or flag missing data, deduplicate, and export to parquet for downstream use
- Aggregate logs by time windows: set DatetimeIndex, resample, compute rolling metrics, and align timezones
- Join customer tables: use appropriate merge keys, handle duplicates, and reconcile conflicting columns
- Large-file ETL: stream CSV chunks, apply transformations per chunk, and append to a consolidated parquet store
- Performance tuning: convert text columns to categorical, downcast numerics, and profile with memory_usage and timeit
FAQ
Use .loc when selecting and assign to a .copy() of a subset before mutating, e.g., sub = df.loc[mask].copy(); sub['col'] = ...
When should I use chunking vs converting dtypes?
Start by converting dtypes and using categorical to reduce memory; use chunking when the dataset still exceeds memory or when reading slow sources like very large CSVs.