stream_skill

This skill designs robust, scalable data pipelines (batch or streaming) with quality checks, lineage, and idempotent recovery.

Shell

8

GitHub Stars

1

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill simota/agent-skills --skill stream

SKILL.md7.0 KB

Overview

This skill designs robust, production-ready ETL/ELT pipelines and streaming architectures with built-in data quality, idempotency, and lineage. I help choose batch vs streaming, create Airflow/Dagster workflows, design Kafka topics and dbt models, and produce backfill and monitoring plans. The focus is practical: reliable pipelines that are re-runnable, observable, and easy to hand off to engineering teams.

How this skill works

I analyze sources, sinks, volume/velocity, and latency requirements to recommend an architecture (batch, streaming, or hybrid). I produce DAGs, topic and consumer designs, dbt model structure, and quality check suites, plus playbooks for backfill, schema evolution, and incident recovery. Deliverables include design docs, orchestration code templates, quality tests, and lineage diagrams ready for implementation.

When to use it

You need a new data pipeline for reporting, ML features, or real-time dashboards.
Existing pipelines are flaky, lack lineage, or cannot be safely re-run.
You must pick between batch and streaming based on volume/latency trade-offs.
You need Kafka/topic design, CDC configuration, or dbt model layering.
You require a backfill/replay strategy, schema evolution plan, or monitoring hooks.

Best practices

Treat schema as a versioned contract and enforce it at source and transform layers.
Design idempotent transforms and deterministic keys to enable safe re-runs and backfills.
Implement three-layer quality checks (source → transform → sink) with clear gates.
Prefer moving computation to where the data resides to reduce cost and latency.
Document lineage for every derived dataset and attach monitoring/alerting to key SLAs.

Example use cases

Design a daily batch pipeline (Airflow + dbt) for finance reports with backfill playbook.
Create a low-latency streaming architecture (Kafka + stream processor) for real-time metrics.
Define CDC ingestion using Debezium into a data lake, with schema evolution strategy.
Produce dbt model templates and tests following staging → intermediate → mart layers.
Build a pipeline monitoring plan with quality gates and automated alerting for freshness and volume.

FAQ

I evaluate data volume, required latency, cost constraints, and downstream use cases; choose batch for high-volume, non-time-sensitive workloads and streaming for sub-minute freshness or event-driven needs.

What guarantees do you design for exactly-once delivery?

I combine deterministic keys, idempotent writes (UPSERTs), Kafka transactions or deduplication layers, and monitoring to approach exactly-once semantics where the stack allows.