Repository inventory

muratcankoylan/agent-skills-for-context-engineering

Skills indexed from this repository, with install-style signals scoped to the repo.

11 skills133K GitHub stars0 weekly installsPythonGitHub Owner profile

Overview

This skill provides a practical framework for evaluating agent systems across multiple dimensions. It focuses on outcome-centered rubrics, realistic test sets, and continuous evaluation pipelines to catch regressions and measure improvements. The guidance balances automated LLM-as-judge methods with human review for edge cases.

How this skill works

The skill defines multi-dimensional rubrics (accuracy, completeness, citation accuracy, source quality, tool efficiency) and converts them to weighted numeric scores. It supports complexity-stratified test sets, token-budgeted runs, and automated LLM-based judgments while recommending human sampling for subtle failures. Results are tracked over time to detect regressions and validate context engineering choices.

When to use it

When you need a systematic test framework for agent performance
Before deploying agent changes to catch regressions
To compare agent configurations, models, or context strategies
When building quality gates and automated evaluation pipelines
To measure production quality by sampling real interactions

Best practices

Design multi-dimensional rubrics; avoid single-metric decisions
Evaluate outcomes, not specific execution paths or steps
Stratify test sets by complexity and include edge cases
Run evaluations under realistic token budgets and context sizes
Combine LLM-as-judge for scale with human review for edge cases

Example use cases

Compare two agent architectures by running the same test set and comparing weighted scores
Build a CI pipeline that runs evaluation tests on every agent version and flags regressions
Measure the impact of context-window reductions with degradation tests to find safe limits
Use LLM-as-judge prompts to score thousands of runs, then human-review low-confidence failures
Create pass/fail quality gates that enforce minimum weighted scores before deployment

FAQ

Judge outcomes instead of steps, run multiple seeds, and aggregate scores across runs to account for variability.

When should I use human evaluation versus automated LLM judging?

Use LLM judges for large-scale, consistent scoring; reserve human review for edge cases, samples, and nuanced failure modes.

11 skills

evaluation

This skill helps you design and run multi-dimensional evaluation for agent systems, enabling robust benchmarking, continuous improvement, and quality gates.

AnalyticsDataPerformanceTesting+1

multi-agent-patterns

This skill helps you design and implement multi-agent systems with context isolation, coordination patterns, and scalable architectures.

AutomationMonitoringPlanningProductivity+2

bdi-mental-states

This skill models agent beliefs, desires, and intentions from RDF context, enabling explainable BDI reasoning and coherent multi-agent coordination.

BackendDataObservabilityPlanning+3

context-fundamentals

This skill helps you understand and design efficient context windows for AI agents, optimizing loading and budgeting across systems.

BackendDebuggingDesignDocs+2

context-compression

This skill helps manage and compress long agent conversations by optimizing tokens-per-task through anchored summaries and structured artifact tracking.

BackendDebuggingDocsPerformance+2

book-sft-pipeline

This skill enables building and validating author-style fine-tuning pipelines from ePub to LoRA-trainable models for book voices.

AutomationDataDocsWriting+1

project-development

This skill helps you design and evaluate LLM-backed projects, selecting architecture, tasks, and cost estimates for efficient agent development.

AnalyticsDesignPlanningProductivity+1

tool-design

Api

This skill helps design and optimize agent tools, improving tool descriptions, consolidation, and interfaces for reliable multi-agent systems.

AutomationDebuggingDesignRefactor+2

advanced-evaluation

This skill enables automated LLM-based evaluation pipelines, compares model outputs, and mitigates biases to deliver consistent, objective quality assessments.

AnalyticsAutomationDataTesting+1

comprehensive-research-agent

This skill helps ensure thorough validation, error handling, and transparent reasoning across multi-source research tasks.

DataDebuggingPlanningProductivity+2

agent-skills-for-context-engineering

This skill helps you design, optimize, and debug production agent systems through context engineering best practices and multi-agent architectures.

AutomationDebuggingDesignDevops+2