muratcankoylan/agent-skills-for-context-engineering
Overview
This skill provides a practical framework for evaluating agent systems across multiple dimensions. It focuses on outcome-centered rubrics, realistic test sets, and continuous evaluation pipelines to catch regressions and measure improvements. The guidance balances automated LLM-as-judge methods with human review for edge cases.
How this skill works
The skill defines multi-dimensional rubrics (accuracy, completeness, citation accuracy, source quality, tool efficiency) and converts them to weighted numeric scores. It supports complexity-stratified test sets, token-budgeted runs, and automated LLM-based judgments while recommending human sampling for subtle failures. Results are tracked over time to detect regressions and validate context engineering choices.
When to use it
- When you need a systematic test framework for agent performance
- Before deploying agent changes to catch regressions
- To compare agent configurations, models, or context strategies
- When building quality gates and automated evaluation pipelines
- To measure production quality by sampling real interactions
Best practices
- Design multi-dimensional rubrics; avoid single-metric decisions
- Evaluate outcomes, not specific execution paths or steps
- Stratify test sets by complexity and include edge cases
- Run evaluations under realistic token budgets and context sizes
- Combine LLM-as-judge for scale with human review for edge cases
Example use cases
- Compare two agent architectures by running the same test set and comparing weighted scores
- Build a CI pipeline that runs evaluation tests on every agent version and flags regressions
- Measure the impact of context-window reductions with degradation tests to find safe limits
- Use LLM-as-judge prompts to score thousands of runs, then human-review low-confidence failures
- Create pass/fail quality gates that enforce minimum weighted scores before deployment
FAQ
Judge outcomes instead of steps, run multiple seeds, and aggregate scores across runs to account for variability.
When should I use human evaluation versus automated LLM judging?
Use LLM judges for large-scale, consistent scoring; reserve human review for edge cases, samples, and nuanced failure modes.
11 skills
This skill helps you design and run multi-dimensional evaluation for agent systems, enabling robust benchmarking, continuous improvement, and quality gates.
This skill helps you design and implement multi-agent systems with context isolation, coordination patterns, and scalable architectures.
This skill models agent beliefs, desires, and intentions from RDF context, enabling explainable BDI reasoning and coherent multi-agent coordination.
This skill helps you understand and design efficient context windows for AI agents, optimizing loading and budgeting across systems.
This skill helps manage and compress long agent conversations by optimizing tokens-per-task through anchored summaries and structured artifact tracking.
This skill enables building and validating author-style fine-tuning pipelines from ePub to LoRA-trainable models for book voices.
This skill helps you design and evaluate LLM-backed projects, selecting architecture, tasks, and cost estimates for efficient agent development.
This skill helps design and optimize agent tools, improving tool descriptions, consolidation, and interfaces for reliable multi-agent systems.
This skill enables automated LLM-based evaluation pipelines, compares model outputs, and mitigates biases to deliver consistent, objective quality assessments.
This skill helps ensure thorough validation, error handling, and transparent reasoning across multi-source research tasks.
This skill helps you design, optimize, and debug production agent systems through context engineering best practices and multi-agent architectures.