Repository inventory

muratcankoylan/agent-skills-for-context-engineering

Skills indexed from this repository, with install-style signals scoped to the repo.
11 skills133K GitHub stars0 weekly installsPythonGitHubOwner profile

Overview

This skill provides a practical framework for evaluating agent systems across multiple dimensions. It focuses on outcome-centered rubrics, realistic test sets, and continuous evaluation pipelines to catch regressions and measure improvements. The guidance balances automated LLM-as-judge methods with human review for edge cases.

How this skill works

The skill defines multi-dimensional rubrics (accuracy, completeness, citation accuracy, source quality, tool efficiency) and converts them to weighted numeric scores. It supports complexity-stratified test sets, token-budgeted runs, and automated LLM-based judgments while recommending human sampling for subtle failures. Results are tracked over time to detect regressions and validate context engineering choices.

When to use it

  • When you need a systematic test framework for agent performance
  • Before deploying agent changes to catch regressions
  • To compare agent configurations, models, or context strategies
  • When building quality gates and automated evaluation pipelines
  • To measure production quality by sampling real interactions

Best practices

  • Design multi-dimensional rubrics; avoid single-metric decisions
  • Evaluate outcomes, not specific execution paths or steps
  • Stratify test sets by complexity and include edge cases
  • Run evaluations under realistic token budgets and context sizes
  • Combine LLM-as-judge for scale with human review for edge cases

Example use cases

  • Compare two agent architectures by running the same test set and comparing weighted scores
  • Build a CI pipeline that runs evaluation tests on every agent version and flags regressions
  • Measure the impact of context-window reductions with degradation tests to find safe limits
  • Use LLM-as-judge prompts to score thousands of runs, then human-review low-confidence failures
  • Create pass/fail quality gates that enforce minimum weighted scores before deployment

FAQ

Judge outcomes instead of steps, run multiple seeds, and aggregate scores across runs to account for variability.

When should I use human evaluation versus automated LLM judging?

Use LLM judges for large-scale, consistent scoring; reserve human review for edge cases, samples, and nuanced failure modes.

11 skills

evaluation
Ai

This skill helps you design and run multi-dimensional evaluation for agent systems, enabling robust benchmarking, continuous improvement, and quality gates.

AnalyticsDataPerformanceTesting+1
multi-agent-patterns
Ai

This skill helps you design and implement multi-agent systems with context isolation, coordination patterns, and scalable architectures.

AutomationMonitoringPlanningProductivity+2
bdi-mental-states
Ai

This skill models agent beliefs, desires, and intentions from RDF context, enabling explainable BDI reasoning and coherent multi-agent coordination.

BackendDataObservabilityPlanning+3
context-fundamentals
Ai

This skill helps you understand and design efficient context windows for AI agents, optimizing loading and budgeting across systems.

BackendDebuggingDesignDocs+2
context-compression
Ai

This skill helps manage and compress long agent conversations by optimizing tokens-per-task through anchored summaries and structured artifact tracking.

BackendDebuggingDocsPerformance+2
book-sft-pipeline
Ai

This skill enables building and validating author-style fine-tuning pipelines from ePub to LoRA-trainable models for book voices.

AutomationDataDocsWriting+1
project-development
Ai

This skill helps you design and evaluate LLM-backed projects, selecting architecture, tasks, and cost estimates for efficient agent development.

AnalyticsDesignPlanningProductivity+1
tool-design
Api

This skill helps design and optimize agent tools, improving tool descriptions, consolidation, and interfaces for reliable multi-agent systems.

AutomationDebuggingDesignRefactor+2
advanced-evaluation
Ai

This skill enables automated LLM-based evaluation pipelines, compares model outputs, and mitigates biases to deliver consistent, objective quality assessments.

AnalyticsAutomationDataTesting+1
comprehensive-research-agent
Ai

This skill helps ensure thorough validation, error handling, and transparent reasoning across multi-source research tasks.

DataDebuggingPlanningProductivity+2
agent-skills-for-context-engineering
Ai

This skill helps you design, optimize, and debug production agent systems through context engineering best practices and multi-agent architectures.

AutomationDebuggingDesignDevops+2
More from this maintainer
Other repositories and skills published under the same GitHub owner.
Skills library
Jump back to the full directory or explore grouped topics.
Built by
VeilStrat
AI signals for GTM teams
© 2026 VeilStrat. All rights reserved.All systems operational