evaluation-anchor-checker_skill

This skill ensures numeric claims are reviewer-safe by enforcing minimal protocol context (task, metric, constraint) and avoiding underspecified model naming.

Python

109

GitHub Stars

1

Bundled Files

3 weeks ago

Catalog Refreshed

2 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstart where the catalogue uses aiagentskills.

npx veilstart add skill willoscar/research-units-pipeline-skills --skill evaluation-anchor-checker

SKILL.md4.2 KB

Overview

This skill audits and rewrites numeric and evaluation claims to make them reviewer-safe by ensuring each retained number carries minimal protocol context (task, metric, constraint). It prevents underspecified model naming and weakly supported numeric assertions without inventing facts or changing citation keys. Use it to convert fragile percentages and counts into interpretable, evidence-bounded statements or to downgrade them into qualitative claims when context is missing.

How this skill works

The checker scans target section files for digits, percents, and explicit numeric claims. For each claim it attempts to attach at least two pieces of protocol context (task/benchmark, metric, constraint) within the same sentence using available evidence files; if it cannot, it weakens or removes the numeric claim rather than guessing. It preserves citation keys, refuses to invent numbers, and flags ambiguous model names unless the citation explicitly supports them.

When to use it

Pre-merge review when numeric statements may be reviewer magnets.
When reviewers commonly flag numbers lacking task/metric/budget context.
After a pipeline-auditor warns about underspecified model naming or suspicious claims.
When polishing survey sections that report performance comparisons.
Before finalizing drafts that will be submitted or circulated externally.

Best practices

Always cross-check outline/writer_context_packs.jsonl and evidence drafts before editing numbers.
Keep every retained number in the same sentence as at least two protocol elements: benchmark/task, metric, or constraint.
If protocol context cannot be found, downgrade to qualitative wording (e.g., often, may, tends to) with the same citation.
Do not invent, infer, or extrapolate numeric values; route thin evidence upstream when necessary.
Never add, remove, or relocate citation keys; edit only wording and context placement.

Example use cases

Convert 'Model X achieves 78% on XYZ' into 'On the XYZ benchmark, Model X reaches ~78% accuracy under the reported budget [@key]' when budget/metric are available.
Downgrade a lone percentage to 'reported gains vary and depend on budget/retries [@key]' when constraint context is missing.
Replace ambiguous model mentions like 'GPT-5' with the exact cited model name or a cautious phrasing if the citation does not support the label.
Audit survey sections to ensure every numeric claim remains interpretable in isolation for reviewers.
Produce a refined draft and optional completion marker after all evaluation anchors are reviewed.

FAQ

Do not infer. If the constraint is not documented in the available evidence, weaken the numeric claim or move specificity into a verification target.

Can I rephrase sentences to add context from other paragraphs?

You may relocate or rephrase context only if supported by explicit evidence in the provided files; do not create or merge citation keys.