reef-prompt-guard_skill

This skill detects and filters prompt injection in untrusted text before passing to an LLM, reducing risk of data leakage and jailbreaks.

Python

2.5k

GitHub Stars

2

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill openclaw/skills --skill reef-prompt-guard

_meta.json289 B
SKILL.md3.3 KB

Overview

This skill detects and filters prompt injection attacks in untrusted text before it reaches any LLM. It is designed for pipelines that accept external content such as emails, web scrapes, API inputs, chat messages, or outputs from sub-agents. The tool flags direct injections, jailbreak attempts, data-exfiltration patterns, privilege escalation vectors, and contextual manipulation techniques. Results include a status, numeric score, sanitized text, and a list of detected threat patterns.

How this skill works

The scanner applies a set of configurable regex patterns with severity scores and context multipliers to compute a risk score and classify input as clean, suspicious, or blocked. Higher-risk sources (email, web, API, Discord, subagent) apply stricter multipliers to raise sensitivity. Output is machine-friendly JSON with status, score (0–100), sanitized text, and threat categories, and exit codes indicate downstream handling. Integrations can call the scanner synchronously or use provided helper functions like scan() and sandwich() to harden prompts.

When to use it

Before forwarding any externally sourced text to an LLM (emails, web scrapes, API payloads).
When building chat integrations that accept user messages from third-party platforms (Discord, Slack).
In orchestrations that accept sub-agent outputs or plugins you do not fully control.
On ingestion pipelines that will store or act on user-provided prompts.
When you need an automated gate to block or flag high-risk prompts for review.

Best practices

Run the filter as a mandatory pre-step and act on statuses (block, review, or sanitize) rather than ignoring results.
Use context types to tune sensitivity (email/web/api/subagent) according to your threat model.
Prefer sanitized text returned by the tool over the original raw input when constructing LLM prompts.
Combine regex scanning with defense-in-depth: prompt sandwiches, system reminders, and credential vaulting.
Continuously update and vet pattern lists and track false positives to refine rules.

Example use cases

Guarding an AI email assistant by scanning email bodies and attachments for instruction overrides.
Filtering web-scraped content before summarization or ingestion into a knowledge base.
Pre-checking prompts submitted via a public API or webhook to prevent data exfiltration attempts.
Hardening multi-agent systems by scanning outputs from sub-agents and blocking cascaded injections.
Protecting chatbots on Discord or other platforms from jailbreak roleplay prompts and hidden instructions.

FAQ

Suspicious means patterns of concern were detected but not decisive; proceed with caution, consider human review or additional checks.

Can this detect novel semantic jailbreaks?

Not reliably. The current approach is regex-based and excels at known syntactic patterns; a machine-learning layer is recommended for semantic detection.