patrickporto/desktop-agent
Overview
This skill exposes desktop automation capabilities for controlling the mouse, keyboard, taking screenshots, performing OCR, and managing applications. It provides a CLI interface that returns structured JSON, letting AI agents perform reliable, auditable desktop actions across Windows, macOS, and Linux. The skill emphasizes safe workflows: observe first, then act.
How this skill works
Commands are invoked via a single CLI pattern (uvx desktop-agent <category> <command> ...) and return JSON responses with success, data, and error fields. Categories include mouse, keyboard, screen, message, and app; actions range from moving/clicking the mouse and typing keys to locating images, reading text with OCR, taking screenshots, and opening or focusing applications. The skill uses PyAutoGUI-style operations and supports options like durations, confidence thresholds, regions, and window targeting.
When to use it
- Automating repetitive desktop tasks: form filling, copy/paste, or batch UI interactions.
- Testing GUI workflows or validating UI presence via image or text recognition.
- Taking screenshots for logging, analysis, or evidence of automated steps.
- Opening, focusing, or listing application windows before sending input.
- Showing dialogs to ask for user confirmation before destructive actions.
Best practices
- Observe the environment first: run screen size, app list, or locate commands before acting.
- Use window-targeted or regional screenshots to speed up image searches on large displays.
- Add delays between steps and prefer animated mouse moves with short durations for reliability.
- Validate image files exist and set an appropriate confidence (0.7–0.9) when using locate.
- Handle failures gracefully: check command JSON output and confirm recoverable errors before retrying.
Example use cases
- Open Notepad, focus the window, and type a template message using keyboard write.
- Locate a Save button image in an application window and click its center using locate-center then mouse click.
- Capture an active-window screenshot for diagnostics and store the filename in automation logs.
- Fill a multi-field form by clicking each field and typing text, using tab navigation only when field order is stable.
- Copy selected text from one app and paste it into another using keyboard hotkeys and mouse clicks to focus targets.
FAQ
Query screen size first and use window-specific or relative offsets rather than hard-coded absolute coordinates.
What if an image locate command fails?
Verify the image path, reduce the search region, adjust confidence, and confirm the target window is active before retrying.