thongnt0208/browser-onnx-skills
Overview
This skill enables high-performance local ML inference in the browser using ONNX Runtime Web. It focuses on privacy-first, low-latency, and offline-capable inference for vision and NLP models. Use it to run image classification, object detection, transformers, and other ONNX models entirely client-side.
How this skill works
The skill configures ORT-Web, selects appropriate execution providers (WebGPU, WASM), and creates optimized inference sessions. It supplies patterns for preprocessing inputs, binding IO (including GPU buffers), enabling graph capture, and handling large models or external weight files. It also covers memory management, quantization guidance, and platform-specific edge cases to keep inference fast and stable.
When to use it
- You need privacy-preserving inference with no server round trips.
- Low-latency or offline AI features for web apps or mobile webviews.
- To reduce server costs and scale without backend compute.
- When you must run vision models (classification, detection) or small-to-medium NLP models in-browser.
- To prototype or ship ML features where users provide sensitive data locally.
Best practices
- Set ort.env flags before creating sessions (threads, proxy worker, WASM paths).
- Prioritize WebGPU with WASM as a fallback and enable graph capture for static-shape models.
- Preprocess inputs to match model training format (resize, normalize, NCHW/RGB).
- Use IO binding and GPU buffers to avoid expensive CPU-GPU copies for transformers and large tensors.
- Prefer uint8 quantized models for WASM/CPU; avoid float16 on CPU. Dispose GPU tensors explicitly to prevent leaks.
Example use cases
- On-device image classification or multi-label tagging without uploading images.
- Browser-based YOLO object detection with local NMS or NMS-as-an-ONNX-model fallback.
- Multilingual translation running in a Web Worker so the model loads once and inference is off the main thread.
- Interactive demo that runs large models with external data files split to bypass ArrayBuffer limits.
- Offline form parsing or intent classification in a privacy-sensitive web app.
FAQ
Use WebGPU when available for best GPU performance; fall back to WASM (multi-threaded) for broad compatibility. Configure executionProviders to try webgpu first, then wasm.
How can I load models larger than browser ArrayBuffer limits?
Export model weights as external data and pass externalData entries when creating the session so the runtime can fetch linked weight files instead of loading a single large buffer.