Repository inventory

thongnt0208/browser-onnx-skills

Skills indexed from this repository, with install-style signals scoped to the repo.
1 skills0 GitHub stars0 weekly installsGitHubOwner profile

Overview

This skill enables high-performance local ML inference in the browser using ONNX Runtime Web. It focuses on privacy-first, low-latency, and offline-capable inference for vision and NLP models. Use it to run image classification, object detection, transformers, and other ONNX models entirely client-side.

How this skill works

The skill configures ORT-Web, selects appropriate execution providers (WebGPU, WASM), and creates optimized inference sessions. It supplies patterns for preprocessing inputs, binding IO (including GPU buffers), enabling graph capture, and handling large models or external weight files. It also covers memory management, quantization guidance, and platform-specific edge cases to keep inference fast and stable.

When to use it

  • You need privacy-preserving inference with no server round trips.
  • Low-latency or offline AI features for web apps or mobile webviews.
  • To reduce server costs and scale without backend compute.
  • When you must run vision models (classification, detection) or small-to-medium NLP models in-browser.
  • To prototype or ship ML features where users provide sensitive data locally.

Best practices

  • Set ort.env flags before creating sessions (threads, proxy worker, WASM paths).
  • Prioritize WebGPU with WASM as a fallback and enable graph capture for static-shape models.
  • Preprocess inputs to match model training format (resize, normalize, NCHW/RGB).
  • Use IO binding and GPU buffers to avoid expensive CPU-GPU copies for transformers and large tensors.
  • Prefer uint8 quantized models for WASM/CPU; avoid float16 on CPU. Dispose GPU tensors explicitly to prevent leaks.

Example use cases

  • On-device image classification or multi-label tagging without uploading images.
  • Browser-based YOLO object detection with local NMS or NMS-as-an-ONNX-model fallback.
  • Multilingual translation running in a Web Worker so the model loads once and inference is off the main thread.
  • Interactive demo that runs large models with external data files split to bypass ArrayBuffer limits.
  • Offline form parsing or intent classification in a privacy-sensitive web app.

FAQ

Use WebGPU when available for best GPU performance; fall back to WASM (multi-threaded) for broad compatibility. Configure executionProviders to try webgpu first, then wasm.

How can I load models larger than browser ArrayBuffer limits?

Export model weights as external data and pass externalData entries when creating the session so the runtime can fetch linked weight files instead of loading a single large buffer.

1 skills

More from this maintainer
Other repositories and skills published under the same GitHub owner.
Skills library
Jump back to the full directory or explore grouped topics.
Built by
VeilStrat
AI signals for GTM teams
© 2026 VeilStrat. All rights reserved.All systems operational