metal-kernel_skill

This skill helps you implement native Metal kernels for PyTorch operators on Apple Silicon, boosting performance and MPS support.

Python

97k

GitHub Stars

1

Bundled Files

2 months ago

Catalog Refreshed

4 months ago

First Indexed

Readme & install

Copy the install command, review bundled files from the catalogue, and read any extended description pulled from the listing source.

Installation

Preview and clipboard use veilstrat where the catalogue uses aiagentskills.

npx veilstrat add skill pytorch/pytorch --skill metal-kernel

SKILL.md9.9 KB

Overview

This skill teaches how to implement native Metal (MPS) kernels for PyTorch operators on Apple Silicon. It covers updating operator dispatch, writing Metal shader kernels, and adding host-side stubs so operators run on the MPS device with native performance. The guide emphasizes native Metal via c10/metal rather than MPSGraph for better control and maintainability.

How this skill works

You add MPS to the operator dispatch in native_functions.yaml so the operator uses a shared stub. Then you implement the Metal shader functor(s) in aten/src/ATen/native/mps/kernels/ using provided REGISTER_* macros and c10/metal utilities. Finally, you create a host-side stub in aten/src/ATen/native/mps/operations/ that invokes the compiled Metal kernel via exec_unary_kernel or exec_binary_kernel and register it with the dispatch system.

When to use it

Adding a new operator with native MPS support on Apple Silicon.
Porting an existing CUDA kernel to Metal for MPS device support.
Migrating operators from MPSGraph to native Metal for improved performance.
Implementing structured kernels using the TensorIterator pattern.
Supporting additional dtypes (float32, float16, bfloat16, complex via float2/half2).

Best practices

Prefer shared structured dispatch (CPU, CUDA, MPS) so MPS uses the same stub mechanism.
Register supported types explicitly with REGISTER_UNARY_OP / REGISTER_BINARY_OP macros.
Use c10/metal utilities (opmath_t, accum_t, precise math) for correct precision and stability.
Handle edge cases: empty tensors, non-contiguous layouts, and required dtype coverage.
Remove legacy MPSGraph implementations and xfail/skips in tests after migration.

Example use cases

Implementing a new unary activation (e.g., special exp) with precise::exp for floating types.
Writing a binary math kernel like atan2 that supports floats and int-to-float outputs.
Adding an alpha-scaled operation (a + alpha * b) using REGISTER_UNARY_ALPHA_OP.
Porting elementwise CUDA add/mul kernels to Metal with REGISTER_FLOAT_BINARY_OP and REGISTER_INTEGER_BINARY_OP.
Replacing an MPSGraph atan2 implementation with a native Metal kernel and shared dispatch.

FAQ

Support required dtypes (float32, float16, bfloat16) and add integer or complex specializations as needed; use provided registration macros to cover groups of types.

How do I connect the Metal kernel to PyTorch dispatch?

Add MPS to the operator dispatch in native_functions.yaml and register a host-side stub with REGISTER_DISPATCH that calls lib.exec_unary_kernel or lib.exec_binary_kernel with the kernel name.