FlashMLA Compatibility Wrapper (flash_mla)

flash_mla is a compatibility wrapper package that preserves the official flash_mla package name and import path while running on MUSA through MATE Multi-head Latent Attention (MLA) operators.

Overview

This wrapper is designed for projects that already target the FlashMLA Python API. It lets you run MLA dense decode, sparse decode, and sparse prefill workloads on MUSA with minimal integration changes.

The current compatibility scope includes FlashMLASchedMeta, get_mla_metadata, flash_mla_with_kvcache, and flash_mla_sparse_fwd.

Package and import

  • Package name: flash_mla

  • Import path: flash_mla

  • Runtime backend: MATE MLA operators on MUSA

Requirements

Before using this wrapper, make sure the following are available:

  • MATE is installed and importable.

  • TorchMUSA is installed and the MUSA runtime environment is configured.

  • The target workload is configured to run on MUSA devices.

Build

Build a wheel from the wrappers/FlashMLA directory:

python -m build --wheel

The generated wheel will be placed under:

dist/

Installation

Install from source:

pip install --no-build-isolation -e .

Install a built wheel:

pip install dist/flash_mla-*.whl

Import

Import the package directly:

import flash_mla

Import individual APIs:

from flash_mla import (
    FlashMLASchedMeta,
    get_mla_metadata,
    flash_mla_with_kvcache,
    flash_mla_sparse_fwd,
)

Behavior

  • get_mla_metadata(...) follows the current upstream FlashMLA Python interface and returns (FlashMLASchedMeta(), None).

  • The real scheduler tensors are initialized lazily on the first flash_mla_with_kvcache(...) call and cached inside FlashMLASchedMeta.

  • Reusing the same FlashMLASchedMeta requires the same decode configuration across calls.

  • flash_mla_with_kvcache(...) is the dense/sparse decode entry. The wrapper validates FlashMLASchedMeta, lazily materializes the real scheduler with mate.flashmla.get_mla_metadata(...), and then forwards to mate.flashmla.flash_mla_with_kvcache(...).

  • flash_mla_sparse_fwd(...) is the sparse MLA prefill entry.

Notes

  • This wrapper keeps the official FlashMLA import surface, but execution is provided by MATE on MUSA.

  • For the authoritative MLA operator behavior, refer to mate.flashmla.