MATE Design and Architecture

This document provides a technical overview of the MUSA AI Tensor Engine (MATE). It explains MATE’s layered architecture, its CUDA-ecosystem compatibility approach, and its optimization focus for running generative AI workloads efficiently on Moore Threads GPUs.

Architectural Layers

MATE is a layered runtime and compilation engine that sits between high-level inference frameworks and MUSA-native execution backends.

flowchart TB
    frameworks["1. Framework Integration Layer
vLLM, SGLang"] compat["2. CUDA Ecosystem Compatibility Layer
FlashAttention-3, SageAttention, FlashMLA, FlashKDA, DeepGEMM"] core["3. MATE Core
mha_interface.py, flashmla.py, gemm.py, deep_gemm.py"] kernels["4. Native MUSA Kernels and Compiled Modules
TileLang, JIT modules, AOT libraries, MUTLASS"] frameworks --> compat --> core --> kernels

1. Framework Integration Layer

MATE integrates with inference and serving frameworks through:

  • wrapper-first integration surfaces when a matching wrapper is available

  • direct mate APIs for advanced usage or when wrapper coverage does not match a workload

Design intent:

  • keep integration familiar for users coming from CUDA-oriented ecosystems

  • reduce the cost of bringing existing inference stacks onto MUSA hardware

  • provide a clear path to adopt MATE capabilities incrementally

2. CUDA Ecosystem Compatibility Layer

This layer provides compatibility for selected CUDA-oriented operator interfaces by mapping them onto MUSA-backed execution paths.

  • FlashAttention-3

  • SageAttention

  • FlashMLA

  • FlashKDA

  • DeepGEMM

Note

Compatibility is interface-oriented and scope-specific. Coverage can vary by operator family, tensor shape constraints, and version.

When a wrapper is not available or not applicable, users can integrate through direct mate APIs.

3. MATE Core

MATE Core is responsible for:

  • public API entrypoints

  • operator selection and dispatch

  • scheduling and execution orchestration

  • tensor layout handling and data movement conventions

  • bridging wrapper calls to native MUSA kernel backends

4. Native MUSA Kernels and Compiled Modules

This layer provides the hardware-executed implementation of MATE operators. MATE supports multiple backend building blocks and packaging forms, including:

  • TileLang kernels

  • JIT-managed compiled modules

  • AOT-packaged libraries

  • MUTLASS-backed build inputs for selected compiled paths

Model and Workload Coverage

MATE focuses optimization on these core generative AI operator families:

  • Attention and FMHA

  • GEMM

  • MLA

  • KDA

  • GDN

  • Hyperconnection

  • MoE-adjacent routing-related paths

MATE also includes model-tuned paths for selected mainstream model families, depending on version and configuration, including:

  • DeepSeek V3, V3.1, and V3.2 MLA-related paths

  • DeepSeek V4 hyperconnection and routing-related paths

  • Qwen-related GDN configurations

For detailed constraints and supported configurations, see the relevant support matrices and compatibility pages. For GDN, see MATE Gated Delta Network (GDN) Support Matrix.

In workload terms, MATE most clearly supports:

  • Text generation paths built around prefill and decode

  • Inference-serving integrations through wrapper-first package surfaces

  • Long-context and KV-cache execution paths

Design Choices and Value

MATE balances migration speed and performance through these design choices:

  • Wrapper-first migration path. Reuse familiar CUDA-oriented operator surfaces where supported.

  • Direct API fallback. Integrate through MATE APIs when wrapper coverage is insufficient.

  • Hybrid compilation approach. Support both AOT packaging and runtime JIT compilation.

  • Native MUSA optimization focus. Prioritize performance-critical operator families such as attention, GEMM, MLA, and related components.

  • Operational readiness. Provide built-in diagnostics and reproducibility mechanisms, including logging, environment inspection, and dump-and-replay workflows for faster issue triage.

Execution Modes (AOT and JIT)

MATE supports multiple deployment and execution modes:

  • AOT (Ahead-of-Time). Selected operator variants can be built and packaged before runtime for more predictable deployment behavior.

  • JIT (Just-in-Time). Missing variants can be compiled on demand at runtime to improve coverage and flexibility.

  • Preference and fallback behavior. When a matching AOT artifact is available, MATE can prefer AOT variants, while JIT behavior can be controlled through configuration and environment settings.

Operational System Tools

When an operator crash, environment mismatch, or output issue needs deeper investigation, MATE provides these built-in tools:

  • mate check for runtime validation

  • mate show-config for versions, devices, architecture resolution, and JIT/AOT status

  • mate env for current MATE-related environment variables

  • API logging for call metadata, structured logs, and Level 10 dumps

  • mate replay for deterministic replay of captured dumps

See Also