MATE Design and Architecture¶

This document provides a technical overview of the MUSA AI Tensor Engine (MATE). It explains MATE’s layered architecture, its CUDA-ecosystem compatibility approach, and its optimization focus for running generative AI workloads efficiently on Moore Threads GPUs.

Architectural Layers¶

MATE is a layered runtime and compilation engine that sits between high-level inference frameworks and MUSA-native execution backends.

flowchart TB
    frameworks["1. Framework Integration Layer
vLLM, SGLang"]
    compat["2. CUDA Ecosystem Compatibility Layer
FlashAttention-3, SageAttention, FlashMLA, FlashKDA, DeepGEMM"]
    core["3. MATE Core
mha_interface.py, flashmla.py, gemm.py, deep_gemm.py"]
    kernels["4. Native MUSA Kernels and Compiled Modules
TileLang, JIT modules, AOT libraries, MUTLASS"]

    frameworks --> compat --> core --> kernels

1. Framework Integration Layer¶

MATE integrates with inference and serving frameworks through:

wrapper-first integration surfaces when a matching wrapper is available
direct mate APIs for advanced usage or when wrapper coverage does not match a workload

Design intent:

keep integration familiar for users coming from CUDA-oriented ecosystems
reduce the cost of bringing existing inference stacks onto MUSA hardware
provide a clear path to adopt MATE capabilities incrementally

2. CUDA Ecosystem Compatibility Layer¶

This layer provides compatibility for selected CUDA-oriented operator interfaces by mapping them onto MUSA-backed execution paths.

FlashAttention-3
SageAttention
FlashMLA
FlashKDA
DeepGEMM

Note

Compatibility is interface-oriented and scope-specific. Coverage can vary by operator family, tensor shape constraints, and version.

When a wrapper is not available or not applicable, users can integrate through direct mate APIs.

3. MATE Core¶

MATE Core is responsible for:

public API entrypoints
operator selection and dispatch
scheduling and execution orchestration
tensor layout handling and data movement conventions
bridging wrapper calls to native MUSA kernel backends

4. Native MUSA Kernels and Compiled Modules¶

This layer provides the hardware-executed implementation of MATE operators. MATE supports multiple backend building blocks and packaging forms, including:

TileLang kernels
JIT-managed compiled modules
AOT-packaged libraries
MUTLASS-backed build inputs for selected compiled paths

Model and Workload Coverage¶

MATE focuses optimization on these core generative AI operator families:

Attention and FMHA
GEMM
MLA
KDA
GDN
Hyperconnection
MoE-adjacent routing-related paths

MATE also includes model-tuned paths for selected mainstream model families, depending on version and configuration, including:

DeepSeek V3, V3.1, and V3.2 MLA-related paths
DeepSeek V4 hyperconnection and routing-related paths
Qwen-related GDN configurations

For detailed constraints and supported configurations, see the relevant support matrices and compatibility pages. For GDN, see MATE Gated Delta Network (GDN) Support Matrix.

In workload terms, MATE most clearly supports:

Text generation paths built around prefill and decode
Inference-serving integrations through wrapper-first package surfaces
Long-context and KV-cache execution paths

Design Choices and Value¶

MATE balances migration speed and performance through these design choices:

Wrapper-first migration path. Reuse familiar CUDA-oriented operator surfaces where supported.
Direct API fallback. Integrate through MATE APIs when wrapper coverage is insufficient.
Hybrid compilation approach. Support both AOT packaging and runtime JIT compilation.
Native MUSA optimization focus. Prioritize performance-critical operator families such as attention, GEMM, MLA, and related components.
Operational readiness. Provide built-in diagnostics and reproducibility mechanisms, including logging, environment inspection, and dump-and-replay workflows for faster issue triage.

Execution Modes (AOT and JIT)¶

MATE supports multiple deployment and execution modes:

AOT (Ahead-of-Time). Selected operator variants can be built and packaged before runtime for more predictable deployment behavior.
JIT (Just-in-Time). Missing variants can be compiled on demand at runtime to improve coverage and flexibility.
Preference and fallback behavior. When a matching AOT artifact is available, MATE can prefer AOT variants, while JIT behavior can be controlled through configuration and environment settings.

Operational System Tools¶

When an operator crash, environment mismatch, or output issue needs deeper investigation, MATE provides these built-in tools:

mate check for runtime validation
mate show-config for versions, devices, architecture resolution, and JIT/AOT status
mate env for current MATE-related environment variables
API logging for call metadata, structured logs, and Level 10 dumps
mate replay for deterministic replay of captured dumps

MATE Design and Architecture¶

Architectural Layers¶

1. Framework Integration Layer¶

2. CUDA Ecosystem Compatibility Layer¶

3. MATE Core¶

4. Native MUSA Kernels and Compiled Modules¶

Model and Workload Coverage¶

Design Choices and Value¶

Execution Modes (AOT and JIT)¶

Operational System Tools¶

See Also¶