MATE Design and Architecture¶
This document provides a technical overview of the MUSA AI Tensor Engine (MATE). It explains MATE’s layered architecture, its CUDA-ecosystem compatibility approach, and its optimization focus for running generative AI workloads efficiently on Moore Threads GPUs.
Architectural Layers¶
MATE is a layered runtime and compilation engine that sits between high-level inference frameworks and MUSA-native execution backends.
flowchart TB
frameworks["1. Framework Integration Layer
vLLM, SGLang"]
compat["2. CUDA Ecosystem Compatibility Layer
FlashAttention-3, SageAttention, FlashMLA, FlashKDA, DeepGEMM"]
core["3. MATE Core
mha_interface.py, flashmla.py, gemm.py, deep_gemm.py"]
kernels["4. Native MUSA Kernels and Compiled Modules
TileLang, JIT modules, AOT libraries, MUTLASS"]
frameworks --> compat --> core --> kernels
1. Framework Integration Layer¶
MATE integrates with inference and serving frameworks through:
wrapper-first integration surfaces when a matching wrapper is available
direct
mateAPIs for advanced usage or when wrapper coverage does not match a workload
Design intent:
keep integration familiar for users coming from CUDA-oriented ecosystems
reduce the cost of bringing existing inference stacks onto MUSA hardware
provide a clear path to adopt MATE capabilities incrementally
2. CUDA Ecosystem Compatibility Layer¶
This layer provides compatibility for selected CUDA-oriented operator interfaces by mapping them onto MUSA-backed execution paths.
FlashAttention-3
SageAttention
FlashMLA
FlashKDA
DeepGEMM
Note
Compatibility is interface-oriented and scope-specific. Coverage can vary by operator family, tensor shape constraints, and version.
When a wrapper is not available or not applicable, users can integrate
through direct mate APIs.
3. MATE Core¶
MATE Core is responsible for:
public API entrypoints
operator selection and dispatch
scheduling and execution orchestration
tensor layout handling and data movement conventions
bridging wrapper calls to native MUSA kernel backends
4. Native MUSA Kernels and Compiled Modules¶
This layer provides the hardware-executed implementation of MATE operators. MATE supports multiple backend building blocks and packaging forms, including:
TileLang kernels
JIT-managed compiled modules
AOT-packaged libraries
MUTLASS-backed build inputs for selected compiled paths
Model and Workload Coverage¶
MATE focuses optimization on these core generative AI operator families:
Attention and FMHA
GEMM
MLA
KDA
GDN
Hyperconnection
MoE-adjacent routing-related paths
MATE also includes model-tuned paths for selected mainstream model families, depending on version and configuration, including:
DeepSeek V3, V3.1, and V3.2 MLA-related paths
DeepSeek V4 hyperconnection and routing-related paths
Qwen-related GDN configurations
For detailed constraints and supported configurations, see the relevant support matrices and compatibility pages. For GDN, see MATE Gated Delta Network (GDN) Support Matrix.
In workload terms, MATE most clearly supports:
Text generation paths built around prefill and decode
Inference-serving integrations through wrapper-first package surfaces
Long-context and KV-cache execution paths
Design Choices and Value¶
MATE balances migration speed and performance through these design choices:
Wrapper-first migration path. Reuse familiar CUDA-oriented operator surfaces where supported.
Direct API fallback. Integrate through MATE APIs when wrapper coverage is insufficient.
Hybrid compilation approach. Support both AOT packaging and runtime JIT compilation.
Native MUSA optimization focus. Prioritize performance-critical operator families such as attention, GEMM, MLA, and related components.
Operational readiness. Provide built-in diagnostics and reproducibility mechanisms, including logging, environment inspection, and dump-and-replay workflows for faster issue triage.
Execution Modes (AOT and JIT)¶
MATE supports multiple deployment and execution modes:
AOT (Ahead-of-Time). Selected operator variants can be built and packaged before runtime for more predictable deployment behavior.
JIT (Just-in-Time). Missing variants can be compiled on demand at runtime to improve coverage and flexibility.
Preference and fallback behavior. When a matching AOT artifact is available, MATE can prefer AOT variants, while JIT behavior can be controlled through configuration and environment settings.
Operational System Tools¶
When an operator crash, environment mismatch, or output issue needs deeper investigation, MATE provides these built-in tools:
mate checkfor runtime validationmate show-configfor versions, devices, architecture resolution, and JIT/AOT statusmate envfor current MATE-related environment variablesAPI logging for call metadata, structured logs, and Level 10 dumps
mate replayfor deterministic replay of captured dumps
See Also¶
Overview for onboarding and getting started
Wrappers for the wrapper-first integration path
Python APIs for direct integration entrypoints
Diagnostic Overview for configuration inspection and troubleshooting
Logging for debugging workflows