Analyzing Latency Hiding and Parallelism in an MLIR-based AI Kernel Compiler
arXiv:2602.20204v1 Announce Type: new Abstract: AI kernel compilation for edge devices depends on the compiler’s ability to exploit parallelism and hide memory latency in the presence of hierarchical memory and explicit data movement. This paper reports a benchmark methodology and corresponding results for three compiler-controlled mechanisms in an MLIR-based compilation pipeline: vectorization (Vec), multi-threading (MT) across hardware contexts, and double buffering (DB) using ping–pong scratchpad buffers to overlap DMA transfers with compute. Using Triton/Inductor-generated kernels, we present an […]