Skip to content

话TopicTracker

趋势分类关于

Loading deep-dive…

© 2026 TopicTracker

关于条款隐私

来自 martinalderson.com查看原文 ↗

译文语言译文语言

KV缓存压缩技术发展简史

本文梳理了KV缓存压缩技术从MQA、GQA到MLA及线性注意力混合模型的演进历程，揭示这些看似低调的技术革新如何悄然解锁了长上下文窗口，从而为现代智能体大语言模型（Agentic LLMs）的实现奠定了基础。

相关报道

Luce KVFlash: 256K context with 72MiB of KV cache on the GPU
7.5
Luce KVFlash is a memory-efficient optimization enabling 256K context windows using only 72 MiB of KV cache on the GPU. It reduces memory consumption for long-sequence inference by compressing key-value cache storage.
SubQ 1.1 Card: Linear-scaling sparse attention with 98% retrieval at 12M tokens [pdf]
5.0
SubQ 1.1 introduces a linear-scaling sparse attention mechanism that maintains 98% retrieval accuracy at 12 million tokens, significantly extending context length efficiency for large language models while reducing computational overhead compared to full attention methods.
GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz
4.0
A new Transformer implementation called GateGPT achieves 56,000 tokens per second using KV cache on an FPGA running at 80 MHz.
Subquadratic – Introducing SubQ 1.1 Small
2.0
Subquadratic released SubQ 1.1 Small, a 1.5B open-weight language model using a soft-moe-2x8 architecture. It outperforms larger models like Gemma 2 2.6B and Phi-2 2.8B on several benchmarks. The model uses subquadratic soft-MoE layers (MMA and MMAM) for improved efficiency.