arXiv:2604.20920v1 Announce Type: new Abstract: Scaling large language models to long contexts is challenging due to the quadratic computational cost of full attention. Mitigation approaches include KV-cache selection or compression techniques. We instead provide an effective and end-to-end learnable bridge between the two without requiring architecture modification. In particular, our key insight
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention
Yuzhen Mao, Michael Y. Li, Emily B. Fox·arXiv cs.LG··1 min read
a
Continue reading on arXiv cs.LG
This article was sourced from arXiv cs.LG's RSS feed. Visit the original for the complete story.