Papers
arxiv:2601.06463

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Published on Jan 10
· Submitted by
Xuezhe Ma
on Jan 13
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Gecko is a neural architecture that improves long-range dependency capture through exponential moving average with gated attention and additional components like timestep decay normalization and sliding chunk attention, achieving efficient processing of arbitrary-length sequential data with superior long-context scalability compared to existing models.

AI-generated summary

Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to 4times longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm

Community

Paper submitter
edited 1 day ago

train_loss

Screenshot 2026-01-13 at 3.50.58 AM

Screenshot 2026-01-13 at 3.51.18 AM

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.06463 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.06463 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.06463 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.