arcee-ai/GLM-4-32B-Base-32K

GLM-4-32B-Base-32K

GLM-4-32B-Base-32K is an enhanced version of THUDM's GLM-4-32B-Base-0414, specifically engineered to offer robust performance over an extended context window. While the original model's capabilities degraded after 8,192 tokens, this version maintains strong performance up to a 32,000-token context, making it ideal for tasks requiring long-context understanding and processing.

This model was developed as a proof-of-concept to validate that a merging-centric approach to context extension can be successfully applied to larger-scale models. The techniques employed resulted in an approximate 5% overall improvement on standard base model benchmarks while significantly improving 32k recall.

More details can be found in our blog post here where we applied this work to our upcoming AFM 4.5B

Model Details

Architecture Base: THUDM/GLM-4-32B-Base-0414
Parameter Count: 32B
License: MIT

Improvements

The primary improvement in this model is its enhanced long-context capability. The following methods were used to achieve this:

Targeted Long-Context Training: The model underwent continued pretraining on sequences up to its full 32,000 token context length.
Iterative Merging: Various model checkpoints were iteratively merged to combine the benefits of different training runs, enhancing both long-context and short-context performance.
Short-Context Distillation: Knowledge from the original high-performing short-context model was distilled into the long-context-trained model to recover and retain its initial capabilities on shorter tasks.

As a result, where the original model's performance on the Needle in a Haystack (NIAH) benchmark would decline after 8,000 tokens, this extended version maintains reliable performance across the entire 32,000 token context window.

Benchmarks

Benchmark	GLM-4-32B-Base-0414	GLM-4-32B-Base-32K
arc_challenge	59.39%	64.93%
arc_easy	85.44%	87.88%
hellaswag	64.75%	65.40%
mmlu	77.05%	77.87%
piqa	81.61%	83.19%
truthfulqa_mc2	49.27%	50.07%
winogrande	78.69%	80.03%

NIAH Benchmark Results Comparison

Model	Task	4,096	8,192	16,384	24,576	32,768
GLM-4-32B-Base-0414
	niah_single_1	100.0%	100.0%	77.0%	5.2%	1.2%
	niah_single_2	100.0%	100.0%	73.4%	2.6%	0.0%
	niah_single_3	100.0%	99.8%	48.0%	1.4%	0.0%
GLM-4-32B-Base-32k
	niah_single_1	100.0%	100.0%	100.0%	99.2%	99.6%
	niah_single_2	100.0%	100.0%	99.2%	80.2%	68.8%
	niah_single_3	100.0%	99.6%	95.6%	86.6%	61.0%

NIAH Averages

Model	4,096	8,192	16,384	24,576	32,768
GLM-4-32B-Base-0414	100.0%	99.9%	66.1%	3.1%	0.4%
GLM-4-32B-Base-32k	100.0%	99.9%	98.3%	88.7%	76.5%

Use Cases

This model serves as a new base for continued training at 32K context

License

GLM-4-32B-Base-32K (32B) is released under the MIT license following with the original model's license.

If you have questions or would like to share your experiences using GLM-4-32B-Base-32K (32B), please connect with us on social media. We’re excited to see what you build—and how this model helps you innovate!

arcee-ai
/

GLM-4-32B-Base-32K