Update README.md
Browse files
README.md
CHANGED
@@ -13,13 +13,17 @@ tags:
|
|
13 |
|
14 |
# Model Card for abs-bvv-6
|
15 |
|
16 |
-
[[Paper](https://huggingface.co/papers/2507.07129)] [[Code](https://github.com/Bochkov/bvv241)]
|
17 |
-
|
18 |
## Model Description
|
19 |
|
20 |
`abs-bvv-6` is a 2.3 billion parameter decoder-only Transformer model. It is the sixth and final model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
|
21 |
|
22 |
-
This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced in the
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
|
25 |
1. Semantic understanding can emerge without trainable embeddings.
|
@@ -34,42 +38,35 @@ This model is primarily an artifact for research into emergent capabilities, con
|
|
34 |
|
35 |
## Performance
|
36 |
The model was evaluated on several standard benchmarks. Scores reflect performance on held-out test sets.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
| ARC-e | 23.42% | 1.28% |
|
42 |
-
| ARC-c | 25.62% | 1.92% |
|
43 |
-
| C-SENSE | 19.51% | 0.90% |
|
44 |
-
| SQuAD | 5.55% | 1.05% |
|
45 |
|
46 |
A key finding from the PGT series is the emergence of extractive QA capabilities (SQuAD) only in deeper models.
|
47 |
|
48 |
## Training Details
|
49 |
Architecture: 6-layer Decoder-Only Transformer (n_layer=6, d_model=4096, n_head=32).
|
50 |
-
|
51 |
Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
|
52 |
-
|
53 |
Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
|
54 |
-
|
55 |
Parameters: Total: 2.3B.
|
56 |
-
|
57 |
Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
|
58 |
-
|
59 |
## Limitations and Bias
|
60 |
-
|
61 |
This model is a research prototype and has several limitations:
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
* **Limited Scope:** The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
|
67 |
-
|
68 |
## π§βπ¬ Citation & Concept
|
69 |
-
|
70 |
If you use this model or the underlying concepts in your research, please cite our work:
|
71 |
-
|
72 |
-
```bibtex
|
73 |
@misc{bochkov2025emergentsemanticstokenembeddings,
|
74 |
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
|
75 |
author={A. Bochkov},
|
@@ -79,7 +76,6 @@ If you use this model or the underlying concepts in your research, please cite o
|
|
79 |
primaryClass={cs.CL},
|
80 |
url={https://arxiv.org/abs/2507.04886},
|
81 |
}
|
82 |
-
|
83 |
@misc{bochkov2025growingtransformersmodularcomposition,
|
84 |
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
|
85 |
author={A. Bochkov},
|
@@ -90,22 +86,16 @@ If you use this model or the underlying concepts in your research, please cite o
|
|
90 |
url={https://arxiv.org/abs/2507.07129},
|
91 |
}
|
92 |
```
|
93 |
-
|
94 |
This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β a step toward modular, fusable, multilingual LMs.
|
95 |
-
|
96 |
## How to Use
|
97 |
-
|
98 |
The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
|
99 |
-
|
100 |
```python
|
101 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
102 |
import torch
|
103 |
-
|
104 |
tokenizer = AutoTokenizer.from_pretrained('Bochkov/abs-bvv-6')
|
105 |
model = AutoModelForCausalLM.from_pretrained('Bochkov/abs-bvv-6', trust_remote_code=True, torch_dtype=torch.bfloat16).to('cuda')
|
106 |
|
107 |
inputs = tokenizer("Hello, I am a language model ", return_tensors="pt").to('cuda')
|
108 |
-
|
109 |
# Generate text
|
110 |
outputs = model.generate(
|
111 |
**inputs,
|
@@ -115,6 +105,5 @@ outputs = model.generate(
|
|
115 |
top_p=0.95,
|
116 |
do_sample=True
|
117 |
)
|
118 |
-
|
119 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
120 |
```
|
|
|
13 |
|
14 |
# Model Card for abs-bvv-6
|
15 |
|
|
|
|
|
16 |
## Model Description
|
17 |
|
18 |
`abs-bvv-6` is a 2.3 billion parameter decoder-only Transformer model. It is the sixth and final model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
|
19 |
|
20 |
+
This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced introduced in the papers
|
21 |
+
|
22 |
+
[π Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
|
23 |
+
|
24 |
+
[π Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
|
25 |
+
|
26 |
+
[π» Code](https://github.com/AVBochkov/PGT)
|
27 |
|
28 |
The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
|
29 |
1. Semantic understanding can emerge without trainable embeddings.
|
|
|
38 |
|
39 |
## Performance
|
40 |
The model was evaluated on several standard benchmarks. Scores reflect performance on held-out test sets.
|
41 |
+
Benchmark Score (%) Ο (%)
|
42 |
+
|
43 |
+
MMLU 21.63% 0.22%
|
44 |
+
|
45 |
+
ARC-e 23.42% 1.28%
|
46 |
+
|
47 |
+
ARC-c 25.62% 1.92%
|
48 |
|
49 |
+
C-SENSE 19.51% 0.90%
|
50 |
+
|
51 |
+
SQuAD 5.55% 1.05%
|
|
|
|
|
|
|
|
|
52 |
|
53 |
A key finding from the PGT series is the emergence of extractive QA capabilities (SQuAD) only in deeper models.
|
54 |
|
55 |
## Training Details
|
56 |
Architecture: 6-layer Decoder-Only Transformer (n_layer=6, d_model=4096, n_head=32).
|
|
|
57 |
Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
|
|
|
58 |
Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
|
|
|
59 |
Parameters: Total: 2.3B.
|
|
|
60 |
Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
|
|
|
61 |
## Limitations and Bias
|
|
|
62 |
This model is a research prototype and has several limitations:
|
63 |
+
Not Instruction-Tuned: It is a base model and will not follow instructions or engage in dialogue reliably.
|
64 |
+
Potential for Hallucinations: Like all LLMs, it can generate factually incorrect or nonsensical text.
|
65 |
+
Data Bias: Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
|
66 |
+
Limited Scope: The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
|
|
|
|
|
67 |
## π§βπ¬ Citation & Concept
|
|
|
68 |
If you use this model or the underlying concepts in your research, please cite our work:
|
69 |
+
```
|
|
|
70 |
@misc{bochkov2025emergentsemanticstokenembeddings,
|
71 |
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
|
72 |
author={A. Bochkov},
|
|
|
76 |
primaryClass={cs.CL},
|
77 |
url={https://arxiv.org/abs/2507.04886},
|
78 |
}
|
|
|
79 |
@misc{bochkov2025growingtransformersmodularcomposition,
|
80 |
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
|
81 |
author={A. Bochkov},
|
|
|
86 |
url={https://arxiv.org/abs/2507.07129},
|
87 |
}
|
88 |
```
|
|
|
89 |
This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β a step toward modular, fusable, multilingual LMs.
|
|
|
90 |
## How to Use
|
|
|
91 |
The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
|
|
|
92 |
```python
|
93 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
94 |
import torch
|
|
|
95 |
tokenizer = AutoTokenizer.from_pretrained('Bochkov/abs-bvv-6')
|
96 |
model = AutoModelForCausalLM.from_pretrained('Bochkov/abs-bvv-6', trust_remote_code=True, torch_dtype=torch.bfloat16).to('cuda')
|
97 |
|
98 |
inputs = tokenizer("Hello, I am a language model ", return_tensors="pt").to('cuda')
|
|
|
99 |
# Generate text
|
100 |
outputs = model.generate(
|
101 |
**inputs,
|
|
|
105 |
top_p=0.95,
|
106 |
do_sample=True
|
107 |
)
|
|
|
108 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
109 |
```
|