Bochkov commited on
Commit
353bf23
Β·
verified Β·
1 Parent(s): 8308c13

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -33
README.md CHANGED
@@ -13,13 +13,17 @@ tags:
13
 
14
  # Model Card for abs-bvv-6
15
 
16
- [[Paper](https://huggingface.co/papers/2507.07129)] [[Code](https://github.com/Bochkov/bvv241)]
17
-
18
  ## Model Description
19
 
20
  `abs-bvv-6` is a 2.3 billion parameter decoder-only Transformer model. It is the sixth and final model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
21
 
22
- This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced in the paper "[Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations](https://arxiv.org/abs/2507.04886)".
 
 
 
 
 
 
23
 
24
  The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
25
  1. Semantic understanding can emerge without trainable embeddings.
@@ -34,42 +38,35 @@ This model is primarily an artifact for research into emergent capabilities, con
34
 
35
  ## Performance
36
  The model was evaluated on several standard benchmarks. Scores reflect performance on held-out test sets.
 
 
 
 
 
 
 
37
 
38
- | Benchmark | Score (%) | Οƒ (%) |
39
- |---|---|---|
40
- | MMLU | 21.63% | 0.22% |
41
- | ARC-e | 23.42% | 1.28% |
42
- | ARC-c | 25.62% | 1.92% |
43
- | C-SENSE | 19.51% | 0.90% |
44
- | SQuAD | 5.55% | 1.05% |
45
 
46
  A key finding from the PGT series is the emergence of extractive QA capabilities (SQuAD) only in deeper models.
47
 
48
  ## Training Details
49
  Architecture: 6-layer Decoder-Only Transformer (n_layer=6, d_model=4096, n_head=32).
50
-
51
  Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
52
-
53
  Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
54
-
55
  Parameters: Total: 2.3B.
56
-
57
  Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
58
-
59
  ## Limitations and Bias
60
-
61
  This model is a research prototype and has several limitations:
62
-
63
- * **Not Instruction-Tuned:** It is a base model and will not follow instructions or engage in dialogue reliably.
64
- * **Potential for Hallucinations:** Like all LLMs, it can generate factually incorrect or nonsensical text.
65
- * **Data Bias:** Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
66
- * **Limited Scope:** The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
67
-
68
  ## πŸ§‘β€πŸ”¬ Citation & Concept
69
-
70
  If you use this model or the underlying concepts in your research, please cite our work:
71
-
72
- ```bibtex
73
  @misc{bochkov2025emergentsemanticstokenembeddings,
74
  title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
75
  author={A. Bochkov},
@@ -79,7 +76,6 @@ If you use this model or the underlying concepts in your research, please cite o
79
  primaryClass={cs.CL},
80
  url={https://arxiv.org/abs/2507.04886},
81
  }
82
-
83
  @misc{bochkov2025growingtransformersmodularcomposition,
84
  title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
85
  author={A. Bochkov},
@@ -90,22 +86,16 @@ If you use this model or the underlying concepts in your research, please cite o
90
  url={https://arxiv.org/abs/2507.07129},
91
  }
92
  ```
93
-
94
  This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β€” a step toward modular, fusable, multilingual LMs.
95
-
96
  ## How to Use
97
-
98
  The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
99
-
100
  ```python
101
  from transformers import AutoModelForCausalLM, AutoTokenizer
102
  import torch
103
-
104
  tokenizer = AutoTokenizer.from_pretrained('Bochkov/abs-bvv-6')
105
  model = AutoModelForCausalLM.from_pretrained('Bochkov/abs-bvv-6', trust_remote_code=True, torch_dtype=torch.bfloat16).to('cuda')
106
 
107
  inputs = tokenizer("Hello, I am a language model ", return_tensors="pt").to('cuda')
108
-
109
  # Generate text
110
  outputs = model.generate(
111
  **inputs,
@@ -115,6 +105,5 @@ outputs = model.generate(
115
  top_p=0.95,
116
  do_sample=True
117
  )
118
-
119
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
120
  ```
 
13
 
14
  # Model Card for abs-bvv-6
15
 
 
 
16
  ## Model Description
17
 
18
  `abs-bvv-6` is a 2.3 billion parameter decoder-only Transformer model. It is the sixth and final model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
19
 
20
+ This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced introduced in the papers
21
+
22
+ [πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
23
+
24
+ [πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
25
+
26
+ [πŸ’» Code](https://github.com/AVBochkov/PGT)
27
 
28
  The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
29
  1. Semantic understanding can emerge without trainable embeddings.
 
38
 
39
  ## Performance
40
  The model was evaluated on several standard benchmarks. Scores reflect performance on held-out test sets.
41
+ Benchmark Score (%) Οƒ (%)
42
+
43
+ MMLU 21.63% 0.22%
44
+
45
+ ARC-e 23.42% 1.28%
46
+
47
+ ARC-c 25.62% 1.92%
48
 
49
+ C-SENSE 19.51% 0.90%
50
+
51
+ SQuAD 5.55% 1.05%
 
 
 
 
52
 
53
  A key finding from the PGT series is the emergence of extractive QA capabilities (SQuAD) only in deeper models.
54
 
55
  ## Training Details
56
  Architecture: 6-layer Decoder-Only Transformer (n_layer=6, d_model=4096, n_head=32).
 
57
  Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
 
58
  Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
 
59
  Parameters: Total: 2.3B.
 
60
  Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
 
61
  ## Limitations and Bias
 
62
  This model is a research prototype and has several limitations:
63
+ Not Instruction-Tuned: It is a base model and will not follow instructions or engage in dialogue reliably.
64
+ Potential for Hallucinations: Like all LLMs, it can generate factually incorrect or nonsensical text.
65
+ Data Bias: Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
66
+ Limited Scope: The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
 
 
67
  ## πŸ§‘β€πŸ”¬ Citation & Concept
 
68
  If you use this model or the underlying concepts in your research, please cite our work:
69
+ ```
 
70
  @misc{bochkov2025emergentsemanticstokenembeddings,
71
  title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
72
  author={A. Bochkov},
 
76
  primaryClass={cs.CL},
77
  url={https://arxiv.org/abs/2507.04886},
78
  }
 
79
  @misc{bochkov2025growingtransformersmodularcomposition,
80
  title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
81
  author={A. Bochkov},
 
86
  url={https://arxiv.org/abs/2507.07129},
87
  }
88
  ```
 
89
  This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β€” a step toward modular, fusable, multilingual LMs.
 
90
  ## How to Use
 
91
  The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
 
92
  ```python
93
  from transformers import AutoModelForCausalLM, AutoTokenizer
94
  import torch
 
95
  tokenizer = AutoTokenizer.from_pretrained('Bochkov/abs-bvv-6')
96
  model = AutoModelForCausalLM.from_pretrained('Bochkov/abs-bvv-6', trust_remote_code=True, torch_dtype=torch.bfloat16).to('cuda')
97
 
98
  inputs = tokenizer("Hello, I am a language model ", return_tensors="pt").to('cuda')
 
99
  # Generate text
100
  outputs = model.generate(
101
  **inputs,
 
105
  top_p=0.95,
106
  do_sample=True
107
  )
 
108
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
109
  ```