Bochkov commited on
Commit
a84db18
Β·
verified Β·
1 Parent(s): 171ce39

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -20
README.md CHANGED
@@ -13,13 +13,17 @@ tags:
13
 
14
  # Model Card for abs-bvv-4
15
 
16
- **GitHub Repository**: [https://github.com/Bochkov/BVV241-Tokenizers-Embeddings-Benchmarks](https://github.com/Bochkov/BVV241-Tokenizers-Embeddings-Benchmarks)
17
-
18
  ## Model Description
19
 
20
  `abs-bvv-4` is a 1.9 billion parameter decoder-only Transformer model. It is the 4th model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
21
 
22
- This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced in the paper "[Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations](https://arxiv.org/abs/2507.04886)".
 
 
 
 
 
 
23
 
24
  The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
25
  1. Semantic understanding can emerge without trainable embeddings.
@@ -34,31 +38,18 @@ This model is primarily an artifact for research into emergent capabilities, con
34
 
35
  ## Training Details
36
  Architecture: 4-layer Decoder-Only Transformer (n_layer=4, d_model=4096, n_head=32).
37
-
38
  Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
39
-
40
  Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
41
-
42
  Parameters: Total: 1.9B.
43
-
44
  Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
45
-
46
  ## Limitations and Bias
47
-
48
  This model is a research prototype and has several limitations:
49
-
50
  Not Instruction-Tuned: It is a base model and will not follow instructions or engage in dialogue reliably.
51
-
52
  Potential for Hallucinations: Like all LLMs, it can generate factually incorrect or nonsensical text.
53
-
54
  Data Bias: Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
55
-
56
  Limited Scope: The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
57
-
58
  ## πŸ§‘β€πŸ”¬ Citation & Concept
59
-
60
  If you use this model or the underlying concepts in your research, please cite our work:
61
-
62
  ```
63
  @misc{bochkov2025emergentsemanticstokenembeddings,
64
  title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
@@ -69,7 +60,6 @@ If you use this model or the underlying concepts in your research, please cite o
69
  primaryClass={cs.CL},
70
  url={https://arxiv.org/abs/2507.04886},
71
  }
72
-
73
  @misc{bochkov2025growingtransformersmodularcomposition,
74
  title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
75
  author={A. Bochkov},
@@ -80,11 +70,8 @@ If you use this model or the underlying concepts in your research, please cite o
80
  url={https://arxiv.org/abs/2507.07129},
81
  }
82
  ```
83
-
84
  This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β€” a step toward modular, fusable, multilingual LMs.
85
-
86
  ## How to Use
87
-
88
  The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
89
 
90
  ```python
 
13
 
14
  # Model Card for abs-bvv-4
15
 
 
 
16
  ## Model Description
17
 
18
  `abs-bvv-4` is a 1.9 billion parameter decoder-only Transformer model. It is the 4th model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
19
 
20
+ This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced in the papers
21
+
22
+ [πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
23
+
24
+ [πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
25
+
26
+ [πŸ’» Code](https://github.com/AVBochkov/Embeddings)
27
 
28
  The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
29
  1. Semantic understanding can emerge without trainable embeddings.
 
38
 
39
  ## Training Details
40
  Architecture: 4-layer Decoder-Only Transformer (n_layer=4, d_model=4096, n_head=32).
 
41
  Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
 
42
  Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
 
43
  Parameters: Total: 1.9B.
 
44
  Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
 
45
  ## Limitations and Bias
 
46
  This model is a research prototype and has several limitations:
 
47
  Not Instruction-Tuned: It is a base model and will not follow instructions or engage in dialogue reliably.
 
48
  Potential for Hallucinations: Like all LLMs, it can generate factually incorrect or nonsensical text.
 
49
  Data Bias: Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
 
50
  Limited Scope: The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
 
51
  ## πŸ§‘β€πŸ”¬ Citation & Concept
 
52
  If you use this model or the underlying concepts in your research, please cite our work:
 
53
  ```
54
  @misc{bochkov2025emergentsemanticstokenembeddings,
55
  title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
 
60
  primaryClass={cs.CL},
61
  url={https://arxiv.org/abs/2507.04886},
62
  }
 
63
  @misc{bochkov2025growingtransformersmodularcomposition,
64
  title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
65
  author={A. Bochkov},
 
70
  url={https://arxiv.org/abs/2507.07129},
71
  }
72
  ```
 
73
  This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs β€” a step toward modular, fusable, multilingual LMs.
 
74
  ## How to Use
 
75
  The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
76
 
77
  ```python