Bochkov
/

abs-bvv-6

@@ -13,13 +13,17 @@ tags:
 # Model Card for abs-bvv-6
-[[Paper](https://huggingface.co/papers/2507.07129)] [[Code](https://github.com/Bochkov/bvv241)]
 ## Model Description
 `abs-bvv-6` is a 2.3 billion parameter decoder-only Transformer model. It is the sixth and final model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
-This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced in the paper "[Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations](https://arxiv.org/abs/2507.04886)".
 The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
 1.  Semantic understanding can emerge without trainable embeddings.
@@ -34,42 +38,35 @@ This model is primarily an artifact for research into emergent capabilities, con
 ## Performance
 The model was evaluated on several standard benchmarks. Scores reflect performance on held-out test sets.
-| Benchmark | Score (%) | σ (%) |
-|---|---|---|
-| MMLU | 21.63% | 0.22% |
-| ARC-e | 23.42% | 1.28% |
-| ARC-c | 25.62% | 1.92% |
-| C-SENSE | 19.51% | 0.90% |
-| SQuAD | 5.55% | 1.05% |
 A key finding from the PGT series is the emergence of extractive QA capabilities (SQuAD) only in deeper models.
 ## Training Details
 Architecture: 6-layer Decoder-Only Transformer (n_layer=6, d_model=4096, n_head=32).
 Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
 Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
 Parameters: Total: 2.3B.
 Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
 ## Limitations and Bias
 This model is a research prototype and has several limitations:
-*   **Not Instruction-Tuned:** It is a base model and will not follow instructions or engage in dialogue reliably.
-*   **Potential for Hallucinations:** Like all LLMs, it can generate factually incorrect or nonsensical text.
-*   **Data Bias:** Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
-*   **Limited Scope:** The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
 ## 🧑‍🔬 Citation & Concept
 If you use this model or the underlying concepts in your research, please cite our work:
-```bibtex
 @misc{bochkov2025emergentsemanticstokenembeddings,
       title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
       author={A. Bochkov},
@@ -79,7 +76,6 @@ If you use this model or the underlying concepts in your research, please cite o
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2507.04886},
 }
 @misc{bochkov2025growingtransformersmodularcomposition,
       title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
       author={A. Bochkov},
@@ -90,22 +86,16 @@ If you use this model or the underlying concepts in your research, please cite o
       url={https://arxiv.org/abs/2507.07129},
 }
 ```
 This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.
 ## How to Use
 The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 tokenizer = AutoTokenizer.from_pretrained('Bochkov/abs-bvv-6')
 model = AutoModelForCausalLM.from_pretrained('Bochkov/abs-bvv-6', trust_remote_code=True, torch_dtype=torch.bfloat16).to('cuda')
 inputs = tokenizer("Hello, I am a language model ", return_tensors="pt").to('cuda')
 # Generate text
 outputs = model.generate(
     **inputs,
@@ -115,6 +105,5 @@ outputs = model.generate(
     top_p=0.95,
     do_sample=True
 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```

 # Model Card for abs-bvv-6
 ## Model Description
 `abs-bvv-6` is a 2.3 billion parameter decoder-only Transformer model. It is the sixth and final model in the **Progressive Growth Transformers (PGT)** series, designed to explore how linguistic and reasoning capabilities emerge as a function of model depth.
+This model was not trained monolithically. Instead, it was "grown" constructively, one layer at a time, upon a foundation of **frozen, non-semantic visual embeddings**, as introduced introduced in the papers
+[📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
+[📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
+[💻 Code](https://github.com/AVBochkov/PGT)
 The core idea is to demonstrate an alternative, more modular and resource-efficient paradigm for building LLMs. The PGT series shows that:
 1.  Semantic understanding can emerge without trainable embeddings.
 ## Performance
 The model was evaluated on several standard benchmarks. Scores reflect performance on held-out test sets.
+Benchmark	Score (%)	σ (%)
+MMLU	21.63%	0.22%
+ARC-e	23.42%	1.28%
+ARC-c	25.62%	1.92%
+C-SENSE	19.51%	0.90%
+SQuAD	5.55%	1.05%
 A key finding from the PGT series is the emergence of extractive QA capabilities (SQuAD) only in deeper models.
 ## Training Details
 Architecture: 6-layer Decoder-Only Transformer (n_layer=6, d_model=4096, n_head=32).
 Embeddings: The token embedding layer is frozen and derived from visual representations of Unicode glyphs. It is never updated during training.
 Training Method: Progressive Layer-Wise Growth. The model was built by training one layer at a time. Layer 1 was trained to convergence, then frozen. Layer 2 was added and trained, etc. For deeper layers (5 and 6), LoRA was used to fine-tune all existing layers simultaneously with the new layer to ensure global coherence.
 Parameters: Total: 2.3B.
 Data: A ~9B token mix of Wikipedia and SFT datasets (10%).
 ## Limitations and Bias
 This model is a research prototype and has several limitations:
+Not Instruction-Tuned: It is a base model and will not follow instructions or engage in dialogue reliably.
+Potential for Hallucinations: Like all LLMs, it can generate factually incorrect or nonsensical text.
+Data Bias: Trained primarily on Wikipedia, it will reflect the biases present in that corpus.
+Limited Scope: The model was trained on a relatively small dataset (9B tokens) compared to state-of-the-art models. Its performance is intended to be evaluated relative to its own baseline (trainable embeddings) and shallower versions, not against giant commercial models.
 ## 🧑‍🔬 Citation & Concept
 If you use this model or the underlying concepts in your research, please cite our work:
+```
 @misc{bochkov2025emergentsemanticstokenembeddings,
       title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
       author={A. Bochkov},
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2507.04886},
 }
 @misc{bochkov2025growingtransformersmodularcomposition,
       title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
       author={A. Bochkov},
       url={https://arxiv.org/abs/2507.07129},
 }
 ```
 This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs.
 ## How to Use
 The model can be loaded using the `transformers` library. Note that `trust_remote_code=True` is required as it uses a custom model architecture.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 tokenizer = AutoTokenizer.from_pretrained('Bochkov/abs-bvv-6')
 model = AutoModelForCausalLM.from_pretrained('Bochkov/abs-bvv-6', trust_remote_code=True, torch_dtype=torch.bfloat16).to('cuda')
 inputs = tokenizer("Hello, I am a language model ", return_tensors="pt").to('cuda')
 # Generate text
 outputs = model.generate(
     **inputs,
     top_p=0.95,
     do_sample=True
 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```