--- language: - en license: apache-2.0 pipeline_tag: text-generation tags: - bvv - non-frozen - embedding - research - baseline library_name: transformers --- # pro_bvv_unfrozen: 200M baseline LM (non-frozen embeddings) This repository contains the model and associated resources from the papers [📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) - [📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) - [💻 Code](https://github.com/AVBochkov/Embeddings) **Description** This is a baseline English language model (200M parameters) trained in the **classical** way with **fully trainable** token embeddings, **provided for direct comparison** with the conceptually frozen-embedding variant. **Training details** - English corpus (~9B tokens), 10% SFT mixed-in. - All layers, including token embeddings, are trainable. - Hyperparameters and architecture match pro_bvv_en. **Evaluation** | Task | pro_bvv_unfrozen | |----------|------------------| | MMLU | 14.00% ± 0.14% | | ARC-e | 24.09% ± 0.78% | | ARC-c | 22.24% ± 1.04% | | C-SENSE | 19.76% ± 0.52% | | SQUAD | 13.28% ± 0.93% | --- ⚠️ Limitations Research use only. Trained on a small subset. Quality, robustness, and reasoning are much lower than SOTA models. SFT was only lightly applied; not intended for real world use. ## 🧑‍🔬 Citation & Concept If you use this model or the underlying concepts in your research, please cite our work: ``` @misc{bochkov2025emergentsemanticstokenembeddings, title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations}, author={A. Bochkov}, year={2025}, eprint={2507.04886}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.04886}, } @misc{bochkov2025growingtransformersmodularcomposition, title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, author={A. Bochkov}, year={2025}, eprint={2507.07129}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2507.07129}, } ``` This work demonstrates that transformer blocks, not token embeddings, carry the semantic burden in LLMs — a step toward modular, fusable, multilingual LMs. **Usage** ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained('Bochkov/pro_bvv_unfrozen') model = AutoModelForCausalLM.from_pretrained('Bochkov/pro_bvv_unfrozen', trust_remote_code=True).to('cuda') inputs = torch.tensor([tokenizer.encode("Example input: ")], device='cuda') outputs = model.generate(inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0])) ```