Update README.md
Browse files
README.md
CHANGED
@@ -11,89 +11,44 @@ tags:
|
|
11 |
- frozen-embeddings
|
12 |
---
|
13 |
|
14 |
-
#
|
15 |
|
16 |
-
|
|
|
|
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
* **Progressive Layer-wise Growth**: Deep Transformers can be "grown" by progressively stacking and training one layer at a time, showing stable convergence and correlation between depth and reasoning abilities.
|
21 |
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
You can use the model with the `transformers` library. Note that this model uses a custom architecture, so `trust_remote_code=True` is required for proper loading. This model also utilizes precomputed, frozen embeddings that are loaded separately.
|
29 |
|
|
|
30 |
```python
|
31 |
-
from transformers import
|
32 |
-
from huggingface_hub import hf_hub_download
|
33 |
import torch
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
# with the 'embeddings' loaded above, depending on the model's specific
|
45 |
-
# 'forward' method or initialization.
|
46 |
-
model = AutoModelForCausalLM.from_pretrained(
|
47 |
-
'Bochkov/bvv241-max',
|
48 |
-
torch_dtype=torch.float32, # or torch.bfloat16, depending on your setup
|
49 |
-
low_cpu_mem_usage=True,
|
50 |
-
trust_remote_code=True
|
51 |
)
|
52 |
-
|
53 |
-
# model.transformer.wte.weight = torch.nn.Parameter(embeddings).to(model.device)
|
54 |
-
model.eval() # Set to evaluation mode
|
55 |
-
|
56 |
-
# Example text generation
|
57 |
-
prompt = "The key to life is"
|
58 |
-
input_ids = tokenizer.encode(prompt, return_tensors="pt")
|
59 |
-
|
60 |
-
# Move to GPU if available
|
61 |
-
if torch.cuda.is_available():
|
62 |
-
model.to("cuda")
|
63 |
-
input_ids = input_ids.to("cuda")
|
64 |
-
|
65 |
-
# Generate text
|
66 |
-
output_ids = model.generate(input_ids, max_new_tokens=20, do_sample=True, temperature=0.7)
|
67 |
-
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
|
68 |
-
|
69 |
-
print(generated_text)
|
70 |
```
|
71 |
|
72 |
-
## Tokenizer and Embedding Variants
|
73 |
-
This repository provides various Unicode-based tokenizers and precomputed, L2-normalized, frozen embedding matrices for direct use in `nn.Embedding`. These embeddings contain **no semantic information** and are designed for research into emergent semantics in transformer layers.
|
74 |
-
|
75 |
-
1. **`bvv241-2-3`**: Base Unicode plane (0β65535) with Wikipedia bigrams/trigrams in private Unicode ranges. (65,536 tokens, 1024-dim frozen embedding)
|
76 |
-
2. **`bvv241-max` (This Model)**: Combines Unicode monograms + bigrams/trigrams + intersection of token strings from SOTA models. (131,072 tokens, 1024-dim frozen embedding)
|
77 |
-
3. **`bvv241-nemo`**: Vocabulary of Mistral-Nemo SOTA model with frozen surface-level embeddings. (131,072 tokens, 1024-dim frozen embedding)
|
78 |
-
4. **`bvv241-abs`**: Similar to `bvv241-max`, but with an embedding size of 4096.
|
79 |
-
|
80 |
-
These variants are designed to enable flexible experimentation with modular model fusion and the study of semantic emergence in LLMs.
|
81 |
-
|
82 |
## Citation
|
83 |
|
84 |
If you find this work helpful or inspiring, please consider citing the associated papers:
|
85 |
|
86 |
```bibtex
|
87 |
-
@misc{bochkov2025growingtransformersmodularcomposition,
|
88 |
-
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
|
89 |
-
author={A. Bochkov},
|
90 |
-
year={2025},
|
91 |
-
eprint={2507.07129},
|
92 |
-
archivePrefix={arXiv},
|
93 |
-
primaryClass={cs.LG},
|
94 |
-
url={https://arxiv.org/abs/2507.07129},
|
95 |
-
}
|
96 |
-
|
97 |
@misc{bochkov2025emergentsemanticstokenembeddings,
|
98 |
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
|
99 |
author={A. Bochkov},
|
@@ -103,4 +58,14 @@ If you find this work helpful or inspiring, please consider citing the associate
|
|
103 |
primaryClass={cs.CL},
|
104 |
url={https://arxiv.org/abs/2507.04886},
|
105 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
106 |
```
|
|
|
11 |
- frozen-embeddings
|
12 |
---
|
13 |
|
14 |
+
# best_bvv_unfrozen_zh
|
15 |
|
16 |
+
[π Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations)](https://huggingface.co/papers/2507.04886) -
|
17 |
+
[π Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate)](https://huggingface.co/papers/2507.07129) -
|
18 |
+
[π» Code](https://github.com/AVBochkov/Embeddings)
|
19 |
|
20 |
+
# Model summary
|
21 |
+
best_bvv_unfrozen_zh is a 500M parameter Causal Language Model (LM) trained as an open proof-of-concept for the "frozen embeddings" paradigm. This version uses fully trainable token embeddings β a standard setup β and serves as a baseline for direct comparison with the corresponding "frozen-embedding" model Bochkov/best_bvv_zh.
|
|
|
22 |
|
23 |
+
Architecture: Transformer, rotary positional encoding
|
24 |
+
Vocabulary: Custom Unicode-based, 131072 tokens
|
25 |
+
Embedding: Unfrozen (trainable, classic)
|
26 |
+
Pretraining data: 9B tokens, (Wikipedia, SQuAD2.0, TriviaQA, NQ etc) and 10% SFT (instruction/factual Q&A) mixed in
|
27 |
+
Purpose: Compare learning capacity and generalization of full vs. frozen-embedding LMs on small data
|
|
|
|
|
28 |
|
29 |
+
## Example Usage
|
30 |
```python
|
31 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
32 |
import torch
|
33 |
+
model = AutoModelForCausalLM.from_pretrained('Bochkov/best_bvv_unfrozen_zh', trust_remote_code=True).to('cuda')
|
34 |
+
tokenizer = AutoTokenizer.from_pretrained('Bochkov/best_bvv_unfrozen_zh')
|
35 |
+
inputs = tokenizer("Hello, world! ", return_tensors="pt").to('cuda')
|
36 |
+
outputs = model.generate(
|
37 |
+
**inputs,
|
38 |
+
max_new_tokens=100,
|
39 |
+
temperature=0.8,
|
40 |
+
top_k=50,
|
41 |
+
top_p=0.95,
|
42 |
+
do_sample=True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
)
|
44 |
+
print(tokenizer.decode(outputs[0]))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
```
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
## Citation
|
48 |
|
49 |
If you find this work helpful or inspiring, please consider citing the associated papers:
|
50 |
|
51 |
```bibtex
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
@misc{bochkov2025emergentsemanticstokenembeddings,
|
53 |
title={Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations},
|
54 |
author={A. Bochkov},
|
|
|
58 |
primaryClass={cs.CL},
|
59 |
url={https://arxiv.org/abs/2507.04886},
|
60 |
}
|
61 |
+
|
62 |
+
@misc{bochkov2025growingtransformersmodularcomposition,
|
63 |
+
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
|
64 |
+
author={A. Bochkov},
|
65 |
+
year={2025},
|
66 |
+
eprint={2507.07129},
|
67 |
+
archivePrefix={arXiv},
|
68 |
+
primaryClass={cs.LG},
|
69 |
+
url={https://arxiv.org/abs/2507.07129},
|
70 |
+
}
|
71 |
```
|