Update README.md
Browse files
README.md
CHANGED
@@ -17,6 +17,85 @@ tags:
|
|
17 |
This model was converted to GGUF format from [`SicariusSicariiStuff/Oni_Mitsubishi_12B`](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
|
18 |
Refer to the [original model card](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) for more details on the model.
|
19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
## Use with llama.cpp
|
21 |
Install llama.cpp through brew (works on Mac and Linux)
|
22 |
|
|
|
17 |
This model was converted to GGUF format from [`SicariusSicariiStuff/Oni_Mitsubishi_12B`](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
|
18 |
Refer to the [original model card](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) for more details on the model.
|
19 |
|
20 |
+
---
|
21 |
+
It happened. The long-awaited Gemma-3 is here, and not only are the model sizes really good (1, 4, 12, 27), but the 128k
|
22 |
+
context (except for the 1B 32k) was exactly what the Open-Source
|
23 |
+
community wanted and asked for. My only issue with Gemma models in
|
24 |
+
general, is the VRAM requirement for tuning them, but that's a "me problem." End users will probably be very happy with Gemma-3 in terms of the VRAM requirement for running it.
|
25 |
+
|
26 |
+
|
27 |
+
On the 12th of March, the Gemma-3 family of models was released. So I decided to go full superstitious, and took this omen as a divine calling to finetune the 12B model first. This is how Oni_Mitsubishi_12B was born.
|
28 |
+
|
29 |
+
|
30 |
+
Before starting the actual training run, I used the following
|
31 |
+
command, which I believe has helped the model to converge "better":
|
32 |
+
|
33 |
+
|
34 |
+
for i in {1..666}; do nvidia-smi; done
|
35 |
+
|
36 |
+
|
37 |
+
|
38 |
+
Gemma is known for its "Gemma knowledge": fandom and
|
39 |
+
\ or other obscure knowledge that sometimes even larger LLMs often do
|
40 |
+
not possess. It gets even better, as this time we also got a vision model
|
41 |
+
embedded into all the Gemma-3 models, except for the 1B. I wonder what
|
42 |
+
are the possibilities for the vision part if the text layers are
|
43 |
+
uncensored?
|
44 |
+
|
45 |
+
|
46 |
+
I have used brand new long context markdown data, some deslopped instruct data (very lightly deslopped, it's very time-consuming to get right), and more than 50%
|
47 |
+
of highly curated and filtered organic human data, meticulously
|
48 |
+
cleaned, and parsed into obedience. A new stack of organic and
|
49 |
+
data-engineered text was used for the first time for Oni_Mitsubishi_12B. I truly hope creating it was worth the effort.
|
50 |
+
|
51 |
+
|
52 |
+
At NO POINT ChatGPT was used for data generation, however, the new Claude 3.7 sonnet was used VERY sparingly for the specific task
|
53 |
+
of creating a small number of humorous datasets (very human-like, was
|
54 |
+
done with a decent amount of prompt engineering), I've meticulously
|
55 |
+
checked them for slop, and it is minimal. This goal of said data was to imitate human text, using the 4chan vernacular.
|
56 |
+
|
57 |
+
|
58 |
+
Speaking of which, I've published a highly curated, SFT-ready 4chan dataset here: UBW_Tapestries, naturally I have included it in the dataset used for this model as well.
|
59 |
+
|
60 |
+
|
61 |
+
|
62 |
+
|
63 |
+
|
64 |
+
|
65 |
+
|
66 |
+
|
67 |
+
Technical details
|
68 |
+
|
69 |
+
|
70 |
+
|
71 |
+
|
72 |
+
I've used the "ancient" Alpaca chat template because the Gemma-3 chat template
|
73 |
+
was behaving funkily, and I didn't want to waste precious time, and
|
74 |
+
instead give the community a more uncensored finetune to play with, as
|
75 |
+
fast as possible (I saw this requested a lot on both Reddit and discord,
|
76 |
+
understandable). In my opinion, it's silly to let perfect be an enemy
|
77 |
+
of the good. Anyway, I had to use both bleeding edge Transformers and Axolotl, and modify stuff that wasn't even supposed to work (like the model's config.json).
|
78 |
+
|
79 |
+
|
80 |
+
Since it's a hybrid model, training its text-only part is a bit
|
81 |
+
problematic, so I hacked a config.json that gaslights the model into
|
82 |
+
thinking it's only a text model, and got some warnings like:
|
83 |
+
|
84 |
+
|
85 |
+
'vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight', 'vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias'}
|
86 |
+
- This IS expected if you are initializing Gemma3ForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
|
87 |
+
- This IS NOT expected if you are initializing Gemma3ForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
|
88 |
+
|
89 |
+
|
90 |
+
|
91 |
+
Then I saw it trains.
|
92 |
+
|
93 |
+
|
94 |
+
|
95 |
+
|
96 |
+
The absolute state when you can train a model before you can actually inference it.
|
97 |
+
|
98 |
+
---
|
99 |
## Use with llama.cpp
|
100 |
Install llama.cpp through brew (works on Mac and Linux)
|
101 |
|