Text2Text Generation
GGUF
English
Not-For-All-Audiences
llama-cpp
gguf-my-repo
conversational
Triangle104 commited on
Commit
34a9647
·
verified ·
1 Parent(s): 1ad16dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md CHANGED
@@ -17,6 +17,85 @@ tags:
17
  This model was converted to GGUF format from [`SicariusSicariiStuff/Oni_Mitsubishi_12B`](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
18
  Refer to the [original model card](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) for more details on the model.
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ## Use with llama.cpp
21
  Install llama.cpp through brew (works on Mac and Linux)
22
 
 
17
  This model was converted to GGUF format from [`SicariusSicariiStuff/Oni_Mitsubishi_12B`](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
18
  Refer to the [original model card](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) for more details on the model.
19
 
20
+ ---
21
+ It happened. The long-awaited Gemma-3 is here, and not only are the model sizes really good (1, 4, 12, 27), but the 128k
22
+ context (except for the 1B 32k) was exactly what the Open-Source
23
+ community wanted and asked for. My only issue with Gemma models in
24
+ general, is the VRAM requirement for tuning them, but that's a "me problem." End users will probably be very happy with Gemma-3 in terms of the VRAM requirement for running it.
25
+
26
+
27
+ On the 12th of March, the Gemma-3 family of models was released. So I decided to go full superstitious, and took this omen as a divine calling to finetune the 12B model first. This is how Oni_Mitsubishi_12B was born.
28
+
29
+
30
+ Before starting the actual training run, I used the following
31
+ command, which I believe has helped the model to converge "better":
32
+
33
+
34
+ for i in {1..666}; do nvidia-smi; done
35
+
36
+
37
+
38
+ Gemma is known for its "Gemma knowledge": fandom and
39
+ \ or other obscure knowledge that sometimes even larger LLMs often do
40
+ not possess. It gets even better, as this time we also got a vision model
41
+ embedded into all the Gemma-3 models, except for the 1B. I wonder what
42
+ are the possibilities for the vision part if the text layers are
43
+ uncensored?
44
+
45
+
46
+ I have used brand new long context markdown data, some deslopped instruct data (very lightly deslopped, it's very time-consuming to get right), and more than 50%
47
+ of highly curated and filtered organic human data, meticulously
48
+ cleaned, and parsed into obedience. A new stack of organic and
49
+ data-engineered text was used for the first time for Oni_Mitsubishi_12B. I truly hope creating it was worth the effort.
50
+
51
+
52
+ At NO POINT ChatGPT was used for data generation, however, the new Claude 3.7 sonnet was used VERY sparingly for the specific task
53
+ of creating a small number of humorous datasets (very human-like, was
54
+ done with a decent amount of prompt engineering), I've meticulously
55
+ checked them for slop, and it is minimal. This goal of said data was to imitate human text, using the 4chan vernacular.
56
+
57
+
58
+ Speaking of which, I've published a highly curated, SFT-ready 4chan dataset here: UBW_Tapestries, naturally I have included it in the dataset used for this model as well.
59
+
60
+
61
+
62
+
63
+
64
+
65
+
66
+
67
+ Technical details
68
+
69
+
70
+
71
+
72
+ I've used the "ancient" Alpaca chat template because the Gemma-3 chat template
73
+ was behaving funkily, and I didn't want to waste precious time, and
74
+ instead give the community a more uncensored finetune to play with, as
75
+ fast as possible (I saw this requested a lot on both Reddit and discord,
76
+ understandable). In my opinion, it's silly to let perfect be an enemy
77
+ of the good. Anyway, I had to use both bleeding edge Transformers and Axolotl, and modify stuff that wasn't even supposed to work (like the model's config.json).
78
+
79
+
80
+ Since it's a hybrid model, training its text-only part is a bit
81
+ problematic, so I hacked a config.json that gaslights the model into
82
+ thinking it's only a text model, and got some warnings like:
83
+
84
+
85
+ 'vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight', 'vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias'}
86
+ - This IS expected if you are initializing Gemma3ForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
87
+ - This IS NOT expected if you are initializing Gemma3ForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
88
+
89
+
90
+
91
+ Then I saw it trains.
92
+
93
+
94
+
95
+
96
+ The absolute state when you can train a model before you can actually inference it.
97
+
98
+ ---
99
  ## Use with llama.cpp
100
  Install llama.cpp through brew (works on Mac and Linux)
101