Triangle104
/

Oni_Mitsubishi_12B-Q8_0-GGUF

@@ -17,6 +17,85 @@ tags:
 This model was converted to GGUF format from [`SicariusSicariiStuff/Oni_Mitsubishi_12B`](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
 Refer to the [original model card](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) for more details on the model.
 ## Use with llama.cpp
 Install llama.cpp through brew (works on Mac and Linux)

 This model was converted to GGUF format from [`SicariusSicariiStuff/Oni_Mitsubishi_12B`](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
 Refer to the [original model card](https://huggingface.co/SicariusSicariiStuff/Oni_Mitsubishi_12B) for more details on the model.
+---
+It happened. The long-awaited Gemma-3 is here, and not only are the model sizes really good (1, 4, 12, 27), but the 128k
+ context (except for the 1B 32k) was exactly what the Open-Source
+community wanted and asked for. My only issue with Gemma models in
+general, is the VRAM requirement for tuning them, but that's a "me problem." End users will probably be very happy with Gemma-3 in terms of the VRAM requirement for running it.
+On the 12th of March, the Gemma-3 family of models was released. So I decided to go full superstitious, and took this omen as a divine calling to finetune the 12B model first. This is how Oni_Mitsubishi_12B was born.
+Before starting the actual training run, I used the following
+command, which I believe has helped the model to converge "better":
+for i in {1..666}; do nvidia-smi; done
+Gemma is known for its "Gemma knowledge": fandom and
+ \ or other obscure knowledge that sometimes even larger LLMs often do
+not possess. It gets even better, as this time we also got a vision model
+ embedded into all the Gemma-3 models, except for the 1B. I wonder what
+are the possibilities for the vision part if the text layers are
+uncensored?
+I have used brand new long context markdown data, some deslopped instruct data (very lightly deslopped, it's very time-consuming to get right), and more than 50%
+ of highly curated and filtered organic human data, meticulously
+cleaned, and parsed into obedience. A new stack of organic and
+data-engineered text was used for the first time for Oni_Mitsubishi_12B. I truly hope creating it was worth the effort.
+At NO POINT ChatGPT was used for data generation, however, the new Claude 3.7 sonnet was used VERY sparingly for the specific task
+ of creating a small number of humorous datasets (very human-like, was
+done with a decent amount of prompt engineering), I've meticulously
+checked them for slop, and it is minimal. This goal of said data was to imitate human text, using the 4chan vernacular.
+Speaking of which, I've published a highly curated, SFT-ready 4chan dataset here: UBW_Tapestries, naturally I have included it in the dataset used for this model as well.
+		Technical details
+I've used the "ancient" Alpaca chat template because the Gemma-3 chat template
+ was behaving funkily, and I didn't want to waste precious time, and
+instead give the community a more uncensored finetune to play with, as
+fast as possible (I saw this requested a lot on both Reddit and discord,
+ understandable). In my opinion, it's silly to let perfect be an enemy
+of the good. Anyway, I had to use both bleeding edge Transformers and Axolotl, and modify stuff that wasn't even supposed to work (like the model's config.json).
+Since it's a hybrid model, training its text-only part is a bit
+problematic, so I hacked a config.json that gaslights the model into
+thinking it's only a text model, and got some warnings like:
+'vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight', 'vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias'}
+- This IS expected if you are initializing Gemma3ForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
+- This IS NOT expected if you are initializing Gemma3ForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
+Then I saw it trains.
+The absolute state when you can train a model before you can actually inference it.
+---
 ## Use with llama.cpp
 Install llama.cpp through brew (works on Mac and Linux)