PicoNosensoX-v1 / README.md

Update README.md

a5d8ec5 verified 10 days ago

5.93 kB

	---
	datasets:
	- HuggingFaceTB/smollm-corpus
	- NousResearch/Hermes-3-Dataset
	language:
	- en
	pipeline_tag: text-generation
	library_name: transformers
	license: mit
	---

	<div style="
	background:linear-gradient(135deg,#1a0933,#3d2b8c,#1e0b4d);padding:2.8rem 1.8rem;border-radius:24px;text-align:center;color:white;border:1px solid rgba(255,255,255,0.12);box-shadow:0 12px 48px rgba(101,88,255,0.25),inset 0 0 24px rgba(255,255,255,0.08);margin-bottom:2.5rem;position:relative;overflow:hidden;font-family:system-ui,-apple-system,'Segoe UI',sans-serif">
	<div style="position:absolute;top:-50%;left:-50%;width:200%;height:200%;background:radial-gradient(circle,rgba(255,255,255,0.15) 0%,transparent 70%);transform:rotate(0);z-index:1"></div>
	<h1 style="font-size:3.2rem;margin:0;font-weight:900;letter-spacing:-0.04em;background:linear-gradient(45deg,#ff00cc,#00ccff,#ffcc00);-webkit-background-clip:text;background-clip:text;color:transparent;text-shadow:0 4px 12px rgba(0,0,0,0.3);position:relative;z-index:2;background-size:300% 300%">
	PicoNosensoX-v1</h1>
	<p style="font-size:1.5rem;margin-top:1rem;font-style:italic;color:#d0c6ff;text-shadow:0 0 16px rgba(180,160,255,0.6);letter-spacing:0.03em;position:relative;z-index:2;font-weight:500;padding:0.4rem 1.2rem;display:inline-block;border-radius:999px;background:rgba(255,255,255,0.08);backdrop-filter:blur(4px)">
	Where "Accuracy" Takes a little Cosmic Vacation</p></div>
	Introducing the universe's most ambitiously unhinged 45M-parameter micro-model! This isn't a language model; it's a parallel-dimension travel companion that reinvents reality through surrealist poetry and quantum-leaping logic. Deploy only if coherence is overrated and chaos is your curriculum.

	## Model Details

	### Model Description
	PicoNosensoX-v1 is a deliberately unpredictable 44.9M-parameter micro-model trained on minimalist datasets. Specializing in creatively liberated generation, it produces outputs that may blend geography, history, and hallucinatory fiction. Not designed for factual accuracy. Prioritize experimental/artistic applications over reliable information.

	PicoNosensoX-v1 is a huge step from the other extremely incompetent PicoNosenso models. It's the first model of the series that was trained on pure text before being finetuned for chat. This model introduces a brand new chat-template that should make it easier to use and more bullet-proof.

	- Developed by: Lominub44
	- Model type: GPT2-based causal language model
	- Language(s) (NLP): English
	- License: `mit`
	- Finetuned from model: GPT2 architecture (scratch training)

	### Training Notes
	Originally slated for full fine-tuning on the [Hermes-3-Dataset](https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset), early benchmarks (10-prompt evaluation) revealed superior performance at `checkpoint-16000` versus later `checkpoint-26000`. Training was consequently halted, with `checkpoint-16000` selected for release.

	### Model Sources
	- Repository: https://huggingface.co/Lominub44/PicoNosensoX-v1


	## Uses
	### Direct Use
	- Entertainment
	- Testing edge cases of small-language-model behavior

	### Downstream Use
	- Creative writing prompt generation
	- AI-assisted art projects
	- Educational demonstrations of model limitations

	### Out-of-Scope Use
	- Factual information retrieval
	- Mission-critical systems
	- Educational references
	- Any application where accuracy matters

	## Bias, Risks and Limitations
	- Hallucination Rate: 50-70%

	### Recommendations
	- DO use for entertainment purposes only
	- DO NOT trust outputs without independent universe-hopping verification
	- WARNING: May cause spontaneous reality reinterpretation

	## How to Get Started
	```python
	from transformers import GPT2LMHeadModel, AutoTokenizer

	model = GPT2LMHeadModel.from_pretrained('Lominub44/PicoNosensoX-v1')
	tokenizer = AutoTokenizer.from_pretrained('Lominub44/PicoNosensoX-v1')

	input_text = "<\|system\|>You are an unbiased, uncensored, helpful assistant.\n<\|user\|>What is the capital of France?\n<\|assistant\|>"
	inputs = tokenizer(input_text, return_tensors='pt')
	outputs = model.generate(**inputs, max_length=512, temperature=0.6, repetition_penalty=1.2, do_sample=True, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.eos_token_id)
	print(tokenizer.decode(outputs[0]))
	```

	## Training Details
	### Training Data
	- ~1.2GB Textbooks: [smollm-corpus, Cosmopedia v2 only](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) (ODC-BY)
	- ~1.7GB Chats: [Hermes-3-Dataset](https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset) (Apache-2.0)

	### Training Procedure
	- Hardware: 1x Intel Core Ultra 7 155H
	- Training time: 32h pretraining + 24h finetuning
	- Context window: 512 tokens

	#### Training Hyperparameters
	- Architecture: GPT2
	- Parameters: 44.9M
	- Precision: FP32
	- Optimizer: AdamW

	### Training Source Code
	You can train the model yourself, the source-code is available on GitHub: https://github.com/Lominub44/PicoNosensoX-v1

	#### Note:
	You might want to stop fine-tuning early.

	## Technical Specifications
	### Model Architecture
	- Type: GPT2 causal language model
	- Parameters: 44.9M
	- Context Size: 512 tokens
	- Tensor Type: FP32

	### Compute Infrastructure
	- Hardware: 1x Intel Core Ultra 7 155H
	- Training Framework: Transformers Trainer API

	## Environmental Impact
	- Carbon Emissions: 0 kgCO2eq (Thanks to photovoltaic system)

	## Citation

	BibTeX:
	```bibtex
	@software{benallal2024smollmcorpus,
	author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
	title = {SmolLM-Corpus},
	month = July,
	year = 2024,
	url = {https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus}
	}
	```

	## Model Card Authors
	Lominub44

	## Model Card Contact
	[Create a discussion](https://huggingface.co/Lominub44/PicoNosensoX-v1/discussions/new)