korarishi
/

rishi-2-2b-it

@@ -1,167 +1,78 @@
 ---
 library_name: transformers
 tags:
-- text-generation
-- conversational
-- instruction-tuned
-- 4-bit precision
-- bitsandbytes
-license: apache-2.0
-language:
-- en
-base_model:
-- google/gemma-2-2b-it
 ---
-# rishi-2-2B-IT
 **Model ID:** `korarishi1027/rishi-2-2b-it`
-rishi-2-2B-IT is a 4-bit quantized, instruction-tuned variant of Google’s Gemma-2 2B decoder-only language model, optimized for efficient chat and general text generation in English.
-## Model Details
-### Model Description
-Gemma is a family of lightweight, state-of-the-art open models from Google, built on the same technology as the Gemini series. Kora-2-2B-IT has **2.61 B parameters**, quantized to **4-bit NF4** (with double quantization) and uses **bfloat16** for on-the-fly compute to reduce its GPU footprint.
-- **Developed by:** Google Research
-- **Shared by:** korarishi1027
-- **Finetuned from:** `google/gemma-2-2b-it`
-- **Model type:** Causal language model (decoder-only)
-- **Language(s):** English
-- **License:** Apache-2.0
-### Quantization & Memory
 ```python
-from bitsandbytes import BitsAndBytesConfig
-quant_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_quant_type="nf4"
 )
-### Intended Uses
-## Direct Use
-- Chatbots and conversational agents
-- Story, email, or code snippet generation
-- Summarization, Q&A, and instruction following
-### Downstream Use
-- Fine-tuning for domain-specific tasks (e.g. legal, medical, technical summarization)
-- Integration into larger NLP pipelines or applications
-## Out-of-Scope / Misuse
-- High-stakes domains (medical, legal) without human review
-- Real-time decision systems
-- Any use requiring perfect factual accuracy
----
-## Bias, Risks & Limitations
-- Inherits biases from its pre-training and instruction-tuning data
-- Quantization may introduce minor artifacts or rare decoding glitches
-- Not guaranteed to be up-to-date on world events or specialized knowledge
-## Recommendations
-- Always validate critical outputs with human oversight
-- Use guardrails or filters if exposing the model to untrusted inputs
-## How to Get Started
 ```python
-import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
-from bitsandbytes import BitsAndBytesConfig
-quant_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_use_double_quant=True,
-    bnb_4bit_compute_dtype=torch.bfloat16,
-    bnb_4bit_quant_type="nf4"
-)
 tokenizer = AutoTokenizer.from_pretrained("korarishi1027/rishi-2-2b-it")
 model = AutoModelForCausalLM.from_pretrained(
     "korarishi1027/rishi-2-2b-it",
-    quantization_config=quant_config,
-    device_map="auto"
 )
-prompt = "Translate to Shakespearean English: Hello, friend!"
-inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
-output = model.generate(**inputs, max_new_tokens=60)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-## Training Details
-### Training Data
-- **Pre-training:** Large-scale English web text corpora used by Google Gemma
-- **Instruction tuning:** Public instruction-following datasets (e.g., OpenAI’s InstructGPT mixtures)
-### Preprocessing
-- Tokenized with SentencePiece
-- Truncated to 2,048 tokens
-- Removed duplicates and low-quality examples
-### Hyperparameters
-- **Precision:** bf16 mixed
-- **Batch size:** 16
-- **Learning rate:** 2e-5
-- **Training hardware:** 8 × A100 GPUs for ~4 hours
----
-## Evaluation
-### Test Data & Metrics
-- **Datasets:** SuperGLUE, Anthropic HH-RLHF style instruction set
-- **Metrics:** Perplexity, BLEU
-### Results
-- **Perplexity:** 10.5 on held-out validation
-- **BLEU:** 23.7 average
-**Summary:** Performance matches the full-precision base; quantization adds <1 PPL point.
----
-## Environmental Impact
-Estimated via the [ML CO₂ Impact Calculator](https://mlco2.github.io/impact#compute):
-- **Hardware:** 8 × NVIDIA A100
-- **Provider:** Google Cloud (us-central1)
-- **Training time:** ~4 hours
-- **Emissions:** ~150 kg CO₂ eq
----
-## Technical Specifications
-- **Architecture:**
-  24-layer, 2.61 B-parameter decoder-only Transformer
-  - Hidden size: 2,048
-  - Attention heads: 16
-- **Software:**
-  - transformers ≥ 4.x
-  - bitsandbytes ≥ 0.39
-  - torch ≥ 2.x
-- **Inference HW:** NVIDIA V100/A100
----
-## Citation
-```bibtex
-@misc{rishi-2-2b-it,
-  title        = {rishi-2-2B-IT: A 4-bit Quantized Instruction-Tuned Variant of Gemma-2},
-  author       = {Google Research and korarishi1027},
-  year         = {2024},
-  howpublished = {\url{https://huggingface.co/koraishi1027/kora-2-2b-it}}
-}

 ---
 library_name: transformers
 tags:
+  - text-generation
+  - conversational
+  - instruction-tuned
+  - 4-bit precision
+  - bitsandbytes
 ---
+# Rishi-2-2B-IT
 **Model ID:** `korarishi1027/rishi-2-2b-it`
+## Model Information
+Summary description and brief definition of inputs and outputs.
+## Description
+The text-to-text, decoder-only large language model, available in English, with open weights for both pre-trained and instruction-tuned variants. Rishi-2-2B-IT is suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Its compact size allows deployment on limited-resource environments such as laptops, desktops, or private cloud infrastructure, democratizing access to state-of-the-art AI models.
+## Running with the pipeline API
 ```python
+import torch
+from transformers import pipeline
+pipe = pipeline(
+    "text-generation",
+    model="korarishi1027/rishi-2-2b-it",
+    model_kwargs={"torch_dtype": torch.bfloat16},
+    device="cuda",  # replace with "mps" to run on a Mac device
 )
+messages = [
+    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
+]
+outputs = pipe(messages, max_new_tokens=256)
+assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
+print(assistant_response)
+```
+## Running on single / multi GPU
+```bash
+# pip install accelerate
+```
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
 tokenizer = AutoTokenizer.from_pretrained("korarishi1027/rishi-2-2b-it")
 model = AutoModelForCausalLM.from_pretrained(
     "korarishi1027/rishi-2-2b-it",
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
 )
+input_text = "Write me a poem about Machine Learning."
+input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=32)
+print(tokenizer.decode(outputs[0]))
+```
+## Chat template usage
+```python
+messages = [
+    {"role": "user", "content": "Write me a poem about Cars."},
+]
+input_ids = tokenizer.apply_chat_template(
+    messages, return_tensors="pt", return_dict=True
+).to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=256)
+print(tokenizer.decode(outputs[0]))
+```
+## Developed by
+[korarishi1027](https://huggingface.co/korarishi1027)