README.md · Allanatrix/Nexa-Mistral-Sci7b at main

Nexa-Mistral-Sci7b / README.md

Allanatrix

Update README.md

e6ecb05 verified 2 days ago

preview code

raw

history blame contribute delete

7 kB

	---
	tags:
	- mistral
	- lora
	- peft
	- transformers
	- scientific-ml
	- fine-tuned
	- research-assistant
	- hypothesis-generation
	- scientific-writing
	- scientific-reasoning
	license: apache-2.0
	library_name: peft
	datasets:
	- Allanatrix/Scientific_Research_Tokenized
	pipeline_tag: text-generation
	language:
	- en
	model-index:
	- name: Nexa Mistral 7B Sci
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	type: allen/nexa-scientific-tokens
	name: Nexa Scientific Tokens
	metrics:
	- name: BLEU
	type: bleu
	value: 10
	- name: Entropy Novelty
	type: entropy
	value: 6
	- name: Internal Consistency
	type: custom
	value: 9
	base_model:
	- mistralai/Mistral-7B-v0.1
	metrics:
	- bleu
	---


	# Model Card for `nexa-mistral-7b-psi`

	## Model Details

	Model Description:
	`nexa-mistral-7b-psi` is a fine-tuned variant of the open-weight `Mistral-7B-v0.1` model, optimized for scientific research generation tasks such as hypothesis generation, abstract writing, and methodology completion. Fine-tuning was performed using the PEFT (Parameter-Efficient Fine-Tuning) library with LoRA in 4-bit quantized mode using the `bitsandbytes` backend.

	This model is part of the Nexa Scientific Intelligence (Psi) series, developed for scalable, automated scientific reasoning and domain-specific text generation.

	---

	Developed by: Allan (Independent Scientific Intelligence Architect)
	Funded by: Self-funded
	Shared by: Allan (https://huggingface.co/allan-wandeer)
	Model type: Decoder-only transformer (causal language model)
	Language(s): English (scientific domain-specific vocabulary)
	License: Apache 2.0 (inherits from base model)
	Fine-tuned from: `mistralai/Mistral-7B-v0.1`
	Repository: https://huggingface.co/allan-wandeer/nexa-mistral-7b-psi
	Demo: Coming soon via Hugging Face Spaces or Lambda inference endpoint.

	---

	## Uses

	### Direct Use
	- Scientific hypothesis generation
	- Abstract and method section synthesis
	- Domain-specific research writing
	- Semantic completion of structured research prompts

	### Downstream Use
	- Fine-tuning or distillation into smaller expert models
	- Foundation for test-time reasoning agents
	- Seed model for bootstrapping larger synthetic scientific corpora

	### Out-of-Scope Use
	- General conversation or chat use cases
	- Non-English scientific domains
	- Legal, financial, or clinical advice generation

	---

	## Bias, Risks, and Limitations
	While the model performs well on structured scientific input, it inherits biases from its base model (`Mistral-7B`) and fine-tuning dataset. Results should be evaluated by domain experts before use in high-stakes settings. It may hallucinate plausible but incorrect facts, especially in low-data areas.

	---

	## Recommendations
	Users should:
	- Validate critical outputs against trusted scientific literature
	- Avoid deploying in clinical or regulatory environments without further evaluation
	- Consider additional domain fine-tuning for niche fields

	---

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "allan-wandia/nexa-mistral-7b-sci"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

	prompt = "Generate a novel hypothesis in quantum materials research:"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=250)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```
	---

	## Training Details

	### Training Data

	* Size: 100 million tokens sampled from a 500M+ token corpus
	* Source: Curated scientific literature, abstracts, methodologies, and domain-labeled corpora (Bio, Physics, QST, Astro)
	* Labeling: Token-level labels auto-generated via `Nexa DataVault` tokenizer infrastructure

	### Preprocessing

	* Tokenization with sequence truncation to 1024 tokens
	* Labeled and batched using CPU; inference dispatched to GPU asynchronously

	### Training Hyperparameters

	- Base model: `mistralai/Mistral-7B-v0.1`
	- Sequence length: `1024`
	- Batch size: `1` (with gradient accumulation)
	- Gradient Accumulation Steps: `64`
	- Effective Batch Size: `64`
	- Learning rate: `2e-5`
	- Epochs: `2`
	- LoRA: Enabled (PEFT)
	- Quantization: 4-bit via `bitsandbytes`
	- Optimizer: 8-bit AdamW
	- Framework: Transformers + PEFT + Accelerate

	---

	## Evaluation

	### Testing Data

	* Synthetic scientific prompts across domains (Physics, Biology, Materials Science)

	### Evaluation Factors

	* Semantic coherence (BLEU)
	* Hypothesis novelty (entropy score)
	* Internal scientific consistency (domain-specific rubric)

	### Metrics

	\| Metric \| Score \|
	\| ---------------------- \| ----- \|
	\| BLEU (coherence) \| 10/10 \|
	\| Entropy novelty \| 6/10 \|
	\| Scientific consistency \| 9/10 \|
	\| Model similarity coef \| 87% \|

	### Results

	Model performs robustly in hypothesis generation and scientific prose tasks. While base coherence is high, novelty depends on prompt diversity. Well-suited as a distiller or inference agent for synthetic scientific corpora generation.

	---

	## Environmental Impact

	\| Component \| Value \|
	\| -------------- \| ----------------------------------- \|
	\| Hardware Type \| 2× NVIDIA T4 GPUs \|
	\| Hours used \| \~7.5 \|
	\| Cloud Provider \| Kaggle (Google Cloud) \|
	\| Compute Region \| US \|
	\| Carbon Emitted \| Estimate pending (likely < 1kg CO2) \|

	---

	## Technical Specifications

	### Model Architecture

	* Transformer decoder (Mistral-7B architecture)
	* LoRA adapters applied to attention and FFN layers
	* Quantized with `bitsandbytes` to 4-bit for memory efficiency

	### Compute Infrastructure

	* CPU: Intel i5 8th Gen vPro (batch preprocessing)
	* GPU: 2× NVIDIA T4 (CUDA 12.1)

	### Software Stack

	* PEFT 0.12.0
	* Transformers 4.41.1
	* Accelerate
	* TRL
	* Torch 2.x

	---

	## Citation

	BibTeX:

	```bibtex
	@misc{nexa-mistral-7b-sci,
	title = {Nexa Mistral 7B Sci},
	author = {Allan Wandia},
	year = {2025},
	howpublished = {\url{https://huggingface.co/allan-Wandia/nexa-mistral-7b-sci}},
	note = {Fine-tuned model for scientific generation tasks}
	}
	```
	---

	## Model Card Contact

	For questions, contact Allan via Hugging Face or at:
	📫 Email: \[[email protected]]

	---

	## Model Card Authors

	* Allan Wandia (Independent ML Engineer and Systems Architect)

	---

	## Glossary

	* LoRA: Low-Rank Adaptation
	* PEFT: Parameter-Efficient Fine-Tuning
	* BLEU: Bilingual Evaluation Understudy Score
	* Entropy Score: Metric used to estimate novelty/variation
	* Safe Tensors: Secure, fast format for model weights

	## Links
	Github Repo and notebook: https://github.com/DarkStarStrix/Nexa_Auto

	---