|
--- |
|
tags: |
|
- mistral |
|
- lora |
|
- peft |
|
- transformers |
|
- scientific-ml |
|
- fine-tuned |
|
- research-assistant |
|
- hypothesis-generation |
|
- scientific-writing |
|
- scientific-reasoning |
|
license: apache-2.0 |
|
library_name: peft |
|
datasets: |
|
- Allanatrix/Scientific_Research_Tokenized |
|
pipeline_tag: text-generation |
|
language: |
|
- en |
|
model-index: |
|
- name: Nexa Mistral 7B Sci |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
type: allen/nexa-scientific-tokens |
|
name: Nexa Scientific Tokens |
|
metrics: |
|
- name: BLEU |
|
type: bleu |
|
value: 10 |
|
- name: Entropy Novelty |
|
type: entropy |
|
value: 6 |
|
- name: Internal Consistency |
|
type: custom |
|
value: 9 |
|
base_model: |
|
- mistralai/Mistral-7B-v0.1 |
|
metrics: |
|
- bleu |
|
--- |
|
|
|
|
|
# Model Card for `nexa-mistral-7b-psi` |
|
|
|
## Model Details |
|
|
|
**Model Description**: |
|
`nexa-mistral-7b-psi` is a fine-tuned variant of the open-weight `Mistral-7B-v0.1` model, optimized for scientific research generation tasks such as hypothesis generation, abstract writing, and methodology completion. Fine-tuning was performed using the PEFT (Parameter-Efficient Fine-Tuning) library with LoRA in 4-bit quantized mode using the `bitsandbytes` backend. |
|
|
|
This model is part of the **Nexa Scientific Intelligence (Psi)** series, developed for scalable, automated scientific reasoning and domain-specific text generation. |
|
|
|
--- |
|
|
|
**Developed by**: Allan (Independent Scientific Intelligence Architect) |
|
**Funded by**: Self-funded |
|
**Shared by**: Allan (https://huggingface.co/allan-wandeer) |
|
**Model type**: Decoder-only transformer (causal language model) |
|
**Language(s)**: English (scientific domain-specific vocabulary) |
|
**License**: Apache 2.0 (inherits from base model) |
|
**Fine-tuned from**: `mistralai/Mistral-7B-v0.1` |
|
**Repository**: https://huggingface.co/allan-wandeer/nexa-mistral-7b-psi |
|
**Demo**: Coming soon via Hugging Face Spaces or Lambda inference endpoint. |
|
|
|
--- |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
- Scientific hypothesis generation |
|
- Abstract and method section synthesis |
|
- Domain-specific research writing |
|
- Semantic completion of structured research prompts |
|
|
|
### Downstream Use |
|
- Fine-tuning or distillation into smaller expert models |
|
- Foundation for test-time reasoning agents |
|
- Seed model for bootstrapping larger synthetic scientific corpora |
|
|
|
### Out-of-Scope Use |
|
- General conversation or chat use cases |
|
- Non-English scientific domains |
|
- Legal, financial, or clinical advice generation |
|
|
|
--- |
|
|
|
## Bias, Risks, and Limitations |
|
While the model performs well on structured scientific input, it inherits biases from its base model (`Mistral-7B`) and fine-tuning dataset. Results should be evaluated by domain experts before use in high-stakes settings. It may hallucinate plausible but incorrect facts, especially in low-data areas. |
|
|
|
--- |
|
|
|
## Recommendations |
|
Users should: |
|
- Validate critical outputs against trusted scientific literature |
|
- Avoid deploying in clinical or regulatory environments without further evaluation |
|
- Consider additional domain fine-tuning for niche fields |
|
|
|
--- |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_name = "allan-wandia/nexa-mistral-7b-sci" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto") |
|
|
|
prompt = "Generate a novel hypothesis in quantum materials research:" |
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
outputs = model.generate(**inputs, max_new_tokens=250) |
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
--- |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
* **Size**: 100 million tokens sampled from a 500M+ token corpus |
|
* **Source**: Curated scientific literature, abstracts, methodologies, and domain-labeled corpora (Bio, Physics, QST, Astro) |
|
* **Labeling**: Token-level labels auto-generated via `Nexa DataVault` tokenizer infrastructure |
|
|
|
### Preprocessing |
|
|
|
* Tokenization with sequence truncation to 1024 tokens |
|
* Labeled and batched using CPU; inference dispatched to GPU asynchronously |
|
|
|
### Training Hyperparameters |
|
|
|
- **Base model**: `mistralai/Mistral-7B-v0.1` |
|
- **Sequence length**: `1024` |
|
- **Batch size**: `1` (with gradient accumulation) |
|
- **Gradient Accumulation Steps**: `64` |
|
- **Effective Batch Size**: `64` |
|
- **Learning rate**: `2e-5` |
|
- **Epochs**: `2` |
|
- **LoRA**: Enabled (PEFT) |
|
- **Quantization**: 4-bit via `bitsandbytes` |
|
- **Optimizer**: 8-bit AdamW |
|
- **Framework**: Transformers + PEFT + Accelerate |
|
|
|
--- |
|
|
|
## Evaluation |
|
|
|
### Testing Data |
|
|
|
* Synthetic scientific prompts across domains (Physics, Biology, Materials Science) |
|
|
|
### Evaluation Factors |
|
|
|
* Semantic coherence (BLEU) |
|
* Hypothesis novelty (entropy score) |
|
* Internal scientific consistency (domain-specific rubric) |
|
|
|
### Metrics |
|
|
|
| Metric | Score | |
|
| ---------------------- | ----- | |
|
| BLEU (coherence) | 10/10 | |
|
| Entropy novelty | 6/10 | |
|
| Scientific consistency | 9/10 | |
|
| Model similarity coef | 87% | |
|
|
|
### Results |
|
|
|
Model performs robustly in hypothesis generation and scientific prose tasks. While base coherence is high, novelty depends on prompt diversity. Well-suited as a distiller or inference agent for synthetic scientific corpora generation. |
|
|
|
--- |
|
|
|
## Environmental Impact |
|
|
|
| Component | Value | |
|
| -------------- | ----------------------------------- | |
|
| Hardware Type | 2× NVIDIA T4 GPUs | |
|
| Hours used | \~7.5 | |
|
| Cloud Provider | Kaggle (Google Cloud) | |
|
| Compute Region | US | |
|
| Carbon Emitted | Estimate pending (likely < 1kg CO2) | |
|
|
|
--- |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture |
|
|
|
* Transformer decoder (Mistral-7B architecture) |
|
* LoRA adapters applied to attention and FFN layers |
|
* Quantized with `bitsandbytes` to 4-bit for memory efficiency |
|
|
|
### Compute Infrastructure |
|
|
|
* CPU: Intel i5 8th Gen vPro (batch preprocessing) |
|
* GPU: 2× NVIDIA T4 (CUDA 12.1) |
|
|
|
### Software Stack |
|
|
|
* PEFT 0.12.0 |
|
* Transformers 4.41.1 |
|
* Accelerate |
|
* TRL |
|
* Torch 2.x |
|
|
|
--- |
|
|
|
## Citation |
|
|
|
**BibTeX**: |
|
|
|
```bibtex |
|
@misc{nexa-mistral-7b-sci, |
|
title = {Nexa Mistral 7B Sci}, |
|
author = {Allan Wandia}, |
|
year = {2025}, |
|
howpublished = {\url{https://huggingface.co/allan-Wandia/nexa-mistral-7b-sci}}, |
|
note = {Fine-tuned model for scientific generation tasks} |
|
} |
|
``` |
|
--- |
|
|
|
## Model Card Contact |
|
|
|
For questions, contact Allan via Hugging Face or at: |
|
📫 Email: \[[email protected]] |
|
|
|
--- |
|
|
|
## Model Card Authors |
|
|
|
* Allan Wandia (Independent ML Engineer and Systems Architect) |
|
|
|
--- |
|
|
|
## Glossary |
|
|
|
* **LoRA**: Low-Rank Adaptation |
|
* **PEFT**: Parameter-Efficient Fine-Tuning |
|
* **BLEU**: Bilingual Evaluation Understudy Score |
|
* **Entropy Score**: Metric used to estimate novelty/variation |
|
* **Safe Tensors**: Secure, fast format for model weights |
|
|
|
## Links |
|
**Github Repo and notebook**: https://github.com/DarkStarStrix/Nexa_Auto |
|
|
|
--- |