File size: 6,997 Bytes
190dd04
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6ecb05
 
 
190dd04
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
tags:
- mistral
- lora
- peft
- transformers
- scientific-ml
- fine-tuned
- research-assistant
- hypothesis-generation
- scientific-writing
- scientific-reasoning
license: apache-2.0
library_name: peft
datasets:
- Allanatrix/Scientific_Research_Tokenized
pipeline_tag: text-generation
language:
- en
model-index:
- name: Nexa Mistral 7B Sci
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: allen/nexa-scientific-tokens
      name: Nexa Scientific Tokens
    metrics:
    - name: BLEU
      type: bleu
      value: 10
    - name: Entropy Novelty
      type: entropy
      value: 6
    - name: Internal Consistency
      type: custom
      value: 9
base_model:
- mistralai/Mistral-7B-v0.1
metrics:
- bleu
---


# Model Card for `nexa-mistral-7b-psi`

## Model Details

**Model Description**:  
`nexa-mistral-7b-psi` is a fine-tuned variant of the open-weight `Mistral-7B-v0.1` model, optimized for scientific research generation tasks such as hypothesis generation, abstract writing, and methodology completion. Fine-tuning was performed using the PEFT (Parameter-Efficient Fine-Tuning) library with LoRA in 4-bit quantized mode using the `bitsandbytes` backend.

This model is part of the **Nexa Scientific Intelligence (Psi)** series, developed for scalable, automated scientific reasoning and domain-specific text generation.

---

**Developed by**: Allan (Independent Scientific Intelligence Architect)  
**Funded by**: Self-funded  
**Shared by**: Allan (https://huggingface.co/allan-wandeer)  
**Model type**: Decoder-only transformer (causal language model)  
**Language(s)**: English (scientific domain-specific vocabulary)  
**License**: Apache 2.0 (inherits from base model)  
**Fine-tuned from**: `mistralai/Mistral-7B-v0.1`  
**Repository**: https://huggingface.co/allan-wandeer/nexa-mistral-7b-psi  
**Demo**: Coming soon via Hugging Face Spaces or Lambda inference endpoint.

---

## Uses

### Direct Use
- Scientific hypothesis generation
- Abstract and method section synthesis
- Domain-specific research writing
- Semantic completion of structured research prompts

### Downstream Use
- Fine-tuning or distillation into smaller expert models
- Foundation for test-time reasoning agents
- Seed model for bootstrapping larger synthetic scientific corpora

### Out-of-Scope Use
- General conversation or chat use cases
- Non-English scientific domains
- Legal, financial, or clinical advice generation

---

## Bias, Risks, and Limitations
While the model performs well on structured scientific input, it inherits biases from its base model (`Mistral-7B`) and fine-tuning dataset. Results should be evaluated by domain experts before use in high-stakes settings. It may hallucinate plausible but incorrect facts, especially in low-data areas.

---

## Recommendations
Users should:
- Validate critical outputs against trusted scientific literature
- Avoid deploying in clinical or regulatory environments without further evaluation
- Consider additional domain fine-tuning for niche fields

---

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "allan-wandia/nexa-mistral-7b-sci"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")

prompt = "Generate a novel hypothesis in quantum materials research:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=250)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---

## Training Details

### Training Data

* **Size**: 100 million tokens sampled from a 500M+ token corpus
* **Source**: Curated scientific literature, abstracts, methodologies, and domain-labeled corpora (Bio, Physics, QST, Astro)
* **Labeling**: Token-level labels auto-generated via `Nexa DataVault` tokenizer infrastructure

### Preprocessing

* Tokenization with sequence truncation to 1024 tokens
* Labeled and batched using CPU; inference dispatched to GPU asynchronously

### Training Hyperparameters

- **Base model**: `mistralai/Mistral-7B-v0.1`
- **Sequence length**: `1024`
- **Batch size**: `1` (with gradient accumulation)
- **Gradient Accumulation Steps**: `64`
- **Effective Batch Size**: `64`
- **Learning rate**: `2e-5`
- **Epochs**: `2`
- **LoRA**: Enabled (PEFT)
- **Quantization**: 4-bit via `bitsandbytes`
- **Optimizer**: 8-bit AdamW
- **Framework**: Transformers + PEFT + Accelerate

---

## Evaluation

### Testing Data

* Synthetic scientific prompts across domains (Physics, Biology, Materials Science)

### Evaluation Factors

* Semantic coherence (BLEU)
* Hypothesis novelty (entropy score)
* Internal scientific consistency (domain-specific rubric)

### Metrics

| Metric                 | Score |
| ---------------------- | ----- |
| BLEU (coherence)       | 10/10 |
| Entropy novelty        | 6/10  |
| Scientific consistency | 9/10  |
| Model similarity coef  | 87%   |

### Results

Model performs robustly in hypothesis generation and scientific prose tasks. While base coherence is high, novelty depends on prompt diversity. Well-suited as a distiller or inference agent for synthetic scientific corpora generation.

---

## Environmental Impact

| Component      | Value                               |
| -------------- | ----------------------------------- |
| Hardware Type  | 2× NVIDIA T4 GPUs                   |
| Hours used     | \~7.5                               |
| Cloud Provider | Kaggle (Google Cloud)               |
| Compute Region | US                                  |
| Carbon Emitted | Estimate pending (likely < 1kg CO2) |

---

## Technical Specifications

### Model Architecture

* Transformer decoder (Mistral-7B architecture)
* LoRA adapters applied to attention and FFN layers
* Quantized with `bitsandbytes` to 4-bit for memory efficiency

### Compute Infrastructure

* CPU: Intel i5 8th Gen vPro (batch preprocessing)
* GPU: 2× NVIDIA T4 (CUDA 12.1)

### Software Stack

* PEFT 0.12.0
* Transformers 4.41.1
* Accelerate
* TRL
* Torch 2.x

---

## Citation

**BibTeX**:

```bibtex
@misc{nexa-mistral-7b-sci,
  title = {Nexa Mistral 7B Sci},
  author = {Allan Wandia},
  year = {2025},
  howpublished = {\url{https://huggingface.co/allan-Wandia/nexa-mistral-7b-sci}},
  note = {Fine-tuned model for scientific generation tasks}
}
```
---

## Model Card Contact

For questions, contact Allan via Hugging Face or at:
📫 Email: \[[email protected]]

---

## Model Card Authors

* Allan Wandia (Independent ML Engineer and Systems Architect)

---

## Glossary

* **LoRA**: Low-Rank Adaptation
* **PEFT**: Parameter-Efficient Fine-Tuning
* **BLEU**: Bilingual Evaluation Understudy Score
* **Entropy Score**: Metric used to estimate novelty/variation
* **Safe Tensors**: Secure, fast format for model weights

## Links
**Github Repo and notebook**: https://github.com/DarkStarStrix/Nexa_Auto
 
---