File size: 6,116 Bytes
8bb63a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# BSG CyLLama Setup and Usage Guide

This guide explains how to set up and use the BSG CyLLama scientific summarization model.

## Overview

BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.

## Files Structure

```
bsg_cyllama/
β”œβ”€β”€ scientific_model_production_v2/     # Trained model files
β”‚   β”œβ”€β”€ config.json                     # Model configuration
β”‚   β”œβ”€β”€ prompt_generator.pt             # Prompt generation utilities
β”‚   └── model/                          # LoRA adapter files
β”‚       β”œβ”€β”€ adapter_config.json
β”‚       β”œβ”€β”€ adapter_model.safetensors
β”‚       β”œβ”€β”€ tokenizer.json
β”‚       └── ...
β”œβ”€β”€ bsg_training_data_complete_aligned.tsv  # Complete training dataset (19,174 records)
β”œβ”€β”€ bsg_cyllama_trainer_v2.py          # Training script
β”œβ”€β”€ scientific_model_inference2.py     # Inference utilities
β”œβ”€β”€ bsg_training_data_gen.py           # Data generation pipeline
β”œβ”€β”€ compile_complete_training_data.py  # Data compilation script
β”œβ”€β”€ upload_to_huggingface.py           # HF upload utilities
└── run_upload.py                      # Simple upload runner
```

## Prerequisites

1. **Python Environment**:
   ```bash
   python >= 3.8
   torch >= 2.0
   transformers >= 4.30.0
   peft >= 0.4.0
   huggingface_hub
   pandas
   numpy
   ```

2. **Hardware Requirements**:
   - GPU with at least 8GB VRAM (recommended)
   - 16GB+ system RAM
   - CUDA support for optimal performance

## Installation

1. **Clone/Download the repository**:
   ```bash
   git clone <your-repo-url>
   cd bsg_cyllama
   ```

2. **Install dependencies**:
   ```bash
   pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers
   ```

3. **Activate environment** (if using virtual environment):
   ```bash
   source ~/myenv/bin/activate
   ```

## Usage

### 1. Basic Inference

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")

def generate_summary(text, max_length=200):
    prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
    
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True
        )
    
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary.split("Summary:")[-1].strip()
```

### 2. Using the Inference Script

```bash
python scientific_model_inference2.py
```

### 3. Training from Scratch

```bash
python bsg_cyllama_trainer_v2.py
```

## Dataset Information

The complete training dataset contains **19,174 records** with the following structure:

- **AbstractSummary**: Detailed scientific summary
- **ShortSummary**: Concise version
- **Title**: Research paper title
- **OriginalText**: Source abstract
- **OriginalKeywords**: Topic keywords
- **Clustering information**: For data organization

### Loading the Dataset

```python
import pandas as pd

# Load complete training data
df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")

print(f"Dataset size: {len(df)} records")
print(f"Columns: {df.columns.tolist()}")

# Example training pair
sample = df.iloc[0]
print(f"Original: {sample['OriginalText'][:200]}...")
print(f"Summary: {sample['AbstractSummary'][:200]}...")
```

## Model Configuration

- **Base Model**: meta-llama/Llama-3.2-1B-Instruct
- **LoRA Rank**: 128
- **LoRA Alpha**: 256
- **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
- **Training Samples**: 19,174

## Uploading to Hugging Face

To upload your model and dataset to Hugging Face:

1. **Set up your token**:
   ```bash
   # Your token is already configured in the script
   ```

2. **Run the upload**:
   ```bash
   python run_upload.py
   ```

3. **Enter your HF username** when prompted

This will create two repositories:
- `{username}/bsg-cyllama` (model)
- `{username}/bsg-cyllama-training-data` (dataset)

## Performance Tips

1. **For better performance**:
   - Use GPU inference
   - Adjust temperature (0.5-0.8 for more focused summaries)
   - Experiment with max_length based on your needs

2. **Memory optimization**:
   - Use torch.float16 for inference
   - Enable gradient checkpointing for training
   - Use smaller batch sizes if needed

## Troubleshooting

1. **CUDA out of memory**:
   - Reduce batch size
   - Use CPU inference
   - Enable gradient checkpointing

2. **Import errors**:
   - Check transformers version: `pip install transformers>=4.30.0`
   - Install missing dependencies: `pip install peft sentence-transformers`

3. **Model loading issues**:
   - Verify file paths
   - Check model file integrity
   - Ensure proper permissions

## Example Applications

1. **Scientific Paper Summarization**
2. **Abstract Generation**
3. **Research Literature Review**
4. **Technical Documentation Condensation**

## Citation

```bibtex
@misc{bsg-cyllama-2025,
  title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
  author={BSG Research Team},
  year={2025},
  url={https://huggingface.co/bsg-cyllama}
}
```

## Support

For questions, issues, or collaboration:
1. Check this guide first
2. Review the error messages
3. Open an issue in the repository
4. Contact the development team

---

**Last Updated**: January 2025
**Model Version**: v2
**Dataset Version**: Complete Aligned (19,174 records)