File size: 6,116 Bytes
8bb63a1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
# BSG CyLLama Setup and Usage Guide
This guide explains how to set up and use the BSG CyLLama scientific summarization model.
## Overview
BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.
## Files Structure
```
bsg_cyllama/
βββ scientific_model_production_v2/ # Trained model files
β βββ config.json # Model configuration
β βββ prompt_generator.pt # Prompt generation utilities
β βββ model/ # LoRA adapter files
β βββ adapter_config.json
β βββ adapter_model.safetensors
β βββ tokenizer.json
β βββ ...
βββ bsg_training_data_complete_aligned.tsv # Complete training dataset (19,174 records)
βββ bsg_cyllama_trainer_v2.py # Training script
βββ scientific_model_inference2.py # Inference utilities
βββ bsg_training_data_gen.py # Data generation pipeline
βββ compile_complete_training_data.py # Data compilation script
βββ upload_to_huggingface.py # HF upload utilities
βββ run_upload.py # Simple upload runner
```
## Prerequisites
1. **Python Environment**:
```bash
python >= 3.8
torch >= 2.0
transformers >= 4.30.0
peft >= 0.4.0
huggingface_hub
pandas
numpy
```
2. **Hardware Requirements**:
- GPU with at least 8GB VRAM (recommended)
- 16GB+ system RAM
- CUDA support for optimal performance
## Installation
1. **Clone/Download the repository**:
```bash
git clone <your-repo-url>
cd bsg_cyllama
```
2. **Install dependencies**:
```bash
pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers
```
3. **Activate environment** (if using virtual environment):
```bash
source ~/myenv/bin/activate
```
## Usage
### 1. Basic Inference
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")
def generate_summary(text, max_length=200):
prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
inputs = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs,
max_length=max_length,
num_return_sequences=1,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id,
do_sample=True
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
return summary.split("Summary:")[-1].strip()
```
### 2. Using the Inference Script
```bash
python scientific_model_inference2.py
```
### 3. Training from Scratch
```bash
python bsg_cyllama_trainer_v2.py
```
## Dataset Information
The complete training dataset contains **19,174 records** with the following structure:
- **AbstractSummary**: Detailed scientific summary
- **ShortSummary**: Concise version
- **Title**: Research paper title
- **OriginalText**: Source abstract
- **OriginalKeywords**: Topic keywords
- **Clustering information**: For data organization
### Loading the Dataset
```python
import pandas as pd
# Load complete training data
df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")
print(f"Dataset size: {len(df)} records")
print(f"Columns: {df.columns.tolist()}")
# Example training pair
sample = df.iloc[0]
print(f"Original: {sample['OriginalText'][:200]}...")
print(f"Summary: {sample['AbstractSummary'][:200]}...")
```
## Model Configuration
- **Base Model**: meta-llama/Llama-3.2-1B-Instruct
- **LoRA Rank**: 128
- **LoRA Alpha**: 256
- **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
- **Training Samples**: 19,174
## Uploading to Hugging Face
To upload your model and dataset to Hugging Face:
1. **Set up your token**:
```bash
# Your token is already configured in the script
```
2. **Run the upload**:
```bash
python run_upload.py
```
3. **Enter your HF username** when prompted
This will create two repositories:
- `{username}/bsg-cyllama` (model)
- `{username}/bsg-cyllama-training-data` (dataset)
## Performance Tips
1. **For better performance**:
- Use GPU inference
- Adjust temperature (0.5-0.8 for more focused summaries)
- Experiment with max_length based on your needs
2. **Memory optimization**:
- Use torch.float16 for inference
- Enable gradient checkpointing for training
- Use smaller batch sizes if needed
## Troubleshooting
1. **CUDA out of memory**:
- Reduce batch size
- Use CPU inference
- Enable gradient checkpointing
2. **Import errors**:
- Check transformers version: `pip install transformers>=4.30.0`
- Install missing dependencies: `pip install peft sentence-transformers`
3. **Model loading issues**:
- Verify file paths
- Check model file integrity
- Ensure proper permissions
## Example Applications
1. **Scientific Paper Summarization**
2. **Abstract Generation**
3. **Research Literature Review**
4. **Technical Documentation Condensation**
## Citation
```bibtex
@misc{bsg-cyllama-2025,
title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
author={BSG Research Team},
year={2025},
url={https://huggingface.co/bsg-cyllama}
}
```
## Support
For questions, issues, or collaboration:
1. Check this guide first
2. Review the error messages
3. Open an issue in the repository
4. Contact the development team
---
**Last Updated**: January 2025
**Model Version**: v2
**Dataset Version**: Complete Aligned (19,174 records)
|