SETUP_GUIDE.md · jimnoneill/BSG

File size: 6,116 Bytes

8bb63a1

# BSG CyLLama Setup and Usage Guide

This guide explains how to set up and use the BSG CyLLama scientific summarization model.

## Overview

BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.

## Files Structure

```
bsg_cyllama/
├── scientific_model_production_v2/     # Trained model files
│   ├── config.json                     # Model configuration
│   ├── prompt_generator.pt             # Prompt generation utilities
│   └── model/                          # LoRA adapter files
│       ├── adapter_config.json
│       ├── adapter_model.safetensors
│       ├── tokenizer.json
│       └── ...
├── bsg_training_data_complete_aligned.tsv  # Complete training dataset (19,174 records)
├── bsg_cyllama_trainer_v2.py          # Training script
├── scientific_model_inference2.py     # Inference utilities
├── bsg_training_data_gen.py           # Data generation pipeline
├── compile_complete_training_data.py  # Data compilation script
├── upload_to_huggingface.py           # HF upload utilities
└── run_upload.py                      # Simple upload runner
```

## Prerequisites

1. **Python Environment**:
   ```bash
   python >= 3.8
   torch >= 2.0
   transformers >= 4.30.0
   peft >= 0.4.0
   huggingface_hub
   pandas
   numpy
   ```

2. **Hardware Requirements**:
   - GPU with at least 8GB VRAM (recommended)
   - 16GB+ system RAM
   - CUDA support for optimal performance

## Installation

1. **Clone/Download the repository**:
   ```bash
   git clone <your-repo-url>
   cd bsg_cyllama
   ```

2. **Install dependencies**:
   ```bash
   pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers
   ```

3. **Activate environment** (if using virtual environment):
   ```bash
   source ~/myenv/bin/activate
   ```

## Usage

### 1. Basic Inference

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")

def generate_summary(text, max_length=200):
    prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
    
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True
        )
    
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary.split("Summary:")[-1].strip()
```

### 2. Using the Inference Script

```bash
python scientific_model_inference2.py
```

### 3. Training from Scratch

```bash
python bsg_cyllama_trainer_v2.py
```

## Dataset Information

The complete training dataset contains **19,174 records** with the following structure:

- **AbstractSummary**: Detailed scientific summary
- **ShortSummary**: Concise version
- **Title**: Research paper title
- **OriginalText**: Source abstract
- **OriginalKeywords**: Topic keywords
- **Clustering information**: For data organization

### Loading the Dataset

```python
import pandas as pd

# Load complete training data
df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")

print(f"Dataset size: {len(df)} records")
print(f"Columns: {df.columns.tolist()}")

# Example training pair
sample = df.iloc[0]
print(f"Original: {sample['OriginalText'][:200]}...")
print(f"Summary: {sample['AbstractSummary'][:200]}...")
```

## Model Configuration

- **Base Model**: meta-llama/Llama-3.2-1B-Instruct
- **LoRA Rank**: 128
- **LoRA Alpha**: 256
- **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
- **Training Samples**: 19,174

## Uploading to Hugging Face

To upload your model and dataset to Hugging Face:

1. **Set up your token**:
   ```bash
   # Your token is already configured in the script
   ```

2. **Run the upload**:
   ```bash
   python run_upload.py
   ```

3. **Enter your HF username** when prompted

This will create two repositories:
- `{username}/bsg-cyllama` (model)
- `{username}/bsg-cyllama-training-data` (dataset)

## Performance Tips

1. **For better performance**:
   - Use GPU inference
   - Adjust temperature (0.5-0.8 for more focused summaries)
   - Experiment with max_length based on your needs

2. **Memory optimization**:
   - Use torch.float16 for inference
   - Enable gradient checkpointing for training
   - Use smaller batch sizes if needed

## Troubleshooting

1. **CUDA out of memory**:
   - Reduce batch size
   - Use CPU inference
   - Enable gradient checkpointing

2. **Import errors**:
   - Check transformers version: `pip install transformers>=4.30.0`
   - Install missing dependencies: `pip install peft sentence-transformers`

3. **Model loading issues**:
   - Verify file paths
   - Check model file integrity
   - Ensure proper permissions

## Example Applications

1. **Scientific Paper Summarization**
2. **Abstract Generation**
3. **Research Literature Review**
4. **Technical Documentation Condensation**

## Citation

```bibtex
@misc{bsg-cyllama-2025,
  title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
  author={BSG Research Team},
  year={2025},
  url={https://huggingface.co/bsg-cyllama}
}
```

## Support

For questions, issues, or collaboration:
1. Check this guide first
2. Review the error messages
3. Open an issue in the repository
4. Contact the development team

---

**Last Updated**: January 2025
**Model Version**: v2
**Dataset Version**: Complete Aligned (19,174 records)