BSG_CyLlama / SETUP_GUIDE.md
jimnoneill's picture
Upload SETUP_GUIDE.md with huggingface_hub
8bb63a1 verified
# BSG CyLLama Setup and Usage Guide
This guide explains how to set up and use the BSG CyLLama scientific summarization model.
## Overview
BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.
## Files Structure
```
bsg_cyllama/
├── scientific_model_production_v2/ # Trained model files
│ ├── config.json # Model configuration
│ ├── prompt_generator.pt # Prompt generation utilities
│ └── model/ # LoRA adapter files
│ ├── adapter_config.json
│ ├── adapter_model.safetensors
│ ├── tokenizer.json
│ └── ...
├── bsg_training_data_complete_aligned.tsv # Complete training dataset (19,174 records)
├── bsg_cyllama_trainer_v2.py # Training script
├── scientific_model_inference2.py # Inference utilities
├── bsg_training_data_gen.py # Data generation pipeline
├── compile_complete_training_data.py # Data compilation script
├── upload_to_huggingface.py # HF upload utilities
└── run_upload.py # Simple upload runner
```
## Prerequisites
1. **Python Environment**:
```bash
python >= 3.8
torch >= 2.0
transformers >= 4.30.0
peft >= 0.4.0
huggingface_hub
pandas
numpy
```
2. **Hardware Requirements**:
- GPU with at least 8GB VRAM (recommended)
- 16GB+ system RAM
- CUDA support for optimal performance
## Installation
1. **Clone/Download the repository**:
```bash
git clone <your-repo-url>
cd bsg_cyllama
```
2. **Install dependencies**:
```bash
pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers
```
3. **Activate environment** (if using virtual environment):
```bash
source ~/myenv/bin/activate
```
## Usage
### 1. Basic Inference
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")
def generate_summary(text, max_length=200):
prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
inputs = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs,
max_length=max_length,
num_return_sequences=1,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id,
do_sample=True
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
return summary.split("Summary:")[-1].strip()
```
### 2. Using the Inference Script
```bash
python scientific_model_inference2.py
```
### 3. Training from Scratch
```bash
python bsg_cyllama_trainer_v2.py
```
## Dataset Information
The complete training dataset contains **19,174 records** with the following structure:
- **AbstractSummary**: Detailed scientific summary
- **ShortSummary**: Concise version
- **Title**: Research paper title
- **OriginalText**: Source abstract
- **OriginalKeywords**: Topic keywords
- **Clustering information**: For data organization
### Loading the Dataset
```python
import pandas as pd
# Load complete training data
df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")
print(f"Dataset size: {len(df)} records")
print(f"Columns: {df.columns.tolist()}")
# Example training pair
sample = df.iloc[0]
print(f"Original: {sample['OriginalText'][:200]}...")
print(f"Summary: {sample['AbstractSummary'][:200]}...")
```
## Model Configuration
- **Base Model**: meta-llama/Llama-3.2-1B-Instruct
- **LoRA Rank**: 128
- **LoRA Alpha**: 256
- **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
- **Training Samples**: 19,174
## Uploading to Hugging Face
To upload your model and dataset to Hugging Face:
1. **Set up your token**:
```bash
# Your token is already configured in the script
```
2. **Run the upload**:
```bash
python run_upload.py
```
3. **Enter your HF username** when prompted
This will create two repositories:
- `{username}/bsg-cyllama` (model)
- `{username}/bsg-cyllama-training-data` (dataset)
## Performance Tips
1. **For better performance**:
- Use GPU inference
- Adjust temperature (0.5-0.8 for more focused summaries)
- Experiment with max_length based on your needs
2. **Memory optimization**:
- Use torch.float16 for inference
- Enable gradient checkpointing for training
- Use smaller batch sizes if needed
## Troubleshooting
1. **CUDA out of memory**:
- Reduce batch size
- Use CPU inference
- Enable gradient checkpointing
2. **Import errors**:
- Check transformers version: `pip install transformers>=4.30.0`
- Install missing dependencies: `pip install peft sentence-transformers`
3. **Model loading issues**:
- Verify file paths
- Check model file integrity
- Ensure proper permissions
## Example Applications
1. **Scientific Paper Summarization**
2. **Abstract Generation**
3. **Research Literature Review**
4. **Technical Documentation Condensation**
## Citation
```bibtex
@misc{bsg-cyllama-2025,
title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
author={BSG Research Team},
year={2025},
url={https://huggingface.co/bsg-cyllama}
}
```
## Support
For questions, issues, or collaboration:
1. Check this guide first
2. Review the error messages
3. Open an issue in the repository
4. Contact the development team
---
**Last Updated**: January 2025
**Model Version**: v2
**Dataset Version**: Complete Aligned (19,174 records)