BSG CyLLama Setup and Usage Guide

This guide explains how to set up and use the BSG CyLLama scientific summarization model.

Overview

BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.

Files Structure

bsg_cyllama/
├── scientific_model_production_v2/     # Trained model files
│   ├── config.json                     # Model configuration
│   ├── prompt_generator.pt             # Prompt generation utilities
│   └── model/                          # LoRA adapter files
│       ├── adapter_config.json
│       ├── adapter_model.safetensors
│       ├── tokenizer.json
│       └── ...
├── bsg_training_data_complete_aligned.tsv  # Complete training dataset (19,174 records)
├── bsg_cyllama_trainer_v2.py          # Training script
├── scientific_model_inference2.py     # Inference utilities
├── bsg_training_data_gen.py           # Data generation pipeline
├── compile_complete_training_data.py  # Data compilation script
├── upload_to_huggingface.py           # HF upload utilities
└── run_upload.py                      # Simple upload runner

Prerequisites

Python Environment:

python >= 3.8
torch >= 2.0
transformers >= 4.30.0
peft >= 0.4.0
huggingface_hub
pandas
numpy

Hardware Requirements:
- GPU with at least 8GB VRAM (recommended)
- 16GB+ system RAM
- CUDA support for optimal performance

Installation

Clone/Download the repository:

git clone <your-repo-url>
cd bsg_cyllama

Install dependencies:

pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers

Activate environment (if using virtual environment):
```
source ~/myenv/bin/activate
```

Usage

1. Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")

def generate_summary(text, max_length=200):
    prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
    
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True
        )
    
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary.split("Summary:")[-1].strip()

2. Using the Inference Script

python scientific_model_inference2.py

3. Training from Scratch

python bsg_cyllama_trainer_v2.py

Dataset Information

The complete training dataset contains 19,174 records with the following structure:

AbstractSummary: Detailed scientific summary
ShortSummary: Concise version
Title: Research paper title
OriginalText: Source abstract
OriginalKeywords: Topic keywords
Clustering information: For data organization

Loading the Dataset

import pandas as pd

# Load complete training data
df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")

print(f"Dataset size: {len(df)} records")
print(f"Columns: {df.columns.tolist()}")

# Example training pair
sample = df.iloc[0]
print(f"Original: {sample['OriginalText'][:200]}...")
print(f"Summary: {sample['AbstractSummary'][:200]}...")

Model Configuration

Base Model: meta-llama/Llama-3.2-1B-Instruct
LoRA Rank: 128
LoRA Alpha: 256
Target Modules: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
Training Samples: 19,174

Uploading to Hugging Face

To upload your model and dataset to Hugging Face:

Set up your token:

# Your token is already configured in the script

Run the upload:
```
python run_upload.py
```
Enter your HF username when prompted

This will create two repositories:

{username}/bsg-cyllama (model)
{username}/bsg-cyllama-training-data (dataset)

Performance Tips

For better performance:
- Use GPU inference
- Adjust temperature (0.5-0.8 for more focused summaries)
- Experiment with max_length based on your needs
Memory optimization:
- Use torch.float16 for inference
- Enable gradient checkpointing for training
- Use smaller batch sizes if needed

Troubleshooting

CUDA out of memory:
- Reduce batch size
- Use CPU inference
- Enable gradient checkpointing
Import errors:
- Check transformers version: pip install transformers>=4.30.0
- Install missing dependencies: pip install peft sentence-transformers
Model loading issues:
- Verify file paths
- Check model file integrity
- Ensure proper permissions

Example Applications

Scientific Paper Summarization
Abstract Generation
Research Literature Review
Technical Documentation Condensation

Citation

@misc{bsg-cyllama-2025,
  title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
  author={BSG Research Team},
  year={2025},
  url={https://huggingface.co/bsg-cyllama}
}

Support

For questions, issues, or collaboration:

Check this guide first
Review the error messages
Open an issue in the repository
Contact the development team

Last Updated: January 2025 Model Version: v2 Dataset Version: Complete Aligned (19,174 records)