BSG_CyLlama / SETUP_GUIDE.md
jimnoneill's picture
Upload SETUP_GUIDE.md with huggingface_hub
8bb63a1 verified

BSG CyLLama Setup and Usage Guide

This guide explains how to set up and use the BSG CyLLama scientific summarization model.

Overview

BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content.

Files Structure

bsg_cyllama/
β”œβ”€β”€ scientific_model_production_v2/     # Trained model files
β”‚   β”œβ”€β”€ config.json                     # Model configuration
β”‚   β”œβ”€β”€ prompt_generator.pt             # Prompt generation utilities
β”‚   └── model/                          # LoRA adapter files
β”‚       β”œβ”€β”€ adapter_config.json
β”‚       β”œβ”€β”€ adapter_model.safetensors
β”‚       β”œβ”€β”€ tokenizer.json
β”‚       └── ...
β”œβ”€β”€ bsg_training_data_complete_aligned.tsv  # Complete training dataset (19,174 records)
β”œβ”€β”€ bsg_cyllama_trainer_v2.py          # Training script
β”œβ”€β”€ scientific_model_inference2.py     # Inference utilities
β”œβ”€β”€ bsg_training_data_gen.py           # Data generation pipeline
β”œβ”€β”€ compile_complete_training_data.py  # Data compilation script
β”œβ”€β”€ upload_to_huggingface.py           # HF upload utilities
└── run_upload.py                      # Simple upload runner

Prerequisites

  1. Python Environment:

    python >= 3.8
    torch >= 2.0
    transformers >= 4.30.0
    peft >= 0.4.0
    huggingface_hub
    pandas
    numpy
    
  2. Hardware Requirements:

    • GPU with at least 8GB VRAM (recommended)
    • 16GB+ system RAM
    • CUDA support for optimal performance

Installation

  1. Clone/Download the repository:

    git clone <your-repo-url>
    cd bsg_cyllama
    
  2. Install dependencies:

    pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers
    
  3. Activate environment (if using virtual environment):

    source ~/myenv/bin/activate
    

Usage

1. Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model
base_model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model")

def generate_summary(text, max_length=200):
    prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:"
    
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True
        )
    
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary.split("Summary:")[-1].strip()

2. Using the Inference Script

python scientific_model_inference2.py

3. Training from Scratch

python bsg_cyllama_trainer_v2.py

Dataset Information

The complete training dataset contains 19,174 records with the following structure:

  • AbstractSummary: Detailed scientific summary
  • ShortSummary: Concise version
  • Title: Research paper title
  • OriginalText: Source abstract
  • OriginalKeywords: Topic keywords
  • Clustering information: For data organization

Loading the Dataset

import pandas as pd

# Load complete training data
df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t")

print(f"Dataset size: {len(df)} records")
print(f"Columns: {df.columns.tolist()}")

# Example training pair
sample = df.iloc[0]
print(f"Original: {sample['OriginalText'][:200]}...")
print(f"Summary: {sample['AbstractSummary'][:200]}...")

Model Configuration

  • Base Model: meta-llama/Llama-3.2-1B-Instruct
  • LoRA Rank: 128
  • LoRA Alpha: 256
  • Target Modules: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj
  • Training Samples: 19,174

Uploading to Hugging Face

To upload your model and dataset to Hugging Face:

  1. Set up your token:

    # Your token is already configured in the script
    
  2. Run the upload:

    python run_upload.py
    
  3. Enter your HF username when prompted

This will create two repositories:

  • {username}/bsg-cyllama (model)
  • {username}/bsg-cyllama-training-data (dataset)

Performance Tips

  1. For better performance:

    • Use GPU inference
    • Adjust temperature (0.5-0.8 for more focused summaries)
    • Experiment with max_length based on your needs
  2. Memory optimization:

    • Use torch.float16 for inference
    • Enable gradient checkpointing for training
    • Use smaller batch sizes if needed

Troubleshooting

  1. CUDA out of memory:

    • Reduce batch size
    • Use CPU inference
    • Enable gradient checkpointing
  2. Import errors:

    • Check transformers version: pip install transformers>=4.30.0
    • Install missing dependencies: pip install peft sentence-transformers
  3. Model loading issues:

    • Verify file paths
    • Check model file integrity
    • Ensure proper permissions

Example Applications

  1. Scientific Paper Summarization
  2. Abstract Generation
  3. Research Literature Review
  4. Technical Documentation Condensation

Citation

@misc{bsg-cyllama-2025,
  title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama},
  author={BSG Research Team},
  year={2025},
  url={https://huggingface.co/bsg-cyllama}
}

Support

For questions, issues, or collaboration:

  1. Check this guide first
  2. Review the error messages
  3. Open an issue in the repository
  4. Contact the development team

Last Updated: January 2025 Model Version: v2 Dataset Version: Complete Aligned (19,174 records)