|
# BSG CyLLama Setup and Usage Guide |
|
|
|
This guide explains how to set up and use the BSG CyLLama scientific summarization model. |
|
|
|
## Overview |
|
|
|
BSG CyLLama is a LoRA-adapted Llama-3.2-1B-Instruct model fine-tuned for scientific text summarization. The model excels at generating high-quality abstracts and summaries from scientific papers and research content. |
|
|
|
## Files Structure |
|
|
|
``` |
|
bsg_cyllama/ |
|
├── scientific_model_production_v2/ # Trained model files |
|
│ ├── config.json # Model configuration |
|
│ ├── prompt_generator.pt # Prompt generation utilities |
|
│ └── model/ # LoRA adapter files |
|
│ ├── adapter_config.json |
|
│ ├── adapter_model.safetensors |
|
│ ├── tokenizer.json |
|
│ └── ... |
|
├── bsg_training_data_complete_aligned.tsv # Complete training dataset (19,174 records) |
|
├── bsg_cyllama_trainer_v2.py # Training script |
|
├── scientific_model_inference2.py # Inference utilities |
|
├── bsg_training_data_gen.py # Data generation pipeline |
|
├── compile_complete_training_data.py # Data compilation script |
|
├── upload_to_huggingface.py # HF upload utilities |
|
└── run_upload.py # Simple upload runner |
|
``` |
|
|
|
## Prerequisites |
|
|
|
1. **Python Environment**: |
|
```bash |
|
python >= 3.8 |
|
torch >= 2.0 |
|
transformers >= 4.30.0 |
|
peft >= 0.4.0 |
|
huggingface_hub |
|
pandas |
|
numpy |
|
``` |
|
|
|
2. **Hardware Requirements**: |
|
- GPU with at least 8GB VRAM (recommended) |
|
- 16GB+ system RAM |
|
- CUDA support for optimal performance |
|
|
|
## Installation |
|
|
|
1. **Clone/Download the repository**: |
|
```bash |
|
git clone <your-repo-url> |
|
cd bsg_cyllama |
|
``` |
|
|
|
2. **Install dependencies**: |
|
```bash |
|
pip install torch transformers peft huggingface_hub pandas numpy sentence-transformers |
|
``` |
|
|
|
3. **Activate environment** (if using virtual environment): |
|
```bash |
|
source ~/myenv/bin/activate |
|
``` |
|
|
|
## Usage |
|
|
|
### 1. Basic Inference |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
from peft import PeftModel |
|
import torch |
|
|
|
# Load base model |
|
base_model_name = "meta-llama/Llama-3.2-1B-Instruct" |
|
tokenizer = AutoTokenizer.from_pretrained(base_model_name) |
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
base_model_name, |
|
torch_dtype=torch.float16, |
|
device_map="auto" |
|
) |
|
|
|
# Load LoRA adapter |
|
model = PeftModel.from_pretrained(base_model, "./scientific_model_production_v2/model") |
|
|
|
def generate_summary(text, max_length=200): |
|
prompt = f"Summarize the following scientific text:\n\n{text}\n\nSummary:" |
|
|
|
inputs = tokenizer.encode(prompt, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
inputs, |
|
max_length=max_length, |
|
num_return_sequences=1, |
|
temperature=0.7, |
|
pad_token_id=tokenizer.eos_token_id, |
|
do_sample=True |
|
) |
|
|
|
summary = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
return summary.split("Summary:")[-1].strip() |
|
``` |
|
|
|
### 2. Using the Inference Script |
|
|
|
```bash |
|
python scientific_model_inference2.py |
|
``` |
|
|
|
### 3. Training from Scratch |
|
|
|
```bash |
|
python bsg_cyllama_trainer_v2.py |
|
``` |
|
|
|
## Dataset Information |
|
|
|
The complete training dataset contains **19,174 records** with the following structure: |
|
|
|
- **AbstractSummary**: Detailed scientific summary |
|
- **ShortSummary**: Concise version |
|
- **Title**: Research paper title |
|
- **OriginalText**: Source abstract |
|
- **OriginalKeywords**: Topic keywords |
|
- **Clustering information**: For data organization |
|
|
|
### Loading the Dataset |
|
|
|
```python |
|
import pandas as pd |
|
|
|
# Load complete training data |
|
df = pd.read_csv("bsg_training_data_complete_aligned.tsv", sep="\t") |
|
|
|
print(f"Dataset size: {len(df)} records") |
|
print(f"Columns: {df.columns.tolist()}") |
|
|
|
# Example training pair |
|
sample = df.iloc[0] |
|
print(f"Original: {sample['OriginalText'][:200]}...") |
|
print(f"Summary: {sample['AbstractSummary'][:200]}...") |
|
``` |
|
|
|
## Model Configuration |
|
|
|
- **Base Model**: meta-llama/Llama-3.2-1B-Instruct |
|
- **LoRA Rank**: 128 |
|
- **LoRA Alpha**: 256 |
|
- **Target Modules**: v_proj, o_proj, k_proj, gate_proj, q_proj, up_proj, down_proj |
|
- **Training Samples**: 19,174 |
|
|
|
## Uploading to Hugging Face |
|
|
|
To upload your model and dataset to Hugging Face: |
|
|
|
1. **Set up your token**: |
|
```bash |
|
# Your token is already configured in the script |
|
``` |
|
|
|
2. **Run the upload**: |
|
```bash |
|
python run_upload.py |
|
``` |
|
|
|
3. **Enter your HF username** when prompted |
|
|
|
This will create two repositories: |
|
- `{username}/bsg-cyllama` (model) |
|
- `{username}/bsg-cyllama-training-data` (dataset) |
|
|
|
## Performance Tips |
|
|
|
1. **For better performance**: |
|
- Use GPU inference |
|
- Adjust temperature (0.5-0.8 for more focused summaries) |
|
- Experiment with max_length based on your needs |
|
|
|
2. **Memory optimization**: |
|
- Use torch.float16 for inference |
|
- Enable gradient checkpointing for training |
|
- Use smaller batch sizes if needed |
|
|
|
## Troubleshooting |
|
|
|
1. **CUDA out of memory**: |
|
- Reduce batch size |
|
- Use CPU inference |
|
- Enable gradient checkpointing |
|
|
|
2. **Import errors**: |
|
- Check transformers version: `pip install transformers>=4.30.0` |
|
- Install missing dependencies: `pip install peft sentence-transformers` |
|
|
|
3. **Model loading issues**: |
|
- Verify file paths |
|
- Check model file integrity |
|
- Ensure proper permissions |
|
|
|
## Example Applications |
|
|
|
1. **Scientific Paper Summarization** |
|
2. **Abstract Generation** |
|
3. **Research Literature Review** |
|
4. **Technical Documentation Condensation** |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{bsg-cyllama-2025, |
|
title={BSG CyLLama: Scientific Summarization with LoRA-tuned Llama}, |
|
author={BSG Research Team}, |
|
year={2025}, |
|
url={https://huggingface.co/bsg-cyllama} |
|
} |
|
``` |
|
|
|
## Support |
|
|
|
For questions, issues, or collaboration: |
|
1. Check this guide first |
|
2. Review the error messages |
|
3. Open an issue in the repository |
|
4. Contact the development team |
|
|
|
--- |
|
|
|
**Last Updated**: January 2025 |
|
**Model Version**: v2 |
|
**Dataset Version**: Complete Aligned (19,174 records) |
|
|
|
|
|
|