File size: 2,395 Bytes
e0f2f6c 2a38af2 61e7dec 7299978 61e7dec 2eedb8f 61e7dec e7001d2 61e7dec c88c8a1 9fbc42d c88c8a1 61e7dec 1357461 2a38af2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
license: apache-2.0
datasets:
- huuuyeah/meetingbank
language:
- en
metrics:
- rouge
base_model:
- google/bigbird-pegasus-large-bigpatent
pipeline_tag: summarization
library_name: transformers
---
# MeetingScript
> A BigBird‐Pegasus model fine‑tuned for meeting transcription summarization on the MeetingBank dataset.
📦 **Model Files**
- **Weights & config**: `pytorch_model.bin`, `config.json`
- **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`, `merges.txt`, `special_tokens_map.json`
- **Generation defaults**: `generation_config.json`
🔗 **Hub:** https://github.com/kevin0437/Meeting_scripts
---
## Model Description
**MeetingScript** is a sequence‑to‑sequence model based on
[google/bigbird-pegasus-large-bigpatent](https://huggingface.co/google/bigbird-pegasus-large-bigpatent)
and fine‑tuned on the [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank) corpus of meeting transcripts paired with human‐written summaries.
It is designed to take long meeting transcripts (up to 4096 tokens) and produce concise, coherent summaries.
---
## Evaluation Results
Evaluated on the held‑out test split of MeetingBank (≈ 600 transcripts), using beam search (4 beams, max_length=600):
| Metric | F1 Score (%) |
|-------------|-------------:|
| **ROUGE‑1** | 51.5556 |
| **ROUGE‑2** | 38.5378 |
| **ROUGE‑L** | 48.0786 |
| **ROUGE‑Lsum** | 48.0142 |
---
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
# 1) Load from the Hub
tokenizer = AutoTokenizer.from_pretrained("Shaelois/MeetingScript")
model = AutoModelForSeq2SeqLM.from_pretrained("Shaelois/MeetingScript")
# 2) Summarize a long transcript
transcript = """
Alice: Good morning everyone, let’s get started…
Bob: I updated the design mockups…
… (thousands of words) …
"""
inputs = tokenizer(
transcript,
max_length=4096,
truncation=True,
return_tensors="pt"
).to("cuda")
summary_ids = model.generate(
**inputs,
num_beams=4,
max_length=150,
early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("📝 Summary:", summary)
```
---
## Training Data
Dataset: MeetingBank
Splits: Train (5000+), Validation (600+), Test (600+)
Preprocessing: Sliding‑window chunking for sequences > 4096 tokens |