File size: 2,395 Bytes
e0f2f6c
 
 
 
 
 
 
 
 
 
 
2a38af2
61e7dec
 
 
 
 
 
 
 
 
 
7299978
61e7dec
 
 
 
 
 
 
 
2eedb8f
61e7dec
 
 
 
 
e7001d2
61e7dec
 
 
 
 
 
 
 
 
c88c8a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9fbc42d
 
c88c8a1
 
 
61e7dec
 
1357461
2a38af2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
datasets:
- huuuyeah/meetingbank
language:
- en
metrics:
- rouge
base_model:
- google/bigbird-pegasus-large-bigpatent
pipeline_tag: summarization
library_name: transformers
---
# MeetingScript

> A BigBird‐Pegasus model fine‑tuned for meeting transcription summarization on the MeetingBank dataset.

📦 **Model Files**  
- **Weights & config**: `pytorch_model.bin`, `config.json`  
- **Tokenizer**: `tokenizer.json`, `tokenizer_config.json`, `merges.txt`, `special_tokens_map.json`  
- **Generation defaults**: `generation_config.json`

🔗 **Hub:** https://github.com/kevin0437/Meeting_scripts

---

## Model Description

**MeetingScript** is a sequence‑to‑sequence model based on  
[google/bigbird-pegasus-large-bigpatent](https://huggingface.co/google/bigbird-pegasus-large-bigpatent)  
and fine‑tuned on the [MeetingBank](https://huggingface.co/datasets/huuuyeah/meetingbank) corpus of meeting transcripts paired with human‐written summaries.  
It is designed to take long meeting transcripts (up to 4096 tokens) and produce concise, coherent summaries.

---

## Evaluation Results

Evaluated on the held‑out test split of MeetingBank (≈ 600 transcripts), using beam search (4 beams, max_length=600):

| Metric      | F1 Score (%) |
|-------------|-------------:|
| **ROUGE‑1** |       51.5556 |
| **ROUGE‑2** |       38.5378 |
| **ROUGE‑L** |       48.0786 |
| **ROUGE‑Lsum** |    48.0142 |

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# 1) Load from the Hub
tokenizer = AutoTokenizer.from_pretrained("Shaelois/MeetingScript")
model = AutoModelForSeq2SeqLM.from_pretrained("Shaelois/MeetingScript")

# 2) Summarize a long transcript
transcript = """
    Alice: Good morning everyone, let’s get started…
    Bob: I updated the design mockups…
    … (thousands of words) …
"""
inputs = tokenizer(
    transcript,
    max_length=4096,
    truncation=True,
    return_tensors="pt"
).to("cuda")

summary_ids = model.generate(
    **inputs,
    num_beams=4,
    max_length=150,
    early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("📝 Summary:", summary)
```

---

## Training Data
Dataset: MeetingBank
Splits: Train (5000+), Validation (600+), Test (600+)
Preprocessing: Sliding‑window chunking for sequences > 4096 tokens