File size: 1,844 Bytes
d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 d83e09a 11a0bc6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
---
license: mit
tags:
- quote-attribution
- speaker-identification
- dialogue-attribution
- nlp
- transformers
- bert
language:
- en
datasets:
- aNameNobodyChose/quote-speaker-attribution
---
# π£οΈ QuoteCaster: Speaker-Aware Quote Encoder
**QuoteCaster** is a fine-tuned BERT-based model designed to encode dialogue quotes along with their surrounding context in order to **identify or group quotes by speaker** β even in stories the model has never seen before.
This encoder powers unsupervised or few-shot quote attribution by mapping similar speaking styles (with context) to nearby points in embedding space. Perfect for clustering or nearest-neighbor speaker inference tasks.
---
## π¦ Model Details
- **Base model**: `bert-base-uncased`
- **Trained with**: Triplet Margin Loss
- **Objective**: Pull quotes from the same speaker together, push different ones apart
- **Input**: `context [SEP] quote`
- **Output**: `[CLS]` embedding as a 768-dimensional vector
---
## π Use Case
QuoteCaster is ideal for:
- π§ Clustering quotes by speaker using KMeans or Agglomerative Clustering
- π Zero-shot speaker inference on unseen stories
- π§ͺ Dialogue structure analysis in novels, scripts, or plays
---
## π Example: Inference with QuoteCaster
```python
from transformers import AutoModel, AutoTokenizer
# Load fine-tuned encoder
model = AutoModel.from_pretrained("aNameNobodyChose/quote-caster-encoder")
tokenizer = AutoTokenizer.from_pretrained("aNameNobodyChose/quote-caster-encoder")
# Encode a quote with its surrounding context
def encode_quote(context, quote):
text = f"{context} [SEP] {quote}"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
outputs = model(**inputs)
return outputs.last_hidden_state[:, 0, :] # [CLS] token
|