Link Anchor Detection Model
A fine-tuned DeBERTa v3 model that predicts which words in text should be hyperlinks. Trained on 10,273 pages scraped from The Keyword (Google's official blog), where editorial linking decisions serve as ground truth labels.
How It Works
Given raw text, the model performs token-level binary classification β each token is labeled LINK or O (not a link). This identifies anchor text candidates: words that a human editor would likely hyperlink.
Pipeline
sitemap.xml (10,274 URLs from blog.google)
β
βΌ
scrape.py βββΊ scraped.db (SQLite, 10,273 pages with markdown + inline links)
β
βΌ
_prep.py βββΊ train_windows.jsonl / val_windows.jsonl
β β’ Strip markdown, annotate link spans as [LINK_START]...[LINK_END]
β β’ Tokenize with DeBERTa, align labels to tokens
β β’ Sliding windows (512 tokens, stride 128)
β β’ 90/10 doc-level split
βΌ
train.py βββΊ model_link_token_cls/
β β’ Fine-tune microsoft/mdeberta-v3-base
β β’ Weighted cross-entropy (~25x for minority class)
β β’ 3 epochs, lr 2e-5, batch 16
βΌ
app.py βββΊ Streamlit UI
β’ Sliding-window inference (handles any text length)
β’ Word-level highlighting with confidence scores
Data
Source: blog.google sitemap (The Keyword β Google's product and technology blog).
| Metric | Value |
|---|---|
| Pages scraped | 10,273 |
| Total tokens | 8.2M |
| Link tokens | 286,799 (3.48%) |
| Training windows | 21,264 |
| Validation windows | 2,402 |
The class imbalance (96.5% non-link vs 3.5% link) is handled with weighted cross-entropy loss during training.
Model
- Base:
microsoft/mdeberta-v3-base(DebertaV2ForTokenClassification) - Labels:
O(0),LINK(1) - Max position: 512 tokens
- Parameters: 12 layers, 768 hidden, 12 attention heads
Evaluation Results
| Metric | Value |
|---|---|
| Accuracy | 95.6% |
| Precision | 42.4% |
| Recall | 79.5% |
| F1 | 0.553 |
High recall means the model catches most link-worthy text. Lower precision reflects the inherent ambiguity β many words could be linked, so "false positives" are often reasonable candidates.
Usage
Streamlit App
streamlit run app.py
Paste text, adjust the confidence threshold, and see predicted link anchors highlighted in green.
Python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained("model_link_token_cls")
model = AutoModelForTokenClassification.from_pretrained("model_link_token_cls")
model.eval()
text = "Google announced new features for Search and Gmail today."
enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
with torch.no_grad():
logits = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]).logits
probs = F.softmax(logits, dim=-1)[0, :, 1] # P(LINK) per token
for token, offset, p in zip(
tokenizer.convert_ids_to_tokens(enc["input_ids"][0]),
enc["offset_mapping"][0],
probs
):
if offset[0] == offset[1]:
continue # skip special tokens
if p > 0.5:
print(f" LINK: {text[offset[0]:offset[1]]} ({p:.2%})")
Scripts
| File | Purpose |
|---|---|
scrape.py |
Async Playwright scraper; reads sitemap.xml, saves to SQLite + markdown files |
_prep.py |
Cleans markdown, annotates link spans, tokenizes, creates sliding windows |
train.py |
Fine-tunes DeBERTa with weighted loss, W&B tracking |
app.py |
Streamlit inference app with sliding-window support |
_count.py |
Token length analysis utility |
_detok.py |
Token ID decoder (Streamlit) |
Requirements
- Python 3.8+
- PyTorch
- Transformers
- Playwright (for scraping)
- Streamlit (for inference app)
- Downloads last month
- 52
Model tree for dejanseo/google-links
Base model
microsoft/deberta-v3-base