Link Anchor Detection Model

A fine-tuned DeBERTa v3 model that predicts which words in text should be hyperlinks. Trained on 10,273 pages scraped from The Keyword (Google's official blog), where editorial linking decisions serve as ground truth labels.

How It Works

Given raw text, the model performs token-level binary classification — each token is labeled LINK or O (not a link). This identifies anchor text candidates: words that a human editor would likely hyperlink.

Pipeline

sitemap.xml (10,274 URLs from blog.google)
        │
        ▼
   scrape.py ──► scraped.db (SQLite, 10,273 pages with markdown + inline links)
        │
        ▼
    _prep.py ──► train_windows.jsonl / val_windows.jsonl
        │         • Strip markdown, annotate link spans as [LINK_START]...[LINK_END]
        │         • Tokenize with DeBERTa, align labels to tokens
        │         • Sliding windows (512 tokens, stride 128)
        │         • 90/10 doc-level split
        ▼
   train.py ──► model_link_token_cls/
        │         • Fine-tune microsoft/mdeberta-v3-base
        │         • Weighted cross-entropy (~25x for minority class)
        │         • 3 epochs, lr 2e-5, batch 16
        ▼
    app.py ──► Streamlit UI
                  • Sliding-window inference (handles any text length)
                  • Word-level highlighting with confidence scores

Data

Source: blog.google sitemap (The Keyword — Google's product and technology blog).

Metric	Value
Pages scraped	10,273
Total tokens	8.2M
Link tokens	286,799 (3.48%)
Training windows	21,264
Validation windows	2,402

The class imbalance (96.5% non-link vs 3.5% link) is handled with weighted cross-entropy loss during training.

Model

Base: microsoft/mdeberta-v3-base (DebertaV2ForTokenClassification)
Labels: O (0), LINK (1)
Max position: 512 tokens
Parameters: 12 layers, 768 hidden, 12 attention heads

Evaluation Results

Metric	Value
Accuracy	95.6%
Precision	42.4%
Recall	79.5%
F1	0.553

High recall means the model catches most link-worthy text. Lower precision reflects the inherent ambiguity — many words could be linked, so "false positives" are often reasonable candidates.

Usage

Streamlit App

streamlit run app.py

Paste text, adjust the confidence threshold, and see predicted link anchors highlighted in green.

Python

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("model_link_token_cls")
model = AutoModelForTokenClassification.from_pretrained("model_link_token_cls")
model.eval()

text = "Google announced new features for Search and Gmail today."
enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
with torch.no_grad():
    logits = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]).logits
    probs = F.softmax(logits, dim=-1)[0, :, 1]  # P(LINK) per token

for token, offset, p in zip(
    tokenizer.convert_ids_to_tokens(enc["input_ids"][0]),
    enc["offset_mapping"][0],
    probs
):
    if offset[0] == offset[1]:
        continue  # skip special tokens
    if p > 0.5:
        print(f"  LINK: {text[offset[0]:offset[1]]} ({p:.2%})")

Scripts

File	Purpose
`scrape.py`	Async Playwright scraper; reads sitemap.xml, saves to SQLite + markdown files
`_prep.py`	Cleans markdown, annotates link spans, tokenizes, creates sliding windows
`train.py`	Fine-tunes DeBERTa with weighted loss, W&B tracking
`app.py`	Streamlit inference app with sliding-window support
`_count.py`	Token length analysis utility
`_detok.py`	Token ID decoder (Streamlit)

Requirements

Python 3.8+
PyTorch
Transformers
Playwright (for scraping)
Streamlit (for inference app)

Downloads last month: 52

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for dejanseo/google-links

Base model

microsoft/deberta-v3-base

Finetuned

(525)

this model

dejanseo
/

google-links