Link Anchor Detection Model

A fine-tuned DeBERTa v3 model that predicts which words in text should be hyperlinks. Trained on 10,273 pages scraped from The Keyword (Google's official blog), where editorial linking decisions serve as ground truth labels.

How It Works

Given raw text, the model performs token-level binary classification β€” each token is labeled LINK or O (not a link). This identifies anchor text candidates: words that a human editor would likely hyperlink.

Pipeline

sitemap.xml (10,274 URLs from blog.google)
        β”‚
        β–Ό
   scrape.py ──► scraped.db (SQLite, 10,273 pages with markdown + inline links)
        β”‚
        β–Ό
    _prep.py ──► train_windows.jsonl / val_windows.jsonl
        β”‚         β€’ Strip markdown, annotate link spans as [LINK_START]...[LINK_END]
        β”‚         β€’ Tokenize with DeBERTa, align labels to tokens
        β”‚         β€’ Sliding windows (512 tokens, stride 128)
        β”‚         β€’ 90/10 doc-level split
        β–Ό
   train.py ──► model_link_token_cls/
        β”‚         β€’ Fine-tune microsoft/mdeberta-v3-base
        β”‚         β€’ Weighted cross-entropy (~25x for minority class)
        β”‚         β€’ 3 epochs, lr 2e-5, batch 16
        β–Ό
    app.py ──► Streamlit UI
                  β€’ Sliding-window inference (handles any text length)
                  β€’ Word-level highlighting with confidence scores

Data

Source: blog.google sitemap (The Keyword β€” Google's product and technology blog).

Metric Value
Pages scraped 10,273
Total tokens 8.2M
Link tokens 286,799 (3.48%)
Training windows 21,264
Validation windows 2,402

The class imbalance (96.5% non-link vs 3.5% link) is handled with weighted cross-entropy loss during training.

Model

  • Base: microsoft/mdeberta-v3-base (DebertaV2ForTokenClassification)
  • Labels: O (0), LINK (1)
  • Max position: 512 tokens
  • Parameters: 12 layers, 768 hidden, 12 attention heads

Evaluation Results

Metric Value
Accuracy 95.6%
Precision 42.4%
Recall 79.5%
F1 0.553

High recall means the model catches most link-worthy text. Lower precision reflects the inherent ambiguity β€” many words could be linked, so "false positives" are often reasonable candidates.

Usage

Streamlit App

streamlit run app.py

Paste text, adjust the confidence threshold, and see predicted link anchors highlighted in green.

Python

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("model_link_token_cls")
model = AutoModelForTokenClassification.from_pretrained("model_link_token_cls")
model.eval()

text = "Google announced new features for Search and Gmail today."
enc = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
with torch.no_grad():
    logits = model(input_ids=enc["input_ids"], attention_mask=enc["attention_mask"]).logits
    probs = F.softmax(logits, dim=-1)[0, :, 1]  # P(LINK) per token

for token, offset, p in zip(
    tokenizer.convert_ids_to_tokens(enc["input_ids"][0]),
    enc["offset_mapping"][0],
    probs
):
    if offset[0] == offset[1]:
        continue  # skip special tokens
    if p > 0.5:
        print(f"  LINK: {text[offset[0]:offset[1]]} ({p:.2%})")

Scripts

File Purpose
scrape.py Async Playwright scraper; reads sitemap.xml, saves to SQLite + markdown files
_prep.py Cleans markdown, annotates link spans, tokenizes, creates sliding windows
train.py Fine-tunes DeBERTa with weighted loss, W&B tracking
app.py Streamlit inference app with sliding-window support
_count.py Token length analysis utility
_detok.py Token ID decoder (Streamlit)

Requirements

  • Python 3.8+
  • PyTorch
  • Transformers
  • Playwright (for scraping)
  • Streamlit (for inference app)
Downloads last month
52
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dejanseo/google-links

Finetuned
(525)
this model

Space using dejanseo/google-links 1