OSM-Det: Online Social Media Detector

Model Description

OSM-Det (Online Social Media Detector) is a AI-generated text detection model specifically designed for social media content. This model is introduced in the paper "Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media".

AIGTBench Pipeline

Model Details

  • Base Model: allenai/longformer-base-4096
  • Model Type: Text Classification (Binary)
  • Architecture: Longformer with classification head
  • Max Sequence Length: 4096 tokens
  • Training Data: AIGTBench

Quick Start

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("tarryzhang/OSM-Det")
tokenizer = AutoTokenizer.from_pretrained("tarryzhang/OSM-Det")

# Example text
text = "Your text to analyze here..."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", max_length=4096, truncation=True, padding=True)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()

# Interpret results
labels = ["Human-written", "AI-generated"]
confidence = predictions[0][predicted_class].item()

print(f"Prediction: {labels[predicted_class]}")
print(f"Confidence: {confidence:.3f}")

Batch Processing

def detect_ai_text_batch(texts, model, tokenizer, max_length=4096, batch_size=32):
    results = []
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        
        # Tokenize batch
        inputs = tokenizer(
            batch_texts, 
            return_tensors="pt", 
            max_length=max_length, 
            truncation=True, 
            padding=True
        )
        
        # Predict
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            predicted_classes = torch.argmax(predictions, dim=1)
            
        # Store results
        for j, text in enumerate(batch_texts):
            pred_class = predicted_classes[j].item()
            confidence = predictions[j][pred_class].item()
            results.append({
                'text': text,
                'prediction': 'AI-generated' if pred_class == 1 else 'Human-written',
                'confidence': confidence,
                'ai_probability': predictions[j][1].item(),
                'human_probability': predictions[j][0].item()
            })
    
    return results

Labels

  • 0: Human-written text
  • 1: AI-generated text

Training Details

Training Data

OSM-Det was trained on AIGTBench, which includes:

  • 28.77M AI-generated samples from 12 different LLMs
  • 13.55M human-written samples
  • Content from Medium, Quora, and Reddit platforms

Training Configuration

  • Base Model: Longformer-base-4096
  • Training Epochs: 10
  • Batch Size: 5 per device
  • Gradient Accumulation: 8 steps
  • Learning Rate: 2e-5
  • Weight Decay: 0.01
  • Max Sequence Length: 4096 tokens

Citation

@inproceedings{SZSZLBZH25,
    title = {{Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media}},
    author = {Zhen Sun and Zongmin Zhang and Xinyue Shen and Ziyi Zhang and Yule Liu and Michael Backes and Yang Zhang and Xinlei He},
    booktitle = {{Annual Meeting of the Association for Computational Linguistics (ACL)}},
    pages = {},
    publisher ={ACL},
    year = {2025}
}

Contact

License

Apache 2.0

Downloads last month
9
Safetensors
Model size
149M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tarryzhang/OSM-Det

Finetuned
(113)
this model

Dataset used to train tarryzhang/OSM-Det