AI Text Detector - HC3 Dataset

This model is a fine-tuned DistilBERT model for detecting AI-generated text vs human-written text. It was trained on the HC3 dataset from Hugging Face.

Model Details

  • Base Model: distilbert-base-uncased
  • Task: Binary text classification (Human vs AI-generated)
  • Dataset: HC3 (Human ChatGPT Comparison Corpus)
  • Training Framework: PyTorch + Transformers

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("VSAsteroid/ai-text-detector-hc3")
model = AutoModelForSequenceClassification.from_pretrained("VSAsteroid/ai-text-detector-hc3")

# Example prediction
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
# Get prediction
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = torch.max(predictions).item()

label = "AI-Generated" if predicted_class == 1 else "Human-Written"
print(f"Prediction: {label} (Confidence: {confidence:.3f})")

Labels

  • 0: Human-Written
  • 1: AI-Generated

Training Details

  • Epochs: 2-3
  • Batch Size: 8-16
  • Learning Rate: 2e-5
  • Max Sequence Length: 256
  • Optimizer: AdamW with linear scheduling

Performance

The model achieves good performance on distinguishing between human-written and AI-generated text, particularly on the types of content present in the HC3 dataset.

Limitations

  • The model is trained specifically on the HC3 dataset and may not generalize well to other types of text
  • Performance may vary depending on the AI model that generated the text
  • Short texts may be more difficult to classify accurately

Citation

If you use this model, please cite the HC3 dataset:

@misc{guo2023close,
    title={How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection},
    author={Biyang Guo and Xin Zhang and Ziyuan Wang and Minqi Jiang and Jinran Nie and Yuxuan Ding and Jianwei Yue and Yupeng Wu},
    year={2023},
    eprint={2301.07597},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
11
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VSAsteroid/ai-text-detector-hc3

Finetuned
(9518)
this model
Quantizations
1 model

Dataset used to train VSAsteroid/ai-text-detector-hc3