setfit-paraphrase-multilingual-MiniLM-L12-v2

模型描述

这是一个基于 SetFit 方法微调的中文文本二分类模型，用于区分有用和无用的文本内容。

模型架构

基础模型: paraphrase-multilingual-MiniLM-L12-v2
微调方法: SetFit (Sentence Transformer Fine-tuning)
分类器: 神经网络分类器
任务类型: 二分类 (有用文本 vs 无用文本)

训练数据

数据集: finetuning_samples_rectify.csv
类别分布:
- Useless (0): ~87.9%
- Useful (1-5): ~12.1%
训练样本数: 10,135
验证样本数: 2,172
测试样本数: 2,172

性能指标

基于测试集的性能评估：

微调模型性能

Accuracy: 0.8250
F1-Score (macro): 0.7078
F1-Score (weighted): 0.8481
Precision (macro): 0.6783
Recall (macro): 0.8103

与原始模型对比

F1-Score 提升: +0.0149
Accuracy 提升: +0.0134

使用方法

加载模型

from sentence_transformers import SentenceTransformer

# 加载微调后的模型
model = SentenceTransformer("lyingbarrelhome/setfit-paraphrase-multilingual-MiniLM-L12-v2")

# 编码文本
texts = ["这是一个示例文本", "另一个示例"]
embeddings = model.encode(texts)

配合分类器使用

import torch
import json
from sklearn.linear_model import LogisticRegression

# 加载模型和分类器
model = SentenceTransformer("lyingbarrelhome/setfit-paraphrase-multilingual-MiniLM-L12-v2")

# 训练逻辑回归分类器（推荐方式）
# 或者加载预训练的神经网络分类器（如果可用）

# 预测
def predict_text(text):
    embedding = model.encode([text])
    prediction = classifier.predict(embedding)
    probability = classifier.predict_proba(embedding)
    return prediction[0], probability[0]

训练配置

LoRA配置: r=16, alpha=32, dropout=0.1
目标模块: query, key, value, dense
训练轮数: 1
批次大小: 32
学习率: 0.0005
损失函数: TripletLoss

限制和注意事项

模型主要针对中文文本进行优化
适用于教育培训领域的文本分类任务
建议在相似领域的数据上进行进一步微调

引用

如果您使用这个模型，请引用：

@misc{setfit-hierarchical-binary-classifier,
  author = {Claire Liu},
  title = {SetFit Hierarchical Binary Classifier},
  year = {2025},
  url = {https://huggingface.co/lyingbarrelhome/setfit-paraphrase-multilingual-MiniLM-L12-v2}
}

许可证

MIT License

模型创建时间: 2025-06-27 03:41:25