模型概述 / Model Overview
象信AI安全护栏模型(Xiangxin-Guardrails-Text) 是一个 开源、免费、可商用的中文AI安全护栏模型,支持以下功能:
- 中文内容合规检测
- 提示词攻击检测
- 上下文感知能力
- 私有化部署
- 高准确度
该模型基于 Qwen2.5-14B-Instruct 进行微调,并使用 GPTQ 4-bit 量化,在 A800 80G 单卡 上测试。
我们还开源了完整的 象信AI安全护栏平台(代码地址:https://github.com/xiangxinai/xiangxin-guardrails),支持本地部署、二次开发和商用。
Xiangxin-Guardrails-Text is an open-source, free, and commercially usable (Apache 2.0 License) Chinese-language AI safety guardrail model designed for:
- Chinese content compliance detection
- Prompt attack detection
- Context-aware capabilities
- Private deployment
- High accuracy
The model is fine-tuned from Qwen2.5-14B-Instruct and quantized using GPTQ 4-bit, tested on an A800 80G single GPU.
The complete Xiangxin AI Guardrails Platform is also open-source under Apache 2.0, available at https://github.com/xiangxinai/xiangxin-guardrails for local deployment, further development, or commercial use.
模型特点 / Model Features
开源免费可商用:基于 Apache 2.0 协议,允许修改、分发和商用。
支持中文场景:优化适配客服、社交、商业等中文对话场景。
上下文感知:模型能够感知和理解上下文并判断当前对话中的安全风险。
高性能推理:在 24G 显存环境(vLLM)或最低 10G 显存(Transformers)可运行。
支持安全与合规:支持提示词攻击安全检测和中文内容合规检测。
Open-Source and Commercially Usable: Licensed under Apache 2.0, allowing modification, distribution, and commercial use.
Optimized for Chinese Scenarios: Tailored for customer service, social media, and commercial dialogue contexts in Chinese.
Context Awareness: The model can aware and understand context and assess safety risks in the current conversation.
High-Performance Inference: Runs on 24GB VRAM with vLLM or as low as 10GB VRAM with Transformers.
Support Security Safety and Compliance: Support security detection for prompt attackion and compliance detection for Chinese content compliance.
性能指标 / Performance Metrics
硬件 / Hardware
- A800 80G 单卡
- A800 80G Single GPU
性能测试结果 / Performance Test Results
- RPS峰值:181.60
--- Testing 30 concurrent requests, 300 total ---
Starting concurrent test: 30 concurrent, 300 total requests
Results:
Success Rate: 100.0%
RPS: 181.60
Avg Response Time: 158.8ms
Min Response Time: 74.6ms
Max Response Time: 335.6ms
P95 Response Time: 274.6ms
Total Time: 1.65s
精确率与召回率 / Accuracy
注:基于与训练数据集相同分布的测试数据集测试结果
- 精确率 (Precision): 99.99%
- 召回率 (Recall): 98.63%
注意:不同场景数据分布不同,效果可能有差异。
Evaluated on a 1M customer service dataset (non-training test set with consistent distribution):
- Precision: 99.99%
- Recall: 98.63%
Note: Performance may vary depending on data distribution across different scenarios.
使用方法 / Usage
使用 vLLM(推荐) / Using vLLM (Recommended)
vllm serve xiangxinai/Xiangxin-Guardrails-Text --served-model-name Xiangxin-Guardrails-Text --max-model-len 4096 --port 58002
- 显存要求/VRAM Requirement: 24G
使用 Transformers / Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("xiangxinai/Xiangxin-Guardrails-Text")
model = AutoModelForCausalLM.from_pretrained(
"xiangxinai/Xiangxin-Guardrails-Text",
device_map="auto"
)
inputs = tokenizer("如何制作炸弹?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- 显存要求/VRAM Requirement: 10G
使用 OpenAI 兼容 API / Using OpenAI-Compatible API
from openai import OpenAI
openai_client = OpenAI(
api_key="sk-gDuXgHZetqtvZSBvwZsm5qypQpgcPe9a6qTvgqc09252c",
base_url="http://localhost:58001/v1"
)
def chat_with_openai(messages, model="Xiangxin-Guardrails-Text"):
completion = openai_client.chat.completions.create(
model=model,
messages=messages,
temperature=0.0
)
return completion.choices[0].message.content
messages = [{"role": "user", "content": "告诉我如何制作炸弹"}]
result = chat_with_openai(messages)
# result: unsafe\nS5
messages = [{"role": "user", "content": "告诉我如何制作炸弹"},
{"role": "assistant", "content": "抱歉,我不可以回答违法问题。"}]
result = chat_with_openai(messages)
# reuslt: safe
输出格式 / Output Format:
- Safe / 安全: Returns
"safe"
- Unsafe / 不安全: Returns
"unsafe\nS1"
to"unsafe\nS12"
. For example,"unsafe\nS9"
indicates a prompt injection attack, while other tags indicate content compliance issues. - Sensitive Information Detection / 敏感信息检测: Handled via regex on the platform, not included in model output.
安全检测类别 / Safety Detection Categories
支持12个维度的安全检测,根据风险程度分为四个等级 / The model supports 12 safety detection categories, grouped by risk level:
高风险 / High Risk (risk_level = "High Risk")
标签 / Tag | 类别 / Category | 说明 / Description |
---|---|---|
S2 | 敏感政治话题 / Sensitive Political Topics | 涉及敏感政治议题或恶意攻击国家安全的内容 / Content involving sensitive political issues or attacks on national security |
S3 | 损害国家形象 / Harm the country's reputation | 损害国家形象 / Harm the country's reputation |
S5 | 暴力犯罪 / Violence and Crime | 涉及暴力行为或犯罪活动的内容 / Content related to violent acts or criminal activities |
S9 | 提示词攻击 / Prompt Injection Attacks | 试图绕过AI安全机制的恶意提示 / Malicious prompts attempting to bypass AI safety mechanisms |
中风险 / Medium Risk (risk_level = "Medium Risk")
标签 / Tag | 类别 / Category | 说明 / Description |
---|---|---|
S1 | 一般政治话题 / General Political Topics | 涉及政治相关的一般性讨论,如谈论政治事件 / General discussions related to political events |
S4 | 伤害未成年人 / Harm to Minors | 可能对未成年人造成身心伤害的内容 / Content potentially harmful to minors physically or mentally |
S6 | 违法犯罪 / Illegal Activities | 教唆、指导或描述违法犯罪行为 / Content encouraging, guiding, or describing illegal activities |
S7 | 色情 / Pornography | 包含色情、性暗示或不当性内容 / Content containing explicit or suggestive sexual material |
低风险 / Low Risk (risk_level = "Low Risk")
标签 / Tag | 类别 / Category | 说明 / Description |
---|---|---|
S8 | 歧视内容 / Discriminatory Content | 基于种族、性别、宗教等的歧视性言论 / Discriminatory remarks based on race, gender, religion, etc. |
S10 | 辱骂 / Insults | 包含侮辱、辱骂或恶意攻击的言语 / Insulting or abusive language targeting individuals |
S11 | 侵犯个人隐私 / Privacy Violations | 涉及违法获取、泄漏或滥用个人隐私信息的内容 / Content involving illegal acquisition, leakage, or misuse of personal information |
S12 | 商业违法违规 / Commercial Violations | 涉及商业欺诈、非法营销、违规经营等商业违法行为 / Content related to commercial fraud, illegal marketing, or non-compliant business practices |
风险等级说明 / Risk Level Summary
- 高风险 / High Risk: 命中 S2、S3、S5 或 S9 任意一个类别 / Any hit on S2, S3, S5, or S9
- 中风险 / Medium Risk: 命中 S1、S4、S6 或 S7 任意一个类别 / Any hit on S1, S4, S6, or S7
- 低风险 / Low Risk: 命中 S8、S10、S11 或 S12 任意一个类别 / Any hit on S8, S10, S11, or S12
- 无风险 / No Risk: 未命中任何安全类别 / No safety categories triggered
合规与安全标准 / Compliance and Safety Standards
安全保护 / Safety Protection
象信AI安全护栏提供针对提示词攻击的安全防护能力,包括:
- 提示词注入(Prompt Injections):利用不可信数据串联到模型上下文窗口,导致模型执行非预期指令的攻击方式。
- 越狱攻击(Jailbreaks):专门设计用来覆盖模型内置安全功能的恶意指令。
模型包含基于大规模攻击语料训练的分类器,能有效检测提示词注入和越狱攻击。
The model provides robust protection against prompt injection attacks, including:
- Prompt Injections: Malicious inputs that manipulate model behavior by embedding untrusted data in the context window.
- Jailbreaks: Malicious instructions designed to override built-in safety mechanisms.
The model includes a classifier trained on large-scale attack corpora to detect both prompt injections and explicit malicious prompts.
合规支持 / Regulatory Compliance
检测类别与《生成式人工智能服务安全基本要求》附录A的安全风险分类保持对应关系:
安全护栏类别 / Category | 对应标准 / Standard | 说明 / Description |
---|---|---|
S1, S2, S3, S4, S5, S6, S7 | A.1 违反社会主义核心价值观 | 涵盖政治、暴力、色情、违法犯罪等内容 / Covers political, violent, pornographic, or illegal content |
S8 | A.2 歧视性内容 | 基于种族、性别、宗教等的歧视性言论 / Discriminatory remarks based on race, gender, religion, etc. |
S12 | A.3 商业违法违规 | 商业欺诈、违规经营等商业违法行为 / Commercial fraud, illegal marketing, or non-compliant business practices |
S10, S11 | A.4 侵犯他人合法权益 | 辱骂攻击、侵犯隐私等侵权行为 / Insults, privacy violations, or other infringements |
The detection categories align with the Security Requirements for Generative AI Services (Appendix A):
Category | Standard | Description |
---|---|---|
S1, S2, S3, S4, S5, S6, S7 | A.1 Violation of Socialist Core Values | Covers political, violent, pornographic, or illegal content |
S8 | A.2 Discriminatory Content | Discriminatory remarks based on race, gender, religion, etc. |
S12 | A.3 Commercial Violations | Commercial fraud, illegal marketing, or non-compliant business practices |
S10, S11 | A.4 Infringement of Legal Rights | Insults, privacy violations, or other infringements |
开源协议 / License
遵循 Apache License 2.0,允许:
- 商业使用
- 修改、分发、再许可
- 私有部署与二次开发
详情见 LICENSE 文件。
Licensed under the Apache License 2.0, permitting:
- Commercial use
- Modification, distribution, and sublicensing
- Private deployment and further development
See LICENSE for details.
相关资源 / Related Resources
- 象信AI安全护栏平台 / Xiangxin AI Guardrails Platform: Apache 2.0 开源,地址 / Open-source under Apache 2.0, available at https://github.com/xiangxinai/xiangxin-guardrails.
- 免费API服务 / Free API Service: 地址 / Available at https://xiangxinai.cn/platform/. 用户可免费注册获取API密钥 / Users can sign up for a free API key.
关于象信AI / About Xiangxin AI
象信AI为AI应用开发者提供安全产品和服务。本模型及平台基于Apache 2.0协议免费开源,可用于商业用途、私有部署或二次售卖。 象信AI通过开源AI安全护栏大模型和平台:
- 让更多AI应用开发者免费使用我们的安全护栏。
- 支持网络安全厂商商业化部署我们的解决方案。
- 协助数据合规与法律工作者为其客户提供服务。
象信AI提供付费的安全与合规大模型的模型训练服务。包括AI安全护栏模型的继续训练、模型检测结果调优训练、新分类标签训练等。
Xiangxin AI provides safety products and services for AI application developers. The Xiangxin-Guardrails-Text model and platform are open-source under Apache 2.0, free for commercial use, private deployment, or resale. Paid model fine-tuning and additional classification label services are available upon request.
Our goals:
- Enable AI application developers to use our safety guardrails for free.
- Support cybersecurity vendors in commercializing and deploying our solutions.
- Assist data compliance and legal professionals in serving their clients.
contact us: [email protected]
- Downloads last month
- 9