Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
walledai 's Collections
Research

Research

updated 17 days ago

Our AI Safety Research

Upvote
1

  • Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique

    Paper • 2408.10701 • Published Aug 20, 2024 • 12

  • Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming

    Paper • 2406.11654 • Published Jun 17, 2024 • 6

  • Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

    Paper • 2409.11242 • Published Sep 17, 2024 • 7

  • Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

    Paper • 2308.09662 • Published Aug 18, 2023 • 3

  • Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases

    Paper • 2310.14303 • Published Oct 22, 2023 • 1

  • WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models

    Paper • 2408.03837 • Published Aug 7, 2024 • 18

  • Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic

    Paper • 2402.11746 • Published Feb 19, 2024 • 2
Upvote
1
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs