Universal Jailbreak Backdoors from Poisoned Human Feedback Paper • 2311.14455 • Published Nov 24, 2023 • 1
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation Paper • 2311.03348 • Published Nov 6, 2023 • 1
Personas as a Way to Model Truthfulness in Language Models Paper • 2310.18168 • Published Oct 27, 2023 • 5
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Paper • 2307.15217 • Published Jul 27, 2023 • 38
PassGPT: Password Modeling and (Guided) Generation with Large Language Models Paper • 2306.01545 • Published Jun 2, 2023
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks Paper • 2204.04636 • Published Apr 10, 2022