Javier Rando's picture

2 8 1

Javier Rando

javirandor

·

https://javirando.com

AI & ML interests

NLP, Safety

Organizations

authored 3 papers over 1 year ago

Universal Jailbreak Backdoors from Poisoned Human Feedback

Paper • 2311.14455 • Published Nov 24, 2023 • 1

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Paper • 2311.03348 • Published Nov 6, 2023 • 1

Personas as a Way to Model Truthfulness in Language Models

Paper • 2310.18168 • Published Oct 27, 2023 • 5

authored 4 papers almost 2 years ago

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Paper • 2307.15217 • Published Jul 27, 2023 • 38

PassGPT: Password Modeling and (Guided) Generation with Large Language Models

Paper • 2306.01545 • Published Jun 2, 2023

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Paper • 2204.04636 • Published Apr 10, 2022

Red-Teaming the Stable Diffusion Safety Filter

Paper • 2210.04610 • Published Oct 3, 2022 • 2