Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
Javier Rando's picture
2 8 1

Javier Rando

javirandor
Aexyno's profile picture nkristina's profile picture dedeswim's profile picture
·
https://javirando.com
  • javirandor
  • javirandor

AI & ML interests

NLP, Safety

Organizations

SPY Lab - ETH Zurich's profile picture

authored 3 papers over 1 year ago

Universal Jailbreak Backdoors from Poisoned Human Feedback

Paper • 2311.14455 • Published Nov 24, 2023 • 1

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Paper • 2311.03348 • Published Nov 6, 2023 • 1

Personas as a Way to Model Truthfulness in Language Models

Paper • 2310.18168 • Published Oct 27, 2023 • 5
authored 4 papers almost 2 years ago

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Paper • 2307.15217 • Published Jul 27, 2023 • 38

PassGPT: Password Modeling and (Guided) Generation with Large Language Models

Paper • 2306.01545 • Published Jun 2, 2023

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Paper • 2204.04636 • Published Apr 10, 2022

Red-Teaming the Stable Diffusion Safety Filter

Paper • 2210.04610 • Published Oct 3, 2022 • 2
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs