LLM-LAT (LLM Latent Adversarial Training)

stecas

authored a paper 5 months ago

Open Problems in Mechanistic Interpretability

Paper • 2501.16496 • Published Jan 27 • 19

aengusl

authored a paper 10 months ago

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

CindyXWu

authored 2 papers 11 months ago

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Paper • 2405.10927 • Published May 17, 2024 • 3

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

stecas

authored 2 papers 11 months ago

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Paper • 2403.05030 • Published Mar 8, 2024

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

stecas

updated a Space 11 months ago

README

🌍

abhayesian

updated 2 models 11 months ago

LLM-LAT/robust-llama3-8b-instruct

Text Generation • 8B • Updated Aug 1, 2024 • 2.2k • 12

LLM-LAT/llama3-8b-instruct-lat-jailbreak-robust3

Updated Aug 1, 2024

stecas

updated 2 datasets 12 months ago

LLM-LAT/benign-dataset

Viewer • Updated Jul 24, 2024 • 165k • 134 • 2

LLM-LAT/harmful-dataset

Viewer • Updated Jul 24, 2024 • 4.95k • 2.7k • 17

abhayesian

updated 5 models 12 months ago

CindyXWu

updated 3 models 12 months ago

LLM-LAT/llama2-7b-chat-lat-unlearn-harry-potter-stronger-unlearning

Text Generation • 7B • Updated Jul 22, 2024 • 7 • 1

LLM-LAT/llama2-7b-chat-lat-unlearn-harry-potter-normal

Text Generation • 7B • Updated Jul 22, 2024 • 22

LLM-LAT/zephyr7b-beta-rmu-lat-unlearn-wmdp-bio-cyber

Text Generation • 7B • Updated Jul 22, 2024 • 6 • 1

Baidicoot

updated a model about 1 year ago

LLM-LAT/llama2-7b-chat-lat-removed-backdoor5

Text Generation • 7B • Updated Jul 5, 2024 • 8

AI & ML interests

Team members 6

LLM-LAT's activity

README