Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability Paper • 2405.10927 • Published May 17, 2024 • 3
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training Paper • 2403.05030 • Published Mar 8, 2024
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024
LLM-LAT/llama2-7b-chat-lat-unlearn-harry-potter-stronger-unlearning Text Generation • 7B • Updated Jul 22, 2024 • 7 • 1
LLM-LAT/llama2-7b-chat-lat-unlearn-harry-potter-normal Text Generation • 7B • Updated Jul 22, 2024 • 22
LLM-LAT/zephyr7b-beta-rmu-lat-unlearn-wmdp-bio-cyber Text Generation • 7B • Updated Jul 22, 2024 • 6 • 1