|
[Meta's Llama-3 8b](https://github.com/meta-llama/llama3) that has had the refusal direction removed so that `helpfulness >> harmlessness`. |
|
|
|
It will still warn you and lecture you (as this direction has not been erased), but it will follow instructions. |
|
|
|
Only use this if you can take responsibility for your own actions and emotions while using it. |
|
|
|
For generation code, see https://gist.github.com/wassname/42aba7168bb83e278fcfea87e70fa3af |
|
|
|
## Dev thoughts |
|
|
|
- I found the llama needed a separate intervention per layer, and interventions on each layer. Could this be a property of smarter models - their residual stream changes more by layer |
|
|
|
## More info |
|
For anyone who is enjoying increasing their knowledge in this field, check out these intros: |
|
- A primer on the internals of transformers: https://arxiv.org/abs/2405.00208 |
|
- Machine unlearning: https://ai.stanford.edu/~kzliu/blog/unlearning |
|
- The original post that this script is based on https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction# |
|
|
|
And check out this overlooked work that steer's LLM's inside Oobabooga's popular UI: https://github.com/Hellisotherpeople/llm_steer-oobabooga |
|
|
|
--- |
|
license: llama3 |
|
--- |
|
|