Post
🔍 Today's pick in Interpretability & Analysis of LMs: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection by
@chaochen
et al.
Previous efforts in detecting hallucinations using model intrinsic information employed predictive uncertainty or self-consistency to detect evaluation. Authors contend that in these procedure the rich semantic information captured in model embeddings is inevitably lost while decoding tokens.
To prevent this information loss they propose EigenScore, an internal measure of responses’ self-consistency using the eigenvalues of sampled responses' covariance matrix in intermediate model layers to quantify answers’ diversity in the dense embedding space. Results show that EigenScore outperforms logit-level methods for hallucination detection on QA tasks, especially when paired with inference time feature clipping to truncate extreme activations, reducing overconfident generations.
📄 Paper: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection (2402.03744)
Previous efforts in detecting hallucinations using model intrinsic information employed predictive uncertainty or self-consistency to detect evaluation. Authors contend that in these procedure the rich semantic information captured in model embeddings is inevitably lost while decoding tokens.
To prevent this information loss they propose EigenScore, an internal measure of responses’ self-consistency using the eigenvalues of sampled responses' covariance matrix in intermediate model layers to quantify answers’ diversity in the dense embedding space. Results show that EigenScore outperforms logit-level methods for hallucination detection on QA tasks, especially when paired with inference time feature clipping to truncate extreme activations, reducing overconfident generations.
📄 Paper: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection (2402.03744)