BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation Paper • 2506.00482 • Published 8 days ago • 8
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems Paper • 2505.18366 • Published 15 days ago • 25
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding Paper • 2505.17330 • Published 16 days ago • 22
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper • 2505.17332 • Published 16 days ago • 31 • 3
Can LLMs faithfully generate their layperson-understandable 'self'?: A Case Study in High-Stakes Domains Paper • 2412.07781 • Published Nov 25, 2024 • 2
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper • 2505.17332 • Published 16 days ago • 31
Can LLMs faithfully generate their layperson-understandable 'self'?: A Case Study in High-Stakes Domains Paper • 2412.07781 • Published Nov 25, 2024 • 2
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper • 2505.17332 • Published 16 days ago • 31
SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use Paper • 2505.17332 • Published 16 days ago • 31 • 3
view article Article Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B By asoria and 3 others • Apr 4, 2024 • 28
view article Article Illustrating Reinforcement Learning from Human Feedback (RLHF) By natolambert and 3 others • Dec 9, 2022 • 266