2 4 1

Silei Xu

sileixu

https://nlp.stanford.edu/~silei/

AI & ML interests

None yet

Recent Activity

updated a Space 3 days ago

zoom-ai/hle-leaderboard

upvoted a paper 14 days ago

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

liked a Space 16 days ago

zoom-ai/hle-leaderboard

View all activity

Organizations

updated a Space 3 days ago

HLE Leaderboard for Agents with Tools

🥇

Humanity's Last Exam Leaderboard for LLM Agents with Tools

upvoted a paper 14 days ago

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

Paper • 2410.09102 • Published Oct 9, 2024 • 2

liked a Space 16 days ago

HLE Leaderboard for Agents with Tools

🥇

Humanity's Last Exam Leaderboard for LLM Agents with Tools

published a Space 19 days ago

HLE Leaderboard for Agents with Tools

🥇

Humanity's Last Exam Leaderboard for LLM Agents with Tools

upvoted a paper 5 months ago

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Paper • 2508.15760 • Published Aug 21, 2025 • 46

authored a paper 5 months ago

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Paper • 2508.15760 • Published Aug 21, 2025 • 46

commented on Mahjong: Where Grandmas Beat The Best LLMs 9 months ago

Thank you for your interest. I did not look very closely on if each model correctly identifies the tiles, but from the examples I manually reviewed, it doesn't seem to be a problem for models like o1 and o3.

Yeah, the result was a surprise to me as well. This is a problem with not much coverage in training corpus, and it requires quite simple logic with good abstraction. The low accuracy shows that the LLMs are probably still heavily relying on memory instead of true logic reasoning.

I do think this work can be expanded to a paper easily. Unfortunately I do not have enough time to do it myself. Happy to collaborate if you are interested.