Introducing Bot Scanner: A "Skyscanner" for LLM answers

Community Article Published June 4, 2025

From our experience in the development of AutoBench, the innovative AI-generated LLM benchmark, we introduce Bot Scanner, a platform for everyone that, for any given question, leverages AI to generate a ranked list of answers from existing LLMs.

banner

Which AI model should you use? ChatGPT, Gemini, Claude, Llama, DeepSeek? Every day brings new releases, and while benchmarks exist, we all know that performance on individual tasks can vary enormously. This leaves you with a persistent doubt: "Is this the best possible answer for my specific use case?".

Many apps now let you compare LLM responses side-by-side, but this requires you to manually read and judge each one. In the era of LLMs, this feels like trying to book a flight by visiting each airline's website individually. So, what if there were a "Skyscanner" for LLMs? Or an "AirBnB" of AI-generated answers?.

That’s why we built Bot Scanner, a platform that lets users query multiple LLMs and, crucially, have their responses ranked by other AI models, simplifying the task of identifying the best AI-generated content through a simple, chatbot-like interface.

In its simplicity, Bot Scanner orchestrates a sophisticated two-step process. First, it takes a single prompt from the user and broadcasts it to a user-selected group of "responder" LLMs. Second, once the responses are collected, it passes them to a different, user-selected group of "ranker" LLMs. These "judges" evaluate the quality of the initial responses based on the user's prompt and produce a final, ranked list. This gives the user granular control over the entire evaluation chain, from choosing the contestants to appointing the jury.

Take a guided tour of Bot Scanner

The Need for a New Evaluation Paradigm in the "Agentic Era"

We are on the cusp of the "agentic era," where AI agents are designed to autonomously execute complex, multi-step tasks. This raises a critical question: how do we ensure these agents are not just capable, but consistently effective and aligned with our goals?. The answer lies in a robust real-time evaluation of the LLM responses that they provide .

This new era suggests a broader interpretation of the "Mixture of Experts" (MoE) concept. Instead of internal sub-networks, we are seeing systems where distinct, specialized LLMs are orchestrated to collaborate like a team of experts. Imagine a market research agent where one LLM excels at data retrieval, another at sentiment analysis, and a third at creative brainstorming. While incredibly powerful, this "multi-LLM agent" paradigm makes evaluation exponentially more complex. How do you choose the right LLM "expert" for each sub-task?.

From AutoBench to Bot Scanner: Democratizing Advanced Evaluation

This challenge was the driving force behind our work at eZecute. Recognizing the limitations of traditional, static benchmarks, we developed AutoBench, an advanced benchmarking framework based on a "Collective-LLM-as-a-Judge" paradigm, built with the collaboration of AI researchers and entrepreneurs, like Marco Trombetti, and the support of leading AI companies like Transalted. In this approach, a collective of AI models evaluates the outputs of other AIs, overcoming the static and often gamed nature of traditional benchmarks and providing a dynamic, scalable evaluation. It’s a method designed to be "ASI-ready," capable of evaluating LLMs even when they become too difficult for humans to assess.

The insights from AutoBench were the direct catalyst for Bot Scanner. We saw firsthand how difficult it was to select the optimal LLM for a given task. The process was manual, subjective, and time-consuming. So we asked ourselves: what if we could take the sophisticated, LLM-driven evaluation principles from AutoBench and package them into an accessible, everyday tool? That is Bot Scanner. We created a user-friendly platform that allows users to not only get responses from multiple LLMs but, most importantly, have those responses evaluated and ranked by a user-defined "collective of judges".

How Bot Scanner Works and Why It's Needed

With Bot Scanner, instead of a simple list of outputs, you get a ranked list, evaluated by AI models you trust for the evaluation task. This provides three key benefits:

  • Immense Time Savings: No more manual sifting through dozens of outputs.
  • Informed Decision-Making: Quickly identify the highest-quality response based on criteria evaluated by other AIs.
  • Build Better Agents: For developers in the agentic era, Bot Scanner is an invaluable tool for selecting the most effective "LLM experts" for each component of their agent, a vital step as these systems become more complex.

It's important to note that Bot Scanner is a specialized tool. It is not a replacement for everyday chatbots like ChatGPT or Gemini. It becomes essential when accuracy and quality from a wide variaety of modela are paramount. This depth comes at a cost; a single query can involve hundreds of LLM calls for responses and evaluations, making it up to 100 times more expensive than a standard LLM query. Bot Scanner is also not a benchmarking system; it provides an immediate ranking for a specific question, not a generalized benchmark across thousands of interactions.

The Future is Continuously Evaluated

As AI models become more powerful and agentic systems more widespread, the need for robust, dynamic, and user-driven evaluation will only grow. Static benchmarks will continue to serve a purpose, but dynamic tools are essential for practical application and optimization. With Bot Scanner, we aim to give everyone—from individual researchers to large teams—a powerful yet accessible means to navigate this complex landscape and make better decisions about the AI they use.

Community

Sign up or log in to comment