Testing Magnitude: the Open source AI agents for web testing

Community Article Published April 19, 2025

End-to-end (E2E) testing is a cornerstone of modern web development, ensuring complex user flows function correctly across the entire application stack. However, traditional E2E tests, often written using frameworks like Selenium, Cypress, or Playwright, face persistent challenges. They can be brittle, breaking easily with minor UI changes, and require significant effort to write and maintain, demanding specialized knowledge of selectors and browser automation APIs. Magnitude enters this arena with a bold proposition: leveraging AI agents to write, execute, and adapt E2E tests using natural language, aiming to drastically simplify the process and improve test resilience. This review delves into the technical architecture and implementation of the open-source Magnitude framework based on its codebase.

Magnitude Github repo: https://github.com/magnitudedev/magnitude#running-your-first-test

Tired of Postman? Want a decent postman alternative that doesn't suck?

Apidog is a powerful all-in-one API development platform that's revolutionizing how developers design, test, and document their APIs.

Unlike traditional tools like Postman, Apidog seamlessly integrates API design, automated testing, mock servers, and documentation into a single cohesive workflow. With its intuitive interface, collaborative features, and comprehensive toolset, Apidog eliminates the need to juggle multiple applications during your API development process.

Whether you're a solo developer or part of a large team, Apidog streamlines your workflow, increases productivity, and ensures consistent API quality across your projects.

image/png

The Problem: Brittle Tests and High Maintenance

The core issue Magnitude tackles is the inherent fragility of selector-based E2E tests. A minor change in the DOM structure, a class name modification, or a CSS refactor can break tests even if the user-facing functionality remains identical. This leads to "flaky" tests and a high maintenance burden, often discouraging teams from investing adequately in E2E testing. Furthermore, writing these tests requires developers to meticulously inspect the DOM, craft robust selectors, and handle asynchronous operations, demanding a specific skillset and considerable time investment.

Magnitude proposes a shift from explicit instructions based on code structure (selectors) to intent-based instructions grounded in visual understanding. By describing test steps and checks in natural language – much like explaining a test case to a human tester – Magnitude aims to create tests that understand the goal of an interaction rather than just the mechanism, making them inherently more robust to UI changes.

Core Concept: Two-Tiered AI Agency for Testing

At the heart of Magnitude lies a sophisticated two-agent architecture, separating the concerns of high-level planning and low-level execution. This design, clearly visible within the magnitude-core package, is key to Magnitude's approach:

  1. The Macro Agent (Planner/Reasoner): This agent acts as the "brains" of the operation. Implemented in packages/magnitude-core/src/ai/macro.ts and likely leveraging a powerful Large Language Model (LLM) like Claude Sonnet (configurable via providers like Anthropic or AWS Bedrock), its primary role is to interpret the user's natural language test steps. Given a high-level goal (e.g., "create 3 todos"), the current state of the web application (represented by a screenshot), and the history of actions already taken within the current step, the Macro agent generates a partial recipe. This recipe is a sequence of lower-level "action ingredients" (e.g., "type 'Take out the trash' into the input field labeled 'New Todo'", "click the 'Add' button"). It intelligently breaks down the complex step into manageable sub-tasks and determines if the sequence of actions generated should logically complete the step. Crucially, it also refines natural language checks (e.g., "should see all 3 todos") by removing implicit context based on the visual state and action history, making them directly evaluatable by the execution agent. Furthermore, it plays a role in diagnosing failures, attempting to classify failed checks as potential application bugs or misalignments between the test intent and the application state/agent capability.
  2. The Micro Agent (Visual Executor/Checker): This agent is the "hands and eyes." Implemented in packages/magnitude-core/src/ai/micro.ts, it likely uses a more specialized (potentially faster and cheaper) vision-capable model. Its core function is visual grounding. It takes an "action ingredient" from the Macro agent's recipe (e.g., "click the 'Add' button") and the current screenshot. It then visually identifies the target element corresponding to the description within the screenshot and translates this into a concrete, executable WebAction understood by the browser automation backend (e.g., specifying coordinates or a uniquely identified visual target). This convertAction function is critical for bridging the gap between natural language intent and browser interaction. The Micro agent also evaluates the refined checks provided by the Macro agent against the current screenshot (evaluateCheck), returning a simple pass/fail result.

Implementation Details: BAML, Playwright, and TypeScript

The Magnitude codebase, structured as a monorepo managed with turbo and bun/npm, primarily uses TypeScript. Key implementation choices include:

  • BAML (@boundaryml/baml): The Boundary Meta Language (BAML) appears central to Magnitude's AI implementation. Files within packages/magnitude-core/baml_src/ likely define the prompts, input/output schemas, and logic for interacting with the LLMs powering both the Macro and Micro agents. BAML allows defining these interactions in a structured way, potentially enabling easier switching between different LLM providers (like Anthropic, Bedrock, OpenAI) and providing features like input/output validation, type safety, and integrated tracing/logging (as seen with traceAsync calls). This abstraction layer is crucial for managing the complexity of LLM interactions.
  • Playwright: For browser automation, Magnitude relies on Playwright. The WebHarness class (packages/magnitude-core/src/web/harness.ts) abstracts Playwright's core functionalities like navigating pages (goto), taking screenshots (screenshot), and executing actions (executeAction). Playwright's robustness and cross-browser capabilities provide a solid foundation for interacting with the web application under test. Notably, playwright is listed as an optional peer dependency in magnitude-core, suggesting that the test runner package (magnitude-test) or the user's project environment is responsible for providing the Playwright installation, particularly relevant for self-hosted scenarios.
  • Sharp: The use of the sharp library indicates sophisticated image processing capabilities. While the exact usage isn't fully detailed in the reviewed code, it's highly likely used by the Micro agent to preprocess or analyze screenshots before feeding them to the vision model, potentially for cropping, resizing, identifying regions of interest, or extracting visual features relevant to element identification.
  • Orchestration (TestCaseAgent): The packages/magnitude-core/src/agent.ts file houses the TestCaseAgent class, which orchestrates the entire test execution flow. It manages the browser lifecycle, iterates through test steps and checks, coordinates the interaction between the Macro and Micro agents, handles state (like the sequence of actions taken), manages error conditions (navigation, action conversion/execution, check failures), and emits events through a listener system (TestAgentListener) for external monitoring or reporting.

The Workflow: From Natural Language to Browser Action

The end-to-end process for a single test step unfolds as follows:

  1. Input: A natural language step description (e.g., "Log in to the app") and associated data/checks defined in a .mag.ts file.
  2. Macro Planning: The TestCaseAgent passes the step description and current screenshot to the MacroAgent.
  3. Recipe Generation: The MacroAgent generates a partial recipe of ActionIngredients (e.g., {'variant': 'type', 'target': 'username field', 'value': '[email protected]'}).
  4. Micro Grounding & Execution (Loop):
    • For each ActionIngredient:
      • The TestCaseAgent gets the current screenshot.
      • It passes the screenshot and ActionIngredient to the MicroAgent.
      • The MicroAgent.convertAction visually identifies the target (e.g., the username input) and generates a concrete WebAction. Failure point: If the target cannot be visually located reliably, a "misalignment" error occurs.
      • The TestCaseAgent instructs the WebHarness to execute the WebAction via Playwright. Failure point: Browser-level errors during execution.
  5. Step Completion: The loop continues until the MacroAgent signals the step is finished based on its generated recipe.
  6. Check Phase:
    • The TestCaseAgent gets a final screenshot for the step.
    • For each natural language check:
      • The MacroAgent.removeImplicitCheckContext refines the check.
      • The MicroAgent.evaluateCheck assesses the check against the screenshot. Failure point: If the check returns false.
      • If a check fails, the MacroAgent.classifyCheckFailure attempts diagnosis.

This iterative process of planning, visual grounding, execution, and checking allows the agent to handle multi-action steps and validate outcomes visually.

Strengths and Potential

  • Reduced Brittleness: By relying on visual understanding and natural language intent rather than specific selectors, Magnitude tests promise significantly greater resilience to minor UI code changes.
  • Ease of Authoring: Writing tests in natural language lowers the barrier to entry, potentially enabling non-programmers or QA professionals less familiar with coding to contribute to the E2E test suite. The syntax observed in the README (test(...).step(...).data(...).check(...)) is fluent and intuitive.
  • Adaptability: The reasoning capability of the Macro agent holds the potential for tests to adapt to unexpected changes or errors, although the current implementation seems focused primarily on diagnosis rather than complex runtime adaptation (as indicated by TODOs).
  • Open Source & Self-Hosting: Providing an open-source core and a self-hosting option grants flexibility and control, allowing users to leverage their own infrastructure and potentially fine-tune models or integrate with internal systems.

Challenges and Considerations

  • Visual Grounding Reliability: The success of Magnitude hinges heavily on the Micro agent's ability to reliably and accurately map natural language descriptions to visual elements (convertAction). Ambiguous descriptions, visually similar elements, complex custom components, or dynamic content loading could challenge this process, leading to "misalignment" errors. The TODO comments regarding MAG-103/MAG-104 suggest this is an active area of development.
  • Performance (Speed & Cost): LLM inference, especially for powerful models like those likely used by the Macro agent and vision models for the Micro agent, introduces latency and cost. Each step might involve multiple LLM calls (planning, action conversion per action, check refinement, check evaluation, potentially failure diagnosis). While Magnitude aims to be more cost-effective than general-purpose agents (as per the README), the cumulative cost and execution time for large test suites compared to traditional frameworks need consideration. The separation into Macro/Micro agents is likely a strategy to mitigate this, using potentially cheaper/faster models for the more frequent Micro agent tasks.
  • Determinism and Debugging: While saved plans aim for consistency, the inherent stochasticity of LLMs could introduce variability. Debugging failures might also be more complex than traditional tests, requiring analysis of agent decisions (potentially aided by BAML tracing) alongside application behavior. Distinguishing between a genuine application bug, a poorly phrased test step, or an agent hallucination/misinterpretation will be crucial.
  • Handling Complex State & Logic: Tests often require intricate logic, conditional flows, or management of complex application state beyond what's immediately visible. How effectively the current architecture handles scenarios requiring deep memory of past interactions or complex conditional test logic remains to be fully assessed.

Conclusion

Magnitude represents a fascinating and potentially transformative approach to E2E web testing. By replacing brittle selectors with AI agents driven by natural language and visual understanding, it tackles the core pain points of traditional E2E test maintenance and authoring. The two-tiered Macro/Micro agent architecture, leveraging BAML for LLM interaction and Playwright for browser automation, provides a technically sound foundation.

While the promise of writing simple, resilient tests in plain English is compelling, the practical success will depend heavily on the reliability and efficiency of the visual grounding (Micro agent) and the reasoning capabilities (Macro agent), particularly in handling ambiguous situations and diagnosing failures accurately. The performance implications (cost and speed) associated with frequent LLM calls are also a key factor.

As an open-source project actively addressing these challenges, Magnitude is a technology worth watching closely. It offers a glimpse into a future where AI significantly streamlines the creation and maintenance of robust E2E test suites, potentially making comprehensive testing more accessible and effective for web development teams. Its success could mark a significant shift in how we ensure the quality and reliability of web applications.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment