Architecture

GridArena is a thin browser UI on top of edge-deployed server functions, a Postgres database, an LLM gateway, an optional Python simulation service, and an evaluation layer (counterfactual replays, perturbation jobs, decision- trace recorder, and LLM-as-judge). Together these implement GridArena's role as an evaluation harness for LLM agents — see the LLM Tools Landscape for how this differs from operational agents like Grid-Agent, GridMind, GAIA, or PowerDAG.

For the experimental flow these components implement, see the methodology diagram.

Components

Browser (TanStack Start)

React 19 SPA with SSR, served from the edge. Uses TanStack Router for type-safe routing, TanStack Query for caching, and shadcn/ui components on Tailwind v4 design tokens.

Edge Worker (Server Functions)

Stateless functions that handle authentication, validation, queue processing, parsing, and the deterministic in-Worker DC powerflow solver. Concurrency-capped with built-in retry and lease-based locking.

Postgres (Lovable Cloud)

Single source of truth for runs, evaluations, metadata, presets, batches, the durable job queue, and validation results. Every table is RLS-scoped to the authenticated user.

LLM Gateway

Unified provider for Gemini and GPT-5 models with a fixed seed for reproducibility. No third-party API keys required — the gateway is built into Lovable Cloud.

Simulation Service (optional)

FastAPI + PyPSA container hosted externally — pure Python with no native binaries, supporting IEEE case5, case14, and case30. Exposes a stable HTTP contract (/version, /health, /simulate) used by both the evaluator and the diagnostics layer. Falls back to the in-Worker DC solver when the service is unavailable, so evaluations never block.

Simulation Health (admin)

An admin-only diagnostics layer at /simulation-health that probes the external engine end-to-end: it calls /version and /health, then runs /simulate in parallel against IEEE case5/14/30. Each probe is persisted to simulation_health_checks (RLS-scoped) so failures can be reviewed over time. The page surfaces a state-change alert banner when the engine flips from pass to fail and renders contextual troubleshooting tips derived from the exact failing endpoint response (missing SIMULATION_SERVICE_URL / TOKEN, DNS or connection errors, 401/403, 5xx, per-case timeouts). Access is gated server-side via the has_role RPC — non-admin users see an access-required notice.

Counterfactual Engine (Layer E)

A deterministic replay layer that re-executes the same benchmark case against alternative actions — either contextual defaults derived from the agent's action type or user-supplied custom actions. Runs entirely on the in-Worker DC power flow (no LLM calls), persists per-action outcomes to counterfactual_actions and counterfactual_results, and surfaces optimality gap and decision regret in the run and batch reports. See the workflow docs for the formulas.

Perturbation Engine (robustness layer)

Where the counterfactual engine swaps the action, the perturbation engine swaps the environment. It re-runs the agent against systematically modified versions of the benchmark case — load scaled up or down, generators tripped, lines removed — and records whether the recommendation degrades gracefully, switches modes, or breaks. Jobs are enqueued from the batch detail page, executed through the same Edge-Worker pipeline as normal runs, and persisted to perturbation_jobs / perturbation_results for cross-run aggregation. This is GridArena's answer to the robustness gap left open by single-shot benchmarks like PFBench and ProOPF.

Optimizer Oracle (OPF)

A tiered OPF (optimal power flow) oracle that computes the mathematically optimal operator action for a benchmark case. It mirrors the simulator's tiering: an in-Worker DC-OPF LP solver (GLPK via WASM) runs against IEEE case5/case14/case30, and an external AC-OPF / SCOPF endpoint on the PyPSA service handles higher-fidelity and N-1 contingency cases. Supported objectives include min_violations, min_generation_cost, min_load_shed, and min_redispatch_from_baseline. Results are cached in optimizer_results (RLS-scoped) and consumed by four surfaces: the true optimality_gap / decision_regret metrics on run_evaluations, the Optimal Action panel on Run Details (agent vs optimum vs ground truth), a ground-truth scenario generator, and a batch-level oracle benchmark. Falls through to null when no tier is available, so evaluations never block. Unlike the counterfactual engine — which only compares against a finite enumerated action set — the oracle returns a global optimum, turning optimality_gap from a lower bound into an exact measurement.

Decision-trace recorder & LLM-as-judge

Every run emits a structured decision trace: the prompt log, the raw LLM response, the parser's per-field provenance (regex / JSON / fallback), each tool call with arguments and result, and the final structured action. A separate LLM-as-judge function scores the recommendation against domain rubrics (safety, feasibility, actionability) and persists the score, the rubric used, and the judge model. Together the trace and the judge let researchers attribute failures to a specific cause — wrong tool, misparsed solver output, or flawed final reasoning — rather than reporting a single opaque pass/fail.

Rule-based Fallback

A deterministic, dependency-free evaluator that always works. Used when both simulation engines are unavailable or when an experiment explicitly opts out of simulation.

Validation feedback loop

Beyond the runtime evaluation pipeline, GridArena also closes the loop on its curated metadata. The case-meta validator scans CASE_META for missing or malformed fields, surfaces them in the dev panel on /docs/cases, and offers an AI-assisted fix for each issue. Suggestions are reviewed in a drawer, accepted into case_meta_overrides, merged back into the validator on the next render, and recorded in an append-only case_meta_override_audit trail. The diagram below shows the full cycle — from a flagged issue to a re-validated dataset with a signed history of who changed what and when.

For the matching how-to (drawer, accept/revert, bulk actions, audit panel), see the Reviewing AI suggestions section. For how this fits into the experiment lifecycle, see the AI-assisted fixes methodology.