About GridArena

An open research platform for evaluating and auditing LLM agents on power-system operational tasks.

Purpose

GridArena lets researchers ask the question: can an LLM agent recommend safe, effective corrective actions on a real power network? It pairs deterministic benchmark cases with structured evaluation, full provenance logging, and reproducible execution — so claims about agent performance can be independently verified.

What's included

  • Deterministic benchmark cases (case5 / case14 / case30) with full provenance logging.
  • Single-run and batch experiment execution with a durable Postgres-backed job queue.
  • Rule-based and simulation-based evaluation (external pandapower service + DC fallback).
  • Side-by-side comparison and one-click CSV / LaTeX / SVG export of any report.
  • Dashboard management: inline edit and delete on every list, soft-delete with a 5-second Undo toast, and bulk multi-select with a sticky action bar on Runs, Presets, and Batches.

System at a glance

BrowserTanStack Start UIReact 19 · ViteEdge WorkerServer FunctionsQueue WorkerDC Powerflow SolverAuth · Validation · EvalPostgres (Cloud)Runs · EvaluationsJob Queue · LogsRLS (user-scoped)Validation ResultsLLM GatewayGemini · GPT-5deterministic seedSimulation + OptimizerPyPSA · /simulate · /optimizeAC-OPF · SCOPF (N-1)optional · externalOptimizer OracleDC-OPF (GLPK/WASM)tiered → AC-OPFRule-based Fallbackalways availableSimulation Health (admin)probes /version + /health + /simulate · writes simulation_health_checks

Full component breakdown in the architecture docs.

Landscape & positioning

The 2025–2026 literature describes a growing family of LLM tools for power-system operations: multi-agent controllers like Grid-Agent, analysis co-pilots like GridMind, dispatch models like GAIA, agentic executors like PowerDAG, retrieval-augmented compliance systems like GridCodex, and benchmarks such as ProOPF, PFBench, and the EPRI electric-sector evaluation.

GridArena is deliberately one layer above these systems. It does not compete with them — it consumes them. Any agent exposing an inference endpoint can be registered as an engine in GridArena and put through the same physical-feasibility loop (PyPSA AC/DC power flow), robustness probes (counterfactual replays and perturbation jobs), and reasoning audit (decision trace, parser provenance, LLM-as-judge scoring). Where static benchmarks measure single-shot accuracy, GridArena turns evaluation into a reproducible runtime experiment.

Full survey, comparison-at-a-glance table, and 24 references in the LLM Tools Landscape docs page.

Selected references (2025–2026)

  1. Zhang et al., "Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control," arXiv:2508.05702, 2025. link
  2. Jin, Kim & Kwon, "GridMind: LLMs-Powered Agents for Power System Analysis and Operations," Argonne National Laboratory, arXiv:2509.02494, 2025. link
  3. Cheng et al., "A large language model for advanced power dispatch (GAIA)," Scientific Reports 15:91940, 2025. doi
  4. Badmus & Pandey, "PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis," arXiv:2603.17418, 2026. link
  5. Shen et al., "ProOPF: Benchmarking and Improving LLMs for Professional-Grade Power Systems Optimization Modeling," arXiv:2602.03070, 2026. link
  6. Electric Power Research Institute, "Benchmarking Large Language Models for the Electric Power Sector," EPRI 3002034347, Feb. 2026. link

See all 24 references in the Landscape bibliography.

Authors & affiliation

Zain Naeem
University of Palermo

How to cite

If you use GridArena in your research, please cite:

APA

text
Zain Naeem (2026). GridArena: An LLM Agent Research Platform for Power System Operations (Version 1.0) [Computer software]. https://gridarena.eu

BibTeX

bibtex
@software{gridarena2026,
  title   = {GridArena: An LLM Agent Research Platform for Power System Operations},
  author  = {Zain Naeem},
  year    = {2026},
  version = {1.0},
  url     = {https://gridarena.eu},
  note    = {University of Palermo}
}

Get started