About GridArena
An open research platform for evaluating and auditing LLM agents on power-system operational tasks.
Purpose
GridArena lets researchers ask the question: can an LLM agent recommend safe, effective corrective actions on a real power network? It pairs deterministic benchmark cases with structured evaluation, full provenance logging, and reproducible execution — so claims about agent performance can be independently verified.
What's included
- Deterministic benchmark cases (case5 / case14 / case30) with full provenance logging.
- Single-run and batch experiment execution with a durable Postgres-backed job queue.
- Rule-based and simulation-based evaluation (external pandapower service + DC fallback).
- Side-by-side comparison and one-click CSV / LaTeX / SVG export of any report.
- Dashboard management: inline edit and delete on every list, soft-delete with a 5-second Undo toast, and bulk multi-select with a sticky action bar on Runs, Presets, and Batches.
System at a glance
Full component breakdown in the architecture docs.
Landscape & positioning
The 2025–2026 literature describes a growing family of LLM tools for power-system operations: multi-agent controllers like Grid-Agent, analysis co-pilots like GridMind, dispatch models like GAIA, agentic executors like PowerDAG, retrieval-augmented compliance systems like GridCodex, and benchmarks such as ProOPF, PFBench, and the EPRI electric-sector evaluation.
GridArena is deliberately one layer above these systems. It does not compete with them — it consumes them. Any agent exposing an inference endpoint can be registered as an engine in GridArena and put through the same physical-feasibility loop (PyPSA AC/DC power flow), robustness probes (counterfactual replays and perturbation jobs), and reasoning audit (decision trace, parser provenance, LLM-as-judge scoring). Where static benchmarks measure single-shot accuracy, GridArena turns evaluation into a reproducible runtime experiment.
Full survey, comparison-at-a-glance table, and 24 references in the LLM Tools Landscape docs page.
Selected references (2025–2026)
- Zhang et al., "Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control," arXiv:2508.05702, 2025. link
- Jin, Kim & Kwon, "GridMind: LLMs-Powered Agents for Power System Analysis and Operations," Argonne National Laboratory, arXiv:2509.02494, 2025. link
- Cheng et al., "A large language model for advanced power dispatch (GAIA)," Scientific Reports 15:91940, 2025. doi
- Badmus & Pandey, "PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis," arXiv:2603.17418, 2026. link
- Shen et al., "ProOPF: Benchmarking and Improving LLMs for Professional-Grade Power Systems Optimization Modeling," arXiv:2602.03070, 2026. link
- Electric Power Research Institute, "Benchmarking Large Language Models for the Electric Power Sector," EPRI 3002034347, Feb. 2026. link
See all 24 references in the Landscape bibliography.
Authors & affiliation
Zain Naeem
University of Palermo
How to cite
If you use GridArena in your research, please cite:
APA
Zain Naeem (2026). GridArena: An LLM Agent Research Platform for Power System Operations (Version 1.0) [Computer software]. https://gridarena.euBibTeX
@software{gridarena2026,
title = {GridArena: An LLM Agent Research Platform for Power System Operations},
author = {Zain Naeem},
year = {2026},
version = {1.0},
url = {https://gridarena.eu},
note = {University of Palermo}
}