GridArena Documentation

GridArena is a research platform for evaluating and auditing LLM agents on power-system operational tasks. It combines deterministic simulation, structured evaluation, and full provenance logging so experiments are reproducible end-to-end.

What you can do

Run a single LLM agent on a benchmark case (case5 / case14 / case30) and inspect every step.
Compose presets and execute large batch experiments with background queueing.
Evaluate recommendations against a deterministic DC powerflow solver (or external PyPSA service).
Compare runs side-by-side, export results to CSV / LaTeX / SVG, and validate the system itself.

Where to start

LLM Tools Landscape (2025–2026) — survey of 24 LLM tools for power-system operations, comparison-at-a-glance table, and how GridArena fits in.
Installation — how the platform is wired and what to configure.
Usage — sign in, create runs, manage presets and batches.
Experiment Workflow — the full lifecycle from prompt to evaluation.
Architecture — system diagram and component responsibilities.
Reproducibility — three reproducible experiments with expected metrics.
Troubleshooting — common errors and fixes.
(Admin) Simulation Health — run engine self-tests across IEEE case5/14/30 and review failure history.

New: Counterfactual Analysis (Layer E)

GridArena now replays each agent decision against alternative actions to compute optimality gap and decision regret.

Workflow → Counterfactual analysis — formulas and methodology.
Architecture → Counterfactual Engine — where it sits in the pipeline.

Citation

If you use GridArena in your research, please cite it. See the About page for APA and BibTeX entries.