LLM Tools for Power Systems — Landscape (2025–2026)

This page surveys 24 peer-reviewed and preprint works on large language model (LLM) tools for power-system operations published in 2025 and 2026, and clarifies how GridArena positions itself in this landscape. For each tool we list its purpose, required inputs, expected outputs, the type of decisions or recommendations it generates, and its operational scope and limitations. The comparison-at-a-glance table is constructed so that every row maps one-to-one to a numbered reference in the bibliography below.

1. Landscape of LLM tools (2025–2026)

The 2025–2026 literature on LLMs for power systems can be grouped into five overlapping families: (i) operational agents and co-pilots that take actions or recommend them; (ii) optimization- and solver-coupled LLMs that translate natural language into formal models; (iii) retrieval-augmented and knowledge-grounded systems for compliance and operations and maintenance; (iv) benchmarks and surveys; and (v) domain-specialized foundation models and adjacent finance/market work.

1.1 Operational agents and co-pilots

Grid-Agent [1] is a multi-agent LLM framework that coordinates distributed energy resources for grid control, with each agent specialized for a sub-task and a planner agent orchestrating the workflow. GridMind [2], from Argonne National Laboratory, exposes power-system analysis tools (power flow, contingency analysis) to an LLM so analysts can ask natural-language questions and receive structured numerical answers. The closely named Grid-Mind work [3] focuses on connection impact assessment, orchestrating multi-fidelity simulations from a single interconnection request expressed in natural language.

GAIA [4], published in Scientific Reports, fine-tunes an LLM for advanced power dispatch and couples it with classical dispatch routines — one of the first peer-reviewed demonstrations of an LLM acting in the dispatch loop. Grid CoPilot [5] targets long-term planning rather than real-time operations. InstructMPC [6] keeps a human in the loop and uses LLM-derived instructions to adapt model-predictive controllers to context. PowerDAG [7], a 2026 preprint, builds an agentic directed-acyclic-graph executor for distribution- grid analysis that emphasizes reliability.

Beyond these flagship systems, several specialized agents are worth noting. The feedback-driven multi-agent framework of Jia et al. [8] focuses on running and debugging power-system simulations under LLM control. Hu et al. [9] propose a validation-in-the-loop pipeline that converts natural-language descriptions into solver-ready optimization problems. Ren et al. [10] integrate LLM agents with a stochastic unit-commitment framework to handle wind uncertainty. Yang et al. provide two complementary contributions: an LLM-powered automated modeler for active-distribution-network dispatch [11], and an LLM-RL collaboration for two-stage voltage control [12]. LLM4DistReconfig [13] is a fine-tuned LLM for distribution-network reconfiguration. Behavioral generative agents for dispatch and auction [14] explore LLMs as bidders and operators in market settings, and the knowledge-driven adaptive method of [15] couples ontologies with LLM agents for operations.

1.2 Optimization and solver-coupled LLMs

A recurring pattern uses the LLM as a translator between human intent and formal mathematical programs. Hu et al. [9] add a validation-in-the-loop step so that solver feasibility feedback closes the loop on the LLM's output. LLM4DistReconfig [13] follows the same philosophy at the distribution level. The most rigorous evaluation of this pattern to date is ProOPF [19], a 2026 benchmark that measures LLMs on professional-grade optimal power flow modeling and provides automatic feasibility scoring of generated models.

1.3 Retrieval-augmented and knowledge-grounded systems

GridCodex [16], from Huawei, uses retrieval-augmented generation over grid codes for compliance reasoning. Cheng et al. [17] extend RAG to operational reliability evaluation, retrieving over historical events and standards before answering. The virtual-power-plant device failure query model in [18] applies RAG to operations and maintenance for fleets of distributed assets. These systems share an explicit limitation: they reason over text and do not, by themselves, check physical feasibility.

1.4 Benchmarks and surveys

Three works define the current measurement frontier. ProOPF [19] benchmarks LLMs on OPF modeling. PFBench [20] is a 2026 power-flow benchmark for LLM-based power-system agent evaluation, hosted on IEEE DataPort. The Electric Power Research Institute (EPRI) released the first electric-sector benchmarking results for public LLMs in early 2026 [21], and the comprehensive literature survey of Sarwar et al. [22] catalogs the field to date.

1.5 Domain-specialized models and adjacent work

EnergyGPT [23] is a foundation-style LLM specialized for the energy sector. On the market side, Cui et al. [24] use LLM-augmented reinforcement learning for energy futures trading. These works are not operational agents, but they shape the substrate (specialized models, market signals) on which future operational agents will be built.

2. Comparison at a glance

Each row maps one-to-one to a numbered reference in Section 4. The final row introduces GridArena explicitly, framing it not as another operational agent but as an evaluation and benchmarking platform.

Tool	Year	Ref	Inputs	Outputs / Decisions	Physical-feasibility check	Key limitations
Grid-Agent	2025	[1]	Grid topology, DER setpoints, NL operator queries	Multi-agent control actions; DER coordination	Implicit — relies on simulator wrapper	Coordination overhead; opaque inter-agent reasoning
GridMind (Argonne)	2025	[2]	NL questions over power-flow / contingency tools	Structured numerical answers from registered tools	Yes — calls deterministic analysis tools	Analyst Q&A scope, not closed-loop control
Grid-Mind (CIA)	2026	[3]	Interconnection request in natural language	Multi-fidelity connection impact assessment	Yes — orchestrates fidelity-tiered simulators	Single workflow (interconnection), narrow scope
GAIA	2025	[4]	Dispatch context, NL operator instructions	Dispatch decisions via LLM + classical routines	Yes — coupled dispatch solver	Fine-tuned on specific dispatch scope
Grid CoPilot	2025	[5]	Long-term planning datasets, scenario queries	Capacity-expansion / scenario navigation	Indirect — planning models, not real-time PF	Not for real-time operations
InstructMPC	2025	[6]	Operator instructions + MPC state	Context-adapted MPC control law	Via MPC controller	Requires human-in-the-loop
PowerDAG	2026	[7]	Distribution-grid analysis tasks	Reliability-focused agentic DAG execution	Yes — tool-grounded execution	Distribution scope; reliability of agent graph
Jia et al. (feedback MA)	2025	[8]	Simulation specs, debug feedback	Working power-system simulations	Yes — simulator-in-the-loop	Focuses on simulation building, not control
Hu et al. (NL→solver)	2025	[9]	NL optimization problem statement	Solver-ready optimization model	Yes — validation-in-the-loop with solver	Modeling assistant, not operational agent
Ren et al. (SUC)	2025	[10]	Wind/load scenarios, system data	LLM-orchestrated stochastic unit commitment	Yes — UC solver	Scoped to UC under wind uncertainty
Yang et al. (ADN modeler)	2025	[11]	ADN dispatch problem in NL	Auto-built ADN dispatch model + solution	Yes — solver-coupled	Active-distribution-network scope
Yang et al. (LLM-RL voltage)	2026	[12]	ADN voltage state, control objectives	Two-stage voltage control actions	Yes — RL environment grounded in PF	Voltage-control task only
LLM4DistReconfig	2025	[13]	Distribution topology, reconfiguration query	Switching reconfiguration plan	Yes — feasibility filtering	Single task; needs fine-tuning per network
Behavioral generative agents	2026	[14]	Market state, agent personas	Bidding / dispatch behavior in markets	Indirect — market simulator	Behavioral study, not operations control
Knowledge-driven LLM agents	2026	[15]	Ontology + operations queries	Adaptive operating recommendations	Partial — ontology-grounded	Heavily dependent on KB quality
GridCodex (RAG)	2025	[16]	Grid code corpus + compliance question	Cited compliance reasoning	No — text-only	Cannot detect physical infeasibility
Cheng et al. (RAG reliability)	2026	[17]	Historical events + standards corpus	Reliability evaluation answers	No — text-only	Same as above
VPP O&M RAG	2025	[18]	Device failure logs, manuals	Failure-query answers for VPP O&M	No — text-only	Maintenance scope; no physics
ProOPF (benchmark)	2026	[19]	OPF problem descriptions	LLM-generated optimization models	Solver feasibility check on generated models	Scope limited to optimization modeling
PFBench (benchmark)	2026	[20]	Power-flow tasks for agent evaluation	Pass/fail and accuracy metrics	Power-flow solver as ground truth	Single task family (power flow)
EPRI benchmarking	2026	[21]	Electric-sector LLM evaluation suite	Public LLM benchmark results for the sector	Static test suite	Static; not a runtime evaluation harness
Sarwar et al. (survey)	2025	[22]	Literature on LLMs in power systems	Comprehensive survey	N/A	Survey, not a tool
EnergyGPT	2025	[23]	Energy-sector pre-training corpus	Domain-specialized LLM	N/A — base model	Foundation model, not an agent
Cui et al. (LLM-RL trading)	2025	[24]	Energy futures market signals	RL trading strategies augmented by LLM	N/A — market scope	Trading, not operations
GridArena (this work)	2026	—	Any LLM agent + benchmark case (case5/14/30, CIGRE/IEEE), counterfactual & perturbation jobs, judge prompts	Audit report: feasibility, robustness, decision-trace and parser provenance, LLM-as-judge scores per action	Yes — PyPSA AC/DC power-flow loop on every action	Evaluation layer, not an operational agent — depends on quality of probes & judges

Reading the table: rows [1]–[15] are operational or analytical LLM agents that produce control or analysis outputs. Rows [16]–[18] (RAG systems) reason over textual knowledge bases. Rows [19]–[21] are benchmarks and [22] is a survey. The last row, GridArena, is an evaluation platform that wraps any of the above and audits them on physical feasibility (via a PyPSA AC/DC power-flow loop), robustness (via counterfactual probes and perturbation jobs), and reasoning integrity (decision trace, parser provenance, and LLM-as-judge scoring).

3. How GridArena fits in

Most tools above are agents that act on the grid. GridArena is deliberately one layer above: it is an evaluation and benchmarking harness for those agents. This positioning addresses three concrete questions — physical feasibility, robustness under changing conditions, and failure modes in reasoning or tool use.

3.1 Physical feasibility

Every agent action that GridArena observes is replayed through a PyPSA AC/DC power-flow simulator. The result — converged or not, line/voltage limits respected or violated — is attached to the action as ground truth. This addresses the concern that an LLM may produce plausible-sounding but physically infeasible recommendations, a failure mode that pure text-based RAG systems [16, 17, 18] cannot detect by themselves.

3.2 Robustness under changing conditions

GridArena runs counterfactual probes and perturbation jobs around each baseline run: load is scaled, generators are tripped, lines are removed, and the agent is re-queried. The platform records whether the agent's recommendation degrades gracefully, switches modes appropriately, or breaks. This complements benchmarks like PFBench [20] and ProOPF [19], which evaluate single-shot accuracy rather than behavior under stress.

3.3 Failure modes in reasoning and tool use

GridArena records a full decision trace for every run: the prompt log, the tool calls, the parser provenance for every structured field, and an LLM-as-judge scoring of the final recommendation against domain rubrics. Together these surfaces let researchers attribute failures to a specific cause — wrong tool selection, misparsed solver output, or flawed final reasoning — rather than reporting a single opaque score.

3.4 Position in the landscape

In summary, GridArena does not compete with Grid-Agent [1], GridMind [2], GAIA [4], or PowerDAG [7]; it consumes them. Any agent exposing an inference endpoint can be registered as an engine in GridArena and put through the same physical-feasibility, counterfactual, perturbation, and judge pipeline. This makes the platform a natural complement to the EPRI [21] and PFBench [20] benchmarking efforts: where those provide static test sets, GridArena provides a runtime that turns evaluation into a reproducible experiment.

4. References

All 24 references below are dated 2025 or 2026 and have been verified to exist via web search at the time of writing. arXiv identifiers, DOIs, and URLs are provided for each entry.

Y. Zhang, A. M. Saber, A. Youssef, and D. Kundur, "Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control," arXiv:2508.05702, Aug. 2025. arxiv.org/abs/2508.05702
H. Jin, K. Kim, and J. Kwon, "GridMind: LLMs-Powered Agents for Power System Analysis and Operations," Argonne National Laboratory, arXiv:2509.02494, Sep. 2025. arxiv.org/abs/2509.02494
M. Shamseldein, "Grid-Mind: An LLM-Orchestrated Multi-Fidelity Agent for Automated Connection Impact Assessment," arXiv:2602.20683, 2026. arxiv.org/abs/2602.20683
Y. Cheng, H. Zhao, X. Zhou, J. Zhao, Y. Cao, C. Yang, and X. Cai, "A large language model for advanced power dispatch (GAIA)," Scientific Reports, vol. 15, art. 91940, 2025. doi.org/10.1038/s41598-025-91940-x
"Grid CoPilot: A Large Language Model (LLM) Based Framework for Transforming Long-Term Planning Analyses," Preprints.org 202504.1464, Apr. 2025. preprints.org/manuscript/202504.1464
R. Wu, J. Ai, and T. S. Bartels, "InstructMPC: A Human-LLM-in-the-Loop Framework for Context-Aware Power Grid Control," arXiv:2512.05876, Dec. 2025. arxiv.org/abs/2512.05876
E. O. Badmus and A. Pandey, "PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis," arXiv:2603.17418, Mar. 2026. arxiv.org/abs/2603.17418
M. Jia, Z. Cui, and G. Hug, "Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework," arXiv:2411.16707, May 2025. arxiv.org/abs/2411.16707
Y. Hu, T. Zhao, and M. Yue, "From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework," arXiv:2508.08147, Aug. 2025. arxiv.org/abs/2508.08147
X. Ren, C. S. Lai, G. Taylor, and Z. Guo, "Can Large Language Model Agents Balance Energy Systems?," arXiv:2502.10557, Feb. 2025. arxiv.org/abs/2502.10557
X. Yang, C. Lin, Y. Yang, Q. Wang, H. Liu, H. Hua, and W. Wu, "Large Language Model Powered Automated Modeling and Optimization of Active Distribution Network Dispatch Problems," arXiv:2507.21162, Jul. 2025. arxiv.org/abs/2507.21162
X. Yang, C. Lin, X. Ma, D. Liu, R. Zheng, H. Liu, and W. Wu, "Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration," arXiv:2602.21715, Feb. 2026. arxiv.org/abs/2602.21715
P. Christou, M. Z. Islam, Y. Lin, and J. Xiong, "LLM4DistReconfig: A Fine-tuned Large Language Model for Power Distribution Network Reconfiguration," arXiv:2501.14960, Jan. 2025. arxiv.org/abs/2501.14960
S. Li, J. S. Kim, and C. Chen, "Behavioral Generative Agents for Power Dispatch and Auction," arXiv:2603.08477, 2026. arxiv.org/abs/2603.08477
"Adaptive Solving Method for Power System Operation Based on Knowledge-Driven LLM Agents," MDPI Electronics, vol. 15, no. 2, art. 478, 2026. mdpi.com/2079-9292/15/2/478
J. Shi, Y. Cheng, F. Zhang, M. Jiang, J. Lin, and Y. Shen (Huawei), "GridCodex: A RAG-Driven AI Framework for Power Grid Code Reasoning and Compliance," arXiv:2508.12682, Aug. 2025. arxiv.org/abs/2508.12682
Y. Cheng, H. Zhao, D. Xiang, Z. Zhang, G. Liu, Y. Liu, J. Zhao, and X. Cai, "Power system operational reliability evaluation with retrieval-augmented generation enhanced large language model," Energy and AI, vol. 24, art. 100688, May 2026. doi.org/10.1016/j.egyai.2026.100688
"Implementation of a Device Failure Query Model for the Virtual Power Plant Smart Operation and Maintenance Platform Based on Retrieval-Augmented Generation Technology," MDPI Electronics, vol. 14, no. 22, art. 4502, 2025. mdpi.com/2079-9292/14/22/4502
C. Shen, Z. Guo, X. Wan, Z. Yang, Y. Zhang, W. Huang, J. Song, Z. Zhang, et al., "ProOPF: Benchmarking and Improving LLMs for Professional-Grade Power Systems Optimization Modeling," arXiv:2602.03070, Feb. 2026. arxiv.org/abs/2602.03070
B. She, "Power-Flow Benchmark for LLM-based Power System Agent Evaluation (PFBench)," IEEE DataPort, DOI 10.21227/jnrm-q720, Mar. 2026. ieee-dataport.org / PFBench
Electric Power Research Institute, "Benchmarking Large Language Models for the Electric Power Sector," EPRI Technical Report 3002034347 / EPRI Journal, Feb. 2026. eprijournal.com / EPRI LLM benchmark
M. Sarwar, M. Rizwan, M. Aziz, and A. R. Sudais, "Large Language Models for Power System Applications: A Comprehensive Literature Survey," arXiv:2512.13004, Dec. 2025. arxiv.org/abs/2512.13004
"Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector," arXiv:2509.07177, Sep. 2025. arxiv.org/abs/2509.07177
T. Cui, Y. Ye, Y. Li, N. Du, X. Song, Y. Zhu, and X. Yang, "Toward profitable energy futures trading strategies using reinforcement learning incorporating disagreement and connectedness methods enabled by large language models," Energy and AI, vol. 21, art. 100562, 2025. doi.org/10.1016/j.egyai.2025.100562