LLM Tools for Power Systems — Landscape (2025–2026)
This page surveys 24 peer-reviewed and preprint works on large language model (LLM) tools for power-system operations published in 2025 and 2026, and clarifies how GridArena positions itself in this landscape. For each tool we list its purpose, required inputs, expected outputs, the type of decisions or recommendations it generates, and its operational scope and limitations. The comparison-at-a-glance table is constructed so that every row maps one-to-one to a numbered reference in the bibliography below.
1. Landscape of LLM tools (2025–2026)
The 2025–2026 literature on LLMs for power systems can be grouped into five overlapping families: (i) operational agents and co-pilots that take actions or recommend them; (ii) optimization- and solver-coupled LLMs that translate natural language into formal models; (iii) retrieval-augmented and knowledge-grounded systems for compliance and operations and maintenance; (iv) benchmarks and surveys; and (v) domain-specialized foundation models and adjacent finance/market work.
1.1 Operational agents and co-pilots
Grid-Agent [1] is a multi-agent LLM framework that coordinates distributed energy resources for grid control, with each agent specialized for a sub-task and a planner agent orchestrating the workflow. GridMind [2], from Argonne National Laboratory, exposes power-system analysis tools (power flow, contingency analysis) to an LLM so analysts can ask natural-language questions and receive structured numerical answers. The closely named Grid-Mind work [3] focuses on connection impact assessment, orchestrating multi-fidelity simulations from a single interconnection request expressed in natural language.
GAIA [4], published in Scientific Reports, fine-tunes an LLM for advanced power dispatch and couples it with classical dispatch routines — one of the first peer-reviewed demonstrations of an LLM acting in the dispatch loop. Grid CoPilot [5] targets long-term planning rather than real-time operations. InstructMPC [6] keeps a human in the loop and uses LLM-derived instructions to adapt model-predictive controllers to context. PowerDAG [7], a 2026 preprint, builds an agentic directed-acyclic-graph executor for distribution- grid analysis that emphasizes reliability.
Beyond these flagship systems, several specialized agents are worth noting. The feedback-driven multi-agent framework of Jia et al. [8] focuses on running and debugging power-system simulations under LLM control. Hu et al. [9] propose a validation-in-the-loop pipeline that converts natural-language descriptions into solver-ready optimization problems. Ren et al. [10] integrate LLM agents with a stochastic unit-commitment framework to handle wind uncertainty. Yang et al. provide two complementary contributions: an LLM-powered automated modeler for active-distribution-network dispatch [11], and an LLM-RL collaboration for two-stage voltage control [12]. LLM4DistReconfig [13] is a fine-tuned LLM for distribution-network reconfiguration. Behavioral generative agents for dispatch and auction [14] explore LLMs as bidders and operators in market settings, and the knowledge-driven adaptive method of [15] couples ontologies with LLM agents for operations.
1.2 Optimization and solver-coupled LLMs
A recurring pattern uses the LLM as a translator between human intent and formal mathematical programs. Hu et al. [9] add a validation-in-the-loop step so that solver feasibility feedback closes the loop on the LLM's output. LLM4DistReconfig [13] follows the same philosophy at the distribution level. The most rigorous evaluation of this pattern to date is ProOPF [19], a 2026 benchmark that measures LLMs on professional-grade optimal power flow modeling and provides automatic feasibility scoring of generated models.
1.3 Retrieval-augmented and knowledge-grounded systems
GridCodex [16], from Huawei, uses retrieval-augmented generation over grid codes for compliance reasoning. Cheng et al. [17] extend RAG to operational reliability evaluation, retrieving over historical events and standards before answering. The virtual-power-plant device failure query model in [18] applies RAG to operations and maintenance for fleets of distributed assets. These systems share an explicit limitation: they reason over text and do not, by themselves, check physical feasibility.
1.4 Benchmarks and surveys
Three works define the current measurement frontier. ProOPF [19] benchmarks LLMs on OPF modeling. PFBench [20] is a 2026 power-flow benchmark for LLM-based power-system agent evaluation, hosted on IEEE DataPort. The Electric Power Research Institute (EPRI) released the first electric-sector benchmarking results for public LLMs in early 2026 [21], and the comprehensive literature survey of Sarwar et al. [22] catalogs the field to date.
1.5 Domain-specialized models and adjacent work
EnergyGPT [23] is a foundation-style LLM specialized for the energy sector. On the market side, Cui et al. [24] use LLM-augmented reinforcement learning for energy futures trading. These works are not operational agents, but they shape the substrate (specialized models, market signals) on which future operational agents will be built.
2. Comparison at a glance
Each row maps one-to-one to a numbered reference in Section 4. The final row introduces GridArena explicitly, framing it not as another operational agent but as an evaluation and benchmarking platform.
| Tool | Year | Ref | Inputs | Outputs / Decisions | Physical-feasibility check | Key limitations |
|---|---|---|---|---|---|---|
| Grid-Agent | 2025 | [1] | Grid topology, DER setpoints, NL operator queries | Multi-agent control actions; DER coordination | Implicit — relies on simulator wrapper | Coordination overhead; opaque inter-agent reasoning |
| GridMind (Argonne) | 2025 | [2] | NL questions over power-flow / contingency tools | Structured numerical answers from registered tools | Yes — calls deterministic analysis tools | Analyst Q&A scope, not closed-loop control |
| Grid-Mind (CIA) | 2026 | [3] | Interconnection request in natural language | Multi-fidelity connection impact assessment | Yes — orchestrates fidelity-tiered simulators | Single workflow (interconnection), narrow scope |
| GAIA | 2025 | [4] | Dispatch context, NL operator instructions | Dispatch decisions via LLM + classical routines | Yes — coupled dispatch solver | Fine-tuned on specific dispatch scope |
| Grid CoPilot | 2025 | [5] | Long-term planning datasets, scenario queries | Capacity-expansion / scenario navigation | Indirect — planning models, not real-time PF | Not for real-time operations |
| InstructMPC | 2025 | [6] | Operator instructions + MPC state | Context-adapted MPC control law | Via MPC controller | Requires human-in-the-loop |
| PowerDAG | 2026 | [7] | Distribution-grid analysis tasks | Reliability-focused agentic DAG execution | Yes — tool-grounded execution | Distribution scope; reliability of agent graph |
| Jia et al. (feedback MA) | 2025 | [8] | Simulation specs, debug feedback | Working power-system simulations | Yes — simulator-in-the-loop | Focuses on simulation building, not control |
| Hu et al. (NL→solver) | 2025 | [9] | NL optimization problem statement | Solver-ready optimization model | Yes — validation-in-the-loop with solver | Modeling assistant, not operational agent |
| Ren et al. (SUC) | 2025 | [10] | Wind/load scenarios, system data | LLM-orchestrated stochastic unit commitment | Yes — UC solver | Scoped to UC under wind uncertainty |
| Yang et al. (ADN modeler) | 2025 | [11] | ADN dispatch problem in NL | Auto-built ADN dispatch model + solution | Yes — solver-coupled | Active-distribution-network scope |
| Yang et al. (LLM-RL voltage) | 2026 | [12] | ADN voltage state, control objectives | Two-stage voltage control actions | Yes — RL environment grounded in PF | Voltage-control task only |
| LLM4DistReconfig | 2025 | [13] | Distribution topology, reconfiguration query | Switching reconfiguration plan | Yes — feasibility filtering | Single task; needs fine-tuning per network |
| Behavioral generative agents | 2026 | [14] | Market state, agent personas | Bidding / dispatch behavior in markets | Indirect — market simulator | Behavioral study, not operations control |
| Knowledge-driven LLM agents | 2026 | [15] | Ontology + operations queries | Adaptive operating recommendations | Partial — ontology-grounded | Heavily dependent on KB quality |
| GridCodex (RAG) | 2025 | [16] | Grid code corpus + compliance question | Cited compliance reasoning | No — text-only | Cannot detect physical infeasibility |
| Cheng et al. (RAG reliability) | 2026 | [17] | Historical events + standards corpus | Reliability evaluation answers | No — text-only | Same as above |
| VPP O&M RAG | 2025 | [18] | Device failure logs, manuals | Failure-query answers for VPP O&M | No — text-only | Maintenance scope; no physics |
| ProOPF (benchmark) | 2026 | [19] | OPF problem descriptions | LLM-generated optimization models | Solver feasibility check on generated models | Scope limited to optimization modeling |
| PFBench (benchmark) | 2026 | [20] | Power-flow tasks for agent evaluation | Pass/fail and accuracy metrics | Power-flow solver as ground truth | Single task family (power flow) |
| EPRI benchmarking | 2026 | [21] | Electric-sector LLM evaluation suite | Public LLM benchmark results for the sector | Static test suite | Static; not a runtime evaluation harness |
| Sarwar et al. (survey) | 2025 | [22] | Literature on LLMs in power systems | Comprehensive survey | N/A | Survey, not a tool |
| EnergyGPT | 2025 | [23] | Energy-sector pre-training corpus | Domain-specialized LLM | N/A — base model | Foundation model, not an agent |
| Cui et al. (LLM-RL trading) | 2025 | [24] | Energy futures market signals | RL trading strategies augmented by LLM | N/A — market scope | Trading, not operations |
| GridArena (this work) | 2026 | — | Any LLM agent + benchmark case (case5/14/30, CIGRE/IEEE), counterfactual & perturbation jobs, judge prompts | Audit report: feasibility, robustness, decision-trace and parser provenance, LLM-as-judge scores per action | Yes — PyPSA AC/DC power-flow loop on every action | Evaluation layer, not an operational agent — depends on quality of probes & judges |
Reading the table: rows [1]–[15] are operational or analytical LLM agents that produce control or analysis outputs. Rows [16]–[18] (RAG systems) reason over textual knowledge bases. Rows [19]–[21] are benchmarks and [22] is a survey. The last row, GridArena, is an evaluation platform that wraps any of the above and audits them on physical feasibility (via a PyPSA AC/DC power-flow loop), robustness (via counterfactual probes and perturbation jobs), and reasoning integrity (decision trace, parser provenance, and LLM-as-judge scoring).
3. How GridArena fits in
Most tools above are agents that act on the grid. GridArena is deliberately one layer above: it is an evaluation and benchmarking harness for those agents. This positioning addresses three concrete questions — physical feasibility, robustness under changing conditions, and failure modes in reasoning or tool use.
3.1 Physical feasibility
Every agent action that GridArena observes is replayed through a PyPSA AC/DC power-flow simulator. The result — converged or not, line/voltage limits respected or violated — is attached to the action as ground truth. This addresses the concern that an LLM may produce plausible-sounding but physically infeasible recommendations, a failure mode that pure text-based RAG systems [16, 17, 18] cannot detect by themselves.
3.2 Robustness under changing conditions
GridArena runs counterfactual probes and perturbation jobs around each baseline run: load is scaled, generators are tripped, lines are removed, and the agent is re-queried. The platform records whether the agent's recommendation degrades gracefully, switches modes appropriately, or breaks. This complements benchmarks like PFBench [20] and ProOPF [19], which evaluate single-shot accuracy rather than behavior under stress.
3.3 Failure modes in reasoning and tool use
GridArena records a full decision trace for every run: the prompt log, the tool calls, the parser provenance for every structured field, and an LLM-as-judge scoring of the final recommendation against domain rubrics. Together these surfaces let researchers attribute failures to a specific cause — wrong tool selection, misparsed solver output, or flawed final reasoning — rather than reporting a single opaque score.
3.4 Position in the landscape
In summary, GridArena does not compete with Grid-Agent [1], GridMind [2], GAIA [4], or PowerDAG [7]; it consumes them. Any agent exposing an inference endpoint can be registered as an engine in GridArena and put through the same physical-feasibility, counterfactual, perturbation, and judge pipeline. This makes the platform a natural complement to the EPRI [21] and PFBench [20] benchmarking efforts: where those provide static test sets, GridArena provides a runtime that turns evaluation into a reproducible experiment.
4. References
All 24 references below are dated 2025 or 2026 and have been verified to exist via web search at the time of writing. arXiv identifiers, DOIs, and URLs are provided for each entry.
- Y. Zhang, A. M. Saber, A. Youssef, and D. Kundur, "Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control," arXiv:2508.05702, Aug. 2025. arxiv.org/abs/2508.05702
- H. Jin, K. Kim, and J. Kwon, "GridMind: LLMs-Powered Agents for Power System Analysis and Operations," Argonne National Laboratory, arXiv:2509.02494, Sep. 2025. arxiv.org/abs/2509.02494
- M. Shamseldein, "Grid-Mind: An LLM-Orchestrated Multi-Fidelity Agent for Automated Connection Impact Assessment," arXiv:2602.20683, 2026. arxiv.org/abs/2602.20683
- Y. Cheng, H. Zhao, X. Zhou, J. Zhao, Y. Cao, C. Yang, and X. Cai, "A large language model for advanced power dispatch (GAIA)," Scientific Reports, vol. 15, art. 91940, 2025. doi.org/10.1038/s41598-025-91940-x
- "Grid CoPilot: A Large Language Model (LLM) Based Framework for Transforming Long-Term Planning Analyses," Preprints.org 202504.1464, Apr. 2025. preprints.org/manuscript/202504.1464
- R. Wu, J. Ai, and T. S. Bartels, "InstructMPC: A Human-LLM-in-the-Loop Framework for Context-Aware Power Grid Control," arXiv:2512.05876, Dec. 2025. arxiv.org/abs/2512.05876
- E. O. Badmus and A. Pandey, "PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis," arXiv:2603.17418, Mar. 2026. arxiv.org/abs/2603.17418
- M. Jia, Z. Cui, and G. Hug, "Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework," arXiv:2411.16707, May 2025. arxiv.org/abs/2411.16707
- Y. Hu, T. Zhao, and M. Yue, "From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework," arXiv:2508.08147, Aug. 2025. arxiv.org/abs/2508.08147
- X. Ren, C. S. Lai, G. Taylor, and Z. Guo, "Can Large Language Model Agents Balance Energy Systems?," arXiv:2502.10557, Feb. 2025. arxiv.org/abs/2502.10557
- X. Yang, C. Lin, Y. Yang, Q. Wang, H. Liu, H. Hua, and W. Wu, "Large Language Model Powered Automated Modeling and Optimization of Active Distribution Network Dispatch Problems," arXiv:2507.21162, Jul. 2025. arxiv.org/abs/2507.21162
- X. Yang, C. Lin, X. Ma, D. Liu, R. Zheng, H. Liu, and W. Wu, "Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration," arXiv:2602.21715, Feb. 2026. arxiv.org/abs/2602.21715
- P. Christou, M. Z. Islam, Y. Lin, and J. Xiong, "LLM4DistReconfig: A Fine-tuned Large Language Model for Power Distribution Network Reconfiguration," arXiv:2501.14960, Jan. 2025. arxiv.org/abs/2501.14960
- S. Li, J. S. Kim, and C. Chen, "Behavioral Generative Agents for Power Dispatch and Auction," arXiv:2603.08477, 2026. arxiv.org/abs/2603.08477
- "Adaptive Solving Method for Power System Operation Based on Knowledge-Driven LLM Agents," MDPI Electronics, vol. 15, no. 2, art. 478, 2026. mdpi.com/2079-9292/15/2/478
- J. Shi, Y. Cheng, F. Zhang, M. Jiang, J. Lin, and Y. Shen (Huawei), "GridCodex: A RAG-Driven AI Framework for Power Grid Code Reasoning and Compliance," arXiv:2508.12682, Aug. 2025. arxiv.org/abs/2508.12682
- Y. Cheng, H. Zhao, D. Xiang, Z. Zhang, G. Liu, Y. Liu, J. Zhao, and X. Cai, "Power system operational reliability evaluation with retrieval-augmented generation enhanced large language model," Energy and AI, vol. 24, art. 100688, May 2026. doi.org/10.1016/j.egyai.2026.100688
- "Implementation of a Device Failure Query Model for the Virtual Power Plant Smart Operation and Maintenance Platform Based on Retrieval-Augmented Generation Technology," MDPI Electronics, vol. 14, no. 22, art. 4502, 2025. mdpi.com/2079-9292/14/22/4502
- C. Shen, Z. Guo, X. Wan, Z. Yang, Y. Zhang, W. Huang, J. Song, Z. Zhang, et al., "ProOPF: Benchmarking and Improving LLMs for Professional-Grade Power Systems Optimization Modeling," arXiv:2602.03070, Feb. 2026. arxiv.org/abs/2602.03070
- B. She, "Power-Flow Benchmark for LLM-based Power System Agent Evaluation (PFBench)," IEEE DataPort, DOI 10.21227/jnrm-q720, Mar. 2026. ieee-dataport.org / PFBench
- Electric Power Research Institute, "Benchmarking Large Language Models for the Electric Power Sector," EPRI Technical Report 3002034347 / EPRI Journal, Feb. 2026. eprijournal.com / EPRI LLM benchmark
- M. Sarwar, M. Rizwan, M. Aziz, and A. R. Sudais, "Large Language Models for Power System Applications: A Comprehensive Literature Survey," arXiv:2512.13004, Dec. 2025. arxiv.org/abs/2512.13004
- "Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector," arXiv:2509.07177, Sep. 2025. arxiv.org/abs/2509.07177
- T. Cui, Y. Ye, Y. Li, N. Du, X. Song, Y. Zhu, and X. Yang, "Toward profitable energy futures trading strategies using reinforcement learning incorporating disagreement and connectedness methods enabled by large language models," Energy and AI, vol. 21, art. 100562, 2025. doi.org/10.1016/j.egyai.2025.100562