LLM Tools for Power Systems — Landscape (2025–2026)

This page surveys 24 peer-reviewed and preprint works on large language model (LLM) tools for power-system operations published in 2025 and 2026, and clarifies how GridArena positions itself in this landscape. For each tool we list its purpose, required inputs, expected outputs, the type of decisions or recommendations it generates, and its operational scope and limitations. The comparison-at-a-glance table is constructed so that every row maps one-to-one to a numbered reference in the bibliography below.

1. Landscape of LLM tools (2025–2026)

The 2025–2026 literature on LLMs for power systems can be grouped into five overlapping families: (i) operational agents and co-pilots that take actions or recommend them; (ii) optimization- and solver-coupled LLMs that translate natural language into formal models; (iii) retrieval-augmented and knowledge-grounded systems for compliance and operations and maintenance; (iv) benchmarks and surveys; and (v) domain-specialized foundation models and adjacent finance/market work.

1.1 Operational agents and co-pilots

Grid-Agent [1] is a multi-agent LLM framework that coordinates distributed energy resources for grid control, with each agent specialized for a sub-task and a planner agent orchestrating the workflow. GridMind [2], from Argonne National Laboratory, exposes power-system analysis tools (power flow, contingency analysis) to an LLM so analysts can ask natural-language questions and receive structured numerical answers. The closely named Grid-Mind work [3] focuses on connection impact assessment, orchestrating multi-fidelity simulations from a single interconnection request expressed in natural language.

GAIA [4], published in Scientific Reports, fine-tunes an LLM for advanced power dispatch and couples it with classical dispatch routines — one of the first peer-reviewed demonstrations of an LLM acting in the dispatch loop. Grid CoPilot [5] targets long-term planning rather than real-time operations. InstructMPC [6] keeps a human in the loop and uses LLM-derived instructions to adapt model-predictive controllers to context. PowerDAG [7], a 2026 preprint, builds an agentic directed-acyclic-graph executor for distribution- grid analysis that emphasizes reliability.

Beyond these flagship systems, several specialized agents are worth noting. The feedback-driven multi-agent framework of Jia et al. [8] focuses on running and debugging power-system simulations under LLM control. Hu et al. [9] propose a validation-in-the-loop pipeline that converts natural-language descriptions into solver-ready optimization problems. Ren et al. [10] integrate LLM agents with a stochastic unit-commitment framework to handle wind uncertainty. Yang et al. provide two complementary contributions: an LLM-powered automated modeler for active-distribution-network dispatch [11], and an LLM-RL collaboration for two-stage voltage control [12]. LLM4DistReconfig [13] is a fine-tuned LLM for distribution-network reconfiguration. Behavioral generative agents for dispatch and auction [14] explore LLMs as bidders and operators in market settings, and the knowledge-driven adaptive method of [15] couples ontologies with LLM agents for operations.

1.2 Optimization and solver-coupled LLMs

A recurring pattern uses the LLM as a translator between human intent and formal mathematical programs. Hu et al. [9] add a validation-in-the-loop step so that solver feasibility feedback closes the loop on the LLM's output. LLM4DistReconfig [13] follows the same philosophy at the distribution level. The most rigorous evaluation of this pattern to date is ProOPF [19], a 2026 benchmark that measures LLMs on professional-grade optimal power flow modeling and provides automatic feasibility scoring of generated models.

1.3 Retrieval-augmented and knowledge-grounded systems

GridCodex [16], from Huawei, uses retrieval-augmented generation over grid codes for compliance reasoning. Cheng et al. [17] extend RAG to operational reliability evaluation, retrieving over historical events and standards before answering. The virtual-power-plant device failure query model in [18] applies RAG to operations and maintenance for fleets of distributed assets. These systems share an explicit limitation: they reason over text and do not, by themselves, check physical feasibility.

1.4 Benchmarks and surveys

Three works define the current measurement frontier. ProOPF [19] benchmarks LLMs on OPF modeling. PFBench [20] is a 2026 power-flow benchmark for LLM-based power-system agent evaluation, hosted on IEEE DataPort. The Electric Power Research Institute (EPRI) released the first electric-sector benchmarking results for public LLMs in early 2026 [21], and the comprehensive literature survey of Sarwar et al. [22] catalogs the field to date.

1.5 Domain-specialized models and adjacent work

EnergyGPT [23] is a foundation-style LLM specialized for the energy sector. On the market side, Cui et al. [24] use LLM-augmented reinforcement learning for energy futures trading. These works are not operational agents, but they shape the substrate (specialized models, market signals) on which future operational agents will be built.

2. Comparison at a glance

Each row maps one-to-one to a numbered reference in Section 4. The final row introduces GridArena explicitly, framing it not as another operational agent but as an evaluation and benchmarking platform.

ToolYearRefInputsOutputs / DecisionsPhysical-feasibility checkKey limitations
Grid-Agent2025[1]Grid topology, DER setpoints, NL operator queriesMulti-agent control actions; DER coordinationImplicit — relies on simulator wrapperCoordination overhead; opaque inter-agent reasoning
GridMind (Argonne)2025[2]NL questions over power-flow / contingency toolsStructured numerical answers from registered toolsYes — calls deterministic analysis toolsAnalyst Q&A scope, not closed-loop control
Grid-Mind (CIA)2026[3]Interconnection request in natural languageMulti-fidelity connection impact assessmentYes — orchestrates fidelity-tiered simulatorsSingle workflow (interconnection), narrow scope
GAIA2025[4]Dispatch context, NL operator instructionsDispatch decisions via LLM + classical routinesYes — coupled dispatch solverFine-tuned on specific dispatch scope
Grid CoPilot2025[5]Long-term planning datasets, scenario queriesCapacity-expansion / scenario navigationIndirect — planning models, not real-time PFNot for real-time operations
InstructMPC2025[6]Operator instructions + MPC stateContext-adapted MPC control lawVia MPC controllerRequires human-in-the-loop
PowerDAG2026[7]Distribution-grid analysis tasksReliability-focused agentic DAG executionYes — tool-grounded executionDistribution scope; reliability of agent graph
Jia et al. (feedback MA)2025[8]Simulation specs, debug feedbackWorking power-system simulationsYes — simulator-in-the-loopFocuses on simulation building, not control
Hu et al. (NL→solver)2025[9]NL optimization problem statementSolver-ready optimization modelYes — validation-in-the-loop with solverModeling assistant, not operational agent
Ren et al. (SUC)2025[10]Wind/load scenarios, system dataLLM-orchestrated stochastic unit commitmentYes — UC solverScoped to UC under wind uncertainty
Yang et al. (ADN modeler)2025[11]ADN dispatch problem in NLAuto-built ADN dispatch model + solutionYes — solver-coupledActive-distribution-network scope
Yang et al. (LLM-RL voltage)2026[12]ADN voltage state, control objectivesTwo-stage voltage control actionsYes — RL environment grounded in PFVoltage-control task only
LLM4DistReconfig2025[13]Distribution topology, reconfiguration querySwitching reconfiguration planYes — feasibility filteringSingle task; needs fine-tuning per network
Behavioral generative agents2026[14]Market state, agent personasBidding / dispatch behavior in marketsIndirect — market simulatorBehavioral study, not operations control
Knowledge-driven LLM agents2026[15]Ontology + operations queriesAdaptive operating recommendationsPartial — ontology-groundedHeavily dependent on KB quality
GridCodex (RAG)2025[16]Grid code corpus + compliance questionCited compliance reasoningNo — text-onlyCannot detect physical infeasibility
Cheng et al. (RAG reliability)2026[17]Historical events + standards corpusReliability evaluation answersNo — text-onlySame as above
VPP O&M RAG2025[18]Device failure logs, manualsFailure-query answers for VPP O&MNo — text-onlyMaintenance scope; no physics
ProOPF (benchmark)2026[19]OPF problem descriptionsLLM-generated optimization modelsSolver feasibility check on generated modelsScope limited to optimization modeling
PFBench (benchmark)2026[20]Power-flow tasks for agent evaluationPass/fail and accuracy metricsPower-flow solver as ground truthSingle task family (power flow)
EPRI benchmarking2026[21]Electric-sector LLM evaluation suitePublic LLM benchmark results for the sectorStatic test suiteStatic; not a runtime evaluation harness
Sarwar et al. (survey)2025[22]Literature on LLMs in power systemsComprehensive surveyN/ASurvey, not a tool
EnergyGPT2025[23]Energy-sector pre-training corpusDomain-specialized LLMN/A — base modelFoundation model, not an agent
Cui et al. (LLM-RL trading)2025[24]Energy futures market signalsRL trading strategies augmented by LLMN/A — market scopeTrading, not operations
GridArena (this work)2026Any LLM agent + benchmark case (case5/14/30, CIGRE/IEEE), counterfactual & perturbation jobs, judge promptsAudit report: feasibility, robustness, decision-trace and parser provenance, LLM-as-judge scores per actionYes — PyPSA AC/DC power-flow loop on every actionEvaluation layer, not an operational agent — depends on quality of probes & judges

Reading the table: rows [1]–[15] are operational or analytical LLM agents that produce control or analysis outputs. Rows [16]–[18] (RAG systems) reason over textual knowledge bases. Rows [19]–[21] are benchmarks and [22] is a survey. The last row, GridArena, is an evaluation platform that wraps any of the above and audits them on physical feasibility (via a PyPSA AC/DC power-flow loop), robustness (via counterfactual probes and perturbation jobs), and reasoning integrity (decision trace, parser provenance, and LLM-as-judge scoring).

3. How GridArena fits in

Most tools above are agents that act on the grid. GridArena is deliberately one layer above: it is an evaluation and benchmarking harness for those agents. This positioning addresses three concrete questions — physical feasibility, robustness under changing conditions, and failure modes in reasoning or tool use.

3.1 Physical feasibility

Every agent action that GridArena observes is replayed through a PyPSA AC/DC power-flow simulator. The result — converged or not, line/voltage limits respected or violated — is attached to the action as ground truth. This addresses the concern that an LLM may produce plausible-sounding but physically infeasible recommendations, a failure mode that pure text-based RAG systems [16, 17, 18] cannot detect by themselves.

3.2 Robustness under changing conditions

GridArena runs counterfactual probes and perturbation jobs around each baseline run: load is scaled, generators are tripped, lines are removed, and the agent is re-queried. The platform records whether the agent's recommendation degrades gracefully, switches modes appropriately, or breaks. This complements benchmarks like PFBench [20] and ProOPF [19], which evaluate single-shot accuracy rather than behavior under stress.

3.3 Failure modes in reasoning and tool use

GridArena records a full decision trace for every run: the prompt log, the tool calls, the parser provenance for every structured field, and an LLM-as-judge scoring of the final recommendation against domain rubrics. Together these surfaces let researchers attribute failures to a specific cause — wrong tool selection, misparsed solver output, or flawed final reasoning — rather than reporting a single opaque score.

3.4 Position in the landscape

In summary, GridArena does not compete with Grid-Agent [1], GridMind [2], GAIA [4], or PowerDAG [7]; it consumes them. Any agent exposing an inference endpoint can be registered as an engine in GridArena and put through the same physical-feasibility, counterfactual, perturbation, and judge pipeline. This makes the platform a natural complement to the EPRI [21] and PFBench [20] benchmarking efforts: where those provide static test sets, GridArena provides a runtime that turns evaluation into a reproducible experiment.

4. References

All 24 references below are dated 2025 or 2026 and have been verified to exist via web search at the time of writing. arXiv identifiers, DOIs, and URLs are provided for each entry.

  1. Y. Zhang, A. M. Saber, A. Youssef, and D. Kundur, "Grid-Agent: An LLM-Powered Multi-Agent System for Power Grid Control," arXiv:2508.05702, Aug. 2025. arxiv.org/abs/2508.05702
  2. H. Jin, K. Kim, and J. Kwon, "GridMind: LLMs-Powered Agents for Power System Analysis and Operations," Argonne National Laboratory, arXiv:2509.02494, Sep. 2025. arxiv.org/abs/2509.02494
  3. M. Shamseldein, "Grid-Mind: An LLM-Orchestrated Multi-Fidelity Agent for Automated Connection Impact Assessment," arXiv:2602.20683, 2026. arxiv.org/abs/2602.20683
  4. Y. Cheng, H. Zhao, X. Zhou, J. Zhao, Y. Cao, C. Yang, and X. Cai, "A large language model for advanced power dispatch (GAIA)," Scientific Reports, vol. 15, art. 91940, 2025. doi.org/10.1038/s41598-025-91940-x
  5. "Grid CoPilot: A Large Language Model (LLM) Based Framework for Transforming Long-Term Planning Analyses," Preprints.org 202504.1464, Apr. 2025. preprints.org/manuscript/202504.1464
  6. R. Wu, J. Ai, and T. S. Bartels, "InstructMPC: A Human-LLM-in-the-Loop Framework for Context-Aware Power Grid Control," arXiv:2512.05876, Dec. 2025. arxiv.org/abs/2512.05876
  7. E. O. Badmus and A. Pandey, "PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis," arXiv:2603.17418, Mar. 2026. arxiv.org/abs/2603.17418
  8. M. Jia, Z. Cui, and G. Hug, "Enhancing LLMs for Power System Simulations: A Feedback-driven Multi-agent Framework," arXiv:2411.16707, May 2025. arxiv.org/abs/2411.16707
  9. Y. Hu, T. Zhao, and M. Yue, "From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework," arXiv:2508.08147, Aug. 2025. arxiv.org/abs/2508.08147
  10. X. Ren, C. S. Lai, G. Taylor, and Z. Guo, "Can Large Language Model Agents Balance Energy Systems?," arXiv:2502.10557, Feb. 2025. arxiv.org/abs/2502.10557
  11. X. Yang, C. Lin, Y. Yang, Q. Wang, H. Liu, H. Hua, and W. Wu, "Large Language Model Powered Automated Modeling and Optimization of Active Distribution Network Dispatch Problems," arXiv:2507.21162, Jul. 2025. arxiv.org/abs/2507.21162
  12. X. Yang, C. Lin, X. Ma, D. Liu, R. Zheng, H. Liu, and W. Wu, "Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration," arXiv:2602.21715, Feb. 2026. arxiv.org/abs/2602.21715
  13. P. Christou, M. Z. Islam, Y. Lin, and J. Xiong, "LLM4DistReconfig: A Fine-tuned Large Language Model for Power Distribution Network Reconfiguration," arXiv:2501.14960, Jan. 2025. arxiv.org/abs/2501.14960
  14. S. Li, J. S. Kim, and C. Chen, "Behavioral Generative Agents for Power Dispatch and Auction," arXiv:2603.08477, 2026. arxiv.org/abs/2603.08477
  15. "Adaptive Solving Method for Power System Operation Based on Knowledge-Driven LLM Agents," MDPI Electronics, vol. 15, no. 2, art. 478, 2026. mdpi.com/2079-9292/15/2/478
  16. J. Shi, Y. Cheng, F. Zhang, M. Jiang, J. Lin, and Y. Shen (Huawei), "GridCodex: A RAG-Driven AI Framework for Power Grid Code Reasoning and Compliance," arXiv:2508.12682, Aug. 2025. arxiv.org/abs/2508.12682
  17. Y. Cheng, H. Zhao, D. Xiang, Z. Zhang, G. Liu, Y. Liu, J. Zhao, and X. Cai, "Power system operational reliability evaluation with retrieval-augmented generation enhanced large language model," Energy and AI, vol. 24, art. 100688, May 2026. doi.org/10.1016/j.egyai.2026.100688
  18. "Implementation of a Device Failure Query Model for the Virtual Power Plant Smart Operation and Maintenance Platform Based on Retrieval-Augmented Generation Technology," MDPI Electronics, vol. 14, no. 22, art. 4502, 2025. mdpi.com/2079-9292/14/22/4502
  19. C. Shen, Z. Guo, X. Wan, Z. Yang, Y. Zhang, W. Huang, J. Song, Z. Zhang, et al., "ProOPF: Benchmarking and Improving LLMs for Professional-Grade Power Systems Optimization Modeling," arXiv:2602.03070, Feb. 2026. arxiv.org/abs/2602.03070
  20. B. She, "Power-Flow Benchmark for LLM-based Power System Agent Evaluation (PFBench)," IEEE DataPort, DOI 10.21227/jnrm-q720, Mar. 2026. ieee-dataport.org / PFBench
  21. Electric Power Research Institute, "Benchmarking Large Language Models for the Electric Power Sector," EPRI Technical Report 3002034347 / EPRI Journal, Feb. 2026. eprijournal.com / EPRI LLM benchmark
  22. M. Sarwar, M. Rizwan, M. Aziz, and A. R. Sudais, "Large Language Models for Power System Applications: A Comprehensive Literature Survey," arXiv:2512.13004, Dec. 2025. arxiv.org/abs/2512.13004
  23. "Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector," arXiv:2509.07177, Sep. 2025. arxiv.org/abs/2509.07177
  24. T. Cui, Y. Ye, Y. Li, N. Du, X. Song, Y. Zhu, and X. Yang, "Toward profitable energy futures trading strategies using reinforcement learning incorporating disagreement and connectedness methods enabled by large language models," Energy and AI, vol. 21, art. 100562, 2025. doi.org/10.1016/j.egyai.2025.100562