Experiment Workflow

Every GridArena run follows the same deterministic pipeline. Each stage is logged and inspectable from the run detail page so experiments are fully reproducible. The pipeline is what makes GridArena an evaluation harness rather than an operational agent — see the LLM Tools Landscape for the broader context.

Methodology at a glance

The diagram below summarises the end-to-end methodology: a benchmark case (optionally perturbed) and a templated prompt feed a seeded LLM agent, whose structured action is validated, simulated, and rule-evaluated in parallel before metrics, counterfactual replays, perturbation outcomes, judge scores, and full provenance are persisted.

1. Configuration

A run starts from either a manual form (/new-run) or a preset. The configuration captures the benchmark case, model, sampling parameters (temperature, top_p, seed), the prompt template, and the evaluation mode (rule-based or simulation-based).

2. Enqueue

For batch executions, one run_execution job is enqueued per pending run. The job queue (Postgres-backed, lease + retry) ensures durability across worker restarts. Single runs execute synchronously through the same code path.

3. LLM call

The Edge Worker calls the configured LLM via the OpenAI API (using the server-side OPENAI_API_KEY secret) with the rendered prompt. A deterministic random_seed and fixed sampling parameters make the same input reproduce the same output (within provider tolerances).

4. Parsing

The raw response is parsed into a structured action ( scale_all_loads, set_generator_p_mw, line_outage) plus a target index and value. The parser version is recorded so logic changes are auditable.

5. Simulation & Evaluation

The action is applied to a fresh copy of the benchmark case. In simulation mode, GridArena calls the external PyPSA service (power flow on IEEE case5/14/30) or falls back to the bundled deterministic DC solver. Baseline and post-action violations are counted, feasibility is assessed, and the engine used is recorded on the evaluation row.

6. Persistence

Every artifact is stored: runs, run_metadata, run_prompt_logs, run_parse_results, run_recommendations, run_evaluations. RLS scopes everything to the authenticated user. Batches link to runs through batch_run_links.

7. Reporting

Run and batch reports render the full provenance tree, evaluation summary, and structured action. Exports (CSV / LaTeX / SVG) are one click away from any report view.

8. Counterfactual analysis (Layer E)

After a run completes, GridArena can replay the same benchmark case against a set of alternative actions — either contextual defaults (derived from the agent's chosen action type) or user-supplied custom actions. Each counterfactual is executed through the bundled deterministic DC power flow so results are reproducible and independent of the LLM call. Per-action outcomes are stored in counterfactual_actions and counterfactual_results, and surfaced in the run detail Counterfactual panel and the batch Counterfactual aggregation card.

What gets computed

For each counterfactual action c evaluated against the same case as the baseline agent action a, GridArena computes baseline and counterfactual feasibility, post-action violations, and violation improvement, then derives two headline metrics:

Optimality gap

The optimality gap measures how much better a counterfactual action would have been than the agent's chosen action, in terms of violations resolved:

optimality_gap(c) = max(0, improvement(c) − improvement(a))

where improvement(x) = baseline_violations − post_action_violations(x). A value of 0 means the agent's action was at least as good as the counterfactual; a positive value quantifies the missed improvement (in violation count). The gap is clamped at zero so worse counterfactuals do not produce negative scores — they simply contribute 0, since the agent already dominated them.

Decision regret

For a single counterfactual, decision regret mirrors the optimality gap:

decision_regret(c) = optimality_gap(c)

At the batch level it is aggregated as the mean regret across every successful counterfactual across every run:

avg_decision_regret = mean{ optimality_gap(c) : c ∈ successful counterfactuals }

Intuitively, average decision regret answers: "On average, how many additional violations could the agent have resolved if it had picked the best alternative we considered?" Lower is better; 0 means the agent matched or beat every counterfactual we tested.

Feasibility change

Alongside the numeric metrics, each counterfactual is labelled with a categorical feasibility_change:

improved — counterfactual is feasible while baseline was not.
worsened — baseline was feasible but the counterfactual is not.
unchanged — both are feasible, both infeasible, or both partial.

Caveats

Counterfactuals run on the in-Worker DC power flow, which currently supports case5, case14, and case30. Larger cases (e.g. ieee39) report a failure status with a clear reason and are excluded from aggregates.
Counterfactual-derived optimality gap and regret only consider the actions you enumerated — they are a lower bound on the true optimal-policy gap. For an exact measurement, enable the Optimizer Oracle (step 8b), which computes the mathematically optimal action via OPF and overrides these metrics on run_evaluations.
Failed counterfactuals (non-converged, unsupported case) are persisted with status = "failure" and failure_reason for full auditability, but do not contribute to averages.

8b. Optimizer Oracle (OPF — true optimum)

The optimizer oracle replaces the counterfactual approximation of the optimal action with a mathematical optimum. For each evaluation, GridArena resolves the OPF (Optimal Power Flow) on the same case and objective, then writes the result back to run_evaluations.optimality_gap, decision_regret, and deviation_from_reference. The oracle is tiered exactly like the simulator:

DC-OPF (in-Worker) — a linear program solved with GLPK/WASM on the built-in IEEE cases. Always available, sub-second.
AC-OPF (external) — PyPSA network.optimize() on the simulation service. Higher fidelity, used when reachable.
SCOPF (external, N-1) — security-constrained OPF over the contingency set. Used by the robustness layer to score perturbation outcomes against the optimum under-contingency dispatch.
null — when no tier is reachable, the run keeps the counterfactual- derived metrics and is flagged optimizer_engine = "none" on the evaluation row.

Objectives: min_violations (default, matches the existing evaluator semantics), min_generation_cost, min_load_shed, and min_redispatch_from_baseline (closest feasible point to the current dispatch). With a true optimum x* in hand, the metrics become exact:

optimality_gap = objective(agent_action) − objective(x*)
decision_regret = objective(agent_action) − objective(x*) (per-run; ≥ 0 by optimality)

Results are cached in optimizer_results (RLS-scoped, keyed by case + objective + engine) so expensive AC-OPF solves are reused across page views. The Optimal Action panel on Run Details renders agent vs optimum vs ground truth side-by-side; the batch report exposes an oracle benchmark column.

9. Perturbation jobs (robustness layer)

While counterfactuals vary the action, perturbation jobs vary the environment. From any batch you can launch a perturbation sweep that scales loads, trips generators, or removes lines, then re-queries the agent on each modified case. Each perturbed re-run goes through the full pipeline above (LLM → parse → simulate → evaluate) and is stored in perturbation_jobs and perturbation_results. The batch report aggregates per-perturbation feasibility, violation deltas, and decision stability so you can see whether the agent's recommendation degrades gracefully or breaks under stress — the failure mode that single-shot benchmarks like PFBench and ProOPF do not measure.

10. LLM-as-judge scoring

After evaluation, an independent judge model scores the recommendation against domain rubrics (safety, feasibility, actionability, clarity). The judge prompt, model, rubric version, and per-criterion scores are persisted alongside the run so the audit trail explains why a run passed or failed — not just whether it did. Combined with the decision trace and parser provenance, this lets failures be attributed to the right cause: wrong tool selection, misparsed solver output, or flawed final reasoning.

12. AI-assisted fixes (metadata loop)

The pipeline above operates on curated benchmark metadata — dataset_version, source_url, last_reviewed, prompt_version, random_seed, and the standardized/simplified notes documented per case. To keep that metadata trustworthy as the dataset evolves, GridArena runs a parallel AI-assisted fix loop outside the run pipeline, rendered on /docs/cases and diagrammed in the validation feedback loop.

Stages

Validate — findCaseMetaIssues scans CASE_META on every render and flags missing required fields (errors) or malformed optional fields (warnings).
Suggest — clicking Suggest fix on an issue calls suggestCaseMetaFix, a server function that prompts the configured LLM with the case key, field name, and current value, and returns a structured proposal ({ value, model, rationale }).
Review — the drawer renders a previous-vs-proposed diff. Nothing is persisted until the user explicitly accepts.
Accept / Revert — accept writes a row to case_meta_overrides; revert deletes it. Both operations update the UI optimistically and roll back on server error. Bulk revert collapses many deletes into a single round-trip.
Re-merge & re-validate — mergeOverrides applies the overrides on top of CASE_META, the validator re-runs, and the issue disappears from the dev panel (or stays, with a clear message, if the suggestion was itself invalid).
Audit — every accept and revert is appended to case_meta_override_audit with the actor, timestamp, action, previous value, new value, and the AI model + rationale (when applicable). Audit rows are user-scoped via RLS and cannot be edited or deleted, so the metadata history is a tamper-evident log.

Why this is separate from the run pipeline

The run pipeline (steps 1–11) evaluates an agent against a fixed benchmark. The AI-assisted fix loop evaluates the benchmark itself — its provenance, reproducibility metadata, and curation notes. Keeping the two loops independent means a flaky LLM suggestion on a metadata field can never silently change a run's scientific inputs: overrides are explicit, reversible, and audited, while the run pipeline always consumes the merged-and-validated CASE_META at execution time.

13. Curating your dashboards

Once experiments accumulate, the Runs, Presets, Batches, Ground Truth, and Validation pages all support inline Edit and Delete actions. Deletes are soft: a 5-second toast with Undo lets you recover accidental removals before the row is permanently dropped. Runs, Presets, and Batches also offer bulk multi-select via row checkboxes and a sticky action bar — handy for cleaning up exploratory sweeps before publishing a report.