Troubleshooting

Run stuck in “queued”

Check /system-status. If no worker has claimed the job after a minute, click Process queue now — this manually drains the queue. If you keep the tab open, the client drain hook re-pings the worker every 5 seconds.

LLM timeout / 504

Each run_execution job has a 15s default timeout. Failures retry with exponential backoff up to max_attempts (default 3). Check the job logs in/system-status for the truncated error trace.

Simulation engine “unavailable”

The Health badge shows unavailable when the external PyPSA service can't be reached. GridArena automatically falls back to the in-Worker DC solver — runs still succeed and the engine used is recorded on the evaluation row. To re-enable the external engine, verify SIMULATION_SERVICE_URL and SIMULATION_SERVICE_TOKEN in Cloud secrets.

Diagnose engine failures (admin)

Admins can open /simulation-health to run end-to-end self-tests against the external engine. The page probes /version and /health, then executes /simulate in parallel for IEEE case5/14/30 and renders actionable troubleshooting tips based on the exact failing endpoint response — missing SIMULATION_SERVICE_URL or SIMULATION_SERVICE_TOKEN, DNS or connection errors, 401/403 authentication failures, 5xx with raw response samples, and per-case timeout hints. Each probe is stored in history so you can review past failures and confirm when a fix took effect. A banner alert fires only on state changes (pass → fail) to avoid noise.

“Unauthorized” errors

Server functions require a Bearer token from the Supabase session. If you see 401s, sign out and back in to refresh the token. RLS will also reject reads if you query data owned by another user — by design.

Validation suite failures

Open the failing test from /validation and inspect the actual vs. expected output. Parser/evaluator versions are recorded so you can correlate failures with code changes.

Demo data clutters my workspace

Every seeded artifact is prefixed with [Demo]. Filter or delete them from/presets and /runs at any time — deletes cascade safely thanks to RLS scoping.

Still stuck?

Capture a screenshot of the failing page plus the /system-status KPIs and recent job logs. That trio is enough to debug nearly every failure mode.