Documentation
Test Execution

Debugging Failed Tests

Auto-heal first, AI triage second. Then screenshots, logs, traces. Plus common failure modes — locators, timeouts, auth, flakes, bot-block — and how to fix each.

Debugging Failed Tests

When a test goes red, AegisRunner does a lot of the legwork for you before you even open the run page. This guide walks through the order to look at things — auto-heal first, AI triage second, manual investigation third — and what to do when you actually need to dig in.

The order to investigate failures

  1. Did auto-heal fix it? Look at the run summary first. If a step has a healed badge, the locator drifted and AegisRunner found it again. Nothing to do — the suite is updated for next time.
  2. What does the Triage tab say? AI classifies each remaining failure. Most failures are explained here without needing logs.
  3. Look at the per-step screenshot and DOM. The screenshot of the page right before the failure is usually enough.
  4. Check the Logs tab. Worker logs, browser console, network — only needed for the strange cases.
  5. Run it locally. Export to Playwright and step through with the inspector. Last resort.

Step 1: Auto-heal

The most common reason "tests" fail isn't a real bug — it's that someone renamed a button or moved an element. Auto-heal handles this:

  • When a step's locator can't find its target, AegisRunner takes an accessibility snapshot of the live page and looks for a confident match by role + visible text.
  • If found, the step proceeds against the new locator and the suite is updated. The step shows a healed badge with a link to the locator change.
  • If not found with high confidence, the step fails. You'll see it in Results and the Triage tab will explain why.

If you keep seeing healed badges on the same step, your app's structure is shifting underneath the test — that's a signal to refactor the test (or the app) for stability, not just rely on heal.

Step 2: AI triage

The Triage tab appears whenever a run has failures. For each failure it labels:

LabelWhat it means
Real regressionThis test passed before. Something in your app broke.
New failureThis test has never passed. Often means the test was just added and needs a fix.
Likely flakeThis test has a history of intermittent failures. Triage will say "passed 9 of last 10 runs."
EnvironmentLooks like an outage or auth/proxy issue, not your code (4xx on every step, login redirect loop, DNS failure).
Auto-healedThe locator drifted but heal recovered. No action needed.

Each labeled failure has a one-paragraph explanation linking to the exact step. See Failure Triage for how the labelling works.

Step 3: Inspect the failure

Click any failed test in the Results tab to open the step-by-step view:

  • Steps panel — every step in order, with a green/red status. The first red step is where things went sideways.
  • Screenshot at failure — the live page right before the failing assertion. 80% of failures are obvious from this alone (unexpected modal, blank page, error banner, missing data).
  • Expected vs actual — for assertion failures, what the test expected and what it actually saw.
  • Locator details — the locator the step tried to use, and (if heal ran) the locator it ended up using.
  • Recording — the Recording tab plays back the full run. Skip to the failure timestamp.

Step 4: Logs and traces

Worker logs

The Logs tab on the run page shows the raw worker output: every navigation, every step, every browser-level event. Useful when the failure is something weird like a navigation timeout or a cross-origin redirect.

Network and console

Each step records browser console messages and network activity. You'll find these in the step detail panel — useful for spotting failed API calls, CORS errors, or runtime JS errors that broke the page.

Playwright trace files Pro+

Pro and Business plans get full Playwright traces — DOM snapshots, network HAR, screenshots, and the full execution timeline. Download the trace from the run page and open it in Playwright's inspector:

npx playwright show-trace trace.zip

Common failure modes and what to do

"Element not found" — and auto-heal didn't fire

Auto-heal needs an accessibility snapshot to work. If the page didn't fully load before the step ran, heal had nothing to match against.

  • Check the Logs tab — was there a navigation timeout right before the step?
  • Confirm the element actually exists in the screenshot. If it does, the locator just needs an update — re-record by re-running the suite from the page detail.
  • If the element is loaded async, the test step likely needs an explicit wait. Open the suite and edit the step.
"Timeout exceeded"
  • Look at the screenshot — was the page still spinning?
  • Check network: a slow API call can stall the whole step.
  • If the action genuinely takes longer than the default 30s, the step should have an explicit longer timeout.
"Expected X, got Y" assertion failures
  • Did the data behind the page actually change? (Common when test data resets between runs.)
  • Is the assertion too strict? E.g. asserting a full timestamp string when only the date matters — relax it.
  • For numbers that drift (counts, balances), use a comparison rather than equality.
The test is flaky — passes 7 out of 10 times
  • The Triage tab will already mark it as a flake. AegisRunner lowers a flake's weight in pass-rate metrics so it doesn't drag your suite score down.
  • Most flakes are race conditions: the test acts before the page is fully ready. Look for a missing wait.
  • If a flake survives a fix, mark the test Quarantine — it'll still run but won't fail the suite. Pro and above.
Test redirects to /login

The session expired or wasn't applied. AegisRunner has a built-in auto-login fallback for this case (when redirected to /login, it'll try the project's login script). If that's not firing:

  • Verify your project has a login script set in Settings → Login Script.
  • If the test is intentionally testing the unauthenticated state, tag the test with negative, unauthenticated, or a *-negative tag — that suppresses auto-login.
  • Check pre-auth cookies in Test Data. Stale cookies cause silent auth failures.
Bot-block / Cloudflare / captcha

If a run hits anti-bot middleware, AegisRunner short-circuits the entire suite with a BotBlocked error rather than reporting fake passes. You'll see this on the Triage tab as an Environment failure.

  • Add the AegisRunner user-agent or IP range to your allowlist for staging.
  • Use a staging environment that doesn't have the same protections.
  • Configure environment-specific bypass tokens via Environments.

Test data and environments

Many failures resolve by configuring the test environment correctly:

  • Login script (Project Settings) — runs before any test, so auth-gated pages are reachable.
  • Pre-auth cookies (Test Data) — drop in session cookies to skip the login flow entirely.
  • Environment tokens — bypass tokens, debug headers, or staging API keys, scoped per environment.
  • Custom headers — e.g. X-Debug-Mode: true for tests, X-AegisRunner: true for tagging traffic in your logs.

See Test Data Management.

Running tests locally for deep debugging

When you really need to step through:

  1. Open the failing suite, click Export → Playwright.
  2. Drop the .spec.ts into your project (or npm init playwright@latest for a fresh setup).
  3. Run with the inspector:
# Run with the browser visible
npx playwright test --headed

# Step through interactively
npx playwright test --debug

# Just one test
npx playwright test my-test.spec.ts

Local runs don't get auto-heal — once you're outside AegisRunner, locator drift is yours to handle.

Filing a bug with us

If you genuinely think the failure is on our side:

  1. Note the run ID (in the URL).
  2. Note any healed-locator badges (sometimes heal picks the wrong element, and that's worth telling us about).
  3. Email support@aegisrunner.com with the run ID and a one-line description.

Related

Need help?

Can't find what you're looking for? Our support team is here to help.