Flaky Test Finder

Runs the suite again and again across days, records every result, and surfaces the tests that fail without any code change

L2 · Verify-until-done Collector Low risk Autonomous Fixed interval tested
What it does

Distinguish genuinely flaky tests from real failures by gathering pass/fail data across many identical runs.

Stops when

After N runs (e.g. 15 over 3 days), produce a flakiness report; or run continuously and report on a cadence.

Runs

Fixed interval (0 */4 * * *) · Autonomous

How one iteration works

discover → plan → execute → verify → escalate

  1. 1
    Discover

    Note the current commit SHA so runs on unchanged code are comparable.

  2. 2
    Plan

    Decide to run the full suite (or a target subset) this cycle.

  3. 3
    Execute

    Run the tests; append each test's result, the SHA, and timestamp to the results store.

  4. 4
    Verify

    Only count a test as flaky when it both passed and failed on the SAME commit — never flag a test that only failed after code changed.

  5. 5
    Escalate

    When a test crosses a flakiness threshold, write it to the report with its pass/fail ratio.

The prompt

The tool-agnostic spec the loop runs each pass — copy it, then wire it to your tool below.

Record the current commit SHA, then run the test suite. For every test, append its outcome (pass/fail), the SHA, and the timestamp to the results log. Do not modify any code or tests. A test counts as flaky only if it has BOTH passed and failed on the same SHA. After this run, update the flakiness report: list each flaky test with its fail rate and the SHAs it flaked on, sorted by fail rate. If nothing is newly flaky, say so.
Claude Code
/loop 4h run the suite, append results, and update the flakiness report
Generic
while true; do SHA=$(git rev-parse HEAD); run_tests --json >> results.ndjson; agent -p 'update flakiness report from results.ndjson'; sleep 14400; done

Memory contract

Append-only results log keyed by (test_id, commit_sha, timestamp, outcome). The report is derived from it; nothing is overwritten.

Verification & guardrails

How it checks itself. Flakiness is asserted only from mixed outcomes on an identical SHA; a single failure is not enough to flag.

  • Read-only with respect to code — it only runs tests and appends data
  • Never edits or deletes a test on its own
  • Compares within the same commit so code changes can't masquerade as flakiness

Failure modes

  • Calls a test flaky when the failures actually came from changed code — always key by SHA
  • Results file grows unbounded — rotate or summarize old runs
  • Misses time-of-day flakiness if it always runs at the same minute — vary the schedule

Variations

  • Targeted. Only re-run the subset of tests already suspected flaky to save time, widening occasionally to catch new ones.
  • Quarantine proposer. When a test crosses a high threshold, have it open a PR proposing a quarantine/retag — still human-approved.

Example run

Run 11/15 at SHA a1b2c3d. 0 real failures. 'test_websocket_reconnect' failed this run but passed runs 3,5,7 on the same SHA -> flaky, fail rate 27%. Report updated.