Documentation Index
Fetch the complete documentation index at: https://trunk-4cab4936-sam-gutentag-flaky-tests-new-monitors.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
What it is
Some CI jobs fail for reasons unrelated to a PR’s code change, such as due to flaky tests or a CI runner disconnecting. These failures are usually cleared when the CI job is rerun. If a second PR that depends on the first does pass, it is very likely that the first PR was good and experienced a transient failure. Trunk Merge Queue can use the combination of Optimistic Merging and Pending Failure Depth to merge pull requests that would otherwise be rejected from the queue. In the video below, you can see an example of this anti-flake protection:| what’s happening? | queue |
|---|---|
| A, B, C begin predictive testing | main <- A <- B+a <- C+ba |
| B fails testing | main <- A <- B+a <- C+ba |
| predictive failure depth keeps B from being evicted while C tests | main <- A <- B+a (hold) <- C+ba |
| C passes | main <- A <- B+a <- C+ba |
| optimistic merging allows A, B, C to merge | merge A B C |
Optimistic Merging only works when the Pending Failure Depth is set to a value greater than zero. When zero or disabled, Merge will not hold any failed tests in the queue.
Why use it
- Eliminate false negatives - Flaky tests frequently cause PR failures unrelated to actual code changes. Anti-flake protection helps get these under control, so developers don’t waste time investigating non-issues.
- Maintain developer confidence - When the queue rejects PRs for real reasons (not flaky tests), developers trust the system. Reduces “it’s probably just flaky” dismissiveness of real failures.
- Reduce manual retries - Developers don’t need to manually resubmit PRs or click “retry” when tests flake. Trunk handles it automatically, saving time and frustration.
- Keep queue moving - Flaky tests don’t stall the queue. PRs that would have been blocked by transient failures merge successfully, increasing overall throughput.
How to enable
Anti Flake Protection is active when Optimistic Merge Queue is On and Pending Failure Depth is set to a value greater than zero
Tradeoffs and considerations
What you gain
- 80-90% reduction in flaky test blocks - Most flaky failures are caught and handled automatically
- Developer time saved - No manual retries or investigation of flaky failures
- Higher queue throughput - Flaky tests don’t stall the queue
- Better developer experience - Less frustration with non-deterministic failures
What you give up or risk
- Increased CI cost - Retrying tests costs additional CI resources (typically 10-20% increase)
- Slightly longer merge times - PRs that fail then retry take longer than PRs that pass first time
- Potential false positives - Occasionally a legitimate failure might be retried (though Trunk is conservative)
- Masks underlying problems - Flaky tests indicate test quality issues; retrying treats symptom, not cause
When NOT to use anti-flake protection
Don’t enable anti-flake protection if:- Your tests are not flaky (< 2% flake rate) - No benefit, only cost
- CI resources are extremely limited - Retries double test costs for flaky PRs
- You’re actively fixing flaky tests - Better to fix than to mask
- Flaky tests indicate real issues - Sometimes “flaky” failures reveal race conditions or timing issues in your code
When to use anti-flake protection
Do enable anti-flake protection when:- Flaky tests are blocking PRs (5-15% flake rate) - Clear benefit outweighs cost
- Fixing flaky tests will take time - Use this as interim solution while improving test quality
- Infrastructure flakiness - Network timeouts, resource contention you can’t control
- Third-party dependencies are flaky - External APIs or services cause transient failures
The right long-term solution
The right approach:- Enable anti-flake protection - Unblock your team immediately
- Identify flaky tests - Use CI analytics to find which tests flake most
- Fix the root causes - Make tests deterministic, add retries at test level, improve infrastructure
- Reduce flake rate over time - Goal should be < 2% flake rate
- Consider disabling - Once tests are stable, anti-flake protection becomes unnecessary
- Flake rate > 20% (your tests are broken)
- Same tests flake repeatedly (specific tests need fixing)
- All flakes are in one area (infrastructure or test framework issue)
Common misconceptions
- Misconception: “Anti-flake protection lets me ignore flaky tests”
- Reality: NO! This is a temporary solution. Flaky tests are a code/test quality problem that must be fixed. Anti-flake protection buys you time to fix them properly.
- Misconception: “It retries all failures automatically”
- Reality: Trunk is selective. Only failures that match flaky patterns are retried. Legitimate failures still block PRs immediately.
- Misconception: “Anti-flake protection wastes tons of CI resources”
- Reality: Typical cost increase is 10-20% for teams with moderate flake rates. This is far less than the developer time wasted investigating flaky failures.
- Misconception: “I should set retry limit to 10 to catch all flakes”
- Reality: If you need 10 retries, your tests are catastrophically broken. Fix the tests! Retry limit should be 1-3 max.