Frequentist Testing: Null Hypothesis Approach
For teams with statistical training, pytest-repeated supports null hypothesis testing with confidence intervals.
Overview
Instead of setting a simple threshold, you:
- Define a null hypothesis (H₀): "The code's true success rate is ≤ X%"
- Set a confidence level: "I want to be 95% confident when rejecting H₀"
- Run the test multiple times
- Reject H₀ if the lower bound of the confidence interval exceeds the null hypothesis
The test passes when we can confidently reject the null hypothesis (i.e., prove the code performs better than the threshold).
Core Parameters
H0 or null
The null hypothesis proportion - the baseline success rate you want to exceed. These are aliases:
@pytest.mark.repeated(times=100, H0=0.8, ci=0.95) # Using 'H0'
@pytest.mark.repeated(times=100, null=0.8, ci=0.95) # Using 'null' - exactly the same
ci
Confidence level (0 to 1) for the confidence interval. Common values:
0.90- 90% confidence0.95- 95% confidence (recommended)0.99- 99% confidence (very strict)
How It Works
pytest-repeated uses the Wilson score interval to construct a confidence interval for the true success rate:
- Run the test
timesiterations - Calculate observed success rate: passes / times
- Construct Wilson score confidence interval at level
ci - Check if the lower bound >
H0(null hypothesis)
Test passes if: lower_bound_of_CI > H0
Test fails if: lower_bound_of_CI ≤ H0
Examples
LLM Response Quality
import pytest
@pytest.mark.repeated(times=100, H0=0.90, ci=0.95)
def test_llm_provides_accurate_answer():
"""
Null hypothesis: LLM accuracy ≤ 90%
We reject H0 (test passes) if we're 95% confident accuracy > 90%
"""
response = call_llm("What is the capital of France?")
assert "Paris" in response
If 96 out of 100 runs pass, the 95% CI might be [0.902, 0.987]. Since 0.902 > 0.90, we reject H₀ and the test passes.
ML Model Performance
@pytest.mark.repeated(n=200, null=0.85, ci=0.99)
def test_model_exceeds_baseline():
"""
Null hypothesis: Model accuracy ≤ 85%
We need 99% confidence that accuracy > 85%
"""
sample = get_random_validation_sample()
prediction = model.predict(sample.features)
assert prediction == sample.label
A/B Testing New Algorithm
@pytest.mark.repeated(times=500, H0=0.75, ci=0.95)
def test_new_algorithm_beats_old():
"""
Null hypothesis: New algorithm success rate ≤ 75% (old algorithm's rate)
Test passes if we're 95% confident new algorithm > 75%
"""
result = new_algorithm(get_test_case())
expected = ground_truth(get_test_case())
assert result == expected
When to Use Frequentist Testing
✅ Best for: - Teams with statistical/scientific background - Need for formal hypothesis testing - Quality assurance requiring statistical rigor - Publishing results that need confidence intervals - Regulatory or compliance requirements
❌ Consider alternatives when: - Team lacks statistics training (Basic threshold is simpler) - You have prior knowledge to incorporate (Bayesian) - Simple pass/fail communication is sufficient
Choosing Parameters
Confidence Level (ci)
Higher confidence = stricter test = need more passes:
# 95% confidence (standard in most sciences)
@pytest.mark.repeated(times=100, H0=0.8, ci=0.95)
# 99% confidence (very strict, need strong evidence)
@pytest.mark.repeated(times=100, H0=0.8, ci=0.99)
# 90% confidence (less strict, easier to pass)
@pytest.mark.repeated(times=100, H0=0.8, ci=0.90)
Sample Size (times)
More repetitions = narrower confidence interval = more precise:
# Small sample - wide CI, less precise
@pytest.mark.repeated(times=50, H0=0.85, ci=0.95)
# Medium sample - moderate CI
@pytest.mark.repeated(times=200, H0=0.85, ci=0.95)
# Large sample - narrow CI, very precise
@pytest.mark.repeated(times=1000, H0=0.85, ci=0.95)
Null Hypothesis (H0)
Set based on your minimum acceptable performance:
# High bar - code must be very reliable
@pytest.mark.repeated(times=100, H0=0.95, ci=0.95)
# Moderate bar - code should work most of the time
@pytest.mark.repeated(times=100, H0=0.80, ci=0.95)
# Low bar - code just needs to work more often than not
@pytest.mark.repeated(times=100, H0=0.60, ci=0.95)
Statistical Details
Wilson Score Interval
pytest-repeated uses the Wilson score interval rather than the normal approximation. The Wilson interval:
- ✅ Works well even with small sample sizes
- ✅ Never produces invalid intervals (outside [0,1])
- ✅ Provides better coverage than normal approximation
- ✅ Recommended by statisticians for proportion confidence intervals
One-Sided Test
The test is one-sided - we only check if performance exceeds H₀, not whether it falls below some upper bound.
- If
lower_CI_bound > H0: Reject H₀ → test passes - If
lower_CI_bound ≤ H0: Fail to reject H₀ → test fails
Interpretation
When the test passes:
"We are [ci×100]% confident that the true success rate exceeds [H0×100]%"
When the test fails:
"We cannot be [ci×100]% confident that the true success rate exceeds [H0×100]%"
Example with times=100, H0=0.90, ci=0.95:
- Pass: "We are 95% confident the true success rate exceeds 90%"
- Fail: "We cannot be 95% confident the success rate exceeds 90%"
Error Handling
Like all pytest-repeated modes, frequentist testing stops immediately on non-AssertionErrors:
@pytest.mark.repeated(times=100, H0=0.85, ci=0.95)
def test_with_potential_bug():
result = risky_function() # Might raise ValueError
assert result > threshold
# If ValueError occurs on run 15:
# - Test stops immediately
# - Test FAILS regardless of H0/CI
# - Statistical evaluation is not performed
Next Steps
- Bayesian Testing - Incorporate prior knowledge
- Basic Testing - Simpler threshold approach
- Parameters Reference - Full parameter details
- Decorator Placement - Using with other pytest markers