Analysis of intermittent test failures in the unix port GitHub Actions
workflow on the master branch of micropython/micropython.
Job-level pass/fail data was collected via the GitHub API for all
ports_unix.yml workflow runs triggered by pushes to master. For the subset of
runs where GitHub still retains logs (roughly the last 90 days), the specific
failing test name was extracted from the run-tests.py --print-failures output.
Only the unix port workflow has test failures on master. All other workflows
(ports_qemu.yml, ports_stm32.yml, ports_esp32.yml, ports_rp2.yml, etc.)
pass consistently.
Data collected:
- 575 non-cancelled push-triggered runs from 2024-12-19 to 2026-02-12
- Job-level pass/fail status for all 20 jobs in each of 103 failed runs
- Test-level failure details from 20 runs with available log data
103 of 575 non-cancelled master push runs failed: 17.9% per-run failure rate.
87% of failed runs had exactly one job fail. The failures are distributed across 15 of the 20 jobs in the workflow, with no single job dominating.
| Failed jobs per run | Occurrences |
|---|---|
| 1 | 90 |
| 2 | 9 |
| 3 | 2 |
| 5 | 2 |
| Month | Runs | Failures | Rate |
|---|---|---|---|
| 2024-12 | 10 | 3 | 30.0% |
| 2025-01 | 32 | 5 | 15.6% |
| 2025-02 | 41 | 8 | 19.5% |
| 2025-03 | 37 | 3 | 8.1% |
| 2025-04 | 40 | 4 | 10.0% |
| 2025-05 | 50 | 7 | 14.0% |
| 2025-06 | 48 | 5 | 10.4% |
| 2025-07 | 55 | 15 | 27.3% |
| 2025-08 | 48 | 15 | 31.2% |
| 2025-09 | 46 | 2 | 4.3% |
| 2025-10 | 49 | 8 | 16.3% |
| 2025-11 | 37 | 8 | 21.6% |
| 2025-12 | 28 | 5 | 17.9% |
| 2026-01 | 32 | 8 | 25.0% |
| 2026-02 | 22 | 7 | 31.8% |
The July/August 2025 spike (27-31%) correlates with the GitHub Actions macOS runner migration to macOS 15 (announced August 4 2025), which produced 11 macOS job failures in August alone. The September 2025 dip (4.3%) has no obvious explanation beyond normal variance.
Measured directly from 575 push runs. Each job executes once per workflow run.
| Job | Failures | Rate | Runner |
|---|---|---|---|
| settrace_stackless | 25 | 4.3% | ubuntu-latest |
| macos | 25 | 4.3% | macos-26 |
| qemu_mips | 11 | 1.9% | ubuntu-latest (QEMU) |
| qemu_arm | 9 | 1.6% | ubuntu-latest (QEMU) |
| qemu_riscv64 | 8 | 1.4% | ubuntu-latest (QEMU) |
| standard_v2 | 8 | 1.4% | ubuntu-latest |
| settrace | 7 | 1.2% | ubuntu-latest (removed from current workflow) |
| coverage | 7 | 1.2% | ubuntu-latest |
| sanitize_undefined | 6 | 1.0% | ubuntu-latest |
| float | 5 | 0.9% | ubuntu-latest |
| standard | 4 | 0.7% | ubuntu-latest |
| coverage_32bit | 3 | 0.5% | ubuntu-latest |
| nanbox | 2 | 0.3% | ubuntu-latest |
| float_clang | 2 | 0.3% | ubuntu-latest |
| longlong | 2 | 0.3% | ubuntu-latest |
| minimal | 0 | 0% | ubuntu-latest |
| reproducible | 0 | 0% | ubuntu-latest |
| gil_enabled | 0 | 0% | ubuntu-latest |
| stackless_clang | 0 | 0% | ubuntu-latest |
| repr_b | 0 | 0% | ubuntu-latest |
| sanitize_address | 0 | 0% | ubuntu-latest |
The product of per-job pass rates gives an aggregate predicted pass rate of 80.4%, close to the observed 82.1%, confirming the individual job failures are approximately independent events.
The following tests were directly observed failing on master in runs where log data was available (20 runs, covering 2026-01-05 to 2026-02-13). Every failure in every available log was caused by one of these six tests.
| Observed failures | 9 (in 20 runs with logs) |
| Jobs affected | settrace_stackless (6), coverage (3) |
| Failure output | Expected True, got False |
| Dates observed | 2026-01-13, 2026-01-13, 2026-01-24, 2026-01-27, 2026-01-30, 2026-02-04, 2026-02-05, 2026-02-06, 2026-02-12 |
| Already excluded from | macos, qemu_mips, qemu_arm, qemu_riscv64 (in tools/ci.sh) |
| Not excluded from | settrace_stackless, coverage, standard, standard_v2, coverage_32bit, nanbox, longlong, float, float_clang, stackless_clang, gil_enabled, sanitize_address, sanitize_undefined, repr_b |
The test spawns threads that perform garbage collection and checks a boolean
result. The ci.sh file already contains comments acknowledging this test is
flaky and excludes it from 4 of 20 jobs.
| Observed failures | 7 (in 20 runs with logs) |
| Jobs affected | qemu_riscv64 (5), qemu_arm (2) |
| Failure output | Expected done, got TIMEOUT |
| Dates observed | 2026-01-14, 2026-01-20, 2026-01-23, 2026-01-26, 2026-01-30, 2026-01-31, 2026-02-03 |
| Already excluded from | none |
| Notes | ci.sh comments note this test "takes around 70/90/180 seconds" on QEMU ARM/MIPS/RISC-V but does not exclude it; timeouts are set to 90/180/200s respectively |
The test performs AES encryption across threads. Under QEMU emulation the execution time approaches or exceeds the configured timeout.
| Observed failures | 3 (in 20 runs with logs) |
| Jobs affected | qemu_arm (2), qemu_riscv64 (1) |
| Failure output | Missing >>> prompt prefix on micropython.heap_lock() line |
| Dates observed | 2026-01-13, 2026-02-03, 2026-02-13 |
| Already excluded from | none |
The expected output shows >>> micropython.heap_lock() but the actual output
drops the >>> prefix. This is a REPL prompt timing issue under QEMU
emulation.
| Observed failures | 2 (in 20 runs with logs) |
| Jobs affected | float (1), longlong (1) |
| Failure output | One timing assertion returns False instead of True |
| Dates observed | 2026-01-05, 2026-02-04 |
| Already excluded from | none |
The test makes assertions about time.time_ns() precision. On shared CI
runners the wall clock can have insufficient precision or the process can be
descheduled between measurements.
| Observed failures | 1 (in 20 runs with logs) |
| Jobs affected | macos (1) |
| Failure output | Differences in quote escaping in REPL continuation prompts (e.g. "'" vs '\'') |
| Dates observed | 2026-01-27 |
| Already excluded from | none |
The expected REPL output differs from what macOS produces, with differences in how escaped quotes and continuation lines are rendered. The macOS job already excludes several other tests due to platform differences.
| Observed failures | 1 (in 20 runs with logs) |
| Jobs affected | qemu_riscv64 (1) |
| Failure output | Expected PASS, got CRASH |
| Dates observed | 2026-02-05 |
| Already excluded from | none (but skipped on qemu_arm per ci.sh) |
The test exercises micropython.schedule() under thread stress. Under QEMU
RISC-V emulation it intermittently crashes.
The following tests are already excluded from specific jobs with comments marking them as flaky:
| Test | Excluded from | Exclusion reason (from ci.sh comments) |
|---|---|---|
thread/thread_gc1.py |
macos, qemu_mips, qemu_arm, qemu_riscv64 | "is flaky" |
thread/stress_recurse.py |
qemu_mips, qemu_arm, qemu_riscv64 | "is flaky" |
thread/stress_heap.py |
macos | "is flaky" |
float_parse.py |
macos | "parse/print floats out by a few mantissa bits" |
float_parse_doubleprec.py |
macos | "parse/print floats out by a few mantissa bits" |
ffi_callback |
macos | "crashes for an unknown reason" |
Note: This section combines the directly observed data above with inference to attribute the 94 failed runs whose logs have expired (older than ~90 days). The reasoning is described for each estimate.
For runs without log data, the failing job is known but the specific test is not. The estimates below attribute job failures to likely tests based on:
- 100% consistency in the 20 runs where both job and test are known
- The test exclusion patterns in
ci.shwhich restrict what can fail in each job - Each job runs largely the same test suite, differing only in build configuration and platform
| Test | Attributed failures | Executions per run | Total opportunities | Est. rate per execution |
|---|---|---|---|---|
thread/thread_gc1.py |
62 | 8 jobs that don't exclude it | 4,600 | ~1.3% |
thread/stress_aes.py |
28 | 3 QEMU jobs | 1,725 | ~1.6% |
cmdline/repl_*.py |
25 | 1 (macOS) | 575 | ~4.3% |
extmod/time_time_ns.py |
7 | 2 jobs (float, longlong) | 1,150 | ~0.6% |
Reasoning for thread/thread_gc1.py estimate (62 failures): The 25
settrace_stackless failures, 8 standard_v2 failures, 7 coverage failures, 7
settrace failures, 6 sanitize_undefined failures, 4 standard failures, 3
coverage_32bit failures, and 2 nanbox failures are attributed to this test.
All of these jobs run test_full or test_full_no_native without excluding
thread_gc1.py. In the 10 runs with log data from these jobs, 100% (10/10)
failed on thread_gc1.py and nothing else.
Reasoning for thread/stress_aes.py estimate (28 failures): The 11
qemu_mips, 9 qemu_arm, and 8 qemu_riscv64 failures are attributed primarily to
this test. These jobs exclude thread_gc1.py and thread_stress_recurse.py,
leaving stress_aes.py as the dominant remaining flaky test. In 11 runs with
log data from QEMU jobs, 7 were stress_aes.py, 3 were cmdline/repl_lock.py,
and 1 was thread/stress_schedule.py. The QEMU MIPS logs are all expired so
the exact split for that job is unknown.
Reasoning for cmdline/repl_*.py estimate (25 failures): All 25 macOS job
failures are attributed to REPL-related tests. The macOS job already excludes
thread_gc1.py, stress_heap.py, float_parse*.py, and ffi_callback. In
the 1 run with log data from the macOS job, the failure was
cmdline/repl_cont.py. The 11 macOS failures in August 2025 coincide with the
GitHub Actions macOS 15 runner migration.
Reasoning for extmod/time_time_ns.py estimate (7 failures): The 5 float
failures and 2 longlong failures are attributed to this test. In 2 runs with
log data from these jobs, both were time_time_ns.py. The float job runs a
reduced test set (basic run-tests.py without test_full) making timing tests
the most likely flaky candidate; the 2 float_clang failures may also be this
test but could be a different root cause.
The stackless_clang job has 1 failure across 575 runs, with no log data
available. The root cause is unknown.