andrewleech/flaky-tests-report.md

## flaky-tests-report.md

      
    Raw
  

              flaky-tests-report.md
            
          
    MicroPython CI Flaky Test Report

Analysis of intermittent test failures in the unix port GitHub Actions
workflow on the master branch of micropython/micropython.
Methodology

Job-level pass/fail data was collected via the GitHub API for all
ports_unix.yml workflow runs triggered by pushes to master. For the subset of
runs where GitHub still retains logs (roughly the last 90 days), the specific
failing test name was extracted from the run-tests.py --print-failures output.
Only the unix port workflow has test failures on master. All other workflows
(ports_qemu.yml, ports_stm32.yml, ports_esp32.yml, ports_rp2.yml, etc.)
pass consistently.
Data collected:

575 non-cancelled push-triggered runs from 2024-12-19 to 2026-02-12
Job-level pass/fail status for all 20 jobs in each of 103 failed runs
Test-level failure details from 20 runs with available log data

Overall Failure Rate

103 of 575 non-cancelled master push runs failed: 17.9% per-run failure rate.
87% of failed runs had exactly one job fail. The failures are distributed across
15 of the 20 jobs in the workflow, with no single job dominating.


Failed jobs per run
Occurrences


1
90


2
9


3
2


5
2


Monthly Breakdown


Month
Runs
Failures
Rate


2024-12
10
3
30.0%


2025-01
32
5
15.6%


2025-02
41
8
19.5%


2025-03
37
3
8.1%


2025-04
40
4
10.0%


2025-05
50
7
14.0%


2025-06
48
5
10.4%


2025-07
55
15
27.3%


2025-08
48
15
31.2%


2025-09
46
2
4.3%


2025-10
49
8
16.3%


2025-11
37
8
21.6%


2025-12
28
5
17.9%


2026-01
32
8
25.0%


2026-02
22
7
31.8%


The July/August 2025 spike (27-31%) correlates with the GitHub Actions
macOS runner migration to macOS 15 (announced August 4 2025), which produced
11 macOS job failures in August alone. The September 2025 dip (4.3%) has no
obvious explanation beyond normal variance.
Per-Job Failure Rates

Measured directly from 575 push runs. Each job executes once per workflow run.


Job
Failures
Rate
Runner


settrace_stackless
25
4.3%
ubuntu-latest


macos
25
4.3%
macos-26


qemu_mips
11
1.9%
ubuntu-latest (QEMU)


qemu_arm
9
1.6%
ubuntu-latest (QEMU)


qemu_riscv64
8
1.4%
ubuntu-latest (QEMU)


standard_v2
8
1.4%
ubuntu-latest


settrace
7
1.2%
ubuntu-latest (removed from current workflow)


coverage
7
1.2%
ubuntu-latest


sanitize_undefined
6
1.0%
ubuntu-latest


float
5
0.9%
ubuntu-latest


standard
4
0.7%
ubuntu-latest


coverage_32bit
3
0.5%
ubuntu-latest


nanbox
2
0.3%
ubuntu-latest


float_clang
2
0.3%
ubuntu-latest


longlong
2
0.3%
ubuntu-latest


minimal
0
0%
ubuntu-latest


reproducible
0
0%
ubuntu-latest


gil_enabled
0
0%
ubuntu-latest


stackless_clang
0
0%
ubuntu-latest


repr_b
0
0%
ubuntu-latest


sanitize_address
0
0%
ubuntu-latest


The product of per-job pass rates gives an aggregate predicted pass rate of
80.4%, close to the observed 82.1%, confirming the individual job failures are
approximately independent events.
Confirmed Flaky Tests

The following tests were directly observed failing on master in runs where log
data was available (20 runs, covering 2026-01-05 to 2026-02-13). Every failure
in every available log was caused by one of these six tests.
thread/thread_gc1.py


Observed failures
9 (in 20 runs with logs)


Jobs affected
settrace_stackless (6), coverage (3)


Failure output
Expected True, got False


Dates observed
2026-01-13, 2026-01-13, 2026-01-24, 2026-01-27, 2026-01-30, 2026-02-04, 2026-02-05, 2026-02-06, 2026-02-12


Already excluded from
macos, qemu_mips, qemu_arm, qemu_riscv64 (in tools/ci.sh)


Not excluded from
settrace_stackless, coverage, standard, standard_v2, coverage_32bit, nanbox, longlong, float, float_clang, stackless_clang, gil_enabled, sanitize_address, sanitize_undefined, repr_b


The test spawns threads that perform garbage collection and checks a boolean
result. The ci.sh file already contains comments acknowledging this test is
flaky and excludes it from 4 of 20 jobs.
thread/stress_aes.py


Observed failures
7 (in 20 runs with logs)


Jobs affected
qemu_riscv64 (5), qemu_arm (2)


Failure output
Expected done, got TIMEOUT


Dates observed
2026-01-14, 2026-01-20, 2026-01-23, 2026-01-26, 2026-01-30, 2026-01-31, 2026-02-03


Already excluded from
none


Notes
ci.sh comments note this test "takes around 70/90/180 seconds" on QEMU ARM/MIPS/RISC-V but does not exclude it; timeouts are set to 90/180/200s respectively


The test performs AES encryption across threads. Under QEMU emulation the
execution time approaches or exceeds the configured timeout.
cmdline/repl_lock.py


Observed failures
3 (in 20 runs with logs)


Jobs affected
qemu_arm (2), qemu_riscv64 (1)


Failure output
Missing >>>  prompt prefix on micropython.heap_lock() line


Dates observed
2026-01-13, 2026-02-03, 2026-02-13


Already excluded from
none


The expected output shows >>> micropython.heap_lock() but the actual output
drops the >>>  prefix. This is a REPL prompt timing issue under QEMU
emulation.
extmod/time_time_ns.py


Observed failures
2 (in 20 runs with logs)


Jobs affected
float (1), longlong (1)


Failure output
One timing assertion returns False instead of True


Dates observed
2026-01-05, 2026-02-04


Already excluded from
none


The test makes assertions about time.time_ns() precision. On shared CI
runners the wall clock can have insufficient precision or the process can be
descheduled between measurements.
cmdline/repl_cont.py


Observed failures
1 (in 20 runs with logs)


Jobs affected
macos (1)


Failure output
Differences in quote escaping in REPL continuation prompts (e.g. "'" vs '\'')


Dates observed
2026-01-27


Already excluded from
none


The expected REPL output differs from what macOS produces, with differences in
how escaped quotes and continuation lines are rendered. The macOS job already
excludes several other tests due to platform differences.
thread/stress_schedule.py


Observed failures
1 (in 20 runs with logs)


Jobs affected
qemu_riscv64 (1)


Failure output
Expected PASS, got CRASH


Dates observed
2026-02-05


Already excluded from
none (but skipped on qemu_arm per ci.sh)


The test exercises micropython.schedule() under thread stress. Under QEMU
RISC-V emulation it intermittently crashes.
Existing Exclusions in tools/ci.sh

The following tests are already excluded from specific jobs with comments
marking them as flaky:


Test
Excluded from
Exclusion reason (from ci.sh comments)


thread/thread_gc1.py
macos, qemu_mips, qemu_arm, qemu_riscv64
"is flaky"


thread/stress_recurse.py
qemu_mips, qemu_arm, qemu_riscv64
"is flaky"


thread/stress_heap.py
macos
"is flaky"


float_parse.py
macos
"parse/print floats out by a few mantissa bits"


float_parse_doubleprec.py
macos
"parse/print floats out by a few mantissa bits"


ffi_callback
macos
"crashes for an unknown reason"


Estimated Failure Attribution


Note: This section combines the directly observed data above with
inference to attribute the 94 failed runs whose logs have expired (older than
~90 days). The reasoning is described for each estimate.

For runs without log data, the failing job is known but the specific test is
not. The estimates below attribute job failures to likely tests based on:

100% consistency in the 20 runs where both job and test are known
The test exclusion patterns in ci.sh which restrict what can fail in each
job
Each job runs largely the same test suite, differing only in build
configuration and platform

Estimated per-test failure rates


Test
Attributed failures
Executions per run
Total opportunities
Est. rate per execution


thread/thread_gc1.py
62
8 jobs that don't exclude it
4,600
~1.3%


thread/stress_aes.py
28
3 QEMU jobs
1,725
~1.6%


cmdline/repl_*.py
25
1 (macOS)
575
~4.3%


extmod/time_time_ns.py
7
2 jobs (float, longlong)
1,150
~0.6%


Reasoning for thread/thread_gc1.py estimate (62 failures): The 25
settrace_stackless failures, 8 standard_v2 failures, 7 coverage failures, 7
settrace failures, 6 sanitize_undefined failures, 4 standard failures, 3
coverage_32bit failures, and 2 nanbox failures are attributed to this test.
All of these jobs run test_full or test_full_no_native without excluding
thread_gc1.py. In the 10 runs with log data from these jobs, 100% (10/10)
failed on thread_gc1.py and nothing else.
Reasoning for thread/stress_aes.py estimate (28 failures): The 11
qemu_mips, 9 qemu_arm, and 8 qemu_riscv64 failures are attributed primarily to
this test. These jobs exclude thread_gc1.py and thread_stress_recurse.py,
leaving stress_aes.py as the dominant remaining flaky test. In 11 runs with
log data from QEMU jobs, 7 were stress_aes.py, 3 were cmdline/repl_lock.py,
and 1 was thread/stress_schedule.py. The QEMU MIPS logs are all expired so
the exact split for that job is unknown.
Reasoning for cmdline/repl_*.py estimate (25 failures): All 25 macOS job
failures are attributed to REPL-related tests. The macOS job already excludes
thread_gc1.py, stress_heap.py, float_parse*.py, and ffi_callback. In
the 1 run with log data from the macOS job, the failure was
cmdline/repl_cont.py. The 11 macOS failures in August 2025 coincide with the
GitHub Actions macOS 15 runner migration.
Reasoning for extmod/time_time_ns.py estimate (7 failures): The 5 float
failures and 2 longlong failures are attributed to this test. In 2 runs with
log data from these jobs, both were time_time_ns.py. The float job runs a
reduced test set (basic run-tests.py without test_full) making timing tests
the most likely flaky candidate; the 2 float_clang failures may also be this
test but could be a different root cause.
Unattributed failures

The stackless_clang job has 1 failure across 575 runs, with no log data
available. The root cause is unknown.
Month	Runs	Failures	Rate
2024-12	10	3	30.0%
2025-01	32	5	15.6%
2025-02	41	8	19.5%
2025-03	37	3	8.1%
2025-04	40	4	10.0%
2025-05	50	7	14.0%
2025-06	48	5	10.4%
2025-07	55	15	27.3%
2025-08	48	15	31.2%
2025-09	46	2	4.3%
2025-10	49	8	16.3%
2025-11	37	8	21.6%
2025-12	28	5	17.9%
2026-01	32	8	25.0%
2026-02	22	7	31.8%
Job	Failures	Rate	Runner
settrace_stackless	25	4.3%	ubuntu-latest
macos	25	4.3%	macos-26
qemu_mips	11	1.9%	ubuntu-latest (QEMU)
qemu_arm	9	1.6%	ubuntu-latest (QEMU)
qemu_riscv64	8	1.4%	ubuntu-latest (QEMU)
standard_v2	8	1.4%	ubuntu-latest
settrace	7	1.2%	ubuntu-latest (removed from current workflow)
coverage	7	1.2%	ubuntu-latest
sanitize_undefined	6	1.0%	ubuntu-latest
float	5	0.9%	ubuntu-latest
standard	4	0.7%	ubuntu-latest
coverage_32bit	3	0.5%	ubuntu-latest
nanbox	2	0.3%	ubuntu-latest
float_clang	2	0.3%	ubuntu-latest
longlong	2	0.3%	ubuntu-latest
minimal	0	0%	ubuntu-latest
reproducible	0	0%	ubuntu-latest
gil_enabled	0	0%	ubuntu-latest
stackless_clang	0	0%	ubuntu-latest
repr_b	0	0%	ubuntu-latest
sanitize_address	0	0%	ubuntu-latest

Observed failures	9 (in 20 runs with logs)
Jobs affected	settrace_stackless (6), coverage (3)
Failure output	Expected `True`, got `False`
Dates observed	2026-01-13, 2026-01-13, 2026-01-24, 2026-01-27, 2026-01-30, 2026-02-04, 2026-02-05, 2026-02-06, 2026-02-12
Already excluded from	macos, qemu_mips, qemu_arm, qemu_riscv64 (in `tools/ci.sh`)
Not excluded from	settrace_stackless, coverage, standard, standard_v2, coverage_32bit, nanbox, longlong, float, float_clang, stackless_clang, gil_enabled, sanitize_address, sanitize_undefined, repr_b

Observed failures	7 (in 20 runs with logs)
Jobs affected	qemu_riscv64 (5), qemu_arm (2)
Failure output	Expected `done`, got `TIMEOUT`
Dates observed	2026-01-14, 2026-01-20, 2026-01-23, 2026-01-26, 2026-01-30, 2026-01-31, 2026-02-03
Already excluded from	none
Notes	`ci.sh` comments note this test "takes around 70/90/180 seconds" on QEMU ARM/MIPS/RISC-V but does not exclude it; timeouts are set to 90/180/200s respectively

Observed failures	3 (in 20 runs with logs)
Jobs affected	qemu_arm (2), qemu_riscv64 (1)
Failure output	Missing `>>>` prompt prefix on `micropython.heap_lock()` line
Dates observed	2026-01-13, 2026-02-03, 2026-02-13
Already excluded from	none

Observed failures	2 (in 20 runs with logs)
Jobs affected	float (1), longlong (1)
Failure output	One timing assertion returns `False` instead of `True`
Dates observed	2026-01-05, 2026-02-04
Already excluded from	none

Observed failures	1 (in 20 runs with logs)
Jobs affected	macos (1)
Failure output	Differences in quote escaping in REPL continuation prompts (e.g. `"'"` vs `'\''`)
Dates observed	2026-01-27
Already excluded from	none
Test	Excluded from	Exclusion reason (from ci.sh comments)
`thread/thread_gc1.py`	macos, qemu_mips, qemu_arm, qemu_riscv64	"is flaky"
`thread/stress_recurse.py`	qemu_mips, qemu_arm, qemu_riscv64	"is flaky"
`thread/stress_heap.py`	macos	"is flaky"
`float_parse.py`	macos	"parse/print floats out by a few mantissa bits"
`float_parse_doubleprec.py`	macos	"parse/print floats out by a few mantissa bits"
`ffi_callback`	macos	"crashes for an unknown reason"
Test	Attributed failures	Executions per run	Total opportunities	Est. rate per execution
`thread/thread_gc1.py`	62	8 jobs that don't exclude it	4,600	~1.3%
`thread/stress_aes.py`	28	3 QEMU jobs	1,725	~1.6%
`cmdline/repl_*.py`	25	1 (macOS)	575	~4.3%
`extmod/time_time_ns.py`	7	2 jobs (float, longlong)	1,150	~0.6%