Backend full-suite test isolation flake: leaked async DB work corrupts one random test per run #226

Closed
opened 2026-05-27 23:04:48 -04:00 by owlburtoe · 1 comment
Owner

Symptom

pnpm test:backend:db (full backend suite) fails non-deterministically with a different victim test each run, while every individual test file passes in isolation.

Observed victims so far:

  • tier6-destructive-permissions.test.ts:101Parse Error: Expected HTTP/, RTSP/ or ICE/ (transport-level corruption; hardened to be a less likely victim in PR #225)
  • private-message-delivery.test.ts:218 → spurious 400 instead of 201

Both pass 10–14/14 in isolation; 50× isolated reruns of tier6 showed zero failures.

Root cause (diagnosed, not yet fixed)

Leaked async DB work / test-isolation failure somewhere in the suite. Evidence during full route-suite runs:

  • repeated pg warnings: Calling client.query() when the client is already executing a query
  • seed/setup races in schedules.test.ts (FK violation, duplicate PK during seedTestData())

A connection left mid-query by one file bleeds into a later file's request/transport, surfacing as a parse error or wrong status on an unrelated route. The victim is random because it depends on scheduling/teardown timing.

Why it's not any one victim's bug

The failing routes have no business logic on the asserted path (tier6 line 101 stops at requirePermission → plain 403 JSON; no DB write, no email). A transport/parser failure there points to suite state corruption upstream.

  1. Find the file(s) issuing unawaited client.query() (concurrent-query warnings are the lead — likely a beforeEach/afterEach or a fire-and-forget notification/audit path that outlives the test).
  2. Inspect seedTestData() for the FK/duplicate-PK race in schedules.test.ts.
  3. Consider per-file connection isolation or ensuring all pg pool work is awaited before teardown resolves.

Scope note

Pre-existing; reproduces on main. Not introduced by the facility-address branch (PR #225). Tracked separately per that PR's decision.

## Symptom `pnpm test:backend:db` (full backend suite) fails non-deterministically with **a different victim test each run**, while every individual test file passes in isolation. Observed victims so far: - `tier6-destructive-permissions.test.ts:101` → `Parse Error: Expected HTTP/, RTSP/ or ICE/` (transport-level corruption; hardened to be a less likely victim in PR #225) - `private-message-delivery.test.ts:218` → spurious `400` instead of `201` Both pass 10–14/14 in isolation; 50× isolated reruns of tier6 showed zero failures. ## Root cause (diagnosed, not yet fixed) Leaked async DB work / test-isolation failure somewhere in the suite. Evidence during full route-suite runs: - repeated pg warnings: `Calling client.query() when the client is already executing a query` - seed/setup races in `schedules.test.ts` (FK violation, duplicate PK during `seedTestData()`) A connection left mid-query by one file bleeds into a later file's request/transport, surfacing as a parse error or wrong status on an unrelated route. The victim is random because it depends on scheduling/teardown timing. ## Why it's not any one victim's bug The failing routes have no business logic on the asserted path (tier6 line 101 stops at `requirePermission` → plain 403 JSON; no DB write, no email). A transport/parser failure there points to suite state corruption upstream. ## Recommended investigation 1. Find the file(s) issuing unawaited `client.query()` (concurrent-query warnings are the lead — likely a beforeEach/afterEach or a fire-and-forget notification/audit path that outlives the test). 2. Inspect `seedTestData()` for the FK/duplicate-PK race in `schedules.test.ts`. 3. Consider per-file connection isolation or ensuring all pg pool work is awaited before teardown resolves. ## Scope note Pre-existing; reproduces on `main`. Not introduced by the facility-address branch (PR #225). Tracked separately per that PR's decision.
Author
Owner

Hermes saw this issue and queued it for attention.

Labels detected: agent:hermes + status:ready.
Telegram notification sent by the Shiftd Hermes issue watchdog.

Next step: ask Hermes to work issue #226 when you want implementation to start.

Hermes saw this issue and queued it for attention. Labels detected: `agent:hermes` + `status:ready`. Telegram notification sent by the Shiftd Hermes issue watchdog. Next step: ask Hermes to `work issue #226` when you want implementation to start.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
owlburtoe/Shiftd#226
No description provided.