Local backend DB suite cascades to ~540 failures — postgres-test cleanup is fragile between files #234

New issue

Closed

opened 2026-05-28 09:12:56 -04:00 by owlburtoe · 1 comment

owlburtoe commented

2026-05-28 09:12:56 -04:00

Owner

Summary

Running the full backend integration suite locally via pnpm test:backend:db produces a ~540-failure cascade across most files. This is environmental / test-harness instability, not an application regression — it reproduces on main at equivalent magnitude.

This makes local backend integration testing effectively unusable on the affected machine, so flakes like #226 cannot be locally re-verified end-to-end. CI does not run this vitest suite (the release-artifacts job only does schema-push, type-check, build, runtime smoke, migration ordering; e2e.yml runs Playwright), so there is currently no automated coverage of the backend integration suite either.

Evidence (controlled comparison)

	`main`	feature branch (control)
Tests failed	540	539
Tests passed	879	894
`FATAL: the database system is shutting down` (57P03)	1	0
`truncateAll()` TRUNCATE failures	28	26

Equivalent failure magnitude on both branches → not branch-specific.

Root cause (observed)

The cascade is a cleanup-isolation failure, not a logic failure:

The first victim file (consistently facilityController.test.ts) fails because its beforeEach truncateAll() does not reliably clear the DB.
Stale rows survive into the next file's seedTestData(), which then collides:
- duplicate key value violates unique constraint "users_pkey" / facilities_slug_unique
- insert or update violates foreign key constraint ..._department_id_departments_id_fk (department parent missing)
Every subsequent file inherits the dirty/half-cleared state → ~540 cascading failures.

The trigger of the initial cleanup failure is non-deterministic:

One main run logged FATAL: 57P03 the database system is shutting down (ProcessStartupPacket) — the postgres-test container restarted mid-run (likely Colima/Docker memory pressure).
The control branch run logged no shutdown but several terminating connection / Connection terminated errors.

Either way the postgres-test container / connection drops partway through, and truncateAll() has no guard against running against a dropped/restarting server.

Suggested mitigations (to scope)

Make the harness resilient to a mid-run container drop: detect postgres-test health between files, fail fast with a clear message instead of cascading.
Harden truncateAll() so a failed truncate aborts the run (or retries) rather than letting the next seedTestData() collide.
Investigate Colima/Docker resource limits for postgres-test (bump memory; the 57P03 shutdown smells like OOM/restart).
Consider running the backend integration suite in CI so it has a stable, reproducible baseline (currently uncovered).

Notes

Not a blocker for #226's fix (PR #230), whose targeted client.query() while executing warning was verifiably eliminated (1 → 0 vs main). This issue is purely about local-suite reliability.

## Summary Running the full backend integration suite locally via `pnpm test:backend:db` produces a ~540-failure cascade across most files. This is **environmental / test-harness instability**, not an application regression — it reproduces on `main` at equivalent magnitude. This makes local backend integration testing effectively unusable on the affected machine, so flakes like #226 cannot be locally re-verified end-to-end. CI does **not** run this vitest suite (the release-artifacts job only does schema-push, type-check, build, runtime smoke, migration ordering; e2e.yml runs Playwright), so there is currently *no* automated coverage of the backend integration suite either. ## Evidence (controlled comparison) | | `main` | feature branch (control) | |---|---|---| | Tests failed | 540 | 539 | | Tests passed | 879 | 894 | | `FATAL: the database system is shutting down` (57P03) | 1 | 0 | | `truncateAll()` TRUNCATE failures | 28 | 26 | Equivalent failure magnitude on both branches → not branch-specific. ## Root cause (observed) The cascade is a cleanup-isolation failure, not a logic failure: 1. The first victim file (consistently `facilityController.test.ts`) fails because its `beforeEach` `truncateAll()` does not reliably clear the DB. 2. Stale rows survive into the next file's `seedTestData()`, which then collides: - `duplicate key value violates unique constraint "users_pkey"` / `facilities_slug_unique` - `insert or update violates foreign key constraint ..._department_id_departments_id_fk` (department parent missing) 3. Every subsequent file inherits the dirty/half-cleared state → ~540 cascading failures. The *trigger* of the initial cleanup failure is non-deterministic: - One `main` run logged `FATAL: 57P03 the database system is shutting down` (`ProcessStartupPacket`) — the `postgres-test` container restarted mid-run (likely Colima/Docker memory pressure). - The control branch run logged no shutdown but several `terminating connection` / `Connection terminated` errors. Either way the `postgres-test` container / connection drops partway through, and `truncateAll()` has no guard against running against a dropped/restarting server. ## Suggested mitigations (to scope) - Make the harness resilient to a mid-run container drop: detect `postgres-test` health between files, fail fast with a clear message instead of cascading. - Harden `truncateAll()` so a failed truncate aborts the run (or retries) rather than letting the next `seedTestData()` collide. - Investigate Colima/Docker resource limits for `postgres-test` (bump memory; the 57P03 shutdown smells like OOM/restart). - Consider running the backend integration suite in CI so it has a stable, reproducible baseline (currently uncovered). ## Notes Not a blocker for #226's fix (PR #230), whose targeted `client.query() while executing` warning was verifiably eliminated (1 → 0 vs main). This issue is purely about local-suite reliability.

owlburtoe referenced this issue from a pull request that will close it,

2026-05-28 15:21:18 -04:00

test: harden backend DB suite harness #241

owlburtoe commented

2026-05-28 15:21:50 -04:00

Author

Owner

Could not reproduce this on current main from a fresh worktree: pnpm test:backend:db passed twice locally, including after the hardening changes (133 files / 1472 tests passed).

I still opened PR #241 to harden the harness against the reported failure mode: fail-fast wrapper mode, per-file DB reachability checks, and clearer non-secret reset errors so a dropped postgres-test connection stops as one actionable failure instead of cascading into duplicate-key/FK noise.

Could not reproduce this on current main from a fresh worktree: `pnpm test:backend:db` passed twice locally, including after the hardening changes (`133 files / 1472 tests passed`). I still opened PR #241 to harden the harness against the reported failure mode: fail-fast wrapper mode, per-file DB reachability checks, and clearer non-secret reset errors so a dropped `postgres-test` connection stops as one actionable failure instead of cascading into duplicate-key/FK noise.

owlburtoe closed this issue

2026-05-28 15:21:59 -04:00