Local backend DB suite cascades to ~540 failures — postgres-test cleanup is fragile between files #234

Closed
opened 2026-05-28 09:12:56 -04:00 by owlburtoe · 1 comment
Owner

Summary

Running the full backend integration suite locally via pnpm test:backend:db produces a ~540-failure cascade across most files. This is environmental / test-harness instability, not an application regression — it reproduces on main at equivalent magnitude.

This makes local backend integration testing effectively unusable on the affected machine, so flakes like #226 cannot be locally re-verified end-to-end. CI does not run this vitest suite (the release-artifacts job only does schema-push, type-check, build, runtime smoke, migration ordering; e2e.yml runs Playwright), so there is currently no automated coverage of the backend integration suite either.

Evidence (controlled comparison)

main feature branch (control)
Tests failed 540 539
Tests passed 879 894
FATAL: the database system is shutting down (57P03) 1 0
truncateAll() TRUNCATE failures 28 26

Equivalent failure magnitude on both branches → not branch-specific.

Root cause (observed)

The cascade is a cleanup-isolation failure, not a logic failure:

  1. The first victim file (consistently facilityController.test.ts) fails because its beforeEach truncateAll() does not reliably clear the DB.
  2. Stale rows survive into the next file's seedTestData(), which then collides:
    • duplicate key value violates unique constraint "users_pkey" / facilities_slug_unique
    • insert or update violates foreign key constraint ..._department_id_departments_id_fk (department parent missing)
  3. Every subsequent file inherits the dirty/half-cleared state → ~540 cascading failures.

The trigger of the initial cleanup failure is non-deterministic:

  • One main run logged FATAL: 57P03 the database system is shutting down (ProcessStartupPacket) — the postgres-test container restarted mid-run (likely Colima/Docker memory pressure).
  • The control branch run logged no shutdown but several terminating connection / Connection terminated errors.

Either way the postgres-test container / connection drops partway through, and truncateAll() has no guard against running against a dropped/restarting server.

Suggested mitigations (to scope)

  • Make the harness resilient to a mid-run container drop: detect postgres-test health between files, fail fast with a clear message instead of cascading.
  • Harden truncateAll() so a failed truncate aborts the run (or retries) rather than letting the next seedTestData() collide.
  • Investigate Colima/Docker resource limits for postgres-test (bump memory; the 57P03 shutdown smells like OOM/restart).
  • Consider running the backend integration suite in CI so it has a stable, reproducible baseline (currently uncovered).

Notes

Not a blocker for #226's fix (PR #230), whose targeted client.query() while executing warning was verifiably eliminated (1 → 0 vs main). This issue is purely about local-suite reliability.

## Summary Running the full backend integration suite locally via `pnpm test:backend:db` produces a ~540-failure cascade across most files. This is **environmental / test-harness instability**, not an application regression — it reproduces on `main` at equivalent magnitude. This makes local backend integration testing effectively unusable on the affected machine, so flakes like #226 cannot be locally re-verified end-to-end. CI does **not** run this vitest suite (the release-artifacts job only does schema-push, type-check, build, runtime smoke, migration ordering; e2e.yml runs Playwright), so there is currently *no* automated coverage of the backend integration suite either. ## Evidence (controlled comparison) | | `main` | feature branch (control) | |---|---|---| | Tests failed | 540 | 539 | | Tests passed | 879 | 894 | | `FATAL: the database system is shutting down` (57P03) | 1 | 0 | | `truncateAll()` TRUNCATE failures | 28 | 26 | Equivalent failure magnitude on both branches → not branch-specific. ## Root cause (observed) The cascade is a cleanup-isolation failure, not a logic failure: 1. The first victim file (consistently `facilityController.test.ts`) fails because its `beforeEach` `truncateAll()` does not reliably clear the DB. 2. Stale rows survive into the next file's `seedTestData()`, which then collides: - `duplicate key value violates unique constraint "users_pkey"` / `facilities_slug_unique` - `insert or update violates foreign key constraint ..._department_id_departments_id_fk` (department parent missing) 3. Every subsequent file inherits the dirty/half-cleared state → ~540 cascading failures. The *trigger* of the initial cleanup failure is non-deterministic: - One `main` run logged `FATAL: 57P03 the database system is shutting down` (`ProcessStartupPacket`) — the `postgres-test` container restarted mid-run (likely Colima/Docker memory pressure). - The control branch run logged no shutdown but several `terminating connection` / `Connection terminated` errors. Either way the `postgres-test` container / connection drops partway through, and `truncateAll()` has no guard against running against a dropped/restarting server. ## Suggested mitigations (to scope) - Make the harness resilient to a mid-run container drop: detect `postgres-test` health between files, fail fast with a clear message instead of cascading. - Harden `truncateAll()` so a failed truncate aborts the run (or retries) rather than letting the next `seedTestData()` collide. - Investigate Colima/Docker resource limits for `postgres-test` (bump memory; the 57P03 shutdown smells like OOM/restart). - Consider running the backend integration suite in CI so it has a stable, reproducible baseline (currently uncovered). ## Notes Not a blocker for #226's fix (PR #230), whose targeted `client.query() while executing` warning was verifiably eliminated (1 → 0 vs main). This issue is purely about local-suite reliability.
Author
Owner

Could not reproduce this on current main from a fresh worktree: pnpm test:backend:db passed twice locally, including after the hardening changes (133 files / 1472 tests passed).

I still opened PR #241 to harden the harness against the reported failure mode: fail-fast wrapper mode, per-file DB reachability checks, and clearer non-secret reset errors so a dropped postgres-test connection stops as one actionable failure instead of cascading into duplicate-key/FK noise.

Could not reproduce this on current main from a fresh worktree: `pnpm test:backend:db` passed twice locally, including after the hardening changes (`133 files / 1472 tests passed`). I still opened PR #241 to harden the harness against the reported failure mode: fail-fast wrapper mode, per-file DB reachability checks, and clearer non-secret reset errors so a dropped `postgres-test` connection stops as one actionable failure instead of cascading into duplicate-key/FK noise.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
owlburtoe/Shiftd#234
No description provided.