fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries
This is hardening, not recovery. PR #1016 made a pool whose snapshot lags the
current epoch surface at RUPD instead of panicking obscurely. This adds two
fail-loud guards so the same class of corruption is caught earlier and a
half-finished boundary can't silently double-apply. It does NOT implement true
shard resume, and it does NOT repair an already-lagging pool — those remain
open (see #1018 and the restored "TODO: implement true shard resume" notes).
Piece A — guard the silent-corruption hole. `MintedBlocksInc::apply`
accumulates the block count into the pool's positional `live` snapshot slot,
which only holds this epoch's blocks when the snapshot is aligned. A lagging
pool would silently fold later-epoch blocks into a mislabeled slot, corrupting
the `blocks_minted` reward input. `apply` now asserts the snapshot is aligned
to the block's epoch, failing at the origin (block processing) rather than as a
downstream RUPD failure. It sits in the infallible delta-apply layer alongside
its existing invariant `expect`s, so it is a descriptive panic. The block epoch
rides on a transient `#[serde(skip)]` field; WAL-stored deltas are only ever
undone (never re-applied), so the WAL format is unchanged.
Piece B — guard ESTART finalize. `commit_finalize` now asserts every shard
committed and the epoch has not advanced before rotating pools / advancing the
epoch, returning BrokenInvariant::EpochBoundaryIncomplete otherwise — a
defensive assertion that turns a would-be silent double-rotation into a loud
error. It guards the finalize step only; it does NOT make the per-shard
`AccountTransition` replay idempotent.
Error codes + troubleshooting. The two errors are codified (LEDGER-001 pool
snapshot lagging, LEDGER-002 epoch boundary incomplete) with concise messages;
the explanatory prose and operator guidance live in a new
docs/content/operations/troubleshooting.mdx page.
Out of scope: making boundary resume idempotent (the real fix, tracked in
#1018), and rebuilding an already-corrupted pool snapshot window from the
archive. A node that already persisted a lag keeps failing loud with LEDGER-001
and needs a re-bootstrap.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>