fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries
This is hardening, not recovery. PR #1016 made a pool whose snapshot lags the current epoch surface at RUPD instead of panicking obscurely. This adds two fail-loud guards so the same class of corruption is caught earlier and a half-finished boundary can't silently double-apply. It does NOT implement true shard resume, and it does NOT repair an already-lagging pool — those remain open (see #1018 and the restored "TODO: implement true shard resume" notes). Piece A — guard the silent-corruption hole. `MintedBlocksInc::apply` accumulates the block count into the pool's positional `live` snapshot slot, which only holds this epoch's blocks when the snapshot is aligned. A lagging pool would silently fold later-epoch blocks into a mislabeled slot, corrupting the `blocks_minted` reward input. `apply` now asserts the snapshot is aligned to the block's epoch, failing at the origin (block processing) rather than as a downstream RUPD failure. It sits in the infallible delta-apply layer alongside its existing invariant `expect`s, so it is a descriptive panic. The block epoch rides on a transient `#[serde(skip)]` field; WAL-stored deltas are only ever undone (never re-applied), so the WAL format is unchanged. Piece B — guard ESTART finalize. `commit_finalize` now asserts every shard committed and the epoch has not advanced before rotating pools / advancing the epoch, returning BrokenInvariant::EpochBoundaryIncomplete otherwise — a defensive assertion that turns a would-be silent double-rotation into a loud error. It guards the finalize step only; it does NOT make the per-shard `AccountTransition` replay idempotent. Error codes + troubleshooting. The two errors are codified (LEDGER-001 pool snapshot lagging, LEDGER-002 epoch boundary incomplete) with concise messages; the explanatory prose and operator guidance live in a new docs/content/operations/troubleshooting.mdx page. Out of scope: making boundary resume idempotent (the real fix, tracked in #1018), and rebuilding an already-corrupted pool snapshot window from the archive. A node that already persisted a lag keeps failing loud with LEDGER-001 and needs a re-bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>