fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries (#1017)
Home /
TxPipe /
dolos
Jun 11, 2-3 AM (0)
Jun 11, 3-4 AM (0)
Jun 11, 4-5 AM (0)
Jun 11, 5-6 AM (0)
Jun 11, 6-7 AM (0)
Jun 11, 7-8 AM (0)
Jun 11, 8-9 AM (0)
Jun 11, 9-10 AM (0)
Jun 11, 10-11 AM (0)
Jun 11, 11-12 PM (0)
Jun 11, 12-1 PM (0)
Jun 11, 1-2 PM (0)
Jun 11, 2-3 PM (0)
Jun 11, 3-4 PM (0)
Jun 11, 4-5 PM (0)
Jun 11, 5-6 PM (0)
Jun 11, 6-7 PM (0)
Jun 11, 7-8 PM (0)
Jun 11, 8-9 PM (0)
Jun 11, 9-10 PM (0)
Jun 11, 10-11 PM (0)
Jun 11, 11-12 AM (0)
Jun 12, 12-1 AM (0)
Jun 12, 1-2 AM (0)
Jun 12, 2-3 AM (0)
Jun 12, 3-4 AM (0)
Jun 12, 4-5 AM (0)
Jun 12, 5-6 AM (0)
Jun 12, 6-7 AM (0)
Jun 12, 7-8 AM (0)
Jun 12, 8-9 AM (0)
Jun 12, 9-10 AM (0)
Jun 12, 10-11 AM (0)
Jun 12, 11-12 PM (0)
Jun 12, 12-1 PM (0)
Jun 12, 1-2 PM (0)
Jun 12, 2-3 PM (0)
Jun 12, 3-4 PM (0)
Jun 12, 4-5 PM (0)
Jun 12, 5-6 PM (0)
Jun 12, 6-7 PM (0)
Jun 12, 7-8 PM (0)
Jun 12, 8-9 PM (0)
Jun 12, 9-10 PM (0)
Jun 12, 10-11 PM (0)
Jun 12, 11-12 AM (0)
Jun 13, 12-1 AM (0)
Jun 13, 1-2 AM (0)
Jun 13, 2-3 AM (0)
Jun 13, 3-4 AM (0)
Jun 13, 4-5 AM (0)
Jun 13, 5-6 AM (0)
Jun 13, 6-7 AM (0)
Jun 13, 7-8 AM (0)
Jun 13, 8-9 AM (0)
Jun 13, 9-10 AM (0)
Jun 13, 10-11 AM (0)
Jun 13, 11-12 PM (0)
Jun 13, 12-1 PM (0)
Jun 13, 1-2 PM (0)
Jun 13, 2-3 PM (1)
Jun 13, 3-4 PM (0)
Jun 13, 4-5 PM (0)
Jun 13, 5-6 PM (1)
Jun 13, 6-7 PM (0)
Jun 13, 7-8 PM (0)
Jun 13, 8-9 PM (0)
Jun 13, 9-10 PM (0)
Jun 13, 10-11 PM (0)
Jun 13, 11-12 AM (0)
Jun 14, 12-1 AM (0)
Jun 14, 1-2 AM (0)
Jun 14, 2-3 AM (0)
Jun 14, 3-4 AM (0)
Jun 14, 4-5 AM (0)
Jun 14, 5-6 AM (0)
Jun 14, 6-7 AM (0)
Jun 14, 7-8 AM (0)
Jun 14, 8-9 AM (0)
Jun 14, 9-10 AM (0)
Jun 14, 10-11 AM (0)
Jun 14, 11-12 PM (0)
Jun 14, 12-1 PM (0)
Jun 14, 1-2 PM (0)
Jun 14, 2-3 PM (0)
Jun 14, 3-4 PM (0)
Jun 14, 4-5 PM (0)
Jun 14, 5-6 PM (0)
Jun 14, 6-7 PM (0)
Jun 14, 7-8 PM (0)
Jun 14, 8-9 PM (0)
Jun 14, 9-10 PM (0)
Jun 14, 10-11 PM (0)
Jun 14, 11-12 AM (0)
Jun 15, 12-1 AM (0)
Jun 15, 1-2 AM (0)
Jun 15, 2-3 AM (0)
Jun 15, 3-4 AM (0)
Jun 15, 4-5 AM (0)
Jun 15, 5-6 AM (0)
Jun 15, 6-7 AM (1)
Jun 15, 7-8 AM (0)
Jun 15, 8-9 AM (0)
Jun 15, 9-10 AM (0)
Jun 15, 10-11 AM (0)
Jun 15, 11-12 PM (0)
Jun 15, 12-1 PM (0)
Jun 15, 1-2 PM (0)
Jun 15, 2-3 PM (0)
Jun 15, 3-4 PM (0)
Jun 15, 4-5 PM (0)
Jun 15, 5-6 PM (0)
Jun 15, 6-7 PM (0)
Jun 15, 7-8 PM (0)
Jun 15, 8-9 PM (0)
Jun 15, 9-10 PM (0)
Jun 15, 10-11 PM (0)
Jun 15, 11-12 AM (0)
Jun 16, 12-1 AM (0)
Jun 16, 1-2 AM (0)
Jun 16, 2-3 AM (0)
Jun 16, 3-4 AM (0)
Jun 16, 4-5 AM (0)
Jun 16, 5-6 AM (0)
Jun 16, 6-7 AM (0)
Jun 16, 7-8 AM (0)
Jun 16, 8-9 AM (0)
Jun 16, 9-10 AM (0)
Jun 16, 10-11 AM (0)
Jun 16, 11-12 PM (0)
Jun 16, 12-1 PM (0)
Jun 16, 1-2 PM (0)
Jun 16, 2-3 PM (0)
Jun 16, 3-4 PM (0)
Jun 16, 4-5 PM (0)
Jun 16, 5-6 PM (0)
Jun 16, 6-7 PM (0)
Jun 16, 7-8 PM (0)
Jun 16, 8-9 PM (0)
Jun 16, 9-10 PM (0)
Jun 16, 10-11 PM (0)
Jun 16, 11-12 AM (0)
Jun 17, 12-1 AM (0)
Jun 17, 1-2 AM (0)
Jun 17, 2-3 AM (0)
Jun 17, 3-4 AM (0)
Jun 17, 4-5 AM (0)
Jun 17, 5-6 AM (0)
Jun 17, 6-7 AM (0)
Jun 17, 7-8 AM (0)
Jun 17, 8-9 AM (0)
Jun 17, 9-10 AM (0)
Jun 17, 10-11 AM (0)
Jun 17, 11-12 PM (0)
Jun 17, 12-1 PM (2)
Jun 17, 1-2 PM (0)
Jun 17, 2-3 PM (0)
Jun 17, 3-4 PM (0)
Jun 17, 4-5 PM (0)
Jun 17, 5-6 PM (0)
Jun 17, 6-7 PM (2)
Jun 17, 7-8 PM (0)
Jun 17, 8-9 PM (2)
Jun 17, 9-10 PM (1)
Jun 17, 10-11 PM (0)
Jun 17, 11-12 AM (0)
Jun 18, 12-1 AM (0)
Jun 18, 1-2 AM (0)
Jun 18, 2-3 AM (0)
10 commits this week
Jun 11, 2026
-
Jun 18, 2026
fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries
This is hardening, not recovery. PR #1016 made a pool whose snapshot lags the current epoch surface at RUPD instead of panicking obscurely. This adds two fail-loud guards so the same class of corruption is caught earlier and a half-finished boundary can't silently double-apply. It does NOT implement true shard resume, and it does NOT repair an already-lagging pool — those remain open (see #1018 and the restored "TODO: implement true shard resume" notes). Piece A — guard the silent-corruption hole. `MintedBlocksInc::apply` accumulates the block count into the pool's positional `live` snapshot slot, which only holds this epoch's blocks when the snapshot is aligned. A lagging pool would silently fold later-epoch blocks into a mislabeled slot, corrupting the `blocks_minted` reward input. `apply` now asserts the snapshot is aligned to the block's epoch, failing at the origin (block processing) rather than as a downstream RUPD failure. It sits in the infallible delta-apply layer alongside its existing invariant `expect`s, so it is a descriptive panic. The block epoch rides on a transient `#[serde(skip)]` field; WAL-stored deltas are only ever undone (never re-applied), so the WAL format is unchanged. Piece B — guard ESTART finalize. `commit_finalize` now asserts every shard committed and the epoch has not advanced before rotating pools / advancing the epoch, returning BrokenInvariant::EpochBoundaryIncomplete otherwise — a defensive assertion that turns a would-be silent double-rotation into a loud error. It guards the finalize step only; it does NOT make the per-shard `AccountTransition` replay idempotent. Error codes + troubleshooting. The two errors are codified (LEDGER-001 pool snapshot lagging, LEDGER-002 epoch boundary incomplete) with concise messages; the explanatory prose and operator guidance live in a new docs/content/operations/troubleshooting.mdx page. Out of scope: making boundary resume idempotent (the real fix, tracked in #1018), and rebuilding an already-corrupted pool snapshot window from the archive. A node that already persisted a lag keeps failing loud with LEDGER-001 and needs a re-bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries
This is hardening, not recovery. PR #1016 made a pool whose snapshot lags the current epoch surface at RUPD instead of panicking obscurely. This adds two fail-loud guards so the same class of corruption is caught earlier and a half-finished boundary can't silently double-apply. It does NOT implement true shard resume, and it does NOT repair an already-lagging pool — those remain open (see #1018 and the restored "TODO: implement true shard resume" notes). Piece A — guard the silent-corruption hole. `MintedBlocksInc::apply` accumulates the block count into the pool's positional `live` snapshot slot, which only holds this epoch's blocks when the snapshot is aligned. A lagging pool would silently fold later-epoch blocks into a mislabeled slot, corrupting the `blocks_minted` reward input. `apply` now asserts the snapshot is aligned to the block's epoch, failing at the origin (block processing) rather than as a downstream RUPD failure. It sits in the infallible delta-apply layer alongside its existing invariant `expect`s, so it is a descriptive panic. The block epoch rides on a transient `#[serde(skip)]` field; WAL-stored deltas are only ever undone (never re-applied), so the WAL format is unchanged. Piece B — guard ESTART finalize. `commit_finalize` now asserts every shard committed and the epoch has not advanced before rotating pools / advancing the epoch, returning BrokenInvariant::EpochBoundaryIncomplete otherwise — a defensive assertion that turns a would-be silent double-rotation into a loud error. It guards the finalize step only; it does NOT make the per-shard `AccountTransition` replay idempotent. Error codes + troubleshooting. The two errors are codified (LEDGER-001 pool snapshot lagging, LEDGER-002 epoch boundary incomplete) with concise messages; the explanatory prose and operator guidance live in a new docs/content/operations/troubleshooting.mdx page. Out of scope: making boundary resume idempotent (the real fix, tracked in #1018), and rebuilding an already-corrupted pool snapshot window from the archive. A node that already persisted a lag keeps failing loud with LEDGER-001 and needs a re-bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
fix(cardano): fail loud on lagging pool snapshots and unfinished epoch boundaries
This is hardening, not recovery. PR #1016 made a pool whose snapshot lags the current epoch surface at RUPD instead of panicking obscurely. This adds two fail-loud guards so the same class of corruption is caught earlier and a half-finished boundary can't silently double-apply. It does NOT implement true shard resume, and it does NOT repair an already-lagging pool — those remain open (see #1018 and the restored "TODO: implement true shard resume" notes). Piece A — guard the silent-corruption hole. `MintedBlocksInc::apply` accumulates the block count into the pool's positional `live` snapshot slot, which only holds this epoch's blocks when the snapshot is aligned. A lagging pool would silently fold later-epoch blocks into a mislabeled slot, corrupting the `blocks_minted` reward input. `apply` now asserts the snapshot is aligned to the block's epoch, failing at the origin (block processing) rather than as a downstream RUPD failure. It sits in the infallible delta-apply layer alongside its existing invariant `expect`s, so it is a descriptive panic. The block epoch rides on a transient `#[serde(skip)]` field; WAL-stored deltas are only ever undone (never re-applied), so the WAL format is unchanged. Piece B — guard ESTART finalize. `commit_finalize` now asserts every shard committed and the epoch has not advanced before rotating pools / advancing the epoch, returning BrokenInvariant::EpochBoundaryIncomplete otherwise — a defensive assertion that turns a would-be silent double-rotation into a loud error. It guards the finalize step only; it does NOT make the per-shard `AccountTransition` replay idempotent. Error codes + troubleshooting. The two errors are codified (LEDGER-001 pool snapshot lagging, LEDGER-002 epoch boundary incomplete) with concise messages; the explanatory prose and operator guidance live in a new docs/content/operations/troubleshooting.mdx page. Out of scope: making boundary resume idempotent (the real fix, tracked in #1018), and rebuilding an already-corrupted pool snapshot window from the archive. A node that already persisted a lag keeps failing loud with LEDGER-001 and needs a re-bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
fix(cardano): recover from and guard against lagging pool snapshots at the epoch boundary
PR #1016 made a pool whose snapshot lags the current epoch surface loudly at RUPD instead of panicking. This adds the recovery/prevention half so the lag can neither be introduced silently nor double-applied on restart. Piece A — guard the silent-corruption hole. `MintedBlocksInc::apply` accumulates the block count into the pool's `live` snapshot slot, which only holds this epoch's blocks when the snapshot is aligned. A lagging pool would silently fold later-epoch blocks into a mislabeled slot, corrupting the `blocks_minted` reward input. `apply` now asserts the snapshot is aligned to the block's epoch before accumulating, failing loud and identifying (naming pool + epochs) at the origin — block processing — rather than as a downstream RUPD failure. This sits in the infallible delta-apply layer alongside its existing invariant `expect`s, so it is a descriptive panic, not a propagating error. The block epoch travels on a transient `#[serde(skip)]` field on MintedBlocksInc; WAL-stored deltas are only ever undone (never re-applied), so the default it decodes to off the WAL is never read — the WAL format is unchanged. Piece B — make ESTART finalize run exactly once. Per-shard resume already skips committed shards via `start_shard` (each shard commits atomically and advances `estart_progress.committed` in the same write), and EWRAP finalize short- circuits via `is_complete()`. ESTART finalize had no such guard — it was safe only because the cursor advances in the same atomic commit as `EpochTransition`. `commit_finalize` now asserts every shard committed and the epoch has not advanced, returning BrokenInvariant::EpochBoundaryIncomplete (enriched with epoch/committed/total) otherwise, converting a would-be silent double-rotation into a loud, identifying error. Resume diagnostics updated to describe the now-enforced mechanism (dropped the two "TODO: implement true shard resume" notes). Out of scope (deferred): rebuilding an already-corrupted pool snapshot window from the archive for nodes that persisted a lag before this fix — they keep failing loud with PoolSnapshotLagging and require a re-bootstrap. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
fix(cardano): surface lagging pool snapshots in RUPD instead of obscure panic (#1016)
fix(cardano): surface lagging pool snapshots in RUPD instead of obscure panic
The reward update (RUPD) admitted pools into the stake snapshot via an
epoch-addressed test (`snapshot_at(stake_epoch)`) but then read pool params
and blocks through fixed positional slots (`.go()`/`.mark()`) and unwrapped
them. These agree only when a pool's `EpochValue.epoch == current_epoch`. When
a persisted pool's snapshot lagged the `EpochState` epoch, the admission gate
passed via the `set`/`mark` slot while `go` was still `None`, causing an
obscure `Option::unwrap() on a None value` panic deep in `define_rewards`
(loading.rs:510) that unwound the worker thread and brought down the daemon.
Rather than mask the lag by switching the accessors to `snapshot_at(...)`
(which would silently drop pools lagging beyond the snapshot window and corrupt
rewards with no signal), validate the alignment invariant up front: every
stored `PoolState` must have `snapshot.epoch() == current_epoch` (ESTART
transitions all pools each boundary). A violation now returns a descriptive
`ChainError::PoolSnapshotLagging` naming the pool hash and the epoch mismatch,
which propagates cleanly through the gasket stage (no thread-panic cascade) and
omits nothing — the whole RUPD fails loud rather than computing partial data.
- add `ChainError::PoolSnapshotLagging { pool, pool_epoch, current_epoch }`
- validate every pool in `StakeSnapshot::load_globals` (checked over all pools,
so a pool lagging out of the admission window is caught, not dropped)
- harden the `pool_params`/`pool_blocks` unwraps with descriptive messages as a
backstop against regressing to the obscure panic
- unit tests for the alignment check
This only surfaces the lag; recovering/repairing a lagging pool and finding its
root cause (bootstrap import vs sharded boundary resume) is a follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
feat(export): add --canonical flag for byte-reproducible snapshots
feat(export): add --canonical flag for byte-reproducible snapshots