Apply suggestions by @carbolymer
Co-authored-by: Mateusz Galazyn <[email protected]>
Co-authored-by: Mateusz Galazyn <[email protected]>
Two config changes to make the smoke cluster behave like the sim-rs Linear-Leios reference scenarios rather than a stress test. 1. `[validation]` in mainnet.toml: switch RB-body validation cost from a flat `1000.0 ms` constant to sim-rs's `0.3539 ms + 2.151e-05 ms/byte` formula, and surface `tx_validation_ms = 0.6201` (matches sim-rs's amortised-Plutus `tx-validation-cpu-time-ms`). At 90 KiB max RB body the new constant is ~2.4 ms — 3-hop propagation now finishes well under the 3-slot pre-voting buffer, where the old 1 s/hop pushed RB adoption past the voting window and forced every non-producer voter into `WrongEB`. 2. `rb_generation_probability` in sample-cluster.toml: revert from `0.2` (cluster-wide ≈ 1 RB / 5 s) to the base-config `0.05` (≈ 1 RB / 20 s, mainnet-like). The high rate packed multiple RBs into each EB's voting window and meant the chain tip typically moved past the EB-referencing RB before quorum could gather, so no EB ever certified. Together these match sim-rs's empirical 40%-certification working point and let the cluster reach `RbCertifiedEb` end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two changes to the Leios voting flow. 1. `decide_vote` previously called `mark_voted` on *any* predicate failure, so a voter that hit `WrongEB`, `LateRBHeader`, or `MissingTX` on the first slot of the Voting phase had no chance to retry next slot when their chain-tip caught up or the missing TX arrived. In the 25-node cluster this gave ~1 vote per EB (the producer's), nowhere near quorum. Only `LateEB` is genuinely permanent — `eb_seen_slot` is fixed at receipt, once-late is always-late. The other three are transient. Skip `mark_voted` on the transient reasons so `EligibleToVote` re-fires every slot of the Voting window. `EmitVote` still calls `mark_voted` so a successful vote happens at most once per EB. Effect on the smoke test: best-EB vote count went from 1 to 18 out of 25, which crosses the quorum threshold and produces the first end-to-end `RbCertifiedEb` events. 2. `emit_vote` only logged the bundle at `info!` level; the `VTBundleGenerated` telemetry variant defined in `telemetry.rs` was never constructed. Push it onto `pending_telemetry` parallel to `LeiosNoVote` so the UI can show producer flashes. Updated the two `con-rs` unit tests that asserted the "`mark_voted` after WrongEB" behaviour, and added a positive test for the new "transient reasons re-fire each slot" semantic. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two related sizing fixes for protocol 19 (LeiosFetch).
1. Per-protocol ingress channel capacity was a single
`egress_queue_size: 16` value used for every protocol. At 25-node-
cluster scale, an EB-tx response can fragment into ~256 segments
and overshoots a 16-deep mpsc instantly, killing the mux. Split
into three named tiers in `peer_task.rs`:
- `HIGH_VOLUME_QUEUE_SIZE = 256` (chainsync, blockfetch,
txsubmission, leios_notify) — kilobyte-sized messages
- `BULK_FETCH_QUEUE_SIZE = 4096` (leios_fetch only) — sized
to hold a worst-case 12 KiB-segmented multi-MB delivery plus
headroom for the next request
- `LOW_VOLUME_QUEUE_SIZE = 4` (keepalive, peersharing) — slow
fixed-size traffic
2. `MuxError::IngressOverflow` was reused for both byte-budget overflow
and channel-full: the resulting message read
"{queued bytes} exceeds limit {byte budget}" even when the byte
budget was *not* exceeded, because the channel filled first. Add
a sibling `IngressChannelFull { queued, capacity }` variant and
use it on the `try_send` Full path so the log accurately points
at the channel as the bottleneck.
3. `LeiosFetch` `INGRESS_LIMIT` and `SIZE_LIMIT_LARGE` raised from
16 MiB to 24 MiB. Without 50% headroom over `MAX_BLOCK_SIZE`,
the demuxer's per-state buffer cap was hit by the very last SDU
of a max-size delivery (codec hadn't drained earlier segments
yet), tearing the connection down at the end of every full-size
EB-tx response. Per-message safety is still enforced inside
the CBOR codec via the unchanged `MAX_BLOCK_SIZE`,
`MAX_TRANSACTIONS`, `MAX_VOTES`, `MAX_TRANSACTION_SIZE` caps.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
When a `BlockFetch` request fails, `on_block_fetch_failed` previously
removed the in-flight marker and immediately re-ran chain selection.
The same peer's announced fragment still pointed at the missing point,
so `select_chain_once` reached the same `WaitingForBlocks { peer_id }`
decision and re-issued the fetch from the same peer in microseconds.
Under any persistent fetch failure (e.g. a peer that NoBlocks every
request) this busy-loops at hundreds of thousands of iterations per
second — disk I/O from the resulting log spam can starve slot ticks
and cascade into cluster-wide fork divergence.
Add `BLOCK_FETCH_COOLDOWN` (2 s) and a `block_fetch_cooldown` map on
`PraosState`. `on_block_fetch_failed` now takes the responsible
`PeerId`, inserts a cooldown entry, and `evaluate_and_fetch_internal`
merges those peers into its `skip` set so the fetch policy picks a
different candidate.
`NetworkEvent::BlockFetchFailed` now carries `peer_id: Option<PeerId>`
so the wrapper can pass it through. `Some(p)` is the normal "this
peer failed" case; `None` means the coordinator never reached any
peer for the requested fragment, so there is no one to penalise and
the wrapper skips the cooldown call.
Demoted `select_chain: fetching missing blocks`,
`fetching missing chain blocks`, and the failure log to `debug!` —
they fired at info level on every fetch decision and were the bulk
of the disk-I/O storm during the 25-node smoke.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Activates the timeout(1)-bounded external calls + signal-safe emit refactor from the prior commit. publish-images resolves the new tag, builds at c588914, pushes; both master and adversary testnets pull the new image on next run.
Two compounding bugs across all asteria-game test scripts: 1. socat (eventually_alive, finally_alive, parallel_driver_heartbeat) and the asteria-* binaries (asteria_player, admin_singleton, bootstrap, asteria_consistency) have no internal deadlines. Composer's per-command cap is ~16 s for parallel/eventually, ~54 s for finally. When upstream stalls (indexer slow to reply, N2C handshake hung under partition, container restart), the wrapping bash gets SIGKILLed BEFORE the binary returns; the sdk_run_signal_safe wrapper never executes its rc check; composer assigns synthetic exit 1 and flags the script under "Always: Commands finish with zero exit code". Fix: every external call now has an explicit `timeout N` wrapper sized under composer's per-command-type cap (1 s for socat probes, 12 s for parallel/anytime, 25 s for serial, 30 s for finally binary; 10 s settle for the consistency check, total 40 s). 2. sdk_run_signal_safe absorbed signal-induced exits via sdk_unreachable, but AlwaysOrUnreachable with hit:true + condition:false IS a finding. Now uses sdk_sometimes_optional (must_hit:false) so absorbed signals are observation-only. Same precedent as PR #137's eventually_alive cold_start fix. Smoke-tested locally — `timeout 0.1 sleep 5` exits 124 → sdk_run_signal_safe maps it to a Sometimes/must_hit:false event and returns 0; the JSONL emit shows must_hit:false as expected. Together this addresses every finding category we saw in the multi-run flake survey: - Always: Commands finish with zero exit code (× 3 scripts) - Always: stub eventually_alive cold_start (handled in PR #137) - Sometimes: stub eventually_alive cold_start (handled in PR #137)
[ci skip]
added: * github:input-output-hk/mithril/945c12f7f013426d1d51be9dae9fa4c41a3c3163#mithril-client-cli * github:input-output-hk/mithril/945c12f7f013426d1d51be9dae9fa4c41a3c3163#mithril-signer removed: * github:input-output-hk/mithril/43998cbb871defca0c21f2e12537ae629b5781d9#mithril-client-cli * github:input-output-hk/mithril/43998cbb871defca0c21f2e12537ae629b5781d9#mithril-signer
Reshape the UTxO-HD table abstraction so the @mk@ parameter is a
single-argument @TableKind@ (indexed by @blk@) rather than a
two-argument @MapKind@ over @(TxIn blk)@ and @(TxOut blk)@. The
concrete table types (`EmptyMK`, `KeysMK`, `ValuesMK`, `DiffMK`)
are renamed to `NoTables`, `Keys`, `Values`, `Diffs` and now take
@blk@ directly.
Collapse `Ouroboros.Consensus.Ledger.Tables{,.Basics,.Combinators,
.Kinds,.MapKind,.Utils}` into a single new
`Ouroboros.Consensus.Ledger.BasicTypes` module. The old modules are
left on disk but commented out of the cabal file.
Add `empty`, `map`, and `mapKeys` to
`Ouroboros.Consensus.Ledger.Tables.Diff`, and a constant bifunctor
`K2` in `Ouroboros.Consensus.Util`, both used by the new module.
Update all 130+ call sites to the new names and shape.
Deprecates the old-API transaction body surface (the type, its constructor, and direct producers/consumers) so users are pointed at 'Cardano.Api.Experimental'. Internal modules that still use these symbols are annotated with -Wno-deprecations to keep -Werror green; they will be migrated in a follow-up along with the setter family. Deprecations: - TxBody (data type), ShelleyTxBody (constructor) - TxBodyContent (type/constructor) - createTransactionBody, defaultTxBodyContent - getTxBody, getTxBodyContent - BalancedTxBody The existing pattern-synonym TxBody deprecation message is updated for consistency with the new messages.