Merge pull request #858 from input-output-hk/fix/antithesis-moog-poll
fix: handle transient errors and reduce poll frequency in wait-for-test
fix: handle transient errors and reduce poll frequency in wait-for-test
Add committee-selection-algorithm config with three modes: - wfa-ls (default): existing VRF lottery matching CIP-0164 wFA+LS - everyone: every node votes unconditionally (1 vote each) - top-stake-fraction: nodes covering top N% of cumulative stake vote This enables traffic analysis comparing the CIP's VRF-based scheme against simpler alternatives. Vote bundle sizes, CPU times, diffusion, and threshold checking are unchanged — only the selection mechanism differs. Includes benchmark script (scripts/cip-voting-options.sh) that runs CIP topology under turbo mode across all three committee modes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The TxGeneratorCore refactor (8a4da350) moved node selection logic into TxGeneratorCore but left a reference to the removed `node_lookup` local. Replace with `self.sinks` which serves the same empty-check purpose. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
ci(antithesis): poll for test results after submission
Signed-off-by: Chris Gianelloni <[email protected]>
Add a two-phase polling script (wait-for-test.sh) that monitors Antithesis test status via Moog: - Phase 1: poll every 10s until the test is accepted or rejected - Phase 2: poll every 60s until the test finishes - Final check: exit 0 on success, 1 on failure/unknown Also hoists Moog env vars to job level, auto-increments the try counter per commit, and caps the job at 180 minutes. Adapted from cardano-foundation/cardano-node-antithesis. Signed-off-by: Chris Gianelloni <[email protected]>
Add a shared Mempool accumulator that collects transactions from both local generation and peer receipt. When producing an RB, the mempool determines the path: if total pending bytes fit within rb_body_max_bytes, txs go in the RB body directly; otherwise ALL txs drain into an EB manifest (list of tx hashes) and the RB body is empty with the EB announced in the header's announced_eb field. - mempool.rs: Mempool struct with push/drain_all/drain_up_to, shared via Arc<Mutex>; tx_from_received_bytes for peer tx accumulation - config.rs: rb_body_max_bytes (default 64KB), mempool_capacity (10K) - production.rs: ProducedRb with optional announced_eb, make_fake_block encodes txs in CBOR tx_bodies map, make_overflow_eb builds content- addressed EB manifests [slot, [tx_hash, ...]] - main.rs: wire mempool to generator and main loop, replace stage-boundary EB production with overflow-triggered path Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Expand node flash types to cover EBGenerated, EBReceived, VTBundleGenerated, and VTBundleReceived in addition to the existing RB and RolledBack triggers. Flash colors match the event log badge colors (green=RB, blue=EB, purple=votes). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add three items found during CIP-0164 audit: EB selection policy when multiple EBs reach CertEligible, ledger state needs for EB transaction validation, and freshest-first as a security property. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Trim roadmap to what was actually implemented, document module structure and test coverage, add next steps: mempool-driven EB production, stake-weighted quorum, telemetry events. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
When an EB election reaches quorum and enters CertEligible phase, the next RB produced by this node includes certified_eb=true in its header (CIP-0164 11-field extended header). Receiving nodes parse the flag via the existing HeaderInfo Leios extension fields. End of Leios consensus MVP: EB → votes → quorum → cert in RB header. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Parse validated vote bodies to extract endorser_block_hash and voter_id, attribute votes to their EB election, detect quorum (≥3 unique voters for MVP). VotesValidated outcome now carries vote_data. Verified in 25-node cluster: 75 quorum events across all nodes. Added scripts/leios-check.sh for Leios cluster diagnostics. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
When an EB election enters the Voting pipeline phase (3×Δhdr slots after announcement), committee selection determines whether this node votes. Structured 130-180B vote bodies are injected into the network via InjectLeiosVotes. Stage-boundary vote production removed from main. Verified in 25-node cluster: EBs propagate, elections track pipeline phases, votes flood the network (~1200 VotesReceived in ~10s run). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- pipeline.rs — PipelinePhase, PipelineConfig, EbElection, phase timing - voting.rs — stub for future EB-triggered vote production - aggregation.rs — stub for future vote tallies and certificates - mod.rs — LeiosConsensus, event routing, slot tick, validation handlers Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Each validated EB gets its own election with phases driven by pipeline timing (3×Δhdr + L_vote + L_diff). Phase transitions computed from elapsed slots since announcement. Elections pruned after dedup_window. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace stage-based election model with per-EB pipeline timing (3×Δhdr + L_vote + L_diff). Align commit sequence with CIP-0164 spec: per-EB elections, pipeline phases, EB-coupled-to-RB production. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add adaptive key corruption scenarios to threat model
Remove unused ChainFragment::remove, fix doc indentation in selection.rs, suppress complex-type lint on try_switch_to. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Route LeiosBlockReceived and LeiosVotesReceived through the existing fake-delay validation pipeline before consensus sees them, giving Leios events a consistent "has been validated" gate matching RBs. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
select_chain_once's contiguity guard called chain_tree.ancestors(last_hash) to verify that a peer's replay chain reaches the picked common ancestor. The walk terminates at the first block whose parent is not in chain_tree. On_block_received inserts every fetched block into both chain_tree and block_cache — but the chain_tree insert is skipped when the header has no parsed info AND chain_tree doesn't already know the block (so block_no=0). That leaves the block in block_cache without a chain_tree entry, and the next contiguity walk terminates early, firing `fork mismatch (replay doesn't reach ancestor)` → OrphanCandidate. With the cooldown cap this no longer infects other peers, but individual nodes under sustained fork load can slowly get stuck on this false mismatch. Fix: add a hybrid walker that follows prev_hash links using chain_tree first and block_cache as a fallback. The walk terminates at a genuine gap (neither store has the parent) or at a genesis child (prev_hash=None) — both distinguished via a new HybridWalk.reached_origin flag so the genesis-reached check in select_chain_once still works. The walk is a new private method on PraosConsensus in selection.rs: walk_ancestors_hybrid(start_hash) -> HybridWalk. 4 new unit tests exercise: - chain_tree-only case (back-compat with pre-fix behaviour) - block_cache fallback (tree has tip + anchor, middle only in cache) - gap termination (parent in neither store → reached_origin=false) - start_only_in_cache (start block only in block_cache) Cluster verification at p=0.2: 24/25 nodes stayed healthy for ~55 min (vs previous build which had 4 stuck by T+60min). The one stuck node (node-4) hit a separate mux-level ingress-overflow bug during catch-up fetches, not the contiguity walk — tracked separately. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
evaluate_and_fetch had no persistent skip set — every new event (TipAdvanced, RolledBack, BlockFetchFailed) re-ran select_chain_once with an empty local `tried` HashSet, so a peer that failed the contiguity guard would be re-classified as OrphanCandidate hundreds of times per second, each iteration clearing its entries and sending another NetworkCommand::ReIntersect. Two peers on a single node could generate 500k+ orphan log lines per 30 min and saturate CPU on receiving peers, propagating stuckness. A naive pending-set guard (clear on IntersectionFound) didn't help on localhost because re-intersection round-trips complete in 1-3 ms, faster than TipAdvanced events arrive. The peer ping-pongs between orphan and re-intersected in a tight loop. Fix: time-based cooldown instead. - PraosConsensus.orphan_cooldown: HashMap<PeerId, Instant> holds the earliest time each peer can be reconsidered in chain selection. - ORPHAN_COOLDOWN = 1s — caps orphan/ReIntersect emissions at ≤1/sec/peer. - evaluate_and_fetch builds `skip` from unexpired cooldown entries (and prunes expired ones), inserts new orphans with `now + ORPHAN_COOLDOWN`, and gates the log + ReIntersect send on the transition via an `already_cooling` check so the rate-limited state is visible exactly once per entry. - IntersectionFound does NOT clear the cooldown — the entry must expire naturally so the peer's ChainSync stream has time to rebuild contiguous entries under the new anchor before we re-evaluate. - PeerDisconnected clears the cooldown entry (prevents leaks). 5 new unit tests: - orphan_first_time_sends_reintersect_and_marks_cooldown - orphan_while_cooling_does_not_resend_reintersect - many_tip_advances_while_cooling_do_not_cascade (1000 events → 0 extra) - intersection_found_does_not_clear_cooldown - peer_disconnected_clears_cooldown Cluster verification at p=0.2: fresh run held 21-24/25 nodes healthy for ~2 hours. Orphan-cascade total across all 25 nodes ~5k in that window (vs 500k+ per-node on the pre-fix build). Stuck count stable at 4/25 in a bounded partition — the remaining stuck nodes fall to a separate chain_tree contiguity-walk bug (tracked as a follow-up), not the cascade. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>