Merge pull request #859 from input-output-hk/dnadales/cumulative-tx-bytes-metric
Add confirmed tx throughput to proto-devnet dashboard
Add confirmed tx throughput to proto-devnet dashboard
Picks up the merged cumulative-tx-size work in ouroboros-consensus and cardano-node now that both branches incorporate it on leios-prototype.
jemalloc handles concurrent allocation from rayon worker threads better than glibc's ptmalloc2 (per-thread caches, less lock contention) and returns freed pages more aggressively, reducing RSS bloat from allocator fragmentation. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The deterministic event sorting pipeline (added in 54389c5ec) was cloning and buffering every simulation event even when no -o output file was given. At T=0.250 with 1500 nodes this accumulated 7M+ OutputEvent structs (~10 GB) at peak, causing RSS to balloon from ~21 GB (actual node state) to 59 GB and OOM. Guard the clone/buffer/flush path with a has_output check. RSS at slot 656 dropped from 59 GB to 28 GB — matching tracked node state plus normal allocator overhead. Also adds EventMonitor and LivenessMonitor stats logging every 60 slots for ongoing memory diagnostics. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Expose per-shard connection queue statistics (total/active connections, queued messages, queued bytes) via a shared NetworkStatsCollector. Each shard's sequential engine updates its counters at slot boundaries; the node's existing log_memory_stats reads the aggregate. Output appears every 60 slots alongside Memory stats, covering all shards. Initial profiling showed zero queued messages in turbo mode (zero-latency clusters bypass bandwidth queues), ruling out network queues as the cause of the ~40 GB RSS vs ~20 GB tracked-state gap. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace the full symmetrization (which nearly doubled link count from 39k to 59k) with a targeted fixup: for each node not listed as anyone's producer, add a single reciprocal link back from its first producer. This adds only 432 links (one per BP) vs ~20k before. BPs were the only nodes needing fixup — they pick 2 relay producers but no relay was picking them back, making them invisible to the sim's consumer-edge BFS. Relays cross-reference each other enough to be naturally reachable. Re-generated topology: 38,943 links (vs 59,268 symmetric, 38,511 original asymmetric). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-authored-by: Sebastian Nagel <[email protected]>
Co-authored-by: Sebastian Nagel <[email protected]>
Update roadmap.md
The sim's connectivity BFS traverses consumer edges (reverse of producers). Unidirectional producer links left nodes unreachable, causing "Graph must be fully connected!" errors. Symmetrize all links so every A→B producer also creates B→A. Also rename generate_topology.py → generate-topology.py and summarize_topology.py → summarize-topology.py for consistency with the other shell scripts. Re-generated topology-v2-expanded-1500.yaml (59,268 links, fully connected). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Problem
-------
When a node's peer TX backlog hits its cap (e.g. 10,000), incoming TXs
are silently dropped from self.txs. If a dropped TX is referenced by a
pending Endorser Block, the EB's validation scan (try_validating_eb)
finds has_tx() = false and the EB is never marked all_txs_seen. The EB
then misses its vote window and is orphaned by the next Ranking Block
(WrongEB). Because the TX is never re-offered by peers, the one-shot
missing_txs trigger — already consumed by acknowledge_tx — cannot
re-fire, leaving the EB permanently stuck.
Under Poisson-clustered RB production (e.g. seed 4 at 0.200 MB/s), this
cascade produced 48 EBs with 19 uncertified (40%), 23M peer TX drops,
and a mean of only 348 votes/EB (well below the 450 quorum).
Fix
---
Two changes in propagate_tx():
1. Move the mempool insertion check (try_add_to_mempool) BEFORE
acknowledge_tx, so that missing_txs has not yet been consumed at the
point where we decide whether to drop.
2. When PeerBacklogFull fires, check whether the TX is referenced by a
pending EB (self.leios.missing_txs.contains_key). If yes, keep the
TX in self.txs (skip the backlog, but preserve has_tx = true) and
fall through to acknowledge_tx normally. If no, drop as before.
This retains only EB-critical TXs — bounded by (pending_EBs × EB_size),
typically a few thousand entries and ~3 MB of HashMap overhead per node.
Non-critical TXs are still dropped, preserving the memory cap's purpose.
Effect on seed 4 sequential 0.200/wfa-ls (worst-case seed)
-----------------------------------------------------------
EBs uncert mean WrongEB drops peak RSS
caps (before): 48 19 348 1138 23.2M ~20 GB
caps-retain: 45 8 470 1330 5.9M ~24 GB
nocaps (ref): 46 8 473 1516 0 ~35 GB
Uncertified EBs: 19 → 8 (40% → 18%)
Mean votes/EB: 348 → 470 (near nocaps 473)
Peer TX drops: 23.2M → 5.9M (−74%)
Peak RSS: ~20 → ~24 GB (+20%, well below nocaps ~35 GB)
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
parameters/no-caps.yaml disables all three memory caps for diagnostic
experiments (peer backlog, generated backlog, TX max age).
voting_results.csv captures the full 4-way matrix at 0.200/wfa-ls:
{turbo,sequential} × {caps,nocaps} × seeds 0-4. Key findings:
- Seed 4 is the stress seed: caps cause 40% uncertified (seq) vs 17%
without caps. Root cause is a race in propagate_tx where
acknowledge_tx consumes the one-shot missing_txs trigger before
PeerBacklogFull drops the TX.
- Seeds 1,3 are cap-insensitive (well-spaced RBs).
- No-caps converges all seeds to 16-22% uncertified.
- Stale rows (pre-rayon-fix, pre-seed-wiring) labelled as such.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Adds a label column (position 5, between seed and time_seconds) to distinguish experiment configurations (e.g. "caps", "nocaps") without relying on memory of which rows came from which invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The seed field existed on SimConfiguration but was hardcoded to 0 in build(). Adding it to RawParameters (with #[serde(default)]) lets it be set via -p YAML files, which the -S/--seed flag in cip-voting-options.sh already generates. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
rayon's filter() on an indexed parallel iterator produces an unindexed iterator whose collect() does NOT preserve element order — the output Vec order depends on work-stealing scheduling, which varies per process. Moving the empty-work check into .map() keeps the iterator indexed, so collect() is deterministic regardless of rayon thread scheduling. This was the root cause of the bistable attractor at 0.200/wfa-ls: the same seed+config could land on either 28/8 (healthy) or 81/49 (pathological) depending on which process-launch rayon happened to schedule. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
cip-voting-options.sh was piping every run through `tee /dev/stderr` which reopens /proc/self/fd/2 on each invocation; on Linux that gives a fresh offset-0 open-file-description, so successive seeds in a -S sweep overwrote the combined log from byte 0 — only the in-flight seed ever survived on disk. Now each run tees to /tmp/sim-T<T>-<mode>-<engine>-seed<N>.log so every seed retains its full log. poll-sim.sh defaults to the latest /tmp/sim-*.log when no path is given, so the normal /loop monitor workflow keeps working without changes. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Seed is the innermost loop so a partial run still yields a complete seed distribution for each (throughput, mode) cell. CSV grows a seed column (position 4); existing rows should be backfilled with seed=0. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
cip-voting-options.sh gains a repeatable -P/--extra-params flag that layers additional YAML parameter files on top of the existing config chain (applied last so they override everything). Useful for quick experiments — e.g., `-P /tmp/coarse-timestamp.yaml` to bump timestamp-resolution-ms without touching the committed parameter set. poll-sim.sh prints a concise one-line status of a running sim-cli plus the log tail, intended for use from /loop or cron to watch a long-running benchmark without blocking Claude's thread on sleep. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Cross-shard message delivery order in the sequential engine previously depended on OS thread scheduling of peer shards, so runs with shard_count > 1 produced different event sequences across runs. Fixing this required four coordinated changes: 1. **Deterministic cross-shard merge**: tag every CrossShardMsg with `source_shard` and a per-sender monotonic `seq`. Receiving shards buffer incoming messages into a `BinaryHeap` keyed on `(send_time, source_shard, seq)` and only deliver those whose send_time is strictly less than the minimum of every peer's advertised `shared_time`. Under that rule, no future message can arrive with an earlier send_time, so delivery order is a pure function of sent messages (the messages themselves are produced deterministically per-shard). 2. **Strict CMB ceiling**: the block condition changes from `timestamp > ceiling` to `timestamp >= ceiling`. At the boundary `timestamp == ceiling`, a peer might still be about to send a message whose `delivery_time == timestamp`; using strict less-than ensures every message with `delivery_time <= timestamp` is already on the mpsc by the time we process `timestamp`. 3. **Content-derived sort at pop**: BinaryHeap pop order for equal-timestamp events is a function of push history, which under multi-shard can vary across runs (cross-shard pushes from drain interleave with intra-shard pushes from apply_batch_output). Collect all events at the current timestamp into a Vec and sort by `GlobalEvent::sort_key()` before processing, so the order is a pure function of event content. 4. **Ceiling-aware termination**: replace the primary-shard-cancels-on-SlotBoundary scheme with an independent per-shard termination check that only breaks when the local queue has no events with `ts < end_time` AND the CMB ceiling is also `>= end_time`. Every shard stops at the same simulation time, independent of token-cancellation propagation races. 5. **Second drain before popping**: run drain_cross_shard_safe a second time after the ceiling check passes. The top-of-loop drain may run before the peer has advanced enough for send_time=`timestamp - eps` messages to be deliverable; the post-ceiling-check drain catches them, preventing a cross-shard delivery from landing in a later iteration and splitting a timestamp's events across batches. New test `test_sequential_multi_shard_deterministic` compares per-node event trajectories across two runs under shard_count=2. Passes 500/500 in release mode (was failing in ~100% of runs before the fix, ~25% with only the sort fix, 2% with the termination fix, 0% with the second drain). All 55 sim-core tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
`TxGeneratorCore::generate` computed inter-tx delay as `config.frequency_ms.sample() as u64 * shard_count as u64` and passed it to `Duration::from_millis`. The `as u64` cast truncated each sample: a configured 7.5 ms became 7 ms, producing TXs ~7% faster than requested. For the 0.200/wfa-ls single-shard run this meant 128,572 TXs over 900s (~214 KB/s) instead of the intended ~120,000 TXs (~200 KB/s). Only affects configurations with sub-ms precision and no batching. Turbo is largely unaffected (1 ms resolution, 10 ms tx-batch-window collapses the fractional delay anyway). Switch to `Duration::from_secs_f64`, preserving sub-millisecond precision via nanosecond-resolution Duration. Clamp to `.max(0.0)` so distributions that can sample negative (e.g., Normal) keep the old "treat negative as zero delay" behaviour rather than panicking in `from_secs_f64`. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Migrate every remaining stateful-RNG use reachable from Linear Leios:
- linear_leios.rs generate_withheld_txs: `self.rng.random_bool(p)` is
replaced with `rng.draw_bool(node, slot, DrawSite::WithholdDecision,
p)`. The distribution sample for `txs_to_generate` and the per-tx
`new_tx` body generation use `Rng::seeded_chacha(node, slot, site)`
to produce one-shot ChaChaRngs seeded from context — this keeps the
rand_distr / `new_tx` machinery unchanged while removing the
cross-call stateful coupling.
- tx.rs TxGeneratorCore: replaces its `ChaChaRng` with the stateless
`SimRng` plus a monotonic `next_tx_idx: u64`. Each TX is generated
from a one-shot ChaChaRng seeded from
`("tx_generator", tx_idx)` — so the generated TX stream is a pure
function of the master seed regardless of per-node or network-timing
behaviour. Propagates the `SimRng` type through TransactionProducer
and its callers in sim/sequential.rs and sharding/shard.rs; the
master-RNG `.next_u64()` consumption is preserved to keep any
remaining downstream draws on stracciatella/leios variants seeded
the same way they were.
- Drops `rng: ChaChaRng` field from `LinearLeiosNode`. The NodeImpl
trait signature still takes a `ChaChaRng` for the other variants, so
LinearLeiosNode::new accepts it as `_rng` and discards.
New Rng methods: `seeded_chacha(node, slot, site)` for context-tied
one-shot ChaChaRng seeding, and `seeded_chacha_from<K: Hash>(&K)` for
sim-wide (non-node-tied) draws like the TX generator.
All 54 sim-core tests pass; clippy clean for Linear Leios and
TxGeneratorCore.
Stracciatella and full-Leios variants retain their stateful `self.rng`
for now — they build fine but are out of scope for the current
determinism investigation.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Replace `candidates.shuffle(&mut self.rng)` in
LinearLeiosNode::sample_from_mempool with Rng::context_shuffle, which
performs Fisher-Yates using DrawSite::MempoolSwap { call, idx } for
each swap. The `call` discriminator distinguishes independent shuffle
invocations at the same (node, slot): the RB-body sample uses call=0,
the EB-body sample uses call=1, so they don't collide.
DrawSite::MempoolSwap gains a `call: u32` field. Three new rng tests
cover: deterministic-per-context, distinct-calls-yield-distinct-perms,
multiset-preservation.
Threads `slot` and `shuffle_call` through sample_from_mempool's
signature. Both call sites (RB path, EB path) in try_generate_rb pass
the active slot and their assigned call index.
Note: the default `leios-mempool-sampling-strategy: ordered-by-id`
means the shuffle branch doesn't fire in the current benchmark; this
is structural cleanup so Linear Leios contains no remaining
stateful-RNG uses on its hot VRF / sampling path.
Stracciatella and full Leios variants still use stateful `self.rng` for
their shuffle paths; those will be migrated in a follow-up.
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>