Merge pull request #913 from input-output-hk/prc/net-node-memory
net-node memory + per-message size enforcement
net-node memory + per-message size enforcement
Signed-off-by: cryptodj413 <[email protected]>
`LeiosStore::notifications: Vec<LeiosNotification>` was never pruned. Every `inject_block` / `inject_block_txs` / `inject_votes` pushed an entry that lived forever, even though the blocks/votes those entries point at get slot-window-evicted by the same `bump_version` call. Over a long-running cluster the vec grows monotonically with the total inject rate. Switch storage to `VecDeque` and add a `notifications_pruned_count` so logical (caller-facing) cursors stay monotonic across pruning. At slot-window eviction, front-prune notifications whose every referenced slot is below the cutoff — those refer to data the store no longer holds, so re-sending them to a subscriber would be a wasted round trip. Stop at the first non-evictable front entry: notifications arrive in roughly slot order, so the leak past the cutoff is bounded by out-of-order arrivals (next bump catches up). `notifications_after` now takes `&mut usize`. Callers track a monotonic logical cursor; if it lags the prune frontier the call bumps it forward so subsequent `*after += 1` increments stay aligned with the items actually consumed. `notification_count` reports the all-time logical total. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Three info! lines fire once per item under steady cluster load and together accounted for ~600K of the ~660K info-level log lines in a 27-minute test run, dominating disk usage: - `transaction received` (net-node) 366K lines - `network event` (net-node, default arm) 155K lines - `mempool: evicting oldest tx` (shared) 71K lines At a 1 tx/s/node generation rate with a 10K-cap mempool, the eviction line fires on every admit once the cap is reached. At RUST_LOG=info on a 25-node cluster these saturate disk in roughly half a day. None of the three is useful at info: the per-tx and per-event lines are item-level traces (debug territory) and eviction at cap is a steady-state condition, not a notable event. Periodic `mempool state sizes` / `praos state sizes` / `leios_store: stats` lines remain at info — those carry the diagnostic signal. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The demuxer enforces `set_ingress_limit` as a *buffer* cap on the
per-protocol ingress queue, not as a per-message cap. Under server
pipelining (`MsgStartBatch` immediately followed by `MsgBlock`) both
segments can land in the buffer before the codec processes
`MsgStartBatch` and bumps the limit for `StStreaming`. With the
prior per-state caps that race manifested two ways:
1. `StBusy` at `SIZE_LIMIT_SMALL = 65_535` rejected the pipelined
block body outright (a 65K+ block was already enough to trip
it).
2. Even with `StBusy` raised to `SIZE_LIMIT_STREAMING = 2.5 MB`,
a real Praos block body — particularly the post-EB-overflow
fallback path where txs the EB couldn't carry get inlined into
the RB — can legitimately exceed 2.5 MB. Overnight one landed
at 2,506,268 bytes and tripped the new cap, cascading SDU
timeouts and freezing the cluster.
The spec defines `INGRESS_LIMIT = 230 MB` as the per-protocol
ingress buffer cap exactly for this case. Use it for both `StBusy`
and `StStreaming`. Spec per-message rejection at
`SIZE_LIMIT_SMALL` / `SIZE_LIMIT_STREAMING` belongs in the codec at
decode time (not yet wired); the framework's `size_limit` callback
controls buffer sizing only.
`StIdle` keeps `SIZE_LIMIT_SMALL` — the client never receives in
that state, so the tighter cap stands.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two distinct size guards on the receive path are now correctly layered: 1. **Demuxer buffer cap** (DoS protection): fixed at protocol registration via `ProtocolConfig::ingress_limit`, never narrowed at runtime. Sized to bound runaway accumulation when a protocol consumer falls behind. 2. **Codec per-message cap** (protocol conformance): `Runner::recv` passes `P::size_limit(state)` to `CodecRecv::recv`; after a CBOR value decodes, the consumed-bytes count is checked against the per-state spec limit and `MuxError::MessageTooLarge` is returned (and the connection torn down) for spec-violating peers. The previous design conflated these — `Runner` overrode the demuxer's buffer cap with the current state's per-message cap on every transition, so a fast peer pipelining `MsgStartBatch` + `MsgBlock` into one TCP read could legitimately overflow before the local runner advanced the state. Reverts `BlockFetch::size_limit` to spec (`SIZE_LIMIT_SMALL` for StIdle/StBusy, `SIZE_LIMIT_STREAMING` for StStreaming); the values are now actually enforced at decode time rather than misused as buffer caps. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
`MempoolState::peek_unannounced_for_peer` used to scan the entire `txs` queue on every call, doing a `BTreeSet<TxId>::contains` check per element to find the few ids the peer hadn't yet been told about. Under steady TxSubmission pull traffic with a caught-up peer that's `O(N · log A · 32)` per poll for no useful work — the dominant mempool-CPU symptom under cluster load. Flip the polarity: keep a per-peer "still owed to this peer" `BTreeSet<TxId>` that's the inverse partition of `peer_advertised` over the current mempool. Lazily seed it from `tx_index` on first peer-facing call; admit fan-out inserts the new id into every known peer's owed set; pruning (`drain_*`, `on_block_applied`, capacity eviction) drops the id from both sides; `forget_peer` clears both. The hot path becomes: look up the peer's owed set (`O(log P)`), return immediately if empty, otherwise drain up to `max_count` and resolve bodies. The remaining `for tx in &self.txs` scan only runs when there's actually unannounced work — the empty-poll case (which dominates under steady-state cluster traffic) returns without touching the queue. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Empty vote batches were enqueued by `inject_votes` but flagged non-evictable by `notification_evictable`, so they accumulated past slot pruning. Skip the enqueue and let `all(...)` decide eviction on its own vacuous-truth semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Dijkstra cert blocks are deserialised, resolved (txs spliced from the announcer's EB), and reencoded. Relies on the protocol invariant that a cert block always immediately follows an announcer. Added a new `resolveLeiosBlockHdr` rather than overloading the existing `resolveLeiosBlock`, to avoid conflicts No integration was needed in network/node, except srps.
tcp-model: analytic TCP envelope for sim-rs links
Reject mss_bytes == 0 and loss_prob_per_segment outside [0, 1] at the YAML override layer, with a matching early-return in msg_loss_prob so the library can't be panicked from configuration. Add a debug_assert when add_edge sees a tcp_envelope but the rng_oracle hasn't been wired. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
sequential.rs builds Connections directly (it doesn't route through NetworkCoordinator), so the envelope wiring added earlier was only reaching the actor engine. Fix: construct an EnvelopeWiring at link- build time whenever `LinkConfiguration.tcp_envelope` is `Some`, using the same `Rng::new(config.seed)` stateless oracle already used by other sequential-engine machinery. Caught by a NA,0.200 / top-stake-fraction / 750n sanity run with loss-prob-per-segment 0.01: the turbo-engine output was byte-identical to baseline because the envelope cfg was being discarded at the Connection::new call site. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
TCP has one cwnd per connection, so a new event (loss or idle reset) should *replace* the per-link envelope rather than overlay onto a stack. Adds `bw_start` to `Envelope` so the new envelope picks up at whatever multiplier the link was already at, then dips to `bw_depth` through the onset and recovers to 1.0 through the release. LinkState.active (SmallVec) becomes LinkState.current (Option). bw_mult and delivery_floor query the single envelope; bytes_deliverable sub-divides at its phase transitions only. Loss retrigger: bw_start = current_mult bw_depth = current_mult * loss_bw_depth (= cwnd halved from current) lat_stall = max(existing_stall_remaining, rto) so back-to-back losses correctly extend the stall (instead of shortening it under naive replacement) and the AIMD halving compounds from the current cwnd-like state instead of always halving from 1.0. Without this, the multiplicative composition of overlapping envelopes drove bw_mult to ~0 at high loss probabilities, throttling the sim to a crawl. The single-envelope model is both physically correct and far cheaper to evaluate. 27 unit tests (was 25), including loss-replaces-from-current-mult and back-to-back-stall-extension. sim-core regression suite unchanged (65 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds an optional `tcp-envelope` block at the top level of the scenario
config (RawParameters) that applies to every directed link in the
topology with a bandwidth set. Per-link overrides in the topology YAML
layer on top, so the resolution order per link is:
defaults_for(latency, bps) → global → per-link override
This is the common case for sweeps: enable a loss probability for every
edge in one place instead of touching the topology file. Per-link
overrides remain available for asymmetric experiments.
Topology::from_raw(raw, global) is the new entry point; the existing
`From<RawTopology>` impl thin-wraps it with `global = None` so test
fixtures and the gen-test-data binary keep working unchanged.
sim-cli reorders parsing so RawParameters is read first, then the
topology is converted with the global cfg in hand.
Three new config tests cover the global-applies path, the no-cfg path,
and per-link-overrides-global layering.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
RawTcpEnvelope now exposes the full LinkEnvelopeCfg surface as optional fields (mss-bytes, initial-cwnd-segments, idle-reset- threshold-ms, rto-ms, loss-bw-depth, cold-bw-depth, cold-release-ms, cold-release-shape, loss-release-ms, loss-release-shape). Any field left unset falls through to the physics-derived default from LinkEnvelopeCfg::defaults_for. Unknown fields are rejected. Adds kebab-case serde rename on tcp_model::Curve so the YAML enum values are spelt "step" / "linear" / "geometric" rather than the PascalCase Rust variants. Four new parsing tests cover empty blocks, the full override schema, the deny-unknown-fields guard, and the layered-defaults semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Each directed Connection optionally carries a tcp_model::LinkState plus
the (from, to) link identity and the stateless Rng oracle used to seed
per-message loss draws. Behaviour with no envelope configured is
byte-identical to today; the regression suite is unchanged.
When an envelope is attached:
- send() calls LinkState::on_send with a loss outcome drawn from the
oracle under context ("tcp_loss", from, to, send_time, message_id).
- update_bandwidth_queues() integrates `bps * bw_mult(t)` to find the
total bytes deliverable over [last_event, now], then fair-shares them
among miniprotocols exactly as before. Per-message arrival timestamps
are computed by inverting bytes_deliverable so the slow-start ramp is
reflected in arrivals (not averaged out over the update window).
- All arrivals are clamped to LinkState::delivery_floor, modelling
cross-protocol HoL blocking during a loss-induced RTO stall.
Topology YAML grows an optional per-link `tcp-envelope` block; for now
only `loss-prob-per-segment` is exposed, with all other params derived
from the link's latency and bandwidth via LinkEnvelopeCfg::defaults_for.
Three new connection.rs tests cover: cold-start delaying a 1 MB transfer,
loss-induced delivery floor, and envelope-disabled byte-identical
matching against the no-envelope path.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
LinkState carries a small stack of active envelopes, queried as a bandwidth multiplier and a delivery floor at message-send time: - Cold/idle: instantaneous depth + geometric recovery over log2(BDP/IW) RTTs. Re-fires after a configurable idle gap (default RFC-6298-ish max(1s, 2L)). - Loss: Bernoulli draw per send (p scales with segment count); stall window = 1 RTO; bandwidth halves at the moment of recovery and recovers linearly over (BDP/2)/MSS RTTs. bytes_deliverable integrates bps · bw_mult(t) over arbitrary intervals, sub-dividing at envelope phase transitions and applying a fine-grained trapezoid rule so geometric ramps integrate cleanly. delivery_floor takes the max stall-end over active envelopes — exact HoL semantics. 24 unit tests including an analytic slow-start cross-check (300ms / 1 MiB/s, 1 MB cold message → ~3.4s transfer time). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Drops the internal RNG: on_send now takes a pre-drawn `loss_drawn` boolean, and LinkEnvelopeCfg gains `msg_loss_prob(bytes)` for the caller to sample from any deterministic oracle. This keeps tcp-model purely deterministic state with no rand dependency. Adds `invert_bytes_deliverable(bps, target, t0, upper)` — binary search for when a cumulative byte count crosses a threshold, used by consumers to compute envelope-aware per-message arrival times. Adds `has_active_envelopes()` so consumers can fast-path the unperturbed case, and stops retaining envelopes that fire immediately expired (e.g. under `LinkEnvelopeCfg::disabled`), guaranteeing byte-identical behaviour when envelopes are turned off. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Workspace skeleton for the shared TCP-behaviour envelope model crate that will be consumed by sim-rs and net-rs. No implementation yet; modules and API follow. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: Jonathan Lim <[email protected]>
Signed-off-by: Jonathan Lim <[email protected]>
Part of https://github.com/cardano-scaling/hydra/issues/1468 ## Summary This PR implements the full **partial fanout** mechanism for large UTxO sets, introducing two key additions: ### `FanoutProgressDatum` — slim intermediate datum After the first `PartialFanout` step, the head output no longer carries the full `ClosedDatum` (11 fields). Instead it carries a `FanoutProgressDatum` with only the 4 fields needed for subsequent steps: `headId`, `parties`, `contestationDeadline`, `accumulatorCommitment`. This reduces datum size and script execution cost for every intermediate step. ### `FinalPartialFanout` redeemer — explicit terminal step A dedicated `FinalPartialFanout` redeemer handles the `FanoutProgress → Final` transition. It burns all head tokens (ST + PTs) and distributes the last batch of UTxOs, making the terminal step type-safe and distinct from the intermediate `PartialFanout` redeemer. ### Updated state machine ``` Closed ──PartialFanout──────────────────────► FanoutProgress FanoutProgress ──PartialFanout──────────────► FanoutProgress (repeat) FanoutProgress ──FinalPartialFanout─────────► Final (burns tokens) Closed ──Fanout──────────────────────────────► Final (unchanged, ≤ threshold UTxOs) ``` ### EIP-4844 KZG trusted setup The accumulator uses the [EIP-4844 KZG trusted setup](https://ceremony.ethereum.org/) (`trusted_setup.json`) as the Common Reference String. G1 points are used on-chain (compressed BLS12-381 G1 element stored in the datum), and G2 points are used for the membership proof verification. The KZG CRS is delivered to the on-chain validator via a dedicated `vCRS` reference script output. ### `accumulatorHash` binding `Close` and `Contest` validators bind the `accumulatorHash` (stored in `ClosedDatum`) to the G1 accumulator commitment. This ensures consistency between the signed hash and the on-chain commitment point. ### CRS validator (`vCRS`) New `vCRS` script that holds the G2 CRS points as an inline datum, referenced by the `PartialFanout` and `FinalPartialFanout` redeemers to avoid re-embedding the large CRS in every transaction. ### cardano-api 10.21 Updated `cardano-api` dependency from `^>=10.18` to `^>=10.21`, integrating all changes from the upstream `capi-10.21` branch. This was previously carried as a `source-repository-package` override and is now resolved to the released version. ## Changes **On-chain (`hydra-plutus`)** - `HeadState.hs`: new `FanoutProgressDatum` type; `FanoutProgress` state variant; `FinalPartialFanout` redeemer variant - `Head.hs`: validator cases for `(FanoutProgress, PartialFanout)` and `(FanoutProgress, FinalPartialFanout)`; first `PartialFanout` step now outputs `FanoutProgress` datum instead of `Closed` - `HeadError.hs`: new error codes for fanout progress transitions - `CRS.hs`: new CRS reference script validator - Updated `vHead.plutus`, `mHead.plutus`, `vCRS.plutus` golden scripts **Off-chain (`hydra-tx`)** - `Fanout.hs`: `partialFanoutTx` outputs `FanoutProgress` datum; new `finalPartialFanoutTx` that reads `FanoutProgressDatum` and burns tokens; new `observeFinalPartialFanoutTx` - `Accumulator.hs`: `buildFromSnapshotUTxOs`, `getAccumulatorHash`, `createMembershipProofFromUTxO` and related helpers - `KZGTrustedSetup.hs`: EIP-4844 trusted setup integration **Node logic (`hydra-node`)** - `HeadLogic.hs`: `emitNextFanoutStep` updated to choose between `partialFanoutTx`, `intermediatePartialFanoutTx`, and `finalPartialFanoutTx` based on remaining UTxO count and current head state - `Chain.Direct.Handlers.hs`: observation of `FinalPartialFanout` (surfaced as `OnFanoutTx`) - `Chain.Direct.State.hs`: `FanoutProgressState` in chain state; generators for intermediate and final partial fanout **Tests** - New contract tests: `PartialFanout.hs`, `FinalPartialFanout.hs` - `KZGTrustedSetupSpec.hs`: property tests for CRS point counts and trusted setup round-trips - `StateSpec.hs`, `HeadLogicSpec.hs`, `BehaviorSpec.hs`, `ChainObserverSpec.hs`, `HandlersSpec.hs`, `TxSpec.hs`: updated for new transitions - `SecurityScenarios.hs`: filled missing `accumulator`, `accumulatorHash`, `accumulatorCommitment` fields post-rebase --- * [x] CHANGELOG updated or not needed * [x] Documentation updated or not needed * [x] Haddocks updated or not needed * [ ] No new TODOs introduced or explained hereafter