perf: switch to stake_address.hash_raw ordering for stable + fast pagination
The previous fix on this branch ordered by the slot_no of the earliest
delegation per (addr, pool). That ordering was correct and chain-stable
but had a fatal perf characteristic: it forced a correlated subquery on
`delegation` for every row in `epoch_stake` (~1.3M rows per recent epoch
on mainnet). Page 1 of /epochs/:number/stakes took ~5s in EXPLAIN ANALYZE
even though the page itself is only 100 rows — top-N sort still requires
the sort key for every input row.
We tried several escape hatches before giving up on seniority ordering:
- LATERAL subquery with (slot_no, tx_id, cert_index) tie-break to remove
the replica-local addr_id fallback. Correctness improved, perf
unchanged: still ~5s, same loop count.
- 225MB covering index on
delegation(addr_id, pool_hash_id, active_epoch_no, slot_no, tx_id, cert_index)
Went to 6.9s — the index turned each lookup into an Index Only Scan
but the range filter on active_epoch_no (column 3) meant rows still
came back ordered by active_epoch_no, not by slot_no, so the planner
re-sorted inside every loop. Reordering to put slot_no before
active_epoch_no would have helped but only to ~1-2s, still far from
what the original es.id query (~17ms) delivered.
- Walking delegation backwards (latest-first) to invert the LATERAL.
Doesn't work because a stake address can re-delegate to a different
pool — only the most recent delegation matches epoch_stake, so any
earlier delegations we see during the walk are noise.
The only paths to sub-second seniority ordering are materializing the
first-deleg-per-(addr, pool) into a helper table (Koios's approach via
grest.stake_distribution_cache) or upstreaming it into db-sync. Both are
large engineering changes for a query traffic of ~2k calls/day where the
common case is already <50ms via HTTP-layer caching.
Instead, drop seniority semantics and order by stake_address.hash_raw —
chain-derived, deterministic across replicas, and lets the planner walk
the existing `unique_stake_address` index, probing epoch_stake via
`unique_epoch_stake` (epoch_no, addr_id) with early-exit on LIMIT.
EXPLAIN ANALYZE on epoch 500 (1.3M rows):
/epochs/:number/stakes page 1: 5000ms → 138ms
For /epochs/:number/stakes/:pool_id the win is smaller (7s → 3s on the
biggest pool) because that endpoint's bottleneck is the BitmapAnd
finding pool delegators for the epoch, not the sort. A future
bf_idx_epoch_stake_pool_epoch(pool_id, epoch_no) would close that gap
but isn't urgent at current traffic.
Both endpoints now sort by sa.hash_raw so pagination order is consistent
between them. Returned bech32 in `stake_address` is rendered from
sa.view; the binary hash_raw is only the sort key.
Trade-off: offset pagination cost grows linearly with page depth on
/epochs/:number/stakes (page 1 ≈ 138ms, page 10 ≈ 1s, page 100 ≈ 5-10s).
This is the standard "walk N matches, discard" cost of OFFSET on an
index-walk plan; the only escape would be cursor pagination, which would
change the public API. Acceptable for current usage.
Co-Authored-By: Claude Opus 4.7 <[email protected]>