Directory / project-icarus-importer / QUEUE_METRICS.md
You are browsing a mirror of a file hosted on GitHub. View original
EKG metrics for the OutboundQueue
The outbound queue introduces a number of EKG metrics that can be used used to monitor the queue’s health. There are a number of guages showing current status and a number of counters that keep track of error conditions.
Ongoing conversations (e.g.
Records the total number of conversations that are currently on-going with peers (aka numbers of messages in-flight).
Records the total number of nodes to which a recent conversation failed; we will not attempt to enqueue messages to such nodes. What exactly constitutes a “recent” failure depends on the failure policy; currently it defaults to 10 slots (200 seconds). The recent failures statistics are cleared on SIGHUP.
Records the total number of conversations that have been scheduled (and enqueued to specific peers), but not yet started.
Known peers (e.g.,
The outbound queue keeps a number of buckets of peers it knows about.
BucketStatic: statically known peers (from a
topology.yamlfile; core and relay nodes only). This should only change on
SIGHUPand remain static otherwise.
BucketSubscriptionListener: subscribers (peers that subscribed to this node by sending a
MsgSubscribemessage). A high number indicates a (relay) node under heavy load.
BucketBehindNatWorker: relay nodes discovered through DNS query (used by behind-NAT edge nodes to declare relay nodes as known peers). This should be equal to the number of relay nodes returned by DNS (through a series of DNS queries to the various hostnames).
BucketKademliaWorker: peers discovered through the Kademlia network (used in P2P mode). This number should remain more or less constant (configurable through the
topology.yamlfile), unless there are not enough Kademlia peers to discover.
For each of these buckets we counts the number of distinct known peers (i.e., ignoring routing information).
Note on directionality: The outbound queue deals with outgoing messages only. Thus, the situation looks like this for behind-NAT nodes:
/-------------------------\ /------------------------------\ | | | | | BEHIND-NAT NODE |<----------+--BucketSubscriptionListener | | | | | | BucketBehindNatWorker--+---------->| RELAY NODE | | | | | \-------------------------/ \------------------------------/
The behind-NAT node discovers the relay node through a DNS request and adds the
relay node to its
BucketBehindNatWorker, so that it can send messages to the
relay node. It then sends
MsgSubscribe to the relay node, so that the relay
node can add the edge node to its
BucketSubscriptionListener, ensuring that
the behind-NAT node will receive messages from the relay node.
In principle we could do something similar for P2P, but in this case we do not
use subscription. When a P2P node discovers peers through Kademlia, it adds them
BucketKademliaWorker, but instead of subscription it relies on Kademlia
to notify other peers that they have a new neighbour; hence, the picture looks
/------------------------\ /------------------------\ | | | | | NODE 1 |<----------+--BucketKademliaWorker | | | | | | BucketKademliaWorker--+---------->| NODE 2 | | | | | \------------------------/ \------------------------/
Problems during enqueing
Failed “enqueue-all” instructions (e.g.,
The queue is configured with detailed routing information, which minimizes the risk of network fragmentation. Normally when a message gets enqueued, this routing information (along with information about which nodes had recent failures) dictates to which set of nodes the message gets enqueued to; we call this an “enqueue-all instruction” (an instruction to enqueue it to all nodes dictated by the routing configuration).
An enqueue-all instruction is considered to have failed when we could not find any node at all to enqueue the message to. In this case the message is lost.
(Enqueuing the message does not of course imply that the message will also be successfully sent.)
Failed “enqueue-one” instructions (e.g.,
Occassionally the queue is provided with a message that should be sent to one of a specific set of nodes. Typically this happens when we learn that a certain set of nodes has some information (for instance, a particular block) that we would like to have; the queue will then enqueue the message to one node in this set; we call this an “enqueue-one instruction”.
As for enqueue-all instructions, an enqueue-one instruction is considered to have failed when we could not find any node to enqueue the message to (that is, any node in the specified set with no recent failures). In this case the message is lost.
Failure to choose an alternative (e.g.,
This is related to
FailedEnqueueAll, but is a less serious indicator: when we
enqueue a message, the routing information says “for each of these n sets,
pick an node from that set to enqueue the message to”. When we fail to pick
any such node at all, we record the enqueue as having failed, but for each set
individually we increment the
FailedChooseAlt when we fail to choose an
alternative from that set. In other words, we would only increment the
FailedEnqueueAll counter after having increment the
Failure to choose an alternative increases the chances of network partitioning, but the message is not (necessarily) lost.
Side note: for enqueue-one instructions we don’t separately warn about failure
to choose an alternative as in this case
n=1 and hence failure to choose
an alternative implies failure to enqueue.
Problems during sending
At some point after enqueuing the queue will attempt to actually send the
message. As discussed above, when the message sits in the queue, it will be
reflected in the
Scheduled gauge, and when the conversation starts, this is
reflected in the
When a message fails to send, the
FailedSend counter is incremented. This does
not necessarily mean the message is lost, as it might still be sent to other
peers; if we fail to send the message to all peers it was enqueued to, we
FailedAllSends counter. This is the analogue of the
FailedChooseAlt split during enqueueing. Unfortunately,
FailedAllSends will only ever be incremented when using the synchronous API
(where we enqueue a message and then wait for it to have been sent), as there is
no natural point to check whether all sends failed otherwise. From a dev-ops
perspective, this means that
FailedAllSends may be an under-approximation,
FailedSend is somewhat more important a measure than
Finally, the queue offers a “cherished” send API. This is a wrapper around the
synchronous send API which attempts to re-enqueue a message if it failed to
send. Normally it will be enqueued to a different destination as the previous
one will now be recorded as having a recent failure, reflected in the
Failures) guage. Under very rare circumstances this could loop indefinitely;
when we detect such a loop the
FailedCherishLoop counter. Any value above zero
for this counter indicates queue misconfiguration.
Problems during subscription
The various buckets maintained by the queue can have a maximum size imposed on them. When an attempt is based to grow a bucket past its maximum size, the change is rejected and the corresponding counter is incremented.