Transaction submission errors

Incident Report for Xaman XRPL Labs

Postmortem

Context

There are three types of nodes powering “xrplcluster.com”:

Full history nodes, serving full transaction history & handling queries regarding transactions that potentially took place a long time ago (so: potentially needing full history). Example queries are: account transactions and ledger contents.
Non full history nodes, serving only recent information. Example queries are: on ledger event subscriptions (new transactions, closed ledgers) and e.g. account TrustLine information for the most recent ledger.
Submission nodes, submitting transactions to the XRP Ledger. These nodes only submit transactions. By being tasked with only transaction submission, these nodes are never too busy, preventing fee escalation.

The technology powering “xrplcluster.com” analyses requests and dynamically routes commands/queries to the best suitable nodes. When one of the nodes goes down, the cluster will automatically detect this and a seamless client handover will take place to one of the other nodes of the same type (FH, non FH, submission).

Until today all three node types were powered by a number of servers for performance & redundancy reasons: 9 full history nodes, 6 non full history nodes and 3 submission nodes. Several well known developers/engineers from the XRPL community maintain, contribute and host these nodes & the cluster, and new nodes are being added on a regular basis to keep up with the growing use of the XRP Ledger.

Outage

Today, all three of the submission nodes went down at the same time. As a result applications relying on “xrplcluster.com” (like XUMM Wallet, Ledger Live (Desktop & Mobile) and others could connect, fetch transactions, balance, etc. but when submitting a transaction, a timeout would occur and an error message could / would be displayed.

As the three nodes are running on different hardware and all have slightly different configurations, this seems almost impossible. When trying to get the three nodes back up, it became clear that all three failed for different reasons:

One node was misconfigured, and applied rate limiting on the cluster instead of the end user
One node failed because of a dead hard drive
One node failed because of a corrupted database, containing the most recent transaction information (instead of serving corrupt info, it’d go offline)

To get transaction submission back up and running, we (several developers/engineers from the XRPL community) spun up multiple new submission nodes in different geographic regions. There are now 5 submission nodes in 5 different geographical regions, maintained by 3 different operators.

The problem & solution

While it is very unlikely three out of three servers responsible for the same job fail at the same time: “sh#t happens”. Something like this can always happen. However, with more nodes (now: 5 servers, 5 geographical regions, 3 operators) it is a lot less likely to happen to the new (amount of, and distribution of) nodes responsible for transaction submission.

A bigger problem was the fact that node operators weren’t informed in time about the outage. While monitoring picked up on the outage of all three nodes, this didn’t trigger a notification to a multitude of cluster & node operators. Also, this XRPL Labs status page did not pick up on the outage.

The active notifications to a multitude of node operators & the XRPL Labs status page picking up on outages have already been addressed. Multiple node and cluster operators will be actively messaged using several channels from now on, and an integration between the “xrplcluster.com” monitoring and the XRPL Labs status page has been improved. This integration will also be shared with the XRP Ledger Foundation, for a soon to be launched status page for services offered by the XRP Ledger Foundation (like “xrplcluster.com”).

Posted Oct 31, 2021 - 00:42 CEST

Resolved

The public XRP Ledger nodes used by XUMM to submit transactions had major issues, resulting in a lot of users not being able to submit transactions. Fetching account balance, transaction history, etc. was unaffected.

We're really sorry for this outage and inconvenience for those trying to submit transactions. We are already in the process of adding extra submission node capacity & improving failover.

Posted Oct 30, 2021 - 19:00 CEST