There are three types of nodes powering “xrplcluster.com”:
The technology powering “xrplcluster.com” analyses requests and dynamically routes commands/queries to the best suitable nodes. When one of the nodes goes down, the cluster will automatically detect this and a seamless client handover will take place to one of the other nodes of the same type (FH, non FH, submission).
Until today all three node types were powered by a number of servers for performance & redundancy reasons: 9 full history nodes, 6 non full history nodes and 3 submission nodes. Several well known developers/engineers from the XRPL community maintain, contribute and host these nodes & the cluster, and new nodes are being added on a regular basis to keep up with the growing use of the XRP Ledger.
Today, all three of the submission nodes went down at the same time. As a result applications relying on “xrplcluster.com” (like XUMM Wallet, Ledger Live (Desktop & Mobile) and others could connect, fetch transactions, balance, etc. but when submitting a transaction, a timeout would occur and an error message could / would be displayed.
As the three nodes are running on different hardware and all have slightly different configurations, this seems almost impossible. When trying to get the three nodes back up, it became clear that all three failed for different reasons:
To get transaction submission back up and running, we (several developers/engineers from the XRPL community) spun up multiple new submission nodes in different geographic regions. There are now 5 submission nodes in 5 different geographical regions, maintained by 3 different operators.
While it is very unlikely three out of three servers responsible for the same job fail at the same time: “sh#t happens”. Something like this can always happen. However, with more nodes (now: 5 servers, 5 geographical regions, 3 operators) it is a lot less likely to happen to the new (amount of, and distribution of) nodes responsible for transaction submission.
A bigger problem was the fact that node operators weren’t informed in time about the outage. While monitoring picked up on the outage of all three nodes, this didn’t trigger a notification to a multitude of cluster & node operators. Also, this XRPL Labs status page did not pick up on the outage.
The active notifications to a multitude of node operators & the XRPL Labs status page picking up on outages have already been addressed. Multiple node and cluster operators will be actively messaged using several channels from now on, and an integration between the “xrplcluster.com” monitoring and the XRPL Labs status page has been improved. This integration will also be shared with the XRP Ledger Foundation, for a soon to be launched status page for services offered by the XRP Ledger Foundation (like “xrplcluster.com”).