WebSocket connectivity issues

Incident Report for Xaman XRPL Labs

Postmortem

Around 16:30 (CET) today our monitoring picked up on anomalies: the amount of connected clients to xrplcluster.com was dropping (and lower than average for the dime of day + day of week). Also the connection time (round trip) from our monitoring system through Cloudflare to the xrplcluster.com edge-nodes (geographical points of entry to connect to the cluster) increased.

When we started to investigate, we also noticed that when signing in to the edge-nodes, our (SSH, management) connections would freeze briefly or even drop and restart. This seemed to occur mostly on the London, Frankfurt and Toronto edges.

We could see that the load was relatively high, but there were just some CPU peaks. The edge-nodes are well equipped when it comes to hardware specifications. Overall node CPU usage was well in the healthy ranges: between 10% and 25% usage.

While we couldn’t identify the cause around one hour later the situation stabilized. Meanwhile both Atlassian Statuspage (the product we use to communicate incidents, like the one you’re reading about right now) and Cloudflare (our upstream provider for xrplcluster.com connections) reported issues on their platforms. We figured there was a more widespread problem with internet routing / cloud providers / …

Client connections started to come back to the cluster, errors disappeared from our monitoring dashboards, so we could have some dinner and investigate later.

Around 21:00 (CET) the problems came back. Not as severe as earlier today, but we could reproduce connection issues. By now we added more verbose logging to the software responsible for client connections, connection distribution & looping clients to the right XRPL nodes.

We used New Relic (awesome platform, thanks! 🙏 ) to dig deeper, and noticed that there were short lived, 100% single core consuming processes locking up threads at the xrplcluster.com edge nodes. Adding more logging / metric reporting, we identified that the calls causing this were encoding and decoding relatively large JSON objects. This was odd, as our software implements several rate limits on e.g. connections per IP per time interval, bandwidth consumption and the size of individual WebSocket messages.

Increasing the log verbosity levels to the max, writing all WebSocket requests and responses to a logfile, we found the IP address of the culprit. It looked oddly familiar… We checked the config & it appeared to be the IP address of one of the whitelisted, unlimited, enterprise clients of xrplcluster.com. We reached out and within seconds it was clear that a service on their side, that was previously pointing to dedicated machines, was sending enormous subscription requests (size & amount) to xrplcluster.com.

Due to logic on edge nodes, these specific calls (subscriptions) are much more resource intensive than they would be if we simply passed on the requests to a XRPL Ledger node. Meanwhile (within minutes) the enterprise user stopped sending the calls to the cluster.

To prevent this from happening again:

We improved log verbosity (but only if we need to activate it to investigate)
We added metric reporting for excessive resource consumption on WebSocket message level
We limited the max. WebSocket message size for whitelisted IP addresses (enterprise clients) to the regular max. size * 5, instead of without any limit

We’re very sorry for the inconvenience. While XUMM and the xrpl-client JS lib. auto-failover to other XRP Ledger entry points, this issue will have caused some delays connecting to the XRP Ledger. We did learn & improve our systems. Onwards! 🚀

Posted Mar 31, 2022 - 23:31 CEST

Resolved

The situation is stable, all clients can connect to xrplcluster.com again without problems. We'll be on high alert, but believe there are currently no problems anymore.

While (still) not 100% certain, the problems seem to have been caused by upstream routing/connectivity/firewall issues. An incident currently being reported by Cloudflare (our upstream provider) indicates there may have been a firewall rule propagation issue.

Posted Mar 31, 2022 - 17:33 CEST

Identified

The problems seem to have been caused by routing issues, where users could either not reach Cloudflare (xrplcluster.com) and/or Cloudflare could not reach our edge nodes (specifically London, Singapore, Toronto). We are getting no more connectivity issue reports, and the connected client count is slowly rising again. We'll monitor closely.

Posted Mar 31, 2022 - 17:18 CEST

Investigating

We are currently seeing a low connected user count, and receive reports users who can't connect to the public XRPL Cluster nodes. We're currently investigating connectivity issues & potential other causes.

Meanwhile, we're also seeing that some users can connect to XRPLCluster without problems.

Posted Mar 31, 2022 - 17:09 CEST

This incident affected: XRP Ledger - Public nodes (xrplcluster.com (XRPL Mainnet)).