Major multi component outage

Incident Report for Xaman XRPL Labs

Postmortem

As a result, xApps, network fees, transaction submission and RPC were on-and-off available/unavailable.

While we have redundancy in place at every level of our infrastructure (uplinks, routers, switches, fibers in, fibers out, bare metal, etc.) we know the incidents were caused by hardware failure in our core router in our NL datacenter. Normally this should not have been a problem, and should not have resulted in downtime (BGP » Secondary edge router » Switches). This kind of failover has been proven working and reliable in the past.

During maintenance and hardware upgrades less than two weeks ago, we replaced our links between our core routers (who get BGP in, and have VRRP for redundancy out) and our switches with dual port LACP (802.3ad) bonded links, instead of dual "Active Standby" bonding to our upgraded switches (MC-LAG 🎉).

Somehow, today, after the primary core routers went down, the secondary did not receive any traffic from the switches. BGP switched routes, VRRP made the router become primary, but the switches did not seem to deliver any traffic to the aggregated ports. The LACP link (802.3ad) did however show a working link, as did the router. But no traffic.

To get things back up online ASAP, we switched back from 802.3ad bonding to Active Standby - after which communication between router and network came back. However: only our IPv4 stack came back: our entire IPv6 stack (used for most of our backend services and the communication between Cloudflare and our infrastructure) did not come back. A hard config reload also did not help. As a last resort, we rebooted the secondary core router. This definitively brought everything back.

This means that there are several problems now temporarily resolved, but still to be investigated:

Why does, with identical config and firmware, 802.3ad work on the (now dead) primary core router, and not on the secondary one?
Why did both switch and secondary core router display a working LACP 802.3ad link, while no traffic was flowing to the ports?
Why did the IPv6 stack not come back after switching back from 802.3ad to Active Standby bonding?

On top of the above, the dead primary core router still has to be replaced.

We will order extra hardware to test the above in a lab setting, before we make any further changes to the live infrastructure. Until then, on a core routing level, there will not be redundancy on hardware level.

Posted Nov 29, 2024 - 21:42 CET

Resolved

Between 17:00 and 18:30 CET, several multi minute hiccups occurred affecting all XRPL Labs hosted services. This includes the XRPL Labs infrastructure behind XRPLCluster.com, Xahau.Network and the Xaman App backend.

As a result, xApps, network fees, transaction submission and RPC were on-and-off available/unavailable.

Posted Nov 29, 2024 - 17:00 CET