Degraded performance of Xaman API's & xApps

Incident Report for Xaman XRPL Labs

Postmortem

Postmortem: Degraded Xaman API & xApp Performance

Date: March 3, 2026 Duration: ~35 minutes (11:18 - 11:53 CET, fully resolved 12:03 CET) Severity: Major Impact: Slow or unresponsive Xaman APIs, SDK, and xApps. Account Worth feeds delayed. No impact on the XRP Ledger itself, and no funds were at risk (self custody, your keys are on your own device).

What happened

We run a geographically distributed Percona database cluster with nodes in the US, EU, and Middle East. Each region has a read/write node and a read-only node. The EU region is the primary (master).

We had planned a migration of the Middle East cluster to Asia. During the early stages of that migration, the US read/write node failed and traffic moved to the US read-only node. That node was underscaled for the sudden load, so it took too long to recover and catch up on replication.

Here's where it cascaded: the EU master has to track replication status for all regions. With the US node lagging behind, the EU master started stacking up replication waits. That caused sluggish responses across the entire database cluster, which in turn slowed down all our API and xApp backends, regardless of which region they were running in.

What we did

We initially gave the US cluster ~30 minutes to catch up on its own.
That wasn't going to cut it, so we killed the US cluster entirely and pointed all geographically distributed API and xApp backends to the EU cluster only.
With no remote nodes to wait on for replication status, the EU cluster immediately recovered. It's significantly overscaled for exactly this kind of scenario, so it handled the full load without breaking a sweat.
Services were fully restored, running on EU only (1/3 of normal redundancy).
US cluster re-sync was started right after.
The Asia migration will be restarted later with a fresh sync from the EU cluster.

Timeline

Time (CET)	Event
~11:18	US read/write node fails, traffic moves to undersized read-only node
~11:20	EU master starts stacking replication waits, performance degrades across all regions
11:28	Monitoring alerts fire, team starts investigating. Status page updated.
11:45	Root cause identified. Decision made to kill US cluster and run EU solo.
~11:53	All backends pointed to EU, performance restored.
12:03	Incident fully resolved, status page updated.

Why it happened

Two things collided at the wrong time:

We were in the process of migrating the Middle East node to Asia, which meant the cluster was already in a transitional state.
The US read/write node failed independently during that window. The failover target (read-only node) was not scaled to handle read/write traffic, and its slow recovery caused a ripple effect back to the EU master through replication lag.

The core issue is that the EU master waits on replication acknowledgment from other regions. When a region can't keep up, those waits stack up and slow everything down, including responses to local EU traffic.

What we're fixing

Scale up the US cluster. The read-only failover node needs to be able to handle full read/write load if needed. It was underscaled, and that's what turned a regional failure into a global one.
Better pre-migration checks. Before kicking off major cluster migrations, we need to more thoroughly verify latency and stability of all other nodes in the cluster. Don't start a migration when the cluster isn't in a perfectly healthy state.
Faster failover decisions. We waited ~30 minutes hoping the US node would catch up. In hindsight we should have killed it sooner. We'll define clearer thresholds for when to cut a lagging region loose instead of letting it drag down the master.

What was NOT affected

The XRP Ledger was not impacted in any way. Xaman is a self-custodial wallet: your keys live on your device, not on our servers. Our backend services handle things like push notifications, xApp hosting, account metadata, and API requests. A backend outage means some features are temporarily slow or unavailable, but your funds are always safe and accessible on-ledger.

Posted Mar 03, 2026 - 12:15 CET

Resolved

This incident has been resolved.

Posted Mar 03, 2026 - 12:03 CET

Update

Both the cause and the fix have been identified, and the fix is being rolled out. It seems performance will be degraded for another ~15 minutes before everything can go back to normal. Sorry for the inconvenience 😓

Posted Mar 03, 2026 - 11:45 CET

Identified

Around 10 minutes ago, we got notified by our monitoring systems that our backend performance was degraded. Our distributed database cluster seems to have failed over to back up machines, who now have trouble keeping up.

We're currently expanding the cluster (data needs to sync) while we work on fixing the cause. Performance will be degraded for another ~15 minutes.

During this process, xApps may load slow or not at all, and e.g. Account Worth feeds and our API's may be slow to respond.

Funds are safe, and the XRP Ledger is completely unaffected.

Posted Mar 03, 2026 - 11:28 CET

This incident affected: Xaman API / SDK (Xaman Developer API/SDK).