During the outage less than a week ago, it was clear what happened, but unclear why it happened.
https://status.xrpl-labs.com/incidents/thvcq91gm2cf
Our lab setting replication of the evens has not been executed yet, as we are still waiting for the components to arrive.
However, today the exact same problems returned after a lot of latency and hiccups during the day. It was during our diagnosis into the state of the network we ended up seeing the entire network go down.
During the day, a high amount of TCP connections could not be created (time out) and a lot of UDP packets never arrived. Eventually more verbose logging levels the switches revealed “flapping ports”, which was odd: these ports were all part of MC-LAG (aggregation). Flapping is the last thing to expect, except if there’s physical (SFP, Fiber) failure. Neither were the case (easy to debug by just replacing them).
We then stumbled upon messages on the user forum of the switch vendor: caused by a firmware problem that already impracted other network engineers in the past two weeks, the switches ended up flapping as if STP was enabled, and continously switching between path to use.
We disabled MC-LAG port aggregation, and went back to our old STP based loop prevention for redundant switching. This temporarily took down the network as it had to find the ports to keep disabled and allow to be enabled again (STP). When it came back, all latency, all hiccups & all resubmitted traffic we saw during the day (simply as increased bandwidth consumed on switch ports without a clear explanation) disappeared.
We are sorry for the downtime, but very happy we identified the problem.
We will wait for the duplicate lab setup, carefully test future firmware releases and only deploy after definitively confirming the problems are no longer in future new firmware releases.