LON1: Core Switch Lockup

Resolved
Oct 25 at 05:04pm BST

Executive Summary

On 25th October at 09:15 GMT, monitoring systems detected a loss of network availability to several services connected to access switches in a Virtual Chassis configuration, in racks A2, A3 and A4, triggering a major incident response.

Upon investigation, the pair of access switches had moved into a ‘soft-lockup’ state, which meant the switches were actively advertising their availability to carry traffic, but were unable to do so.

To rectify the situation, our on-site team manually bypassed the affected access switches, which brought the affected services back online.

Throughout the incident, our primary focus was the restoration of services and to minimise any potential impact. Total disruption to services was approximately 1 hour and 10 minutes.

Next Steps

We have engaged the hardware vendor TAC, and we have come to the conclusion that the fault with these access switches is related to the power surge we experienced in early October. These access switches have been scheduled for replacement and won’t be reintroduced into service until they have been replaced and stress tested.

We are also investigating replacing any potentially affected network hardware with new replacements.

Timeline:
09:15 GMT – Monitoring systems trigger availability alerts.
09:17 GMT – Alerts acknowledged and technical teams begin fault investigations.
09:30 GMT – Access switches are power cycled.
09:32 GMT – Service returns to downstream services following reboot.
09:43 GMT – Access switches move back into a ‘soft-lockup’ fault condition, monitoring alerts re-triggers.
09:45 GMT – Decision is made to physically bypass the access switches, instead connecting to rack aggregation switches.
09:50 GMT - Work begins on re-provisioning ports, VNI to VLAN maps, and L3 SVIs for affected customer services.
10:20 GMT – Re-provisioning work is completed across all 138 ports.
10:21 GMT – On-site team begin re-patching ports into aggregation switches.
10:40 GMT – On-site team completes re-patching.
10:42 GMT – Monitoring systems confirm availability of services.
10:45 GMT – Incident is marked as resolved. Investigations to be completed offline.

Root Cause:

We identified a similar fault with our Core switches on 6th October, where we experienced them moving into a ‘soft-lockup’ state. We have come to the conclusion with the hardware vendor that these switches were also affected by the power surge on 6th October.

Due to the nature of the issue, there is no precursor or warning before the failiure occurs.

Updated
Oct 25 at 05:00am BST

We have identified a soft-lockup condition on one of our core switches.

This condition has resulted in the affected core switch to pass health checks, however, fail to correctly route traffic.

We have worked with the equipment vendor, and have replaced the affected core switch.

Telemetry indicates this is a result of the power feed issues on 06-Oct-2024.

We have since observed a full restoration of service.

Created
Oct 25 at 03:30am BST

We are aware of a fault affecting our core network infrastructure in our LON1 location.