Resolved
Dec 03 at 04:24pm GMT

Reason For Outage

Executive Summary

On Tuesday 2nd December 2025, two network devices in one of our data centre racks experienced rare simultaneous hardware failure, temporarily affecting some services hosted in a specific area of our cloud platform.

Our systems initially handled the first failure automatically with no customer impact, but when the backup device failed just three minutes later, our technical teams immediately enacted emergency procedures.

By re-routing affected connections through alternative equipment in an adjacent rack, we restored services within approximately one hour of the initial incident. Permanent replacement hardware was installed overnight, completing full restoration by the early hours of 3rd December.

Importantly, our core services—including internet connectivity, website hosting, and security systems—remained fully operational throughout, as our network design ensures these critical functions are isolated from localised equipment failures.

Incident Detail

On Tuesday 2nd December 2025 at 13:05, Cilix monitoring systems detected the loss of availability of an access switch in rack A2 (switch name: lon1-a2-acc1), triggering a Major Incident Response. The technical team acknowledged and began fault investigation within 2 minutes.

The impacted access switch was a member of a redundant pair, providing connectivity to a total of 8 compute nodes, 5 of which host workloads located in our Virtual Private Cloud (VPC) Cluster 2. Our redundant architecture performed as designed, automatically re-routing traffic through the backup switch with no customer impact observed.

At 13:08, the redundant access switch (switch name: lon1-a2-acc2) also experienced a critical hardware failure. Investigation confirmed both switches had suffered simultaneous, non-recoverable hardware faults. This dual failure exceeded our standard redundancy provisions and resulted in degraded availability for services hosted in VPC Cluster 2.

Within 17 minutes of the confirmed dual switch failure, the decision was made to bypass the failed hardware by re-patching affected equipment into access switches in the adjacent rack (Rack A3).

Our technical and on-site teams worked in parallel - configuring new port and VLAN/VXLAN assignments while simultaneously preparing and re-patching 32 cables. This work was completed at 14:00, with services recovering by 14:10.

Replacement switches were installed overnight, with full migration back to Rack A2 completed by 01:30 on 3rd December.

Due to our strict network segmentation policies, the following services were not affected by this incident:

Internet Connectivity
Website Hosting
Kubernetes Hosting
Dedicated Server Hosting
Hosted Cisco Security Appliances
Denial of Service Protection Systems
Web Application Firewall Systems

Updated
Dec 02 at 02:10pm GMT

The affected equiptment has been bypassed.

We are seeing recovery of services, and are monitoring workloads closely for recovery.

Updated
Dec 02 at 01:30pm GMT

We have traced the cause of this issue to equiptment failure in our network stack.

We are working on re-patching servers to non-impacted switches.

We will provide updates as required.

Created
Dec 02 at 01:05pm GMT

We are aware of availability issues with resources running in our Virtual Private Cloud cluster 2.

Engineers are investigating and will provide further updates shortly.

Virtual Private Cloud Cluster 2: Availability Issues

Reason For Outage

Executive Summary

Incident Detail