Resolved
Oct 08 at 03:18pm BST

What happened?

Tuesday 08-October-24 at

19:00: We received alerts from our monitoring systems that our primary power feed had dropped offline. No disruption to service availability at this point.
19:20: We receive alerts that several access (feed) switches have gone offline and are unreachable internally.
20:59: We receive alerts that the offline access switches have come back online and are carrying traffic correctly.
21:05: We manually verify that workloads are back online and functioning as expected.

What went wrong?

After our primary power feed dropped, all equipment successfully failed over to the backup power feed and was performing as expected.

In an effort to bring service back online for non-power redundant customers, our colocation provider enabled a transfer switch, that connected the primary and secondary power feeds together. This in turn meant that all the previously offline equipment in the data centre began to power back online.

This resulted in a voltage drop from the standard 240v, to approximately 160v. Our core, border and aggregation network equipment feature power supplies that are able to function within a 100-240v range, and as such weren't affected by the voltage drop.

However, our access switches feature power supplies that are validated for 220-240v. This led to our access switches powering down automatically, which in turn caused a loss of connectivity to any services not directly connected to our core switches.

Once the voltage drop condition cleared, the access switches powered back online automatically, and service availability was restored.

What changes are we making?

The service disruption was caused by a failed power feed, which then resulted in out-of-range operating conditions for our access layer switches.

As a result, we will be implementing changes to our access layer switches that include:
* Firmware configuration changes to adjust the automatic shutdown voltage range.
* Replacement power supplies, validated for a 100-240v voltage range.

The above changes are scheduled to take place this Saturday and Sunday (12-13 October 2024)

We will also provide further updates once we have received a full Root Cause Analysis from our colocation provider.

Updated
Oct 07 at 08:45pm BST

We have seen a full restoration of power at DC1.

We are currently manually verifying all equipment has powered on correctly.

Updated
Oct 07 at 08:30pm BST

We have received an update from the data centre that the issue is related to power systems.

They believe they have identified the cause of the issues, and are working on remediation.

Created
Oct 07 at 07:59pm BST

We are aware of an issue affecting the availability of our infrastructure.

DC1: Power Feed Issues

What happened?

Tuesday 08-October-24 at

What went wrong?

What changes are we making?