Between 10:48am UTC (8:48pm AEST) and 6:55pm UTC on Wednesday, August 30, 2023 (4:55am AEST on Thursday, August 31, 2023), Octopus Cloud customers in the Australia East region experienced an outage of their Cloud instance. Additionally, between 12:17pm UTC (10:17pm AEST) and 3:19pm UTC (1:19am AEST + 1d), the remaining customers in the Australia East region whose instances were up would have been unable to perform deployments that used Dynamic Workers.
This disruption was caused by a cooling issue in one of Microsoft Azure’s Australia East datacenters.
Event | Time period |
---|---|
Time to detection | 31 mins |
Time to incident declaration | 40 mins |
Time to resolution | 8 hrs 7 mins |
(All dates and times below are shown in UTC)
10:48 (20:48 AEST) 50% of Cloud Instances in Australia East went down.
11:19 (21:19 AEST) A support engineer acknowledged an automated alert and began investigating.
11:31 (21:31 AEST) Our internal incident response process was initiated.
11:55 (21:55 AEST) Status Page updated: An incident was declared.
12:17 (22:17 AEST) Dynamic Workers in Australia East went down.
15:06 (01:06 AEST + 1d) Status Page updated: We are still monitoring.
15:19 (01:19 AEST + 1d) Dynamic Workers in Australia East came online.
17:54 (03:54 AEST + 1d) Service was restored to 97% of Cloud Instances in Australia East.
18:11 (04:11 AEST + 1d) On-call engineer commenced remediation efforts on the remaining instances that were not online.
18:55 (04:55 AEST + 1d) All Cloud Instances instances up.
19:00 (05:00 AEST + 1d) Status Page updated: All cloud instances are back online.
21:01 (07:01 AEST + 1d) Status Page updated: Incident resolved.
As designed, our services automatically came back online as Microsoft's Azure services were restored. There were a handful of Cloud Instances that required manual intervention, this was expected as these instances were undergoing scheduled maintenance until they were interrupted by the outage.
Starting at approximately 08:30 UTC on 30 August 2023, a utility power surge in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, to avoid damage to hardware.
Source: https://azure.status.microsoft/en-us/status/history/ (Incident Tracking ID: VVTQ-J98), retrieved on Thursday, August 31, 2023.
Octopus takes service availability seriously. Despite the difficulty with upstream cloud provider outages, we fully review and remediate any outages that occur. We do this so that we're continuously improving and maintaining the best possible service we can.
We are aiming to reduce the time between a Cloud Instance going down and a human being notified, and reducing the time to publish a Status Page notification to better inform our customers.
We deeply value the trust you place in our services, and we understand the importance of maintaining that trust. The recent service disruption was a significant event for us, and it has highlighted areas where we can enhance our processes. We are taking active steps to improve our notification and response mechanisms, ensuring that you are informed promptly and accurately. We appreciate your patience and are committed to delivering the consistent and reliable service you expect from us.