Octopus cloud AU region outage
Incident Report for Octopus Deploy
Postmortem

Octopus Cloud Australia East outage - report and learnings

Between 10:48am UTC (8:48pm AEST) and 6:55pm UTC on Wednesday, August 30, 2023 (4:55am AEST on Thursday, August 31, 2023), Octopus Cloud customers in the Australia East region experienced an outage of their Cloud instance. Additionally, between 12:17pm UTC (10:17pm AEST) and 3:19pm UTC (1:19am AEST + 1d), the remaining customers in the Australia East region whose instances were up would have been unable to perform deployments that used Dynamic Workers.

This disruption was caused by a cooling issue in one of Microsoft Azure’s Australia East datacenters.

Key timings

Event Time period
Time to detection 31 mins
Time to incident declaration 40 mins
Time to resolution 8 hrs 7 mins

Incident timeline

(All dates and times below are shown in UTC)

Wednesday, August 30, 2023

10:48 (20:48 AEST) 50% of Cloud Instances in Australia East went down.

11:19 (21:19 AEST) A support engineer acknowledged an automated alert and began investigating.

11:31 (21:31 AEST) Our internal incident response process was initiated.

11:55 (21:55 AEST) Status Page updated: An incident was declared.

12:17 (22:17 AEST) Dynamic Workers in Australia East went down.

15:06 (01:06 AEST + 1d) Status Page updated: We are still monitoring.

15:19 (01:19 AEST + 1d) Dynamic Workers in Australia East came online.

17:54 (03:54 AEST + 1d) Service was restored to 97% of Cloud Instances in Australia East.

18:11 (04:11 AEST + 1d) On-call engineer commenced remediation efforts on the remaining instances that were not online.

18:55 (04:55 AEST + 1d) All Cloud Instances instances up.

19:00 (05:00 AEST + 1d) Status Page updated: All cloud instances are back online.

21:01 (07:01 AEST + 1d) Status Page updated: Incident resolved.

Technical details

As designed, our services automatically came back online as Microsoft's Azure services were restored. There were a handful of Cloud Instances that required manual intervention, this was expected as these instances were undergoing scheduled maintenance until they were interrupted by the outage.

Microsoft Azure’s technical details

Starting at approximately 08:30 UTC on 30 August 2023, a utility power surge in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, to avoid damage to hardware.

Source: https://azure.status.microsoft/en-us/status/history/ (Incident Tracking ID: VVTQ-J98), retrieved on Thursday, August 31, 2023.

Remediation

Octopus takes service availability seriously. Despite the difficulty with upstream cloud provider outages, we fully review and remediate any outages that occur. We do this so that we're continuously improving and maintaining the best possible service we can.

We are aiming to reduce the time between a Cloud Instance going down and a human being notified, and reducing the time to publish a Status Page notification to better inform our customers.

Conclusion

We deeply value the trust you place in our services, and we understand the importance of maintaining that trust. The recent service disruption was a significant event for us, and it has highlighted areas where we can enhance our processes. We are taking active steps to improve our notification and response mechanisms, ensuring that you are informed promptly and accurately. We appreciate your patience and are committed to delivering the consistent and reliable service you expect from us.

Posted Sep 04, 2023 - 02:24 UTC

Resolved
This incident has been resolved.
Posted Aug 30, 2023 - 21:01 UTC
Monitoring
Our upstream provider has mitigated this incident. All cloud instances are back online. Our team will continue monitoring the situation for any issues from the upstream outage.
Posted Aug 30, 2023 - 19:00 UTC
Update
Our upstream provider has not yet provided an ETA for resolution for the AU region outage affecting a number of Octopus Cloud customers. We are still monitoring the situation and will continue to provide periodic updates.
Posted Aug 30, 2023 - 15:06 UTC
Identified
We are aware of an outage affecting our Australian hosted Octopus Cloud customers. Unfortunately, this outage is with our provider in this region. We will continue to monitor the situation and update the status page as more information comes available.
Posted Aug 30, 2023 - 11:55 UTC
This incident affected: Octopus Cloud.