Intermittent errors in West Europe

Incident Report for Octopus Deploy

Postmortem

Dynamic Worker Outage in West Europe - report and learnings

From 3:03am UTC our Octopus Cloud Infrastructure in West Europe was unable to provision new Dynamic Workers. Customers were impacted between 5:15am to 6:51am UTC Thursday, March 23, 2023. Twenty-three Octopus Cloud customers in West Europe were affected during this time period and could not lease Dynamic Workers to run deployments and runbooks.

We’re sorry, and we’re taking steps to minimize the occurrence and impacts of similar events in the future.

Key timings

Background

Octopus Cloud uses Dynamic Workers to execute workloads. During this incident, Dynamic Workers were unavailable for 23 customers, who were therefore unable to execute any of their Deployments and Runbooks that relied on Dynamic Workers.

Incident timeline

(All dates and times below are shown in UTC)

Thursday, March 23, 2023

02:41 One of our upstream dependencies, Azure Resource Manager (ARM), started returning 503 responses (Incident Tracking ID: RNQ2-NC8)

03:03 The first Dynamic Worker provisioning failure occurred. At this time, our pre-provisioned pool of Dynamic Workers continued to operate and serve all customer workloads

04:01 Internal monitoring alerted us about anomalous provisioning failures

04:13 We initiated our incident response process

04:14 We confirmed a sharp rise in 503 responses from ARM

04:17 We disabled automated internal infrastructure functions to limit the number of customers impacted by this issue

04:31 Alerted customers to the incident via status.octopus.com

04:38 We created a ticket with Azure (Sev A)

05:15 Our pooled resources were exhausted, leading to the first customer impact

05:39 As a potential mitigation, we decided to start provisioning additional infrastructure in an alternate region within Europe

06:04 Azure confirmed the outage

06:51 We observed that Dynamic Workers were beginning to recover

06:59 Alerted customers that the incident was mitigated via status.octopus.com

07:10 Azure incident resolved

07:10 We confirmed alternate infrastructure was available for failover if the issue recurred

Technical details

Dynamic Workers makes heavy use of ARM to provision Workers for customer workloads. An outage with ARM meant that we could not provision new Workers in the West Europe region. We maintain a pre-provisioned pool of Workers, but they were depleted after around two and a half hours.

Remediation and next steps

We have identified improvements to our alerting to reduce the time it takes for us to detect similar incidents. We’re prioritizing these improvements using our Risk Treatment Policy.

Currently, we rely heavily on single-region availability in Azure. We are evaluating our options to diversify the regions that we use, to mitigate regional availability issues.

Posted Mar 26, 2023 - 23:44 UTC

Resolved

Azure has advised that this issue has been resolved. A preliminary root cause has been published here: https://azure.status.microsoft/en-us/status/history/

03/23/2023
Azure Resource Manager - Azure Resource Manager Operations Failures - Mitigated
Tracking ID: RNQ2-NC8

Posted Mar 23, 2023 - 23:17 UTC

Monitoring

Dynamic workers are now provisioning successfully. We are continuing to monitor for any degradation of service.

Posted Mar 23, 2023 - 06:59 UTC

Identified

Azure are aware of this issue and are actively investigating. See the Azure status page for ongoing updates: https://azure.status.microsoft/en-us/status

Posted Mar 23, 2023 - 06:24 UTC

Update

We are experiencing issues provisioning dynamic workers in West Europe. This may affect deployments or runbooks relying on dynamic workers. We are working with Azure to have this operational as soon as possible. If you have urgent tasks relying on dynamic workers please contact support@octopus.com.

Posted Mar 23, 2023 - 05:58 UTC

Investigating

We are investigating an issue with our cloud vendor that may affect customers in the the West Europe region

Posted Mar 23, 2023 - 04:31 UTC

This incident affected: Octopus Cloud.