From 3:03am UTC our Octopus Cloud Infrastructure in West Europe was unable to provision new Dynamic Workers. Customers were impacted between 5:15am to 6:51am UTC Thursday, March 23, 2023. Twenty-three Octopus Cloud customers in West Europe were affected during this time period and could not lease Dynamic Workers to run deployments and runbooks.
We’re sorry, and we’re taking steps to minimize the occurrence and impacts of similar events in the future.
Octopus Cloud uses Dynamic Workers to execute workloads. During this incident, Dynamic Workers were unavailable for 23 customers, who were therefore unable to execute any of their Deployments and Runbooks that relied on Dynamic Workers.
(All dates and times below are shown in UTC)
02:41 One of our upstream dependencies, Azure Resource Manager (ARM), started returning 503 responses (Incident Tracking ID: RNQ2-NC8)
03:03 The first Dynamic Worker provisioning failure occurred. At this time, our pre-provisioned pool of Dynamic Workers continued to operate and serve all customer workloads
04:01 Internal monitoring alerted us about anomalous provisioning failures
04:13 We initiated our incident response process
04:14 We confirmed a sharp rise in 503 responses from ARM
04:17 We disabled automated internal infrastructure functions to limit the number of customers impacted by this issue
04:31 Alerted customers to the incident via status.octopus.com
04:38 We created a ticket with Azure (Sev A)
05:15 Our pooled resources were exhausted, leading to the first customer impact
05:39 As a potential mitigation, we decided to start provisioning additional infrastructure in an alternate region within Europe
06:04 Azure confirmed the outage
06:51 We observed that Dynamic Workers were beginning to recover
06:59 Alerted customers that the incident was mitigated via status.octopus.com
07:10 Azure incident resolved
07:10 We confirmed alternate infrastructure was available for failover if the issue recurred
Dynamic Workers makes heavy use of ARM to provision Workers for customer workloads. An outage with ARM meant that we could not provision new Workers in the West Europe region. We maintain a pre-provisioned pool of Workers, but they were depleted after around two and a half hours.
We have identified improvements to our alerting to reduce the time it takes for us to detect similar incidents. We’re prioritizing these improvements using our Risk Treatment Policy.
Currently, we rely heavily on single-region availability in Azure. We are evaluating our options to diversify the regions that we use, to mitigate regional availability issues.