Octopus Server 2023.3 contained a bug causing self-hosted Octopus Server High Availability (HA) clusters to run some deployments more than once, often concurrently. This resulted in incorrect task statuses and confusing task logs.
The bug was caused by a feature-flagged change to our internal TaskQueue. That change removed a database write lock that stops multiple Octopus Server nodes in a High Availability (HA) configuration picking up the same task. The write lock removal was accidentally left un-flagged. Multiple Octopus Server nodes could execute the same task concurrently without the write lock in place. This primarily presented as incorrect or out-of-order logging of tasks in a deployment. The issue could affect any self-hosted customers running HA mode on 2023.3 releases below 2023.3.13026.Once we received a report that the issue was impacting the correctness of task execution and not only incorrect display, we escalated immediately and resolved the issue as quickly as possible. We know how critical deployments are for our customers, and we take the trust they have in us to execute those deployments correctly very seriously. We apologise to our customers for not meeting our own standards of correct deployment execution.
Time to detection: 14 days (from GA of 2023.3 to first report)
Time to incident declaration:
Time to resolution: 27 hours 25 minutes
From Monday 18 September 2023, we received customer reports that task statuses, outputs and ordering were displaying incorrectly. Our Support team worked with our customers to troubleshoot common reasons for incorrect task display, and escalated to our Engineering team when they couldn’t resolve the issue.
On Thursday 21 September 2023, a customer reported that tasks were executing out of order.
On Friday 22 September 2023, we identified a change to our task queue that caused the same task to execute on multiple Octopus Server nodes in HA mode had been released in 2023.3. We fixed the issue immediately and contacted affected customers.
We received reports from four affected customers, and identified a total of 24 customers who were using the impacted versions. We have contacted all 24 customers.
* All times in AEST
Octopus Server can either be run as a managed instance in Octopus Cloud, or hosted by our customers on their platform of choice. Octopus Cloud gets changes continuously and for self-hosted customers, Octopus Server has major releases four times a year, with each release rolling up all the changes from the last three months. Some complex or early access features will only target the next major version and not be backported to previous supported LTS versions.
Octopus Server High Availability (HA) mode is only used by self-hosted customers. In HA, multiple nodes of Octopus Server are run concurrently and distribute tasks between them. Octopus Server uses the task queue persisted in the shared database to manage task execution across nodes.
Octopus Deploy has been working on a fix for an issue where deployments would “hang”, getting stuck in a Cancelling
state and not progressing. Under the hood, deployments and other work are represented as a ServerTask
, and they are added to a TaskQueue
. The first iteration of a fix changed how the database handled conflicting updates to the ServerTask
entity, and required flow-on changes to the TaskQueue
. It was added to the 2023.3 release behind a feature flag which defaulted to off. One of the changes was removing a write lock that Octopus Server nodes used to indicate they were executing a specific ServerTask
on the queue. The write lock removal should have been behind the feature flag, but was mistakenly shipped as a universal change.
The Pull Request containing the problem was merged in June 2023 and has since been running in CI environments and on the Cloud platform. The issue didn’t show up in those environments because they don’t use HA mode, and only HA mode has multiple Octopus Server nodes contending to execute tasks. When 2023.3 was released in September the problem started appearing, and only for self-hosted customers.
The fix was to put the write lock back in place on the TaskQueue
. Replacing the lock was a small change that was quick to test and ship. The work to reduce hung deployments isn’t used in Production environments yet so there was no concern with interactions between the fix and the feature flag.
We have removed all affected releases from public availability. The fixed version of 2023.3 is available on our downloads page. We have also reached out to all potentially affected self-hosted customers.
Our next step will be running an incident review to understand where our processes allowed us to ship a critical bug.
We have identified that we need to improve our automated testing of HA and our process around how we manage changes to those tests, and will be addressing these as a priority.