Investigating bug reports related to step execution logic

Incident Report for Octopus Deploy

Postmortem

Incident Report - Deployments run more than once in High Availability (HA) clusters

Summary

Octopus Server 2023.3 contained a bug causing self-hosted Octopus Server High Availability (HA) clusters to run some deployments more than once, often concurrently. This resulted in incorrect task statuses and confusing task logs.

‌

The bug was caused by a feature-flagged change to our internal TaskQueue. That change removed a database write lock that stops multiple Octopus Server nodes in a High Availability (HA) configuration picking up the same task. The write lock removal was accidentally left un-flagged. Multiple Octopus Server nodes could execute the same task concurrently without the write lock in place. This primarily presented as incorrect or out-of-order logging of tasks in a deployment. The issue could affect any self-hosted customers running HA mode on 2023.3 releases below 2023.3.13026.Once we received a report that the issue was impacting the correctness of task execution and not only incorrect display, we escalated immediately and resolved the issue as quickly as possible. We know how critical deployments are for our customers, and we take the trust they have in us to execute those deployments correctly very seriously. We apologise to our customers for not meeting our own standards of correct deployment execution.

Timings

Time to detection: 14 days (from GA of 2023.3 to first report)

Time to incident declaration:

3 days (from initial report of log ordering issue)
27 minutes (from first report of incorrect execution)

Time to resolution: 27 hours 25 minutes

What happened?

From Monday 18 September 2023, we received customer reports that task statuses, outputs and ordering were displaying incorrectly. Our Support team worked with our customers to troubleshoot common reasons for incorrect task display, and escalated to our Engineering team when they couldn’t resolve the issue.

On Thursday 21 September 2023, a customer reported that tasks were executing out of order.

On Friday 22 September 2023, we identified a change to our task queue that caused the same task to execute on multiple Octopus Server nodes in HA mode had been released in 2023.3. We fixed the issue immediately and contacted affected customers.

We received reports from four affected customers, and identified a total of 24 customers who were using the impacted versions. We have contacted all 24 customers.

* All times in AEST

‌

Technical details of the problem

Octopus Server can either be run as a managed instance in Octopus Cloud, or hosted by our customers on their platform of choice. Octopus Cloud gets changes continuously and for self-hosted customers, Octopus Server has major releases four times a year, with each release rolling up all the changes from the last three months. Some complex or early access features will only target the next major version and not be backported to previous supported LTS versions.

Octopus Server High Availability (HA) mode is only used by self-hosted customers. In HA, multiple nodes of Octopus Server are run concurrently and distribute tasks between them. Octopus Server uses the task queue persisted in the shared database to manage task execution across nodes.

Octopus Deploy has been working on a fix for an issue where deployments would “hang”, getting stuck in a Cancelling state and not progressing. Under the hood, deployments and other work are represented as a ServerTask , and they are added to a TaskQueue. The first iteration of a fix changed how the database handled conflicting updates to the ServerTask entity, and required flow-on changes to the TaskQueue. It was added to the 2023.3 release behind a feature flag which defaulted to off. One of the changes was removing a write lock that Octopus Server nodes used to indicate they were executing a specific ServerTask on the queue. The write lock removal should have been behind the feature flag, but was mistakenly shipped as a universal change.

The Pull Request containing the problem was merged in June 2023 and has since been running in CI environments and on the Cloud platform. The issue didn’t show up in those environments because they don’t use HA mode, and only HA mode has multiple Octopus Server nodes contending to execute tasks. When 2023.3 was released in September the problem started appearing, and only for self-hosted customers.

The fix was to put the write lock back in place on the TaskQueue. Replacing the lock was a small change that was quick to test and ship. The work to reduce hung deployments isn’t used in Production environments yet so there was no concern with interactions between the fix and the feature flag.

Remediation and next steps

We have removed all affected releases from public availability. The fixed version of 2023.3 is available on our downloads page. We have also reached out to all potentially affected self-hosted customers.

Our next step will be running an incident review to understand where our processes allowed us to ship a critical bug.

We have identified that we need to improve our automated testing of HA and our process around how we manage changes to those tests, and will be addressing these as a priority.

Posted Sep 27, 2023 - 20:48 UTC

Resolved

We have published a fix https://octopus.com/downloads/2023.3.13026

A public incident report will be shared with affected customers. If you would like a copy, please get in touch with us at support@octopus.com.

Posted Sep 22, 2023 - 05:16 UTC

Identified

We've identified a very likely root cause. We made some changes to our task queues that should have been behind a feature flag, but a change to remove a write lock on the task queue table was accidentally left un-flagged. This means that multiple nodes could pick up the same task accidentally.
This confirms that the incident will only affect self-hosted customers using High-Availability mode.
We're working on a fix now and should have it available later on today.
There are two potential workarounds, although we know that they are not good ones. Moving to a single node instead of HA will be safe as it removes task queue contention.
You could also drain all of the nodes, then turn one of them on at a time. Allow each node to pick up some tasks, then drain it, and turn on the next. This approach would be extremely manual and we don't recommend it.

Posted Sep 22, 2023 - 00:49 UTC

Update

We continue to investigate isolated reports from a limited number of self-hosted customers on Octopus Server 2023.3 of this bug: https://github.com/OctopusDeploy/Issues/issues/8356. Out of an abundance of caution, we have temporarily removed the 2023.3 release from our downloads page while we continue to investigate.

Posted Sep 21, 2023 - 21:39 UTC

Investigating

We are currently investigating reports from a limited number of self-hosted customers on Octopus Server 2023.3 of this bug: https://github.com/OctopusDeploy/Issues/issues/8356. We will update that bug as our investigation progresses.

Posted Sep 21, 2023 - 04:28 UTC