Between November 25, at 10:24 pm AEST, and November 26, at 12:58 am AEST, customers experienced sporadic issues signing into Octopus Cloud instances. These issues stemmed from a database connection leak, which caused some sign-in requests to time out. We apologize for the inconvenience this has caused our customers and are taking steps to prevent it from happening again.
Time to detection: 9hrs 41mins
Time to incident declaration: 9hrs 41mins
Time to resolution: 12hrs 15mins
(All dates and times below are in AEST)
12:43 pm We deployed a change to our authorization service. This change introduced a bug resulting in database connection leaks. The connection leak only became apparent in high-traffic scenarios, which our current test suite doesn't replicate. As a result, our test suite didn't detect the issue.
10:24 pm A DevOps Support Engineer declared an incident after receiving reports from customers that they were having difficulty signing into Octopus Cloud instances.
10:24 pm - 11:41 pm Investigations showed that the issue was due to database connection timeouts.
11:42 pm Temporary mitigations implemented, including restarting services to release database connections.
11:42 pm - 11:59 pm Effects of mitigations observed and deemed successful. Incident marked as mitigated.
12:00 am - 12:57 am Incident responders continue to watch systems.
12:58 am Mitigation steps considered successful, and the incident marked as resolved.
6:55 am Root cause of database connection timeouts identified as a connection leak.
6:55 am - 8:33 am Permanent fix implemented and deployed.
We deployed a new service version to remove the database connection leak. We have also conducted an incident review to identify process improvements to prevent this from occurring in the future. We continue to work towards improving the reliability and security of our authentication and authorization services.