Service Issue with blyCloud West Coast Data Center
Incident Report for Blycloud Status Page
Postmortem

Root Cause Analysis

Overview:

‌At 7pm (PDT), June 22, 2023, Data Center staff began routine maintenance of core network components in our west coast facility. While under maintenance, external connections to the environment would be interrupted for periods of time as security updates were applied to network equipment. Critical LAN functions were to be maintained through redundant network paths. Systems were updated according to schedule and upon performing final checks it was discovered some systems were not responding as expected. Upon troubleshooting these issues, we discovered that the secondary path for LAN traffic that was provided to the system, was not leveraged by the high availability server cluster. This failure resulted in erroneous communications between failover cluster hosts that led to some hosted systems entering a paused/stopped state, and in some cases disk consistency issues.

Timeline: (Please note that all times listed are Pacific Time)

Date Time Event
6/22/23 7:00 PM Maintenance window began
6/22/23 7:15 PM Network configuration changes made to allow alternate path for cluster communications
6/22/23 8:00 PM Patching of network infrastructure began
6/22/23 11:10 PM Patching of network infrastructure concluded
6/22/23 11:15 PM During system checks noted several systems in a paused/saved state and began investigating issues
6/23/23 2:00 AM Cluster errors had been resolved and all visibly impact system had been brought back online and appeared stable
6/23/23 3:00 AM NOC team began receiving reports of additional impact and remediation efforts began for reported systems
6/23-27 Additional remediation steps were addressed as needed on case-by-case basis

Root Cause:

The alternate path for cluster communication was not automatically propagated to all hosts in the cluster before the primary path was taken offline as expected. All hosts that did not receive the alternate path for cluster communications acted as though the other hosts in the cluster had gone offline and attempted multiple times to assume control of the clustered roles. This led to systems being offline, restarted, and disk consistency errors in some cases.

Follow Up Action:

blyCloud is re-evaluating the switching topology in the west coast facility to provide a permanent secondary redundant path for cluster communications. This will allow critical infrastructure to be properly serviced, while not relying on manual reviews and/or system changes during the maintenance cycle to propagate prior to patching.  Service tickets are still pending with the current switch and server vendors to determine if there is a systematic fix prior to a potential network topology change.

We apologize for any inconvenience caused by this service interruption.

Posted Jul 11, 2023 - 08:01 EDT

Resolved
We received some reports of slow or inaccessible servers in our west coast facility after network maintenance.
Posted Jun 23, 2023 - 06:00 EDT