Incident - Connection Failure

 

Date

August 5, 2020

Description

Customers were unable to connect to Silo

Duration

Approximately 49 minutes (5:44 AM to 6:33 AM PT)

Affected components

Web Client and Installed Clients

Affected customers

Any customer trying to establish a new session. 

Root cause analysis

A failure in our firewall automation applied an incorrect set of rules making some of our internal systems unable to communicate with each other.

 

 

Event Log

08/05/2020 -- 05:44 AM PT: Authentic8’s Site Reliability team received multiple alerts for Silo app servers across multiple regions.

08/05/2020 -- 05:46 AM PT: Site Reliability team began investigating; confirmed connection failure to multiple
services.

08/05/2020 -- 06:33 AM PT:
Site Reliability team restored thick client service by rebuilding firewall rules.

08/05/2020 -- 07:02 AM PT:
Site Reliability team restored web client service in most locations by rebuilding firewall rules. Our London and Tokyo locations experienced further service interruption.

08/05/2020 -- 08:00 AM PT: Site Reliability team restored services for ALL Silo server roles.

08/05/2020 -- 11:51 AM PT: Web Client endpoint was restored for London and Tokyo locations.
 

Resolution
Our Site Reliability team manually recreated the firewall rules on all affected servers in order to quickly restore Silo connections. The expired certificate was updated on the affected infrastructure to fix the condition that had caused our automation to fail in this way.

Moving Forward

Authentic8 has updated the infrastructure automation that allowed this mis-configuration and has implemented redundant monitoring to provide multiple methods of assessing the server health. In addition improvements are being made to improve the instrumentation and data from the product itself to help further improve the identification of incidents in the future.