Incident - Connection Failure
Date | August 5, 2020 |
Description | Customers were unable to connect to Silo |
Duration | Approximately 49 minutes (5:44 AM to 6:33 AM PT) |
Affected components | Web Client and Installed Clients |
Affected customers | Any customer trying to establish a new session. |
Root cause analysis | A failure in our firewall automation applied an incorrect set of rules making some of our internal systems unable to communicate with each other. |
Event Log
08/05/2020 -- 05:44 AM PT: Authentic8’s Site Reliability team received multiple alerts for Silo app servers across multiple regions.
08/05/2020 -- 05:46 AM PT: Site Reliability team began investigating; confirmed connection failure to multiple services.
08/05/2020 -- 06:33 AM PT: Site Reliability team restored thick client service by rebuilding firewall rules.
08/05/2020 -- 07:02 AM PT: Site Reliability team restored web client service in most locations by rebuilding firewall rules. Our London and Tokyo locations experienced further service interruption.
08/05/2020 -- 08:00 AM PT: Site Reliability team restored services for ALL Silo server roles.
08/05/2020 -- 11:51 AM PT: Web Client endpoint was restored for London and Tokyo locations.
Resolution
Our Site Reliability team manually recreated the firewall rules on all affected servers in order to quickly restore Silo connections. The expired certificate was updated on the affected infrastructure to fix the condition that had caused our automation to fail in this way.
Moving Forward
Authentic8 has updated the infrastructure automation that allowed this mis-configuration and has implemented redundant monitoring to provide multiple methods of assessing the server health. In addition improvements are being made to improve the instrumentation and data from the product itself to help further improve the identification of incidents in the future.