Incident - Unresponsive API Servers; High CPU Utilization

 

Date

July 23, 2020

Description

Customers were unable to establish new Silo sessions.

Duration

Approximately 1 hour and 13 minutes (10:16 AM to 11:29 AM PT)

Affected components

Multiple API servers were running unusually high CPU utilization, which caused failures and long delays for new connection attempts to Silo. Both Web Client and Installed Clients were affected.

Affected customers

Potentially affected all customers.

Root cause analysis

Multiple API servers were briefly overloaded, which caused a performance degradation due to a large API transaction for user data purge. During this period, several incoming connection requests were delayed. Thus, creating a backlog condition, and eventually lead to unresponsive API servers. 

 

 

Event Log

07/23/2020 -- 10:29 AM PT: Authentic8’s Site Reliability team received internal communication of high system load with the API servers.

07/23/2020 -- 10:48 AM PT: Site Reliability team began investigating the anomaly.
     

Resolution

07/23/2020 -- 11:15 AM PT: Site Reliability team performed system restarts for API servers showing high connection counts. CPU utilization returned to normal once servers were restarted on a rolling order.

 

Moving Forward
To help alleviate the server bottleneck condition, Authentic8 has reviewed capacity limits of the API and associated services and has taken appropriate measures to add capacity and improve data flow to prevent similar bottlenecks from occurring in the future. In addition, significant improvements to our monitoring have been implemented to provide an earlier warning signal to our technical teams.



Additional Notes  

Please contact Support if you have any additional questions and/or require further information.