Date

March 3, 2017

Description

Certain customers unable to access the service

Duration

55 hours

Affected components

Silo sessions running on certain app servers were inaccessible.  Users completely off-line for duration

Affected customers

12 customer organizations

Root cause analysis

An upstream provider blocked IP addresses associated with machines hosting browser sessions.   This prevented any user attempting to launch a session on blocked addresses from establishing a session.  All regions for affected customers were blocked.



Event Log

On 2/21/2017 at 23:58 GMT Authentic8 began receiving reports from various customers that were unable to connect to the service, Internal support staff began to investigate the reports.


All system monitors indicated normal operation.  Operations confirmed that:

  • Statistics for session requests through Authentic8 launchers were normal for the period

  • Concurrent session statistics were lower for the period


On 2/22/2017 at 00:15 GMT  while investigating the issue with an affected customer we learned that while requests from the client to the “launch servers”, which validates the user and starts the browser session were being processed, requests for accessing “app servers”, which run the user’s Silo session were being blocked by an upstream network provider.  This revelation was consistent with the usage stats reported above.


On 2/22/2017 at 21:40 GMT Authentic8 created a validation check that customers could run to confirm that they were unable to connect to an App Server IP.  We began sharing that with customer and collecting responses.


On 2/22/2017 at 22:14 GMT Authentic8 began instructing customers to open support cases with their provider.  The issue was consistent across all affected services, not localized to a particular branch or region.  The upstream provider was the only potential upstream provider we were aware who spanned the range of customers.    One organization reported that asking the upstream provider to compare the firewall black hole list against the list of known Authentic8 IP addresses which must be white listed would enumerate the problem.



On 2/23/2017 at 06:48 GMT Authentic8 management escalated to non-operational contacts at the provider for assistance.


On 2/24/2017 at 07:00:00 GMT Authentic8 began receiving reports that connectivity had been restored.      

 


Resolution

On 2/24/2017 at 07:00:00 GMT Authentic8 began receiving reports that connectivity had been restored.      


Authentic8 is uncertain of the events that led to the network block, the specific sub-components of the provider infrastructure that were blocking us, nor the process required to back out the blocks of Authentic8 IPs.


Authentic8 does not have direct operational contacts with this provider and needed to use non-operational contacts to escalate the issue.  It is unclear if the escalation by Authentic8 or by the numerous impacted customers was the successful component in restoration of connectivity.



Moving Forward

There are a number of intermediary providers between the user and the Authentic8 service.  


Incidents like this will likely occur again in the future and our goal is to improve our processes.  Our course of action is as follows:


  • Improve our triaging and correlation of incidents to identify if issues are local, broad-based or pervasive across our customers.  Categorize the issue accordingly

  • Build an escalation list based on the potential upstream providers by each category

  • Proactively notify all customers in the category to inform them that other organizations are experiencing an issue, bring them into the process

  • Provide specific, consistent instructions to impacted customers on escalation to their appropriate provider


Request for information from you

  • Share any details regarding the ticket you may have opened with your provider that you can.  

  • Provide any details you can regarding your escalation path.  As we build our external contact list for escalation we will want to provide input to impacted customers on their internal channels for escalation.