Date

August 3, 2017

Description

Database Issues prevented users from accessing Silo

Duration

Resolved 30 minutes after first incident report

Affected components

New Silo users were unable to access the system.  Users who were logged into Silo sessions were terminated.

Affected customers

All Silo users

Root cause analysis

Our third party provider, Google, reported that they had an issue with the underlying disks for our instance which caused our production database to stop functioning.


Event Log


On 2017-08-03 10:03 AM PDT – Both our internal monitoring systems and our customer base began to alert us to the fact that they were having trouble accessing the Silo service.

On 2017-08-03 10:25 AM PDT – Our third party provider restored service to the affected systems.

All impacted customers reported back throughout the morning that the issue had been resolved.


Resolution


Our third party provider, Google, reported that they had a brief issue with persistent disks experiencing high latency that morning.  This delay caused query responses from to "freeze".  Once Google identified the root cause of the issue they corrected it and they have taken steps to prevent a future occurrence of this issue.  

The provider has root caused the issue with the persistent disks and claims it will not happen again.


Moving Forward


Nonetheless, the Operations Team is researching ways to improve database failover methods.  In this case, a failover was not required but would have been initiated had it taken a few seconds instead minutes.

We apologize for any inconvenience this may have caused and appreciate the customer feedback that was shared while this incident was happening.